Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition

Tóth, L [Tóth, László (Mesterséges intel...), szerző] MTA-SZTE Mesterséges Intelligencia Kutatócsoport (SZTE / TTIK / ITCS)

Angol nyelvű Konferenciaközlemény (Könyvrészlet) Tudományos
    Azonosítók
    Convolutional neural networks have proved very successful in image recognition, thanks to their tolerance to small translations. They have recently been applied to speech recognition as well, using a spectral representation as input. However, in this case the translations along the two axes - time and frequency - should be handled quite differently. So far, most authors have focused on convolution along the frequency axis, which offers invariance to speaker and speaking style variations. Other researchers have developed a different network architecture that applies time-domain convolution in order to process a longer time-span of input in a hierarchical manner. These two approaches have different background motivations, and both offer significant gains over a standard fully connected network. Here we show that the two network architectures can be readily combined, like their advantages. With the combined model we report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.
    Hivatkozás stílusok: IEEEACMAPAChicagoHarvardCSLMásolásNyomtatás
    2026-01-22 06:30