Convolutional neural networks have proved very successful in image recognition, thanks
to their tolerance to small translations. They have recently been applied to speech
recognition as well, using a spectral representation as input. However, in this case
the translations along the two axes - time and frequency - should be handled quite
differently. So far, most authors have focused on convolution along the frequency
axis, which offers invariance to speaker and speaking style variations. Other researchers
have developed a different network architecture that applies time-domain convolution
in order to process a longer time-span of input in a hierarchical manner. These two
approaches have different background motivations, and both offer significant gains
over a standard fully connected network. Here we show that the two network architectures
can be readily combined, like their advantages. With the combined model we report
an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.