Factorized Hidden Layer Adaptation for Deep Neural Network based Acoustic Modeling
COM2 Level 2
MR3, COM2-02-26
closeAbstract:
Speech-based interfaces enable human interaction with machines using a speech platform. Since speech is the primary way of human interaction, these services powered by the automatic speech recognition (ASR) technology have become widely popular in real world applications. For enhanced user experience, it is necessary for ASR services to have a performance that is close to human parity. Therefore, we need to build ASR systems that are robust to variations in speaker, noise, and channel. Recently, Deep Neural Networks (DNNs) have been successfully integrated into ASR systems with significant improvements over the conventional Gaussian mixture model (GMM) based ASR systems. However, DNNs, like all the other machine learning techniques, are susceptible to performance degradations due to the mismatches between the training and testing conditions. Therefore, it is necessary to develop adaptation techniques to reduce this mismatch between training and testing conditions.
In this thesis, we propose two methods to adapt the DNN acoustic models for ASR. Namely, factorized hidden layer (FHL) and subspace learning hidden unit contributions (LHUC). FHL aims at modeling speaker dependent (SD) hidden layers by representing an SD affine transformation as a linear combination of bases. The combination weights are low-dimensional speaker parameters that can be initialized using speaker representations like i-vectors and then reliably refined in an unsupervised adaptation fashion. Therefore, FHL method provides an efficient way to perform both adaptive training and (test-time) adaptation. In addition, FHL can be extended to handle multiple variabilities of the speech signal. The subspace LHUC is introduced to improve the recently proposed LHUC method. In LHUC, a set of SD parameters is estimated to linearly recombine the hidden units in an unsupervised fashion. Subspace LHUC estimates the SD hidden unit contributions in a subspace and connected to various layers through a set of adaptively trained weights. Then the adaptation is performed as a point in this SD subspace which reduces the SD parameters significantly.
Experimental results have shown that the both adaptation methods improve the ASR performance significantly, compared to the standard DNN models, as well as other state-of-the-art DNN adaptation approaches, such as training with the speaker-normalized constrained maximum likelihood linear regression (CMLLR) features, speaker-aware training (SaT) using i-vector and LHUC. Moreover, these methods have desirable properties like robustness to the adaptation target quality, low per-speaker footprint, gains from both speaker-level and utterance-level adaptation. The experimental results on three benchmark ASR tasks show that these methods achieve 1.3-4.2% absolute reductions in the word error rate (WER).