RAPID FEATURE SPACE SPEAKER ADAPTATION FOR MULTI-STREAM HMM-BASED AUDIO-VISUAL SPEECH RECOGNITION (WedAmOR2)
Author(s) :
Jing Huang (IBM, United States of America)
Etienne Marcheret (IBM, United States of America)
Karthik Visweswariah (IBM, United States of America)
Abstract : Multi-stream hidden Markov models (HMMs) have recently been very successful in audio-visual speech recognition, where audio stream and visual stream are fused at final decision level. In this paper we investigate fast feature space speaker adaptation using multi-stream HMMs for audio-visual speech recognition. In particular, we focus on studying the performance of feature-space maximum likelihood linear regression(fMLLR), which is a fast and effective feature space transform. Unlike the common speaker adaptation technique MAP or MLLR, fMLLR would not change any of the audio and visual HMM parameters, but simply apply a single transform to the testing features. We also address the problem of fast and robust on-line fMLLR adaptation using feature space maximum a posterior linear regression (fMAPLR). Adaptation experiments are reported on the IBM infrared headset audio-visual database. On average of 20-speaker 1-hour speaker independent test data, the multi-stream fMLLR achieves 31% relative gain on the clean audio condition, and 59% relative gain on the noisy audio condition (around 7dB) compared to no fMLLR adaptation on multi-stream HMMs.

Menu