Phonological Models in Automatic Speech Recognition
Toyota Technological Institute at Chicago
Location: Cobb 201
The performance of automatic speech recognizers varies widely across contexts. Very good performance can be achieved on single-speaker, large-vocabulary dictation in a clean acoustic environment, as well as on small-vocabulary tasks with fewer constraints on the speakers and acoustics. One domain that is still elusive is that of spontaneous conversational speech. This type of speech poses a number of challenges, among them extreme variation in pronunciation. I will describe efforts in the speech recognition community to characterize and model pronunciation variation.
The most thoroughly studied approach is augmentation of a phonetic pronunciation lexicon with phonological rules. Despite successes in a few domains, it has been surprisingly difficult to obtain significant recognition improvements by including those phonetic pronunciations that appear to exist in the data. I will advocate an alternative view: that the phone unit may not be the most appropriate for modeling the lexicon. I will describe approaches using both larger (e.g. syllable-sized) and “smaller” (e.g. articulatory) units. In the class of “smaller” unit models, ideas from articulatory and autosegmental phonology motivate multi-tier models in which tiers have semi-independent behavior. I will present a particular model in which articulatory features are represented as variables in a dynamic Bayesian network.
Non-phonetic pronunciation models can involve significantly different model structures than those typically used in speech recognition, and as a result they may also entail modifications to other components such as the observation model and training algorithms. At this point it is not clear what the “winning” approach will be. The success of a given approach may depend on the domain or on the amount and type of training data available. I will describe some of the current challenges and ongoing work, with a particular focus on the role of phonological theories in statistical models of pronunciation (and vice versa?).