This paper describes an audio visual speech recognition (AVSR) system based on articulatory features (AF). It implements a
tandem approach where artificial neural networks (ANN), in particular multi-layer perceptrons (MLP), are used as posterior
probability estimators for transforming raw input data into the more abstract articulatory features. Such an approach is particularly
well suited if relatively few training data are available, a situation which is typical for AVSR. In addition, the MLP feature
extraction results and some analysis in terms of recognition accuracy and confusions are presented. Our AF-based AVSR system
has been trained on the audio-visual speech corpus VIDTIMIT, which contains conversational speech based on a medium size vocabulary
including more than 1200 words.
Keywords MLP - Articulatory Features - Audio Visual Speech Recognition