We describe the IBM systems submitted to the NIST RT06s Speech-to-Text (STT) evaluation campaign on the CHIL lecture meeting
data for three conditions: Multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone
(IHM). The system building process is similar to the IBM conversational telephone speech recognition system. However, the
best models for the far-field conditions (SDM and MDM) proved to be the ones that use neither variance normalization nor vocal
tract length normalization. Instead, feature-space minimum-phone error discriminative training yielded the best results. Due
to the relatively small amount of CHIL-domain data, the acoustic models of our systems are built on publicly available meeting
corpora, with maximum a-posteriori adaptation applied twice on CHIL data during training: First, at the initial speaker-independent
model, and subsequently at the minimum phone error model. For language modeling, we utilized meeting transcripts, text from
scientific conference proceedings, and spontaneous telephone conversations. On development data, chosen in our work to be
the 2005 CHIL-internal STT evaluation test set, the resulting language model provided a 4% absolute gain in word error rate
(WER), compared to the model used in last year’s CHIL evaluation. Furthermore, the developed STT system significantly outperformed
our last year’s results, by reducing close-talking microphone data WER from 36.9% to 25.4% on our development set. In the
NIST RT06s evaluation campaign, both MDM and SDM systems scored well, however the IHM system did poorly due to unsuccessful
cross-talk removal.