A Dialectal Chinese Speech Recognition Framework
Jing Li1
, Thomas Fang Zheng1
, William Byrne2, 3
and Dan Jurafsky4 
| (1) |
Center for Speech Technology, State Key Laboratory of Intelligent Technology and Systems Department of Computer Science and
Technology, Tsinghua University, Beijing, 100084, P.R. China |
| (2) |
Machine Intelligence Laboratory, Cambridge University, U.K. |
| (3) |
Center for Language and Speech Processing, The Johns Hopkins University, U.S.A. |
| (4) |
Department of Linguistics, Stanford University, U.S.A. |
Received: 20 December 2004 Accepted: 17 June 2005
Abstract A framework for dialectal Chinese speech recognition is proposed and studied, in which a relatively small dialectal Chinese
(or in other words Chinese influenced by the native dialect) speech corpus and dialect-related knowledge are adopted to transform
a standard Chinese (or Putonghua, abbreviated as PTH) speech recognizer into a dialectal Chinese speech recognizer. Two kinds
of knowledge sources are explored: one is expert knowledge and the other is a small dialectal Chinese corpus. These knowledge
sources provide information at four levels: phonetic level, lexicon level, language level, and acoustic decoder level. This
paper takes Wu dialectal Chinese (WDC) as an example target language. The goal is to establish a WDC speech recognizer from
an existing PTH speech recognizer based on the Initial-Final structure of the Chinese language and a study of how dialectal
Chinese speakers speak Putonghua. The authors propose to use context-independent PTH-IF mappings (where IF means either a
Chinese Initial or a Chinese Final), context-independent WDC-IF mappings, and syllable-dependent WDC-IF mappings (obtained
from either experts or data), and combine them with the supervised maximum likelihood linear regression (MLLR) acoustic model
adaptation method. To reduce the size of the multi-pronunciation lexicon introduced by the IF mappings, which might also enlarge
the lexicon confusion and hence lead to the performance degradation, a Multi-Pronunciation Expansion (MPE) method based on
the accumulated uni-gram probability (AUP) is proposed. In addition, some commonly used WDC words are selected and added to
the lexicon. Compared with the original PTH speech recognizer, the resulting WDC speech recognizer achieves 10–18% absolute
Character Error Rate (CER) reduction when recognizing WDC, with only a 0.62% CER increase when recognizing PTH. The proposed
framework and methods are expected to work not only for Wu dialectal Chinese but also for other dialectal Chinese languages
and even other languages.
Keywords dialectal Chinese speech recognition - initial or final (IF) - IF-mapping rule - pronunciation modeling - small quantity of speech data
This paper is based upon a study supported by the US National Science Foundation under Grant No.0121285. Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the
views of the National Science Foundation.
Jing Li is currently a Ph.D. candidate of Center for Speech Technology, the State Key Laboratory of Intelligent Technology and Systems,
Department of Computer Science and Technology, Tsinghua University. He received his B.S. degree in computer science and technology
from Tsinghua University, in 1999. He is now focusing on dialectal Chinese speech recognition, acoustic modeling, and keyword
spotting.
Thomas Fang Zheng graduated from the Department of Computer Science & Technology of Tsinghua University and received his B.S., M.S. and Ph.D.
degrees from Tsinghua University, in 1990, 1992 and 1997 respectively. Dr. Zheng is currently a professor at Tsinghua University.
He is Vice Dean of Research Institute of Information Technology of Tsinghua University, and the Director of Center of Speech
Technology, State Laboratory of Intelligent Technology and Systems. Dr. Zheng is now the Council Chair of the Chinese Corpus
Consortium, an IEEE member, an ISCA member, a senior member of China Computer Federation, a member of the Artificial Intelligence
and Pattern Recognition Technical Commission of China Computer Federation, a member of the editorial committee of the Journal
of Chinese Information Processing, and a key member of Oriental-COCOSDA. He was a senior member and a co-leader at the Johns
Hopkins University's Summer Workshop of Language and Speech Processing, in 2000 and 2004, working on pronunciation modeling
and dialectal Chinese recognition, respectively. His main research interests are speech recognition, natural language understanding,
and speaker recognition.
William Byrne received the B.S. degree in electrical engineering from Cornell University, Ithaca, NY in 1982, and the Ph.D. degree in electrical
engineering from the University of Maryland, College Park, MA in 1993. He has worked at Entropic Research Laboratory, Washington
DC, and the National Institutes of Health, Bethesda, MD. He is currently a research associate professor in the Department
of Electrical Engineering and the Center for Language and Speech Processing at the Johns Hopkins University, Baltimore, MD,
and a university lecturer in the Machine Intelligence Laboratory and a member of the Speech Research Group, Cambridge University,
UK. His main research interests are in statistical modeling techniques for speech and language processing, with a recent interest
in statistical machine translation.
Dan Jurafsky is an associate professor in the Department of Linguistics, Stanford University, where he just arrived in January of 2004.
He received his B.A. degree in Linguistics in 1983, and his Ph.D. degree in computer science in 1992, both from UC Berkeley.
He then worked for 8 years at the University of Colorado at Boulder, where he was an assistant and associate professor in
the Department of Linguistics, the Institute of Cognitive Science, the Department of Computer Science, and the Center for
Spoken Language Research. He still maintains an adjunct position at the University of Colorado, and continues to work closely
with colleagues there. His research focuses on statistical models of human and machine language processing, especially computational
linguistics, automatic speech recognition and understanding, computational psycholinguistics, and natural language processing.
He received the National Science Foundation CAREER award in 1998, the MacArthur Fellowship in 2002. His most recent book,
with James H. Martin, is the widely-used textbook “Speech and Language Processing”.
References secured to subscribers.