Much electronic text in the languages of South Asia has been published on the Internet. However, while Unicode has emerged
as the favoured encoding system of corpus and computational linguists, most South Asian language data on the web uses one
of a wide range of non-standard legacy encodings. This paper describes the difficulties inherent in converting text in these
encodings to Unicode. Among the various legacy encodings for South Asian scripts, the most problematic are 8-bit fonts based
on graphical principles (as opposed to the logical principles of Unicode). Graphical fonts typically encode several features
in ways highly incompatible with Unicode. For instance, half-form glyphs used to construct conjunct consonants are typically
separate code points in 8-bit fonts; in Unicode they are represented by the full consonant followed by
virama. There are many more such cases. The solution described here is an approach to text conversion based on
mapping rules. A small number of generalised rules (plus the capacity for more specialised rules) captures the behaviour of each character
in a font, building up a conversion algorithm for that encoding. This system is embedded in a font-mapping program, outputting
CES-compliant SGML Unicode. This program, a generalised text-conversion tool, has been employed extensively in corpus-building
for South Asian languages.
Keywords Unicode - Font - Devanagari - South Asian languages/scripts - Legacy text - Encoding - Conversion - Virama - Conjunct consonant - Vowel diacritic