Lecture Notes in Computer Science, 2001, Volume 2163/2001, 429-437, DOI: 10.1007/3-540-44796-2_36

Digitization, Coded Character Sets, and Optical Character Recognition for Multi-script Information Resources: The Case of the Letopis’ Zhurnal’nykh Statei

George Andrew Spencer

View Related Documents

Abstract

Multi-lingual information resources that consist of texts in more scripts than can be represented by a single 8-bit encoding scheme can currently be best represented by use of the Unicode multi-byte character-encoding scheme. However use of Unicode could lead to a decrease in the accuracy of Optical Character Recognition (OCR) software because of the similarity of glyphs between certain scripts. This decrease in OCR accuracy can dramatically increase the amount of time needed to proofread the resulting electronic texts. An Indiana University - Digital Library Program project for digitizing a 20-year portion of the Letopis’ Zhurnal’nykh Statei is presented as an example of a digital library project dealing with a multi-script information resource for which Unicode has been used.

Fulltext Preview

Image of the first page of the fulltext document