This paper is concerned with research on OCR (optical character recognition) of printed mathematical expressions. Construction
of a representative corpus of technical and scientific documents containing expressions is discussed. A statistical investigation
of the corpus is presented, and usefulness of this analysis is demonstrated in the related research problems, namely, (i)
identification and segmentation of expression zones from the rest of the document, (ii) recognition of expression symbols,
(iii) interpretation of expression structures, and (iv) performance evaluation of a mathematical expression recognition system.
Moreover, a groundtruthing format has been proposed to facilitate automatic evaluation of expression recognition techniques.
Keywords: OCR - Mathematical expressions - Database - Groundtruthing - Statistical learning - Performance evaluation
Received: 10 July 2003, Accepted: 22 November 2004, Published online: 18 March 2005
Correspondence to: Utpal Garain