The creation of structured digital libraries from paper-based archives is an area of growing demand in many scientific and
cultural fields, and is not satisfied either by off-the-shelf OCR or commercial form- processing systems. This paper describes
and evaluates a configurable archive construction system, which integrates document image pre-processing and analysis with
text post-processing tools and a standard OCR package to meet digital archiving requirements. The prototype system is currently
being used in conjunction with the UK Natural History Museum to help convert more than 500,000 cards of Lepidoptera (Butterflies
and Moths) and Coleoptera (Beetles) to searchable digital archives. Evaluation results covering different aspects of the system
from card scanning to overall word recognition rates for different database fields are summarised for two datasets comprising
over 5,000 cards selected from different parts of these archives. First-pass end-to-end word recognition rates of 70–90% are
reported for key data fields, subject to availability of suitable electronic dictionaries. Further validation and correction
is supported through web-editing of the online digital archive.
Keywords Document analysis - Digital archive - OCR