Biblio is an adaptive system that automatically extracts meta-data from semi-structured and structured scanned documents.
Instead of using hand-coded templates or other methods manually customized for each given document format, it uses example-based
machine learning to adapt to customer-defined document and meta-data types. We provide results from
experiments on
the recognition of document information in two document corpuses: a set of scanned journal articles and a set of scanned legal documents. The first set is semi-structured,
as the different journals use a variety of flexible layouts. The second set is largely free-form text based on poor quality
scans of FAX-quality legal documents. We demonstrate accuracy on the semi-structured document set roughly comparable to hand-coded
systems, and much worse performance on the legal documents.
Keywords Document recognition - Document understanding - Neural networks - Support vector machines - Machine learning