Volume 7, Number 1, 66-76, DOI: 10.1007/s10044-004-0208-3

Improving recognition accuracy on structured documents by learning structural patterns

Gy. Hévízi, T. Marcinkovics and A. Lőrincz

View Related Documents

Abstract

In this paper, we present a probabilistic method that can improve the efficiency of document classification when applied to structured documents. The analysis of the structure of a document is the starting point of document classification. Our method is designed to augment other classification schemes and complement pre-filtering information extraction procedures to reduce uncertainties. To this end, a probabilistic distribution on the structure of XML documents is introduced. We show how to parameterise existing learning methods to describe the structure distribution efficiently. The learned distribution is then used to predict the classes of unseen documents. Novelty detection making use of the structure-based distribution function is also discussed. Demonstration on model documents and on Internet XML documents are presented.

Keywords  Bayesian networks - Classification - Novelty detection - Probabilistic tree model - XML

Fulltext Preview

Image of the first page of the fulltext document