Lecture Notes in Computer Science, 2001, Volume 1997/2001, 256-274, DOI: 10.1007/3-540-45271-0_17

Locating and Reconfiguring Records in Unstructured Multiple-Record Web Documents

David W. Embley and L. Xu

View Related Documents

Abstract

Record extraction from data-rich, unstructured, multiplerecord Web documents works well [9], but only if the text for each record can be located and isolated. Although some multiple-record Web documents present records as contiguous, delineated chunks of text (which can thus be located and isolated [10]), many do not. When some values of textual records are factored out, are split unnaturally across boundaries, are joined unnaturally within boundaries, or are linked by off-page connectors, or when desired records are interspersed with records that are not of interest, it is dificult to automatically cull records and piece values together to form clean, delineated chunks of text that each represent a single record of interest. In this paper we address this problem and propose an algorithm to find and rearrange (if necessary) records in an HTML document. The essential idea is to attempt to maximize a record-recognition heuristic with respect to a given application ontology. Tests we conducted for two widely differing applications show that this technique properly locates and reconfigures records.

Fulltext Preview

Image of the first page of the fulltext document