Lecture Notes in Computer Science, 2001, Volume 2070/2001, 165-174, DOI: 10.1007/3-540-45517-5_20

Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web

Luo Xiao, Dieter Wissmann, Michael Brown and Stefan Jablonski

View Related Documents

Abstract

This paper describes Information Extraction for applications concerning the automated filling of templates from an input of HTML documents. We developed a complete system to extract information from Web sites. The system is able to use a number of algorithms to learn the document structure, rules and keywords to locate specific information and spatial relations between different information items. Experiments with well known data set show a substantial performance improvement over standard wrapper systems.

Fulltext Preview

Image of the first page of the fulltext document