Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
My Menu
Saved Items

Information Extraction — Tree Alignment Approach to Pattern Discovery in Web Documents

Ajay HemnaniContact Information and Stephane BressanContact Information

(7)  National University of Singapore, 3 Science Drive 2, 117543 Singapore
Abstract
The World Wide Web has nowen tered its mature age. It not only hosts and serves large amounts of pages but also offers large amounts of information potentially useful for individuals and businesses. Modern decision support can no more be effective without timely and accurate access to this unprecedented source of data. However, unlike in a database, the structure of data available on the Web is not known apriori and its understanding seems to require human intervention. Yet the conjunction of layout rules and simple domain knowledge enables in many cases the automatic understanding of such unstructured data. In such cases we say that data is semi-structured. Wrapper generation for automatic extraction of information from theWeb has therefore been a crucial challenge in the recent years. Various authors have suggested different approaches for extracting semi-structured data from the Web, ranging from analyzing the layout and syntax of Web documents to learning extraction rules from user’s training examples. In this paper, we propose to exploit the HTML structure of Web documents that contain information in the form of multiple homogeneous records. We use a Tree Alignment algorithm with a novel combination of heuristics to detect repeated patterns and infer extraction rules. The performance study shows that our approach is effective in practice, yielding practical performance and accurate results.

Contact Information Ajay Hemnani
Email: hemnania@comp.nus.edu.sg

Contact Information Stephane Bressan
Email: steph@comp.nus.edu.sg
Fulltext Preview (Small, Large)
Image of the first page of the fulltext

References secured to subscribers.



Export this chapter
Export this chapter as RIS | Text
 
Remote Address: 38.107.191.106 • Server: mpweb18
HTTP User Agent: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)