Institutional Login
Welcome!
To use the personalized features of this site, please
log in
or
register
.
If you have forgotten your username or password, we can
help
.
My Menu
Marked Items
Alerts
Order History
Saved Items
All
Favorites
Content Types
All
Publications
Journals
Book Series
Books
Reference Works
Protocols
Subject Collections
Architecture and Design
Behavioral Science
Biomedical and Life Sciences
Business and Economics
Chemistry and Materials Science
Computer Science
Earth and Environmental Science
Engineering
Humanities, Social Sciences and Law
Mathematics and Statistics
Medicine
Physics and Astronomy
Professional and Applied Computing
中文(简体)
中文(繁體)
English
Deutsch
한국어
日本語
Français
Español
العربية
Русский
Book Chapter
Information Extraction — Tree Alignment Approach to Pattern Discovery in Web Documents
Book Series
Lecture Notes in Computer Science
Publisher
Springer Berlin / Heidelberg
ISSN
0302-9743 (Print) 1611-3349 (Online)
Volume
Volume 2453/2002
Book
Database and Expert Systems Applications
DOI
10.1007/3-540-46146-9
Copyright
2002
ISBN
978-3-540-44126-7
DOI
10.1007/3-540-46146-9_78
Pages
789-798
Subject Collection
Computer Science
SpringerLink Date
Tuesday, January 01, 2002
Add to marked items
Add to shopping cart
Add to saved items
Permissions & Reprints
Recommend this chapter
PDF (146.0 KB)
Free Preview
Information Extraction — Tree Alignment Approach to Pattern Discovery in Web Documents
Ajay Hemnani
7
and Stephane Bressan
7
(7)
National University of Singapore, 3 Science Drive 2, 117543 Singapore
Abstract
The World Wide Web has nowen tered its mature age. It not only hosts and serves large amounts of pages but also offers large amounts of information potentially useful for individuals and businesses. Modern decision support can no more be effective without timely and accurate access to this unprecedented source of data. However, unlike in a database, the structure of data available on the Web is not known apriori and its understanding seems to require human intervention. Yet the conjunction of layout rules and simple domain knowledge enables in many cases the automatic understanding of such unstructured data. In such cases we say that data is semi-structured. Wrapper generation for automatic extraction of information from theWeb has therefore been a crucial challenge in the recent years. Various authors have suggested different approaches for extracting semi-structured data from the Web, ranging from analyzing the layout and syntax of Web documents to learning extraction rules from user’s training examples. In this paper, we propose to exploit the HTML structure of Web documents that contain information in the form of multiple homogeneous records. We use a Tree Alignment algorithm with a novel combination of heuristics to detect repeated patterns and infer extraction rules. The performance study shows that our approach is effective in practice, yielding practical performance and accurate results.
Ajay
Hemnani
Email:
hemnania@comp.nus.edu.sg
Stephane
Bressan
Email:
steph@comp.nus.edu.sg
Fulltext Preview (Small,
Large
)
References secured to subscribers.
more options
Find
Query Builder
Close
|
Clear
Title (ti)
Summary (su)
Author (au)
ISSN (issn)
ISBN (isbn)
DOI (doi)
And
Or
Not
(
)
* (wildcard)
"" (exact)
Within all content
Within this book series
Within this book
Export this chapter
Export this chapter as
RIS
|
Text
Frequently asked questions
|
General information on journals and books
|
Send us your feedback
|
Impressum
|
Contact
© Springer.
Part of Springer Science+Business Media
Privacy, Disclaimer, Terms and Conditions, © Copyright Information
MetaPress Privacy Policy
Remote Address: 38.107.191.106 • Server: mpweb18
HTTP User Agent: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)