Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
My Menu
Saved Items

Post-supervised Template Induction for Dynamic Web Sources

Zhongmin Shi5, Evangelos MiliosContact Information and Nur Zincir-HeywoodContact Information

(5)  Faculty of Computer Science, Dalhousie University, Halifax, N.S., Canada, B3H 1W5
Abstract
Dynamic web sites commonly return information in the form of lists and tables. Although hand crafting an extraction program for a specific template is time-consuming but straightforward, it is desirable to automatically generate template extraction programs from examples of lists and tables in html documents. We describe a novel technique, Post-supervised Learning, which exploits unsupervised learning to avoid the need for training examples, while minimally involving the user to achieve high accuracy. We have developed unsupervised algorithms to extract the number of rows and adopted a dynamic programming algorithm for extracting columns. Our system, called TIDE (Template Induction for web Data Extraction), achieves high performance with minimal user input compared to fully supervised techniques.

Contact Information Evangelos Milios
Email: eem@cs.dal.ca

Contact Information Nur Zincir-Heywood
Email: zincir@cs.dal.ca
Fulltext Preview (Small, Large)
Image of the first page of the fulltext

References secured to subscribers.



Export this chapter
Export this chapter as RIS | Text
 
Remote Address: 38.107.191.108 • Server: mpweb19
HTTP User Agent: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)