Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
My Menu
Saved Items

SCOOP: A Record Extractor without Knowledge on Input

Yasuhiro YamadaContact Information, Daisuke IkedaContact Information and Sachio HirokawaContact Information

(3)  Graduate School of Information Science and Electrical Engineering, Kyushu University, 812-8581 Fukuoka, Japan
(4)  Computing and Communications Center, Kyushu University, 812-8581 Fukuoka, Japan
Abstract
We present a record extractor system SCOOP. We assume that semi-structured documents given to SCOOP contain similar formats and each of them has only a record consisting of some different fields. SCOOP treats a document as just a string and does not use knowledge on input except that a field is surrounded with delimiters, a left delimiter ends with “>”, and the corresponding right delimiter begins with “<”. By counting substrings, SCOOP roughly divides into two parts: contents of the fields and others. SCOOP counts substrings near boundaries of two parts and extracts the most frequent substrings as delimiters. We show experimental results with news articles written in English or Japanese. A record consists of the headline and the body text on this experiment. SCOOP extracts records at a high rate.

Contact Information Yasuhiro Yamada
Email: yshiro@matu.cc.kyushu-u.ac.jp

Contact Information Daisuke Ikeda
Email: daisuke@cc.kyushu-u.ac.jp

Contact Information Sachio Hirokawa
Email: hirokawa@cc.kyushu-u.ac.jp
Fulltext Preview (Small, Large)
Image of the first page of the fulltext

References secured to subscribers.



Export this chapter
Export this chapter as RIS | Text
 
Remote Address: 38.107.191.106 • Server: mpweb01
HTTP User Agent: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)