Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
|
 |
SCOOP: A Record Extractor without Knowledge on Input
| Book Series | Lecture Notes in Computer Science |
| Publisher | Springer Berlin / Heidelberg |
| ISSN | 0302-9743 (Print) 1611-3349 (Online) |
| Volume | Volume 2226/2001 |
| Book | Discovery Science |
| DOI | 10.1007/3-540-45650-3 |
| Copyright | 2001 |
| ISBN | 978-3-540-42956-2 |
| DOI | 10.1007/3-540-45650-3_45 |
| Pages | 482-487 |
| Subject Collection | Computer Science |
| SpringerLink Date | Monday, January 01, 2001 |
| |
|
SCOOP: A Record Extractor without Knowledge on Input
Yasuhiro Yamada3 , Daisuke Ikeda4 and Sachio Hirokawa4 
| (3) |
Graduate School of Information Science and Electrical Engineering, Kyushu University, 812-8581 Fukuoka, Japan |
| (4) |
Computing and Communications Center, Kyushu University, 812-8581 Fukuoka, Japan |
Abstract
We present a record extractor system SCOOP. We assume that semi-structured documents given to SCOOP contain similar formats
and each of them has only a record consisting of some different fields. SCOOP treats a document as just a string and does
not use knowledge on input except that a field is surrounded with delimiters, a left delimiter ends with “>”, and the corresponding
right delimiter begins with “<”. By counting substrings, SCOOP roughly divides into two parts: contents of the fields and
others. SCOOP counts substrings near boundaries of two parts and extracts the most frequent substrings as delimiters. We show
experimental results with news articles written in English or Japanese. A record consists of the headline and the body text
on this experiment. SCOOP extracts records at a high rate.
Fulltext Preview (Small, Large)
 References secured to subscribers.
|
|
|
|
|
|