Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
My Menu
Saved Items

Extracting Characteristic Structures among Words in Semistructured Documents

Kazuyoshi FurukawaContact Information, Tomoyuki UchidaContact Information, Kazuya YamadaContact Information, Tetsuhiro MiyaharaContact Information, Takayoshi ShoudaiContact Information and Yasuaki NakamuraContact Information

(4)  Faculty of Information Sciences, Hiroshima City University, Hiroshima 731-3194, Japan
(5)  Department of Informatics, Kyushu University, Kasuga 816-8580, Japan
Abstract
Electronic documents such as SGML/HTML/XML files and LaTeX files have been rapidly increasing, by the rapid progress of network and storage technologies. Many electronic documents have no rigid structure and are called semistructured documents. Since a lot of semistructured documents contain large plain texts, we focus on the structural characteristics among words in semistructured documents. The aim of this paper is to present a text mining technique for semistructured documents. We consider a problem of finding all frequent structured patterns among words in semistructured documents. Let (W 1, W 2,..., W k) be a list of words which are sorted in lexicographical order and let k ≥ 2 be an integer. Firstly, we define a tree-association pattern on (W 1, W 2,..., W k). A tree-association pattern on (W 1, W 2,..., W k) is a sequence 〈t 1; t 2;...; t k-1〉 of labeled rooted trees such that, for i = 1, 2,..., k-1, (1) t i consists of only one node having the pair of two words W i and W i+1 as its label, or (2) t i is a labeled rooted tree which has just two leaves labeled with W i and W i+1, respectively. Next, we present a text mining algorithm for finding all frequent tree-association patterns in semistructured documents. Finally, by reporting experimental results on our algorithm, we show that our algorithm is effective for extracting structural characteristics in semistructured documents.

Contact Information Kazuyoshi Furukawa
Email: k_furukawa@toc.cs.hiroshima-cu.ac.jp

Contact Information Tomoyuki Uchida
Email: uchida@cs.hiroshima-cu.ac.jp

Contact Information Kazuya Yamada
Email: kazuy@toc.cs.hiroshima-cu.ac.jp

Contact Information Tetsuhiro Miyahara
Email: miyahara@its.hiroshima-cu.ac.jp

Contact Information Takayoshi Shoudai
Email: shoudai@i.kyushu-u.ac.jp

Contact Information Yasuaki Nakamura
Email: nakamura@cs.hiroshima-cu.ac.jp
Fulltext Preview (Small, Large)
Image of the first page of the fulltext

References secured to subscribers.



Export this chapter
Export this chapter as RIS | Text
 
Remote Address: 38.107.191.106 • Server: mpweb17
HTTP User Agent: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)