Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
|
 |
Extracting Characteristic Structures among Words in Semistructured Documents
| |
|
Extracting Characteristic Structures among Words in Semistructured Documents
Kazuyoshi Furukawa4 , Tomoyuki Uchida4 , Kazuya Yamada4 , Tetsuhiro Miyahara4 , Takayoshi Shoudai5 and Yasuaki Nakamura4 
| (4) |
Faculty of Information Sciences, Hiroshima City University, Hiroshima 731-3194, Japan |
| (5) |
Department of Informatics, Kyushu University, Kasuga 816-8580, Japan |
Abstract
Electronic documents such as SGML/HTML/XML files and LaTeX files have been rapidly increasing, by the rapid progress of network
and storage technologies. Many electronic documents have no rigid structure and are called semistructured documents. Since
a lot of semistructured documents contain large plain texts, we focus on the structural characteristics among words in semistructured
documents. The aim of this paper is to present a text mining technique for semistructured documents. We consider a problem
of finding all frequent structured patterns among words in semistructured documents. Let (W
1, W
2,..., W
k) be a list of words which are sorted in lexicographical order and let k ≥ 2 be an integer. Firstly, we define a tree-association pattern on (W
1, W
2,..., W
k). A tree-association pattern on (W
1, W
2,..., W
k) is a sequence 〈t
1; t
2;...; t
k-1〉 of labeled rooted trees such that, for i = 1, 2,..., k-1, (1) t
i consists of only one node having the pair of two words W
i and W
i+1 as its label, or (2) t
i is a labeled rooted tree which has just two leaves labeled with W
i and W
i+1, respectively. Next, we present a text mining algorithm for finding all frequent tree-association patterns in semistructured
documents. Finally, by reporting experimental results on our algorithm, we show that our algorithm is effective for extracting
structural characteristics in semistructured documents.
Fulltext Preview (Small, Large)
 References secured to subscribers.
|
|
|
|
|
|