The rapid progress of computer and network technologies makes it easy to collect and store a large amount of unstructured
or semi-structured texts such as Web pages, HTML/XML archives, E-mails, and text files. These text data can be thought of
large scale text databases, and thus it becomes important to develop an efficient tools to discover interesting knowledge
from such text databases.
There are a large body of data mining researches to discover interesting rules or patterns from well-structured data such
as transaction databases with boolean or numeric attributes [1,8,13]. However, it is difficult to directly apply the traditional data mining technologies to text or semi-structured data mentioned
above since these text databases consist of (i) heterogeneous and (ii) huge collections of (iii) un-structured or semi-structured
data. Therefore, there still have been a small number of studies on text mining, e.g., [4,5,12,17].
Our research goal is to devise an efficient semi-automatic tool that supports human discovery from large text databases. Therefore,
we require a fast pattern discovery algorithm that can work in time, e.g., O(n) to O(n log n), to respond in real time on an unstructured data set of total size n. Furthermore, such an algorithm has to be robust in the sense that it can work on a large amount of noisy and incomplete
data without the assumption of an unknown hypothesis class.
To achieve this goal, we adopt the framework of optimized pattern discovery [11], also known as Agnostic PAC learning [10] in computational learning theory. In optimized pattern discovery, an algorithm tries to find a pattern from a hypothesis
space that optimizes a given statistical measure, such as classification error [10], information entropy [11], and Gini index [6], to discriminate a set of interesting documents from a set of uninteresting ones. In the recent developments in computational learning theory, it is shown that such an algorithm can approximate arbitrary
distributions on data within a given class of hypotheses very well in the sense of classification accuracy [6,10].
This work is partially supported by the Ministry of Education, Science, Sports, and Culture, Grant-in-Aid for Scientific Research
on Priority Areas Informatics (No. 14019070) 2002.