We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive
miningbased operations. Requirements of any such system include speed and minimal end-user effort. Athena satisfies these
requirements through linear-time classification and clustering engines which are applied interactively to speed the development
of accurate models.
Naive Bayes classifiers are recognized to be among the best for classifying text. We show that our specialization of the Naive
Bayes classifier is considerably more accurate (7 to 29% absolute increase in accuracy) than a standard implementation. Our
enhancements include using Lidstone’s law of succession instead of Laplace’s law, under-weighting long documents, and over-weighting
author and subject.
We also present a new interactive clustering algorithm, C-Evolve, for topic discovery. C-Evolve first finds highly accurate
cluster digests (partial clusters), gets user feedback to merge and correct these digests, and then uses the classification
algorithm to complete the partitioning of the data. By allowing this interactivity in the clustering process, C-Evolve achieves
considerably higher clustering accuracy (10 to 20% absolute increase in our experiments) than the popular K-Means and agglomerative
clustering methods.