Lecture Notes in Computer Science, 2003, Volume 2821/2003, 254-266, DOI: 10.1007/978-3-540-39451-8_19

Automatic Document Categorization
Interpreting the Perfomance of Clustering Algorithms

Benno Stein and Sven Meyer zu Eissen

View Related Documents

Abstract

Clustering a document collection is the current approach to automatically derive underlying document categories. The categorization performance of a document clustering algorithm can be captured by the F-Measure, which quantifies how close a human-defined categorization has been resembled.
However, a bad F-Measure value tells us nothing about the reason why a clustering algorithm performs poorly. Among several possible explanations the most interesting question is the following: Are the implicit assumptions of the clustering algorithm admissible with respect to a document categorization task?
Though the use of clustering algorithms for document categorization is widely accepted, no foundation or rationale has been stated for this admissibility question. The paper in hand is devoted to this gap. It presents considerations and a measure to quantify the sensibility of a clustering process with regard to geometric distortions of the data space. Along with the method of multidimensional scaling, this measure provides an instrument for accessing a clustering algorithm’s adequacy.

Keywords  Document Categorization - Clustering -  F-Measure - Multidimensional Scaling - Information Visualization

Fulltext Preview

Image of the first page of the fulltext document