Lecture Notes in Computer Science, 2002, Volume 2291/2002, 248-267, DOI: 10.1007/3-540-45886-7_17

Uncertainty-Based Noise Reduction and Term Selection in Text Categorization

C. Peters and C. H. A. Koster

View Related Documents

Abstract

This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most significant ones. Uncertainty-based term selection (UC) is compared to a number of other criteria like Information Gain (IG), simplified χ2 (SX), Term Frequency (TF) and Document Frequency (DF) in a Text Categorization setting. Experiments on data sets with different properties (Reuters- 21578, patent abstracts and patent applications) and with two different algorithms (Winnow and Rocchio) show that UC-based term selection is not the most aggressive term selection criterium, but that its effect is quite stable across data sets and algorithms. This makes it a good candidate for a general “install-and-forget” term selection mechanism. We also describe and evaluate a hybrid Term Selection technique, first applying UC to eliminate noisy terms and then using another criterium to select the best terms.

Fulltext Preview

Image of the first page of the fulltext document