We focus on two recently proposed algorithms in the family of “boosting”-based learners for automated text classification,
A
DAB
OOST. MH and A
DAB
OOST.MH
KR. While the former is a realization of the well-known A
DAB
OOST algorithm specifically aimed at multi-label text categorization, the latter is a generalization of the former based on the
idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization
experiments so far.
A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence
or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the “weighted” representations
(consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much
more significant rendition of the document’s content than binary representations.
In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods.
We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the
version with discretized continuous attributes outperforms the version with traditional binary representations.