Lecture Notes in Computer Science, 2001, Volume 2167/2001, 419-430, DOI: 10.1007/3-540-44795-4_36

Second Order Features for Maximising Text Classification Performance

Bhavani Raskutti, Herman Ferrá and Adam Kowalczyk

View Related Documents

Abstract

The paper demonstrates that the addition of automatically selected word-pairs substantially increases the accuracy of text classification which is contrary to most previously reported research. The word-pairs are selected automatically using a technique based on frequencies of n-grams (sequences of characters), which takes into account both the frequencies of word-pairs as well as the context in which they occur.
These improvements are reported for two different classifiers, support vector machines (SVM) and k-nearest neighbours (kNN), and two different text corpora. For the first of them, a collection of articles from PC Week magazine, the addition of word-pairs increases micro-averaged breakeven accuracy by more than 6% point from a baseline accuracy (without pairs) of around 40%. For second one, the standard Reuters benchmark, SVM classifier using augmentation with pairs outperforms all previously reported results.

Fulltext Preview

Image of the first page of the fulltext document