Lecture Notes in Computer Science, 2005, Volume 3248/2005, 456-465, DOI: 10.1007/978-3-540-30211-7_48

A Comparative Study on the Use of Labeled and Unlabeled Data for Large Margin Classifiers

Hiroya Takamura and Manabu Okumura

View Related Documents

Abstract

We propose to use both labeled and unlabeled data with the Expectation-Maximization (EM) algorithm in order to estimate the generative model and use this model to construct a Fisher kernel. The Naive Bayes generative probability is used to model a document. Through the experiments of text categorization, we empirically show that, (a) the Fisher kernel with labeled and unlabeled data outperforms Naive Bayes classifiers with EM and other methods for a sufficient amount of labeled data, (b) the value of additional unlabeled data diminishes when the labeled data size is large enough for estimating a reliable model, (c) the use of categories as latent variables is effective, and (d) larger unlabeled training datasets yield better results.

Fulltext Preview

Image of the first page of the fulltext document