In many learning problems, labeled examples are rare or expensive while numerous unlabeled and positive examples are available.
However, most learning algorithms only use labeled examples. Thus we address the problem of learning with the help of positive
and unlabeled data given a small number of labeled examples. We present both theoretical and empirical arguments showing that
learning algorithms can be improved by the use of both unlabeled and positive data. As an illustrating problem, we consider
the learning algorithm from statistics for monotone conjunctions in the presence of classification noise and give empirical
evidence of our assumptions. We give theoretical results for the improvement of Statistical Query learning algorithms from
positive and unlabeled data. Lastly, we apply these ideas to tree induction algorithms. We modify the code of C4.5 to get
an algorithm which takes as input a set LAB of labeled examples, a set POS of positive examples and a set UNL of unlabeled
data and which uses these three sets to construct the decision tree. We provide experimental results based on data taken from
UCI repository which confirm the relevance of this approach.
Key words PAC model - Statistical Queries - Unlabeled Examples - Positive Examples - Decision Trees - Data Mining
This research was partially supported by “Motricité et Cognition” : Contrat par objectifs région Nord/Pas-de-Calais