Sequence labeling problem is commonly encountered in many natural language and query processing tasks. SVM
struct
is a supervised learning algorithm that provides a flexible and effective way to solve this problem. However, a large amount
of training examples is often required to train SVM
struct
, which can be costly for many applications that generate long and complex sequence data. This paper proposes an active learning
technique to select the most informative subset of unlabeled sequences for annotation by choosing sequences that have largest
uncertainty in their prediction. A unique aspect of active learning for sequence labeling is that it should take into consideration
the effort spent on labeling sequences, which depends on the sequence length. A new active learning technique is proposed
to use dynamic programming to identify the best subset of sequences to be annotated, taking into account both the uncertainty
and labeling effort. Experiment results show that our SVM
struct
active learning technique can significantly reduce the number of sequences to be labeled while outperforming other existing
techniques.
Keywords Active Learning - Struct Support Vector Machine - Uncertainty - Sequence Labeling - Natural Language Processing - Subphrase Generation
The work was performed when the first author worked as a summer intern at Yahoo, Inc.