Machine-generated documents containing semi-structured text are rapidly forming the bulk of data being stored in an organisation.
Given a feature-based representation of such data, methods like SVMs are able to construct good models for information extraction
(IE). But how are the feature-definitions to be obtained in the first place? (We are referring here to the representation
problem: selecting good features from the ones defined comes later.) So far, features have been defined manually or by using
special-purpose programs: neither approach scaling well to handle the heterogeneity of the data or new domain-specific information.
We suggest that Inductive Logic Programming (ILP) could assist in this. Specifically, we demonstrate the use of ILP to define
features for seven IE tasks using two disparate sources of information. Our findings are as follows: (1) the ILP system is
able to identify efficiently large numbers of good features. Typically, the time taken to identify the features is comparable
to the time taken to construct the predictive model; and (2) SVM models constructed with these ILP-features are better than
the best reported to date that rely heavily on hand-crafted features. For the ILP practioneer, we also present evidence supporting
the claim that, for IE tasks, using an ILP system to assist in constructing an extensional representation of text data (in
the form of features and their values) is better than using it to construct intensional models for the tasks (in the form
of rules for information extraction).