Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
|
 |
Toward a Computational Theory of Data Acquisition and Truthing
| |
|
Toward a Computational Theory of Data Acquisition and Truthing
David G. Stork3 
| (3) |
Ricoh California Research Center, 2882 Sand Hill Road Suite 115, Menlo Park, CA, 94025-7022 |
Abstract
The creation of a pattern classifier requires choosing or creating a model, collecting training data and verifying or “truthing”
this data, and then training and testing the classifier. In practice, individual steps in this sequence must be repeated a
number of times before the classifier achieves acceptable performance. The majority of the research in computational learning
theory addresses the issues associated with training the classifier (learnability, convergence times, generalization bounds,
etc.). While there has been modest research effort on topics such as cost-based collection of data in the context of a particular
classifier model, there remain numerous unsolved problems of practical importance associated with the collection and truthing
of data. Many of these can be addressed with the formal methods of computational learning theory. A number of these issues,
as well as new ones — such as the identification of “hostile” contributors and their data — are brought to light by the Open
Mind Initiative, where data is openly contributed over the World Wide Web by non-experts of varying reliabilities. This paper
states generalizations of formal results on the relative value of labeled and unlabeled data to the realistic case where a
labeler is not a foolproof oracle but is instead somewhat unreliable and error-prone. It also summarizes formal results on
strategies for presenting data to labelers of known reliability in order to obtain best estimates of model parameters. It
concludes with a call for a rich, powerful and practical computational theory of data acquisition and truthing, built upon
the concepts and techniques developed for studying general learning systems.
Keywords monitoring data quality - data truthing - open data collection - anomalous data detection - learning with queries - cost-based learning - Open Mind Initiative
Fulltext Preview (Small, Large)
 References secured to subscribers.
|
|
|
|
|
|