Linking records from two or more databases is an increasingly important data preparation step in many data mining projects,
as linked data can enable studies that are not feasible otherwise, or that would require expensive collection of specific
data. The aim of such linkages is to match all records that refer to the same entity. One of the main challenges in record
linkage is the accurate classification of record pairs into matches and non-matches. Many modern classification techniques
are based on supervised machine learning and thus require training data, which is often not available in real world situations.
A novel two-step approach to unsupervised record pair classification is presented in this paper. In the first step, training
examples are selected automatically, and they are then used in the second step to train a binary classifier. An experimental
evaluation shows that this approach can outperform k-means clustering and also be much faster than other classification techniques.
Keywords data linkage - entity resolution - clustering - support vector machines - data mining preprocessing