Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
|
 |
Identifying and Filtering Near-Duplicate Documents
| |
|
Identifying and Filtering Near-Duplicate Documents
Andrei Z. Broder6 
| (6) |
AltaVista Company, San Mateo, CA 94402, USA |
Abstract
The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance
can be estimated using a fixed size “sketch” for each document. For a large collection of documents (say hundreds of millions)
the size of this sketch is of the order of a few hundred bytes per document.
However, for efficient large scale web indexing it is not necessary to determine the actual resemblance value: it suffices
to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other
words, it suffices to determine whether the resemblance is above a certain threshold. In this talk we show how this determination
can be made using a “sample” of less than 50 bytes per document.
The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection
problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently
for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above
can be applied to arbitrary sets, and thus might be of independent interest.
The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for
the last three years in the context of the AltaVista search engine.
Most of this work was done while the author was at Compaq’s System Research Center in Palo Alto. A preliminary version of
this work was presented (but not published) at the “Fun with Algorithms” conference, Isola d’Elba, 1998.
Fulltext Preview (Small, Large)
 References secured to subscribers.
|
|
|
|
|
|