Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
My Menu
Saved Items

Identifying and Filtering Near-Duplicate Documents

Andrei Z. BroderContact Information

(6)  AltaVista Company, San Mateo, CA 94402, USA
Abstract
The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch” for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document.
However, for efficient large scale web indexing it is not necessary to determine the actual resemblance value: it suffices to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffices to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a “sample” of less than 50 bytes per document.
The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest.
The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.
Most of this work was done while the author was at Compaq’s System Research Center in Palo Alto. A preliminary version of this work was presented (but not published) at the “Fun with Algorithms” conference, Isola d’Elba, 1998.

Contact Information Andrei Z. Broder
Email: andrei.broder@av.com
Fulltext Preview (Small, Large)
Image of the first page of the fulltext

References secured to subscribers.



Export this chapter
Export this chapter as RIS | Text
 
Remote Address: 38.107.191.109 • Server: mpweb15
HTTP User Agent: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)