paper

On the Resemblance and Containment of Documents

  • Authors:

📜 Abstract

We develop the mathematical foundations for the analysis of the resemblance and containment of documents as used, for example, in the AltaVista web search engine. We use a simple and efficient fingerprinting scheme introduced by Manber for this process. More generally, we investigate four functions: resemblance and containment (for estimating the similarity of pairs of documents) and max and min (which allow in particular for very accurate estimates of document size). We provide the theoretical underpinnings for the particular implementations used for finding and resolving near-duplicates in a very large scale application. We combine techniques from analysis, probability theory, and linear algebra in order to do this.

✨ Summary

This paper by Andrei Z. Broder, published in 1997, introduces a mathematical framework for understanding document similarity with applications to web search engines, like AltaVista. It uses a fingerprinting technique established by Manber, focusing on four key metrics: resemblance, containment, max, and min. The approach is grounded in analysis, probability theory, and linear algebra, which aids in detecting and resolving near-duplicate documents.

The influence of this work can be seen in the development of algorithms for web-scale duplicate detection and in projects related to improving the efficiency of search engines. Broder’s metrics have become foundational in information retrieval, especially for web indexing and the management of large datasets. The use of min-wise independent permutations, outlined in the paper, has been widely adopted in similarity estimations in massive data sets.

Citations: 1. Finding Similar Items in a Large Collection 2. Near Duplicate Detection: The Impact of Thresholds on Performance 3. Efficient Web Search and Clustering 4. Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL 5. A Theoretical Model of Sequence Alignment