Moz Developer Blog
Moz engineers writing about the work we do and the tech we care about.
Near-Duplicate Detection
Posted by Dan Lecocq on January 27, 2015
Duplication of content on the Internet is inevitable, whether it be intentional or accidental, malicious or coincidental. There are a number of reasons that it's important to identify duplication, and almost as many approaches. We've historically used a small handful of techniques, and here we'll talk about them and how we've come to use what we use today.
Why Does Duplication Matter?
The primary reason to identify duplication is that cause problems with search engine rankings. From the search engines' view, it can represent cruft on the Internet and make it difficult to determine what is the definitive source. If you have several different URLs that have effectively the same content, it can also dilute your link juice across many pages, rather than aggregating it to a single higher-ranking page. The following diagram shows how using a rel canonical tag helps you to help the search engines by identifying the definitive source for your content and concentrate your link juice at the same time.
- 'www' vs. 'non-www' versions of the same page
- pagination
- pages that differ only by a parameter, especially sort order
- different parameter orders, like '?a=1&b=2' and '?b=1&a=2'
What to Do About It?
Once you've detected duplicated content, typically you'll decide one of two things: First, the pages might really be intended to be unique content and perhaps they only need to be better-differentiated; alternatively, you might discover that the pages are essentially the same with only subtle changes, in which case you would designate one the primary source using rel canonical. While Moz tools do a good job of providing you insight into your duplicates over time, when you're actively fixing these issues it can be helpful to get more-immediate feedback with spot checks using tools like webconfs.How to Identify Duplication
There are many different ways that machines (that is, search engines and Moz) can attempt to identify duplicate content.- Bag of Words - Comparing the words and frequency of those words on a page with those on another page.
- Shingling - This method improves on the "Bag of Words" approach by comparing short phrases, providing some context to the words.
- Hashing - This method further improves the process by eliminating the need to store copies of all of the content. The phrases are hashed into numbers, which can then be compared to identify duplication.
- MinHash - Helps streamline the process of storing content hashes.
- SimHash - Further streamlines the process of storing content hashes and duplication detection.
Bag of Words
Let's suppose that we've got a set of documents and for each one, we'd like to be able to find its duplicates. For this, we'd go to each document and then check every other to see if it's a match, so let's describe what we'll mean by a 'match.' Naïvely, we might begin by treating each document as a bag of words. So, a document like "a bump on the log in the hole in the bottom of the sea" is transformed to the set of the words it contains:{a, in, of, on, the, log, sea, bump, hole, bottom}To compare two documents, we figure out how many words they have in common and compare that to the number of words that appear in both sets. For those playing at home, this is called the Jaccard similarity coefficient. So if we compare our example document to another one, "a frog on the bump on the log in the hole in the bottom of the sea", we see that they have lots of words in common relative to the total number of words. These, then, are considered very similar.



Shingling
The problem with the "Bag of Words" approach is that it ignores the context of words. In particular, the words that surround any given word. As the above example illustrates, this is important. So, instead of just treating each document as a bag of words, let's instead turn each document into a set of overlapping phrases of a few words at a time. This approach is known as shingling because each phrase overlaps its neighbors, much like shingles on a roof. So for, "Your mother drives you in the car," we'd get a set like this:{ 'your mother drives', 'mother drives you', 'drives you in', 'you in the', 'in the car' }Applying this approach and comparing the number of phrases shared by our sentences relative to the total number of phrases, we see that they actually are quite different:


Hashing
One of the problems with this bag of phrases approach is that we actually have to store the document several times over. In particular, if we use phrases of k words, then each word will appear in k phrases (except for words at the beginning and end), making the storage requirements O(nk). To make our lives easier, we'll turn each phrase into a number representative of the phrase using a hash function. So now instead of a bag of phrases, we have a bag of numbers on which we can still perform the same set intersection. To avoid confusion, we'll refer to these as 'phrase hashes.'
MinHash
While hashing reduces our per-document storage requirements down to O(n) in the number of documents, it's still too much. A lot of pages are really big, and when comparing two documents to determine the overlap, it must take at least O(n + m) time.
MinHash is an approach that is capable of using constant storage independent of the document length and producing a good estimate of our similarity measure. This approach distills down each document to a fixed-size set of hashes as a rough signature of this document. This feat is accomplished by using a set of k randomizing hash functions. For each randomizing hash function denoted πi, we pass the entire document's phrase hashes through to get a minimum hash denoted mi.


Simhash
Simhashing is a technique that helps us to overcome this problem. Given a set of input hashes, simhash produces a single hash with an interesting property -- two similar sets of input hashes produce similar resultant hashes. Most hashing functions are chosen for the property that slightly different inputs have potentially very-different outputs, which makes simhash unique. For each bit position, we count the number of input hashes with that bit set and subtract the number of input hashes with that bit not set. Once done, every position with a negative counter is set to 0 and every other position is set to 1.


