Duplicate Content: Why does it happen and how to fix issues

What is duplicate content?

Duplicate content is content that appears on the Internet in more than one place. That “one place” is defined as a location with a unique website address () - so, if the same content appears at more than one web address, you’ve got duplicate content.

While not technically a penalty, duplicate content can still sometimes impact search engine rankings. When there are multiple pieces of, as Google calls it, “appreciably similar” content in more than one location on the Internet, it can be difficult for search engines to decide which version is more relevant to a given search query. Other websites may use your content without consent, creating confusion for search engines and diluting backlinks.

Why does duplicate content matter?

For search engines

Duplicate content can present three main issues for search engines:

  1. They don’t know which version(s) to include/exclude from their indices.
  2. They don’t know whether to direct the link metrics (trust, authority, link equity, etc.) to one page, or keep it separated between multiple versions.
  3. They don’t know which version(s) to rank for query results.

Implementing a canonical URL can help search engines identify the preferred version of a page, consolidating link value and improving SEO performance.

For site owners

When duplicate content is present, site owners can suffer rankings and traffic losses. These losses often stem from two main problems:

  1. To provide the best search experience, search engines will rarely show multiple versions of the same content, and thus are forced to choose which version is most likely to be the best result. This dilutes the visibility of each of the duplicates.
  2. Link equity can be further diluted because other sites have to choose between the duplicates as well. instead of all inbound links pointing to one piece of content, they link to multiple pieces, spreading the link equity among the duplicates. Because inbound links are a ranking factor, this can then impact the search visibility of a piece of content.

The net result? A piece of content doesn't achieve the search visibility it otherwise would.

How does duplicate content impact SEO?

Duplicate content can have a significant impact on a website’s SEO performance. When search engines encounter multiple versions of the same content, they may struggle to determine which version is the most relevant and authoritative. This can lead to a few issues:

  • Diluted link equity: When multiple versions of the same content exist, link equity is split among them, reducing the overall authority of each page. This means that instead of one strong page, you have several weaker ones, which can negatively affect your search engine rankings.
  • Reduced crawl budget: Search engines allocate a specific crawl budget to each website, which is the number of pages they will crawl and index within a given timeframe. If search engines waste this budget on duplicate pages, it reduces the frequency and speed of crawling and indexing new and updated content. This can delay the appearance of fresh content in search results.
  • Lower search engine rankings: Duplicate content can lead to lower search engine rankings because search engines may not know which version of the content to prioritize. This confusion can result in none of the duplicate pages ranking as well as they could if there were a single, authoritative version.

By understanding these impacts, website owners can better appreciate the importance of identifying and resolving duplicate content issues to maintain optimal SEO performance.

How do duplicate content issues happen?

In the vast majority of cases, website owners don't intentionally create duplicate content. But, that doesn't mean it's not out there. In fact by some estimates, up to 29% of the web is actually duplicate content!

Let's take a look at some of the most common ways duplicate content is unintentionally created:

1. URL variations

URL parameters, such as click tracking and some analytics code, can cause duplicate content issues. This can be a problem caused not only by the parameters themselves, but also the order in which those parameters appear in the URL itself.

For example:

Similarly, session IDs are a common duplicate content creator. This occurs when each user that visits a website is assigned a different session ID that is stored in the URL.

Printer-friendly versions of content can also cause duplicate content issues when multiple versions of the pages get indexed.

One lesson here is that when possible, it's often beneficial to avoid adding URL parameters or alternate versions of URLs (the information those contain can usually be passed through scripts).

2. HTTP vs. HTTPS or WWW vs. non-WWW pages

If your site has separate versions at "www.site.com" and "site.com" (with and without the "www" prefix), and the same content lives at both versions, you've effectively created duplicates of each of those pages. The same applies to sites that maintain versions at both http:// and https://. If both versions of a page are live and visible to search engines, you may run into a duplicate content issue.

3. Scraped or copied content

Content includes not only blog posts or editorial content, but also product information pages. Scrapers republishing your blog content on their own sites may be a more familiar source of duplicate content, but there’s a common problem for e-commerce sites, as well: product information. If many different websites sell the same items, and they all use the manufacturer’s descriptions of those items, identical content winds up in multiple locations across the web. Using a self-referencing canonical tag on a page signifies it as the original, authoritative source, thus helping to protect the content from scrapers and ensure proper recognition by search engines.

How to find duplicate content

Finding duplicate content is an essential step in identifying and fixing duplicate content issues. Here are a few ways to do it:

Site audit

A site audit is a comprehensive analysis of a website’s technical SEO performance. Using tools like Ahrefs’ Site Audit or Semrush’s Site Audit, you can identify duplicate content issues, including:

  • Duplicate pages: These tools can scan your website and highlight pages that have identical or very similar content.
  • Duplicate titles and meta descriptions: They can also identify pages with the same titles and meta descriptions, which are common indicators of duplicate content.
  • Duplicate content across multiple URLs: The audit will reveal if the same content is accessible through different URLs, helping you pinpoint where canonicalization or redirects are needed.

Indexed pages

Checking indexed pages in Google Search Console can also help identify duplicate content issues. By analyzing the number of indexed pages and the types of pages that are being indexed, you can identify potential duplicate content issues. Here’s how:

  • Review the Coverage Report: This report shows which pages Google has indexed and any issues encountered. Look for pages that should not be indexed or multiple versions of the same page.
  • Inspect URLs: Use the URL Inspection tool to check individual pages. This can help you see if Google has indexed multiple versions of the same content.

By regularly auditing your site and monitoring indexed pages, you can quickly identify and address duplicate content issues, ensuring your website remains optimized for search engines.

How to fix duplicate content issues

Fixing duplicate content issues all comes down to the same central idea: specifying which of the duplicates is the "correct" one.

Whenever content on a site can be found at multiple URLs, it should be canonicalized for search engines. Let's go over the three main ways to do this: Using a 301 redirect to the correct URL, the rel=canonical attribute, or using the parameter handling tool in Google Search Console.

301 redirect

In many cases, the best way to combat duplicate content is to set up a 301 redirect from the "duplicate" page to the original content page.

When multiple pages with the potential to rank well are combined into a single page, they not only stop competing with one another; they also create a stronger relevancy and popularity signal overall. This will positively impact the "correct" page's ability to rank well.

Rel="canonical"

Another option for dealing with duplicate content is to use the rel=canonical attribute. This tells search engines that a given page should be treated as though it were a copy of a specified URL, and all of the links, content metrics, and "ranking power" that search engines apply to this page should actually be credited to the specified URL.

The rel="canonical" attribute is part of the HTML head of a web page and looks like this:

General format:

<head>...[other code that might be in your document's HTML head]...<link href="URL OF ORIGINAL PAGE" rel="canonical" />...[other code that might be in your document's HTML head]...</head>

The rel=canonical attribute should be added to the HTML head of each duplicate version of a page, with the "URL OF ORIGINAL PAGE" portion above replaced by a link to the original (canonical) page. (Make sure you keep the quotation marks.) The attribute passes roughly the same amount of link equity (ranking power) as a 301 redirect, and, because it's implemented at the page (instead of server) level, often takes less development time to implement.

Below is an example of what a canonical attribute looks like in action:

Using MozBar to identify canonical attributes.

Here, we can see BuzzFeed is using the rel=canonical attributes to accommodate their use of  URL parameters (in this case, click tracking). Although this page is accessible by two URLs, the rel=canonical attribute ensures that all link equity and content metrics are awarded to the original page (/no-one-does-this-anymore).

Meta robots noindex

One meta tag that can be particularly useful in dealing with duplicate content is meta robots, when used with the values "noindex, follow." Commonly called Meta Noindex,Follow and technically known as content=”noindex,follow” this meta robots tag can be added to the HTML head of each individual page that should be excluded from a search engine's index.

General format:

<head>...[other code that might be in your document's HTML head]...<meta name="robots" content="noindex,follow">...[other code that might be in your document's HTML head]...</head>

The meta robots tag allows search engines to crawl the links on a page but keeps them from including those links in their indices. It's important that the duplicate page can still be crawled, even though you're telling Google not to index it, because Google explicitly cautions against restricting crawl access to duplicate content on your website. (Search engines like to be able to see everything in case you've made an error in your code. It allows them to make a [likely automated] "judgment call" in otherwise ambiguous situations.)

Using meta robots is a particularly good solution for duplicate content issues related to pagination.

Preferred domain and parameter handling in Google Search Console

Google Search Console allows you to set the preferred domain of your site (i.e. http://yoursite.com instead of http://www.yoursite.com) and specify whether Googlebot should crawl various URL parameters differently (parameter handling).

Depending on your URL structure and the cause of your duplicate content issues, setting up either your preferred domain or parameter handling (or both!) may provide a solution.

The main drawback to using parameter handling as your primary method for dealing with duplicate content is that the changes you make only work for Google. Any rules put in place using Google Search Console will not affect how Bing or any other search engine's crawlers interpret your site; you'll need to use the webmaster tools for other search engines in addition to adjusting the settings in Search Console.

Additional methods for dealing with duplicate content

  • Maintain consistency when linking internally throughout a website. For example, if a webmaster determines that the canonical version of a domain is www.example.com/, then all internal links should go to http://www.example.co... rather than http://example.com/pa... (notice the absence of www).
  • When syndicating content, make sure the syndicating website adds a link back to the original content and not a variation on the URL. (Check out our Whiteboard Friday episode on dealing with duplicate content for more information.)
  • To add an extra safeguard against content scrapers stealing SEO credit for your content, it's wise to add a self-referential rel=canonical link to your existing pages. This is a canonical attribute that points to the URL it's already on, the point being to thwart the efforts of some scrapers.

A self-referential rel=canonical link: The URL specified in the rel=canonical tag is the same as the current page URL.

While not all scrapers will port over the full HTML code of their source material, some will. For those that do, the self-referential rel=canonical tag will ensure your site's version gets credit as the "original" piece of content.

Conclusion

Duplicate content is a common issue that can have a significant impact on a website’s SEO performance. By understanding the causes of duplicate content, identifying duplicate content issues, and implementing solutions like canonical tags and 301 redirects, website owners can improve their website’s crawlability, reduce link equity dilution, and increase their search engine rankings. Regular site audits and monitoring of indexed pages can help identify and fix duplicate content issues, ensuring that your website is optimized for search engines and provides a better user experience.

By taking these steps, you can ensure that your content is unique, authoritative, and easily discoverable by search engines, ultimately leading to better web search results and a stronger online presence.

Updated by Chima Mmeje — November 18, 2024.