Duplicate Content & SEO: a Practical Guide for Everyone
Published: May 4, 2016
Author: David Portney
Even if you’re not an SEO professional, it’s likely you’ve heard about the problem called “duplicate content”.
You may have even heard it called “The Duplicate Content Penalty” (it’s not a penalty; we’ll get to that shortly).
Even worse, you may have read somewhere that “duplicate content is not a problem” (they lied with that headline to get you to read that post, by the way).
But you may not be sure why you should care, or even what it really is, or what to do about it.
That’s all about to change because I’m going to pull back the curtain of mystery on this topic and concisely provide you with the definitive information you need to figure out if you have a duplicate content issue, and what to do about it.
What is duplicate content?
Here’s how Google defines duplicate content:
“Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar.”
That means you have a duplicate content problem if you have one or more content pieces that are copied (duplicated) on other URLs (pages) on your domain or other domain(s) or both, by accident or on purpose (or both).
As you can see by the definitions above, duplicate content can be on just your site, or your site plus some other site(s), or both, and it can be something you planned for and did on purpose (intentional) or it could be something you didn’t want or decide to do at all (unintentional).
In just a minute, we’ll take a look at some common ways that people (websites, actually) wind up with both intentional and unintentional duplicate content, but first let’s talk about why you should care.
Why Is Duplicate Content a Problem?
To begin to answer this question, it’s important to understand how search engines like Google find web pages on the internet, store them, then retrieve them in response to a Google search (referred to as a “query”).
First, how does Google discover all those websites and web pages out there on the internet, anyway?
They do this by doing something referred to as “crawling” (also called “fetching”, or “spidering”).
They use computers – lots and lots of computers – to accomplish this, and these computers run a program called “Googlebot,” which goes out to the internet going (crawling) from link to link within a website and also crawling pages to which those websites link.
When Googlebot arrives at your website and finishes crawling, it knows about all the pages on your website. Google is then able to store those pages (that takes a lot of computers too) in what they call their “index,” which you can think of as a giant library.
Then, when someone performs a search on Google, like a robot librarian at your service, they’re able to retrieve pages from the index and show them in search results. Google wants very badly to present you with results that satisfy your query. It is Google’s proprietary ranking algorithms that determine which pages (or other resources like images, videos, etc.) are the best match for that query.
Sidebar: the above process may be news to you. Until now you may have thought that when you search using Google (or Bing, or Yahoo! or DuckDuckGo) that you were searching the internet, the World Wide Web. You are not – you are searching their index of the Web. You’re welcome.
So, imagine that when Google shows up at your 10,000-page website it discovers that each of those 10,000 pages has 10 copies – how does it know which copy you meant for them to index?
If you guessed “they don’t know,” you’re right; they don’t. If you have duplicated versions of pages on your site, Google’s forced to “guess” which ones they should index and rank.
At this point you might be thinking “so what? – if all the copies are duplicates, they can show any one they want,” and if it were just that simple, there’d be no such thing as a duplicate content problem.
But crawling and indexing takes up computing power, and you’re causing them to use extra computing power.
Embarrassing confession time: back in 2000 I had taken up web site creation as a hobby, and I bought multiple domains and put the same exact website across like 12 domains thinking I’d have “more fishing nets in the water”… I was absolutely mortified when 8 years later I discovered the duplicate content problem. Hold tight, more embarrassing confessions to come.
Back to the example of your 10,000-page website with 10 copies of each page on the site: this problem is bad enough and makes Google’s job difficult already, but now add in the fact that bloggers or forums or news sites or your mom’s and uncle’s website have linked to this or that version of the same duplicated page on your website.
Now Google has a really hard time because those incoming links are like “votes” (crudely speaking) and are part of how Google ranks pages in their search results.
So those links, which are signals of authority and trust from other sites, just confuse Google even more than before about which page they should show a searcher in Google’s search results.
What about the “Duplicate Content Penalty”?
Yeah, no. That’s actually a myth. You’ll find that penalty when you find Bigfoot and the Loch Ness Monster.
There really is not a penalty; it’s more like you’re shooting yourself in the foot. Google is not going to penalize you; you’re just making their life harder. And yours. You did read all that stuff above about how Google crawls, indexes, and ranks websites, didn’t you?
It’s not a penalty. Period.
What Causes Duplicate Content?
There are a variety of ways duplicate content gets created. Here’s a short summary:
- Inconsistent internal linking (we see this a lot. All. The. Time.)
- Sorting and filtering functions (think eCommerce sites)
- Scraper sites (bad people write scripts that grab your content and put it on their own website)
- Syndication (you put up a blog post that also goes on several other sites with your permission)
- Posting the same content to multiple websites you own (see my embarrassing confession above)
- URL session IDs and tracking parameters (just say no)
- Printer-friendly pages
- Improper tagging of blog posts
- Serving home page with / but also .index.html (or .apsx or .php)
- Serving your site via www and non-www URLs (surprise! – still fairly common)
- Serving your site via http and https URLs
- Any combination of the above
How to Uncover If You a Have Duplicate Content Problem
As you may have noted above, there are a good many ways to cause a duplicate content problem. That also means there’s not a single and easy way to detect it.
A quick ’n dirty way to check is to load any URL on your website in a browser, then see if removing or adding www from the URL string loads the page at that URL.
So if your website is sumtingwong.com, you want to see if the page loads successfully at both http://www.sumtingwong.com and also http://sumtingwong.com.
Yes, those are 2 different addresses on the internet and therefore duplicate pages of each other.
If you’re really lucky (that’s sarcasm), maybe you can load your home page successfully at:
Congratulations – you have 5 versions of your home page.
Embarrassing confession time again: I had a very similar problem with my websites where I had 5 different versions of my home page and 2 versions of every other page on my sites. Have you ever gotten that weird sensation when all the blood drains out of your head and you feel like you’re about to pass out? I hope you don’t have 5 home pages…
Other ways to detect duplicate content:
- Copy-paste into Google search box a snippet of text from your page (or your Title Tag)
- Check Google Search Console in the HTML improvements section (look for duplicate Title Tags and duplicate Meta Descriptions)
- Have an SEO perform a crawl of your website and analyze for duplicate content (ahem)
- Do a manual review of your site pages
How Do You Fix Duplicate Content?
That topic alone could make for a very long blog post.
The answer is “it depends” – it depends on what’s causing the duplicates in the first place.
That said, I do want to emphasize that your best course of action is to actually fix the problem at the root cause, whatever that root cause ends up being. Honestly, this is where you’re going to be well-served by having a SEO professional who is well versed in the technical aspects of search engine optimization be your guide.
That said, often the fixes fall into one or more of several categories:
- Fix inconsistent internal linking
- Making use of the rel=”canonical” tag
- Making use of Meta Robots tag
- Making use of the robots.txt file (be careful with this one!)
- Making use of the nofollow attribute on internal links
- Fixing improper use of blog tagging
Those are not all the fixes possible, but that list includes some of the more common methods used.
As I said before, look to fix duplicate content problems at the root cause, rather than just putting duct tape and band aids on the problem.
If Dr. Frankenstein’s monster were an SEO professional, he’d say “Grrrr… duplicate content… bad…”. Friends don’t let friends have websites with duplicate content problems. Your stepdad will finally remember your birthday if you fix your duplicate content problems. Smokey the Bear says “only you can prevent duplicate content problems”.
Okay, I don’t know if any of that is true, but I do know that duplicate content is a serious problem for Google. The developer programs tech lead there told me so. Don’t make it hard on Google to know which pages they should index and serve in search results. Fix your duplicate content problems, and the internet will sleep better at night.