How Google treats duplicate content from scrapers
I know I’ve said it before, but if you’re running your web site and you’re not paying attention to the Official Google Webmaster Central Blog, you’re ignoring a very good resource.
Just yesterday, Sven Naumann (who is on the search quality team) wrote a post dealing with concerns webmasters have about scraper sites that pull exact post content and republish it as their own.
So what does Google do? They use a couple of methods, which they don’t identify, to determine which bit of duplicate content is the original piece, and then point to those.
For people who see scraping sites placing higher in search results than their own, original content, they offer this advice:
Some webmasters have asked what could cause scraped content to rank higher than the original source. That should be a rare case, but if you do find yourself in this situation:
- Check if your content is still accessible to our crawlers. You might unintentionally have blocked access to parts of your content in your robots.txt file.
- You can look in your Sitemap file to see if you made changes for the particular content which has been scraped.
- Check if your site is in line with our webmaster guidelines.
Largely, the answer seems to be “don’t worry, we’ve got a handle on it.” But if you want more details on how to minimize duplicate content within your own domain, you can go check out the post.
What I think Google really needs to solve though, is the arrival of “old” content for the first time online. How do you figure out who the real owner of that content is?