Due to a new tool that was released yesterday, Niche Store Writer, there has been some buzz around about Duplicate Content Filtering and what it means in terms of this tool, tools similar to it and in general. Hopefully I can provide a few answers for you here based on my study of the subject. I’ll try to keep the technical details and jargon to a minimum as best I can.
Search Engine’s Goal - A Good User Experience.
Why do search engines like Google and Yahoo implement a duplicate content filter? The quick and easy answer is so that they can provide a good user experience.
First of all, they want to screen out obvious duplicate content like 404 pages and other common, useless, pages that almost all websites have internally. Next, they want to insure that there is only one record in their index for a page, particularly one that is generated by the server as happens with WordPress. Lastly, they want to filter out any duplicates from outside the site as best they can.
All of this happens in multiple stages during the process of crawling and then indexing and scoring a page. This is one reason why you may see a page jump up in rankings then fall as it is scored and re-scored on duplicate content and other factors built into the algorithm.
Of course, one of the biggest parts of their algorithm is designed to remove references to bulk content that is duplicated across the Internet. A good example of this would be Wikipedia scrapper sites. There was a trend a while back where blackhatters would scrape content from Wikipedia based on keywords and then republish it with their ads on their own site. Google has gotten quite good at filtering out these sites.
How Does Google Determine Duplicate Content?
While I don’t know for sure we can make some logical guesses based on published patents and public comments on this topic.
The first step, as mentioned above, is to simply not index easily identified duplicate content like error pages. The earlier in the algorithm they can eliminate them the less cluttered their index will be and the less work their bots will have to do in analyzing the page.
After this, they break up the content into chunks or snippets of about 5 to 8 words apiece. These bits and pieces become the fingerprints of the document, so to speak. They check these fragments against a database of common phrases, common quotations and quips, cliches and a lot of other stuff in order to reduce the impact of common language elements on the algorithm. Some time after this initial check, the data is also classified by type and relationships using LSI (Latent Semantic Indexing) and other data clustering and relationship analysis algorithms. As you might guess, this takes time and is an ongoing process that runs across many servers and datacenters over a period of several days, or even weeks, before the page is fully analyzed.
Finally, after the page is fully analyzed, its fingerprint is checked against other pages in the database. If it is a substantial match for an existing document in the database it will probably be flagged as duplicate content. I say ‘probably’ because there seem to be other factors at work that can allow a near-duplicate to keep its place in search results. This could include the relevance to the site as a whole or the reputation or authority of the site with the potential duplicate content.
How To Avoid Duplicate Content
First of all, what doesn’t work is adding in a few sentences at the start and/or end of the duplicate content. Google’s algorithm has been smart enough to handle this for years.
Another thing that doesn’t work is switching out or switching around a few words. For example, simply switching in “attorney” for “lawyer” in an article won’t avoid the filtering. Their algorithm, as indicated by their patents, detects common synonyms. Other simple forms of modifying content might also be detectible over time. Remember that Google has a lot of time and money to devote to the ongoing process of evaluating web pages so you may not be filtered today but you may be tomorrow or vice versa.
One thing that works to some degree is content of 5 words or more that varies automatically on each page load or even gradually over time. As the content changes on a page with each ‘bot crawl this seems to make portions of Google’s algorithm reset themselves. Widgets like the WordPress Related Posts plugin seem to take advantage of this as does simply adding a comment to a blog post.
Another that seems to work is to build your content around keywords or keyword phrases interlaced with short common phrases. This is what Niche Store Writer helps you do. How much it will help will depend a lot on a number of factors, such as the keyword density and uniqueness, length and how common the template text is across the Internet.
Lastly, writing a bit of text that passes Copyscape or other duplicate checkers should be sufficient to pass Google’s current duplicate filtering algorithms. Of course, doing this in the process of preparing a site can eat up a lot of time.
Yes, a lot of this is educated guesses and theories based on personal observations and observations of others. So far as Google’s algorithms go things are always in flux so what works today may not work tomorrow. Do you have any theories or observations to add? If so, leave a comment about them.