Duplicate Content Filtering Concepts

Due to a new tool that was released yesterday, Niche Store Writer, there has been some buzz around about Duplicate Content Filtering and what it means in terms of this tool, tools similar to it and in general. Hopefully I can provide a few answers for you here based on my study of the subject. I’ll try to keep the technical details and jargon to a minimum as best I can.

Search Engine’s Goal - A Good User Experience.

Cute Thumbs Up

Why do search engines like Google and Yahoo implement a duplicate content filter? The quick and easy answer is so that they can provide a good user experience.

First of all, they want to screen out obvious duplicate content like 404 pages and other common, useless, pages that almost all websites have internally.  Next, they want to insure that there is only one record in their index for a page, particularly one that is generated by the server as happens with WordPress. Lastly, they want to filter out any duplicates from outside the site as best they can.

All of this happens in multiple stages during the process of crawling and then indexing and scoring a page. This is one reason why you may see a page jump up in rankings then fall as it is scored and re-scored on duplicate content and other factors built into the algorithm.

Of course, one of the biggest parts of their algorithm is designed to remove references to bulk content that is duplicated across the Internet. A good example of this would be Wikipedia scrapper sites. There was a trend a while back where blackhatters would scrape content from Wikipedia based on keywords and then republish it with their ads on their own site. Google has gotten quite good at filtering out these sites.

How Does Google Determine Duplicate Content?

Old Fashioned Data Center

While I don’t know for sure we can make some logical guesses based on published patents and public comments on this topic.

The first step, as mentioned above, is to simply not index easily identified duplicate content like error pages. The earlier in the algorithm they can eliminate them the less cluttered their index will be and the less work their bots will have to do in analyzing the page.

After this, they break up the content into chunks or snippets of about 5 to 8 words apiece. These bits and pieces become the fingerprints of the document, so to speak. They check these fragments against a database of common phrases, common quotations and quips, cliches and a lot of other stuff in order to reduce the impact of common language elements on the algorithm. Some time after this initial check, the data is also classified by type and relationships using LSI (Latent Semantic Indexing) and other data clustering and relationship analysis algorithms. As you might guess, this takes time and is an ongoing process that runs across many servers and datacenters over a period of several days, or even weeks, before the page is fully analyzed.

Finally, after the page is fully analyzed, its fingerprint is checked against other pages in the database. If it is a substantial match for an existing document in the database it will probably be flagged as duplicate content. I say ‘probably’ because there seem to be other factors at work that can allow a near-duplicate to keep its place in search results. This could include the relevance to the site as a whole or the reputation or authority of the site with the potential duplicate content.

How To Avoid Duplicate Content

Avoid the Duplicate Content Wreck

First of all, what doesn’t work is adding in a few sentences at the start and/or end of the duplicate content. Google’s algorithm has been smart enough to handle this for years.

Another thing that doesn’t work is switching out or switching around a few words. For example, simply switching in “attorney” for “lawyer” in an article won’t avoid the filtering. Their algorithm, as indicated by their patents, detects common synonyms.  Other simple forms of modifying content might also be detectible over time. Remember that Google has a lot of time and money to devote to the ongoing process of evaluating web pages so you may not be filtered today but you may be tomorrow or vice versa.

One thing that works to some degree is content of 5 words or more that varies automatically on each page load or even gradually over time. As the content changes on a page with each ‘bot crawl this seems to make portions of Google’s algorithm reset themselves. Widgets like the WordPress Related Posts plugin seem to take advantage of this as does simply adding a comment to a blog post.

Another that seems to work is to build your content around keywords or keyword phrases interlaced with short common phrases. This is what Niche Store Writer helps you do. How much it will help will depend a lot on a number of factors, such as the keyword density and uniqueness, length and how common the template text is across the Internet.

Lastly, writing a bit of text that passes Copyscape or other duplicate checkers should be sufficient to pass Google’s current duplicate filtering algorithms. Of course, doing this in the process of preparing a site can eat up a lot of time.

In Theory

Yes, a lot of this is educated guesses and theories based on personal observations and observations of others. So far as Google’s algorithms go things are always in flux so what works today may not work tomorrow. Do you have any theories or observations to add? If so, leave a comment about them.

 


RSS feed | Trackback URI

20 Comments »

Comment by Vinny Lingo
2008-05-09 13:33:33

Great article, Frank. Regarding “how common the template text is across the Internet,” this is exactly why Dave only includes one template with Niche Store Writer. If everyone uses the same template, then what they were doing would be too obvious to Google. Niche Store Writer isn’t mean to replace the task of writing content, only to accelerate it. So everyone who purchases it need to write original templates.

Comment by jfc
2008-05-09 13:59:57

Hi Vinny,

Google on “By purchasing from your local retail” or “there is no other place to go other than” or “paying top dollar prices” ;)

 
 
2008-05-09 13:57:27

I think the duplicate content fear is a bit overrated. Like your post states, there are a lot of factors that can cancel a site’s label as duplicate.
I know one internet marketer that admits that he does not rewrite PLR articles sometimes. He finds that if he uses it right away, he does not get a penalty. Of course, I would expect someone who used the same PLR material later to get the penalty.
Using the same meta title and description will get your own pages listed as duplicate material. This also shows that SE’s do still count the meta data to at least some extent.

Comment by jfc
2008-05-09 14:10:30

Hi James,

It often comes down to the age of the content and the authority of the site in question. Not only that, it depends on the search. For example, I wrote an article on rewriting a while back and posted a unchanged PLR article as an example. This article is indexed and will usually go to supplemental in results but with the right combination of keywords it will be in the primary search. I may experiment with some link building on it to see what happens.

I’ve heard that about meta tags but, in my experience, this isn’t a factor with Google. Or, if it is, they don’t give it as much weight as they do other factors.

 
 
Comment by Houseboat Rentals
2008-05-10 12:12:06

Duplicate content. Always a topic of no closer. We can only assume by what works for us. I have had the same article on two different sites and had no issues and at the same time have duplicate article I cannot even find. Who Knows!

Denise

Comment by jfc
2008-05-10 19:18:29

Hi Denise,

Yes, it seems like it’s a moving target. Google engineers are always changing the algorithm, testing them in various data centers, pulling them out, putting them back in and so forth.

 
 
Comment by Vic
2008-05-10 18:42:10

Frank better explained could had not been done. Dear friend you are a born teacher! Thank you for writing such a thought out explanation where I can send people over ;)

Comment by jfc
2008-05-10 19:22:05

Thanks Vic.

 
 
Comment by Terry
2008-05-11 02:59:35

Hey Frank, great explanation. The duplicate content thing has had me writing as original as possible for as long as I can remember, although I do re-write articles for submissions etc and when I simply run out of ideas!

I think a good rule of thumb is to try and write original for your own sites as much as possible, time permitting, and for submissions to article directories etc, rewrite existing stuff to get it past the directory’s own filter and job done.

Terry

Comment by jfc
2008-05-11 11:07:38

Hi Terry and thanks,

Passing Copyscape seems to work the best for now when I’m doing rewrites. Of course, 100% original material is the best but having time to do that is tough.

 
 
Comment by ND @ Touch Ipod
2008-05-15 11:46:52

crap, i thought synonyms were enough to avoid duplicate content. of course, as someone who’s written a parser in PERL, i should’ve known that comparing words against a dictionary was pretty straightforward.

I guess the trick is to use words that don’t really mean the same thing! for example: using “70% sure” instead of “absolutely certain” - do you think that would work?

Comment by jfc
2008-05-15 16:03:40

Hi iPod,

Anything that breaks up the pattern should work to avoid the duplicate penalty. Remember that it gets broken up into 5 to 8 word chunks so that’s the level you should concentrate on. You might be able to retain “70% sure” but the 5 or 6 words on either side of it would be what would affect the document’s ‘fingerprint’.

 
 
Comment by Warenwirtschaft
2008-05-23 16:53:46

Normally i avoid using text from other sources, but when i have to, i always rewrite every sentence a bit. That always worked fine for me.

Comment by jfc
2008-05-23 17:07:05

Hi Warenwirtschaft,

It really depends on a number of factors how much of a re-write you have to do in order to avoid penalties. 70% different is the general rule although there are apparently loopholes in this.

 
 
2008-07-20 08:09:00

[…] page. This can be easy or hard depending on what you wrote about. Frank of OpTempo wrote about duplicate content filtering concepts back in May and explained how it can be made to work for […]

 
Comment by Ed
2008-07-20 12:45:27

I find it inconceivable that a blogger would go to the trouble of rewriting someone else’s work to make it their own, it just seems like too much work, and why even blog if you dont have the wherewithal to write your own ideas down in your own style? Rhetorical question, but you get the idea.

I have linked to an article I wrote a while back, turning the issue on its head and encouraging people to copy my content. Does that basic premise still apply today, or has Google made everyone over protective of their own work in the name of ever changing SEO etiquette?

Comment by jfc
2008-07-21 08:21:55

Hi Ed,

Content gets borrowed for rewrites all the time. No where is this more evident than in the echo chamber that is the “Make Money Online Blogging” niche. There is very little original material.

In niche blogging, you may not be an expert in the subject matter at hand so a rewrite of an existing article is one of your least expensive options.

People copying your content can give you some links although the quality of these links may not be all that good. There is also a risk that somebody with a site with more authority than your’s could rank higher than you or push you into supplemental results.

 
 
Comment by Jonathan Meager
2009-01-13 10:45:21

Does the same principle hodl if you are submitting content to article sites? Will it count as duplication if you send the same article to say 10 different sites?

Comment by jfc
2009-01-13 15:10:42

Hi Jonathan,

Since I wrote this article I’ve found that Google doesn’t filter duplicate content all that aggressively. They’ll deindex or filter blatant copying, such as scrapping Wikipedia. Beyond that, it’s basically the authority of the domain that matters and multiple sites can rank well using essentially the same content if Google’s algorithms think that the sites in question have keyword authority.

 
 
Comment by Moving House
2009-02-17 01:02:12

I found using copyscape to rewrite an article entirely - that its important to make sure you don’t have a phrase of 5 words or more the same - even address can sometimes trigger it. I assume google does something similar

 
Name (required)
E-mail (required - never shown publicly)
A Link To Your Site
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong> in your comment. Comments with links are automatically moderated but are normally allowed after review. New commentators are automatically moderated. You may use anchored text in your signature link as long as your comment is meaningful and on topic. Signature links inside of the comment body are not allowed.

 

Some graphics Copyright 2005 Riverdeep Interactive Learning Limited, and its licensors. All rights reserved
Some graphics Copyright 2005 Cosmi Corporation, and its licensors. All rights reserved.
All graphics are intended for viewing purposes only.

Directory of General Blogs Personal blogs Top Blogs Marketing SEO blogs blogoriffic.com Webfeed (RSS/ATOM/RDF) registered at http://www.feeds4all.com BRDTracker blog directory