Monday, January 03, 2011

StackOverflow and Scraper Sites

I recently noticed that Google searches were turning up a lot of sites that mirror StackOverflow content as opposed to the originals. It appears that I'm not alone. This morning Jeff Atwood blogged about how they're having increasing problems with these sites receiving higher Google rankings. His post, and especially the comments, are filled with righteous indignation about how it's the end of the Internet as we know it. Of course, I can't remember a time when search results weren't polluted faked-out results, so I don't understand why he's so surprised.

But I do think this situation is different. The difference is that in many cases the presentation is far different from my previous exposure to link farms. Historically, my impression is that pages on such sites generally have the following attributes:

  1. The recycled content is poorly formated and often truncated
  2. Unrelated content from multiple sources is lumped together in a single page (usually there is some common keyword)
  3. Advertisements are extremely dominant and often unrelated to the content
  4. The link-to-content ratio is often very high, and the links are to unrelated content

In fact, I think if one were to develop a set of quantitative metrics that could be automatically measured for pages based on the above criteria, I think StackOverflow would perform worse than some of the sites that mirror its content. It's rather ironic, but if you think about it, if you wanted to defeat an algorithm that was developed to find the "best" source for some common content, you'd do everything you could to make your scraper site look more legitimate than the original. Let's compare a question on StackOverflow with it's cousin on eFreedom.com.

Analysis

The primary content is essentially the same with similar formatting. The two major differences are that eFreedom only directly displays the selected answer, as opposed to all of the answers with the selected answer on the top, and none of the comments are displayed. This may help avoid triggering the "unrelated content" rule, because the selected answer is probably the most cohesive with the question, and comments frequently veer off-topic. But I suspect the affect is minimal.

Now consider the advertisements. eFreedom has a few more, but they are more closely tied to the content (using Google AdWords, which probably helps). The advertisements on StackOverflow are for jobs identified via geolocation (I live in Maryland), and the words in them don't correlate particularly well to the primary content, even though they are arguably more relevant to StackOverflow users that run-of-the-mill AdWords ones.

Now let's consider the related links. StackOverflow has links to ~25 related questions in a column along the side. The only content from the questions in the title, and the majority seem to be related based on matching a single tag from the question. eFreedom, on the other hand has links to 10 related questions (they appear to be matching on both tags), puts them inline with the main content, and includes a brief summary. As a human being I think the StackOverflow links are much more noisy and less useful. If I try to think like an algorithm, what I notice is StackOverflow has a higher link-to-content ratio, and the links are to more weakly related content.

The only other major difference is that eFreedom has a "social network bar" on the page. I'm not sure how this would affect ranking. It probably helps with obtaining links.

If you look at the HTML for each page, both use Google Analytics, both use what I'm assuming is a CDN for images, and StackOverflow appears to have a second analysis service. On casual inspection, neither appear to be laden with external content for tracking purposes, although without deeper inspection there's no way to be sure. But I presume a having a large number of tracking links would make a site look more like spam, and having few of them make it look more legit.

Conclusion

I don't think either site looks like spam, but between the two, StackOverflow has more spam-like characteristics. eFreedom's content and links are considerably less noisy than StackOverflow's. Is eFreedom being a leach? Yes, certainly, but it, and I believe some of the other sites replicating StackOverflow's content, don't look like traditional link-spam sites. In fact, for a person who is just looking for content, as opposed to looking to participate in a community, then eFreedom is at least as good, if not slightly better. There may be a moral argument that the original source should be given priority over replications, but from a pure content quality perspective StackOverflow and its copiers are essentially identical, and computer algorithms aren't generally in the business of making moral judgements. Also, there are many forums and mailing list archives out there that have atrocious user interfaces where the casual searcher is likely better off being directed to a content aggregator than to the original source, so I don't think a general rule giving preference to original sources would be productive. Ultimately, I think open community sites like StackOverflow are going to have to compete with better SEO and perhaps better search and browsing UI's for non-contributors, rather than relying up search engines to perform some miracle, because the truth is that from a content consumption perspective the replicated sites are just as good.

Sphere: Related Content

2 comments:

Gavriel said...

Very interesting and very true.

IMHO the only way stackoverflow could make it's position better would be to require the sites that use their content to link back to the original content's page (http://stackoverflow.com/questions/1833762/scala-reflection-getdeclaringtrait) in this example as opposed to http://stackexchange.com/ and http://blog.stackoverflow.com/category/cc-wiki-dump/ .

sex shop said...

A great deal of useful information for me!