Panda and Big Data

// // July 15th 2011 // SEO

I’ve been thinking a lot about Google’s Panda update. Who in the SEO community hasn’t, right? But there’s one thing in particular that continues to bug me.

Why is Panda applied at the site level?

A site wide quality metric seems very un-Googly. It was one of my major complaints when Panda (then called Farmer) was rolled out. It treats lousy content the same as great content. This seems to run contrary to Google’s mission to return the best and most relevant search results.

You might argue that the great content will continue to rank well based on other signals, but there’s little doubt that it will be negatively impacted. And the content that now outranks it may not be better at the page level.

Panda Mechanics

At this point I believe we have a fairly good idea of how Panda is applied. Bill Slawski (here and here), Danny Sullivan and Eric Enge have all provided great insight into how Panda might have been constructed and implemented.

In general, the conclusion seems to be that Panda acts as a type of quality filter that sits on top of the algorithm, placing a penalty of sorts on those sites it deems to be unworthy.

My Panda Theory

It seems probable that Panda is a document based classifier that evaluates and scores the quality of a page. But why not integrate that page score as a true signal so lousy content would be demoted and quality content would rise? Let each piece of content compete on its own merits.

Could it be that the confidence interval for the Panda classifier isn’t high enough on any single document?

But if you sample enough pages from a site the Panda classifier reaches an acceptable confidence level, allowing Google to pass judgement on the entire site as a whole, but not on an individual URL basis.

panda document scores

If the Panda score is on a scale of 100 you might wind up with something like this. So while document three scores very well with a 94, the rest of the documents on the site drag the aggregate Panda score down.

This could explain why removing certain thin content pages might impact your Panda status, since the aggregate score might rise substantially.

It could also explain why many saw sites with a small content corpus escape the wrath of Panda, since the lack of a viable sample made the aggregate Panda score invalid.

Panda and Big Data

Part of this theory draws from Mike Cohen’s presentation at the Inside Search event. He stated that the way in which Google was improving the accuracy of voice search was through “massive amounts of data.”

And if you read In The Plex (which you should) you also come away with the feeling that it is sometimes less about tweaking the algorithm and more about feeding that algorithm more data. Machine learning requires big data.

Danny Sullivan reported from SMX Advanced that  “the Panda filter isn’t running all the time. Right now, it’s too much computing power to be running this particular analysis of pages.”

Again, this seems to indicate that Panda isn’t part of the normal evaluation process and that it requires a substantial effort to recompute, even by Google’s standards.

Constraint or On Purpose?

Pee Wee Herman I Meant To Do That

Perhaps the iterative nature of the Panda updates will result in a more accurate document classifier that could be applied on the document level. Or maybe Google simply believes that a site’s overall content corpus should have an impact on all of the content on that site.

Maybe Google could apply Panda on the document level but instead believes site level application is more expedient.

There’s a small bit of logic there. If I buy products from a store and have them break again and again, I might not want to patronize that store even if a few of their other products were well-crafted and solid. So, the store (aka site) develops a reputation and the once bitten, twice shy adage kicks in. This dovetails nicely into the idea that your brand equity can have an impact on perceived relevance.

Of course, none of this could be even remotely true because I’m not a data scientist or statistician. I’m just a guy who reads a lot, experiments and enjoys uncovering patterns. And it doesn’t change the facts nor how to get out of Panda Jail.

What do you think? Is Panda’s site level application a product of constraint or done by design?

Postscript: Leave A Comment // Subscribe (RSS Feed)

The Next Post:
The Previous Post:

1 trackbacks/pingbacks

  1. Pingback: Content Marketing: What does Google Authorship mean for SEO on June 18, 2012

Who Are You?

Your Email Address

Your Website

You can follow any responses to this entry via its RSS comments feed. You may also leave a trackback by clicking this link.