You Are Browsing The SEO Category

Google Now Topics

November 26 2013 // SEO + Technology // 16 Comments

Have you visited your Google Now Topics page? You should if you want to get a peek at how Google is translating queries into topics, which is at the core of the Hummingbird Update.

Google Now Topics

If you are in the United States and have Google Web History turned on you can go to your Google Now Topics page and see your query and click behavior turned into specific topics.

Google Now Topics Example

This is what my Google Now Topics page looked like a few weeks back. It shows specific topics that I’ve researched in the last day, week and month. If you’re unfamiliar with this page this alone might be eye opening. But it gets even more interesting when you look at the options under each topic.

Topic Intent

The types of content offered under each topic is different.

Why is this exciting? To me it shows that Google understands the intent behind each topic. So the topic of New York City brings up ‘attractions and photos’ while the topic of Googlebot just brings up ‘articles’. Google clearly understands that Back to the Future is a movie and that I’d want reviews for the Toyota Prius Plug-in Hybrid.

In essence, words map to a topic which in turn tells Google what type of content should most likely be returned. You can see how these topics were likely generated by looking back at Web History.

Search of Google Web History for Moto X

This part of my web history likely triggered a Moto X topic. I used the specific term ‘Moto X’ a number of times in a query which made it very easy to identify. (I did wind up getting the Moto X and love it.)

Tripping Google Now Topics

When I first saw this page  back in March and then again in June I wanted to start playing around with what combination of queries would produce a Google Now Topic. However, I’ve been so busy with client work that I never got a chance to do that until now.

Here’s what I did. Logged into my Google account and using Chrome I tried the following series of queries (without clicking through on any results) at 1:30pm on November 13th.

the stranger
downeaster alexa
big shot
uptown girl
piano man

But nothing ever showed up in Google Now Topics. So I took a similar set of terms but this time engaged with the results at 8:35am on November 16th.

piano man (clicked through on Wikipedia)
uptown girl (clicked through on YouTube)
pressure (no click)
big shot (clicked through on YouTube)
the stranger lyrics (clicked through on atozlyrics, then YouTube)
scenes from an italian restaurant (no click)

Then at 9:20am a new Google Now Topic shows up!

Google Now Topic for Billy Joel Songs

Interestingly it understands that this is about music but it hasn’t made a direct connection to Billy Joel. I had purposefully not used his name in the queries to see if Google Now Topics would return him as the topic instead of just songs. Maybe Google knows but I had sort of hoped to get a Billy Joel topic to render and think that might be the better result.

YouTube Categories

Engagement certainly seems to count based on my limited tests. But I couldn’t help but notice the every one of the songs in that Google Now Topic was also a YouTube click. Could I get a Google Now Topic to render without a YouTube click.

The next morning I tried again with a series of queries at 7:04am.

shake it up (no click)
my best friend’s girl (lyricsfreak click)
let the good times roll (click on Wikipeida, click to disambiguated song)
hello again (no click)
just what i needed (lastfm click)
tonight she comes (songmeanings click)
shake it up lyrics (azlyrics click)

At 10:04 nothing showed up so I decided to try another search.

let the good times roll (clicked on YouTube)

At 10:59 nothing showed up and I was getting antsy, which was probably not smart. I should have waited! But instead I performed another query.

the cars (clicked on knowledge graph result for Ric Ocasek)

And at 12:04 I get a new Google Now Topic.

Let The Good Times Roll Google Now Topic

I’m guessing that if I’d waited a bit longer after my YouTube click that this would have appeared, regardless of the click on the knowledge graph result. It seems that YouTube is a pretty important part of the equation. It’s not the only way to generate a Google Now Topic but it’s one of the faster ways to do so right now.

Perhaps it’s easier to identify the topic because of the more rigid categorization on YouTube?

The Cars on YouTube

I didn’t have time to do more research here but am hoping others might begin to compile a larger corpus of tests so we can tease out some conclusions.

Topic Stickiness

I got busy again and by the time I was ready to write this piece I found that my topics had changed.

New Google Now Topics

It was fairly easy to deduce why each had been produced, though the Ice Bath result could have been simply from a series of queries. But what was even more interesting was what my Google Now Topics looked like this morning.

My Google Now Topics Today

Some of my previous topics are gone! Both Ice Bath and Let The Good Times Roll are nowhere to be found. This seems to indicate that there’s a depth of interaction and distance from event (time) factor involved in identifying relevant topics.

It would make sense for Google to identify intent that was more consistent from intent that was more ephemeral. I was interested in ice baths because my daughter has some plantar fascia issues. But I’ve never researched it before and likely (fingers crossed) won’t again. So it would make sense to drop it.

There are a number of ways that Google could determine which topics are more important to a user, including frequency of searching, query chains, depth of interaction as well as type and variety of content.

Google Now Topics and Hummingbird

OMG It's Full of Stars Cat

My analysis of the Hummingbird Update focused largely on the ability to improve topic modeling through a combination of traditional text analysis natural and entity detection.

Google Now Topics looks like a Hummingbird learning lab.

Watching how queries and click behavior turn into topics (there’s that word again) and what types of content are displayed for each topic is a window into Google’s evolving abilities and application of entities into search results.

It may not be the full picture of what’s going on but there’s enough here to put a lot of paint on the canvass.


Google Now Topics provide a glimpse into the Hummingbird Update by showing how Google takes words, queries and behavior and turns them into topics with defined intent.

What Does The Hummingbird Say?

November 07 2013 // SEO + Technology // 29 Comments

What Does The Fox Say Video Screencap

Dog goes woof
Cat goes meow
Bird goes tweet
and mouse goes squeak

Cow goes moo
Frog goes croak
and the elephant goes toot

Ducks say quack
and fish go blub
and the seal goes ow ow ow ow ow

But theres one sound
That no one knows
What does the hummingbird say?

What Does The Hummingbird Say?

For the last month or so the search industry has been trying to figure out Google’s new Hummingbird update. What is it? How does it work? How should you react.

There’s been a handful of good posts on Hummingbird including those by Danny SullivanBill Slawski, Gianluca Fiorelli, Eric Enge (featuring Danny Sullivan), Ammon Johns and Aaron Bradley. I suggest you read all of these given the chance.

I share many of the views expressed in the referenced posts but with some variations and additions, which is the genesis of this post.

Entities, Entities, Entities

Are you sick of hearing about entities yet? You probably are but you should get used to it because they’re here to stay in a big way. Entities are at the heart of Hummingbird if you parse statements from Amit Singhal.

We now get that the words in the search box are real world people, places and things, and not just strings to be managed on a web page.

Long story short, Google is beginning to understand the meaning behind words and not just the words themselves. And in August 2013 Google published something specifically on this topic in relation to an open source toolkit called word2vec, which is short for word to vector.

Word2vec uses distributed representations of text to capture similarities among concepts. For example, it understands that Paris and France are related the same way Berlin and Germany are (capital and country), and not the same way Madrid and Italy are. This chart shows how well it can learn the concept of capital cities, just by reading lots of news articles — with no human supervision:

Example of Getting Meaning Behind Words

So that’s pretty cool isn’t it? It gets even cooler when you think about how these words are actually places that have a tremendous amount of metadata surrounding them.

Topic Modeling

It’s my belief that the place where Hummingbird has had the most impact is in the topic modeling of sites and documents. We already know that Google is aggressively parsing documents and extracting entities.

When you type in a search query — perhaps Plato – are you interested in the string of letters you typed? Or the concept or entity represented by that string? But knowing that the string represents something real and meaningful only gets you so far in computational linguistics or information retrieval — you have to know what the string actually refers to. The Knowledge Graph and Freebase are databases of things, not strings, and references to them let you operate in the realm of concepts and entities rather than strings and n-grams.

Reading this I think it becomes clear that once those entities are extracted Google is then performing a lookup on an entity database(s) and learning about what that entity means. In particular Google wants to know what topic/concept/subject to which that entity is connected.

Google seems to be pretty focused on that if you look at the Freebase home page today.

Freebase Topic Count

Tamar Yehoshua, VP of Search, also said as much during the Google Search Turns 15 event.

So the Knowledge Graph is great at letting you explore topics and sets of topics.

One of the examples she used was the search for impressionistic artists. Google returned a list of artists and allowed you to navigate to different genres like cubists. It’s clear that Google is relating specific entities, artists in this case, to a concept or topic like impressionist artists, and further up to a parent topic of art.

Do you think that having those entities on a page might then help Google better understand what the topic of that page is about? You better believe it.

Based on client data I think that the May 2013 Phantom Update was the first application of a combined topic model (aka Hummingbird). Two weeks later it was rolled back and then later reapplied with some adjustments.

Hummingbird refined the topic modeling of sites and pages that are essential to delivering relevant results.

Strings AND Things

Hybrid Car

This doesn’t mean that text based analysis has gone the way of the do-do bird. First off, Google still needs text to identify entities. Anyone who thinks that keywords (or perhaps it’s easier to call them subjects) in text isn’t meaningful is missing the boat.

In almost all cases you don’t have as much labeled data as you’d really like.

That’s a quote from a great interview with Jeff Dean and while I’m taking the meaning of labeled data out of context I think it makes sense here. Writing properly (using nouns and subjects) will help Google to assign labels to your documents. In other words, make it easy for Google to know what you’re talking about.

Google can still infer a lot about what that page is about and return it for appropriate queries by using natural language processing and machine learning techniques. But now they’ve been able to extract entities, understand the topics to which they refer and then feed that back into the topic model. So in some ways I think Hummingbird allows for a type of recursive topic modeling effort to take place.

If we use the engine metaphor favored by Amit and Danny, Hummingbird is a hybrid engine instead of a combustion or electric only engine.

From Caffeine to Hummingbird

Electrical Outlet with USB and Normal Sockets

One of the head scratching parts of the announcement was the comparison of Hummingbird to Caffeine. The latter was a huge change in the way that Google crawled and indexed data. In large part Caffeine was about the implementation of Percolator (incremental processing), Dremel (ad-hoc query analysis) and Pregel (graph analysis). It was about infrastructure.

So we should be thinking about Hummingbird in the same way. If we believe that Google now wants to use both text and entity based signals to determine quality and relevance they’d need a way to plug both sources of data into the algorithm.

Imagine a hybrid car that didn’t have a way to recharge the battery. You might get some initial value out of that hybrid engine but it would be limited. Because once out of juice you’d have to take the battery out and replace it with a new one. That would suck.

Instead, what you need is a way to continuously recharge the battery so the hybrid engine keeps humming along. So you can think of Hummingbird as the way to deliver new sources of data (fuel!) to the search engine.

Right now that new source of data is entities but, as Danny Sullivan points out, it could also be used to bring social data into the engine. I still don’t think that’s happening right now, but the infrastructure may now be in place to do so.

The algorithms aren’t really changing but the the amount of data Google can now process allows for greater precision and insight.

Deep Learning

Mr. Fusion Home Reactor

What we’re really talking about is a field that is being referred to as deep learning, which you can think of as machine learning on steroids.

This is a really fascinating (and often dense) area that looks at the use of labeled and unlabeled data and the use of supervised and unsupervised learning models. These concepts are somewhat related and I’ll try to quickly explain them, though I may mangle the precise definitions. (Scholarly types are encouraged to jump in an provide correction or guidance.)

The vast majority of data is unlabeled, which is a fancy way of saying that it hasn’t been classified or doesn’t have any context. Labeled data has some sort of classification or identification to it from the start.

Unlabeled data would be the tub of old photographs while labeled data might be the same tub of photographs but with ‘Christmas 1982′, ‘Birthday 1983′, ‘Joe and Kelly’ etc. scrawled in black felt tip on the back of each one. (Here’s another good answer to the difference between labeled and unlabeled data.)

Why is this important? Let’s return to Jeff Dean (who is a very important figure in my view) to tell us.

You’re always going to have 100x, 1000x as much unlabeled data as labeled data, so being able to use that is going to be really important.

The difference between supervised learning and unsupervised learning is similar. Supervised learning means that the model is looking to fit things into a pre-conceived classification. Look at these photos and tell me which of them are cats. You already know what you want it to find. Unsupervised learning on the other hand lets the model find it’s own classifications.

If I have it right, supervised learning has a training set of labeled data where a unsupervised learning has no initial training set. All of this is wrapped up in the fascinating idea of neural networks.

The different models for learning via neural nets, and their variations and refinements, are myriad. Moreover, researchers do not always clearly understand why certain techniques work better than others. Still, the models share at least one thing: the more data available for training, the better the methods work.

The emphasis here is mine because I think it’s extremely relevant. Caffeine and Hummingbird allow Google to both use more data and to process that data quickly. Maybe Hummingbird is the ability to deploy additional layers of unsupervised learning across a massive corpus of documents?

And that cat reference isn’t just because I like LOLcats. A team at Google (including Jeff Dean) was able to use unlabeled, unsupervised learning to identify cats (among other things) in YouTube thumbnails (PDF).

So what does this all have to do with Hummingbird? Quite a bit if I’m connecting the dots the right way. Once again I’ll refer back the Jeff Dean interview (which I seem to get something new out of each time I read it).

We’re also collaborating with a bunch of different groups within Google to see how we can solve their problems, both in the short and medium term, and then also thinking about where we want to be four years, five years down the road. It’s nice to have short-term to medium-term things that we can apply and see real change in our products, but also have longer-term, five to 10 year goals that we’re working toward.

Remember at the end of Back to The Future when Doc shows up and implores Marty to come to the future with him? The flux capacitor used to need plutonium to reach critical mass but this time all it takes is some banana peels and the dregs from some Miller Beer in a Mr. Fusion home reactor.

So not only is Hummingbird a hybrid engine but it’s hooked up to something that can turn relatively little into a whole lot.

Quantum Computing

So lets take this a little bit further and look at Google’s interest in quantum computing. Back in 2009 Hartmut Neven was talking about the use of quantum algorithms in machine learning.

Over the past three years a team at Google has studied how problems such as recognizing an object in an image or learning to make an optimal decision based on example data can be made amenable to solution by quantum algorithms. The algorithms we employ are the quantum adiabatic algorithms discovered by Edward Farhi and collaborators at MIT. These algorithms promise to find higher quality solutions for optimization problems than obtainable with classical solvers.

This seems to have yielded positive results because in May 2013 Google upped the ante and entered into a quantum computer partnership with NASA. As part of that announcement we got some insight into Google’s use of quantum algorithms.

We’ve already developed some quantum machine learning algorithms. One produces very compact, efficient recognizers — very useful when you’re short on power, as on a mobile device. Another can handle highly polluted training data, where a high percentage of the examples are mislabeled, as they often are in the real world. And we’ve learned some useful principles: e.g., you get the best results not with pure quantum computing, but by mixing quantum and classical computing.

A highly polluted set of training data where many examples are mislabeled? Makes you wonder what that might be doesn’t it? Link graph analysis perhaps?

Are quantum algorithms part of Hummingbird? I can’t be certain. But I believe that Hummingbird lays the groundwork for these types of leaps in optimization.

What About Conversational Search?

Dog Answering The Phone

There’s also a lot of talk about conversational search (pun intended). I think many are conflating Hummingbird with the gains in conversational search. Mind you, the basis of voice and conversational search is still machine learning. But Google’s focus on conversational search is largely a nod to the future.

We believe that voice will be fundamental to building future interactions with the new devices that we are seeing.

And the first area where they’ve made advances is the ability to resolve pronouns in query chains.

Google understood my context. It understood what I was talking about. Just as if I was having a conversation with you and talking about the Eiffel Tower, I wouldn’t have to keep repeating it over and over again.

Does this mean that Google can resolve pronouns within documents? They’re getting better at that (there a huge corpus of research actually) but I doubt it’s to the level we see in this distinct search microcosm.

Conversational search has a different syntax and demands a slightly different language model to better return results. So Google’s betting that conversational search will be the dominant method of searching and is adapting as necessary.

What Does Hummingbird Do?

What's That Mean Far Field Productions

This seems to be the real conundrum when people look at Hummingbird. If it affects 90% of searches worldwide why didn’t we notice the change?

Hummingbird makes results even more useful and relevant, especially when you ask Google long, complex questions.

That’s what Amit says of Hummingbird and I think this makes sense and can map back to the idea of synonyms (which are still quite powerful). But now, instead of looking at a long query and looking at word synonyms Google could also be applying entity synonyms.

Understanding the meaning of the query might be more important than the specific words used in the query. It reminds me a bit of Aardvark which was purchased by Google in February 2010.

Aardvark analyzes questions to determine what they’re about and then matches each question to people with relevant knowledge and interests to give you an answer quickly.

I remember using the service and seeing how it would interpret messy questions and then deliver a ‘scrubbed’ question to potential candidates for answering. There was a good deal of technology at work in the background and I feel like I’m seeing it magnified with Hummingbird.

And it resonates with what Jeff Dean has to say about analyzing sentences.

I think we will have a much better handle on text understanding, as well. You see the very slightest glimmer of that in word vectors, and what we’d like to get to where we have higher level understanding than just words. If we could get to the point where we understand sentences, that will really be quite powerful. So if two sentences mean the same thing but are written very differently, and we are able to tell that, that would be really powerful. Because then you do sort of understand the text at some level because you can paraphrase it.

My take is that 90% of the searches were affected because documents that appear in those results were re-scored or refined through the addition of entity data and the application of machine learning across a larger data set.

It’s not that those results have changed but that they have the potential to change based on the new infrastructure in place.

Hummingbird Response

Le homard et le chat

How should you respond to Hummingbird? Honestly, there’s not a whole lot to do in many ways if you’ve been practicing a certain type of SEO.

Despite the advice to simply write like no one’s watching, you should make sure you’re writing is tight and is using subjects that can be identified by people and search engines. “It is a beautiful thing” won’t do as well as “Picasso’s Lobster and Cat is a beautiful painting”.

You’ll want to make your content easy to read and remember, link out to relevant and respected sources, build your authority by demonstrating your subject expertise, engage in the type of social outreach that produces true fans and conduct more traditional marketing and brand building efforts.


Hummingbird is an infrastructure change that allows Google to take advantage of additional sources of data, such as entities, as well as leverage new deep learning models that increase the precision of current algorithms. The first application of Hummingbird was the refinement of Google’s document topic modeling, which is vital to delivering relevant search results.

Authorship Is Dead, Long Live Authorship

October 24 2013 // SEO // 62 Comments

Google’s Authorship program is still a hot topic. A constant string of blog posts, conference sessions and ‘research’ projects about Authorship and the idea that it can be used as a ranking signal fill our community.

I Do Not Think It Means What You Think It Does

Yet, the focus on the actual markup and clock-watching when AuthorRank might show up may not be the best use of time.

Would it surprise you to learn that the Authorship Project at Google has been shuttered? Or that this signals not the death of Authorship but a different method of assigning Authorship.

Here’s my take on where Authorship stands today.

RIP Authorship Project

The Authorship Project at Google was headed up by Othar Hansson. He’s an incredibly smart and amiable guy, who from time to time was kind enough to provide answers and insight into Authorship. I was going to reach out to him again the other day and discovered something.

Othar Hansson Google+ About

Othar no longer works on the Authorship Project. He’s now a principal engineer on the Android search team, which is a pretty sweet gig. Congratulations!

Remember that it was Othar who announced the new markup back in June of 2011 and then appeared with Matt Cutts in the Authorship Markup video. His departure is meaningful. More so because I can’t locate a replacement. (That doesn’t mean there isn’t one but … usually I’m pretty good at connecting with folks.)

Not only that but there was no replacement for Sagar Kamdar, who left as product manager of Authorship (among other things) in July of 2012 to work at Google X and, ultimately, Project Loon.

At the time I thought the writing was on the wall. The Authorship Project wasn’t getting internal resources and wasn’t a priority for Google.

Authorship Adoption

Walter White with his Pontiac Aztec

The biggest problem with Authorship markup is adoption. Not everyone is participating. Study after study after study show that there are material gaps in who is and isn’t using the markup. Even the most rosy study of Authorship adoption by technology writers isn’t anything to write home about.

Google is unable to use Authorship as a ranking signal if important authors aren’t participating.

That means people like Neil Gaiman and Kevin Kelly wouldn’t rank as well since they don’t employ Authorship markup. It doesn’t take a lot of work to find important people who aren’t participating and that makes any type of AuthorRank that relies on markup a non-starter.

Authorship SERP Benefits

Search Result Heatmap For Authorship Snippet

Don’t get me wrong. Google still supports Authorship markup and there are clear click-through rate benefits to having an Authorship snippet on a search result. Even if you don’t believe me or Cyrus Shepard, you should believe Google and the research they’ve done on social annotations in 2012 (PDF) and 2013 (PDF).

So if you haven’t implemented Google Authorship yet it’s still a good idea to do so. You’ll receive a higher click-through rate and will build authority (different from AuthorRank), both of which may help you rank better over time.

Google knows users respond to Authorship.

Inferred Authorship

I Know What You Did Last Summer

It’s clear that Google still wants to do something about identifying authority and expertise. Any monkey with a keyboard can add content to the Internet. So increasingly it’s about who is creating that content and why you should trust and value their opinion.

One of the first ways Google was able to infer identity (aka authorship) was by crawling the public social graph. Rapleaf took the brunt of the backlash for this but Google was quietly mapping all of your social profiles as well.

So even if you don’t have Authorship markup on a Quora or Slideshare profile Google probably knows about it and could assign Authorship. All this data used to be available via social circles but Google removed this feature a few years ago. But that doesn’t mean Google isn’t mining the social graph.

Heck, Google could even employ usernames as a way to identify accounts from the same person. What we’re really talking about here is how Google can identify people and their areas of expertise.

Authors are People are Entities

But what if Google took another approach to identifying authors? Instead of looking for specific markup what if they looked for entities that happen to be people.

Authors are people are entities.

This would solve the adoption issue. And that’s what the Freebase Annotations of the ClueWeb Corpora (FACC) seems to indicate.

Identifying Authors in Text

The picture makes it pretty clear in my mind. Here we’re seeing that Google has been able to identify an entity (a person in this instance) within the text of a document and match it to a Freebase identifier.

Based on review of a sample of documents, we believe the precision is about 80-85%, and recall, which is inherently difficult to measure in situations like this, is in the range of 70-85%. Not every ClueWeb document is included in this corpus; documents in which we found no entities were excluded from the set. A document might be excluded because there were no entities to be found, because the entities in question weren’t in Freebase, or because none of the entities were resolved at a confidence level above the threshold.

At a glance you might think this means that Google still has a ‘coverage’ problem if they were to use entities as their approach to Authorship. But think about who is and isn’t in Freebase (or Wikipedia). In some ways, these repositories are biased towards those who have achieved some level of notoriety.

Would Google prefer to rely on self referring markup or a crowd based approach to identifying experts?

Google+ Is An Entity Platform

AJ Kohn Cheltenham High School ID

While Google might prefer to use a smaller set of crowd sourced entities to assign Authorship initially I think they’d ultimately like to have a larger corpus of Authors. That’s where Google+ fits into the puzzle.

I think most people understand that Google+ is an identity platform. But if people are entities (and so are companies) then Google+ is a huge entity platform, a massive database of people.

Google+ is the knowledge graph of everyday people.

And if we then harken back to social circles, to mapping the social graph and to measuring engagement and activity, we can begin to see how a comprehensive Authorship program might take shape.

Extract, Match and Measure

Concentration Board Game

Authorship then becomes about Google’s ability to extract entities from documents, matching those entities to a corpus that contains descriptors of that entity (i.e. – social profiles, official page(s), subjects) and then measuring the activity around that entity.

Perhaps Google could even go so far as to understand triples on a very detailed (document) level, noting which documents I might have authored as well as the documents in which I’ve been mentioned.

The presence of Authorship markup might increase the confidence level of the match but it will likely play a supporting and refining role instead of the defining role in the process.

Trust and Authority

Trust Me Sign

I’m reminded that Google talks frequently about trust and authority. For years that was about how it assessed sites but that same terminology can (and should) be applied to people as well.

Authorship markup is but one part of the equation but that alone won’t translate into some magical silver bullet of algorithmic success. Building authority is what will ultimately matter and be reflected in any related ranking signal.

Are the documents you author well regarded by your peers? Are they shared? By who? How often? With what velocity? And are you mentioned (or cited) by other documents? Do they sit on respected sites? Who are they authored by? What text surrounded your mention?

So part of this is doing the hard work of producing memorable content, marketing yourself and engaging with your community. The other part will be ensuring that your entity information is both comprehensive and up-to-date. That means filling out your entire Google+ profile and potentially finding ways to add yourself to traditional entity resources such as Wikipedia and Freebase.

Just as links are the result and not the goal of your efforts, any sort of AuthorRank will be the result of building your own trust and authority through content and engagement.


The Authorship Project at Google has been abandoned. But that doesn’t mean Authorship is dead. Instead it signals a change in tactics from Authorship markup to entity extraction as a way to identify experts and a pathway to using Authorship as a ranking signal.

Crawl Optimization

July 29 2013 // SEO // 84 Comments

Crawl optimization should be a priority for any large site looking to improve their SEO efforts. By tracking, monitoring and focusing Googlebot you can gain an advantage over your competition.

Crawl Budget

Ceiling Cat

It’s important to cover the basics before discussing crawl optimization. Crawl budget is the time or number of pages Google allocates to crawl a site. How does Google determine your crawl budget? The best description comes from an Eric Enge interview of Matt Cutts.

The best way to think about it is that the number of pages that we crawl is roughly proportional to your PageRank. So if you have a lot of incoming links on your root page, we’ll definitely crawl that. Then your root page may link to other pages, and those will get PageRank and we’ll crawl those as well. As you get deeper and deeper in your site, however, PageRank tends to decline.

Another way to think about it is that the low PageRank pages on your site are competing against a much larger pool of pages with the same or higher PageRank. There are a large number of pages on the web that have very little or close to zero PageRank. The pages that get linked to a lot tend to get discovered and crawled quite quickly. The lower PageRank pages are likely to be crawled not quite as often.

In other words, your crawl budget is determined by authority. This should not come as a shock. But that was pre-Caffeine. Have things changed since?



What is Caffeine? In this case it’s not the stimulant in your latte. But it is a stimulant of sorts. In June of 2010, Google rebuilt the way they indexed content. They called this change ‘Caffeine’ and it had a profound impact on the speed in which Google could crawl and index pages. The biggest change, as I see it, was incremental indexing.

Our old index had several layers, some of which were refreshed at a faster rate than others; the main layer would update every couple of weeks. To refresh a layer of the old index, we would analyze the entire web, which meant there was a significant delay between when we found a page and made it available to you.

With Caffeine, we analyze the web in small portions and update our search index on a continuous basis, globally. As we find new pages, or new information on existing pages, we can add these straight to the index. That means you can find fresher information than ever before—no matter when or where it was published.

Essentially, Caffeine removed the bottleneck for getting pages indexed. The system they built to do this is aptly named Percolator.

We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.

The speed in which Google can crawl is now matched by the speed of indexation. So did crawl budgets increase as a result? Some did, but not as much as you might suspect. And here’s where it gets interesting.

Googlebot seems willing to crawl more pages post-Caffeine but it’s often crawling the same pages (the important pages) with greater frequency. This makes a bit of sense if you think about Matt’s statement along with the average age of documents benchmark. Pages deemed to have more authority are given crawl priority.

Google is looking to ensure the most important pages remain the ‘freshest’ in the index.

Time Since Last Crawl

Googlebot's Google Calendar

What I’ve observed over the last few years is that pages that haven’t been crawled recently are given less authority in the index. To be more blunt, if a page hasn’t been crawled recently, it won’t rank well.

Last year I got a call from a client about a downward trend in their traffic. Using advanced segments it was easy to see that there was something wrong with their product page traffic.

Looking around the site I found that, unbeknownst to me, they’d implemented pagination on their category results pages. Instead of all the products being on one page, they were spread out across a number of paginated pages.

Products that were on the first page of results seemed to be doing fine but those on subsequent pages were not. I started to look at the cache date on product pages and found that those that weren’t crawled (I’m using cache date as a proxy for crawl date) in the last 7 days were suffering.

Undo! Undo! Undo!


That’s right, I told them to go back to unpaginated results. What happened?


You guessed it. Traffic returned.

Since then I’ve had success with depagination. The trick here is to think about it in terms of progressive enhancement and ‘mobile’ user experiences.

The rise of smartphones and tablets has made click based pagination a bit of an anachronism. Revealing more results by scrolling (or swiping) is an established convention and might well become the dominant one in the near future.

Can you load all the results in the background and reveal them only when users scroll to them without crushing your load time? It’s not always easy and sometimes there are tradeoffs but it’s a discussion worth having with your team.

Because there’s no better way to get those deep pages crawled by having links to all of them on that first page of results.


Was I crazy to think that the time since last crawl could be a factor in ranking? It turns out I wasn’t alone. Adam Audette (a smart guy) mentioned he’d seen something like this when I ran into him at SMX West. Then at SMX Advanced I wound up talking with Mitul Gandhi, who had been tracking this in more detail at seoClarity.

seoClarity graph

Mitul and his team were able to determine that content not crawled within ~14 days receives materially less traffic. Not only that, but getting those same pages crawled more frequently produced an increase in traffic. (Think about that for a minute.)

At first, Google clearly crawls using PageRank as a proxy. But over time it feels like they’re assigning a self-referring CrawlRank to pages. Essentially, if a page hasn’t been crawled within a certain time period then it receives less authority. Let’s revisit Matt’s description of crawl budget again.

Another way to think about it is that the low PageRank pages on your site are competing against a much larger pool of pages with the same or higher PageRank. There are a large number of pages on the web that have very little or close to zero PageRank.

The pages that aren’t crawled as often are pages with little to no PageRank. CrawlRank is the difference in this very large pool of pages.

You win if you get your low PageRank pages crawled more frequently than the competition.

Now what CrawlRank is really saying is that document age is a material ranking factor for pages with little to no PageRank. I’m still not entirely convinced this is what is happening, but I’m seeing success using this philosophy.

Internal Links

One might argue that what we’re really talking about is internal link structure and density. And I’d agree with you!

Not only should your internal link structure support the most important pages of your site, it should make it easy for Google to get to any page on your site in a minimum of clicks.

One of the easier ways to determine which pages are deemed most important (based on your internal link structure) is by looking at the Internal Links report in Google Webmaster Tools.

Google Webmaster Tools Internal Links

Do the pages at the top reflect the most important pages on your site? If not, you might have a problem.

I have a client whose blog was receiving 35% of Google’s crawl each day. (More on how I know this later on.) This is a blog with 400 posts amid a total content corpus of 2 million+ URLs. Googlebot would crawl blog content 50,000+ times a day! This wasn’t where we wanted Googlebot spending its time.

The problem? They had menu links to the blog and each blog category on nearly all pages of the site. When I went to the Internal Links report in Google Webmaster Tools you know which pages were at the top? Yup. The blog and the blog categories.

So, we got rid of those links. Not only did it change the internal link density but it changed the frequency with which Googlebot crawls the blog. That’s crawl optimization in action.

Flat Architecture

Flat Architecture

Remember the advice to create a flat site architecture. Many ran out and got rid of subfolders thinking that if the URL didn’t have subfolders then the architecture was flat. Um … not so much.

These folks destroyed the ability for easy analysis, potentially removed valuable data in assessing that site, and did nothing to address the underlying issue of getting Google to pages faster.

How many clicks from the home page is each piece of content. That’s what was, and remains, important. It doesn’t matter if the URL is if it takes Googlebot (and users) 8 clicks to get there.

Is that mega-menu on every single page really doing you any favors? Once you get someone to a leaf level page you want them to see similar leaf level pages. Related product or content links are the lifeblood of any good internal link structure and are, sadly, frequently overlooked.

Depagination is one way to flatten your architecture but a simple HTML sitemap, or specific A-Z sitemaps can often be very effective hacks.

Flat architecture shortens the distance between authoritative pages and all other pages, which increases the chances of low PageRank pages getting crawled on a frequent basis.

Tracking Googlebot

“A million dollars isn’t cool. You know what’s cool? A billion dollars.”

Okay, Sean Parker probably didn’t say that in real life but it’s an apt analogy for the difference in knowing how many pages Googlebot crawled versus where Googlebot is crawling, how often and with what result.

The Crawl Stats graph in Google Webmaster Tools only shows you how many pages are crawled per day.

Google Webmaster Tools Crawl Stats

For nearly five years I’ve worked with clients to build their own Googlebot crawl reports.

Googlebot Crawl Reporting That's Cool

That’s cool.

And it doesn’t always have to look pretty to be cool.

Googlebot Crawl Report by Page Type and Status

Here I can tell there’s a problem with this specific page type. More than 50% of the crawl on that page type if producing a 410. That’s probably not a good use of crawl budget.

All of this is done by parsing or ‘grepping‘ log files (a line by line history of visits to the site) looking for Googlebot. Here’s a secret. It’s not that hard, particularly if you’re even half-way decent with Regular Expressions.

I won’t go into details (this post is long enough as it is) but you can check out posts by Ian Lurie and Craig Bradford for more on how to grep log files.

In the end I’m interested in looking at the crawl by page type and response code.

Googlebot Crawl Report Charts

You determine page type using RegEx. That sounds mysterious but all you’re doing is bucketing page types based on pattern matching.

I want to know where Googlebot is spending time on my site. As Mike King said, Googlebot is always your last persona. So tracking Googlebot is just another form of user experience monitoring. (Referencing it like this might help you get this project prioritized.)

You can also drop the crawl data into a database so you can query things like time since last crawl, total crawl versus unique crawl or crawls per page. Of course you could also give seoClarity a try since they’ve got a lot of this stuff right out of the box.

If you’re not tracking Googlebot then you’re missing out on the first part of the SEO process.

You Are What Googlebot Eats

Cookie Monster Fruit

What you begin to understand is that you’re assessed based on what Googlebot crawls. So if they’re crawling a whole bunch of parameter based, duplicative URLs or you’ve left the email-a-friend link open to be crawled on every single product, you’re giving Googlebot a bunch of empty calories.

It’s not that Google will penalize you, it’s the opportunity cost for dirty architecture based on a finite crawl budget.

The crawl spent on junk could have been spent crawling low PageRank pages instead. So managing your URL Parameters and using robots.txt wisely can make a big difference.

Many large sites will also have robust external link graphs. I can leverage those external links, rely less on internal link density to rank well, and can focus my internal link structure to ensure low PageRank pages get crawled more frequently.

There’s no patent right or wrong answer. Every site will be different. But experimenting with your internal link strategies and measuring the results is what separates the great from the good.

Crawl Optimization Checklist

Here’s a quick crawl optimization checklist to get you started.

Track and Monitor Googlebot

I don’t care how you do it but you need this type of visibility to make any inroads into crawl optimization. Information is power. Learn to grep, perfect your RegEx. Be a collaborative partner with your technical team to turn this into an automated daily process.

Manage URL Parameters

Yes, it’s confusing. You will probably make some mistakes. But that shouldn’t stop you from using this feature and changing Googlebot’s diet.

Use Robots.txt Wisely

Stop feeding Googlebot empty calories. Use robots.txt to keep Googlebot focused and remember to make use of pattern matching.

Don’t Forget HTML Sitemap(s)

Seriously. I know human users might not be using these, but Googlebot is a different type of user with slightly different needs.

Optimize Your Internal Link Structure

Whether you try depagination to flatten your architecture, re-evaluate navigation menus, or play around with crosslink modules, find ways to optimize your internal link structure to get those low PageRank pages crawled more frequently.

Keywords Still Matter

June 05 2013 // SEO // 59 Comments

As content marketing becomes the new black I’m starting to hear people talk about how keywords don’t matter anymore. This sentiment appears in more than a few posts and the general tenor seems to be that keyword focused strategies are a thing of the past – a relic from a dark time.

The problem? You need keywords to produce successful content.

Dwight Meme Keywords

Keyword Syntax

How do people search for something? That’s what keywords are all about. It’s vital to ensuring your content will be found and resonate with your users.

keyword syntax

Are people searching for ‘all weather fluid displacement sculptures’ or ‘outdoor water fountains’. That’s an extreme example but it makes an important point.

You need to understand the user and the words they use to find your content.

Keyword Intent

Keywords can also tell you a lot about the intent of a search. Look (well) beyond informational, navigational and transactional intent and start thinking about how you can map keywords to the various stages of your site’s conversion funnel.

For instance, what does a query like ‘majestic seo vs open site explorer’ tell you? This user is probably further along in purchase funnel. They’re aware of their choices and may have even narrowed it down to these two options. The keyword (yes, keyword) ‘vs’ makes it clear that they’re looking for comparison data.

Google SERP for Comparison Intent

Sure enough, most of the results returned are posts that compare these two tools. Those pieces of content squarely meet that intent, in part because they’re paying attention to keywords.

Majestic SEO has a result but … it’s the home page. Is that going to satisfy the desire to compare? Probably not. And where’s SEOMoz? Missing in action.

Each could rely on the blog posts presented to deliver this comparison. Or they could also develop content that met that keyword and intent, allowing them to tell their story and frame the debate.

I know some will shriek, “Are you crazy? You don’t want to promote your competition by mentioning them so prominently!” But that’s denying reality. Users are searching with this syntax and intent.

Now, I’m not saying you have to put content that meets this particular intent prominently on the site or in the normal conversion flow. But if you know someone is on the fence and comparing products, why wouldn’t you want a chance to engage that user on your own terms?

Keywords let you create content that matches user intent.

Magic Questions

Oh-O It's Magic!

There’s also a lot of meta information that comes along with a keyword. I’m fond of using a term like ‘eureka 313a manual’ as an example. It’s a query for a vacuum cleaner manual.

On the one hand it’s a pretty simple. There’s explicit intent. Someone is looking for the manual to their vacuum cleaner. The content to meet that informational search would be … the manual. But, what’s really going on?

If you’re searching for the manual, odds are that something is wrong with your vacuum. There’s an implied intent at work. The vacuum is either not working right or is flat out broken. You have the opportunity to anticipate and answer magic questions.

How can I fix my vacuum? Where can I buy replacement parts? Are there repair shops near me? What vacuum should I get to replace this one if it can’t be fixed?

Be decoding the keyword you can create a relevant and valuable page that meets explicit and implied intent.

Keyword Frequency

Keyword frequency is important. Yes, really. One of my favorite examples of this is LinkedIn. How did they secure their place in the competitive ‘name’ query space?

LinkedIn Keyword Frequency

LinkedIn wanted to make it clear what (or who) these pages were about. That’s what keyword frequency is about, making it easy for search engines and users to understand what that page is about.

LinkedIn doesn’t just do it with their headers either, but uses the name frequently elsewhere on the page. The result?

Marshall Simmonds Wordle

There’s no question what this page is about.

Keywords are Steve Krug for Googlebot.


This Is Not A Pipe

The reaction I get from many when I press on this issue is that it produces a poor user experience. Really? I’ve never heard anyone complain about LinkedIn and most never realize that it’s even going on.

Using the keywords people expect to see can only help make your content more readable, which is still a tremendously undervalued aspect of SEO. Because people scan text and rarely read word for word.

And what do you think they’re scanning for? What do you think is rattling around in their brain when they’re scanning your content? It’s not something random like ‘bellhop poodle duster’, it’s probably the keyword that brought them there.

You may think Google is smart enough to figure it out. You’ll claim that Google’s gotten far more sophisticated in the application of synonyms and topical modeling. And you’d be right to a degree. But why take the chance? Particularly since users crave the repetition and consistency.

They don’t want you to use four different ways to say the same thing and the hard truth is they’re probably only going to read one of those words anyway. You’ll create better content for users if you write for search engines.

Make sure you’re using the words users expect to see.


Keywords aren’t going away, they’re becoming more important. Query syntax and user intent are vital in producing relevant and valuable content that resonates with users and answers both explicit and implicit questions.

Google Removes Related Searches

April 19 2013 // Rant + SEO // 45 Comments

This morning I went to use one of my go to techniques for keyword research and found it was … missing.

Related Searches Gone

Related Searches Option Gone

It was bad enough that the new Search tools interface was this awkward double-click menu but I understood that decision. Because most mainstream users don’t ever refine their results.

But to remove related searches from that menu altogether? In less than a year related searches went from being a search tip to being shuffled off to Buffalo?


Out of Insight

Clooney is Pissed

Google needs to understand that there are SEOs, or digital marketing professionals if that makes it easier, who are helping to make search results better. We’re helping sites understand the syntax and intent of their users and creating relevant and valuable experiences to match and satisfy those queries.

I wasn’t happy but wasn’t that upset when Google introduced (not provided). But as the amount of (not provided) traffic increases I see no reason why Google shouldn’t implement my (not provided) drill down suggestion. Seriously, get on that.

But then Google merged Google Trends with Google Insights for Search and in the process removed its most useful feature. That’s right, knowing what percentage of the traffic that was attributed to each category let SEOs better understand the intent of that query.

Now Google’s taking away the interface for related searches? Yeah, you’ve gone too far now. Hulk mad.

Stop Ignoring Influencers

You Wouldn't Like Me When I'm Angry

Just like the decision to terminate Google Reader, Google doesn’t seem to understand that they need to address influencers. And believe it or not Google, SEOs are influencers. We’re demystifying search so that sites don’t fall for get-rank-quick schemes. And you need us to do that because you’re dreadful at SEO. Sites aren’t finding much of your educational content. They’re not. Really.

In the last year Google’s made it more and more difficult for SEOs to do good work. And you know who ultimately suffers? Google. Because the content coming out won’t match the right syntax and intent. It’ll get tougher for Google, over-time, to find the ‘right’ content and users will feel the slow decline in search quality. You know, garbage in, garbage out.

Any good marketer understands that they have to serve more than one customer segment. Don’t like to think of SEOs as influencers? Fine. Call us power users and put us back on your radar and stop removing value from the search ecosystem.

Time To Long Click

April 17 2013 // SEO // 64 Comments

The internal metric Google uses to determine search success is time to long click. Understanding this metric is important for search marketers in assessing changes to the search landscape and developing better optimization strategies.

Short Clicks vs Long Clicks


Back in 2009 I wrote about the difference between short clicks and long clicks. A long click occurs when a user performs a search, clicks on a result and remains on that site for a long period of time. In the optimal scenario they do not return to the search results to click on another result or reformulate their query.

A long click is a proxy for user satisfaction and success.

On the other hand, a short click occurs when a user performs a search, clicks on a result and returns to the search results quickly to click on another result or reformulate their query. Short clicks are an indication of dissatisfaction.

Google measures success by how fast a search result produces a long click.

Bounce Rate vs Pogosticking

Before I continue I want to make sure we’re not conflating short clicks with bounce rate. While many bounces could be construed as short clicks, that’s not always the case. The bounce rate on Stack Overflow is probably very high. Users search for something specific, click through to a Stack Overflow result, get the answer they needed and move on with their life. This is not a bad thing. That’s actually a long click.

You can gain greater clarity on this by configuring an adjusted bounce rate or something even more advanced that takes into account the amount of time the user spent on the page. In the example above you’d likely see that users spent a material amount of time on that one page which would be a positive indicator.

The behavior you want to avoid is pogosticking. This occurs when users click through on a result, returns quickly to the search results and clicks on another result. This indicates, to some extent, that the user was not satisfied with the original result.

Two problems present themselves with pogosticking. The first is that it’s impossible for sites to measure this metric. That sort of sucks. We can only look at short bounces as a proxy and even then can’t be sure that the user pogosticked to another result.

The second is that some verticals will naturally produce pogosticking behavior. Health related queries will show pogosticking behavior since users want to get multiple points of view (or opinions if you will) on that ailment or issue.

This could be overcome by measuring the normal pogosticking behavior for a vertical or query class and then determining which results produce lower and higher than normal pogosticking rates. I’m not sure Google is doing this but it’s not out of the question since they already have a robust understanding of query and vertical mapping.

But I digress.


Part of the way Google works on reducing the time to long click is by improving the speed of search results and the Interent in general. Their own research showed the impact of speed on search results.

All other things being equal, more usage, as measured by number of searches, reflects more satisfied users. Our experiments demonstrate that slowing down the search results page by 100 to 400 milliseconds has a measurable impact on the number of searches per user of -0.2% to -0.6% (averaged over four or six weeks depending on the experiment). That’s 0.2% to 0.6% fewer searches for changes under half a second!

Remember that while usage was the metric used, they were trying to measure satisfaction. Making it faster to get to information made people happier and more likely to use search for future information requests. Google’s simply reducing the friction of searching.

But it’s not just the speed of presenting results but in how quickly Google gets someone to that long click that matters. Search results that don’t produce long clicks are bad for business as are those that increase the time selecting a result. And pogosticking blows up the query timeline as users loop back and tack on additional seconds worth of selection and page load time.

Google Query Timeline

Make no mistake. Google wants to reduce every portion of this timeline they presented at Inside Search in 2011.



One of the ways in which we’ve seen Google reduce time to long click is through various ‘answers’ initiatives. Whether it’s a OneBox or a Knowledge Graph result the idea is that answers can often reduce the time to long click. It’s immediate gratification and in line with Amit Singhal’s Star Trek computer ideal.

In some of cases a long click is measured by the absence of a click and reformulated query. If I search for weather, don’t click but don’t take any further actions, that should register as a long click.


John Henry Man vs Machine

You’ll also hear Google (and Bing) talk about the fact that ads are answers. Of course ads are what fill the coffers but they also provide another way to get people to a long click. Arguing the opposite (that ads aren’t contributing to satisfaction) is a lot like arguing that marketers and advertisers aren’t valuable.

Not only that, but Google has features in place to help ensure that good ads answers rise to the top. The auction model coupled with quality score and keyword level bidding all produce relevant ads that lead to long clicks.

The analysis of pixel space on search results is often used to show how Google is marginalizing organic search. Yet, the other way to look at it is that advertisers are getting better at delivering results (with the help of new Google ad extensions). Isn’t it, in some ways, man versus machine? The advertiser being able to deliver a better result than the algorithm?

Without doubt Google benefits financially from having more space dedicated to paid results but they still must result in long clicks for Google to optimize long-term use, which leads to long-term revenues and profits.

I would be very surprised if changes to search results (both paid and organic) weren’t measured by the impact they had in time to long click.


Bow Tie

All of this is interesting but what does the time to long click metric mean for SEO? More than you might suspect.

When I started in the SEO field I read everything I could get my hands on (which is not altogether different from now). At the time there was advice about becoming a hub.

There was a good deal of hand waving about the definition of a hub but the general idea was that you wanted to be at the center of a topic by providing value and resources. People would link to you and the traffic you received would often go on to the resources you provided. is a good example.

Funny thing is, this isn’t some well kept secret. Marshall Simmonds spells it out pretty clearly in this 2010 Whiteboard Friday video where he discusses bow tie theory (hubs) and link journalism. (I just watched this again while writing this and, man, this is an awesome video.)

Most people focus on the fact that hubs receive a lot of backlinks. They do because of the value they provide, which is often in the aggregation of and links to other content. In the end, the real value of hubs is that they play an important part in getting people to content and that long click.

Search is a multi-site experience.

This is what search marketers must realize. You will get credit for a long click if you’re part of the long click. If you ensure that the user doesn’t return to search results, even by sending them to another site, then you’re going to be rewarded.

Too often sites won’t link out. I regularly run into this as my clients navigate business development deals with partners. It’s frustrating. They think linking out is a sign of weakness and reduces their ability to consolidate Page Rank.

While Page Rank math might support not linking out, that strategy ultimately limits success.

Link Out!

Local Maxima Graph

Limiting your outlinks creates a local maxima problem. You’ll optimize only up to a certain ceiling based on constrained Page Rank math. Again, not a real secret. Cyrus Shepard talked about this in a 2011 Whiteboard Friday video (though I wouldn’t stress too much about the anchor text myself.)

Linking out can help you break through that local maxima by delivering more long clicks. Suddenly, your page is a sort of mini-hub. People search, get to your page and then go on to other relevant information.

Google wants to include results that contribute to reducing the time to long click for that query. 

I’m not advocating that you vomit up pages with a ton of links. What I’m recommending is that you link to other valuable sources of information when appropriate so that you fully satisfy that user’s query. In doing so you’ll generate more long clicks and earn more links over time, both of which can have profound and positive impact on your rankings.

Stop thinking about optimizing your page and think about optimizing the search experience instead. 

I ran into someone as SMX West who inherited a vast number of low quality sites. These sites used the old technique of being relevant enough to get someone to the page but not delivering enough value to answer their query. The desired result was a click on an ad. Simple arbitrage when you get down to it.

In a test, placing prominent links to relevant content on a sub-set of these pages had a material and positive impact on their ranking. It’s certainly not conclusive, but it showed the potential impact of being part of a multi-site long click search result.

As an aside, it’s not that those ad clicks were bad. Some of those probably resulted in long clicks. Just not enough of them. The majority either pogosticked to another result or wound up back at the search result after an ad click. And we already know this as search marketers by looking at the performance of search versus display campaigns.

Impact On Domain Diversity

If you believe time to long click is the way in which Google is measuring search success then you start to see some of the changes in a new light. I’ve been disappointed by the lack of domain diversity on many search results.

Yelp Dominating Search Results for Haircut in Concord CA

Sadly, this type of result hasn’t been that rare within the last year. Pete Myers has been doing amazing work on this topic.

For a while I just thought this was Google being stupid. But then it dawned on me. The lack of domain diversity may be reducing the time to long click. It might actually be improving the overall satisfaction metrics Google uses to optimize search!

In some ways this makes a bit of sense, if even from a straight up Paradox of Choice perspective. Selecting from 10 different domains versus 5 might reduce cognitive strain. Too many choices overwhelm people, reducing both action and satisfaction. So perhaps Google’s just reflecting that in their results with both domain diversity (or lack there of) and more instances of 7 results pages.

Downsides to Time To Long Click?

MC Escher Relativity Stairs

Are these long clicks are truly a sign of satisfaction. The woman who had been cutting my hair for nearly 10 years retired. So I actually did need to find someone new. I hated search result but did wind up clicking through and using Yelp to locate someone. So from Google’s perspective I was satisfied but in reality … not so much.

I wonder how long a time frame Google uses in assessing the value of long clicks. I abandoned my haircut search a number of times over the course of a month. In many of those instances I’m sure it looked like I was satisfied with the result. It looked like a long click. Yet, if you looked over a longer period of my search history it would become clear I wasn’t. I think this is a really difficult problem to solve. Is it satisfaction or abandonment?

The other danger here is that Google is training people to use another service. Now, I don’t particularly like Yelp but what this result tells me is that if I wanted to find something like this again I should just skip Google and go right to Yelp instead.

The same could be said by reflecting our own bias toward brands. While users may respond better to brands and the time to click might be reduced, the long term implications could be that Google is training users to visit those brands directly. Why start my product search on Google when all they’re doing is giving me links to Amazon 90% of the time?

Of course, Google could argue that it will remain the hub for information requests because it continues to deliver value. (See what I did there?)


Google is using time to long click to measure the effectiveness of search results. Understanding this puts many search changes and initiatives into perspective and gives sites renewed reason to link out and think of search as a multi-site experience.

Tracking Image Search In Google Analytics

March 27 2013 // Analytics + SEO // 52 Comments

(This post has been updated as of 4/5/14 to reflect refinements to the filters as well as new caveats about Chrome.)

The Internet is becoming increasingly visual but the standard Google Analytics default lumps image search traffic in with organic traffic. The problem with that is these two types of traffic have radically different behaviors.

Google Analytics Y U No Track Image Search

So here’s a quick way for you to track image search in Google Analytics to gain insight into how images are performing for your business.

Image Search Referrers

After the last big image search update I was asked by Annie Cushing if I’d figured out a way to track images in Google Analytics. I’d meant to but hadn’t yet. Her reminder led me to find out what was possible. I fired up Firefox and used Live HTTP Headers to look at the referrers for image search traffic.

I found that there were two distinct referrers for Google, one from Google images and one from images that showed up via universal search results.

Here’s what the referrer looks like from Google image search.

Google Image Search Referrer

The parts to note here are the /url? and the source=images parameter. Now lets look at what the referrer looks like from an image via universal search.

Google Image Referrer via Universal Search

The part to note here is that the URL doesn’t use /url? but imgres? instead. This means you can track traffic from each source!

But there’s another wrinkle I discovered over time. Many of the international versions of Google use the old image search UX which also produces the /imgres? referrer. image search for ruby red slippers

In addition, most of these wind up being passed in the Google cookie as a ‘referring’ medium and not ‘organic’. So you might be seeing Google domains cropping up in your referring reports (annoying!). Adding Full Referrer as a secondary dimension shows where the majority of these are coming from: imgres.

Google Referring Traffic in Google Analytics Reports

This means two things. First, we’re going to have to create a special case for universal search on so that it isn’t mixed up with image search from international properties. Second, we’re going to have to change the medium on the international image search traffic so that it is properly attributed to organic.

Finally lets take a look at Bing.

Bing Image Search Referrer

This is pretty straight forward and doesn’t change based on whether it’s from image search proper or via a universal result.

Google Analytics Image Search Filters

If you know the referrer patterns you can set up some Google Analytics filters to capture and reclassify this traffic into the appropriate buckets. Here’s the step-by-step way to do that.

From Google Analytics click Admin.

Google Analytics Admin

That takes you to a list of profiles.

Google Analytics Select or Create a Profile

Here you can either create a new profile or select a current one. I’d suggest creating a new profile to test this out before you decide to integrate it into your primary profile. Because you might screw it up or just may not like the detail or may not want to have the change in continuity. That said, I’ve created these filters so they’ll have the least amount of impact on your reporting while still delivering added insight.

Next you’ll reach the profile navigation pane where you’ll want to click on Filters.

Google Analytics Filters 2014

At that point you’ll want to go ahead and click the red New Filter button.

Google Analytics Red New Filter Button

That’s when the real fun begins and you construct a new advanced filter.

Creating a Google Analytics Google Image Search Filter

The first step is to name this filter. This won’t show up in your reports and is simply a way for you to know what that filter is doing. So make it descriptive and obvious.

Next you’ll want to select the Custom filter button (2) which then reveals a list of options. From that list you’ll want to select Advanced (3). This is where it gets a bit tricky.

In step 4 you’ll select Referral from the menu of options and then apply some RegEx to match the pattern we’ve identified. In this instance the RegEx I’m using is:


I love RegEx, which stands for Regular Expression, but I don’t always get it right the first time and regularly rely on this RegEx cheat sheet to remind and guide me. In this instance I’m looking for all Google domains (and  including any international domain using the new image search here) with /url and source=images within the referrer.

In step five you’re selecting what you’re going to do when a referrer matches your RegEx. I’ve chosen Campaign Source from the menu and then created a new source called ‘google images’. You can name these whatever you like but I keep them lowercase to match the other sources.

You’ll note that the ‘Override Output Field’ is set to Yes which means that I’m going to change the Campaign Source for those that match this referrer pattern from what it is currently to ‘google images’. The great part about this is that you retain the fact that the medium is ‘organic’. So all those reports remain completely valid.

Finally, you click Save and then you wait for the filter to be applied to traffic coming into the site. Depending on the amount of traffic you get from these sources, it may take a few hours to a few days to see the filter working in your reports.

Next we have to put into place a filter for Google universal images, Google images from international properties not using the current image search UX as well as Bing images.

The RegEx for Google universal search images is:


Note that I’m only looking to match referrers coming from so that I’m not mixing international image search with US universal image search.

The RegEx for Google international search is crazy long and didn’t really work pasted here. So instead you can click here to copy and paste the Google ‘International’ image search filter RegEx.

Now, many of the domains won’t match because they’re using the new version of image search, which will match the first filter we created. But I figured I’d just be as inclusive as possible instead of validating the current image search UX on each domain. (I mean, it’s wicked time consuming too.)

Finally, the RegEx for Bing images is:


But we’re not done! Close, but not quite.

Changing Google Analytics Medium Filters

So after having these filters in place for a while I noticed that some of the new sources I created were showing up as a medium of ‘referring’ instead of ‘organic. That means you’re still short-changing your organic efforts because Google is passing the wrong medium in their cookie.

So you have to create two new filters that change the medium of Google universal images and Google international images.

Google Analytics Filter to Change Medium

This is another Advanced filter but this one is much simpler but must be very precise. In Field A  you’re looking for the Campaign Source that exactly matches the source you created in the filter. For me, that means ‘google international images’ and ‘google universal images’. For you, it’s whatever you named the new sources.

Then you’re simply outputting and overriding the Campaign Medium to organic. Remember, you’ll create two of these. One for the ‘international’ images and one for ‘universal images’. My guess is that you might only need the one but I want to cover my bases.

To simplify, all your doing here is looking for the sources you created and then making sure that the medium associated with those sources is changed to organic.

Image Search Filter Order

The final step is to make sure that your filters are in the right order. The last two filters that change the medium based on a specific campaign source (that you created) must come at the end.

Google Analytics Google Image Search Filter Order

This makes sense right? You couldn’t match a source that you hadn’t already created, right? Stick to this order and you’ll ensure image search traffic is tracked appropriately.

Image Search Reports

So what do you get to see in the reports?

Image Filters Create Better Google Analytics Reports

This is data from a client site where I’ve had all the filters in place for a few days. The medium for all of these is still organic but I’ve now got new sources for google images, universal images and bing images. (Update on 4/5/14) I’ve been using these filters successfully for a year now.

What you should see right away is the very large difference in how this traffic performs. Image search traffic in this instance has a 1.5 Pages/Visit and 3:00 Avg. Visit Duration while the web based organic traffic has a 6 Pages/Visit and 6.00 Avg. Visit Duration.

Most importantly, the conversion rate on these two types of traffic is different as well. Segmenting your image search traffic can bring more clarity to your analysis and help you make the right decisions on what’s working, how to allocate resources and what to optimize.

Image Search Filter Validation

So how do I know this is really working? I drill down into one of these new sources and then select keyword as the secondary dimension. Did I forget to mention that the keyword data remains in tact?

Google Analytics Universal Images Keyword Report

Yup, sure does! So the next step here is to see if there really is a universal result for these keywords.

Google Search Result for Badass Over Here Real Pic

Sure enough, I’m the second result in this universal search result. Now lets see if the filter for normal image search is working.

Google Analytics Google Images Keyword Report

I’ll use ‘wifi logo’ as my target term and first go to make sure that I’m not showing up in universal search results.

Google Search Result for Wifi Logo

Nope, not showing up there. But am I showing up in Google image search?

Google Images Search Results for Wifi Logo

Sure enough I’m there just inside the top 100 results from what I can tell. So I’m pretty confident that the filter is catching things and bucketing them appropriately. I’ve also validated this with very robust client data but can’t share that level of detail publicly.

What Is

You might have noticed the source above. What’s that you ask? I don’t know. But I don’t think it’s traditional image search traffic since the user behavior of that source doesn’t conform to the other three image based sources. It’s also a small source of traffic so while my OCD senses are tingling I’m currently ignoring the urge to figure out exactly what represents.

Tell me if you figure it out.


You Raise a Valid Point Ice Cream

The big question is why I wouldn’t just use the Google Webmaster Tools queries report and filter by image right? Well first off, the integration into Google Analytics still isn’t where I’d like it to be making any type of robust reporting near impossible.

In addition, I don’t like mixing image search traffic with web search traffic in my normal reports because they’re so different. It makes any analysis you do using that mixed data less precise and prone to unintentional error.

More problematic is the fact that the data between Google Webmaster Tools and Google Analytics doesn’t match up.

I started looking at specific keywords via my filters versus what was reported in Google Webmaster Tools. There were just too many times when Google Webmaster Tools reported material amounts of traffic that wasn’t showing up in my Google Analytics reports.

Google Webmaster Tools Clicks

Here you can see that the top term received 170 clicks in this time frame. Yet during the same time frame here’s what the Google Analytics filter based method reports.

Google Analytics Image Based Clicks

170 versus 24! Even if I factor in the (not provided) percentage (which runs about 35% for this client) and add that back in I only get close to 40 visits.

But that’s when the lightbulb went off. Maybe Google Analytics is reporting Visits while Google Webmaster Tools is reporting Clicks?

While I can’t confirm this I’m guessing that Google Webmaster Tools is counting all clicks on a result. Many of those clicks are going directly to the image and not the page the image resides on. That’s important since direct clicks to the image (i.e. – .jpg files and the like) aren’t going to be tracked in Google Analytics as a visit. There is no Google Analytics code on these files. The delta between the two could be the number of users who clicked directly to the image.

In addition, this method doesn’t catch any of the mobile clicks and visits since no image search visits (and very few universal images) show up using this filter when looking at mobile traffic. I’m pretty sure that the referrers are just getting stripped and these wind up going into direct instead which is part of the iOS and Android 4+ search attribution issue. (If someone else has an explanation here or finds a different referrer for mobile image search please let me know.)

Finally, there’s something funky with Chrome. When I look at the distribution of traffic to each bucket Chrome is an outlier for Google images.

Image Filters Browser Distribution

That 3.7% is just way out of proportion. And it’s not related to the amount of (not provided) traffic since Firefox actually has a higher percentage of (not provided) 72% than Chrome (64%) in this instance. So I can only conclude that there’s some amount of data loss going on with Chrome. Maybe that also contributes to the discrepancy I see between Google Analytics and Google Webmaster Tools.

This got even worse as of January when Chrome stopped passing rich referrer information.

Image Search by Browser

I can only guess that this is part of Google’s security and privacy efforts. Sadly, it means you’re capturing a lot less detail about image search and your data will be less accurate because of it.

Despite all of these caveats I love having the additional detail on image traffic which has wildly different intent and user behavior. Some insight is better than none.


Apply a few simple Google Analytics filters to gain insight into how much traffic you’re getting through image search. This is increasingly important as the Internet becomes more visual and the user behavior of these visits differs in material ways from traditional search traffic.

Bing People Snippets

March 19 2013 // SEO // 10 Comments

This morning (thanks to a tip from Search Engine Roundtable) I began researching what looked like authorship snippets on Bing. While it’s only been an hour or so here’s what I’ve seen and what I think I’ve figured out.

People Snippets

The new faces (for the most part) showing up in Bing search results are not authorship snippets per se but are people snippets derived from entities. It’s about who the content is about rather than who created the content.

If you haven’t seen them already here’s what one looks like when you search for Lauren Cohan.

Bing People Snippets for Lauren Cohan

They look remarkably like the authorship snippets that Google has implemented but they’re most certainly different in their application.

Structured Data?

The first assumption here is that Bing might be using structured data to present these new snippets. Perhaps they’re using the person attribute in markup?

Structured Data Results for Lauren Cohan page

Not so much. There’s no structured data on this page and I’ve found plenty of others getting the people snippet that are devoid of mark-up. So if Bing isn’t using structured data, what are they using to match and identify people?

People Pages

Clearly they rely heavily on sources such as Wikipedia, LinkedIn and Freebase. But they seem to be expanding their data sources on people to other sites and specific pages.

Simon Le Bon People Snippets on Bing

Searching for Simon Le Bon you’ll find that a people snippet appears for Wikipedia, IMDb and Biography. Wikipedia is a no-brainer and IMDb makes a good deal of sense too. Biography was the surprising one.

I noted that both IMDb and Biography had namespaces or folders (underlined in red) that seemed to be easy identifiers for entities. So I decided to look for more sources and I found them. Lots of them.


CrunchBase People Snippet for Jason Calacanis


MySpace People Snippet for Pete Myers People Snippet for Pete Myers


Quora People Snippet for Jessica Guynn


TED People Snippet for Seth Godin


ESPN People Snippet for Claude Giroux

The Canadian Encyclopedia

Canadian Encyclopedia People Snippet for Douglas Coupland


Amazon People Snippet for Tony Basil


MTV People Snippet for Paula Abdul People Snippet for Kim Carnes


Forbes People Snippet for Mark Cuban


NNDB People Snippet for Alan Greenspan


Facebook People Snippet for Matthew Inman


Twitter People Snippet for Neil deGrasse Tyson

Yahoo! Movies

Yahoo Movies People Snippet for Will Ferrell People Snippet for Will Ferrell


AskMen People Snippet for Will Smith


FriendFeed People Snippet for Louis Gray

TV Guide

TV Guide People Snippet for Andrew Lincoln

Comedy Central

Comedy Central People Snippet for Daniel Tosh

Most of these either have a namespace that makes it easy to identify as a person or are clear profiles in the case of MySpace and FriendFeed. Whether it’s ‘player’, ‘artist’, ‘celebrities’, ‘person’, ‘profiles’ or ‘speakers’ it seems like Bing has determined pages that match these specific entities.

About Pages

People snippets show up far more often on about pages which supports the idea that Bing is looking for high confidence entity pages and not assigning real authorship.

Blind Five Year Old People Snippet for AJ Kohn

As you can see I get a people snippet on my about page but not on my site as a whole. Nor do I get it returned on any of my content. Here’s another example.

0at People Snippet for Matthew Inman

Again, the about page on Matthew Inman’s now defunct site is given a people snippet while the site as a whole isn’t. The people snippet is showing pages about that entity, not authored by that entity. It just so happens that there’s some overlap in those areas.

Sorta Structured Data

Many of these pages have a rich amount of data on them. While they aren’t marked-up with any structured data per se, search engines can clearly parse and use that information. Here’s a people snippet via Green Day Authority.

Green Day Authority People Snippet for Bille Joe Armstrong

That page has no structured data mark-up but it has structure.

Green Day Authority Page


Further pushing on the choice of pages to use to apply the people snippet I began to search for characters. First Harry Potter and then Derek Zoolander.

Derek Zoolander Bing Results

No people snippets are applied even though it’s still pulling from IMDb. The difference here is that it’s plucking out a title page and a character page instead. Maybe that’s not how it works but that’s how my pattern matching mind sees it right now.

[Update 3/21/13]

ChaosSEO noted that he could get character names to render people snippets. Sure enough, you can.

People Snippet for Olivia Dunham on Bing

And …

People Snippets for Jean-Luc Picard on Bing

I tend to think that there’s some special casing going on with IMDb so that it only applies the snippet to the name pages, but if you get a snippet to render for an IMBb character page please let me know.

Going through characters was actually really instructive. First I began to see that there were associations between the entities of person and character.

People Snippets for Hermoine Granger on Bing

A search for Hermione Granger produces people snippets and a result for Emma Watson. Clearly there’s some understanding that the two are related. You can get that same dynamic for a number of character searches such as Gandalf or Chewbacca.

Gandalf People Snippet on Bing

Chewbacca Result on Bing

Finally, I found a result that makes me very confident that this is not authorship at all but entity detection.

Han Solo People Snippets on Bing

Clearly Harrison Ford (or Han Solo) is not the author of these pieces but the subject of them.

Decentralized Images

There are few instances where a site will get a people snippet. This seems to be rare and only occurs when Bing has high confidence that they have the right person.

Bing Seth Godin Results - Different Faces

Here we can see that a people snippet is applied to Seth’s site but that the image is pulled from that site and not from some central database. This provides some variety in what is displayed but also leads to some errors from time to time.

Bing Result for Jason Calacanis

So Bing seems confident that they have the right person associated with that site but the image they pulled is not Jason. It’s Wesley Chan.


Is this a form of Authorship? Sorta, kinda, not really. Sometimes you’ll see what looks like a people snippet pop up on a content page.

Tim Gunn People Snippet on HuffPo Article

Tim is the author of that piece so there’s a chance that they’ve identified and are trying to present Authorship based on that fact. But it’s more likely they just identified him as an entity. Because the images are decentralized it pulls what it can from that article.

Mathew Ingram People Snippet

Same thing happens with this piece by Mathew Ingram. In both cases there is structured data on that page that would indicate that each is the author of that piece (though they both don’t have Google Authorship working.)

So it’s not true authorship and very few pieces of content have a people snippet right now but if Bing decides to follow this path, make the connections with all of their datasets (including their social sidebar results) then you could see Bing being a legitimate authorship platform.

Right now it seems like the snippets on content results are more like a side effect of people identification.


Bing has introduced people snippets that look like Google’s Authorship snippets but are more focused on identifying people as entities through a variety of sources rather than assigning authorship of content. For now.

Build Your Authority Not Your Author Rank

March 18 2013 // SEO // 49 Comments

It’s been a frustrating few weeks of discussion about Authorship and Author Rank.

Are We There Yet?

Here I will present a few things that may give you some more context overall and, in particular, my point of view on things.

Social Computing Research

Just the other day Google revealed that it gave $1.2 million dollars in awards to those undertaking social computing research.

We know that interactions on the Web are diverse and people-centered. Google now enables social interactions to occur across many of our products, from Google+ to Search to YouTube. To understand the future of this socially connected web, we need to investigate fundamental patterns, design principles, and laws that shape and govern these social interactions.

We envision research at the intersection of disciplines including Computer Science, Human-Computer Interaction (HCI), Social Science, Social Psychology, Machine Learning, Big Data Analytics, Statistics and Economics. These fields are central to the study of how social interactions work, particularly driven by new sources of data, for example, open data sets from Web2.0 and social media sites, government databases, crowdsourcing, new survey techniques, and crisis management data collections. New techniques from network science and computational modeling, social network and sentiment analysis, application of statistical and machine learning, as well as theories from evolutionary theory, physics, and information theory, are actively being used in social interaction research.

We’re pleased to announce that Google has awarded over $1.2 million dollars to support the Social Interactions Research Awards, which are given to university research groups doing work in social computing and interactions. Research topics range from crowdsourcing, social annotations, a social media behavioral study, social learning, conversation curation, and scientific studies of how to start online communities.

What this says to me is that Google is intensely interested in understanding how to use social interaction data. But they’re not there yet. And why should they be? They’ve been working on link based signals and refinement for over 10 years but haven’t delved into social data until the last few.

This is a discipline that they are far from fully understanding. I can’t help but pick out words like ‘investigate’, ‘envision’ and ‘new’. This is a post about the exploration of the effects of social interaction on a host of fields. These are not papers as to their conclusions.

But we do have a few of those papers, areas where Google has begun to learn about how social interactions or signals might impact search. Lets take social annotations as an example.

Social Annotations in Web Search

Social Annotations and Snippet Length Chart

This research was presented at the 2012 Conference on Human Factors in Computing Systems.

Remember when our SERPs had a whole bunch of smaller faces in them and other various social gestures? Well, Google found that those didn’t work. We hardly noticed them and when we did we didn’t always believe they added value.

In fact, the only thing that really did was the Authorship snippet. It’s a very interesting read if you’re interested in design and authority. The way we see results today is clearly influenced by this research and you can see Google learning more about how social connections and expertise work within search.

This study revealed a counter-intuitive result. Despite having the names and faces of familiar people, and despite being intended to be noticeable to searchers, subjects for the most part did not pay attention to the social annotations.

Our questions about contact closeness, expertise, and topic were answered by the reactions captured during the retrospective interviews. These interviews revealed the importance of contact expertise and closeness, and the importance of the search topics in determining whether social signals are useful, thus echoing past findings on the role of expertise in social search.

I walk away thinking that all of this is much tougher than we believe peering in from the outside. That and Google is at the start of this research, not the end.

Knowing that they aspire to understand these dynamics also makes the closure of Google Reader odd since there is a substantial amount of data that could be mined there, all tied back to identity and, by extension, topical expertise.

Whisper Down The Lane

I had a chance to speak on a panel at SMX West with Mike Arnesen and Lisa Weinberger about Authorship, Author Rank and Authority.

Overall, authorship and the potential for Author Rank was a hot topic that spilled out into multiple other sessions. Both Matt Cutts and Duane Forrester were asked about link based signals versus social signals. You could tell they are both tired of this question. Paraphrasing, they essentially said that while social signals are intriguing they’re not nearly as far along as we in the industry might believe (or want).

When prodded about the collapse of the link graph they noted that the link graph was just fine thank you very much. Link manipulation, the intent behind linking that we feel is so perverted, is not nearly as rampant as we assume. The mainstream blogger or site owner is linking for the right reasons. In short, the link graph is still valuable and with lower friction to producing digital content it may actually improve as more laypeople become content producers.

That’s not to say that social signals aren’t important but it will be a complement to or a refinement of the link graph, not a replacement. This is something I discussed in my original Author Rank post.

If we believe that search engines still view the link graph as viable there may be ways to simply use Authorship to make the link graph more accurate. Think of Authorship as meta information passed on every link. When looking for information on cancer the link given to an article from an established oncologist at a world renowned hospital would likely confer more value than a link to an article from ‘screwcancer888′ at a Q&A site.

In some ways this reminds me of delegating authority which Bill Slawski (always insightful) wrote about back in late 2010. What we’re really talking about is identifying expertise and allowing those experts to help curate our view of those topics where it matters – in search results.

Parsing Statements

So You're Telling Me There's A Chance?

It’s enticing to pick apart responses and statement by Googlers when they are asked to comment on Author Rank. The fact is that they’re not going to divulge much or commit one way or the other (at least publicly). They’ve been burned before by saying something that is true but interpreted in different ways.

So when asked, of course they’re going to reply that it’s something they’re experimenting with (because they do aspire to use the data) but that it is currently not a direct ranking signal and nothing to worry about now.

Of course that leads everyone to look for the experiments, to look for indirect ranking signals and to take the ‘now’ as a declaration of sorts for future implementation.

Authorship could be an indirect signal if you believe (like I do) that the click through rate (CTR) on a result can provide a positive feedback signal. And we know the CTR on authored results disrupts the normal click distribution of a SERP. Of course Google could take into account the Authorship snippet and normalize the CTR impact. So perhaps it isn’t having that indirect impact. See how confusing it can get?

Just for fun, let us think what would transpire if a Googler simply said there is no such thing as Author Rank without any hedging or caveats. People would start to conflate that with Authorship, potentially reducing the adoption rate. Many would interpret it to mean that Google had abandoned author based weighting completely. Thus, when Google did figure it out and apply it the industry would point to the statement and shout ‘liar’ at the top of their lungs.

We’ve trained Google to provide us with these elliptical statements. I choose to view them through this lens.

What To Look For?

That’s not to say that we shouldn’t be interested in the topic. I like the testing Terry Simmonds is doing on the mechanics of Authorship because it documents how Google is trying to extend the mark-up to more of the content on the web. And that’s a constraint as far as I can tell right now. Conversations about the inability to roll out updates because of low adoption are not uncommon.

You can’t begin to rank results based on topical expertise if many of the experts aren’t included in the selection criteria. The participation rate in Authorship has to be such that using it would provide a materially better ranking of content. Reports have Authorship coverage as low as 9% and as high as 17%. That’s not a lot really and both studies are limited based on the relatively small data sets analyzed.

The problem? If you were to want information on astrophysics you’d probably want to include Neil deGrasse Tyson in those results. Yet, he’s not on Google+ (as far as I can tell) and isn’t part of the Authorship program.

Looking at how Google is trying to assign Authorship is important.

The mechanics and the indirect Authorship Google often grants is particularly intriguing. I noted that Jonathon Colman was receiving a bounce back Authorship link on a SlideShare URL for which no direct Authorship mark-up was present.

Indirect Authorship

I recall seeing this in the past on URLs from Quora, FriendFeed and Flickr. I swear some of these used to show up in Author Stats but I haven’t seen them lately (except for FriendFeed which I see at the tail end of my list.)

In fact, the bug that took Author Stats down might have been the exposure of indirect Authorship based on high confidence in matching public social graph data to Google+ profiles. Rapleaf got the brunt of the ire for crawling the public social graph but Google clearly has and continues to use this information even though the social circles feature has been retired.

Looking today I see another interesting URL showing up in Author Stats – Twitter.

Twitter Discussion Gets Authorship

There’s quite a lot of evidence that Twitter is a fairly well trusted source of indirect Authorship, but that’s a post for another day. However, we can also look at the verbiage in the Structured Data Testing Tool, which has changed within the last few weeks.

Authorship rel=author Structured Data Testing Tool Results

The points of interest here are the ‘(direct or indirect)’ verbiage as well as the fact that the tool only checks the first rel=author link listed on a webpage.  The former certainly makes me believe that assigning Authorship based on indirect links is important to Google.

The latter tells me two things. First that the tool should not be trusted as the final arbiter of whether the correct Authorship is or will be applied. Second that Google obviously sees multiple authors or entities (or agents) on the page.

Lets go a step further. Google’s new Social Sign-In can be construed as a portable digital signature which might allow Google to rely on comments and other content produced outside of Google+. So tracking how this is rolled out and whether the reviews that now flow under your profile are also granted Authorship are interesting developments.

I’ve been eager to see Author Rank implemented since I first saw Matt Cutts interview Steven Levy.

This actually predates Authorship and the follow-up question by Matt (along with a bit of body language) makes it clear that Google was thinking about this seriously. While I absolutely do look for connections and patterns that might paint a picture of the future I’m not looking for it behind every corner and trying to fit Author Rank into each and every odd result or anecdote.


You Will Respect My Authority!

I prefer to talk about how people might build authority rather than how they would build Author Rank. Just as links are the result and not the goal, Author Rank will be the result and not the goal of your efforts.

Discussions about what makes someone an authority and how Google might want to translate that into math are fascinating. What makes someone authoritative versus popular? Is there a difference? If so, how would you go about separating the two?

How do you map the decline of authority? Of someone who is no longer really an expert and just mailing it in? Can you identify this even if they remain popular? How can you tell if someone is endorsing content based on merit or friendship? Is it what you know or who you know?

Furthermore, you could find that one was popular for the wrong reasons. Would you want to rank someone highly who simply fanned the flames of dissent and created controversy? The tone and type of interaction will be important so sentiment analysis and other processes will need to determine how to use social interaction as a reliable signal.


We're Dealing With A Badass Over Here

And how does influence fit into this equation? One can be influential without being popular, but clearly being popular gives you a better chance of being influential just by sheer reach. Can you be influential without being an authority? I think so. Just look at Jenny McCarthy and her influence within the anti-vaccine movement.

The latter clearly strays into the subjective nature of quality, relevance and authority that I touched on after the Panda update. Personalization helps to ensure that your subjective view of authority is reflected back to you. That’s why search results are changed based on who you follow on Google+. And personalization of search results is the most important thing about Google+ in my view.

But in discussing how Google might identify authority and expertise, we’re dealing with the aggregate. So the question isn’t really about your personal view (which is reflected back in Search+ results) but how the aggregate views different figures and authorities.

Of course, being likable is part of the way you can obtain authority. And it is often not what you say, but how you say it (or present it) that gets you noticed. So part of building authority is in ensuring that you can communicate in a way that conveys that expertise but also makes it accessible and … memorable.

Yes, I see all of this as being related because the same content presented in comic sans without any images or paragraph breaks wouldn’t have nearly the same impact and would not, ultimately, convey authority. Even though the actual words are the same!

I had a similar conversation with Dan Shure where he wondered about the impact of publishing content from Rand Fiskin under somebody else’s name. Would it get as much ‘play’ and be received as well? I doubt it. So what does that say about the connection of authority, popularity and quality assessment?

These are just a few of the things that make this topic so incredible.


I believe Google wants to use Author Rank but I also believe that it’s far more difficult than we think. Focusing solely on Author Rank may blind us to tracking Google’s progress and building what is truly important. Authority.