Crawl optimization should be a priority for any large site looking to improve their SEO efforts. By tracking, monitoring and focusing Googlebot you can gain an advantage over your competition.
It's important to cover the basics before discussing crawl optimization. Crawl budget is the time or number of pages Google allocates to crawl a site. How does Google determine your crawl budget? The best description comes from an Eric Enge interview of Matt Cutts.
The best way to think about it is that the number of pages that we crawl is roughly proportional to your PageRank. So if you have a lot of incoming links on your root page, we'll definitely crawl that. Then your root page may link to other pages, and those will get PageRank and we'll crawl those as well. As you get deeper and deeper in your site, however, PageRank tends to decline.
Another way to think about it is that the low PageRank pages on your site are competing against a much larger pool of pages with the same or higher PageRank. There are a large number of pages on the web that have very little or close to zero PageRank. The pages that get linked to a lot tend to get discovered and crawled quite quickly. The lower PageRank pages are likely to be crawled not quite as often.
In other words, your crawl budget is determined by authority. This should not come as a shock. But that was pre-Caffeine. Have things changed since?
What is Caffeine? In this case it's not the stimulant in your latte. But it is a stimulant of sorts. In June of 2010, Google rebuilt the way they indexed content. They called this change 'Caffeine' and it had a profound impact on the speed in which Google could crawl and index pages. The biggest change, as I see it, was incremental indexing.
Our old index had several layers, some of which were refreshed at a faster rate than others; the main layer would update every couple of weeks. To refresh a layer of the old index, we would analyze the entire web, which meant there was a significant delay between when we found a page and made it available to you.
With Caffeine, we analyze the web in small portions and update our search index on a continuous basis, globally. As we find new pages, or new information on existing pages, we can add these straight to the index. That means you can find fresher information than ever before—no matter when or where it was published.
Essentially, Caffeine removed the bottleneck for getting pages indexed. The system they built to do this is aptly named Percolator.
We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.
The speed in which Google can crawl is now matched by the speed of indexation. So did crawl budgets increase as a result? Some did, but not as much as you might suspect. And here's where it gets interesting.
Googlebot seems willing to crawl more pages post-Caffeine but it's often crawling the same pages (the important pages) with greater frequency. This makes a bit of sense if you think about Matt's statement along with the average age of documents benchmark. Pages deemed to have more authority are given crawl priority.
Google is looking to ensure the most important pages remain the 'freshest' in the index.
Time Since Last Crawl
What I've observed over the last few years is that pages that haven't been crawled recently are given less authority in the index. To be more blunt, if a page hasn't been crawled recently, it won't rank well.
Last year I got a call from a client about a downward trend in their traffic. Using advanced segments it was easy to see that there was something wrong with their product page traffic.
Looking around the site I found that, unbeknownst to me, they'd implemented pagination on their category results pages. Instead of all the products being on one page, they were spread out across a number of paginated pages.
Products that were on the first page of results seemed to be doing fine but those on subsequent pages were not. I started to look at the cache date on product pages and found that those that weren't crawled (I'm using cache date as a proxy for crawl date) in the last 7 days were suffering.
Undo! Undo! Undo!
That's right, I told them to go back to unpaginated results. What happened?
You guessed it. Traffic returned.
Since then I've had success with depagination. The trick here is to think about it in terms of progressive enhancement and 'mobile' user experiences.
The rise of smartphones and tablets has made click based pagination a bit of an anachronism. Revealing more results by scrolling (or swiping) is an established convention and might well become the dominant one in the near future.
Can you load all the results in the background and reveal them only when users scroll to them without crushing your load time? It's not always easy and sometimes there are tradeoffs but it's a discussion worth having with your team.
Because there's no better way to get those deep pages crawled by having links to all of them on that first page of results.
Was I crazy to think that the time since last crawl could be a factor in ranking? It turns out I wasn't alone. Adam Audette (a smart guy) mentioned he'd seen something like this when I ran into him at SMX West. Then at SMX Advanced I wound up talking with Mitul Gandhi, who had been tracking this in more detail at seoClarity.
Mitul and his team were able to determine that content not crawled within ~14 days receives materially less traffic. Not only that, but getting those same pages crawled more frequently produced an increase in traffic. (Think about that for a minute.)
At first, Google clearly crawls using PageRank as a proxy. But over time it feels like they're assigning a self-referring CrawlRank to pages. Essentially, if a page hasn't been crawled within a certain time period then it receives less authority. Let's revisit Matt's description of crawl budget again.
Another way to think about it is that the low PageRank pages on your site are competing against a much larger pool of pages with the same or higher PageRank. There are a large number of pages on the web that have very little or close to zero PageRank.
The pages that aren't crawled as often are pages with little to no PageRank. CrawlRank is the difference in this very large pool of pages.
You win if you get your low PageRank pages crawled more frequently than the competition.
Now what CrawlRank is really saying is that document age is a material ranking factor for pages with little to no PageRank. I'm still not entirely convinced this is what is happening, but I'm seeing success using this philosophy.
One might argue that what we're really talking about is internal link structure and density. And I'd agree with you!
Not only should your internal link structure support the most important pages of your site, it should make it easy for Google to get to any page on your site in a minimum of clicks.
One of the easier ways to determine which pages are deemed most important (based on your internal link structure) is by looking at the Internal Links report in Google Webmaster Tools.
Do the pages at the top reflect the most important pages on your site? If not, you might have a problem.
I have a client whose blog was receiving 35% of Google's crawl each day. (More on how I know this later on.) This is a blog with 400 posts amid a total content corpus of 2 million+ URLs. Googlebot would crawl blog content 50,000+ times a day! This wasn't where we wanted Googlebot spending its time.
The problem? They had menu links to the blog and each blog category on nearly all pages of the site. When I went to the Internal Links report in Google Webmaster Tools you know which pages were at the top? Yup. The blog and the blog categories.
So, we got rid of those links. Not only did it change the internal link density but it changed the frequency with which Googlebot crawls the blog. That's crawl optimization in action.
Remember the advice to create a flat site architecture. Many ran out and got rid of subfolders thinking that if the URL didn't have subfolders then the architecture was flat. Um ... not so much.
These folks destroyed the ability for easy analysis, potentially removed valuable data in assessing that site, and did nothing to address the underlying issue of getting Google to pages faster.
How many clicks from the home page is each piece of content. That's what was, and remains, important. It doesn't matter if the URL is domain.com/product-name if it takes Googlebot (and users) 8 clicks to get there.
Is that mega-menu on every single page really doing you any favors? Once you get someone to a leaf level page you want them to see similar leaf level pages. Related product or content links are the lifeblood of any good internal link structure and are, sadly, frequently overlooked.
Depagination is one way to flatten your architecture but a simple HTML sitemap, or specific A-Z sitemaps can often be very effective hacks.
Flat architecture shortens the distance between authoritative pages and all other pages, which increases the chances of low PageRank pages getting crawled on a frequent basis.
"A million dollars isn’t cool. You know what’s cool? A billion dollars."
Okay, Sean Parker probably didn't say that in real life but it's an apt analogy for the difference in knowing how many pages Googlebot crawled versus where Googlebot is crawling, how often and with what result.
The Crawl Stats graph in Google Webmaster Tools only shows you how many pages are crawled per day.
For nearly five years I've worked with clients to build their own Googlebot crawl reports.
And it doesn't always have to look pretty to be cool.
Here I can tell there's a problem with this specific page type. More than 50% of the crawl on that page type if producing a 410. That's probably not a good use of crawl budget.
All of this is done by parsing or 'grepping' log files (a line by line history of visits to the site) looking for Googlebot. Here's a secret. It's not that hard, particularly if you're even half-way decent with Regular Expressions.
In the end I'm interested in looking at the crawl by page type and response code.
You determine page type using RegEx. That sounds mysterious but all you're doing is bucketing page types based on pattern matching.
I want to know where Googlebot is spending time on my site. As Mike King said, Googlebot is always your last persona. So tracking Googlebot is just another form of user experience monitoring. (Referencing it like this might help you get this project prioritized.)
You can also drop the crawl data into a database so you can query things like time since last crawl, total crawl versus unique crawl or crawls per page. Of course you could also give seoClarity a try since they've got a lot of this stuff right out of the box.
If you're not tracking Googlebot then you're missing out on the first part of the SEO process.
You Are What Googlebot Eats
What you begin to understand is that you're assessed based on what Googlebot crawls. So if they're crawling a whole bunch of parameter based, duplicative URLs or you've left the email-a-friend link open to be crawled on every single product, you're giving Googlebot a bunch of empty calories.
It's not that Google will penalize you, it's the opportunity cost for dirty architecture based on a finite crawl budget.
The crawl spent on junk could have been spent crawling low PageRank pages instead. So managing your URL Parameters and using robots.txt wisely can make a big difference.
Many large sites will also have robust external link graphs. I can leverage those external links, rely less on internal link density to rank well, and can focus my internal link structure to ensure low PageRank pages get crawled more frequently.
There's no patent right or wrong answer. Every site will be different. But experimenting with your internal link strategies and measuring the results is what separates the great from the good.
Crawl Optimization Checklist
Here's a quick crawl optimization checklist to get you started.
Track and Monitor Googlebot
I don't care how you do it but you need this type of visibility to make any inroads into crawl optimization. Information is power. Learn to grep, perfect your RegEx. Be a collaborative partner with your technical team to turn this into an automated daily process.
Manage URL Parameters
Yes, it's confusing. You will probably make some mistakes. But that shouldn't stop you from using this feature and changing Googlebot's diet.
Use Robots.txt Wisely
Stop feeding Googlebot empty calories. Use robots.txt to keep Googlebot focused and remember to make use of pattern matching.
Don't Forget HTML Sitemap(s)
Seriously. I know human users might not be using these, but Googlebot is a different type of user with slightly different needs.
Optimize Your Internal Link Structure
Whether you try depagination to flatten your architecture, re-evaluate navigation menus, or play around with crosslink modules, find ways to optimize your internal link structure to get those low PageRank pages crawled more frequently.