New Insights into Googlebot
The author's views are entirely their own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.
Google has found an intelligent way to arrange the results for a search query. But an interesting question is - where we can find that intelligence? A lot of people have conducted research into the indexing process and even more have tested ranking factors on their weight, but we wondered how smart Googlebot itself is. To make a start, we took some statements and commonly used principles and tested how Googlebot handled them. Some results are questionable and should be tested on a few hundred domains to be sure, but it can give you some ideas.
Speed of The Crawler
The first one we tested was Matt Cutts on his following statement: “... the number of pages that we crawl is roughly proportional to your PageRank".
This brings us to one of the challenges large content sites are facing - the problem of getting all pages indexed. You can imagine if Amazon.com was a new website, it would take a while for Google to crawl all 48 million pages and if Matt Cutts’s statement is true, it would be impossible without any incoming links.
To test it, we took a domain with no history (never registered, no backlinks) and made a page with 250 links on it. Those links refer to pages that also have 250 links (and so on…). The links and URLs were numbered from 1 to 250, in the same order as they appeared in the source code. We submitted the URL via “addurl” and waited. Due to the fact that the domain has no incoming links, it has no or at least a negligible PageRank. If Matt Cutts’s statement is correct Googlebot would soon stop crawling.
As you can see in the graph, Googlebot started crawling the site with a crawl rate of approximately 2500 nodes per hour. After three hours, it slowed down to a crawl rate of approximately 25 pages per hour and maintained that rate for months. To verify this result we did the same test with two other domains. Both tests came up with nearly the same results. The only difference is the lower peak at the beginning of Googlebot's visit.
Impact of Sitemaps
During the tests, the sitemap manifested itself as a very useful tool to influence the crawl rate. We added a sitemap with 50,000 uncrawled pages in it (indexation level 0). Googlebot placed the pages which were added to Google by sitemap on top of the crawl queue. This means that those pages got crawled before the F-levelled pages. But what’s really remarkable is the extreme increase in crawl rate. At first, the number of visits was stabilized at a rate of 20-30 pages per hour. As soon as the sitemap was uploaded through Webmaster Central, the crawler accelerated to approximately 500 pages per hour. In just a few days it reached a peak of 2224 pages per hour. Where at first the crawler visited 26.59 pages per hour on average, it grew to an average of 1257.78 pages per hour which is an increase of no less then 4630.27%. The increase of crawl rate doesn’t stop by the pages included in the sitemap. Also the other F- and 0-levelled pages take advantage of the increase in crawl rate.
It’s quite remarkable that Google suddenly uses more of it’s crawl capacity to crawl the website. At the point where we submitted the sitemap the crawl queue was filled with F-pages. Google probably attaches a lot of value to the submitted sitemap.
This brings us to Matt Cutts’s statement. After only 31 days Googlebot crawled about 375,000 pages of the website. If this is proportional to it’s PageRank (which is 0) this would mean that it will crawl 140,625,000,000 pages of a PageRank 1 website in just 31 days. Remember that PageRank is exponential. In other words, this would mean you never have to worry about your PageRank even if you own the largest website on the web. In other words don’t simply accept everything Matt says.
Amount of Links
Rand Fishkin says: “…you really can go above Google’s recommended of 100 links per page, with a PageRank 7.5 you can think about 250-300 links” ( https://moz.rankious.com/_moz/blog/whiteboard-friday-flat-site-architecture )
The 100 links per page advice has always been a hot topic especially for websites with a lot of pages. The reason the advice originally was given is the fact that Google used to index only 100 kilobytes per page. On a 100 kb page the amount of 100 links seemed reasonable. If a page was any longer, there was a chance that the page would be so long that Google would truncate the page and wouldn’t index the entire page. These days, Google will index more than 1.5MB and user experience is the main reason for Google to keep the “100 links” recommendation in their guidelines.
As was described in the previous paragraph, Google does crawl 250 links, even on sites with no incoming links. But is there a limit? We tried the same set-up as the websites described with 250 links on it but instead we used 5,000 links per page. When Googlebot visited that website something remarkable happened. Googlebot requested the following pages:
- http://example.com/1/
- http://example.com/10/
- http://example.com/100/
- http://example.com/1000/
On every level Google visits, we see the same page requests. It seems like Googlebot doesn’t know how to handle such a large amount of links and tries to solve it as a computer.
Semantic Intelligence
One of the SEO myths used on almost every optimised website are the links placed in heading tags. Recently it was mentioned again as one of the factors of the “Reasonable Surfer patent”. If Google respects semantics, it definitely attaches more value to those “heading” links. We had our doubts and put it to the test. We took a page with 250 links on it and marked some with heading tags. This was done a few levels deep. After a few weeks of waiting nothing pointed in the direction that Googlebot preferred the “heading” links. This doesn’t mean Googlebot doesn’t use semantics in it’s algorithm, it just doesn’t use headings to give links more weight than others.
Crawling JavaScript
Google says it keeps getting better in recognizing and executing JavaScript. Although JavaScript is not a good technique to use if you want to be sure that Google does follow your links, it’s used quite a lot to reach the opposite goal. When used for PageRank sculpting the purpose of using JavaScript links is to make those links only visible for users. If you use this technique for this purpose it’s good to keep yourself updated on what Google can and can’t recognize and execute. To test Googlebot on it’s JavaScript capabilities we took the JavaScript codes as described in “The professional’s guide to PageRank optimization” and put them to the test.
The only code Googlebot executed and followed during our test was the link in a simple “document.write” line. This doesn’t exclude the possibility that Googlebot is capable of recognizing and executing the more advanced script. It is possible that Google needs an extra trigger (like incoming links) to put more effort into the JavaScript crawling.
Crawling Breadcrumbs
Breadcrumbs are a typical element on a webpage specially created for users. Sometimes they are used to support the site structure as well. Last month we encountered some problems where the Googlebot was not able to crawl it’s way up, so we did some tests.
We made a page a few levels deep with some content and links to higher levels on it ( http://example.com/lvl1/lvl2/lvl3/ ). We gave the page some incoming links and waited for Googlebot. Although the deep page itself was visited 3 times by the crawler, the higher pages didn’t get a visit.
To verify this result, we did the same test on an other domain. This time the test page was a few levels deeper in the site structure (http://example.com/lvl1/lvl2/lvl3/lvl4/lvl5/). This time Googlebot did follow some links which referred to pages higher on the site structure. Despite the fact that Googlebot does follow the links, it doesn’t seem to be a good method to support a site structure. After a few weeks Google still didn’t crawl all the higher pages. It looks like Googlebot rather crawls deeper into the site structure then higher pages.
Takeaways
In short, the lesson learned is that one can influence the crawl rate with a sitemap. This doesn’t mean that you should always upload a sitemap for your websites. You only want to increase the crawl rate if the bulk of your crawled pages get indexed. It takes longer for a crawler to return to an “F”-levelled page than to return to an indexed page. So if most of your pages get crawled, but dropped from the index you might want to consider getting more incoming links before using a sitemap. Best thing to do is to monitor for every page when Googlebot last visited it. With this method you can always identify problems in your site structure.
The amount of links isn’t limited to 250 links (even if you have no incoming links) although 5000 seems too much. We haven’t found the exact limit yet, but if we do, we will give you an update.
Links in heading tags for crawl purpose seems to be a waste of time. Though you can use them for usability purposes, because you’re used to it or because Wordpress does it anyway and maybe if you’re lucky it’s still a ranking factor.
Another conclusion we can make is that the Googlebot isn’t very good in crawling breadcrumbs. So don’t use them for site structure purposes. Google just doesn’t crawl up as good as it crawls down. In contrast to breadcrumbs, you can use JavaScript for site sculpting purposes. Googlebot isn’t top of the bill if we’re talking about recognizing and executing JavaScript links. Remember to keep yourself updated on this subject, but for now you definitely can use some “advanced” JavaScript to do sculpting.
A last result that came up while performing research on the crawl process was the influence of the URL length. A short URL gets crawled earlier than long URL’s, therefore always consider the need for indexation and the need to be crawled if you choose your URL.
Comments
Please keep your comments TAGFEE by following the community etiquette
Comments are closed. Got a burning question? Head to our Q&A section to start a new conversation.