Skip to content
Advanced seo 74da3a3

Lessons Learned While Crawling the Web

Bill Slawski

This YouMoz entry was submitted by one of our community members. The author’s views are entirely their own (excluding an unlikely case of hypnosis) and may not reflect the views of Moz.

Table of Contents

Bill Slawski

Lessons Learned While Crawling the Web

This YouMoz entry was submitted by one of our community members. The author’s views are entirely their own (excluding an unlikely case of hypnosis) and may not reflect the views of Moz.

One of the most helpful approaches to learning about and understanding a client's website I've found involves actually crawling a site, and learning about the structures of its URLs and how they are connected to each other. I was recently invited to try a beta test for a cloud-based crawler that can handle extremely large websites. I accepted the offer because it appeared that the tool might be useful in campaigns where the crawling tools I'm presently using might not be as effective.

I also thought that it might be worth sharing some of my past crawling experiences as well, thinking that they might help others. Different tools can have different values based upon the purpose for your crawl, and the site you're crawling. I'd like to hear what crawling tools you might use, and why you use them.

Learning to Crawl with Xenu

Back in January 2005, I'd just been assigned a client with what looked like a huge website -- Google estimated 90,000+ URLs. I visited the home page and looked at the main navigation, and it really didn't provide many clues in terms of how the site was organized overall. The client was a Fortune Global 100 company, and they'd been working for a while with the agency that I'd just joined. The business model we followed was one where we optimized specific pages on a site for selected keywords, and provided the client with quarterly recommendations and reports.

I decided that instead of just limiting myself to the pages that we were optimizing for specific keywords, it made sense to know as much as possible about the site, so I crawled it with Xenu Link Sleuth. About an hour into the crawl (I was crawling the site pretty slowly), I started noticing some strange URLs, where there are additional parameters at the end of some of the paths for those URLs that look like this: ?1=open, ?1=open&2=open, ?2=open&1=closed, and so on.

I stopped Xenu and went to the page that's showing these extra parameters, and noticed that there are little triangular widgets on the page. If you clicked upon one, it expanded the content in that section, and I noticed that click also caused the URL to change. When I clicked on the first triangle, ?1=open appeared after the URL in my browser address bar. If I then clicked on the second triangle, ?1=open&2=open was at the end of that URL. If I clicked on the 21st triangle on the page, I saw ?1=open&2=open&21=open at the end of the URL. Clicking upon a triangle for a second time was matched by a "closed" in the URL. Clicking out of order meant that the numbers in the URL matched the order I clicked upon. So, if I clicked then upon the second triangle a second time, I saw this in the URL: ?1=open&21=open&2=closed.

I disallowed that particular based URL (without the additional parameters) from being crawled in Xenu, and started the crawl all over again. After about 15 hours of crawling, I had found 28 pages on the site which had these triangular widgets on them that caused the extra parameters. I looked for some java script that enabled content to be expanded and contracted like that, which still enable the contracted content to be crawled, and it took about three minutes to find on Google.

I looked around for some additional options, and then I sent an email to the client's developers explaining the problem, and asking them to use the java script instead of the code that they were using. They changed it almost immediately.

I discovered during my crawl that there were only about 3,600 pages on the site in total, instead of the 90,000 that Google was estimating. Around three weeks later, Google was estimating around 3,600 pages for the site. I also had a lot better idea of how the site was organized.

What I really love Xenu for is its ability to create a report that enables someone to address broken links and internal redirects on a site, and change links on the pages of that site to (sometimes) significantly reduce links to broken pages, and unnecessary internal redirects. I still use it today for that purpose.

Crawling and Creating Content Inventories with Screaming Frog

Fast forward to a couple of years ago. I had started creating content inventories of client's sites, which included things like URL, page title, meta description, main heading, and so on. I found this to be really helpful in being able to understand which keywords and concepts were covered by which pages, and how the pages worked together as a whole. This ability to focus on the bigger picture came with a cost though. It could take a lot of time to copy and save information like this manually. The benefit of this obstacle was that it required a lot of focus upon the pages of the site itself, but maybe a little too much focus. There had to be a better way.

Then I saw reference online to a program called Screaming Frog that created a crawl file that could be exported into Excel that would provide most of those things and more. As fast as I could, I navigated over to their site, and read up on the program. I downloaded it, and ran the free trial, and was very satisfied with the results. It's been indispensable ever since, and I usually start exploring a site by running a crawl on it with Screaming Frog. After a crawl, I usually save a comma separated value (CSV) file, and then import that into Excel, where I re-organize the crawl data based upon what I find.

Usually, I'll sort by "content" first and create separate worksheets for images, CSS, java script, and other types of content. I then sort by status code, and create worksheets for URLs that return 301, 302, 404, and 500 status codes. Sorting by "meta data" enables me to find all of the pages that have a "noindex" robots meta data tag, and I'll put those in another worksheet. The main idea is to create a single page where all of the content that I want to see indexed is on a single sheet.

If there are canonical link elements for those indexable URLs, I'll create another sheet where I include two columns - one for the Addresses of the pages that I want indexed, and another for the canonicals (sometimes there is more than one of those, which is tricky if they don't match). I'll make sure that both columns are selected, and use "conditional formatting" in Excel to highlight cells in those columns where the content matches. I then sort by cell background color to move the cells that don't match to the top, so that I can compare and try to understand why they don't match. This is where I often find that pages in series (pagination pages), like category pages in WordPress or product pages on an ecommerce site might be incorrectly using the first URL in a series as the canonical link element for all pages in that series.

Like Xenu, I'll often keep an eye on the URLs that show up as the program is crawling, to see if there are URLs that I don't want crawled. Screaming Frog allows you to keep from crawling pages that you might not want to include within a crawl, such as "email to a friend" pages, or "write a review" pages, or "compare products" pages. You can also ignore certain parameters from a crawl, such as session IDs or tracking codes.

Sometimes you'll see sites where all of the pages can be crawled as HTTPS pages, and versions of pages with and without "www".

The program can also be very useful when a site has faceted navigation and uses different types of parameters to sort and filter content - usually products. If you start seeing these types of parameters appearing during a crawl, it's often a good idea to visit those pages and see what kinds of parameters are appearing.

One particular ecommerce site I worked upon had HTTPS versions of all pages crawlable, it had "www" and "non-www" pages that could be indexed, it had multiple sorting facets that were getting indexed by the search engines, lots of pages that had very little actual content that enabled people to share pages via email and contribute reviews and compare products side-by-side, sort by colors, sort by brands, sort by prices (high and low), sort alphabetically (and by reverse alphabet), and more.

Once I had a good understanding of how the site was organized and what it contained, I was able to stop my crawl, take notes, and exclude URLS that I didn't want indexed. The site actually contained about 6,800 product and category pages that I did want to have indexed, but a lot more of those pages, tens of thousands, were being indexed by the search engines.

Discovery via Screaming Frog enables me to decide how I want to handle all of those extra pages that I don't want included in search results, using meta robots noindex pages, robots.txt disallow statements, parameter handling and other approaches. It lets me learn when canonical link elements might be set up incorrectly, and which links lead to redirects and broken pages. And it gave me the start of a content inventory that included page titles, meta descriptions, and headings and the length of those features in a way that was easy to sort by size. That start towards a more complete content inventory lets you more easily identify which keywords you're targeting on different pages of a site, and which ones you might be missing.

Screaming Frog works well with most sites I work on, but there is a limit. If the site is too big, the memory available to my desktop computer isn't enough to let me crawl that site. An 800,000 URL site couldn't be crawled through Screaming Frog. I went searching for solutions, and found a cloud-based one that doesn't have that limitation.

Cloud-Based Crawling

When you're faced with crawls of hundreds of thousands or even millions of URLs, chances are that a desktop computer isn't going to be able to handle that crawl. Over on Distilled's blog, I found a review of a program called DeepCrawl in The Latest 5 Tools I’ve Added to my SEO Toolbox, and contacted the owners of the crawling program.

They really hadn't moved into the US market yet, and the prices of their crawls might be intimidating if you don't have the budget of an enterprise organization (after you translate the cost from pounds to dollars). But, they demonstrated their offering to me, and since offered me a beta account to enable me to explore what it's like to outsource the computing effort of a large crawl to the Cloud. I suspect their prices will come down some as they grow and the audience expands (I believe it already has a bit).

During the demo, I quickly noticed that being able to watch the crawl in action, and discover parameters and URLs that you might want to exclude from your crawl isn't an option. That processing takes place on the Cloud itself, so it's not available. There are ways to limit initial crawls to use as discovery, and spending some time browsing a site beforehand is probably a good idea as well. Asking a client or clients' developers during a kickoff meeting or follow-up meeting might not be a bad idea either.

DeepCrawl has a large number of reports available that are easily shared with any developers you might be working with whether in-house or working for your client. Those reports are easy to access through an online interface, which is fairly intuitive to learn and understand after some exploration of the different features it offers.

Much like Screaming Frog, it offers a number of different ways to exclude specific URLs from being crawled, which can let you make sure that you can check upon only the URLs that you want indexed.

I can't create the content inventory that I like so much from Screaming Frog, and I can't watch along as I could with the crawls from Xenu or Screaming Frog, but DeepCrawl will crawl very large sites quickly and the ability to share reporting tools that focus upon specific issues streamline addressing those issues.

One aspect of DeepCrawl that I like very much is the ability to run a second crawl and quickly compare it against an earlier crawl to pinpoint the changes between crawls. This can be extremely helpful when you perform a follow-up audit to make sure that the changes you've recommended have been implemented.

As the Distilled review notes, DeepCrawl is an enterprise tool. But given the large sizes of many sites, something like it might soon be a necessary part of every SEOs tool kit.

Conclusion

Xenu, the first tool that I used to seriously crawl websites was created as a broken link checker, but it proved its usefulness during audits for hundreds of sites. It wasn't helpful in creating a content inventory, like Screaming Frog was, but it did a great job of finding broken links and redirects, and helping me to understand the link architecture of a site. I still use Xenu Link Sleuth to create reports for broken links and redirects, and changes that I'd like to see made to a site.

Screaming Frog was created specifically as an SEO tool, and in conjunction with Excel, it's extremely helpful while identifying issues that a site might have that should be addressed. I've probably used it on multiple sites every day for the past year or so, and it's usually my go-to program for analyzing SEO issues on sites. It doesn't replace the knowledge and analysis that you need to perform SEO, but it saves a tremendous amount of time in performing that analysis.

DeepCrawl enables the crawling of large sites that a desktop application just can't handle, and when the site you're working on has millions of URLs, it's helpful in making you much more certain to find issues on a site that should be addressed.

That first agency I wrote about above focused upon optimizing specific individual pages for keywords, and not analyzing the entire site. It was a number of years ago, but that was something that I wasn't comfortable with. I wanted to identify spider traps and duplicate URLs, redirects and broken links. Removing low quality pages that weren't necessary to be indexed, because they were unlikely to rank in search results or be linked to or to be shared by visitors was also important. I integrated global changes into sites in addition to specific on-site changes.

I also wanted to be aware of how pages were connected together in terms of an information architecture. Crawling through a site and creating a content inventory of it gives you a roadmap to that information.

Image Credits:

  1. "Camp Edwards, Massachusetts. This 'asparagus patch,' consisting of pine logs stuck in the ground, is an obstacle that looks easy but in reality is a backbreaker. Men must be able to crawl for some distance, twisting and turning like a snake, to complete this course, for the logs are set in a straight line." Date Created/Published: 1942? Reproduction Number: LC-USW33-000253-ZC
  2. A screenshot of the Logo from the Xenu Link Sleuth page.
  3. A screen capture of my Firefox Browser's address bar.
  4. "Fort Belvoir, Virginia. Privates Chirico and Dalto and Sergeant Tauro scaling a fence on the obstacle course" Created by Photographer John Vachon (1914-1975), photographer. Date Created/Published: 1943 Jan. Reproduction Number: LC-USW3-014421-D
  5. The Screaming Frog Logo from their website.
  6. A screen capture of the tabs of a content inventory I created in Excel.
  7. "Camp Edwards, Massachusetts. A hard-running leap takes these artillerymen over one of the obstacles on the obstacle course at the anti-aircraft training center. Here we see the different phases of the jump; one man has just landed on the sandbags; two are in the air; and another man is gathering himself for the spring across." Date Created/Published: 1942? Reproduction Number: LC-USW33-000257-ZC
  8. The DeepCrawl Logo, from their website
  9. "Fort Belvoir, Virginia. Sergeant George Camblair getting rigorous physical training on the obstacle course" Created by photographer Jack Delano. Date Created/Published: 1942 Sept. Reproduction Number: LC-USW3-008294-E
Back to Top
Bill Slawski

My blog started in the summer of 2005 as an announcement of a free internet marketing conference on the shores of the Chesapeake Bay (sort of like a barcamp before there were barcamps), Then I started writing about patents and whitepapers, and I haven't stopped.

Make smarter decisions with Moz API

Start building and scale easily with affordable plans

Read Next

Build a Search Intent Dashboard to Unlock Better Opportunities

Build a Search Intent Dashboard to Unlock Better Opportunities

Aug 29, 2024
How to Optimize for Google's Featured Snippets [Updated for 2024]

How to Optimize for Google's Featured Snippets [Updated for 2024]

Aug 20, 2024
20 SEOs Share Their Key Takeaways From the Google API Leaks

20 SEOs Share Their Key Takeaways From the Google API Leaks

Jun 18, 2024

Comments

Please keep your comments TAGFEE by following the community etiquette

Comments are closed. Got a burning question? Head to our Q&A section to start a new conversation.