Cracking Google's 1,000 Page Barrier
One of the frustrations of doing SEO for large websites is the fact that Google makes it very difficult to see more than a small part of the search index. Even in Webmaster Tools, Google's index search is built on the same mechanics as its web search, which only lets you see the first 1,000 pages of any result. Whether you're trying to get pages discovered, struggling with duplicate content, confirming robots.txt changes, or doing advanced index sculpting, that 1,000-page barrier can be extremely limiting when you're dealing with a site with 10,000 or more indexed pages.
So, how can we dig deeper into the index and really see the big picture?
The Tools – Site: and Inurl:
First off, you're going to need a couple of tools. I'll assume that most of you are familiar with Google's "site:" command, which returns the indexed pages from any given domain or subdomain. Let's take our friends here at SEOmoz as an example. Type "site:seomoz.org" into Google's search box, and you'll see something like this:
The other command we'll be using is "inurl:", which, paired with other search terms, restricts the results to only those containing a specific keyword in the URL. Paired with the "site:" command, Google only reveals indexed pages which contain those URL keywords.
The Tactic – Index Deconstruction
Using our SEOmoz example, how can we find out which pages are included in the roughly 12,000-page index when we can only see those pages 1,000 at a time? Those last three words are the key: we can only see 1,000 pages at a time, but depending on how we construct our searches, they don't have to be the same 1,000 pages. By splitting up our index searches logically, we can break the full index up into manageable chunks. We'll do this by using "inurl:" to force the "site:" command to show us the index through smaller windows.
An Example – Deconstructing SEOmoz
This is one of those techniques that's much easier to illustrate with an example. Let's say that we needed to dig deeply into SEOmoz's 12,000 indexed pages. The first thing that we might do is to take a look at the main navigation to get an idea of the URL/folder structure of the site. Looking at the top-right navigation on SEOmoz, we see the following (I've added the numbers 1-6 - see below):
Other than "Home," the first link goes to the "/blog" folder. That looks promising, so let's try out our combination "site:" and "inurl:" search:
After clicking the "omitted results" link to see the full list, we get 2,430 pages of the index that contain the word "blog." That's a good start, so let's see what we can do with a few more of the major folders (numbered above):
- inurl:blog – 2430
- inurl:ugc - 712
- inurl:articles - 96
- inurl:tools - 29
- inurl:users – 5880
- inurl:marketplace - 787
Not bad: with just 6 subfolders, we've accounted for 9,934 pages or over 80% of the index. This, of course, assumes minimal overlap, and the accuracy of Google's numbers may be questionable (I'll discuss some issues with "inurl:" at the end of the post), but it's more than adequate to get the job done.
Now, we're left with a couple of groups, such as (5) that are still greater than 1,000 pages. At this point, you'll have to use some logic and your knowledge of the site in question. As a frequent Moz user, I know that the "users" folder contains all of the user profiles. Digging a little, I can easily find that those profiles all contain "users/view." A new search on "inurl:users/view" reveals 5,810 user profiles, making up almost all of the pages in the "users" folder and almost half of the total index.
An Example – Canonical URLs
Most of the time, we aren't going to be trying to deconstruct the entire Google index for a site, but just need to answer a specific question. Let's take my own company site/blog as an example. Recently, I realized that I had left some loose ends in the code that were revealing both canonical and non-canonical URLs. So, for example, the same blog post might have the following two URLs:
- http://www.usereffect.com/topic/the-last-spam-youll-ever-need
- http://www.usereffect.com/index.php?id=154
I've recently made some code changes to fix the problem, but how do I find out if my fix is working? I simply look for "id" in the URL with a search command like "site:usereffect.com inurl:id". As of this writing, that search only shows 1 result, suggesting that my changes are having the desired effect.
Advanced Inurl Tips
I hope that I've demonstrated just how powerful two relatively simple search tools can be when effectively combined. Before you go out and put this to work, though, a couple of warnings about "inurl:", which has a tendency to misbehave.
First, "inurl:" seems to ignore punctuation, for the most part. A targeted search on the folder "inurl:/blog" returns the same results as "inurl:blog," which is to say that it returns every page that contains "blog" anywhere in the URL. In some cases, this won't be a problem, but you'll have to judge that on a case-by-case basis. Like standard Google search terms, "inurl:" only searches on whole words (but doesn't seem to allow word stems), and you can only use a single word at a time in any given "inurl:" statement.
You can use multiple "inurl:" statements (one for each word) in your search, which are automatically combined with a logical AND. You can also use "-inurl:" to exclude specific URL keywords from any given search. Finally, you can combine "site:", "inurl:" and stand-alone keywords to target indexed pages by URL and content keywords in one statement.
Comments
Please keep your comments TAGFEE by following the community etiquette
Comments are closed. Got a burning question? Head to our Q&A section to start a new conversation.