September 27, 2013

Solving the Sub-Domain Equation: Predicting Traffic and Value when Merging Sub-Domains

Advanced SEO

The author's views are entirely their own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.

To sub-domain or not to sub-domain, that is the question. Should you keep your content on separate sub-domains or the same domain? If I do merge my sub-domains, will I gain or lose traffic? How much?

Since my first days in SEO back in 2004, the sub-folder vs. sub-domain debate has echoed through nearly every site architecture discussion in which I have participated. It seems trivial in many respects that we would focus so intently on what essentially boils down to the ordering of words in a URL, especially given that www. itself is a sub-domain. However, for a long time, there has been good reason to consider the question vary carefully. Today I am writing about the problem in general, and I propose a programmatic strategy for answering the sub-domain/sub-folder debate.

For the purposes of this article, let's assume there is a company named Example Business that sells baseball cards, baseball jerseys and baseball hats. They have two choices for setting up their site architecture.

They can use sub-domains...

Or, they can use directories...

Many of you have probably dealt with the exact question, and for some of you this question has reared its head dozens if not hundreds of times. For those of you less familiar problem, let's do a brief history on sub-domains, sub-folders, and their interaction with Google's algo so we can get a feeling of the landscape.

Sub-domains and SEOs: A quick historical recap

First, really quickly, here is the breakdown of your average URL. We are most interested in comparing the sub-domain with the directory to determine which might be better for rankings.

This may date me a bit, either as a Noob or an Old-Timer depending on when you got in the game. I started directly after the Florida update in 2003. At that time, if I recall correctly, the sub-domain / sub-folder debate was not quite as pronounced. Most of the decisions we were making at the time regarding sub-domain had more to do with quick technical solutions (ie: putting one sub-domain on a different machine) than with explicit search optimization.

However, it seemed at that time our goal as SEOs was merely to find one more place to shove a keyword. Whether we used dashes (hell, I bought a double--dashed domain at one point) or sub-domains, Google's algos seemed to, at least temporarily, value the sub-domain to be keyword rich. Domains were expensive, but sub-domains were free. Many SEOs, myself included, began rolling out sites with tons of unique, keyword-rich sub-domains.

Google wasn't blind to this manipulation, though, and beginning around 2004, with some degree of effectiveness Google was able to kill off an apparent benefit to sub-domain spam. However, it still seemed to persist to some degree in discussions from 2006, 2007, 2008, and 2009. For a while, there seemed to be a feather in the cap of sub-domains specifically for SEO.

Fast forward a few years and Google introduces a new, wonderful feature called host crowding and indented results. Many of you likely remember this feature, but essentially, if you had two pages from the same host ranking in the top 10, the second would be pulled up directly under the other and given an indent for helpful organization. This was a huge blow to sub-domain strategies. Now ranking positions 1 and 10 on the same host was essentially the same as owning the top two positions, but on separate hosts it was valueless. In this case, it would make sense for "Example Business" to use sub-folders rather than sub-domains. If the content shared the same sub-domain, every time their website had 2 listings in the top 10 for a keyword, the second would be tucked up nicely under the first, effectively jumping multiple positions. If they were on separate sub-domains, they would not get this benefit.

Host Crowding Made Consolidating to a Single Domain Beneficial

Google was not done, however. They have since taken away our beautiful indented listings and deliberate host crowding and, at the same time given us Panda. Initial takes on Panda indicated that sub-domain and topical sub-domain segregation could bring positive results as Panda was applied at a host name level. Now it might make sense for "Example Business" to use sub-domains, especially if segmenting off low quality user generated content.

Given these changes, it is understandable why the sub-domain debate has raged on. While many have tried to discredit the debate altogether, there are legitimate, algorithmic reasons to choose a sub-domain or a sub-folder.

Solving the sub-domain equation

One of the beauties of contemporary SEO is having access to far better data than we've ever had. While I do lament the loss of keyword data in Google Analytics, so much other data is available at our fingertips than ever before. We now have the ability to transform intuition by smart SEOs into cold hard math.

When Virante, the company of which I am CTO, was approached a few months ago by a large website to help answer this question, we jumped at the opportunity. I now had the capability of turning my assumptions and my confidences into variables and variances and build a better solution. The client had chosen to go with the subdomain method for many years. They had heard concepts like "Domain Authority" and wondered if their subdomains spread themselves too thin. Should they merge their subdomains together? All of them, or just a few?

Choosing a mathematical model for analysis

OK, now for the fun stuff. There are a lot of things that we as SEOs don't know, but have a pretty good idea about. We might call these assumptions, gut instincts, experience, intuitions but, in math, we can refer to them as variables. For each of these assumptions, we also have confidence levels. We might be very confident about one assumption of ours (like backlinks improve rankings) and less confident about another (longer content improves rankings). So, we have our variables and we have how confident we are about them. When we don't know the actual values of these variables (in science we would refer to them as independent variables), Monte Carlo simulations often prove to be one of the most effective mathematical models we can use.

Definition: Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results; i.e., by running simulations many times over in order to calculate those same probabilities heuristically just like actually playing and recording your results in a real casino situation: hence the name. - Wikipedia

With Monte Carlo simulations, we essentially brute force our way to an answer. We come up with all of the possibilities, drop them into a bag, and pick one from the bag over and over again until we have an average result. Or think about it this way. Let's say I handed you a bag with 10000 marbles and asked you which color of marble in the bag is most common. You could pour them all out and try and count them, or you could shake the bag and then pick out 1 marble at time. Eventually, you would have a good sample of the marbles and be able to estimate that answer without having to count them all.

We can do the same thing here. Instead of asking which color a marble is, we ask "If I merge one URL with another, what is the likelihood that it will receive more traffic from Google?". We then just have to load all of the variables that go into answering that question into our proverbial bag (a database) and randomly select over and over again to get an estimate.

So here are the details, hopefully you can follow and do this yourself.

Step 1: Determine the keyword landscape

The thing we need to know is every possible keyword for which the client might rank, how much potential traffic is available for that keyword, and how valuable is that keyword in terms of CPC. The CPC value allows us to determine the true value of the traffic, not just the volume. We want to improve rankings for valuable keywords more than random ones. This client in particular is in a very competitive industry that relies on a huge number of mid/long-tail keywords. We built a list of over 46,000 keywords related to their industry using GrepWords (you could use SEMRush to do the same).

Step 2: Determine the search landscape

We now need to know where they actually rank for these keywords and we need to know all the potential sub-domains we might need to test. We queued all 46K keywords with the AuthorityLabs API and within 24 hours we had the top 100 results in Google for each. We then parsed the data and extracted the position and rank of every sub-domain for the site. There were around 25 sub-domains that we discovered, but ultimately chose to only analyze the 9 that made up the majority of non-branded traffic.

Step 3: Determine the link overlap

Finally, we need to know about the links pointing to these sub-domains. If they all have links from the same sites, we might not get any benefit when we merge the sub-domains together. Using Mozscape API Link Metrics call, we pulled down the root linking domains for each site. When we do our Monte Carlo simulation, we can determine how their link profiles overlap and make decisions based on that impact.

Step 4: Create our assumptions

As we have mentioned, there are a lot of things we don't know, but we have a good idea about. Here we get to add in our assumptions as variables. You will see variables expressed as X and Y in these assumptions. This is where your expertise as an SEO comes into play.

Question 1: If two sub-domains rank for the same keyword in the top 10, what happens to the top ranked keyword?
Assumption 1: X% of the time, the second ranking will be lost as Google values domain diversity.
Example: It turns out that http://baseball-jerseys.example.com and http://baseball-hats.example.com both rank in the top 10 for the keyword "Baseball Hats and Jerseys". We assume that 30% of the time, the lower of the two rankings will be lost because Google values domain diversity.

Question 2: If two sub-domains rank for the same keyword in the top 10, what happens to the top ranked subdomain?
Assumption 2: Depending on the X% of link overlap, there is a Y% chance of improving 1 position.
Example: It turns out that http://baseball-jerseys.example.com and http://baseball-hats.example.com both rank in the top 10 for the keyword "Baseball Hats and Jerseys". We assume that 70% of the time, based on X% of link overlap, the top ranking page will move up 1 position.

Question 3: If two sub-domains merge, what happens to all rankings of top ranked subdomain, even when dual rankings are not present?
Assumption 3: Depending on X% of link overlap, there is a Y% chance of improving 1 position.
Example: On keywords where http://baseball-jerseys.example.com and http://baseball-hats.example.com don't have overlapping keyword rankings, we that 20% of the time, based on X% of link overlap, their keywords will improve 1 position.

These are just some of the questions you might want to include in your modeling method. There might be other factors you want to take into account, and you certainly can. The model can be quite flexible.

Step 5: Try not to set fire to the computer

So now that we have our variables, the idea is to pick the proverbial marble out of the bag. We will create a random scenario using our assumptions, sub-domains and keywords and determine what the result of that single random scenario is. We will then repeat this hundreds of thousands of times to get the average result for each sub-domain grouping.

We essentially need to do the following...

Select a random set of sub-domains.
For example, it might be sub-domains 1, 2 and 4. It could also be all of the sub-domains.
Determine the link overlap between the sub-domains
Loop through every keyword ranking those sub-domains we determined when building the Keyword and Search Landscape back in Step 2. Then, for each ranking...
1. Randomly select our answer to #1 (ie: is this the 3 out of 10 times that we will lose rankings?)
2. Randomly select our answer to #2 (ie: is this the 7 out of 10 times that we will increase rankings?)
3. Randomly select our answer to #3 (ie: is this the 2 out of 10 times we will increase rankings?)
Find out what our new traffic and search value will be.
Once you apply those variables above, you can guess what the new ranking will be. Use the Search Volume, CPC, and estimated CTR by ranking to determine what the new traffic and traffic value will be.
Add It Up
Add up the estimated search volume and the estimated search value for each of the keywords.
Store that result
Repeat hundreds of thousands of times.
In our case, we ended up repeating around 800,000 times to make sure we had a tight variance around the individual combinations.

Step 6: Analyze the results

OK, so now you have 800,000 results, so what do we do? The first thing we do segment those results by their sub-domain combination. In this case, we had little over 500 different sub-domain combinations. Second, we an average traffic and traffic value for each of those sub-domain combinations from those 800,000 results. We can then graph all those results to see which sub-domain combination had, on average, the highest predicted Traffic and Value.

To be honest, graphs are a terrible way of figuring out the answer, but it is the best tool we have to convey it in a blog post. You can see exactly why below. With over 500 different potential sub-domain combinations, it is difficult to visualize all of them at the same time. In the graph below, you see all of them, with each bar representing the average score for an individual sub-domain combination. For all subsequent graphs, I have taken a random sample of only 50 of the sub-domain combinations so it is easier to visualize.

As mentioned previously, one of the things we try and predict is not just the volume of the traffic, but also the value of that traffic by multiplying it by CPC value of each keyword for which they rank. This is important if you care more about valuable commercial terms than just any keyword for which they might rank.

As the graph above exposes, there were some sub-domain combinations that influenced traffic more than value, and vice-versa. With this simulation, we could find a sub-domain combination that maximized the value or the traffic equation. A company that makes money off of display advertising might prefer to look at traffic, while one that makes money off of selling goods would likely pay more attention to the traffic value number.

There were some neat trends that the Monte Carlo simulation revealed. Of the sub-domains tested, 3 in particular tended to have a negative rankings effect on nearly all of the combinations. Each time a good sub-domain was merged, these 3 would intermix with combinations to slightly lower the traffic volume and traffic values. It turned out these 3 sub-domains had very few backlinks and only brand keyword rankings. Subsequently, there was huge keyword overlaps and almost no net link benefit when merged. We were easily able to exclude these from the sub-domain merger plan. We would have never guessed this, or seen this trend, without this kind of mathematical modeling.

Finally, we were able to look closely at sub-domain merger combinations that offered more search value and less search traffic, or vice-versa. Ultimately, though, 3 options vied for the top spot. They were statistically indistinguishable from one another in terms of potential traffic and traffic value. This meant the client wasn't tied to a single potential solution, they could weigh other factors like the difficulty of merging some sub-domains and internal political concerns.

Modeling uncertainty

As SEOs, there is a ton we don't know. Over time, we build a huge amount of assumptions and, with those assumptions, levels of confidence for each. I am very confident that a 301 redirect will pass along rankings, but not 100%. I am pretty confident that keyword usage in the title improves rankings, but not 100% confident. The beauty of the Monte Carlo approach is that it allows us to graph our uncertainties.

The graphs you saw above were the averages (means) for each of the sub-domain combinations. There were actually hundreds of different outcomes generated for each of those sub-domain combinations. If we were to plot those different outcomes, they may look like what you see in the image directly above. If I had just made a gut decision and modeled what I thought, without giving a range, I would have come up with only a single data point. Instead, I estimated my uncertainties, turned them into a range of values, and allowed the math to tell me how those uncertainties would play out. We put what we don't know in the graph, not just what we do know. By graphing all of the possibilities, I can present a more accurate, albeit less specific, answer to my client. Perhaps a better way of putting it is this: when we just go with our gut, we are choosing 1 marble out of the bag and hoping it is the right one.

Takeaways

If you are an agency or consultant, it is time to step up your game. Your gut instinct may be better than anyone else's, but there are better ways to use your knowledge to get at an answer than just think it through.
Don't assume that anything in our industry is unknowable. The uncertainty that exists is largely because we, as an industry, have not yet chosen to adopt the tools that are plainly available to us in other sciences that can take into account those uncertainties. Stop looking confused and grab a scientist or statistician to bring on board.
Whenever possible, look to data. As a small business owner or marketer, demand that your provider give you sound, verifiable reasons for making changes.
When in doubt, repeat. Always be testing and always repeat your tests. Making confident, research-driven decisions will give you an advantage over your competition that they can't hope to undo.

Follow up

This is an exciting time for search marketers. Our industry is rapidly maturing in both its access to data and its usage of improved techniques. If you have any more questions about this, feel free to ask in the comments below or hit me up on twitter (@rjonesx). I'd love to talk through more ideas for improvements you might have!

Solving the Sub-Domain Equation: Predicting Traffic and Value when Merging Sub-Domains

Table of Contents

Solving the Sub-Domain Equation: Predicting Traffic and Value when Merging Sub-Domains

Sub-domains and SEOs: A quick historical recap

Solving the sub-domain equation

Choosing a mathematical model for analysis

Step 1: Determine the keyword landscape

Step 2: Determine the search landscape

Step 3: Determine the link overlap

Step 4: Create our assumptions

Step 5: Try not to set fire to the computer

Step 6: Analyze the results

Modeling uncertainty

Takeaways

Follow up

Make smarter decisions with Moz API

Read Next

How I Use Inbound Marketing to Drive 60k Monthly Visitors (Without Paid Ads)

How to Find Profitable Keywords For Your Niche

How To Manage Old Content on a News Website

Comments

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved