November 14, 2011

Google Analytics - Perfect, Future-Proof, Awesome Data FTW

SEO Analytics

This YouMoz entry was submitted by one of our community members. The author’s views are entirely their own (excluding an unlikely case of hypnosis) and may not reflect the views of Moz.

Howdy mozzers!

I'm proud to say that I attended the recent (and frankly amazing) SearchLove London conference, and that I came away with some really exciting ideas and plans. One of the key focuses of the conference was on working out how we can change our behaviour and processes to make things easier to 'get stuff done' - particularly with big clients or organisations - and a raft of related ideas touching on conversion rate optimisation, linkbait, campaigns and link building in general. This is all awesome. However...

A lot of this stuff needs buy-in. CEOs and clients need to be able to understand (or to feel reassured by some compelling spreadsheets) that their investment is going to pay off before they'll free up budget, and so we've got to ensure that we're accountable. One of the most exciting takeaways from the conference was the idea that we should try pitching 'big idea' campaigns to the guys at the top, but start small with examples of execution and success to justify a bigger investment. This shouldn't be rocket science, but get this - tracking, reporting, analysis, analytics and data integrity weren't mentioned even once at SearchLove London. If I don't have a well set up Google Analytics account which will let me be able to measure the impact of my work and show a clear ROI, I'll struggle to get the buy-in I need to get that [awesome campaign idea / pesky site issue / indexation problem / linkbait / infographic launch] sorted out. There's an underlying assumption here that we're all in a position where this is easy, achievable, and already in place; and I'm not convinced that this is the case.

I'll put my hands up, and admit that I'm a data perfectionist. I get twitchy if there's data in my Google Analytics account that I know isn't correct, and I'm always pushing to make it cleaner, better, and ultimately more actionable. But this isn't an academic quirk; If I'm making recommendations to clients, who're in turn investing money in changes and improvements, I can't afford for the information I'm providing to be based on erroneous data. Given that this is the case, and given that to a greater or lesser extent we all need to (or at least should be) justifying our ideas and results with measurable ROI, why is the data in so many Google Analytics accounts so shockingly fragmented, suboptimally configured and just plain messy?

So, here's the deal - I'm going to run through a bunch of ideas and tips which you can go away and implement, and I can work on getting over my neurosis a little in the knowledge that the data you're collecting is accurate and useful. Let's start small.

Exclusion of URL queries

You've all seen reports like this in your Google Analytics account. These are pages wayyyy down at the bottom of my content reports, and we're looking at URLs plus pageviews:

Example of arbitrary query parameters appended to URL page reports in Google Analytics

There are two things here that really wind me up. I'm getting small trickles of visits to pages via third party services such as Google Translate, caching services, and Facebook (and who knows what else) which are fragmenting my data by dropping in arbitrary query parameters. There are only a few visits here, but these add up, and this results in two problems:

This report is a pain to navigate and use, as it's full of detritus. I can probably live with that, however, but;
I don't know how many actual visits my pages have had, because there's an arbitrary percentage which ends up getting fragmented down in this mess and reported as different content.

I don't care if the visitor has come via Google Translate in this context; sure, that's interesting and useful, but it shouldn't be confused with my page consumption data. Naha. #1 in the table, which is a request for a page with some Facebook junk in the URL, should be part of the same data set as the rest of the information about the page in question; the fact that data for this page is in who-knows-how-many-places means that I'm not seeing a representative picture of performance, and that at any given time an arbitrary percentage of the actual, real information about how users are interacting with that page is lost - or at least buried. If this information was consolidated (whilst preserving the fact that the visit came via Translate services via another report), then I'd know that I actually had X more visitors to the page in question, and my understanding of the way they consume my content might shift a little as a result.

Now, I hear you cry, this is easy to fix. I can use the 'Exclude URL Query Parameters' facility in the profile to simply strip these parameters out of the reports and consolidate the information about that page. This works brilliantly if you've got nothing better to do with your time than constantly dig around in the bottom of your Analytics - the issue here is that you're not solving the problem, merely mopping up the symptoms; there's nothing stopping more of this happening.

Let's use an example to demonstrate why this isn't really a solution.

If I access your site with an arbitrary URL parameter, say, example.com/great-page/?messyjunkinyoururl=true, that'll show up as a different content report than example.com/great-page/ in the way we've explored above. This is easily removed via the addition of an exclusion string for messyjunkinyoururl, but get this - there's no way to easily and proactively spot this sort of fragmentation happening. Unless you're spending an awful lot of time monitoring your content reports, the recommendations you're making based on the data could be significantly skewed simply because you might not have spotted that 50% of your visits to a page came via a URL including unexpected query parameters. What happens if the full, consolidated data paints a very different picture which contradicts the recommendations you've made? Nightmare!

There are two hard limitations to this tool, too. There are a finite number of parameters you can remove (I believe it's 50), and the parameters themselves are case-sensitive. It's not going to take long before you run out of the ability to mop up your data, and then you're back at square one.

Even without these limitations, this isn't a fix; every crawler, scraper, syndicator, bolt-on, integration and third party tool in the world is going to be adding proprietary junk to your content reports, and you'll never know what's going on. It's hard to justify your awesome SEO campaign to the CEO when you can't tell which pages people are browsing on your site; harder still to make a wrong decisions due to bad information and to take to rap for it later.

Example of 'cleaned' page reporting data in Google Analytics

Using parameter exclusion will give you consolidated, great data like the above, but only as long as you proactively maintain it.

There's one more challenge here, which is that the same kind of duplicate content issue we face as SEOs applies here, too. If a page can be accessed via multiple URLs (never mind query strings), then it'll report as those multiple pages, rather than as a single entity. This isn't (generally*) fixable with query string exclusion, and necessitates a different approach.

*If you've a heavily query-driven URL system, then there may be scenarios where you can try some clever exclusion rules, but this is a bit of a 'dirty' solution and will only apply to a very small number of websites.

Aggressive pruning with virtual pagenames

Another approach is to cheat a little, and this is the approach that many other tracking solutions take.**

**There's an interesting discussion here around whether Google Analytics's attempts to promote the simplicity and ease of use of the product has undermined the quality of the data collected because people aren't aware that an off-the-shelf installation is usually rubbish, but that's a conversation for another time!

The approach is that rather than using and relying on the URL of the page, each page is assigned a name. Now, when anybody visits that page, regardless of the URL used, the data is all automatically consolidated into the report for that page name.

All I need to do is tweak my Google Analytics code to add a page name value in, which I can do by changing the following code:

_gaq.push(['_trackPageview']); becomes _gaq.push(['_trackPageview', 'NAME OF PAGE']);

Provided 'NAME OF PAGE' is either manually or programmatically slotted into the tag (e.g., editing the HTML of each page for simple/static sites, or on complex sites defining rules of behaviour so that different types of pages will output consistent and accurate page names), every time that tag fires it'll use that value instead of the URL. Bingo! We've cleaned up all that parameter junk, and everything's clean and sparkly; the URL used to access the page doesn't affect the reporting.

An example of using virtual pageviews in Google Analytics

This is nice! If you've a CMS with some flexibility, or some developers who owe you favours, we can take this a little further by starting to apply some intelligent structure to it... maybe something like this:

An example of using structured virtual pageviews in Google Analytics

All we've done here is got the CMS to identify the type of page I'm viewing, and to use some pre-defined rules to prefix that information cleanly to the page name. Now I can search and sort based on different types and classifications of content, and even set up custom reports, segments and filters to only show me specific content types.

There's another really nice advantage to this approach, which is that if the URL of a page changes (e.g., after a 301 redirect), providing that the page type and name remains the same, the report will continue to associate data for that page to the same entry, allowing clean and consistent data across the redirect period. Without this, if a URL changes you'll find that the old one stops tracking, and the new one starts tracking as a different page. Analysing behaviour across these transitions is difficult!

But hold on, while we've fixed the original problem of fragmentation. there are some big drawbacks to this approach...

We've broken all URL-based reporting in Google Analytics, such as in-page reports (because none of these pages represent actual URLs).
We've broken the ability to visit URLs via Google Analytics (because none of these pages represent actual URLs).
We've created a system which needs to be perpetually maintained whenever the site changes or grows, otherwise we're right back where we were originally with a mix of fragmented page reports.

Point three is the killer here - what happens if we introduce pagination? How about if we change the way in which filtering or ordering content works? What about categorisation, taxonomies and new content types? Anything which we'd like to record about our content needs to be explicitly designed, implemented and maintained, otherwise we've failed to solve the original problem of muddled data, or (arguably worse) incorrectly consolidated the data. As an example, if we introduce paginated content into areas of the website and don't create new rules on how the page names in these scenarios should be created, we'll find that all the paginated results (e.g., page 2, page 3, etc.) are either consolidated into one page report, or that they fall back to using URLs - both of which puts us back at square one.

Canonical Tracking

With every new GA solution I scope, I juggle between these two implementations; do you aim high for a clean solution which requires thinking time and extra planning to maintain (page names), or aim low and commit to a fragmented solution which you can dedicate time to spring cleaning periodically (default)? The website's complexity and the resources available will factor into this decision, but it always feels like a dirty compromise as neither is ever a perfect fit, and it's hard to account for future resource and requirements.

What we need is some kind of system which provides a hybrid of both; minimal maintenance, but clean data, with as few drawbacks as possible; and I think I've found the solution...

As SEOs, we've all spent the last year building and implementing the perfect solution in another arena. We have a system which allows us to consolidate and collate value in a single location in the form of the canonical URL tag. What happens if we use the value of this tag as the page name value in our tracking solution?

_gaq.push(['_trackPageview', 'CANONICAL URL']);

Bingo. This is our best-of-both solution, where any URL accessed will report as the canonical URL of that page, solving the issue of query fragmentation and duplicate content reporting. This puts us back at our original, URL-based reports, but guarantees that they'll be clean and consolidated without the need for constant housekeeping. It also maintains the functionality of the in-page reports, hyperlinks within reports, etc.

There are, unsurprisingly, a few considerations however:

Your canonical URL setup needs to be squeaky clean, otherwise pages will report with the wrong canonical URL (potentially resulting in more fragmentation, rather than less).
New functionality, page types and behaviour will still require some consideration (but they already would from a canonical URL perspective, anyhow, so it's no extra work).
There may be fringe cases where the canonical URL isn't the URL you want to report against.***

***Such as advanced canonisation scenarios such as pagination, where you might canonise to a 'view all' page, but want to see content reports against individual paginated pages, in which case you'll need some case-by-case logic to output an alternative value.

The icing on the cake - visibility and accountability

We need to know how well this is working, otherwise we're no better off than being in a position where the data's fragmented and making decisions that we're not sure are right. The page name and canonical page name solutions both present a risk of invisible over-pruning, where data is being cleaned and consolidated too much or in the wrong places, but we've no way of spotting or knowing that.

In order to maintain visibility of the level and types of consolidation that we're making, we can utilise a page-level custom variable slot to record the URL of the page in question, and then create custom reports to compare that to the page name or canonical page name. Try giving the following code a whizz immediately prior to the trackPageview line:

_gaq.push(['_setCustomVar',1,'URL Requested',window.location.pathname,3]);

Now you can use custom reports in Google Analytics to compare pages (as a primary dimension) against requested URLs (as a secondary dimension), and easily see where the canonical pagename is different to the requested page name.

Much of the time, this will show that the canonical page name is doing its job, and that it's reporting as a single page when requested with nasty parameters; however, it'll also allow you identify scenarios where it's either recording the wrong page, or there are opportunities to do some more advanced naming conventions (such as the pagination example covered earlier in this post).

Perfect!

Now we're in a position where your tracking solution maintains itself (at least, moreso than it would otherwise) directly in line with your canonical tagging strategy, and records all of the scenarios where you're correctly (or incorrectly) consolidating data which would otherwise have been all over the place. I've seen significant changes within a couple of days of implementing this in my understanding of page consumption behavioiur, which have influenced recommendations I've made on actions to improve performance.

If you take one thing away from the post, I'd like for it to be that Google Analytics does not work as an off-the-shelf solution if you want to actually rely upon and use the data to make decisions. If you purchase an Omniture license, you'd expect to spend weeks (if not months) aligning the installation and configuration with your website and tracking requirements; just as much care should be given to your Google Analytics installation to make sure that it's telling you the right information with the right degree of integrity. Just because it's free and easy doesn't mean it doesn't need a carefully crafted installation and maintenance strategy.

Google Analytics - Perfect, Future-Proof, Awesome Data FTW

Table of Contents

Google Analytics - Perfect, Future-Proof, Awesome Data FTW

With Moz Pro, you have the tools you need to get SEO right — all in one place.

Read Next

Common Analytics Assumptions — Whiteboard Friday

5 Reasons Your Direct Traffic Can Suddenly Drop

Essential Tips for Directional Reporting in GA4 — Whiteboard Friday

Comments

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved