May 7, 2012

The Penguin Update & How Google Identifies Spam

Search Engines

The sheer fact that you are reading YouMoz is a strong indicator that you already know full well about the recently launched Google Penguin Update. This is in fact the "over-optimization" penalty alluded to by Matt Cutts a few weeks ago. In Google's own words, "Sites affected by this change might not be easily recognizable as spamming without deep analysis or expertise, but the common thread is that these sites are doing much more than white hat SEO; we believe they are engaging in webspam tactics to manipulate search engine rankings."

This statement is very telling. It doesn't say the common thread is that these sites are doing black hat SEO. It says they are doing much more than white hat SEO. Sounds to me like you can take white hat SEO a little too far, even to the point of being labeled "webspam tactics" by Google.

So if the Penguin update was aimed at nuking spam or those using spammy tactics, what is spam? What are spammy tactics?

Google gave a few examples, such as nonsensical spun content and keyword stuffing. This stuff is obvious, and quite frankly, if the examples they gave were actually not marked as spam before, they should be embarrassed.

One thing is for sure with this update, there has been some serious backlash. Not only can you read the comments on Google's own blog post, but very respected people in the SEO industry voiced their concerns. Apparently Google and white hat SEO proponents had two very definitions of spam.

Coincidentally, the following morning when I got to the office and checked my Gmail, I had a larger than normal amount of spam in my spam folder. A light bulb literally went off. Google runs Gmail, and if we are to glean any clues as to how they identify and classify spam, why not investigate Gmail?

I found some pretty interesting results. I think that by learning how Google identifies spam in email, we can learn how they are identifying spam in websites.

First of all, and I didn't know this before, if you open a spam message (probably why I never knew this before), Google puts a little message telling you why the email was marked as spam. It looks something like this:

Gmail Spam Notice

I decided to learn more and clicked through the link to see this Google support page explain a bit more about how Google identifies spam.

The first reason something would be marked as spam is for phishing. This is no surprise as Google doesn't want users to get duped into giving up personal or financial information to scammers.

Website equivalent? Sites with malware or maybe non-trusted merchants. Google is not a fan of any website that tries to infect computers with a virus or anything.

Their second reason deals with messages from an unconfirmed sender. These are basically when someone pretends to send you something from what appears to be an official website address, but they aren't with that website.

Website equivalent? Perhaps hacked or hijacked websites, websites registered and hosted outside the country, sites not registered with Google Webmaster Tools, or other sites that have suspicious ownership.

The next reason something would be marked as spam is because you previously marked it as spam. Persistent messages from the same user, identical subjects, stuff like that.

Website equivalent? In Chrome, you can block sites from your search results. Also sites with little engagement and high bounce rates would probably qualify here.

The next one is a biggie. It deals with similarity to suspicious messages. Google says here that "Gmail uses automated spam detection systems to analyze patterns and predict what types of messages are fraudulent or potentially harmful." They then go on to list some examples, such as typical spam language (adult, get rich quick) messages from accounts or IP addresses that previously sent spam messages and suspicious attachments to name a few.

Website equivalent? Here's where things get tricky. How does Google determine what is, in their words, "usually associated with spam?" We know the obvious ones, but what if your legitimate website in a legitimate industry was suddenly viewed as spam? Just use SEO as an example, just how many folks outside this industry do you suppose believe we are not spammers? Should perception of the masses dictate what is or is not spam?

Other Google resources shed more light on this topic. Google released a video about fighting spam some time ago. It eventually leads you to a place where you can learn more about Gmail's spam fighting methods. Here is where things get really telling.

The first method they point out in combating spam is called community clicks. Basically, as more and more users mark stuff as spam, they use that data to determine what messages are spam. Think back to the Chrome extension to block sites in your search results. Think to the +1 button. Think to user statistics like bounce rate and engagement statistics like depth of visit, length of visit, etc.

We have all been trying to figure out what all of the sites that got hit had in common. What's the one thing we cannot see on other sites? We can see their content. We can see their links. We cannot see usage statistics. Google can.

Google even says this about Gmail spam: "Our team of leading spam-fighting scientists uses a number of advanced Google technologies. Though in many cases our best weapon is you." Can it be more revealing than this?

Gmail openly admits that their best asset in dealing with spam is user feedback. Why would we suspect their search results are any different? Why bother tracking search history, browsing history, offering free analytical software, implementing the +1 button and Chrome extension...to what end? For user feedback. Whether any of us knew it or not, we have all been sending feedback to the big G for quite some time. And now it is being put to use.

Think of this...why in each of our major keyword SERPs are there so many new sites? Why so many poor sites? Why so many that have never been there before? Because Google has no data on them. They have never been in the SERPs, so now that they are, Google will quickly realize, through this fancy new Penguin update, whether or not the result is good or spam. Count on search results to be much more volatile moving forward.

After talking about community clicks, the Gmail spam fighting page then goes on about quick adaptation. They laud their ability to quickly roll out new spam data as they receive it so that within minutes of new spam being created, they can identify it. What does this say about what we do? Think about the recent hit on link networks. Google can quickly discover and identify spam, and as of the Penguin update, they can roll it out globally in a hurry.

And in case you didn't think I was on to something here, the next spam fighting method says it all. I decided to post it all here, verbatim:

"Many Google teams provide pieces of the spam-protection puzzle, from distributed computing to language detection. For example, we use optical character recognition developed by the Google Book Search team to protect Gmail users from image spam. And machine-learning algorithms developed to merge and rank large sets of Google search results allow us to combine hundreds of factors to classify spam."

Okay, so work done by the Google Book Search team has helped Gmail to identify image spam. Work done by the search team has helped Gmail classify spam. Obviously, these different branches of Google work together. So whatever methods they are using at Gmail, you had better believe that is being shared with the search team.

Think about that for a minute. Gmail is saying that they can filter email messages by their content and look for language typically associated with spam. I've noticed that anything related to insurance, pharmaceuticals, and loans all end up in my spam folder. What does that tell you? Gmail knows language associated with those products is known to be spam.

So what about websites? Wouldn't that knowledge of identifying and classifying spam be shared with the search and webspam teams? Don't you think Matt Cutts has access to Gmail's spam detection data? I bet he does, and I bet some of it is being seen in this Penguin update.

Gmail is a pretty well documented product. You can read up quite a bit on spam filters and how they work. I would recommend this to everyone as we can then get a better idea of what Google sees as spam content. For me the biggest takeaway is that Gmail openly admits to using user data and feedback in classifying and identifying spam. This should be a huge indicator to us all that user data is playing a role in how Google classifies and identifies spam on the web. The trick now is to figure out exactly what user data/feedback is being used.

This YouMoz entry was submitted by one of our community members. The author’s views are entirely their own (excluding an unlikely case of hypnosis) and may not reflect the views of Moz.

The Penguin Update & How Google Identifies Spam

Table of Contents

The Penguin Update & How Google Identifies Spam

Daniel Deceuster

Scale revenue from SEO with Moz Pro

With Moz Pro, you have the tools you need to get SEO right — all in one place.

Read Next

How to Future-Proof Your SEO Strategy with Relevance Engineering

Moz’s Brand Authority: Multi-Market, More Features, More Data!

Optimizing for AI Overviews — Whiteboard Friday

Comments

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved

The Penguin Update & How Google Identifies Spam

Table of Contents

The Penguin Update & How Google Identifies Spam

Daniel Deceuster

Scale revenue from SEO with Moz Pro

Get the latest SEO tips and strategies in your inbox

With Moz Pro, you have the tools you need to get SEO right — all in one place.

Read Next

How to Future-Proof Your SEO Strategy with Relevance Engineering

Moz’s Brand Authority: Multi-Market, More Features, More Data!

Optimizing for AI Overviews — Whiteboard Friday

Comments