New Reality: Google Follows Links in JavaScript.
This YouMoz entry was submitted by one of our community members. The author’s views are entirely their own (excluding an unlikely case of hypnosis) and may not reflect the views of Moz.
I must have missed something. I always thought Google doesn't see links inside JavaScript code. As Rand writes in the Beginner's Guide, JavaScript passes no ranking or spidering value and pages behind JavaScript navigation may never be found by search engines if they are not reachable via direct hyperlinks. Is this information obsolete?
Here is my story
I have a new (1 month-old) site. It's an online website security related service and it extensively uses AJAX. This morning, in the Analytics, I found that Google has sent me a visitor via a types of hidden spam query. I went back to Google and was glad to see that my site ranked #1 among the other 14,300,000 results for that search.
However, the strange thing was the search result linked to this "page": unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl=
I use this URL (or rather a part of the URL) in JavaScript to dynamically build customized links to display in reports. No pages on my or any other site link to unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl=
On the other hand, there is another static web page about finding compromised WordPress blogs with direct inbound links and similar text about common types of hidden spam links.
So why has Google preferred that incomplete dynamic URL, with no incoming links, buried inside JavaScript, and not a similar web page with static URL and direct incoming links?
I entered the site:unmaskparasites.com query to check whether the static page was indexed. It was. The site is very small and all pages are indexed. Moreover, in the results I found pages that were not supposed to be indexed, the URLs that were only used for AJAX requests (see the unmaskparasites.com/results/ and unmaskparasites.com/token/ results on the screenshot).
WTF! Where did Google take them from?!
Having analyzed my web application code and Google's cached results, I'm pretty sure Google parses my JavaScript, executes it and follows the links it finds there.
Here is the proof.
Links in AJAX requests
http://unmaskparasites.com/results/ and http://unmaskparasites.com/token/ are service URLs, used exclusively in AJAX (javascript) requests. There are no other links to these "pages." Here are the JavaScript snippets with these URLs:
$.get('/token/', function(txt){ ...
and
$.post("/results/", { ...
As you can see, it is not a trivial task to find links in the code. The links are relative and don't contain any "http://". One should be able to understand the code to distinguish such links from other non-link literals. Once parsed, Google adds the domain name to construct absolute links.
Links in JavaScript strings
The http://unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl= URL can also be found only as a part of the following string inside JavaScript:
...'<a href="/security-tools/find-hidden-links/site/?siteUrl=' + escape($("#id_siteUrl").val() )+ '">' ...
When a crawler visits the page with this script the value of the id_siteUrl field is blank, and if you execute the JavaScript you will get the following string: '<a href="/security-tools/find-hidden-links/site/?siteUrl="'>' - the URL will be indexed by Google (again, with domain name added).
Google crawler's JavaScript is not the same as in your web browser
It looks like Google's crawler executes only the parts of your JavaScript that have to do with links and skips the rest code.
In my case, the cached page clearly shows that Google fetched http://unmaskparasites.com/results/ using the GET request with empty parameters. If it really executed all code it wouldn't be able to 1). pass validation and load the page, and 2). it would use the POST request.
So I assume Google's crawler is not equipped with a full-featured JavaScript interpreter. It just parses JavaScript, finds links, and maybe executes some reduced set of commands, for example, to concatenate strings.
jQuery
My other guess is Google knows how to interpret JavaScript based on known libraries. I use jQuery and load it directly from Google's servers:
http://ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js
This is the only external JavaScript file my pages load. So Google can be pretty sure that $post(...) and $.get(...) functions send AJAX requests and the $('#results').html(...) call adds HTML code to div with the "results" id.
Google Toolbar
I have Google Toolbar installed and it could send information about URLs I visit back to Google. This way Google could have learned about those JavaScript links. But there are still some facts that make me think that the toolbar is not to blame.
- My AJAX URLs never appear in the address bar, so there is no reason to request PageRank info for them.
- The toolbar reports URLs visited in real life. So the indexed page would have a URL like http://unmaskparasites.com/security-tools/find-hidden-links/site/ or http://unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl=example.com
- Other visited "secret" maintenance URLs that are not mentioned in JavaScript are not indexed.
- Did you ever see a web page with no incoming links on a one month-old domain rank #1 for a query with 14 million other results?
Some information from Google
I've just found some indirect proofs of my point in the official Google Webmaster Central Blog:
A spider's view of Web 2.0"One of the main issues with Ajax sites is that while Googlebot is great at following and understanding the structure of HTML links, it can have a difficult time finding its way around sites which use JavaScript for navigation. While we are working to better understand JavaScript, your best bet for creating a site that's crawlable by Google and other search engines is to provide HTML links to your content."
So they state it is "difficult," but don't say "impossible" and that they are "working to better understand JavaScript." And now 9 months later they seem to be able to understand some JavaScript.
"Googlebot does not execute some types of JavaScript."
This proves my point that googlebot executes JavaScript, but its support of JavaScript is limited.
"Regarding ActionScript, we’re able to find new links loaded through ActionScript."
If they can find links in ActionScript, why not find links in JavaScript too?
New era?
Flash, JavaScript. Is this the beginning of a new era of more sophisticated search engine spiders that can "see" web pages the way human surfers see them? Check your JavaScript. Maybe you expose too much to Google. I have just added a few more "Disallow" rules to my robots.txt.
Do you think I'm paranoid?
P.S. Google has indexed http://unmaskparasites.com/security-report/ , which only appears as an action parameter of my HTML forms. Action parameters of forms are followed?
P.P.S. Hopefully, despite my terrible English, you were able to find some interesting information in the article.
Comments
Please keep your comments TAGFEE by following the community etiquette
Comments are closed. Got a burning question? Head to our Q&A section to start a new conversation.