The Functions and Features Tested
- Simple Variables: Can Google Understand Simple Variable Assignment such as “var foo = ‘test content’; document.write(foo); “
- Simple Variable Concatenation: Can Google Interpret “var foo = ‘test content’; var foo += ‘ more ‘; document.write(foo); “
- Simple Document.write();
- Simple element.innerHTML();
- Dummy Variables: We added this test in to make sure Google only indexes data that is printed to the page, and not every string randomly stored in a variable.
The Methods Tested
Hold Your Breath
In short, no.
First, let me state that it is likely that Google will at some point (if they don’t already) use blocked .JS and .CSS as a negative signal. While there are legitimate reasons for this, there is no easy way for Google to verify that the contents of a page using these tactics are not greatly modified by the blocked files. So, be careful.
That being said, Matt was kind enough to respond in great detail to my findings, and pointed out several things one should consider when blocking .JS files which ultimately resulted in false positives in my analysis:
- Give Your Robots.txt a Head Start: This makes a lot of sense, but most webmasters (myself included) handle the new content and robots.txt at the same time.
“In an ideal world, youâ€™d wait 12 hours just to be completely safe. Essentially, any time you make a new directory and block it at the same time, thereâ€™s a race condition where itâ€™s possible we would fetch the test.js before we saw it was blocked in robots.txt. Thatâ€™s what happened here.” – Matt Cutts
It is certainly untenable for Googlebot to check the Robots.txt with every new file downloaded on your site, so giving that head start can make a big difference.
- User-Agent Directives can Override One Another: This one was new to me, but it does make sense. If you begin with a generic “User-Agent: *” directive, and follow up with a specific directive, “User-Agent: Googlebot”, the latter overrides the former in terms of Googlebot, it does not append to it.
If you disallow user-agent: * and then have a disallow user-agent: Googlebot, the more specific Googlebot section overrides the more general section–it doesnâ€™t supplement it. – Matt Cutts
- Robots.txt is only Respected Up to 500,000 Characters: I know this is a pretty big number, but if you have a lot of unique URLs to block, it can get messy. This is particularly frustrating with the Google Webmaster Tools Robots.txt checker, which only analyzes the first 100,000.
- To Be Certain, Use the X-Robots-Tag: There is a great writeup here on how to use the HTTP Header X-Robots-Tag to indicate to Google that any file and filetype should not be indexed. Because this header is sent along with the file, Googlebot can respect it in real-time.
- .JS Files can Be Slow to Clear from Index: As is the case with any lower-priority crawled document, .JS files can take a while to clear Google’s index if for some reason Google finds the blocked .JS.
The crawl team said that once a .js file has been fetched, it can be cached in our indexing process for a while. – Matt Cutts
This is certainly not an understatement. The .JS indexed 2 weeks ago is still present on pages that were indexed before Googlebot realized the exclusion. I believe, though, that you can always use the emergency removal tool if this happens.
Re-Running the Test
Of course, after hearing back from Matt, I needed to re-run the blocked .JS test to confirm. Sure enough, now that the .JS file was behind a previously-established blocked directory, Googlebot respected the disallow. (Also, just to be careful, I tested it on a separate domain with which Matt was not familiar, so I can assure you there was no trickery involved).
- On Experimenting: Confirm, retest, ask, retest, confirm, confirm, write, confirm, revise, confirm, publish.
- On SEO: Learn new shit every day.