Get Your Penguin Vulnerability Score for Just $.99 Cents

Exclude-by-Keyword: Thoughts on Spam and Robots.txt

Note: This solution is for spam that cannot be filtered. There are already wonderful tools to help with comment / forum / wikispam such as LinkSleeve and Akismet. However, this proposed method would prevent the more nefarious methods such as HTML Injection, XSS, and Parasitic Hosting techniques.

Truth be told, I rarely use the Robots.txt file. It’s functionalities can be largely replicated on a page-by-page basis via the robots META tag and, frankly, we spend a lot more time on getting page into the SERPs than excluding them.

However, after running / creating several large communities with tons of user-generated content, I realized that the Robots.txt file could offer a lot more powerful tools for exclusion. Essentially, exclude-by-keyword.

The truth is, there is no reason for the word “cheap cialis” to appear on my informational site about pets. If the keyword occurs anywhere on my site, it is because I was spammed.

So why not create a simple Robots.txt exclusion that is based on keywords?

User-Agent: *
Disallow-by-key: cialis
Disallow-by-key: viagra
Disallow-by-key: xxx

I understand that there are shortcomings -maybe one time there will be a reason to include the phrase “hardcore threesome” on my site, but I am willing to risk losing that 1 page’s potential rankings in return for the piece of mind of not getting spammed like crazy and risking the reputation of my site.

Just thinking out loud.

Exclude-by-Keyword: Thoughts on Spam and Robots.txt by No tags for this post.

12 Comments

  1. Pufone
    Aug 7, 2007

    Hello Russ,

    Good idea. Here’s the flaw:
    I want your pages removed, therefore i will look in your robots.txt file and get to work(spamming).

    Google wants things stupid simple. When should It revisit your excluded page?

    Let’s say you have the “latest comments” widget on. I sign up with the username viagra therefore some or all your pages will be excluded.

    etc, etc…

    Author Response: I had thought of this as well. There will never be a perfect substitute for webmaster vigilance.

  2. Michael Martinez
    Aug 7, 2007

    Great idea! You can count on me to throw it in Google’s face on an ongoing basis for the foreseeable future! This is how they should work with Webmasters to fight Webspam.

  3. mario
    Aug 7, 2007

    Nice idea. Helping Google however won’t ever solve the spam problem. They could have long killed off spambots with just blocking searches for “mortages” and “cialis”. But it’s just to precious to them, so they leave the resulting mess to us.

    Spammers are just victims of Pagerank®, too.

  4. Lea de Groot
    Aug 7, 2007

    Nice concept, but I fear it would be a very long list, if it matches the ‘block comment by keyword’ list I have for wordpress comments :(

  5. Tom
    Aug 7, 2007

    I like this as a concept but I don’t think that the robots.txt file is the place for it. I think webmaster central is a much better place. Matt Cutts even included an option for something very similar in his recent poll of additions to webmaster central.

    Authors Response: I wish that Google would spend more time working on standards solutions rather than cramming more stuff into Webmaster Central. Additionally, there is benefit to keeping it in a Robots.txt – it lets spammers know that spamming your site is useless.

  6. me
    Aug 7, 2007

    Not only would it be a long list, if it isn’t dynamic using bayesian or other machine learning techniques, it won’t work. Would you know to block these?

    Disallow-by-key: cial1s
    Disallow-by-key: cIalis
    Disallow-by-key: cial!s
    .etc..

    Authors Response: Search engine spammers do not regularly target those keywords because no one searches for them. Unlike Email Spam, strange mispellings (like 1337 Speak) are not worth targeting.

  7. Sebastian
    Aug 8, 2007

    Spammers don’t bother reading your robots.txt, their bots just test whether your comment script is attackable or not and if so they hit you. Usually there’s no human activity. Also, this attempt would be a great hook for negative SEO.

  8. cpons
    Aug 8, 2007

    I think it’s a great idea
    We could use Webmaster central but … what about search engines without webmaster console?

  9. Dave Dugdale
    Aug 10, 2007

    So I am guessing this page will be disallow since it is packed with those keywords. :)

  10. sagbee
    Nov 19, 2007

    this isnt work, google bot cant understan these codes, when i saw my robots.txt via google webmaster tool… it show me that error “Syntax not understood”… :|

  11. russ jones
    Nov 20, 2007

    It has not been accepted yet.

  12. DKB
    Dec 17, 2007

    I have a few sites that people can post their services for free. Naturally I get the spammers trying to dump their crap into my site. In many cases, but not always, these people use the same IP. I don’t think that a robots.txt file can block by ip, but that would be great if it could somehow do it. I do see that there would be flaws with this idea too.

Trackbacks/Pingbacks

  1. SEO Theory - SEO Theory and Analysis Blog » Support Russ Jones’ robots.txt proposal - [...] Jones suggested on the Google Cache that search engines honor a disallow-by-keyword directive in robots.txt. This is a great …
  2. Linky Goodness, August 7 - [...] Jones proposes a new robots.txt command—exclude by keyword. I like the idea (a lot), but I just don’t think …
  3. GroundFloorSEO.com | Top SEO Blogs & Bloggers » NYT To Lower The Gates, Google, Wikipedia & Robots.txt - [...] Jones proposes a new exclude-by-keyword directive for the standard robots.txt that would tell the search engines to ignore the …
  4. GroundFloorSEO.com | Top SEO Blogs & Bloggers » Linky Goodness, August 7 - [...] Jones proposes a new robots.txt command—exclude by keyword. I like the idea (a lot), but I just don’t think …

Submit a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

I thought google was better than this by now I thought google was better than this by now I thought google was better than this by now I thought google was better than this by now I thought google was better than this by now