Know Your Risk: Penguin Analysis | Panda Risk

The Strongest Cloaking Yet – Cross Domain Canonical Tag

For years the most advanced forms of bot detection, ip delivery, javascript and flash obfuscation, etc. have been employed by blackhat search engine optimizers to accomplish cloaking. These techniques, when used successfully, would allow the webmaster to pull the wool over the eyes of bots and feed sales-heavy (or worse) content to end users.

Google has fought valiantly to stop these techniques and, by and large, has removed all but the most sophisticated techniques. However, they have fallen on their own sword with the introduction of the new cross-domain canonical tag.

The Canonical Tag

The rel=canonical tag was a god-send for most webmasters. It allowed us to defeat duplicate content issues by placing a single line of code at the top of the HTML page, unbeknown to visitors, telling the visiting browser or bot the intended URL for that piece of content. No matter how crafty, strange or contrived the URL used to access the page was, ultimately the bot would know what it should be. Users needn’t be jostled by redirects, and webmasters needn’t rely on more complex server-side technologies to prevent duplicate content. All was well in the kingdom.

One important aspect of the canonical tag was that it only impacted same-domain content. I could place the tag on one page and tell it the canonical URL is located only on another page on my domain. This means that I could not find a way to sneak my canonical tag onto another webmaster’s site and expect to steal their PageRank.

This, however, has changed.

Cross-Domain Canonical Tag

The new cross-domain canonical tag allows webmasters to place a single tag that tells GoogleBot that the real page exists on another domain. The user does not see this tag unless they take the time to view the source. An even cleverer webmaster could even cloak the page so that the only thing that changes when GoogleBot visits is the display of that one tag. How then could a webmaster use this to cloak?. Simple.

Imagine creating 2 separate websites, one that discusses a kosher if not academic subject related to your target industry, and another that is just plain salesy as hell. For example, you could create a Poker and Gambling Addiction Awareness site that targets colleges and universities with programs and resources for fighting gambling addiction on campus. You could also create a poker referral site. Then, using the rel=canonical tag, you could convince GoogleBot that the Addiction Awareness site, that has no trouble getting great links from great sources, is really meant to be at the canonical poker affiliate site. A talented webmaster could even make the content of the two pages nearly identical, but just use different images and robots.txt blocked javascript to modify the look and feel of the second site.

All the PageRank, TrustRank, and potential rankings would be passed on to the target site.

A Solution in Search of a Problem

I don’t understand the cross-domain canonical tag. Nearly all webmasters with multiple domains issues have server-side access, allowing them to easily prevent duplicate content with mod-rewrite, ISAPI rewrite, etc. Sites dealing with scrapers and content syndication will see no benefit from the cross-domain policy, as it is highly unlikely the scraper sites will be willing to drop the canonical tag in place, potentially losing any rankings they might have. Ultimately, Google has given blackhatters a strong tool for cloaking that will require vigilance on Google’s part to detect and prevent.

Good Luck.

The Strongest Cloaking Yet - Cross Domain Canonical Tag by No tags for this post.

9 Comments

  1. Dave Dugdale
    Dec 16, 2009

    Hmmm, very interesting. It will be interesting to see how this tag gets abused.

  2. Daniel Mcskelly
    Dec 16, 2009

    rel=canonical doesn’t work that way (i.e. concatenating wildly different content) on the same domain…why would it work that way cross-domain?

    Author Response: Hi Daniel, perhaps I did not explain closely enough in the article of how this would be accomplished. It is actually quite easy to use identical or nearly identical content to show vastly different pages to the user. Imagine, for example, that you choose not to add image heights or widths in your HTML. Imagine also that you store your JavaScript, CSS, and Images in a folder that is blocked with the robots.txt file. Subsequently, you can have a page appear to have identical mark-up content on the page but show drastically different views to the user. The images, layout, and actual readable text can be displayed in strikingly different fashions without changing the mark-up of the primary page.

    In past, though, blackhatters had to take the risky step of cloaking to decide whether to 301 redirect the user from the clean, kosher domain to the dirty, salesy domain. Now, that cloaking is not necessary as it happens behind the scenes and with Google’s endorsement.

  3. Jack Adams
    Dec 17, 2009

    Google processes the canonical tag as a suggestion, not a directive: there’s still an algorithm at work to determine the principle version of a page, of which the canonical tag is just a factor (if a very strong, one).

    I’d imagine that with the cross-domain canonical tag, Google recognises the potential of your hypothetical situation and would tone down the weight that the tag had in determining the principle version of a piece of content.

    If one URL continued to attain links, trust and credibility, I’d imagine the cross-domain canonical wouldn’t help in sending this credibility elsewhere unless the new location showed evidence of being attracting links of similar value.

    Author Response: Even if Google’s implementation became sophisticated enough to require that the canonical URL receive some similar weight, it would still lower the threshold of required trusted links for a blackhatter. He or she need only attract a handful of quality links to the target site. My guess is that Google has made a calculated decision that decreasing the overhead of billions of duplicate pages across the web is more important than the potential for search quality losses in a few already-spammed verticals.
  4. A
    Dec 19, 2009

    A thought on the author response on Daniel’s comment… A poker referral site would have a hard time making any money if it had the same textual content as a poker gambling addiction awareness site, wouldn’t it?

    Just as a thought, I wouldn’t be surprised if Google starts (or is already) sometimes crawling pages with a spoofed UA string to see if what it gets is the same as what the normal Googlebot UA string gets.

    Author Note: You assume that what the user sees and reads is that content. If the javascript and css sits in a robots.txt blocked directory, the two can be used to hide the text and reveal only the images. Because the image size hasnt been set, the poker referral site could use large images with text in the images themselves to convert the user and get them to click through to another site. In fact, the Webmaster could use the Javascript to inject any amount of content into the page.

    You are right about spoofing UA. There is little to no question that Google has for a long time used non-bots to compare the content rendered to a user against that rendered to GoogleBot

  5. Utah
    Apr 21, 2010

    well I ran the test.. it did not work cross domain… I have a very branded keyword specific rankings (40 top ten) which i tried to cannonical to a new url…. Original site removed from index lost all rankings .. new site in index… no rankings… so the link metrics were not passed to new domain … (complete duplicate content)

    Author Response: Very interesting. So it seems that Google is not honoring the cross-domain canonical tag the way they claimed they would, even when the canonical tag is legitimate. Have you posted this in Matt Cutt’s blog comments to get a response?

  6. Johar
    May 3, 2010

    Just Learn many article in your blogs about SEO, very interisting

  7. Zions Bank
    Jul 20, 2010

    Update…. It does seem to work cross domain although a brand new domain takes considerable longer to establish credibility than an aged domain name

  8. Mark Carter
    Jun 23, 2011

    Hi there, thanks for this thought provoking article. I must admit that I hadn’t thought of these concerns, and can see why you’d highlight the issue. However, I’m not sure I’d agree that most webmasters have access to appropriate servers. I have many clients who are not in that position where they can access this for all the hosts of their content, so this is an important new tool in their armoury. This often comes up as well in the arena of affiliates, where enforcing best practice on them can be a nightmare.

  9. Dominic
    Sep 28, 2011

    Well, this is essentially a cloaking issue and the use of the cross-domain canonical tag here is secondary. As soon as the site can show, without being detected, a nice page to the bot and a spam page to the visitors, the spammer has succeeded. If he can do that, he does not need the canonical tag. I discuss a similar issue here http://civm.ca?store=queen .

Submit a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>