Cache Wars: Responses to Popular Redditor Comments

Language Issues: Please read this first, or you will sound like an idiot when you comment.

First, no one, in any court case (please prove me wrong), has sued on the basis of having their content “cached”. The issue is caching+redistribution. Most importantly, it is caching+redistribution on another site, often a for-profit site with advertisements, without consent from the author.

In terms of search engines, we will use the word “indexing” to the simple act of “caching” a page. When I refer to “caching”, I mean the redistributed cache of the page or the process of copying and redistributing the page.

Finally, let’s quickly acknowledge the similarity between “quotes” and “paraphrases” in the print world and “snippets”, “quotes” and “paraphrases” in the internet world. They are almost linguistically identical, and for our purposes will be treated as such.

The Basics: Please read this second, or you will still sound like an idiot when you comment.

One of the biggest cop-outs of individuals writing about this issue on the web is the following… “the copyright laws are too old and not suited for the internet”. People who say this are either (1) Lazy (2) Stupid or (3) Employees of Copyright-Infringing Companies.

Lets do a simple break down using the language we employed earlier…

Activity Print Legality Internet Legality
Indexing Legal Legal
Snippets, Quotes, Paraphrases Legal Legal
Copying and Redistributing Illegal Legal???

It is pretty easy to see what is out of sync here. While only the medium has changed, the once sacred copyright protections have now dissolved, with little or no rational explanation.

However, the web is neither like public speech nor print. It is, in its current form, much more like television broadcasting. While the information is publicly available, you are limited in the ways you can use it….

  • You can watch it
  • You can quote / paraphrase / show snippets (think ESPN SportsCenter)
  • You can DVR it for personal use
  • You cannot rebroadcast it
  • You cannot make copies of the tape and sell them

This brings us to the issue of the day: caching and redistributing. Is it a fair, legal practice, or does it violate the copyright holder’s rights. In my opinion, it is a gross, obvious violation. Lets take a look at some of the recent comments from reddit on this issue…

=====

“She should have used robots.txt to define her rules. Ethical search engines generally abide by it.”
Browzer, 55 Points

This is one of the most common responses. First off, the robots.txt standard only addresses “indexing”, not “caching”. This can be controlled with the meta-robots tag. However, the existence of methods to prevent copyright infringement does not answer the debate. You will be hard pressed to find a single crime for which the victim could have employed some defensive mechanism. Unless we change the underlying law, requiring copyright holders to take defensive measures to assert their rights is akin to requiring individuals to post signs on their car, home, and person to assert those rights as well.

“There are a couple of interesting issues arising from this case, but the real impact of it is that if she succeeds on the merits, the search business is instantly – and for all practical purposes permanently – closed to new entrants.”
Dawg, 14 points

Another common response which indicates the lack of knowledge among web users. There is, once again, an undeniable difference between what we know to be “indexing” and “caching” in the search industry. If Google were by law forced to remove the “view the cache” link next to sites not explicitly asking to be cached, there would be little to no impact on the internet, the search community, or future entrants.

“There are no merits to this case. Only lawyers with an agenda. That’s the reason this asinine thing got this far… F***ing with internet infrastructure is huge no no.”
smacfarl, 17 Points

Wow. First off, who are the “lawyers with an agenda”. Is there some secret conspiracy among the bar to topple the internet? Or is it more reasonable to believe that someone has a fairly legitimate copyright complaint and just chose the wrong way to go about enforcing it?

“If you don’t want your site spidered, then don’t put it on the web. It’s that simple. What an idiot.”
UncleOxidant, 16 points

Once again, it is not about indexing and quoting. It is about making exact copies and redistributing them. Remember the annoying photocopying rules in college? Imagine if, instead of just photocopying for yourself at the library, you made thousands of copies and made them available to anyone at anytime in any quantity. Yeah, that’s whats going on.

Perhaps most important to be brought up at this point is imagining an internet without these copyright protections. Do you believe that the vast sums of information placed on websites in a for-profit manner (ie: advertising funded) would continue to exist if copyrights could not be enforced?


Here is the only competent response that I found, in my honest opinion, in defense of caching on Reddit.

“imagine not being able to read books/magazines/newspapers which have been written more than ten years ago” Luce7, 2 points.

It is a legitimate need, however the shortcomings of our current system overwhelm this need,

  1. When you read a book, at some point in the chain the creator got paid for it. This is not the case with internet caching services. A bot does not click on ads, nor does it become convinced to become a client.
  2. There is no method of compensation to the content creator for internet caching services
  3. There is no regulation for internet caching services. Anyone with minor PHP knowledge can write a spam scraper site and call it a caching service. Imagine someone copying books over and over again and making libraries with them, funded by advertising placed in the copied books – then saying they are just caching it. Wake up folks, this is what is going on!

So, what are the simple solutions.

  1. Bots Should Assume No-Cache (not no-index, just no-cache)
  2. Standards should be developed by which sites can be regulated such as…
    • All cached pages should display appropriate attributions such as URL, date of cache, and relevant whois info
    • All cached pages should be blocked by robots.txt to prevent gross duplication across the web and to protect potential revenue for copyright holders.
    • All cached pages should clearly state they are not the owners or creators, that the information may no longer be valid, and encourage users to visit the existing site.
    • Caching sites should publicly report visitor statistics to each page so that content creators can know how traffic they are potentially losing / earning via the caching service

I hope this helps clear some things up a bit. Any other suggestions? Any other ways we can help clear this issue up? Thanks for taking the time to read…

No tags for this post.

2 Comments

  1. Crease
    Mar 19, 2007

    First of all, I assume you’re talking about the crazy child protective services lady, though I was unaware that had gotten to Reddit. I imagine you didn’t include that information because it easily polarizes the audience and clouds the underlying issue, which you’ve already addressed (and hopefully by referring to it vaguely here I won’t cloud things either).

    You’ve cast the issue as “does someone have the right to not have their information cached?”. For the most part, I would think yes, especially in commercial and trade secret situations. How that is enforced is still up in the air, though I’m guessing that a robots.txt like solution will evolve (it being far easier than the opposite of assuming that nobody wants to be cached unless they ask).

    A related question I have is “can the internet be as effective if certain parties control what stays in it (historically)?”. This isn’t “if they disallow caching the whole interneet is DOOMED”. I’m thinking specifically of the church of scientology and other institutions that have a lot of information on the internet that they don’t want people reading, that people perhaps should be reading. If we come up with ways to limit that access and reach we could end up in a situation with too many limitations instead of the too few we have now.

    Your comparison to the print model is somewhat telling, especially the idea of scrapers and other repurposers making money from other’s people content. But there are laws currently in place that theoretically do protect this information. Do they do enough? Who knows? Very few of them have been challenged, and those that have been rarely make it to court (e.g. RIAA). If you’re running a for profit site and a scraper in Russia is giving away your content for free there’s very little you can do, now or at any determinable point in the future, without some connections in the oligarchy of the mafiya.

    Your solutions are interesting, and for the most part I like them, but I don’t think they’ll prevent either of the scenarios I mentioned above. I’m not sure if this problem can be solved with technology.

  2. Steve Riley
    Mar 19, 2007

    Unfortunately it’s not quite so cut and dry.

    If I fire1000 browsers on 1000 machines and load a page and then never reload the page (thus making my copy permanent) have I done anything wrong? I could have an unlimited number of people use my array of computers to read the text forever. What if my browsers output device is a printer and not a monitor?

    If I grab free copies of a leaflet and continue grabbing free copies can I redistribute them? If google loaded the same page 1000 times could they “give” that page out again to 1000 readers through a cache?

    If a library buys a copy of a book or DVD it can lend that book to any number of readers. If I GIVE a library a book then it can certainly do the same. A person can peruse all of the books in the library if they wish. If millions of websites make their content available to the “unsuspecting” public free of charge – which takes actual action and intent – can each single “copy” be served up in a “time delayed digital cache”? Can I cache a copy in my computer’s memory? Can my service provider cache the content? My service provider may cache the document and create thousands of copies.

    Under what conditions can I redistribute a web page that is served to me for free? If the author makes it publicly available and you happen to go to the page have you trespassed? Does a entered URL equal a doorway entered and make you an automatic violator of a yet unread license or copyright notice? The very act of going to the page copies it.

    Even uglier is the concept of license vs. copy that is owned. Obviously I don’t have the copyright of a book I own. But I do actually own that copy. Soon we may start to license books as well in the real world in order to restrict our capability to redistribute them, We may license our films and movies instead of owning the individual copy.

    Without a reasonable uniform set of expectations who is going to have the infinite time to read every license agreement on every web site?

    The thought of slipping a page accidentally into a xerox machine and agreeing to something I was unaware of is unsettling. The same is true of public websites visited and cached for any purpose.

    If nothing else:

    Caching is implicit in the mechanism of delivery. Not only is it implicit in the mechanism of delivery it is implicitly understood the document will be cached in machines outside of the original copyright holders control. The copyright holder must acknowledge the fact that you are making copies in the very act of distributing the page. Only under a situation where the copyright holder distributed the hardware for playback could this be different.

    Licenses change all of this – a license can specify anything. What a nightmare. We need to be reasonable and put reasonable constraints in place to specify how these things should be used. There is currently no right answer.

Submit a Comment

Your email address will not be published. Required fields are marked *