Data Integrity

Are caches allowed to check data integrity?

"Data integrity is an end-to-end issue; caches should not be checking or computing MD5 (or other) integrity checks, or changing any aspect of the requests or responses that would be covered by such checks."

vs.

"I think it is perfectly reasonable for caches to do integrity checks if they want to, using MD5 or whatever else is at hand, and possibly even re-request a corrupted resource before returning anything to the end requestor. It does very little good if a cache holds a corrupted resource and dishes it out to clients. On the other hand, the first time a client gets this corrupted resource, it will (er, might) re-request it, probably (though not necessarily) through the same cache, thereby clearing the problem."

Comment:
In the Working group issues list, one of the comments on data integrity confused me a bit. The second of the two positions asserts that it is perfectly reasonable for a cache to do integrity checks, to avoid the cache serving nonsense to the end user. So far, I understand and agree. It goes on to say, however, that if the cache serves nonsense to the end user, the end user may re-request the document, thus correcting the problem in the cache.

Reply:
These were my comments, so let me try to clarify. If a user agent receives bad data, i.e. a local integrity check fails, it can retry. If it retries, it should do so in a manner that forces the request to go all the way to the point of origin (by using one of the cache-control directives), in order to make sure that the data wasn't stored in a garbled form in a cache somewhere in between. If a proxy cache inbetween the origin and the end user "sees" the ungarbled version going by, it may store that in its cache. If it does, that will repair the garbledness. If it doesn't, it won't. There's nothing especially magical about any of this.

Comment:
I'm not clear from the text whether this is meant to apply to the condition where integrity checks exist between proxy and server or whether it is meant to apply to the condition where they do not.

Reply:
It's meant to apply to the condition where they do not, i.e., when a cache does not perform an integrity check, and ends up with bad data.

Comment:
If the former, then the integrity check would presumably be applied the first time the cache stored the item;

Reply:
Yes

Comment:
if it passed somehow or was corrupted later, re-requesting the item won't clear the problem unless the proxy goes beyond current practice in re-checking the resource.

Reply:
Yes. If the retry request contains cache-control: no-cache (or, in Jeff's version, cache-control: reload) then the proxy will have to forward the request.

Comment:
Any time-based check, for example, won't fail until the resource has expired or changed.

Reply:
yes

Comment:
A really aware client might still get the correct resource despite the cache by reloading with a Pragma: request, but I can't see why the cache would take that as a signal to update its copy (rather than simply pipelining the request on to the origin server).

Reply:
Because it can. The protocol doesn't force it to, but it would seem to be reasonable practice.

Comment:
Do we wish to suggest additions to current practice for the case where integrity checks are applied between proxy and server? If so, should these include: a method for the client to explicitly request a cache update, and/or recommendations for additional checks by the proxy when integrity checks are available?

Reply:
I am not convinced anything extra really needs to be added. The only thing that seems open to question is what a cache (should/must/may) do with the response to a request with cache-control: no-cache (or reload). My assumption was that the most reasonable thing for the cache to do would be to store the most recently received version of any document it receives, especially since one of the reasons for a forced reload would be to fix garbled data. Maybe words to this effect belong in the spec somewhere.

Comment:
Thanks for your explanation. If we believe that proxies should reload pages as they pass by (even if they have a copy they think is "good"), then I suspect we should use cache-control: reload, rather than cache-control: no-cache. That language makes it clear enough that the proxy will and should update its copy.

Reply:
I believe it was Jeff Mogul's intention to use that language to signify ``reload from the origin server'', not ``reload into your cache''. Isn't English wonderful?

Comment:
With cache-control: no cache, an implementor might assume that the directive was to be used when the user agent did not want integrity checks to be applied or did not want the results of the request stored in that cache (some info might not be appropriate for a public cache, for example).

Reply:
This points to the likelihood of confusion when overloading the same token (``no-cache'') with a meaning in both requests and responses. In Roy's proposed spec for HTTP 1.1, cache-control: no-cache in requests indicates that the request cannot be serviced by any intermediate server -- only the origin server. In responses, cache-control: no-cache indicates that the response cannot be stored in a cache. So the hypothetical implementor you describe would be wrong to make that assumption.

Comment:
Switching to cache-control: reload from clients and cache-control: no-cache from servers would solve this particular confusion.

Reply:
No, it wouldn't. For example, the following is a Reload:

    GET / HTTP/1.1
    Cache-control: no-cache
and the following is not
    GET / HTTP/1.1
    Cache-control: no-cache
    If-Modified-Since: Thu, 15 Feb 1996 15:05:20 GMT
The no-cache directive does not mean "Reload" -- it never has (even when used in Pragma).
http working group issues