Characterizing Web Resources and Server Responses to Better Understand the Potential of Caching

Craig E. Wills and Mikhail Mikhailov
Computer Science Department
Worcester Polytechnic Institute
Worcester, MA 01609
{cew,mikhail}@cs.wpi.edu

Introduction

There has been many studies to better understand characteristics of the Web [4,5,10]. Results from these studies have been used in designing improved caching policies and mechanisms [3,7,8,9]. However, work has not been done to specifically understand how the meta information reported by servers correlates with changes in resources and how that information impacts caching by Web browsers and proxy caches. To address this gap we have undertaken a study to monitor and better understand the characteristics of resource changes at servers and how these servers report meta data about the resources.

The approach is to study a set of URLs at a variety of sites and gather statistics about the rate and nature of changes correlated with the resource type. In addition, we gather response header information reported by the servers with each retrieved resource. Previous studies used proxy and server logs or network traces of user requests/responses, which constrained the resulting studies to the available data. In contrast, our approach is to retrieve each resource in the test set at intervals and for a duration needed to ``characterize'' the nature of the resource.

Our resulting study has two distinguishing aspects in characterizing the Web: focusing the study on issues relevant for Web caching; and using a methodology that allows us to study changes to resources in a controlled manner. The following gives more details on our study and its current status.

Focus

The general goal of our work is to better understand the nature of how resources change at a collection of servers and how meta information reported by servers reflects those changes. The overriding goal of this work is obtain data that can be used to better understand the potential benefit of caching and whether existing software is reaching this potential. Our work has many specific directions for investigation.

One direction of our work is to monitor a selected set of resources to study the frequency at which these resources change in a controlled environment. A similar study was done using a packet trace [4], but with our approach we can control when requests are made and test whether resources change on each request, on a periodic basis or aperiodic basis. We also examine the predictability and locality of changes to a resource.
The availability of last modification time (lmodtime) information for a retrieved resource is important in efficiently validating a cached resource using a GET If-Modified-Since (IMS) request in HTTP. Previous studies have found the percentage of server responses that contain the lmodtime for a resource vary from 50-80% [4,6,8,10].
Based on these results, another direction for our work is to examine the availability and accuracy of cache validation information reported by servers for requested resources. Our approach is to monitor response headers returned along with a resource to discover lmodtimes, size and entity tag (etag) information. Etags are an HTTP/1.1 mechanism for servers to provide an ``opaque'' cache validator. Unfortunately, in previous work we found that lmodtime information was not always an accurate predictor of a changed resource--for example a changed size did not always guarantee the lmodtime would change. There are also reported cases where the lmodtime changes, but the resource does not [4]. To check the accuracy of the validator information, we calculate the MD5 checksum of resource contents as a means to determine if a resource does change. By correlating changes in MD5 checksums for a resource with other validators over successive retrievals we can obtain a better understanding of the reliability of each validator.
One possible conclusion is that lmodtime is an unreliable validator and that servers and proxies need a more reliable mechanism, such as etags or MD5 checksums, for determining when documents change. Another possibility for study is whether the availability and accuracy of validator information depends on other factors such as content type or the server software being used. We plan to make additional correlations to answer these questions.
The third direction for our work is to better understand how servers respond to different types of requests for the same resource. One type of variation is whether servers are supplying cookies that clients are then including as part of subsequent requests. A recent study found that 30% of the requests made in a client trace included cookies, concluding that responses to these requests are uncachable [1]. This result raises a number of questions for study. Is there a similar proportion of replies and subsequent requests that contain cookies for our resource test set? Does the inclusion of a cookie in a request always result in a different resource response than obtained with a request containing no cookie? Do two separate requests with two separate cookies always result in different resource responses? We believe answers to these questions will provide us with a more complete picture of the impact of cookies on caching. If resource content does not always change in response to different request cookies (or absence thereof) then such resources could be cached and be used as a base for validation on subsequent requests.
Another area for study is the frequency of use and amount of change between successive requests for dynamically computed resource contents. A better understanding of Web resources is needed to determine the effectiveness of techniques such as delta-encoding [12] and active caches [2] in allowing resources that change frequently, but predictably, to be cached.
Previous work has found that images change less frequently than HTML and text resources [4]. Another direction of work is to examine how images and other embedded resources change relative to the HTML resources they are contained in. Prior work indicates that images do not change at the same rate, but how does the use of embedded images change as these container resources change? Does the set of embedded images change or is the set of images relatively constant with other aspects of the container page changing? Do cascading style sheets, proposed to replace some uses of embedded images [13], exhibit similar change characteristics.

Methodology

There are two primary issues in our approach for studying the questions we identify: how to determine the test set of resources to monitor and how to actually do the monitoring. In determining the test set, we considered different approaches. One possibility is to gather a set of URLs from a relatively current proxy log trace. These URLs should be from a number of different servers and be of a number of different content types.

An alternate approach, and the one we are currently pursuing, is to identify the frequently used, not necessarily the same as ``popular,'' sites and focus our study on resources at those sites. While such a test set may not be ``representative'' of a proxy trace, it provides us with a set of resources that are likely to have the most impact on long-term Web usage. We have explored different Web sites for gathering such information such as Media Metrix [11] and 100hot.com [14]. We are currently using the set of web sites identified by 100hot.com as a basis for our study.

The methodology of the study is to do a GET for each of the URLs in the test set on a daily basis. The time between successive retrievals for a URL may be lengthened or shortened as needed until we can ``characterize'' the resource--whether it changes on each access, periodically or arbitrarily. For each retrieved URL, we store response headers and calculate an MD5 checksum on the contents. We also retrieve all embedded resources and compute their MD5 checksums. A selected subset of links from the retrieved HTML page are marked for subsequent retrieval. We also vary if and which cookie is used as part of the request for a URL.

Using this methodology over a period of time along with tools to automatically analyze the results allows us to answer the questions for our study. The advantage of this approach is that we can study the rate of change and other characteristics of a set of resources from a variety of servers without being constrained by the data from a set of logs or packet traces. On a longer term basis, our study provides a basis to monitor evolution of the Web by focusing on characteristics of URLs at frequently used sites.

Current Status

We are currently in the process of building our test environment and gathering preliminary data. The next step is to run the tests for a longer period and use analysis tools to extract information for helping us to answer the questions of our study. Over time, we expect to vary the test set of URLs to have a more comprehensive study of URLs from different sources.

Summary

In summary, we believe our study is a contribution to the efforts of web characterization, particularly as it applies to the potential benefits of Web caching. We believe the focus of our study and its results are a natural tie in with the Web Characterization Workshop and the activities of the Web Characterization Activity working group, which we are interested in being involved with.

Bibliography

1: Ramon Caceres, Fred Douglis, Anja Feldmann, Gideon Glass, and Michael Rabinovich.
Web proxy caching: the devil is in the details.
In Workshop on Internet Server Performance, Madison, Wisconsin USA, June 1998.
http://www.cs.wisc.edu/~cao/WISP98/final-versions/anja.ps.
2: P. Cao, J. Zhang, and K. Beach.
Active cache: Caching dynamic contents (objects) on the web.
In Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (Middleware '98), The Lake District, England, September 1998. ACM.
http://www.cs.wisc.edu/~cao/papers/active-cache.html.
3: Pei Cao and Sandy Irani.
Cost-aware WWW proxy caching algorithms.
In Symposium on Internet Technology and Systems. USENIX Association, December 1997.
http://www.usenix.org/publications/library/proceedings/usits97/cao.html.
4: Fred Douglis, Anja Feldmann, Balachander Krishnamurthy, and Jeffrey Mogul.
Rate of change and other metrics: a live study of the world wide web.
In Symposium on Internet Technology and Systems. USENIX Association, December 1997.
http://www.usenix.org/publications/library/proceedings/usits97/douglis_rate.html.
5: Brad Duska, David Marwood, and Michael J. Feeley.
The measured access characteristics of World Wide Web client proxy caches.
In USENIX Symposium on Internet Technology and Systems, Monterey, California, USA, December 1997. USENIX Association.
http://www.usenix.org/publications/library/proceedings/usits97/duska.html.
6: Steven D. Gribble and Eric A. Brewer.
System design issues for internet middleware services: Deductions from a large client trace.
In USENIX Symposium on Internet Technology and Systems, Monterey, California, USA, December 1997. USENIX Association.
http://www.usenix.org/publications/library/proceedings/usits97/gribble.html.
7: James Gwertzman and Margo Seltzer.
World-wide web cache consistency.
In Proceedings of the USENIX Technical Conference, pages 141-152. USENIX Association, January 1996.
http://www.usenix.org/publications/library/proceedings/sd96/seltzer.html.
8: Balachander Krishnamurthy and Craig E. Wills.
Study of piggyback cache validation for proxy caches in the world wide web.
In Symposium on Internet Technology and Systems. USENIX Association, December 1997.
http://www.usenix.org/publications/library/proceedings/usits97/krishnamurthy.html.
9: Balachander Krishnamurthy and Craig E. Wills.
Piggyback server invalidation for proxy cache coherency.
In Seventh International World Wide Web Conference, Brisbane, Australia, April 1998.
http://www7.conf.au/programme/fullpapers/1844/com1844.htm.
10: Thomas M. Kroeger, Darrel D.E. Long, and Jeffrey C. Mogul.
Exploring the bounds of web latency reduction from caching and prefetching.
In Symposium on Internet Technology and Systems. USENIX Association, December 1997.
http://www.usenix.org/publications/library/proceedings/usits97/kroeger.html.
11: Media metrix.
http://www.mediametrix.com.
12: Jeffrey C. Mogul, Fred Douglis, Anja Feldmann, and Balachander Krishnamurthy.
Potential benefits of delta-encoding and data compression for HTTP.
In ACM SIGCOMM'97 Conference, September 1997.
http://www.acm.org/sigcomm/sigcomm97/papers/p156.html.
13: Henrik Frystyk Nielsen, Jim Gettys, Anselm Baird-Smith, Eric Prud'hommeaux, Hikon Lie, and Chris Lilley.
Network performance effects of HTTP/1.1, CSS1, and PNG.
In Proceedings of the ACM SIGCOMM '97 Conference. ACM, September 1997.
http://www.acm.org/sigcomm/sigcomm97/papers/p102.html.
14: 100hot.com.
http://www.100hot.com.