Characterizing Web Resources and Server Responses to Better Understand the Potential of Caching

Craig E. Wills and Mikhail Mikhailov
Computer Science Department
Worcester Polytechnic Institute
Worcester, MA 01609
{cew,mikhail}@cs.wpi.edu

Introduction

There has been many studies to better understand characteristics of the Web [4,5,10]. Results from these studies have been used in designing improved caching policies and mechanisms [3,7,8,9]. However, work has not been done to specifically understand how the meta information reported by servers correlates with changes in resources and how that information impacts caching by Web browsers and proxy caches. To address this gap we have undertaken a study to monitor and better understand the characteristics of resource changes at servers and how these servers report meta data about the resources.

The approach is to study a set of URLs at a variety of sites and gather statistics about the rate and nature of changes correlated with the resource type. In addition, we gather response header information reported by the servers with each retrieved resource. Previous studies used proxy and server logs or network traces of user requests/responses, which constrained the resulting studies to the available data. In contrast, our approach is to retrieve each resource in the test set at intervals and for a duration needed to ``characterize'' the nature of the resource.

Our resulting study has two distinguishing aspects in characterizing the Web: focusing the study on issues relevant for Web caching; and using a methodology that allows us to study changes to resources in a controlled manner. The following gives more details on our study and its current status.

Focus

The general goal of our work is to better understand the nature of how resources change at a collection of servers and how meta information reported by servers reflects those changes. The overriding goal of this work is obtain data that can be used to better understand the potential benefit of caching and whether existing software is reaching this potential. Our work has many specific directions for investigation.

Methodology

There are two primary issues in our approach for studying the questions we identify: how to determine the test set of resources to monitor and how to actually do the monitoring. In determining the test set, we considered different approaches. One possibility is to gather a set of URLs from a relatively current proxy log trace. These URLs should be from a number of different servers and be of a number of different content types.

An alternate approach, and the one we are currently pursuing, is to identify the frequently used, not necessarily the same as ``popular,'' sites and focus our study on resources at those sites. While such a test set may not be ``representative'' of a proxy trace, it provides us with a set of resources that are likely to have the most impact on long-term Web usage. We have explored different Web sites for gathering such information such as Media Metrix [11] and 100hot.com [14]. We are currently using the set of web sites identified by 100hot.com as a basis for our study.

The methodology of the study is to do a GET for each of the URLs in the test set on a daily basis. The time between successive retrievals for a URL may be lengthened or shortened as needed until we can ``characterize'' the resource--whether it changes on each access, periodically or arbitrarily. For each retrieved URL, we store response headers and calculate an MD5 checksum on the contents. We also retrieve all embedded resources and compute their MD5 checksums. A selected subset of links from the retrieved HTML page are marked for subsequent retrieval. We also vary if and which cookie is used as part of the request for a URL.

Using this methodology over a period of time along with tools to automatically analyze the results allows us to answer the questions for our study. The advantage of this approach is that we can study the rate of change and other characteristics of a set of resources from a variety of servers without being constrained by the data from a set of logs or packet traces. On a longer term basis, our study provides a basis to monitor evolution of the Web by focusing on characteristics of URLs at frequently used sites.

Current Status

We are currently in the process of building our test environment and gathering preliminary data. The next step is to run the tests for a longer period and use analysis tools to extract information for helping us to answer the questions of our study. Over time, we expect to vary the test set of URLs to have a more comprehensive study of URLs from different sources.

Summary

In summary, we believe our study is a contribution to the efforts of web characterization, particularly as it applies to the potential benefits of Web caching. We believe the focus of our study and its results are a natural tie in with the Web Characterization Workshop and the activities of the Web Characterization Activity working group, which we are interested in being involved with.

Bibliography

1
Ramon Caceres, Fred Douglis, Anja Feldmann, Gideon Glass, and Michael Rabinovich.
Web proxy caching: the devil is in the details.
In Workshop on Internet Server Performance, Madison, Wisconsin USA, June 1998.
http://www.cs.wisc.edu/~cao/WISP98/final-versions/anja.ps.

2
P. Cao, J. Zhang, and K. Beach.
Active cache: Caching dynamic contents (objects) on the web.
In Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (Middleware '98), The Lake District, England, September 1998. ACM.
http://www.cs.wisc.edu/~cao/papers/active-cache.html.

3
Pei Cao and Sandy Irani.
Cost-aware WWW proxy caching algorithms.
In Symposium on Internet Technology and Systems. USENIX Association, December 1997.
http://www.usenix.org/publications/library/proceedings/usits97/cao.html.

4
Fred Douglis, Anja Feldmann, Balachander Krishnamurthy, and Jeffrey Mogul.
Rate of change and other metrics: a live study of the world wide web.
In Symposium on Internet Technology and Systems. USENIX Association, December 1997.
http://www.usenix.org/publications/library/proceedings/usits97/douglis_rate.html.

5
Brad Duska, David Marwood, and Michael J. Feeley.
The measured access characteristics of World Wide Web client proxy caches.
In USENIX Symposium on Internet Technology and Systems, Monterey, California, USA, December 1997. USENIX Association.
http://www.usenix.org/publications/library/proceedings/usits97/duska.html.

6
Steven D. Gribble and Eric A. Brewer.
System design issues for internet middleware services: Deductions from a large client trace.
In USENIX Symposium on Internet Technology and Systems, Monterey, California, USA, December 1997. USENIX Association.
http://www.usenix.org/publications/library/proceedings/usits97/gribble.html.

7
James Gwertzman and Margo Seltzer.
World-wide web cache consistency.
In Proceedings of the USENIX Technical Conference, pages 141-152. USENIX Association, January 1996.
http://www.usenix.org/publications/library/proceedings/sd96/seltzer.html.

8
Balachander Krishnamurthy and Craig E. Wills.
Study of piggyback cache validation for proxy caches in the world wide web.
In Symposium on Internet Technology and Systems. USENIX Association, December 1997.
http://www.usenix.org/publications/library/proceedings/usits97/krishnamurthy.html.

9
Balachander Krishnamurthy and Craig E. Wills.
Piggyback server invalidation for proxy cache coherency.
In Seventh International World Wide Web Conference, Brisbane, Australia, April 1998.
http://www7.conf.au/programme/fullpapers/1844/com1844.htm.

10
Thomas M. Kroeger, Darrel D.E. Long, and Jeffrey C. Mogul.
Exploring the bounds of web latency reduction from caching and prefetching.
In Symposium on Internet Technology and Systems. USENIX Association, December 1997.
http://www.usenix.org/publications/library/proceedings/usits97/kroeger.html.

11
Media metrix.
http://www.mediametrix.com.

12
Jeffrey C. Mogul, Fred Douglis, Anja Feldmann, and Balachander Krishnamurthy.
Potential benefits of delta-encoding and data compression for HTTP.
In ACM SIGCOMM'97 Conference, September 1997.
http://www.acm.org/sigcomm/sigcomm97/papers/p156.html.

13
Henrik Frystyk Nielsen, Jim Gettys, Anselm Baird-Smith, Eric Prud'hommeaux, Hikon Lie, and Chris Lilley.
Network performance effects of HTTP/1.1, CSS1, and PNG.
In Proceedings of the ACM SIGCOMM '97 Conference. ACM, September 1997.
http://www.acm.org/sigcomm/sigcomm97/papers/p102.html.

14
100hot.com.
http://www.100hot.com.