Craig E. Wills and Mikhail Mikhailov
Computer Science Department
Worcester Polytechnic Institute
Worcester, MA 01609
There has been many studies to better understand characteristics of the Web [4,5,10]. Results from these studies have been used in designing improved caching policies and mechanisms [3,7,8,9]. However, work has not been done to specifically understand how the meta information reported by servers correlates with changes in resources and how that information impacts caching by Web browsers and proxy caches. To address this gap we have undertaken a study to monitor and better understand the characteristics of resource changes at servers and how these servers report meta data about the resources.
The approach is to study a set of URLs at a variety of sites and gather statistics about the rate and nature of changes correlated with the resource type. In addition, we gather response header information reported by the servers with each retrieved resource. Previous studies used proxy and server logs or network traces of user requests/responses, which constrained the resulting studies to the available data. In contrast, our approach is to retrieve each resource in the test set at intervals and for a duration needed to ``characterize'' the nature of the resource.
Our resulting study has two distinguishing aspects in characterizing the Web: focusing the study on issues relevant for Web caching; and using a methodology that allows us to study changes to resources in a controlled manner. The following gives more details on our study and its current status.
The general goal of our work is to better understand the nature of how resources change at a collection of servers and how meta information reported by servers reflects those changes. The overriding goal of this work is obtain data that can be used to better understand the potential benefit of caching and whether existing software is reaching this potential. Our work has many specific directions for investigation.
Based on these results, another direction for our work is to examine the availability and accuracy of cache validation information reported by servers for requested resources. Our approach is to monitor response headers returned along with a resource to discover lmodtimes, size and entity tag (etag) information. Etags are an HTTP/1.1 mechanism for servers to provide an ``opaque'' cache validator. Unfortunately, in previous work we found that lmodtime information was not always an accurate predictor of a changed resource--for example a changed size did not always guarantee the lmodtime would change. There are also reported cases where the lmodtime changes, but the resource does not . To check the accuracy of the validator information, we calculate the MD5 checksum of resource contents as a means to determine if a resource does change. By correlating changes in MD5 checksums for a resource with other validators over successive retrievals we can obtain a better understanding of the reliability of each validator.
One possible conclusion is that lmodtime is an unreliable validator and that servers and proxies need a more reliable mechanism, such as etags or MD5 checksums, for determining when documents change. Another possibility for study is whether the availability and accuracy of validator information depends on other factors such as content type or the server software being used. We plan to make additional correlations to answer these questions.
There are two primary issues in our approach for studying the questions we identify: how to determine the test set of resources to monitor and how to actually do the monitoring. In determining the test set, we considered different approaches. One possibility is to gather a set of URLs from a relatively current proxy log trace. These URLs should be from a number of different servers and be of a number of different content types.
An alternate approach, and the one we are currently pursuing, is to identify the frequently used, not necessarily the same as ``popular,'' sites and focus our study on resources at those sites. While such a test set may not be ``representative'' of a proxy trace, it provides us with a set of resources that are likely to have the most impact on long-term Web usage. We have explored different Web sites for gathering such information such as Media Metrix  and 100hot.com . We are currently using the set of web sites identified by 100hot.com as a basis for our study.
The methodology of the study is to do a GET for each of the URLs in the test set on a daily basis. The time between successive retrievals for a URL may be lengthened or shortened as needed until we can ``characterize'' the resource--whether it changes on each access, periodically or arbitrarily. For each retrieved URL, we store response headers and calculate an MD5 checksum on the contents. We also retrieve all embedded resources and compute their MD5 checksums. A selected subset of links from the retrieved HTML page are marked for subsequent retrieval. We also vary if and which cookie is used as part of the request for a URL.
Using this methodology over a period of time along with tools to automatically analyze the results allows us to answer the questions for our study. The advantage of this approach is that we can study the rate of change and other characteristics of a set of resources from a variety of servers without being constrained by the data from a set of logs or packet traces. On a longer term basis, our study provides a basis to monitor evolution of the Web by focusing on characteristics of URLs at frequently used sites.
We are currently in the process of building our test environment and gathering preliminary data. The next step is to run the tests for a longer period and use analysis tools to extract information for helping us to answer the questions of our study. Over time, we expect to vary the test set of URLs to have a more comprehensive study of URLs from different sources.
In summary, we believe our study is a contribution to the efforts of web characterization, particularly as it applies to the potential benefits of Web caching. We believe the focus of our study and its results are a natural tie in with the Web Characterization Workshop and the activities of the Web Characterization Activity working group, which we are interested in being involved with.