Workshop on Web Characterization
About a year ago, with Anja Feldmann, Balachander Krishnamurthy, and Jeffrey Mogul, I worked on a study of the rate of change in the web. We were primarily interested in the effect of certain characteristics (such as content-type, age, and internet top-level domain) of a resource, on the rate at which it is updated.
The abstract is repeated here (the above link refers to the full paper):
Caching in the World Wide Web is based on two critical assumptions: that a significant fraction of requests reaccess resources that have already been retrieved; and that those resources do not change between accesses.
We tested the validity of these assumptions, and their dependence on characteristics of Web resources, including access rate, age at time of reference, content type, resource size, and Internet top-level domain. We also measured the rate at which resources change, and the prevalence of duplicate copies in the Web.
We quantified the potential benefit of a shared proxy-caching server in a large environment by using traces that were collected at the Internet connection points for two large corporations, representing significant numbers of references. Only 22% of the resources referenced in the traces we analyzed were accessed more than once, but about half of the references were to those multiply-referenced resources. Of this half, 13% were to a resource that had been modified since the previous traced reference to it.
We found that the content type and rate of access have a strong influence on these metrics, the domain has a moderate influence, and size has little effect. In addition, we studied other aspects of the rate of change, including semantic differences such as the insertion or deletion of anchors, phone numbers, and email addresses.
In parallel, I have been working for some time on a system (called the AT&T Internet Difference Engine, or AIDE) for tracking and displaying changes to web data. This system relies on a brute force mechanism for detecting changes, namely having the "AIDE server" poll each resource periodically and check for a new timestamp or checksum. The same is true of Internet-wide systems such as URL-minder.
These two past and ongoing efforts go to some extent hand-in-hand. Aggregate information about the modification rate of classes of resources might be useful as hints for caches and for notification systems. In the former case, they might decrease the likelihood of serving stale data from a cache. In the latter, they could be used to set the polling frequency for resources at a rate that establishes an appropriate trade-off between load and timeliness.
Ultimately, distributed notification of changes may prove to be useful if it can be shown to scale well. Eolian's EagleWatch is an example of this, in that a master site monitors and propagates notifications of changes to other caches, but the master must itself poll the content providers. Pushing notifications from content providers to these distributed caches is perhaps the next step. Measurements of access patterns and the rate of change of web data are crucial to the determination of the effectiveness of this approach.
Another approach to the problem of polling sites for notification of changes is to provide an API for browsers, caches, or other software systems to download information about many recent updates at once. From the standpoint of caching, recent protocols like Piggyback server invalidation (PSI) and server volumes are mechanisms for amortizing the overhead of cache consistency. At a previous W3C workshop, I proposed similar support for user-level tracking of changes, rather than for cache validation. I never pursued this approach, but more complete information about how data changes in the web would permit trace-driven analysis.
Finally, workload characterization as at the heart of a raging debate about the effectiveness of caching proxies in today's Internet. Some traces have shown as much as 35% of requests involving cookies; if these resources are actually uncachable it puts a substantial limit on the best any proxy can do. I am participating in ongoing studies of the effectiveness of caching and the impact of changing data, and hope to share these with the W3C and the rest of the research community.
I recently issued a CFP for a special issue of World Wide Web on World Wide Web Characterization and Performance Evaluation. The call closed in September 1998, resulting in 17 submissions, many of which fall into the area of characterization.
I am also the program vice-chair for the performance track of WWW8, and the program chair of USITS'99.
Last modified: Wed Sep 30 15:14:30 1998