Position Paper for Distributed Indexing/Searching Workshop

Support for Temporal HTTP Queries: a Position Paper
Fred Douglis, AT&T Research

There has been much work in the recent past in the area of tracking modifications to data on the Web. Typically, a user's agent polls a list of URLs periodically and compares their timestamps, or checksums when timestamps are not available, to determine what has changed. Examples of this approach include w3new, webwatch (now SurfBot), URL-minder, Netscape SmartMarks, and AIDE.

This approach is not scalable. Even for systems that centralize polling for many users (URL-minder has done this all along, while AIDE has moved to this architecture recently), sending individual requests for each page is an unnecessary waste of network and server resources. This is especially true when data are dynamically generated and have no timestamp to compare, in which case the entire document must be generated and transferred to the agent performing the query, which will then compute a checksum. For slowly changing data, all this work is repeated needlessly.

The HTTP community has recognized that establishing a new connection for each request generates unnecessary traffic and results in poor network utilization (due to slow-start). As HTTP evolves to support more complicated or long-lasting requests, I propose that it support in the protocol and on each compliant server the type of functionality that the agents listed above perform on the client side: given a list of MULTIPLE documents on a site, it should return the last modification date and/or checksum for all of them in a single operation.

Alternatively, or in addition, it could support a REGISTER command to send back electronic notification (via email or a CGI interface) when a document changes. Some sites do this today, either locally or using a link to URL-minder. In the case of pages that are "expensive" to generate (e.g., queries against a large database), it may also be common for the server to cache the results and use some internal state to know when it is necessary to regenerate the page from scratch. In such cases, frequent requests for the same data place minimal load on the server, but sending the pages over the Internet place an unncessary load on it.

Lastly, an additional level of support that would be of use for tracking changes to documents would be to provide versioned data. Versioning allows users to determine not only when pages have changed, but also how they have changed. While AIDE archives pages on demand to provide this data, server-side access to versions of documents would obviate the need for an external ad hoc solution, and avoid duplicating pages unnecessarily.

Note that any of these services can be supported at the CGI level, rather than as part of HTTP itself, as long as there is a standard for how to invoke them. Note also that in addition to these personal agents that may poll daily or weekly, there are search engines that periodically scan the entire web. Their goal is similar: they want to find what new pages exist, and what pages have changed. An HTTP or CGI interface to return information about a list of URLs or about all URLs on a site will enable agents on other hosts to retrieve only the pages that have truly changed.

douglis@research.att.com

Last modified: Mon Apr 15 18:18:24 1996

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.

Support for Temporal HTTP Queries: a Position Paper Fred Douglis, AT&T Research

Support for Temporal HTTP Queries: a Position Paper
Fred Douglis, AT&T Research