Registration / Notification Breakout
Reported by Mike Heffernan (Fulcrum) and Dan LaLiberte (NCSA), and edited further by Mike Schwartz (@Home)

Keywords: Index Maintenance, Notification, Response, Bulk Transfer, Index Maintenance, Registration, Provider Push

Introduction and Motivation

There are three ways to maintain an index:

retrieval without prior coordination (e.g., as used by current robots)
retrieval after notification
notification followed by a provider push

This group explored ways to improve index maintenance capabilities by enhancing the interactions between web servers and indexing technologies. We began by exploring the relationship between notification and bulk content transfer, which had been grouped together during the plenary discussion because of the need to transfer several resources at once at the server's instigation. We ended up not discussing bulk content transfer after deciding that it is an orthogonal and simpler problem.

This group addressed several problems, some of which are not apparently related until we begin looking at the solutions:

Frequent crawling through the web is inefficient when only a small percentage of resources change.
Without frequent crawling, indexes may become inconsistent with resources.
Caches and replicas become inconsistent with the origin resources.
Users do not find out when interesting events occur without frequent visits to the servers where the events occur.
Server administrators have insufficient control over what is indexed on their server.

Required Work

We determined that the following work items need to be done, in order of complexity and precedence:

Define a collection protocol on top of HTTP. This protocol will be the basis for an orthogonal way to uniformly handle either one or many resources, where a collection is an object "comprised of" several other resources.
Define a package or compound document format for the transfer of collections. There are several possible formats that may be used including SOIF, multi-part MIME, and tar.
Define register and notify protocols. The initial form of this protocol should be very simple, with immediate notification for any changes.
Define register/notify parameters. This is an augmentation of the register/notify protocol to allow clients to request a schedule for the frequency and time of notification, specify what kinds of events are of interest, etc.
Define negotiation protocol for modes of communication. Clients need to be able to ask and servers need to be able to request that register/notify be used as opposed to client pull.

Overview of Possible Design

Several issues that work together to provide an overall capability often tend to be bound to one another more tightly than necessary. This is the case with notification and bulk content transfer. Instead of a notification always involving a bulk transfer of resources, we pulled these two apart and explored the general categories of each. Notification is one mode of communication while bulk transfer is one kind of package for delivery of data.

We identified the following modes of communication:

Pull: A request for immediate, synchronous delivery: "Give me X"
Push: The synchronous delivery of a result: "Here is X"
Register: A request for delayed, asynchronous delivery: "Tell me if Y happens"
Notify: The asynchronous delivery of a result: "Y happened"

Note there are two major modes, synchronous and asynchronous, and within each major mode, there are two parts to a communication. (The "Pull" and "Push" modes may not be named or characterized correctly - maybe they should just be "Request" and "Response". We are more interested in the second pair: Register and Notify.)

When a client registers with a server, it is requesting that the server notify it when some event happens of interest to the client. The event might be as simple as 'a resource has changed', or as complex as 'a resource about biking has passed the final review stage'. After the registration is completed, the synchronous connection is dropped.

Some time later, when the event occurs, the server notifies the client. Now the server is actually a client in initiating a connection to the original client which must have a server actively listening for the notification. To keep this less confusing, we will only talk in terms of the original client and server.

The server can notify the client one of several ways. A synchronous connection much like the first connection from client to server could be attempted. SMTP-based email delivers messages using store-and-forward - this works even if both parties are not available at the same time for a synchronous communication. For a large number of clients interested in the same event, it may be more effective to use a flooding propagation of notifications via something like NNTP.

The message being transmitted in a notification should probably be very small, especially with a large number of registrants, so instead of sending a new or changed resource directly, the server should send just a reference to the resource. The client could later fetch the actual resource with a Pull/Push.

Entities Transferred

The entity transferred in any message may be one of several kinds of things all concerning a single resource:

A representation of the actual content of a resource.
Metadata (descriptive, proscriptive, related information, etc) for a resource.
Extracts of the resource or differences relative to previous versions.

Which one of these things is transferred in a message must be known to the receiver. Either it is known implicitly to the client by the kind of request it made, or it is declared explicitly by the server if it would otherwise be unknown to the client.

These entities may be considered resources in their own right, especially if they are given URIs to identify them.

Collections

One kind of resource that is particularly important for this proposal is a collection. A collection is a set of other resources that together act as one resource. This abstraction allows us to send messages about collections as easily as we send them about simple resources.

Some other combinations to keep in mind: Rather than a collection of actual resources, we may also have collections of metadata or differences. We may also have metadata about a collection itself or differences of a collection relative to a previous version of the collection (i.e. new, changed, or deleted elements).

Collections should be identified by URIs. There is a natural use of http and ftp URLs to identify collections. Requests for an http URL of a directory, i.e. ending in "/", is handled by servers by generating an HTML document that lists and describes the accessible elements of the directory, or a default "index.html" file is returned instead. But a client might request that the server return a representation of a collection in another form by specifying what that form could be. The Accept line could include "text/SOIF" for example.

Other Observations

There need to have a hierarchical brokerage capability to get economies of scale.
The group had some reservations about scaling the transport layer to support widespread registration/notification.
Authentication of search services would allow services to limit the number of services indexing a given site.
The system of registrations and notifications should be to third party involvement to add value.

Related Work and Standards

Many standards are involved in the framework we are proposing, including:

HTTP - the dominant standard for resource retrieval
HTML - the dominant standard for text format
RDM - Resource Description Message. This is Netscape's proposed standard that encapsulates SOIF and provides syntax negotiation, stronger typing, and other features)
SOIF - Summary Object Interchange Format (defined by the Harvest Project)
MIME - parts used by HTTP to identify resource types
robots.txt - standard for servers to specify exclusion of robots (aka "Robot Exclusion Protocol")
PEP - Protocol Extension Protocol proposed for HTTP
URNs - the IETF effort to define Uniform Resource Names
NETLIB
Register-Notify Discussion (discussion forum set up by this group)

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.

Registration / Notification Breakout Reported by Mike Heffernan (Fulcrum) and Dan LaLiberte (NCSA), and edited further by Mike Schwartz (@Home)