Background and Goals: This workshop brought together people involved with information server technologies, search technologies, and directory and online services, to discuss areas of mutual concern where repository interface standards could provide better approaches to distributed indexing and searching.
The first day of the workshop consisted of three technical sessions, each of which began with two invited talks intended to stir controversy on a particular topic, followed by a breakout session to discuss the topic in smaller groups. The technical session areas were Distributed Data Collection, Data Transfer Formats, and Distributed Search Architectures. Slides from the talks the first day are available (see the links in the Agenda), as are notes from the plenary sessions.
The morning of the second day was spent in a plenary session summarizing the preceding discussions and then culling and adjusting the topic list to provide charters for a final breakout session to writeup workshop recommendations. The workshop turned out to focus more on indexing than on searching, and in particular on collecting information needed for indexing.
The themes selected from the day 2 plenary session were:
The actual topics covered by these sessions diverged somewhat from these themes, by focusing on a subset of their "assigned" topics and expanding to include some related topics.
Below we summarize the recommendations resulting from the workshop. More detailed summaries of the individual sessions are also available.
Recommendations: The first breakout/writeup session focused primarily on determining the set of servers to which a query should be routed. Most of this discussion focused on centroids, which the group abstractly characterized as a table used to determine if a particular query should be sent to a particular service. The group felt the Whois++ framework specified in RFCs 1913 and 1914 was appropriate, although there might be shortcomings for particular applications - for example, there are currently no provisions for indicating that a centroid was created using stemming or stop lists. Alternatively, the informal agreement that Stanford is coordinating proposes centroids that include the content summaries that we agreed upon (i.e., the words in the collection plus their document frequencies), together with information about stemming and stop words, for example.
The group established several short term goals: additions/modifications to RFC 1913/1914 to specify stop lists, stemming, language, administrative contacts, and field names/attribute keys; the creation of a mailing list for interested parties; and tools for creating centroids. For longer term goals the group recommended creating a centroid standard that interoperates among search engine vendors, perhaps starting with Bunyip's Digger software as a reference implementation of the Whois++ protocol; working out a stemming specification for centroids; measuring the size and computational costs of using centroids, perhaps as part of a proposed prototype implementation using the MetaCrawler Web search service, or as part of the ongoing University of California Whois++ testbed; considering ways to extend centroids for use with non-text databases; expanding the header generality in RFCs 1913 and 1914; adding support for comments; and adding data specification support so that clients can rank services, and to allow per-collection word frequency counts.
This breakout session also discussed the question of how to conduct searches. There the group focused on engine identifiers and merging heterogeneous result sets. They suggested defining a data structure and transport mechanism to allow clients to formulate queries and interpret results. Some of the pieces they considered included URIs, collection descriptions, query language and output formats, and support for active code. The group also felt it would be important to consider standards for query languages and refinement and the role of Z39.50, but they ran out of time for those discussions.
The second breakout/writeup session addressed the problem of defining a simple convention for embedding metadata within HTML documents without requiring additional tags or changes to browser software, and without unnecessarily compromising current practices for robot collection of data. The group noted that a registry may be a necessary feature over time, but suggested that deployment proceed in the short term without requiring a registry. The group then went on to define an encoding scheme using META tags, gave examples of how the scheme might be used, and proposed a convention for linking to a schema's reference definition. Finally, they suggested that the semantics for metadata elements be related to existing well known schemas whenever feasible, to promote consistency among schemas.
The third breakout/writeup session focused on mechanisms to allow information servers to notify indexers when content changes. They separated this issue from the choice of how bulk data transfer is performed, and noted that there are three ways to maintain an index: (a) retrieval without prior coordination (e.g., as used by current robots), retrieval after notification, and notification followed by a provider push. They suggested five areas where standards are needed: a bulk collection protocol on top of HTTP, a collection packaging format, notification and registration protocols, notification event scheduling, and a protocol for clients and servers to negotiate whether to push or pull updates. The group then proposed a basic design for this set of standards.
The group expressed some concern about the ability for the transport layer to handle large scale notifications, especially in the case of personal agents requesting notifications. The availability of authentication mechanisms could reduce this problem by allowing providers to limit notifications to a specific set of indexing services.
The group discussed the use of Netscape's Resource Description Message (RDM) extension to the Harvest SOIF format for performing incremental updates and bulk transfers. Darren Hardy has made a preliminary specification of RDM available.
Finally, the group observed that registration and bulk transfer standards should be open, to encourage competitive value addition by parties other than information providers and indexers.
Z39.50: There was a fair amount of discussion of Z39.50 at the workshop. Some participants felt there should be a standard information retrieval protocol for queries and that Z39.50 was a good choice; others felt that Z39.50 is too large and complex, and suggested that the Z39.50 community develop a lightweight rendering of Z39.50. The Library of Congress Z39.50 representative at the workshop agreed to work towards this goal.
BOFs: Two Birds-Of-a-Feather (BOF) sessions were also held at the workshop. The first BOF stated that the Z39.50 Implementors Group community agreed to help the Stanford Digital Library Project produce a Z39.50 profile for the Stanford informal agreement and two alternative implementations -- one using an ASCII encoding and the other using BER (Basic Encoding Rules).
While the overall workshop goal was to determine areas where standards could be pursued, the second BOF attempted to reach actual standards agreements about some immediate term issues facing robot-based search services. The agreements fell into four areas: a ROBOTS meta-tag, meant to provide a per-document mechanism for users who cannot control the robots.txt file at their sites; a DESCRIPTION meta-tag, providing text that can be used by a search service when printing the document summary; a KEYWORDS meta-tag (which some workshop attendees felt was not appropriate for this BOF to specify without the participation of other parties that have been working on this meta data issue); and a list of of other issues with ROBOTS.TXT that should be resolved in future discussions: ambiguities in the current specification, a means of canonicalizing sites, ways of supporting multiple robots.txt files per site, ways of advertising content that should be indexed (rather than just restricting content that should not be indexed), and information about the maximum acceptable speed and parallelism when indexing a site.