Distributed Indexing and Searching

Nick Arnett

Internet Evangelist

Verity Inc.

(narnett@verity.com)

Prepared for the Distributed Indexing/Searching Workshop

This paper is intended to describe current research and product development at Verity Inc. and to support development of open industry standards for distributed indexing and search on the Internet. Verity developed the first commercially available indexing spider, which the company continues to sell and develop in conjunction with its line of indexing, search and retrieval products. The viewpoints in this paper are subject to change.

The primary goal of open standards for indexing is to acquire data objects across the Internet for efficient indexing and incremental updates of existing indexes. Secondary goals include the desire to reduce server and network loads. Verity's research and product development efforts are focused on merging existing and proposed standard protocols with new, open information gathering protocols. Although the company is not committed to a particular technical direction, it views certain technologies as important antecedents of the information gathering protocol that is to be developed. These include the "robots.txt" standard for robot exclusion (presently supported by Verity products) and the Harvest system developed at the University of Colorado (supported by third-party Verity developers, including the University). However, these antecedents do not address issues that the company believes are critical to today's Internet environment. Furthermore, the Web's primary transfer protocol, HTTP, as well as its antecedents, FTP and Gopher, are inefficient for Internet index maintenance operations, except as carrier protocols.

For example, the "robots.txt" exclusion could be enhanced to include greater information about documents and collections of documents stored on a server. There is no standard means of storing and obtaining meta-information such as titles and owner/maintainer identification for document groups. The "robots.txt" file or similar resource descriptions could accomplish this with an extensible set of well-known descriptions of such data.

The Harvest Summary Object Interchange Format (SOIF) was an important step forward in the effort to transfer large amounts of new and changed information with "push" and "pull" mechanisms, which are critical to efficiency. Verity believes that SOIF can be a building block in an open information gathering protocol that would be stronger than SOIF in terms of incremental updates and generation of data objects specific to the requirements of a robot.

Finally, Verity favors negotiation-based protocols, conceptually similar to those used in recent modem communications, in which the pair of communicators falls back to the most efficient commonly supported protocol. The least common denominator would be today's typical robot operation -- a series of GET or HEAD requests. The most sophisticated protocols would include "push" and "pull" requests, compression negotiation and search query-based index updates (which would take advantage of a search engine's ability to return results based on field data such as last-modified date). The protocols would allow the commuicators to exchange data objects and deletion lists for index maintenance.

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.