Distributed Indexing and Searching:
A Big Picture

Leslie L. Daigle
<leslie@bunyip.com>
Bunyip Information Systems Inc.

Sandro Mazzucato
<pedro@bunyip.com>
Bunyip Information Systems Inc.

May 1, 1996

To properly address the problem of distributed indexing and searching of Internet resources, the component pieces of the problem must be identified and placed in the picture:

The suggested discussion topics in this workshop's call for participation seem to stem from a perspective that is closely focused on extant systems for providing some measure of indexing of resources. Specifically, the focus seems to be on the creation and representation of index data for general purpose resource discovery. A broader perspective must be taken.

Our work with Archie and Digger (based on Whois++) have provided experience with distribution in both areas, and insight into some of the issues faced when generating indexing data. We will focus here on issues of distribution across servers and time.

Distribution across servers -- cooperative indexing

Server load for indexing can be reduced by sharing gathered information amongst indexers. Replication of all or only a portion of the indexing information may be achieved in different ways. The approach used in Archie is to divide the gathering responsibilities among all or a subset of the indexing servers and let them collaborate to perform the replication of information.

Similarly, the information server itself can cooperate in the indexing process -- there are contexts in which, rather than allowing indexers to "pull" the indexing data, the server would be better served by making agreements with indexing services to which the data could be "pushed". One standard that needs to be addressed is the "robots.txt" structure. The knowledge of the type of information is always greater closer to its source, and so it seems natural to have a protocol that allows information providers to give more guidelines to the different robots.

Distribution across time -- responsible gathering

When designing an index data gathering system, the volatility of the information being indexed should be considered. The ideal is to design and build a service that provides the most recent and accurate information to its users. However, frequent accesses to data will not necessarily provide indexing information of greater quality. It may only find data that is often modified, such as daily news. On the other hand, one may not want to index this type of information as it will reside at the location only briefly. A more useful approach might be to create an index of (static) infrastructure information surrounding that volatile data. An extension to the robots.txt convention, specifying the minimal interval between retrievals may help reduce the extra load on information servers.

These examples serve as illustrations of the many and varied issues that must be considered in order to develop truly global indexing systems. The focus of attention at the workshop must be raised beyond the simple creation of a representation for index data, as this is only a small piece of the big picture.


This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.