Position Paper for Distributed Indexing/Searching Workshop: Indexing Proxies

Chris Weider
Ken Weiss

Problem Statement

Indexing systems such as Lycos and Alta Vista must actively download each and every document they wish to index. As services like these proliferate, and as the volume of information on the Internet gets larger, this will become increasingly more difficult to do in a timely fashion, or indeed to do at all. In addition, many of these services do exactly the same types of indexing on the documents. There may very well also be cases (particularly in private or semi-private networks) where it may be perfectly reasonable to export precomputed indexes while securing the documents themselves. This will become more important as the Net evolves to a) contain many more services which cost money, and are typically not indexed today, and b) many more access-controlled resources.

Proposed Solution

A system of indexing proxies should be developed and deployed which generate indices for the information contained in a given group of resources, and export them to indexing services. In this model, a system such as Lycos would contact the index proxy for a given resource site, ask for an index in a specific format, and then integrate the index into the rest of the service. This would also help prevent undesired replication of the entire data resource, a problem which is likely to become more prevalent as time goes on, particularly for smaller resources. It would also allow the integration of expensive resources into the search tools without requiring a substantial up front cost. This reduces the barriers to entry for many smaller special purpose index services.

In addition, if the indexing proxies set up indexing relationships with a number of services, the proxy can *push* any changed data without having to be polled for it. WHOIS++ successfully uses that model today.

Most types of indexing data can be propagated and perhaps even integrated together. Centroids, glimpse full-text indexes, WAIS indexes and so forth are all good candidates for transmission. This may have a disadvantage in that this may foster a reliance on a few types of indices, but this can probably be avoided.

This approach requires the definition of protocols and vocabularies for describing and transmitting indices, and perhaps standards for the specification of subsets of a given resource site.

Advantages of Proposed Solution

Disadvantages of Proposed Solution