Creating Collections with a Distributed Indexing Infrastructure

Position statement for Distributed Indexing/Searching Workshop
Jeremy Hylton, Corp. for National Research Initiatives

This note suggests two ideas that may be useful in the construction of an infrastructure for distributed indexing, searching, and browsing systems. First, I make a clear distinction between servers, which provide storage for digital objects, and collections, which organize related documents. Second, I argue that multiple, independent indexing systems may each require access to the original documents.

These ideas do not lead directly to suggestions for standards, nor are they completely original. They do offer a different perspective on the problem and place different constraints on developing standards.

Distributed information retrieval is often based on a model where many independent servers index local document collections and a directory server (or servers) guides users towards the independent indexes. This model assumes that the documents stored at a particular location define a collection.

In traditional information retrieval, term weights for a document are assigned using a collection-wide statistics, e.g. words occuring in only a few documents are weighted more heavily. This collection-wide information (term due to Viles and French [4]) greatly increases effectiveness and enables other useful services, like automatically constructing hierarchies with scatter/gather [2] or helping users re-formulate queries (content routing [3]).

Applying traditional term weight strategies in a distributed system is hard, because the definition of "collection-wide" can be difficult to pin down and when it is collecting the information can be expensive.

Sheldon [3] proposes a distributed IR model with the important characteristic that a collection of documents is described by a content label and the content label can itself be treated as a document and included in another collection. Content labels help users manage and explore very large information spaces, but the idea could be valuably extended by treating collections (and their labels) seperarely from servers. Thus, a collection could include particular documents from many servers. (HyPursuit [5] moves in this direction.)

Consider a simple example: Several newspapers provide servers with their articles. We could construct many collections, each with different term weightings -- business articles from each of the newspapers, articles with a San Jose dateline, or movie reviews. Different terms would be useful in each collection.

Recent work in distributed indexing has focused mostly on efficient indexing -- minimizing load on servers and keeping indexes small. This is accomplished in part by indexing surrogate for documents that includes only part of the text (in Harvest, the first 100 lines of text and the first line of later paragraphs).

There is a tension between efficient indexing and the collection-based indexing; the best choice of indexing in general isn't necessarily the best for any specific case. An indexing surrogate may omit important terms that occur late in the document or mis-represent the frequency of particular terms.

We can address this tension, in part, by creating a more flexible infrastructure that allows multiple indexing schemes to access to the full content of documents they are indexing. Where a Harvest gatherer describes a single surrogate for a document, a more flexible gatherer would generate surrogates according to a particular index's specifications.

Ideally, the system should be flexible enough to allow very different indexing schemes, including indexes that include word proximity information, n-gram based approaches that don't focus on words per see, or knowledge-based or natural language processing approaches. One possibility is for indexes to send the gatherer a program for generating document surrogates. The gatherer could run the program and return the results to the index.

References

  • 1. C. Mic Bowman, et al. The harvest information discovery and access system. In Proc. of the 2nd World-Wide Web Conf., Chicago, December 1994.
  • 2. D. Cutting, et al. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In Proc. of SIGIR '92. Copenhagen, Denmark, June 1992.
  • 3. M. Sheldon. Content Rouing: A Scalable Architecture for Network-Based Information Discovery. PhD thesis, MIT Dept. of EECS, Oct. 1995.
  • 4. C. Viles and J. French. Dissemination of Collection Wide Information in a Distributed Information Retrieval System. In Proc. of SIGIR '95. Seattle, Washington, July, 1995.
  • 5. R. Weiss, et al. HyPursuit: A Hierarchical Network Search Engine that Explotes Content-Link Hypertext Clustering. Proceedings of Hypertext '96, Washington, DC, March 1996.
    This page is part of the DISW 96 workshop.
    Last modified: Thu Jun 20 18:20:11 EST 1996.