Approximating a Single Index by Combining the Results of Distributed Searches @

Kshitij Shah# Haym Hirsh*Leon Shklaro*
#Bell Communications Research
444 Hoes Lane
Piscataway, NJ 08854
oBell Communications Research
445 South Street
Morristown, NJ 07960
*Rutgers University
Department of Computer Science
Piscataway, NJ 08855

It is becoming increasingly infeasible to maintain a single index for the enormous amount of information on the World Wide Web. Consequently, it has become important to be able to access multiple heterogenous indices in the execution of a single information query. The objective of this work is to project an image of a single index even when queries are actually executed against multiple indices distributed across many machines. Our approach to this problem is to meaningfully merge the documents retrieved when executing the query against multiple indices. The main complexity in doing this is that for most ranked retrieval indexing technologies the scores of retrieved documents only make sense relative to a particular index.

We are exploring three specific approaches to this problem:

  1. Indexing a temporary repository of retrievals: Here we create a temporary repository from the union of all the documents retrieved using each index. We execute the original query against an index created from this new, temporary repository.
  2. Learning scaling factors: This method finds query-specific scaling factors to renormalize the scores of the documents retrieved using each index. The union of the results is then sorted using the resulting scores.
  3. Learning ordering rules: Here we learn query-specific rules for identifying the desired relative order for any pair of retrieved documents. At query-execution time we use these learned rules to order the union of the retrieved results.

Our methods for each assume access to different levels of information about each document:

To test how well we project an image of a single index we are evaluating our methods by testing the extent to which we retrieve the same results that would be obtained when executing a query against a single index for all documents (the "gold list"). We use two measures for testing how the various experiments perform as compared to the gold list:

We are evaluating our work using two very different indexing technologies -- Wide Area Information Service (WAIS) [2] and Latent Semantic Indexing (LSI) [3]. Our preliminary results show that for WAIS the best results are obtained when the system has access to the full contents of each document so that our temporary-repository approach can be applied, and for LSI, simply sorting the union of retrieved results is comparable to more sophisticated methods that require much more effort with little gain in performance.

[1] M. Kendall and J. Gibbons, "Rank Correlation Methods", Fifth edition, Oxford University Press, 1996.

[2] B. Kahle and A. Medlar, "An Information System for Corporate Users: Wide Area Information Service", Connexions - The Interoperability Report, 5(11), November 1991.

[3] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Hashman, "Indexing by Latent Semantic Indexing", Journal of the American Society for Information Science, 41(6), 1990.

@Footnote: This is a position paper in response to the CFP for the Distributed Indexing/Searching Workshop to be held at MIT, May 28-29, 1996 Sponsored by the World Wide Web Consortium

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.