Kshitij Shah# | Haym Hirsh* | Leon Shklaro* |
#Bell
Communications Research 444 Hoes Lane Piscataway, NJ 08854 | oBell Communications
Research 445 South Street Morristown, NJ 07960 | *Rutgers University Department of Computer Science Piscataway, NJ 08855 |
It is becoming increasingly infeasible to maintain a single index for the enormous amount of information on the World Wide Web. Consequently, it has become important to be able to access multiple heterogenous indices in the execution of a single information query. The objective of this work is to project an image of a single index even when queries are actually executed against multiple indices distributed across many machines. Our approach to this problem is to meaningfully merge the documents retrieved when executing the query against multiple indices. The main complexity in doing this is that for most ranked retrieval indexing technologies the scores of retrieved documents only make sense relative to a particular index.
We are exploring three specific approaches to this problem:
Our methods for each assume access to different levels of information about each document:
To test how well we project an image of a single index we are evaluating our methods by testing the extent to which we retrieve the same results that would be obtained when executing a query against a single index for all documents (the "gold list"). We use two measures for testing how the various experiments perform as compared to the gold list:
We are evaluating our work using two very different indexing technologies -- Wide Area Information Service (WAIS) [2] and Latent Semantic Indexing (LSI) [3]. Our preliminary results show that for WAIS the best results are obtained when the system has access to the full contents of each document so that our temporary-repository approach can be applied, and for LSI, simply sorting the union of retrieved results is comparable to more sophisticated methods that require much more effort with little gain in performance.
[1] M. Kendall and J. Gibbons, "Rank Correlation Methods", Fifth edition, Oxford University Press, 1996.
[2] B. Kahle and A. Medlar, "An Information System for Corporate Users: Wide Area Information Service", Connexions - The Interoperability Report, 5(11), November 1991.
[3] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Hashman, "Indexing by Latent Semantic Indexing", Journal of the American Society for Information Science, 41(6), 1990.