Infoseek's approach to distributed search

by Steve Kirsch, Infoseek Corporation, stk@infoseek.com

Presented at the Distributed Indexing/Searching Workshop sponsored by W3C.

Abstract This paper describes how Infoseek is approaching the problem of distributed search and retrieval on the Internet.
WWW master site list The Comprehensive List of Sites was not available at the time this paper was written (May 15). We need a reliable and complete list of all WWW sites that robots can retrieve. The list should also be searchable by people using a fielded search and include basic contact information. Infoseek would be happy to host such a service as a public service.
Additional robots files needed In order to minimize net traffic caused by robots and increase the currency of data indexed, we propose that each WWW site create a "robots1.txt" file containing a list of all files modified within the last 24 hours that a robot would be interested in indexing, e.g., the output from:
(cd $SERVER_ROOT; find . -mtime -1 -print >robots1.txt)
In addition, a "robots7.txt", "robots30.txt", and "robots0.txt" should also be created by a cron script on a daily basis. The 7 and 30 files are for the last 7 and 30 days respectively; the robots0.txt file would have the complete list of all files indexable from this website (including all isolated files). This proposal has the advantage of ease of installation (in most cases, a few simple crontab entries) and compatibility with all existing WWW servers.
Collection identification Infoseek's new full text indexing software (Ultraseek) creates a sophisticated fingerprint file during the indexing process. This fingerprint file can be adjusted by the user to contain every word and multi-word phrase from the original corpus as well as a score for each word and phrase. The user can set a threshold of significance as well for more concise output. Similarly, a requestor of the fingerprint file could set a similar threshold, but this would require a more sophisticated interface than HTTP or FTP. Ultraseek is capable of running a user's query against a meta-index of fingerprint files to determine with excellent precision, a rank ordered list of the best collections to run the query against. No manual indexing is required for each collection. Once the system has been stabilized, we will make the data formats publicly available.

Fusion of search results from heterogenous servers Ultraseek performs query results merging from distributed collections in a unique way. We allow each search engine to handle the query using the most appropriate scoring algorithms. The resulting DocIDs are returned to the user, along with a few fundamental statistics about each of the top ranked documents. This allows the documents to be precisely re-scored at the user's workstation using a consistent scoring algorithm. It is very efficient (and IDF collection pass is not required), heterogenous search engines are supported (e.g., Verity and PLS), and most importantly, a document's score is completely independent of the collection statistics and search engine used. Once the fundamental statistics have stabilized, we will make the statistics spec and protocol publicly available. We currently plan to use ILU to communicate between servers.

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.

Abstract	This paper describes how Infoseek is approaching the problem of distributed search and retrieval on the Internet.
WWW master site list	The Comprehensive List of Sites was not available at the time this paper was written (May 15). We need a reliable and complete list of all WWW sites that robots can retrieve. The list should also be searchable by people using a fielded search and include basic contact information. Infoseek would be happy to host such a service as a public service.
Additional robots files needed	In order to minimize net traffic caused by robots and increase the currency of data indexed, we propose that each WWW site create a "robots1.txt" file containing a list of all files modified within the last 24 hours that a robot would be interested in indexing, e.g., the output from: (cd $SERVER_ROOT; find . -mtime -1 -print >robots1.txt) In addition, a "robots7.txt", "robots30.txt", and "robots0.txt" should also be created by a cron script on a daily basis. The 7 and 30 files are for the last 7 and 30 days respectively; the robots0.txt file would have the complete list of all files indexable from this website (including all isolated files). This proposal has the advantage of ease of installation (in most cases, a few simple crontab entries) and compatibility with all existing WWW servers.
Collection identification	Infoseek's new full text indexing software (Ultraseek) creates a sophisticated fingerprint file during the indexing process. This fingerprint file can be adjusted by the user to contain every word and multi-word phrase from the original corpus as well as a score for each word and phrase. The user can set a threshold of significance as well for more concise output. Similarly, a requestor of the fingerprint file could set a similar threshold, but this would require a more sophisticated interface than HTTP or FTP. Ultraseek is capable of running a user's query against a meta-index of fingerprint files to determine with excellent precision, a rank ordered list of the best collections to run the query against. No manual indexing is required for each collection. Once the system has been stabilized, we will make the data formats publicly available.
Fusion of search results from heterogenous servers	Ultraseek performs query results merging from distributed collections in a unique way. We allow each search engine to handle the query using the most appropriate scoring algorithms. The resulting DocIDs are returned to the user, along with a few fundamental statistics about each of the top ranked documents. This allows the documents to be precisely re-scored at the user's workstation using a consistent scoring algorithm. It is very efficient (and IDF collection pass is not required), heterogenous search engines are supported (e.g., Verity and PLS), and most importantly, a document's score is completely independent of the collection statistics and search engine used. Once the fundamental statistics have stabilized, we will make the statistics spec and protocol publicly available. We currently plan to use ILU to communicate between servers.