Presented at the Distributed Indexing/Searching Workshop sponsored by W3C.
Abstract | This paper describes how Infoseek is approaching the problem of distributed search and retrieval on the Internet. |
WWW master site list | The Comprehensive List of Sites was not available at the time this paper was written (May 15). We need a reliable and complete list of all WWW sites that robots can retrieve. The list should also be searchable by people using a fielded search and include basic contact information. Infoseek would be happy to host such a service as a public service. |
Additional robots files needed |
In order to minimize net traffic caused by robots and increase
the currency of data indexed, we propose that each WWW site create
a "robots1.txt" file containing a list of all files modified within
the last 24 hours that a robot would be interested in indexing,
e.g., the output from:
(cd $SERVER_ROOT; find . -mtime -1 -print >robots1.txt)In addition, a "robots7.txt", "robots30.txt", and "robots0.txt" should also be created by a cron script on a daily basis. The 7 and 30 files are for the last 7 and 30 days respectively; the robots0.txt file would have the complete list of all files indexable from this website (including all isolated files). This proposal has the advantage of ease of installation (in most cases, a few simple crontab entries) and compatibility with all existing WWW servers. |
Collection identification |
Infoseek's new full text indexing software (Ultraseek) creates a
sophisticated fingerprint file during the indexing process. This
fingerprint file can be adjusted by the user to contain every word and
multi-word phrase from the original corpus as well as a score for each
word and phrase. The user can set a threshold of significance as well
for more concise output. Similarly, a requestor of the fingerprint
file could set a similar threshold, but this would require a more
sophisticated interface than HTTP or FTP.
Ultraseek is capable of running a user's query against
a meta-index of fingerprint files to
determine with excellent precision, a rank ordered list of the best
collections to run the query against. No manual indexing is required
for each collection. Once the system has been stabilized, we will
make the data formats publicly available.
|
Fusion of search results from heterogenous servers |
Ultraseek performs query results merging from distributed collections
in a unique way. We allow each search engine to handle the query using
the most appropriate scoring algorithms. The resulting DocIDs are
returned to the user, along with a few fundamental statistics about
each of the top ranked documents. This allows the documents to be
precisely re-scored at the user's workstation using a consistent
scoring algorithm. It is very efficient (and IDF collection pass is
not required), heterogenous search engines are supported (e.g., Verity
and PLS), and most importantly, a document's score is completely
independent of the collection statistics and search engine used.
Once the fundamental statistics have stabilized, we will make the
statistics spec and protocol publicly available. We currently plan
to use ILU to communicate between servers.
|