Index and Search Position Paper

Title:	  Index and Search Position Paper
Company:  vivid studios
Version:  1.0  May 6 1996
Author:   Christian Mogensen

Vivid's concerns fall into three areas: interoperability, communication, and internationalization.

Interoperability

The lack of a standard conveying the context of a web document has led to the use of keywords embedded in comments or META tags and similar workarounds. Unfortunately, meta-data for non-HTML data is not stored and the early work on ALIWEB and Harvest self-indexing has not caught on as much as hoped. This has led to the evolution of mega-indexers like Inktomi, Altavista, and Infoseek. These indexers share no data nor do they cooperate with sites when it comes to generating index data.

A standard summary or index format and collection point would make it easier for indexers to download an entire website's document collection. As a result, bandwidth and compute resources would be used more efficiently since indexes would hit websites only once.

The solution then is to devise a more generic meta-data format that will let both HTML and non-HTML files be indexed and catalogued. The PICS format is one possibility. The important thing is to agree on either one standard that encapsulates the set of annotations or on a meta-standard that would allow gateways between various formats and/or annotation types.

Communication

A web server is optimized towards serving a single document at a time but an indexer wants collections of documents to work with. Either we need to come up with a special URL to allow indexers easy access to collections or we need to introduce a new service geared to the needs of indexers. Harvest's broker network is a step in the right direction but the Summary Object Interchange Format (SOIF) needs the flexibility of PICS. The current crop of NSF-funded digital library projects can have a large bearing on this discussion.

Internationalization

Indexing engines are only now becoming sufficiently HTML/SGML aware such that they can resolve entity references before storing documents in their repository. More work is needed to deal with simple things such as accented letters and varying character sets. A document may exist in multiple language versions, all of which may exist under the same URL depending on the Accept-Language headers that are sent. Meta-data is required to describe the dimensions on which a document may vary. (This meta-data could be sent as HTTP headers, for example.)

In summary, there needs to be more cooperation between indexers and document servers in order to make better use of scarce resources. Content providers have to provide more meta-data as servers and document publishing systems become more complex. Exposing this data to the world in a standard way will add tremendous value to the document collection and ultimately make information more accessible and useful.

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.