Index and Search Position Paper
Title: Index and Search Position Paper
Company: vivid studios
Version: 1.0 May 6 1996
Author: Christian Mogensen
Vivid's concerns fall into three areas: interoperability,
communication, and internationalization.
Interoperability
The lack of a standard conveying the context of a web document has led to the use of keywords embedded in
comments or META tags and similar workarounds.
Unfortunately, meta-data for non-HTML data
is not stored and the early work on ALIWEB and
Harvest self-indexing has not caught on as much as hoped. This has
led to the evolution of mega-indexers like
Inktomi,
Altavista, and
Infoseek.
These indexers share no data nor do they cooperate with sites when it
comes to generating index data.
A standard summary or index format and collection point would
make it easier for indexers to download an entire website's document
collection. As a result, bandwidth and compute resources would
be used more efficiently since indexes would hit websites only once.
The solution then is to devise a more generic meta-data format that will
let both HTML and non-HTML files be indexed and catalogued. The PICS
format is one possibility. The important thing is to agree on
either one standard that encapsulates the set of annotations or on a
meta-standard that would allow gateways between various formats and/or annotation types.
Communication
A web server is
optimized towards serving a single document at a time but an indexer
wants collections of documents to work with. Either we need to come
up with a special URL to allow indexers easy access to collections or
we need to introduce a new service geared to the needs of indexers.
Harvest's broker network is a step in the right direction but the
Summary Object Interchange Format (SOIF) needs the flexibility of PICS. The current crop of
NSF-funded digital
library projects can have a large bearing on this discussion.
Internationalization
Indexing engines are only now becoming sufficiently HTML/SGML aware such that
they can resolve entity references before storing documents in their repository.
More work is needed to deal with simple things such as accented letters
and varying character sets. A document may exist
in multiple language versions, all of which may exist under the
same URL depending on the Accept-Language headers that are
sent. Meta-data is required to describe the dimensions on which
a document may vary. (This meta-data could be sent as HTTP headers, for example.)
In summary, there needs to be more cooperation between
indexers and document servers in order to make better use of scarce
resources. Content providers have to provide more meta-data as
servers and document publishing systems become more complex. Exposing
this data to the world in a standard way will add tremendous value to the document
collection and ultimately make information more accessible and useful.
Copyright 1996. vivid studios
This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.