Distributed Indexing/Searching Workshop Position Paper
Wayne C. Gramlich
The Internet is ripe for some search and index standards.
Currently, most searching and indexing technology tends
to be rather monolithic. The Harvest architecture provides
a perfectly adequate starting point for thinking about how
to break search and index technology into smaller and more
modular pieces. However, there are some additional places
for some standardization above and beyond the Harvest SOIF
interface:
-
TQL (Text Query Language)
-
The text search industry would benefit greatly
from standardizing on a search language, just
like the relational database industry standardized
on SQL (more or less.) Users would benefit because
they could learn and master one language instead of
a multitude of similar but frustratingly different
text query languages. The language needs to be
designed so that the query features that are present
in almost all query engines are readily available,
while still preserving accessibility to higher level
functions that are only implemented in one or two
query engines. The design of TQL will be quite
challenging and controversial, but ultimately will be
quite well received by the user community.
-
Spider Helper
-
Right now, spiders have to continually reprobe
the documents to ensure that they have not changed.
This wastes network bandwidth and time. There
needs to be interface that allows web spiders to
find the documents that have changed since the
last time they probed. This can be organized as
a fairly simple CGI script that returns the list
of all documents that have been
modified/added/deleted since a specified time.
-
Distributed Query Support
-
Right now each query engine implements its own
algorithm for rating query matches that is different
from all other query engines. While a standardized
algorithm is one possible solution, it is unlikely
that there is one "best" algorithm. Instead, it
possible to contemplate an interface directly to
the inverted index that by-passes the query engine.
This interface would provide the ability to query
a document collection with a list of words and get
back a list of document names and the word positions
(for proximity search) in the documents. A
centralized query engine can collect the information
from a set of document collections and form a coherent
match list using same rating algorithm rather than
trying to merge a multitude of different query engine
results. Again, this functionality can probably be
shoe-horned into a CGI script to speed deployment.
-
Document Filter Standardization
-
Right now, each search engine vendor has to write
their document filter code that extracts words
from documents prior to insertion into the the
query index. When an organization comes up with
a new document format, the organization has to go
around to all of the search vendors and ask them
to write a filter for their document format. It
would be so much easier if a standardized document
filter interface could be defined that attached
to the Harvest SOIF interface. This would greatly
simplify the search vendor's lives as well as the
new document format organization's life. Such a
standard filter interface could easily be added
to the Harvest SOIF interface.
-
HTML Meta Tags
-
Right now HTML only supports the definition of the
<TITLE> tag in the <HEAD>. It would be
relatively easy to expand this list to include
important information such as language, authors,
publisher, publication date, E-mail address,
keywords and phrases, etc. This can be done using
a combination of the <META> and <LINK>
tags so that there is no need to wait for browser
vendors to implement new HTML tags. In addition,
it makes sense to define a new HTML <ABSTRACT>
tag to explicitly delineate the abstract portion of
a paper if it exists. Similarly, it would be useful
to define some tags to support bibliographic entries
as well. Many search engines would be able to
usefully use this additional document information.
While there are other opportunities for standardizing
interfaces for search and index functionality, I believe
that standardizing the interfaces above will be the most
fruitful.
Wayne C. Gramlich
This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.