Distributed Indexing/Searching Workshop Position Paper
Wayne C. Gramlich

The Internet is ripe for some search and index standards. Currently, most searching and indexing technology tends to be rather monolithic. The Harvest architecture provides a perfectly adequate starting point for thinking about how to break search and index technology into smaller and more modular pieces. However, there are some additional places for some standardization above and beyond the Harvest SOIF interface:

TQL (Text Query Language): The text search industry would benefit greatly from standardizing on a search language, just like the relational database industry standardized on SQL (more or less.) Users would benefit because they could learn and master one language instead of a multitude of similar but frustratingly different text query languages. The language needs to be designed so that the query features that are present in almost all query engines are readily available, while still preserving accessibility to higher level functions that are only implemented in one or two query engines. The design of TQL will be quite challenging and controversial, but ultimately will be quite well received by the user community.
Spider Helper: Right now, spiders have to continually reprobe the documents to ensure that they have not changed. This wastes network bandwidth and time. There needs to be interface that allows web spiders to find the documents that have changed since the last time they probed. This can be organized as a fairly simple CGI script that returns the list of all documents that have been modified/added/deleted since a specified time.
Distributed Query Support: Right now each query engine implements its own algorithm for rating query matches that is different from all other query engines. While a standardized algorithm is one possible solution, it is unlikely that there is one "best" algorithm. Instead, it possible to contemplate an interface directly to the inverted index that by-passes the query engine. This interface would provide the ability to query a document collection with a list of words and get back a list of document names and the word positions (for proximity search) in the documents. A centralized query engine can collect the information from a set of document collections and form a coherent match list using same rating algorithm rather than trying to merge a multitude of different query engine results. Again, this functionality can probably be shoe-horned into a CGI script to speed deployment.
Document Filter Standardization: Right now, each search engine vendor has to write their document filter code that extracts words from documents prior to insertion into the the query index. When an organization comes up with a new document format, the organization has to go around to all of the search vendors and ask them to write a filter for their document format. It would be so much easier if a standardized document filter interface could be defined that attached to the Harvest SOIF interface. This would greatly simplify the search vendor's lives as well as the new document format organization's life. Such a standard filter interface could easily be added to the Harvest SOIF interface.
HTML Meta Tags: Right now HTML only supports the definition of the <TITLE> tag in the <HEAD>. It would be relatively easy to expand this list to include important information such as language, authors, publisher, publication date, E-mail address, keywords and phrases, etc. This can be done using a combination of the <META> and <LINK> tags so that there is no need to wait for browser vendors to implement new HTML tags. In addition, it makes sense to define a new HTML <ABSTRACT> tag to explicitly delineate the abstract portion of a paper if it exists. Similarly, it would be useful to define some tags to support bibliographic entries as well. Many search engines would be able to usefully use this additional document information.

While there are other opportunities for standardizing interfaces for search and index functionality, I believe that standardizing the interfaces above will be the most fruitful.

Wayne C. Gramlich

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.

Distributed Indexing/Searching Workshop Position Paper Wayne C. Gramlich

Distributed Indexing/Searching Workshop Position Paper
Wayne C. Gramlich