Moving Beyond File Retrieval For Distributed Indexing
Michael F. Schwartz, @Home Network

My motivation for co-chairing this workshop was to bring together a cross section of people involved with information server technologies, search technologies, and directory and online services, to discuss where repository interface standards could support better approaches to distributed indexing and searching. Beyond reducing the CPU and network load required for indexing, appropriate repository interface standards could allow the Internet/intranet searching market to grow by removing incompatibilities among current tools and services.

It is not the goal of this workshop to produce a standard; I don't believe it is possible to create a meaningful standard in a room with 50 people. Rather, the workshop will be an opportunity to uncover and discuss areas of mutual concern where standards might gain momentum.

I believe a key step towards establishing appropriate indexing and searching standards is to transcend the current file orientation of indexing. The object-at-a-time nature of HTTP was never designed to support indexing, and using files such as robots.txt is too static and flat a paradigm to support many types of meta data. Web crawlers arose to fill a market demand in an environment that provides no other guaranteed means of collecting information, yet I hope this workshop can establish as a common goal the definition of a collection-oriented, programmatic indexing interface that can be used in addition to crawlers.

In Harvest we created a mechanism where indexing data could be extracted before it was transmitted across the network to an indexer, placed into a structured format (SOIF), and transmitted using a compressed streaming protocol that supports incremental updates. I see three important ways that those basic ideas might be shaped into a more encompassing framework. First and foremost, I would like to see the ability to negotiate a common query language between a repository and indexer. This would allow components that happen to speak the same language to communicate without an intermediate translation. It would also allow components to communicate using application-specific languages (e.g., utilizing a geo-spatial meta data standard), or using heavier-weight languages than is common in network information retrieval environments (e.g., SQL). Second, it should be possible to retrieve information needed to support the relevance ranking heuristics used by full text indexing systems, in addition to retrieving attribute-value structured meta data. Rather than defining an HTTP "MGET" mechanism, I believe the right approach for this would be the ability to retrieve a remotely generated index -- either by agreeing on a standard index retrieval format, or through an index format negotiation protocol. Third, I would like to see standards for remote query interfaces -- both at the language and the user interface levels. In Harvest we defined a simple generic query language (and implemented mappings to several search engines), but in retrospect that was the least successful aspect of the project: because we chose a "least common denominator" approach it did not support important features like relevance ranking and adjacency operators, and hence that language stood no chance of standardization. The Stanford Digital Library group has more recently taken some steps regarding this difficult problem.

We included "distributed searching" in the set of topics for this workshop, and I'm curious how the participants will rate the importance of this problem. I am aware of some interesting efforts to solve the problems that arise when merging the relevance ranked results of a distributed query, but I question the extent to which people will deploy distributed search services in practice. The problem I see is analogous to why distributed database systems never really caught on: if it is possible to reach agreement about the schema, one might as well just run the database on a large centralized bank of servers -- especially since network bandwidths are improving less quickly than CPU, memory, and disk costs. I do believe it will be important to segment global search services into topical or community-focused components, but it's not clear that distributed search is a useful paradigm in such an environment.

Copyright © 1996 Michael F. Schwartz


This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.