DISW '96 Query routing and Searching Breakout

Erik Selberg <selberg@cs.washington.edu>

with many helpful suggestions and edits by:

Ron Daniel <rdanial@lanl.gov>
Ken Weiss <krweiss@ucdavis.edu>
Leslie Daigle <leslie@bunyip.com>
Luis Gravano <gravano@db.Stanford.EDU>

The Query Routing and Engine ID Breakout Session at DISW '96 was attended by a good cross-section of the online search industry. The discussion centered mostly around Query Routing, and then hit Engine Identification and at the end touched on Result Merging Strategies. Among the successes of the workshop was the creation of an informal working group which will explore various implementations of query routing.

There were two sets of problems addressed, which were divided into the broad categories: Where to Search and How to Search. There were also a few topics, namely Query Refinement, that did not easily fit into our simple categorization, and weren't discussed in the meeting.

Where to Search

Problems Addressed

The Query Routing Problem
i.e. given a Query and a set of Servers, to which Servers should the Query be sent?

Required Work

Most of our discussion centered around a centroid-model for query routing, although some alternatives were mentioned. A centroid in it's most rudimentary form is a table which a Query Router uses to see if a particular Query should go to a particular Service. An example of this would be a list of pairs (token, term freq) for each Service registered with the Query Router. The Router can quickly check each token in the query to determine its term frequency at any of the Services. Queries would only be routed to those Services for which the term frequency was above some threshold. The threshold value will vary depending on the nature of the underlying data and on performance considerations.

We agreed on the basics of a format for centroids to enable query routing. The general framework of the Whois++ Centroids specified in RFC 1913/1914 seemed appropriate enough, although there was discussion as to shortcomings of those standards in relation to general databases vs White Pages (e.g. there are currently no provisions for flags describing the centroid as being created using stemming or stop lists). Alternatively, the informal agreement that Stanford is coordinating proposes centroids that include the content summaries that we agreed upon (i.e., the words in the collection plus their document frequencies), together with information about stemming and stop words, for example.

Short Term Goals (90 day range)

Additions / Modifications to RFC 1913/1914
Some headers that were mentioned are:
- Stop list
- Stemming flag
- Language
- Administrative contact
- Field names/attribute key
The Stanford informal agreement already includes all these headers, with the exception of the Administrative contact.
Creation of an Informal Working Group
The first step to this has already been completed; Ken Weiss volunteered to maintain the q-route mailing list at q-route@ucdavis.edu. To subscribe, send mail to listproc@ucdavis.edu with the body text: subscribe q-route. This mailing list will serve as the forum by which members of the Query Routing Working Group can express their ideas, voice concerns, etc.
Easy tools to create centroids
Ron Daniel <rdanial@lanl.gov> will be doing some work with lq-text in the near future.

Medium (end of 1996) and Long Term Goals

Prototype Implementation
Erik Selberg of UW has offered (threatened? :) to co-ordinate an implementation of query routing on the MetaCrawler web search service (http://www.metacrawler.com). The idea is that the RFC and suggested modifications can be implemented and we can see what problems arise therein. The first step in this is to identify other parties interested.
Apart from the technological issues of defining and implementing an efficient query routing standard, it is of paramount importance to consider the issues facing existing search engine systems (e.g. Verify, Fulcrum) which have not been designed with centroids in mind. That is, what tools and techniques will be necessary in order to allow existing systems to generate effective standard centroid-information?
Some answers can be drawn from existing work with centroid-passing software, such as Bunyip's Digger software, which implements the Whois++ protocol. This is currently being used for meshes of white pages (people) data, and there are other projects underway for using this with corpora of other template-based data (e.g., metadata).
Potential problems arise when engines use stemming. Some agreement on the format of Centroids when they use stemming is going to be needed.
One question which needs to be explored are the performance dimensions of the Centroids, i.e. How big are they? How much CPU time on the server's end do they take to compute?
There are significant research efforts already underway to explore the scaling issues associated with the use of centroids in query routing. One example is the NSF/CNIDR/MCNC supported University of California systemwide Whois++ testbed (http://www.ucdavis.edu/whoisplus/). Another example is the GlOSS project at Stanford (http://gloss.stanford.edu).

Related Standards / Scope / References

RFC 1913 (WHOIS++ syntax), RFC 1914 (WHOIS++ interact) based
FIND minutes from at Los Angeles, soon to be updated and improved.
Stanford Digital Library at http://www-db.stanford.edu/~gravano/standards
FIND working group mailing list, subscribe by sending mail to find-request@bunyip.com with the body subscribe.
An archive of this mailing list is available at ftp://ftp.bunyip.com/pub/mailing-lists/find.archive/ as well as the charter for the FIND group.
The Stanford informal agreement
Other Routing Standards (hardware)?

Centroid Issues

These aren't general enough for non-text databases (e.g. image databases), so we should be open for more expansion.
Headers
- Need to expand 1913/14 for databases (defined headers are for white pages and aren't general enough)
- Stemming could be tricky
Comments
Data
- "Enough for clients to rank services"
- <word> <coll freq> (coll freq = # of docs in coll that contain word)
Transport
Patrik said RFC 1913/4 had stuff for this, so we'll use that

Alternatives to Centroid Model

We touched briefly on alternatives, but didn't go into any details. Below are two ideas which got over a sentence's worth of discussion. Mike - this is pretty much all I intent to write for these; they're here more for completeness sake than anything else. Feel free to scrap 'em. -E

DNS-like (query gives implicit routing)
Query Generators

How to Search

This portion of the Breakout was less defined, as it encompassed a hodge-podge of various ideas. Time constraints made us focus on two: Engine ID and the Merging of Heterogeneous Results

Engine ID

Problem Addressed

How does a Client Interact with appropriate Servers?

Related Standards / Scope / References

FIND folks Query Routing model
Resource Discovery?
KQML - Knowledge Query and Manipulation Language, out of UMBC, which is used with agents in the AI arena.

Required Work

We spent a lot of time discussing what the problem we were trying to address actually was. In the end, we determined that a Client needs to obtain information necessary to communicate with a Server, formulate queries, and interpret results. This breaks the problem in two parts: Definition of a Data Structure which contains this information, and Transport of this Data Structure from the Engine / Engine Provider to the Client.

Engine ID Data Structure Def (EID)

We didn't come up with a lot here ourselves; it seems that it'll have to go through some prototypes to ensure everything that's necessary is in the standard. Alternatively, a dynamically expandable array-value list could be used without great difficulty if people are afraid of a pre-defined structured query.

"All the info needed for a Client to interact with a Server"
URL/URI
Collection Description
Protocol / Ports (subsumed by urLs?)
Query language format (e.g. "JoeBob OmniEngine, version 1.2b")
Output format
Could contain active code

Transport

There are lots of solutions here, most of which are compatible with other solutions. Here are two:

EID Originates at the Server and is propagated through any intermediaries (i.e. Query Routers)
EID's are stored in Large Global Tables available at some well-known repository

Merging Results

We spend the last minutes of the session brainstorming on what meta-data is needed to merge heterogeneous results. Mike - this was at the behest of one of the attendees; while others have also done more work on this, it was more for our own internal benefit. I suspect it can be dropped from the global writeup; it's here mostly because of completeness. -E

Problem Addressed

How do you merge results from heterogeneous engines?

Related Standards / Scope / Standards

MetaCrawler -- normalize scores & sum
Ellen Voorhees and George Fox have independently done some stuff, see TREC-III/IV Proceedings
The Stanford informal agreement.

Brainstorm of attributes to merge data

Raw Score
Scale, endian (are high scores better), distribution
Term stats (may be problem for some search engines)
Concept-based expansion?
Cluster result sets
Dupes
Re-index
Thresholding

Issues we didn't get to, but think are important:

Query Language (although prob part of engine ID)
Query refinement
Access methods (to Z39.50 or not to z39.50)

Written by Erik Selberg, 5/31/96
Last Modified 6/10/96

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.