DISW '96 Query routing and Searching Breakout

Erik Selberg <selberg@cs.washington.edu>

with many helpful suggestions and edits by:

Ron Daniel <rdanial@lanl.gov>
Ken Weiss <krweiss@ucdavis.edu>
Leslie Daigle <leslie@bunyip.com>
Luis Gravano <gravano@db.Stanford.EDU>



The Query Routing and Engine ID Breakout Session at DISW '96 was attended by a good cross-section of the online search industry. The discussion centered mostly around Query Routing, and then hit Engine Identification and at the end touched on Result Merging Strategies. Among the successes of the workshop was the creation of an informal working group which will explore various implementations of query routing.
     There were two sets of problems addressed, which were divided into the broad categories: Where to Search and How to Search. There were also a few topics, namely Query Refinement, that did not easily fit into our simple categorization, and weren't discussed in the meeting.

Where to Search

Problems Addressed

Required Work

Most of our discussion centered around a centroid-model for query routing, although some alternatives were mentioned. A centroid in it's most rudimentary form is a table which a Query Router uses to see if a particular Query should go to a particular Service. An example of this would be a list of pairs (token, term freq) for each Service registered with the Query Router. The Router can quickly check each token in the query to determine its term frequency at any of the Services. Queries would only be routed to those Services for which the term frequency was above some threshold. The threshold value will vary depending on the nature of the underlying data and on performance considerations.
     We agreed on the basics of a format for centroids to enable query routing. The general framework of the Whois++ Centroids specified in RFC
1913/1914 seemed appropriate enough, although there was discussion as to shortcomings of those standards in relation to general databases vs White Pages (e.g. there are currently no provisions for flags describing the centroid as being created using stemming or stop lists). Alternatively, the informal agreement that Stanford is coordinating proposes centroids that include the content summaries that we agreed upon (i.e., the words in the collection plus their document frequencies), together with information about stemming and stop words, for example.

Short Term Goals (90 day range)

Medium (end of 1996) and Long Term Goals

Related Standards / Scope / References

Centroid Issues

Alternatives to Centroid Model

We touched briefly on alternatives, but didn't go into any details. Below are two ideas which got over a sentence's worth of discussion. Mike - this is pretty much all I intent to write for these; they're here more for completeness sake than anything else. Feel free to scrap 'em. -E

How to Search

This portion of the Breakout was less defined, as it encompassed a hodge-podge of various ideas. Time constraints made us focus on two:
Engine ID and the Merging of Heterogeneous Results

Engine ID

Problem Addressed

Related Standards / Scope / References

Required Work

We spent a lot of time discussing what the problem we were trying to address actually was. In the end, we determined that a Client needs to obtain information necessary to communicate with a Server, formulate queries, and interpret results. This breaks the problem in two parts: Definition of a Data Structure which contains this information, and Transport of this Data Structure from the Engine / Engine Provider to the Client.

Engine ID Data Structure Def (EID)

We didn't come up with a lot here ourselves; it seems that it'll have to go through some prototypes to ensure everything that's necessary is in the standard. Alternatively, a dynamically expandable array-value list could be used without great difficulty if people are afraid of a pre-defined structured query.

Transport

There are lots of solutions here, most of which are compatible with other solutions. Here are two:

Merging Results

We spend the last minutes of the session brainstorming on what meta-data is needed to merge heterogeneous results. Mike - this was at the behest of one of the attendees; while others have also done more work on this, it was more for our own internal benefit. I suspect it can be dropped from the global writeup; it's here mostly because of completeness. -E

Problem Addressed

How do you merge results from heterogeneous engines?

Related Standards / Scope / Standards

Brainstorm of attributes to merge data



Issues we didn't get to, but think are important:



Written by Erik Selberg, 5/31/96
Last Modified 6/10/96

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.