DISW '96 Query routing and Searching Breakout
with many helpful suggestions and edits by:
Ron Daniel <email@example.com>
Ken Weiss <firstname.lastname@example.org>
Leslie Daigle <email@example.com>
Luis Gravano <gravano@db.Stanford.EDU>
The Query Routing and Engine ID Breakout Session at DISW '96 was
attended by a good cross-section of the online search industry. The
discussion centered mostly around Query Routing, and then hit Engine
Identification and at the end touched on Result Merging Strategies.
Among the successes of the workshop was the creation of an informal
working group which will explore various implementations of query
There were two sets of problems addressed, which were
divided into the broad categories:
Where to Search and
How to Search.
There were also a few topics, namely Query Refinement,
that did not easily fit into our simple categorization, and weren't
discussed in the meeting.
Where to Search
- The Query Routing Problem
i.e. given a Query and a set of Servers, to which Servers should
the Query be sent?
Most of our discussion centered around a centroid-model for query
routing, although some alternatives were mentioned. A centroid in
it's most rudimentary form is a table which a Query Router uses to
see if a particular Query should go to a particular Service. An example of this would be a list of pairs (token, term freq) for each
Service registered with the Query Router. The Router can quickly check each
token in the query to determine its term frequency at any of the Services.
Queries would only be routed to those Services for which the term frequency
was above some threshold. The threshold value will vary depending on the
nature of the underlying data and on performance considerations.
We agreed on the basics of a format for centroids to enable query routing.
The general framework of the Whois++
Centroids specified in RFC 1913/1914
seemed appropriate enough,
although there was discussion as to shortcomings of those standards in
relation to general databases vs White Pages (e.g. there are currently
no provisions for flags describing the centroid as being created
using stemming or stop lists).
Alternatively, the informal agreement that Stanford is coordinating
that include the content summaries that we agreed upon (i.e., the
words in the collection plus their document frequencies), together
with information about stemming and stop words, for example.
Short Term Goals (90 day range)
Medium (end of 1996) and Long Term Goals
- Prototype Implementation
of UW has offered (threatened? :) to co-ordinate an
implementation of query routing on the MetaCrawler web search service
(http://www.metacrawler.com). The idea is that the RFC and suggested
modifications can be implemented and we can see what problems arise
therein. The first step in this is to identify other parties
- Apart from the technological issues of defining and implementing an
efficient query routing standard, it is of paramount importance to consider
the issues facing existing search engine systems (e.g. Verify, Fulcrum) which
have not been designed with centroids in mind. That is, what tools
and techniques will be necessary in order to allow existing systems to
generate effective standard centroid-information?
- Some answers can be drawn from existing work with centroid-passing software,
such as Bunyip's Digger software, which implements the Whois++ protocol.
This is currently being used for meshes of white pages (people) data, and
there are other projects underway for using this with corpora
of other template-based data (e.g., metadata).
- Potential problems arise when engines use stemming. Some agreement
on the format of Centroids when they use stemming is going to be needed.
- One question which needs to be explored are the performance
dimensions of the Centroids, i.e. How big are they? How much CPU time on the
server's end do they take to compute?
- There are significant research efforts already underway to explore the
scaling issues associated with the use of centroids in query routing. One
example is the NSF/CNIDR/MCNC supported University of California systemwide
testbed (http://www.ucdavis.edu/whoisplus/). Another example is the GlOSS project at Stanford
Related Standards / Scope / References
- These aren't general enough for non-text databases (e.g. image
databases), so we should be open for more expansion.
- Need to expand 1913/14 for databases (defined headers are
for white pages and aren't general enough)
- Stemming could be tricky
- "Enough for clients to rank services"
- <word> <coll freq> (coll freq = # of docs in
coll that contain word)
Patrik said RFC 1913/4 had stuff for this, so we'll use that
Alternatives to Centroid Model
We touched briefly on alternatives, but didn't go into any
details. Below are two ideas which got over a sentence's worth of
discussion. Mike - this is pretty much all I intent to write for
these; they're here more for completeness sake than anything
else. Feel free to scrap 'em. -E
- DNS-like (query gives implicit routing)
- Query Generators
How to Search
This portion of the Breakout was less defined, as it encompassed
a hodge-podge of various ideas. Time constraints made us focus
on two: Engine ID and the Merging of Heterogeneous Results
- How does a Client Interact with appropriate Servers?
Related Standards / Scope / References
- FIND folks Query Routing model
- Resource Discovery?
- KQML - Knowledge Query
and Manipulation Language, out of UMBC, which is used with agents
in the AI arena.
We spent a lot of time discussing what the problem we were trying to address
actually was. In the end, we determined that a Client needs to obtain
information necessary to communicate with a Server, formulate queries, and
interpret results. This breaks the problem in two parts: Definition of a Data Structure which contains this
information, and Transport of this Data Structure from
the Engine / Engine Provider to the Client.
Engine ID Data Structure Def (EID)
We didn't come up with a lot here ourselves; it seems that it'll have to go
through some prototypes to ensure everything that's necessary is in the
standard. Alternatively, a dynamically expandable array-value list could be
used without great difficulty if people are afraid of a pre-defined structured
- "All the info needed for a Client to interact with a Server"
- Collection Description
- Protocol / Ports (subsumed by urLs?)
- Query language format (e.g. "JoeBob OmniEngine, version 1.2b")
- Output format
- Could contain active code
There are lots of solutions here, most of which are compatible with other
solutions. Here are two:
- EID Originates at the Server and is propagated through any
intermediaries (i.e. Query Routers)
- EID's are stored in Large Global Tables available at some
We spend the last minutes of the session brainstorming on what meta-data is
needed to merge heterogeneous results.
Mike - this was at the behest of one of the attendees; while
others have also done more work on this, it was more for our own
internal benefit. I suspect it can be dropped from the global writeup;
it's here mostly because of completeness. -E
How do you merge results from heterogeneous engines?
Related Standards / Scope / Standards
- MetaCrawler -- normalize scores & sum
- Ellen Voorhees and George Fox have independently done some stuff,
see TREC-III/IV Proceedings
- The Stanford informal agreement.
Brainstorm of attributes to merge data
- Raw Score
- Scale, endian (are high scores better), distribution
- Term stats (may be problem for some search engines)
- Concept-based expansion?
- Cluster result sets
Issues we didn't get to, but think are important:
- Query Language (although prob part of engine ID)
- Query refinement
- Access methods (to Z39.50 or not to z39.50)
Written by Erik Selberg, 5/31/96
Last Modified 6/10/96
This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.