Integrating Heterogeneous Search Engines
Position Paper for the
W3C Distributed Indexing/Searching Workshop

Gary Adams and W. A. Woods, Sun Microsystems Laboratories, Chelmsford, Mass.
contact: Gary.Adams@East.Sun.Com, William.Woods@East.Sun.Com

Introduction

Integrating heterogeneous search engines will require protocols for communicating with search engines about their capabilities and for reporting information in result lists about scoring method used and about what constitutes a hit. The growing diversity of search methods poses interesting challenges to integration that can be addressed if there are sufficiently expressive protocols. For example, the Conceptual Indexing System being developed at Sun Microsystems Laboratories, is a concept matching engine that reports a penalty-based score for dynamically identified text passages. In this dynamic passage retrieval system, scores are assigned to regions of text determined at query time, based on groupings of query terms or conceptually related terms. This differs from document retrieval, which generates scores for entire documents, and from static passage retrieval, which identifies rankable passages at indexing time. Integrating this system with a traditional system requires a way to identify dynamic passages and a way to know that smaller penalty scores are better.

Negotiating about Engine Capabilities and Reporting Results

A multi-engine search system may want to interrogate a search engine to determine its capabilities or to negotiate with the engine about what information it wants. For example it may want to determine if a given engine supports a proximity operator, and for those that do not, pass the results through a postprocessing filter. A system that integrates heterogeneous results may want to ask a search engine to report the following kinds of information for each returned hit, if available: One could use SOIF notation to make such requests. For example, the following might be used to specify desired capabilities, and a similar format could be used to report available capabilities:
@CAPABILITIES-REQUEST {labboot:9112
POSITIONS{1}:	Y
SCORES{1}:	Y
WORD-FREQUENCIES{1}:	Y
SCORE-TYPE{33}:	TWIDF,IDF,PROB,WORD-COUNT,PENALTY}
Returning a result list as a collection of SOIF objects would give a way to encode collateral information about results. For example, the following could be a passage retrieval result:
@DPASSAGE { http://www.sunlabs.com/
SCORE{3}:	.01
SCORE-TYPE{7}:	PENALTY
PASSAGE-REGION{11}:	01736,01895
HIGHLIGHT-REGIONS{23}:	01754,01799 01804,01815}

References


Call for Participation
This page is part of the DISW 96 workshop.
Last modified: Tue Jul 9 17:19:02 EST 1996.