Report of the Track II breakout group B on Distributed Search Issues
Chair (and Notes Editor): Clifford Lynch (UC Office of the President)

The first part of the session discussed the what was actually meant by
distributed search and why it was needed; terminology proved to be a
problem. The group decided that in fact it might be more productive to draw
a distinction between centrally managed search, where a single designer
managed a search service that might be implemented on a single centralized
system or as a distributed system; distributed homogeneous search, where
multiple, perhaps autonomously managed implementations of the same software
were used as a distributed search system and searches might be decomposed
and routed to the various nodes by some query planning process; and
federated search, where the same search might be routed to multiple
autonomously managed and heterogeneous systems for evaluation, after which
results would be combined in some fashion for presentation to a user. It
was noted that the limits to the scalability of centrally managed search
were not clear; that distributed search and federated search systems were
in the early stages of actual deployment and that experience with such
systems was relatively limited; and that in fact the highest payoff in
using federated search might be the ability to combine systems that used
discipline-specific indexing and organization of content, where the scaling
provided leverage on an intellectual problem of information organization
rather than simply a problem of computer system scalability.


Several other important points were noted in passing, although discussion
was limited by the time available to the group. These included the
observation that the model of a distributed system as it was presented to
the user was a critical issue -- whether the user selected the databases or
systems to be searched, or whether the search was assigned automatically
and transparently to appropriate information resources. Automatic
assignment of searches was viewed as still largely a research problem. The
central issue here is the level of abstraction presented to the user. It
was noted that much of the focus of the workshop had been on searches
presented by non-expert users, and that in fact there was a segment of the
user community that would want very precise control over the formulation
and execution of searches -- for example, in scientific environments where
the user was searching for genetic sequences or chemical structure matches.
It was recognized that such users should not be disenfranchised or
penalized due to their expertise. Finally, it was recognized that there
were powerful business model and economic disincentives to federation of
search services in the Web environment, where one service did not want to
be "hidden" behind another and each service wanted direct control over the
user interface in order to gather demographic and usage data and to present
advertising to users.

The group developed a number of recommendations; some of these repeated and
expanded themes that had already emerged in other breakout sessions. The
recommendations for further work intermixed research questions with
standards development issues, and time did not permit these to be carefully
distinguished. The recommendations for follow-on work included:

1. Merging ranked result sets from multiple search services. This is a
basic technology. One question is what data is needed from the
participating services and the extent to which rather superficial data
could still be used to provide satisfactory results, while avoiding
complexity and also skirting the need for participating services to reveal
proprietary algorithms.

2. Duplicate detection and consolidation. The question here is what data
elements are most useful in consolidating and potentially eliminating
duplicate records from multiple search services. This leads into the
broader research question of search result presentation and management.
There was a general recognition that as the size of the Web continues to
grow that approaches such as clustering and graphical navigation would be
needed to avoid confronting users with thousands of potentially relevant
"hits" in a simple and impossible to manage list.

3. Characterizing databases and search services for query routing was
identified as another important area for work. This is more on the research
than the standards agenda at present. Some potentially useful work has been
done on this problem as part of the Stanford and Michigan Digital Library
projects.

4. It was noted that the current IBM Informarket project represents one of
the largest attempts to deploy this type of technology at present, and the
work on this (and early results) is a valuable source of insights.

5. Finally, it was agreed that it would be useful to spend some additional
time trying to achieve consensus on terminology, taxonomies, and
architectural models for various types of centralized, distributed and
federated search service applications Much of the work in this area dates
back to federated and distributed databases in the 1980s, and the networked
information environment introduces a number of new features which need to
be integrated into models that characterize the kinds of systems under
consideration today.