Report of the Track II breakout group B on Distributed Search Issues Chair (and Notes Editor): Clifford Lynch (UC Office of the President) The first part of the session discussed the what was actually meant by distributed search and why it was needed; terminology proved to be a problem. The group decided that in fact it might be more productive to draw a distinction between centrally managed search, where a single designer managed a search service that might be implemented on a single centralized system or as a distributed system; distributed homogeneous search, where multiple, perhaps autonomously managed implementations of the same software were used as a distributed search system and searches might be decomposed and routed to the various nodes by some query planning process; and federated search, where the same search might be routed to multiple autonomously managed and heterogeneous systems for evaluation, after which results would be combined in some fashion for presentation to a user. It was noted that the limits to the scalability of centrally managed search were not clear; that distributed search and federated search systems were in the early stages of actual deployment and that experience with such systems was relatively limited; and that in fact the highest payoff in using federated search might be the ability to combine systems that used discipline-specific indexing and organization of content, where the scaling provided leverage on an intellectual problem of information organization rather than simply a problem of computer system scalability. Several other important points were noted in passing, although discussion was limited by the time available to the group. These included the observation that the model of a distributed system as it was presented to the user was a critical issue -- whether the user selected the databases or systems to be searched, or whether the search was assigned automatically and transparently to appropriate information resources. Automatic assignment of searches was viewed as still largely a research problem. The central issue here is the level of abstraction presented to the user. It was noted that much of the focus of the workshop had been on searches presented by non-expert users, and that in fact there was a segment of the user community that would want very precise control over the formulation and execution of searches -- for example, in scientific environments where the user was searching for genetic sequences or chemical structure matches. It was recognized that such users should not be disenfranchised or penalized due to their expertise. Finally, it was recognized that there were powerful business model and economic disincentives to federation of search services in the Web environment, where one service did not want to be "hidden" behind another and each service wanted direct control over the user interface in order to gather demographic and usage data and to present advertising to users. The group developed a number of recommendations; some of these repeated and expanded themes that had already emerged in other breakout sessions. The recommendations for further work intermixed research questions with standards development issues, and time did not permit these to be carefully distinguished. The recommendations for follow-on work included: 1. Merging ranked result sets from multiple search services. This is a basic technology. One question is what data is needed from the participating services and the extent to which rather superficial data could still be used to provide satisfactory results, while avoiding complexity and also skirting the need for participating services to reveal proprietary algorithms. 2. Duplicate detection and consolidation. The question here is what data elements are most useful in consolidating and potentially eliminating duplicate records from multiple search services. This leads into the broader research question of search result presentation and management. There was a general recognition that as the size of the Web continues to grow that approaches such as clustering and graphical navigation would be needed to avoid confronting users with thousands of potentially relevant "hits" in a simple and impossible to manage list. 3. Characterizing databases and search services for query routing was identified as another important area for work. This is more on the research than the standards agenda at present. Some potentially useful work has been done on this problem as part of the Stanford and Michigan Digital Library projects. 4. It was noted that the current IBM Informarket project represents one of the largest attempts to deploy this type of technology at present, and the work on this (and early results) is a valuable source of insights. 5. Finally, it was agreed that it would be useful to spend some additional time trying to achieve consensus on terminology, taxonomies, and architectural models for various types of centralized, distributed and federated search service applications Much of the work in this area dates back to federated and distributed databases in the 1980s, and the networked information environment introduces a number of new features which need to be integrated into models that characterize the kinds of systems under consideration today.