Session IV Track C - Central vs. Distributed Searching/Indexing Chair (and Notes Editor): Ken Weiss (UC Davis) This session opened with consideration of some questions: Will central indices survive? Are distributed indices feasible? Are central and distributed architectures mutually exclusive? Central indicies were defined as the model offered by AltaVista, InfoSeek, Excite, Lycos, and others. These are large indicies maintained at a single site, which attempt to collect all (or at least a very significant percentage) of the information available in their chosen domain (WWW, USENET news...). Distributed indicies are information systems in which a large number of servers are interconnected, and each contains only a part of the full database. After brief discussion the group concluded that, for the near future, both central and distributed services will coexist. There were some concerns about the scaling of central services, particularly in light of the differing growth curves of the quantity of content to be indexed, and the capacity of hardware and networks to process and transfer the content. However, the representatives of the central indexing services (most notably Mike Frumkin of Excite) feel that with improvements in crawling algorithms and the addition of some information to guide spiders, the central services can scale for the next several years. Other issues raised on the topic of the viability of central indicies included the coming problem of indexing large binary object data (multimedia), and the need to allow content providers to determine what gets indexed in a richer way than robots.txt can handle. Distributed indicies are still considered to be research projects. Current testbeds do not prove scalability to multi-million record data collections. However, distributed indicies are a promising technology for the creation of virtual communities in which smaller content providers push their indexing information into a distributed subject-specific mesh. This is more akin to the Harvest and Whois++ approaches. Another possible method for grouping related information would be the use of metadata posted such that a central index could create a virtual community. In a distributed model there may be a need to publish the source of information along with the data itself, to provide some means of assessing the quality and trustworthiness of the material. Central indicies may evolve into a distributed architecture as a response to the problems of scaling. As for-profit servers are aggregated into a distributed mesh, a new business model will have to develop to support the exchange of value-added content by unrelated parties. These models could be as simple as an agreement to retain advertisements when search results are redirected through a metacrawler, or as ambitious as the IBM InfoMarket project. Once agreement was reached on the definition of the problem, discussion turned to standards areas that will facilitate the scaling of central indices and the development of distributed indicies. The following areas were identified as promising for standards work, sorted by time frame: 0-12 months: Enhancements to robots.txt * How often should content be indexed (volatility indicator)? * Last-modified listing * Explicit request for indexing Cooperation between search services and WWW server vendors * Out-of-band data exchange (keep it off HTTP) * Bulk transfer of content from server to indexer 13-24 months: Extended server/indexer cooperation * Negotiated push model for transfer of metadata (RDM? If not, what's missing?) Infrastructure for topical or semantic virtual communities * Common Indexing Protocol (ftp://ds.internic.net/internet-drafts/draft-ietf-find-cip-01.txt) * Multipoint registration/notification for receipt of index PUSHes (Whois++ model) * Standard semantic profiles of metadata * Simple tools to create and distribute indexing/metadata information 25-36 months: * Multicast or net news model for distribution of metadata