Report of Session II, Break-out B - Distributed Data Collection Chair (and report editor): Carl Lagoze (Cornell) Our group started out, with the encouragement of the chair, questioning whether the current model of Distributed Data Collection is really viable in the long-term. The current model treats the net as a single "collection" and indexes solely on content rather than metadata. Some members of the group felt that the current model does not scale and, by nature, leads to search results that don't recognize cross-domain vocabulary problems and other issues that have troubled IR researchers for quite a long time. Members of the break-out also mentioned how the current method ignores non-textual documents. There were a number of participants in the break-out who, while agreeing that the current system is not perfect, felt that it had value and could be tweaked to improve its functionality. In the end, we concluded that we really need to discuss two separate problems - the "searching the internet as a whole problem" (or wading through chaos) and the more controlled search of a selected information space. Solutions to the first problem lie in improving the functionality of robots.txt to better guide spiders. Group members agreed that robots.txt should perform a "search here" function, in addition to the current "don't search there". We then discussed approaches to the second problem; methods to attach differing qualities and types of metadata to networked objects. Spiders and crawlers could then partition networked information into documents with no metadata and those with metadata, and then further partition the second set into those with different types of metadata (e.g., Dublin Core records, geo-spatial records, MARC records). This might be a start on the definition of different qualities or integity of information in the global information space.