Report of Session II, Break-out B - Distributed Data Collection
Chair (and report editor): Carl Lagoze (Cornell)

Our group started out, with the encouragement of the chair, questioning 
whether the current model of Distributed Data Collection is really viable 
in the long-term.  The current model treats the net as a single 
"collection" and indexes solely on content rather than metadata.

Some members of the group felt that the current model does not scale and, 
by nature, leads to search results that don't recognize cross-domain 
vocabulary problems and other issues that have troubled IR researchers for 
quite a long time.  Members of the break-out also mentioned how the current 
method ignores non-textual documents.  There were a number of participants 
in the break-out who, while agreeing that the current system is not 
perfect, felt that it had value and could be tweaked to improve its 
functionality.

In the end, we concluded that we really need to discuss two separate 
problems - the "searching the internet as a whole problem" (or wading 
through chaos) and the more controlled search of a selected information 
space.  Solutions to the first problem lie in improving the functionality 
of robots.txt to better guide spiders.  Group members agreed that 
robots.txt should perform a "search here" function, in addition to the 
current "don't search there".  We then discussed approaches to the second 
problem; methods to attach differing qualities and types of metadata to 
networked objects.  Spiders and crawlers could then partition networked 
information into documents with no metadata and those with metadata, and 
then further partition the second set into those with different types of 
metadata (e.g., Dublin Core records, geo-spatial records, MARC records). 

This might be a start on the definition of different qualities or integity 
of information in the global information space.