IBM T. J. Watson Research Center
Proceedings of the W3C Distributed Indexing/Searching Workshop
Due to the size and growth-rate of the Web, a good distributed indexing/searching mechanism must be integrated with a distributed data-gathering mechanism. Traditionally, this gathering is done by means of a web-crawler. Now, absent a notification protocol in HTTP, the crawler must look everywhere to get the latest data, and since many web pages change frequently, this means the crawler must be continually active. This poses burdens on remote servers and the network itself, and is compounded by the fact that many crawlers are simultaneously trying to do the same thing.
Our investigations suggest that an approach similar to Harvest's use of Gatherers and Brokers is required, but with more generality. In particular, the SOIF usage needs to be extended to accommodate link information and hierarchical representations. If this is done, then a Harvest-like system can interoperate with arbitrary web crawlers by producing standard sets of output files. One such file would be an associated configuration file, used to describe the location, format, date and content of the other files. These files may be altogether absent if the site administration does not want to participate, or may only contain a subset of the public domain if the administration deems its complement to be not useful to crawlers. This methodology turns the essentially confrontational 'robots.txt' approach into a collaborative one where everybody wins.
In addition to the files representing the text, other files such as lists of outgoing links also need to be generated, to supply a complete functional replacement to a web-crawler visiting every page at a site. If a site is unwilling to provide all this data, then at a minimum a file enumerating all of the site's URLs, along with last-modified dates, could be used to advantage (some FTP sites already do something similar).
A serious issue, not yet resolved, is exactly what format the text output should take. It could be a representation of the web pages, which in turn raises the question of whether it should be keywords only, full-text, full-text plus tags, or full-text augmented by results of name-finding and related processes. Alternatively, it could be an inverted index, which raises questions of both format and content. No single solution to these questions will satisfy everybody, in part because different sites will be willing and able to devote different resources to generating and maintaining these output files. It is suggested that a variety of standard levels of detail be established, and the aforementioned configuration file be used to describe the choices made.