Session II Plenary Notes

Darren Hardy (DH)
Netscape Communications Corp.
``WWW Robot Cooperation''

Notes by Raj Vaswani (@Home Network)

The talk

DH began by calling robot activity ``the ultimate reverse spam,'' since it is an attempt to download from, rather than send to, the world. He suggested that this activity is increasing as robots are increasingly being used not simply to index, but also to preload caches, and may soon be used as personal agents. There is, however, no cooperation between these disparate robots. Furthermore, providers have no control over the terms that are used to index their content. DH cited some difficulties in controlling when robots may be allowed to run against one's site; these included lack of interchange protocols and of standard metadata formats.

DH went on to describe Netscape's attempt to address these issues, their Catalog Server product. The Catalog Server is based on the Harvest architecture, which DH described first, thusly:

``Gatherers'' assemble indexes for particular sites.
``Brokers'' coalesce gatherer indexes.
The Broker/Gatherer pairs thus implement a simple distributed indexing model.
Gatherers/Brokers exchange index information using SOIF, an interchange format designed for efficient streaming.
The Essence tool provides a simple way of extracting indexing information (in SOIF format) from a variety of document classes.

Netscape's Catalog Server extends this architecture as follows:

The Resource Description Message (RDM) model layers the Harvest architecture over HTTP.
RDM allows retrieval of indexing information, or resource descriptions (RDs). RDs are attribute/value pairs encoded in SOIF.
The Catalog Server also provides a ``Discovery Query Language'' that allows both queries to be issued and result views to be controlled. Query result retrieval thus becomes a SOIF stream filtered through the view.
``Out of band'' support allows customization of schemas, taxonomies, and server descriptions.

DH outlined a simple indexing example as follows:

The gathering robot runs, and publishes its index via RDM.
A broker asks for that index information.
The broker reads the response SOIF stream, adding that information to the broker's index.
The broker could additionally download schema, taxonomy, and server description information.
The broker may publish its index via RDM to other brokers.

Finally, DH listed some summarizing points and future directions (which were also meant to be potentially controversial):

He proposed Harvest/SOIF/RDM/Essence as the starting point for a standard for WWW robot cooperation.
He stated that extracting index information is a black art, and therefore that using embeddable metadata formats (e.g., HTML's META tag) should be encouraged.
He expressed the need for a better robot guiding infrastructure, since future robots may be running on behalf of particular users, not as promiscuous indexers.
He raised the question of identifying metadata components that are minimally sufficient for supporting search engines.

Q&A

The first questioner asked to what extent DH's presentation was official Netscape policy. DH replied that he is the lead architect of the Catalog Server product, which contains all of the features he had described. The speculative and controversial portions of the talk were his own opinions.

The second questioner asked DH to define the scope of the information space addressed by RDM, specifically whether it only referred to objects on the Internet, or whether it was meant also to encompass objects not network-accessible. DH replied that SOIF tags objects using their URLs, but that a ``no-op URL'' could perhaps be used to indicate network inaccessibility; he was unsure whether or not the URL specification allowed this.

The third questioner wanted to make a distinction between the creation and indexing of metadata. He suggested that DH's talk had been about the first, and not the second. DH agreed.

The fourth questioner asked about infrastructure for guiding robots within intranets, and how this was married to the structure DH had described for Internet indexing. DH replied that he thought of the word ``intranet'' as signifying indexing for ``one's own purposes'', but that this was not mutually exclusive with publishing for an Internet audience.

The fifth questioner asked about the impact of content access fees on the cost of searching. DH replied that today, robots crawl what is freely accessible, but that the intranet market may change this (robots will be needed to crawl private data for internal consumption). Someone else suggested that someone (e.g., the robot owner) pays the provider the cost of the robot's accesses. Yet another person suggested a breakout session on search engine business models, guessing that this may influence standards more than any other single factor. [A BOF on this topic was later held, but reportedly yielded no conclusions.]

The final questioner commented on limitations of SOIF. She first mentioned that SOIF's use of an object's URL as its identifier is poorly suited to a situation in which there are multiple sources of metadata about the object. She suggested that URNs may be a better fit. DH replied that Harvest allows a URL + Gatherer ID pair to be used as the unique key, somewhat mitigating the problem, but that it was indeed a problem. He commented that URNs would allow specification of the reply content type (e.g., whether a French or English version was desired). The questioner pointed out a second problem, namely that SOIF required complex data models/structures to be flattened into attribute/value pairs. DH agreed that SOIF was very oriented towards machine-readability, and was not meant to support complex hierarchies as might, e.g., SGML nesting.

Martijn Koster (MK)
AOL/WebCrawler
His experiences with robots.txt.

The talk

MK began by making clear that what was to follow was not official AOL policy. His talk was divided into two major sections:

A discussion of data collection for global indices
Experience with robots.txt.

MK felt that single repository approaches to data collection suffered from scalability problems. He said that the attitudes of indexing services towards these problems fell into 3 classes:

Denial that the problems exist. MK felt most services today fall into this class.
Despair that centralized indexing could never work, and that therefore distributed indexing and searching are needed. MK felt that this attitude led to cumbersome and sparsely deployed solutions, citing X.500 as an example.
Evolution, i.e., using various means to extend the life of the single repository model. MK felt that increasing richness of expression would be necessary to keep this approach viable.

MK next summarized requirements on data collection:

Find ``good'' documents to index; avoid ``bad'' documents.
Transfer documents -- he felt that other limitations of information retrieval technology would become bottlenecks before document transfer would, and even at that point, multiplexing protocols (e.g., HTTP-NG) would alleviate the problem.
Describe documents, i.e., index content.

MK stated one goal of robots.txt to be to allow administrators to warn robots which ``bad'' documents to avoid. He felt that widespread use of the ``protocol'' suggested success in meeting this goal, and attributed this success to the protocol's simplicity of administration (just requires a text editor), implementation (result is both human and machine readable), and deployment.

Next, MK pointed out several problems with robots.txt:

Administrative access to the server root is required. (He noted that an administrator could implement local schemes for addressing this, e.g., running a cron job that coalesced individual user robot.txt files into a global one. However, there may be a need for a more structured, protocol level solution.)
It does not scale well with site size. E.g., it is difficult to exclude 30% of 40K documents.
Inability either to describe resources or to suggest resources for indexing.
Other minor missing features, e.g., a throttle on crawling speed.

MK noted that the list above was not meant to be a litany of things wrong with the Web, that there are many other problems -- document changes, URL equivalence, etc. -- that it is not the job of robots.txt to fix. He argued that ``simple = good'', that robot.txt's simple approach seemed to be having more impact than long ongoing discussions that had yet to bear fruit. He suggested 3 changes to enhance/improve the protocol:

Add a ``PleaseVisit'' directive to robots.txt.
Bind resource policy with the resource.
Colocate more metadata with resources.

Expanding on binding policy with resources, MK discussed 3 possible ways of implementing this:

Using the META tag. MK felt that although understanding of this tag's use was not widespread, it would be beneficial to have a standard META tag for resource descriptions, etc.
Using the LINK tag. MK felt that the advantage of this approach would be that resource descriptions would thus be upgraded to first-class objects, among which different caching schemes could use HTTP negotiation to differentiate.
Using protocol extensions. MK felt that this had the disadvantage of requiring greater software deployment. He nonetheless cited some possibilities: PEP/HTTP (MK felt that this adding of metadata to HTTP was ill-advised), URN/URC (MK felt that this approach would take longer to achieve deployment), and PICS (MK felt this was viable).

MK closed with a one-sentence summary of his message: ``empower information providers with richness of expression, but keep it simple.''

Q&A

The first questioner noted that MK had jumped from the robots.txt model of a single global file of metadata to a model involving metadata per resource; he asked for MK's opinions on the middle ground between those extremes, namely tying metadata to collections of files. MK responded that this could be achieved with some combination of global and per-file metadata; he added that PICS needed to address this issue as well, and that he would be interested to see how that turned out.

The second questioner felt that because robot.txt provided a mechanism by which to warn off robots, which in turn was respected by robots, no major search service had been sued for wrongful access to (indexing of) data. The questioner felt that this was a significant contribution of the protocol. He did note an additional limitation, however, namely that there was no way to specify how not to read images (there is nowhere to put the metadata). MK replied that he was happy to defer handling non-HTML formats.

The third questioner pointed out that robots.txt had no concept of robot classes -- e.g., indexing vs. personal agents -- and that this distinction may be useful in increasing content provider flexibility.

The fourth questioner suggested that robots engage in two different activities: discovering resources and collecting them. He stated that robots.txt in general addresses the second class, and that the proposed ``PleaseVisit'' directive addresses the first. He felt that this was a useful separation to make, e.g., sites could publish resources they have. MK commented that the normal case shouldn't be to ask for everything on a site, a query that is rarely necessary unless one is trying to index the whole Web. He also gave a concrete example in which redirection could be useful, hypothesizing a corporate home page from which individual employee pages may be reachable, but which pages the provider would not want indexed; in this case, ``PleaseVisit'' would point from the individual pages to the main home page, resulting in an index that conformed to the provider's wishes.

The final questioner cited the ``big inhale'' problem from knowledge engineering -- the problem of trying to consume vast amounts of data, as that currently available on the Web. He suggested that the community was trying to address this problem by pushing responsibility for it onto the provider (e.g., making him/her responsible for summarizing the data), but that eventually it would become impossible to ``inhale'' even the summaries. He questioned whether it was either necessary or feasible to (or to want to) index everything on the Web.

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.

Session II Plenary Notes

Darren Hardy (DH) Netscape Communications Corp. ``WWW Robot Cooperation''

The talk

Q&A

Martijn Koster (MK) AOL/WebCrawler His experiences with robots.txt.

The talk

Q&A

Darren Hardy (DH)
Netscape Communications Corp.
``WWW Robot Cooperation''

Martijn Koster (MK)
AOL/WebCrawler
His experiences with robots.txt.