Notes by Raj Vaswani (@Home Network)
DH began by calling robot activity ``the ultimate reverse spam,'' since it is an attempt to download from, rather than send to, the world. He suggested that this activity is increasing as robots are increasingly being used not simply to index, but also to preload caches, and may soon be used as personal agents. There is, however, no cooperation between these disparate robots. Furthermore, providers have no control over the terms that are used to index their content. DH cited some difficulties in controlling when robots may be allowed to run against one's site; these included lack of interchange protocols and of standard metadata formats.
DH went on to describe Netscape's attempt to address these issues, their Catalog Server product. The Catalog Server is based on the Harvest architecture, which DH described first, thusly:
Netscape's Catalog Server extends this architecture as follows:
DH outlined a simple indexing example as follows:
Finally, DH listed some summarizing points and future directions (which were also meant to be potentially controversial):
The first questioner asked to what extent DH's presentation was official Netscape policy. DH replied that he is the lead architect of the Catalog Server product, which contains all of the features he had described. The speculative and controversial portions of the talk were his own opinions.
The second questioner asked DH to define the scope of the information space addressed by RDM, specifically whether it only referred to objects on the Internet, or whether it was meant also to encompass objects not network-accessible. DH replied that SOIF tags objects using their URLs, but that a ``no-op URL'' could perhaps be used to indicate network inaccessibility; he was unsure whether or not the URL specification allowed this.
The third questioner wanted to make a distinction between the creation and indexing of metadata. He suggested that DH's talk had been about the first, and not the second. DH agreed.
The fourth questioner asked about infrastructure for guiding robots within intranets, and how this was married to the structure DH had described for Internet indexing. DH replied that he thought of the word ``intranet'' as signifying indexing for ``one's own purposes'', but that this was not mutually exclusive with publishing for an Internet audience.
The fifth questioner asked about the impact of content access fees on the cost of searching. DH replied that today, robots crawl what is freely accessible, but that the intranet market may change this (robots will be needed to crawl private data for internal consumption). Someone else suggested that someone (e.g., the robot owner) pays the provider the cost of the robot's accesses. Yet another person suggested a breakout session on search engine business models, guessing that this may influence standards more than any other single factor. [A BOF on this topic was later held, but reportedly yielded no conclusions.]
The final questioner commented on limitations of SOIF. She first mentioned
that SOIF's use of an object's URL as its identifier is poorly suited to
a situation in which there are multiple sources of metadata about the object.
She suggested that URNs may be a better fit. DH replied that Harvest allows
a URL + Gatherer ID pair to be used as the unique key, somewhat mitigating
the problem, but that it was indeed a problem. He commented that URNs would
allow specification of the reply content type (e.g., whether a French or
English version was desired). The questioner pointed out a second problem,
namely that SOIF required complex data models/structures to be flattened
into attribute/value pairs. DH agreed that SOIF was very oriented towards
machine-readability, and was not meant to support complex hierarchies as
might, e.g., SGML nesting.
MK began by making clear that what was to follow was not official AOL policy. His talk was divided into two major sections:
MK felt that single repository approaches to data collection suffered from scalability problems. He said that the attitudes of indexing services towards these problems fell into 3 classes:
MK next summarized requirements on data collection:
MK stated one goal of robots.txt to be to allow administrators to warn robots which ``bad'' documents to avoid. He felt that widespread use of the ``protocol'' suggested success in meeting this goal, and attributed this success to the protocol's simplicity of administration (just requires a text editor), implementation (result is both human and machine readable), and deployment.
Next, MK pointed out several problems with robots.txt:
MK noted that the list above was not meant to be a litany of things wrong with the Web, that there are many other problems -- document changes, URL equivalence, etc. -- that it is not the job of robots.txt to fix. He argued that ``simple = good'', that robot.txt's simple approach seemed to be having more impact than long ongoing discussions that had yet to bear fruit. He suggested 3 changes to enhance/improve the protocol:
Expanding on binding policy with resources, MK discussed 3 possible ways of implementing this:
MK closed with a one-sentence summary of his message: ``empower information
providers with richness of expression, but keep it simple.''
The first questioner noted that MK had jumped from the robots.txt model of a single global file of metadata to a model involving metadata per resource; he asked for MK's opinions on the middle ground between those extremes, namely tying metadata to collections of files. MK responded that this could be achieved with some combination of global and per-file metadata; he added that PICS needed to address this issue as well, and that he would be interested to see how that turned out.
The second questioner felt that because robot.txt provided a mechanism by which to warn off robots, which in turn was respected by robots, no major search service had been sued for wrongful access to (indexing of) data. The questioner felt that this was a significant contribution of the protocol. He did note an additional limitation, however, namely that there was no way to specify how not to read images (there is nowhere to put the metadata). MK replied that he was happy to defer handling non-HTML formats.
The third questioner pointed out that robots.txt had no concept of robot classes -- e.g., indexing vs. personal agents -- and that this distinction may be useful in increasing content provider flexibility.
The fourth questioner suggested that robots engage in two different activities: discovering resources and collecting them. He stated that robots.txt in general addresses the second class, and that the proposed ``PleaseVisit'' directive addresses the first. He felt that this was a useful separation to make, e.g., sites could publish resources they have. MK commented that the normal case shouldn't be to ask for everything on a site, a query that is rarely necessary unless one is trying to index the whole Web. He also gave a concrete example in which redirection could be useful, hypothesizing a corporate home page from which individual employee pages may be reachable, but which pages the provider would not want indexed; in this case, ``PleaseVisit'' would point from the individual pages to the main home page, resulting in an index that conformed to the provider's wishes.
The final questioner cited the ``big inhale'' problem from knowledge engineering -- the problem of trying to consume vast amounts of data, as that currently available on the Web. He suggested that the community was trying to address this problem by pushing responsibility for it onto the provider (e.g., making him/her responsible for summarizing the data), but that eventually it would become impossible to ``inhale'' even the summaries. He questioned whether it was either necessary or feasible to (or to want to) index everything on the Web.