SPIDERING BOF REPORT
Report by Michael Mauldin (Lycos)
(later edited by Michael Schwartz)
While the overall workshop goal was to determine areas where standards
could be pursued, the Spidering BOF attempted to reach actual standards
agreements about some immediate term issues facing robot-based search
services, at least among spider-based search service representatives who
were in attendance at the workshop (Excite, InfoSeek, and Lycos). The
agreements fell into four areas, but we report only three of them here
because the fourth area concerned a KEYWORDS tag that many workshop
participants felt was not appropriate for specification by this BOF
without the participation of other groups that have been working on that
issue.
The remaining three areas were:
1. ROBOTS meta-tag
default = empty = "ALL"
"NONE" = "NOINDEX, NOFOLLOW"
The filler is a comma separated list of terms:
ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW.
Discussion: This tag is meant to provide users who cannot control
the robots.txt file at their sites. It provides a last chance to
keep their content out of search services. It was decided not to
add syntax to allow robot specific permissions within the meta-tag.
INDEX means that robots are welcome to include this page in
search services.
FOLLOW means that robots are welcome to follow links from this
page to find other pages.
So a value of "NOINDEX" allows the subsidiary links to be explored,
even though the page is not indexed. A value of "NOFOLLOW" allows the
page to be indexed, but no links from the page are explored (this may
be useful if the page is a free entry point into pay-per-view content,
for example. A value of "NONE" tells the robot to ignore the page.
2. DESCRIPTION meta-tag
The intent is that the text can be used by a search service when
printing a summary of the document. The text should not contain
any formatting information.
3. Other issues with ROBOTS.TXT
These are issues recommended for future standards discussion that
could not be resolved within the scope of this workshop.
- Ambiguities in the current specification
http://www.kollar.com/robots.html
- A means of canonicalizing sites, using:
HTTP-EQUIV HOST
ROBOTS.TXT ALIAS
- ways of supporting multiple robots.txt files per site ("robotsN.txt")
- ways of advertising content that should be indexed (rather than
just restricting content that should not be indexed)
- Flow control information: retrieval interval or maximum
connections open to server