Additions to the robots.txt Standard

Introduction

The robots.txt standard is a very useful tool for both webmasters and the people who run web crawlers. This standard could be even more useful with several additions. The additions suggested below were inspired both by comments from webmasters and by front-line experience developing and running the Excite web crawler.

Site naming

Site naming poses several problems for maintainers of web indexes. Sites can be referenced by many names, and it can be hard to determine which name the webmaster prefers. Also, large sites can be referenced by many different physical IP addresses.

Multiple names

Most sites can be referenced by several names. To avoid duplication crawlers usually canonicalize these names by converting them to ip addresses. When presenting the results of a search, it is desirable to use the name instead of the ip address. Sometimes it is obvious which of several names to use (e.g. the one that starts with www), but in many cases it is not. The robots.txt file should have an entry that states the preferred name for the site.

Multiple IP addresses

Many high traffic sites use multiple servers. Machines are added frequently and their ip addresses often change. Crawlers do not have a good inexpensive way to understand and keep track of the everchanging mapping of servers to logical sites. This causes needless duplication of effort by the crawler and higher traffic at the sites. The robots.txt file is an ideal place to include a list of ip addresses that map to a logical site.

Freshness of content

HTTP provides mechanisms for determining how recently a file has been modified; it even provides mechanisms for avoiding data transfer costs if the file has not changed since the last visit of a browser or crawler. However, the performance of both crawlers and the sites they visit could be improved by providing higher-level information about when content on a site has changed.

Freshness of web pages

One addition that could dramatically reduce traffic would be a representation of modified dates for various parts of the site. Today the only way to tell what pages you want to update is to use the If-Modified-Since request-header field. This costs a connection per page. Having this information centralized in the robots.txt file would decrease server loads. This information could be presented at a directory or file level depending on the size of the site and the granularity of information the webmaster wants to present. A useful representation might be a reverse-chronological list of files and the dates that they were last modified.

Freshness of the robots.txt file

The robots.txt file needs to include a time-to-live (TTL) value. This tells crawlers how often they should update the robots.txt information for that site. Some sites very rarely change their robots.txt files and do not want the extra traffic of having them frequently re-read by multiple crawlers. Even if the If-Modified-Since request-header field is used, a connection still has to be created each time. On the other hand, some sites change their robots.txt files regularly and often. They are often hurt by extensive caching of robots.txt information by crawlers. Having an explicit TTL value would help crawlers satisfy each site's requirements.

Flexibility of the robots.txt file

Although the simplicity of the robots.txt file is a benefit, many sites on the internet today have structures that are too complex to represent with the current robots.txt format.

Multiple content providers

In some instances many people might provide the content for a single site. A good example is a university site which has a separate area for each student. Each of these individuals might want to control access to his or her own section of the site. It is often unreasonable to allow all of the individuals to edit one global robots.txt file. The robots.txt file should have a way to redirect the crawler to read separate robots.txt files from further down in the site. This allows different robots.txt information to be specified for separate parts of the site.

Complex directory structures

The disallow statement of the current robots.txt standard could be made more powerful. For various reasons some sites cannot change their on-disk layout and may have very large directories. It is very cumbersome to exclude part of a large directory using the current disallow statement. A more powerful regular expression syntax or an 'allow' directive to override the disallow for specific files would be useful.

Description of the site

An optional description of the site would also be a welcome addition to the robots.txt standard. A brief human-readable statement about the site's purpose and the kind of content it contains would provide useful information to the end users of the various repositories created by web crawlers.

Conclusion

We have described several improvements to the robots.txt standard. These would improve the performance and usefulness of both web crawlers and web sites. We have not provided details on what the changes to the format should look like but none of the improvements seem difficult to specify.


Mike Frumkin (mfrumkin@excite.com)
Graham Spencer (gspencer@excite.com)
Excite Inc.
This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.