Open Text Corporation

Position Paper: Internet/Intranet Indexing Suggestions

Open Text Corporation provides text search and retrieval tools for the Internet and for corporate intranets. We currently operate a free text search and retrieval service for a substantial portion of the published documents on the WWW. We also sell products that bring this functionality to companies wishing to provide similar facilities for their corporate intranets. Because of our experience at Open Text, a lot of suggestions have been made for improving the current standards (or defacto standards) with an eye towards improving our indexing and search technologies. The following list is a sample of some of the things that Open Text thinks would go a long way in improving the current search and retrieval experience.

Reducing Bandwidth Requirements

There's a lot of discussion currently on how to reduce the amount of data and tcp/ip connections required to obtain a page with all the embedded gifs, frame components, java applets, and other objects.

Ability to fetch multiple documents in one connection

A multiple get facility in the http protocol (similar to the mget facility in ftp) would assist greatly in reducing the overhead these connections entail. Robots can potentially utilize this multiple get facility to fetch several (perhaps unrelated) documents with one connection. The multiple gets should be robust enough so that unwanted data is not included in the transmission.

Server provided document lists

Robots periodically check to see if documents have changed on a server using the "if-modified-since" mechanism. While this saves transferring the document in those cases where the document has not changed, it still requires the overhead of one connection per document on the server to make this assessment. A better approach might be to provide a list of all documents available at the site, coupled with the size, last modification date, and mime type of those documents. Robots could collect this one file periodically, and from it infer which documents need to be re-fetched. On the server side, some form of daemon may be required to administer this file, but that should not be too difficult to create. This document could be given a reserved name similar to robots.txt, perhaps "sitedocs.txt". For large servers, this file could point to other files, perhaps reflecting the directory organization in use at the site.

Ability to "get-all-documents-modified-since"

An extension of the multiple file transfers and "if-modified-since" mechanism is to provide a "get-all-documents-modified-since" protocol to the HTTPD. With this, a browser or robot could make one connection and fetch all the documents from the site that have been modified or added since a specified time. Additionally the transfer could include a special block listing all URLs that were deleted from the site since the specified time.

Server side document conversion

Few servers currently provide mechanisms for converting documents on the server side. Converting large word processing or PDF files into smaller text or html documents before transmission could save a lot of bandwidth.

Document Change Frequency

Anything that can assist in defining the frequency at which documents are expected to change would assist in the currency of information available from search engines. While mechanisms like providing a mapping of URLs (perhaps defined by a regular expression) to expected change frequency, or expiry dates for documents, could assist, they could also be easily abused.

Improving the Robot Exclusion Protocol

The robot exclusion protocol provides some information on which robots are allowed to crawl what on a given site. A number of sites are excluding all robots because of perceived performance implications. There are a few things that could improve the situation so that robots minimize the impact of visiting and collecting documents from a site.

Time slots for robots.txt

The robots.txt file should detail time slots (in GMT) for crawling, perhaps on a per robot basis. These can be expressed with the same flexibility as the UNIX cron facility, so that sites can restrict robots to late night slots during weekends only, for example.

Bandwidth suggestions for robots.txt

The robot exclusion protocol provides guidelines on how often a robot can access a site for a document. These may not be appropriate given the wide variety of hardware and the popularity of some sites. Perhaps the robots.txt file itself should provide more site specific guidelines like how many files, bytes, or connections to make per unit of time, perhaps defined in time slots over the day, and perhaps by robot.

Refetching requirements of robots.txt

There should be some indication of how frequently the robots.txt file itself should be refetched. The administrators we contacted varied wildly on this one. The robots.txt file should contain an expiry date, or an indication of how frequently it should be re-fetched.

Improved Information Content

Identifiable summaries or abstracts

A short summary about the document is provided by most search engines. For now, those summaries are generated based on the content of the page at hand, which in some cases leads to difficult to read verbiage. A more formalized approach may yield better summary lists.

Improved keyword identification

Anything that can be done to improve the quality of search keywords in a document would assist search and retrieval engines. Our experience, however, has been that these mechanisms are generally abused on the Internet, but not on corporate intranets.

Improved character encoding and language identification

There should be a standard for character encoding(s) the identification of the character encoding scheme used. This is a particularly troublesome situation in countries like Japan where UNICODE, EUC and other encoding mechanisms are sometimes at odds with each other. In a similar vein, knowing what natural language(s) a document is in (ie English, French, Japanese, etc) would allow a search engine to tailor results for a particular user.

Site Specific Information

There are a number of improvements that can be made on a per site basis.

A graphics logo file per site

Each site should provide a small graphics file that represents the site, perhaps a corporate logo. These logos could be used to improve the appearance of a summary page, provide a more graphical means of navigating the net, or other unforeseen applications. This file should probably have a reserved name like robots.txt, perhaps "sitelogo.gif".

Geographic location of site

It should be possible to the determine geographic location of a server, either from the HTTPD or from an auxiliary file located on the server site, perhaps "location.txt". Robots can use this to optimize the collection of documents, and applications can be tailored for regional requirements.

Site summary

Sites could provide summary information describing the nature or purpose of the documents provided by the site, perhaps in a file called "summary.txt". For sites providing many services or types of documents, this file could allow a number of summaries organized by regular expressions defining the URLs that are associated with those summaries.

Multiple server sites

Many larger HTTP servers are implemented using several machines, and hence multiple IP addresses and hostnames. There are benefits in knowing both the IP address and hostname of the "main" server. The functionality already exists in the DNS protocol. However in order to implement this, a webmaster needs to know DNS setup intricately. As well, the webmaster must have the control of their sites' DNS. Many webmasters do not have this control as it may be considered a system administrator's or even the ISP's control.

(C) Copyright 1996 Open Text Corporation

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.