Position Paper: Internet/Intranet Indexing Suggestions
Open Text Corporation provides text search and retrieval tools for
the Internet and for corporate intranets. We currently operate a
free text search and retrieval service
for a substantial portion of the published documents on the WWW. We also
sell products that bring this functionality to companies
wishing to provide similar facilities for their corporate intranets.
Because of our experience at Open Text, a lot of suggestions have
been made for improving the current standards (or defacto
standards) with an eye towards improving our indexing and search
technologies. The following list is a sample of some of the things
that Open Text thinks would go a long way in improving the current
search and retrieval experience.
Reducing Bandwidth Requirements
There's a lot of discussion currently on how to reduce the amount of
data and tcp/ip connections required to obtain a page with all the
embedded gifs, frame components, java applets, and other objects.
Ability to fetch multiple documents in one connection
A multiple get facility in the http protocol (similar to the mget
facility in ftp) would assist greatly in reducing the overhead these
connections entail. Robots can potentially utilize this multiple get
facility to fetch several (perhaps unrelated) documents with one connection.
The multiple gets should be robust enough so that unwanted data is
not included in the transmission.
Server provided document lists
Robots periodically check to see if documents have changed on a
server using the "if-modified-since" mechanism. While this saves
transferring the document in those cases where the document has not
changed, it still requires the overhead of one connection per
document on the server to make this assessment.
A better approach might be to provide a list of all documents
available at the site, coupled with the size, last modification date,
and mime type of those documents. Robots could collect this one
file periodically, and from it infer which documents need to be
re-fetched. On the server side, some form of daemon may be required
to administer this file, but that should not be too difficult to
create. This document could be given a reserved name similar to
robots.txt, perhaps "sitedocs.txt". For large servers, this file
could point to other files, perhaps reflecting the directory organization
in use at the site.
Ability to "get-all-documents-modified-since"
An extension of the multiple file transfers and "if-modified-since"
mechanism is to provide a "get-all-documents-modified-since" protocol
to the HTTPD. With this, a browser or robot could make one connection
and fetch all the documents from the site that have been modified or
added since a specified time. Additionally the transfer could include
a special block listing all URLs that were deleted from the site since
the specified time.
Server side document conversion
Few servers currently provide mechanisms for converting documents on the
server side. Converting large word processing or PDF files into smaller
text or html documents before transmission could save a lot of bandwidth.
Document Change Frequency
Anything that can assist in defining the frequency at which documents
are expected to change would assist in the currency of information
available from search engines. While mechanisms like providing a mapping
of URLs (perhaps defined by a regular expression) to expected change
frequency, or expiry dates for documents, could assist, they could also
be easily abused.
Improving the Robot Exclusion Protocol
The robot exclusion protocol provides some information on which
robots are allowed to crawl what on a given site. A number of sites
are excluding all robots because of perceived performance implications.
There are a few things that could improve the situation so that
robots minimize the impact of visiting and collecting documents from
a site.
Time slots for robots.txt
The robots.txt file should detail time slots (in GMT) for crawling,
perhaps on a per robot basis. These can be expressed with the same
flexibility as the UNIX cron facility, so that sites can restrict robots
to late night slots during weekends only, for example.
Bandwidth suggestions for robots.txt
The robot exclusion protocol provides guidelines on how often a robot
can access a site for a document. These may not be appropriate given
the wide variety of hardware and the popularity of some sites. Perhaps
the robots.txt file itself should provide more site specific guidelines like
how many files, bytes, or connections to make per unit of time, perhaps
defined in time slots over the day, and perhaps by robot.
Refetching requirements of robots.txt
There should be some indication of how frequently the robots.txt file
itself should be refetched. The administrators we contacted varied
wildly on this one. The robots.txt file should contain an expiry date,
or an indication of how frequently it should be re-fetched.
Improved Information Content
Identifiable summaries or abstracts
A short summary about the document is provided by most search engines.
For now, those summaries are generated based on the content of the page
at hand, which in some cases leads to difficult to read verbiage. A more
formalized approach may yield better summary lists.
Improved keyword identification
Anything that can be done to improve the quality of search keywords in
a document would assist search and retrieval engines. Our experience,
however, has been that these mechanisms are generally abused on the Internet,
but not on corporate intranets.
Improved character encoding and language identification
There should be a standard for character encoding(s) the identification
of the character encoding scheme used. This is a particularly troublesome
situation in countries like Japan where UNICODE, EUC and other encoding
mechanisms are sometimes at odds with each other.
In a similar vein, knowing what natural language(s) a document is in (ie
English, French, Japanese, etc) would allow a search engine to tailor
results for a particular user.
Site Specific Information
There are a number of improvements that can be made on a per site basis.
A graphics logo file per site
Each site should provide a small graphics file that represents the site,
perhaps a corporate logo. These logos could be used to improve the
appearance of a summary page, provide a more graphical means of navigating
the net, or other unforeseen applications. This file should probably
have a reserved name like robots.txt, perhaps "sitelogo.gif".
Geographic location of site
It should be possible to the determine geographic location of a server, either
from the HTTPD or from an auxiliary file located on the server site, perhaps
"location.txt". Robots can use this to optimize the collection of documents,
and applications can be tailored for regional requirements.
Site summary
Sites could provide summary information describing the nature or purpose
of the documents provided by the site, perhaps in a file called "summary.txt".
For sites providing many services or types of documents, this file could
allow a number of summaries organized by regular expressions defining
the URLs that are associated with those summaries.
Multiple server sites
Many larger HTTP servers are implemented using several machines,
and hence multiple IP addresses and hostnames. There are benefits
in knowing both the IP address and hostname of the "main" server.
The functionality already exists in the DNS protocol. However in
order to implement this, a webmaster needs to know DNS setup
intricately. As well, the webmaster must have the control of their
sites' DNS. Many webmasters do not have this control
as it may be considered a system administrator's or even the ISP's control.
(C) Copyright 1996 Open Text Corporation
This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.