Keeping Track of WWW Servers

How does www keep track of the available servers?

Q

How does www keep track of the available servers? How does a user know where to go to get a specific piece of information? According to the description of the http protocol, when a user wants to do a search, the corresponding UDI specifies, among other things, the server's address. How does the user find out about the server's address? Or from the server's perspective, how does a server announce its existence?

The resource discovery problem

14 May 1992

This is what people seem to call this problem in general.

As a physical sever can serve many different types of information from different servers, we talk about finding documents and indexes, as that is what the user sees. To the reader, the web is a continuum. When a new server appears, it may serve many databases of data from different sources and on different subjects. The new data must be incorporated into the web. This means putting links to data on the new server (especially to a general overview document for the server if there is one) from existing documents which interested readers might be reading, or putting it into an index which people might search.

The person publishing the data must go through the same process as the person searching for it. When (s)he has found an overview page which (s)he feels ought to refer to the new data, (s)he can ask the author of that document (who ought to have signed it with a link to his or her mail address) to put in a link. There may be several links from different documents: there is not one master list. Of course, some servers are put up for internal use only, and links are only made from local documents. I only find out about these servers by word of mouth, but they exist.

Currently, there are three parallel trees in the web for finding data starting from scratch. The most interesting one is a classification by subject. I've got an "Other subjects" link from Cern's home page to a master page of information by subject . From that I have links to individual servers of all kinds (W3, WAIS and Gopher), and in cases where there are a lot like physics and biology, a link to a page about one specific subject. In this way you can browse the web by subject like a library. I am looking for people in other disciplines to take over the subtrees for those disciplines as the load gets heavier (I may have candidates for some). The tree tends to be ought of date, and its authors rely on feedback to put in things which are missing.

The other trees are by organization and by server type. The list by server type is easy, because the people responsible for each protocol keep a list of the servers using it. That is, there is a tree of gophers, and there is an index of WAIS indexes. There is the W3/WAIS/Archie server for FTP sites. This tree isn't so useful unless you know what sort of a server you are looking for, but it tends to be more up-to-date than the subject index. It also has things in which aren't just about subjects. The third tree was going to be a geographic tree of organizations, but that isn't at all up-to-date.

By the way, it would be easy in principle for a third party to run over these trees and make indexes of what they find. Its just that noone has done it as far as I know because there isn't yet an indexer which runs over the web directly.

As you can see, the web is sufficiently flexible to allow a number of ways of finding infomation. In the end, I think a typical resource discovery session will involve someone starting on their "home" document, following one to two links to an index, then doing a search, and following several links from what they have found. In some cases, there will be more than one index search involved, such as at first for an organization, and having found that, a search within it for a person or document. We need to keep this flexibility, as the available information in diffferent places has such different characteristics.

In the long term, when there is a really large mass of data out there, with deep interconnections, then there is some really exciting work to be done on automatic algorithms to make multi-level searches.

Tim BL