Internet protocols

A protocol is a language that is used between computers. Most protocols are fairly simple, consisting of not much more than a handful of commands and a description of the format for the returned answers. For example, the NNTP protocol lists a number of commands such as article, list, and newgroups, and it says that every command must be on a separate line and that the responses will be preceded by a line with a 3-digit number. The Gopher protocol is even simpler. A protocol is not meant to be used by humans, because it is designed to be simple for computers, which is not necessarily simple for human beings.

This chapter describes a number of protocols that are in use on the Internet, in sofar as they are useful for the purposes of this report, which is information retrieval and distributed services. Protocols at a low level, that are not directly of importance to the subjects of this report are left out.

One of the hypotheses underlying this report is, that more of the protocols should be hidden, in favour of a single (or just a few) client programs that speak all of them. Such clients should be organized by function, not by protocol. Too many of the current client programs know just a single protocol. ftp is built around the ftp protocol, gopher around the gopher protocol. Even though part of their functionality (fetching files) overlaps, the user still has to choose a protocol, not a function.

Usenet/NNTP: The main news service and public discussion forum, uses the NNTP protocol.
Mailing lists: Also called Listservers, after an often used program. Discussions among a limited number of people. Makes use of E-mail.
Gopher: An easy-to-use file retrieval program, based on hierarchical, distributed menus. See also Veronica
FTP: File Transfer Protocol, a protocol for copying files to and from remote machines
Archie: A database of locations for all files that are publicly available through FTP. Uses the Prospero protocol.
World Wide Web/HTTP: A distributed hypermedia system, uses the HTTP protocol.
WAIS/Z39.50: A full text indexing system, works both stand-alone and over a network, in the latter case it uses the Z39.50 protocol.
E-mail: The electronic equivalent of the postal service. Several protocols are in use (SMTP, UUCP, POP, etc.)
Telnet, rlogin: Protocols that allows people to `log in' to remote machines
rcp, NFS, AFS: Rcp is `remote copy', a sort of one-shot FTP. NFS and AFS are systems for `mounting' the file system of a remote machine as if it were a local hard disk.
Hyper-G: A distributed hypermedia system, that supports multiple navigation models.
DEC VTX: An early hypermedia system by Digital Equipment Corporation, described as a `Videotex' system.
Prospero: A `virtual file system'; offers multiple views of a distributed file system.

Usenet

Usenet is the collective name for a public discussion forum based on the NNTP protocol. Newsreader programs show a list of newsgroups and each newsgroup contains articles. Old articles are automatically removed after a certain period. There are over 2500 different newsgroups and the number is growing every day.

Client programs include rn, xrn, nn, tin (all on Unix and/or X) trumpet (MS-DOS).

An article that is posted to Usenet quickly makes its way to all connected computers around the world. Small computers only store a subset of the articles or none at all. Large computers store all of them, many Megabytes each day. People at small computers contact the nearest larger computer via the NNTP protocol in order to read and post articles. The same protocol is also used between the larger computers when distributing articles.

NNTP is anonymous, which means that it doesn't care about the identity of the client; no passwords are required.

Gopher

Gopher is a networked information retrieval and publishing tool, based on the concept of hierarchical menus. Basically, it supports three types of items: documents (including images, sounds, etc.), menus (or directories, containing links to other items), and services (called `links', such as telnet or CSO servers). Information as to where documents or menus are stored remains hidden from the user, giving the impression that `Gopher space' is a single, extremely large system.

Gopher, unlike WWW, doesn't distinguish between the system, the protocol and the format of the data. It also stresses the simplicity of the client programs (the Gopher browsers), requiring all added intelligence to be added to the servers instead. This allowed a rapid spread of Gopher over the world, but seems rather inflexible in regard of future enhancements.

: Types of data and services and their representation in Gopher menus and Gopher viewers. Note that `local' here means `local to the server', in other words, the table is created from the viewpoint of the information publisher. The item type is what is indicated on the screen, next to the item's name. The information source is of interest only to the server that must provide the information. The output format is used by the client program to select the right kind of viewer.

Item type  	Information source         	Output format  	

Document   	Local file,                	Text (ASCII),  	
           	Compressed local file,     	GIF, Image,  	
           	Remote (other Gopher),     	Sound, MPEG,  	
           	FTP'ed file,               	Binary,  	
           	Output of a program        	MIME-encoded  	

Menu       	Local menu file,           	(implied)  	
           	Compressed local file,     	  	
           	Remote (other Gopher),     	  	
           	FTP directory,             	  	
           	Segmented (mail) file,     	  	
           	Search for files (grep),   	  	
           	Search for files (WAIS),   	  	
           	Search for menu items,     	  	
           	Search segmented file,     	  	
           	Output of a program        	  	

Form       	Local form file +          	Text (ASCII),  	
           	local program              	GIF, Image  	
           	                           	Sound, MPEG,  	
           	                           	Binary,  	
           	                           	MIME-encoded,  	
           	                           	Menu  	

Service    	Telnet, Tn3270, CSO        	(implied)

The inventors of Gopher were inconsistent when they implemented the system. Based on what different Gopher implementations provide (in particular the Minnesota `original' gopherd and John Franks' gn), the following could be a breakdown of menu-items vs information sources vs output formats (see table).

Not all combinations are currently implemented and a few additional output formats are defined, though it seems better to reduce the number of formats defined by Gopher itself and instead rely on the MIME standard to encode the contents. In the table above that advise has been followed, even though it remains to be seen if the Gopher community will actually go in that direction.

The latest version of the Gopher protocol is called Gopher+ and it includes facilities for automatic negotiations between server and client to determine the best format for some piece of information and an extension for interactive forms.

Veronica

Veronica is a database of Gopher items (an item is a title plus a pointer to a document or menu). It is updated daily. There are at the moment four such databases, in different parts of the world. The database is accessible through Gopher. It accepts queries for keywords and responds with a Gopher menu consisting of all matching titles.

Veronica stores all titles that appear in Gopher menus anywhere in the world, but they are stored without their context. A query for a particular keyword returns a list of matching titles, but removed from their context, the titles may be rather uninformative. E.g., a title that consists of just the word `Europe' might have been meaningful in its original menu, but taken out of context and stored in a Veronica database, there is very little indication of what the document entitled `Europe' actually contains.

Still, despite its limitations, Veronica is a very useful tool when looking for information in `Gopher-space.'

FTP

FTP is a protocol for file management on a remote machine. It has commands for copying files to and from remote machines and for renaming and deleting files. It protects access by username & password combinations, but it is mostly used in the form of `anonymous ftp', which means that the username `anonymous' is recognized, with any password. Internet etiquette demands that people that make contact as `anonymous' provide their E-mail address as password, so that the maintainers of the ftp site can more easily see who has used the facilities.

Archie

Just like Veronica is an index into Gopher, Archie is an index into anonymous FTP. The Archie database stores filenames from a large number of anonymous FTP archives. The database can be queried with partial filenames or regular expressions and it will return a list of matching filenames together with the addresses where they can be found.

Although the Archie databases (there are about twenty of them around the world) are not updated as frequently as Veronica, they are great for finding the latest versions or nearest copies of software or documents. Of course, Archie suffers from the same problem as Veronica, and that is that the filenames do not convey much information about the contents of a file, but in Archie the context is shown in the form of a directory path.

Archie is an application of the Prospero protocol, but for people without Prospero clients, there is also the possibility to log into a machine running an Archie database and give commands inside a restricted shell.

World Wide Web (WWW)

WWW is a distributed hypermedia system. It defines both a protocol, HTTP (HyperText Transfer Protocol), and a hypertext file format, HTML (HyperText Markup Language). Many machines around the world act as WWW servers, meaning that they have a collection of hyper-documents that they will transfer on request. Each document has a unique name, a so-called Universal Resource Locator (URL). Inside a document there may be references to other documents, also in the form of URL's. The client program that a user runs on his own machine knows how to contact these servers and how to obtain a document, given its URL.

Documents need not be text. They can be single-media or multi-media. Hyperlinks are possible in text and in pictures; not yet in time-based media, such as sound and movies.

The client determines the range of formats that it recognizes. Some clients are smarter than others. A typical list of supported formats is: formatted and unformatted text, PostScript, images in various formats, sound in various formats, and animations in MPEG format.

Typically, WWW clients know a number of protocols, such as FTP, Gopher, NNTP and, of course, HTTP. (The most recent definition of the protocol can be found in Geneva.)

A list of WWW clients is also available from CERN in Geneva.

If someone wants use WWW to publish his own work, he (or rather his system operator) will need to set up a WWW server. At least on a Unix system that is not difficult to do. The details are also available from CERN.

WWW supports format negotiations between server and client for case when information is available in several formats. Interactive forms are possible with a range of buttons, radio-buttons, and input fields.

Authentication is possible in three different ways (as of October 1993): through username & password, Kerberos, and Internet address masks.

DEC VTX

VTX is a product of computer manufacturer DEC (Digital Equipment Corporation). It is described as a `videotext' system. Information is structured as a tree, with each subject collected in `stories' of several `pages'. Each page is either a menu, a query form, or an information page. Pages can indicate that they need an external application to be displayed properly. A story can be distributed over several machines, without the user being aware of that.

The VTX databases must be kept on DEC VAX machines, but client programs (readers) are available for other computers as well.

In contrast to the other systems described in this chapter VTX is a commercial system, which means that license fees have to be paid for every VTX server and every reader program on another computer.

The KUB-gids project of the University of Brabant (Tilburg) uses DEC VTX as the basis for the KUB-gids CWIS. KUB-gids cannot be accessed from the outside, except by logging in to one of the university's machines. (Log in as `kubgids' on machine kubgids.kub.nl.)

Some information on DEC VTX can be found in an article in Byte or in a flyer from DEC that is available by FTP.

WAIS/Z39.50

WAIS is a full-text indexing system, that can work both locally and over a network. A collection of files together forms a database and WAIS is used to create an extensive index into this database, usually every word of every file is indexed. A WAIS server processes incoming queries, consisting of a number of keywords, and returns a list of matches. The server can also return the full text of a file in response to a query. Instead of keywords, a query can also refer to a document, which is interpreted as a query for other documents that are `similar to' the indicated one.

WAIS uses a scoring mechanism to determine how `similar' a document is to a set of keywords or to another document. The server computes a number between 0 and 1000, with 1000 being assigned to the best match. The computation is based on the number of times a keyword occurs and how many of the keywords occur.

The WAIS protocol is a subset of the ANSI standard Z39.50 protocol, which was developed especially for queries to bibliographic databases, such as library catalogues.

E-mail

E-mail is used for communication between two people. It is similar to the normal mail, but much faster. One person writes a letter and sends it to an address. An address is usually of the form user@machine, where machine can contains several parts separated with dots, such as let.rug.nl for the Faculty of Arts (let) of the University of Groningen (rug) in the Netherlands (nl).

The message is copied to the recipient's machine and stored in a file known as the `mailbox.' E-mail programs, so-called Mail User Agents (MUA's), provide several commands to help with composing a letter, remembering addresses, and replying to mail.

There are several protocols in use between machines to exchange mail. A few of the more common ones are UUCP, SMTP, and POP. Since E-mail already has a long history, there are several gateways in operation, so that it is possible to send mail to a user on a machine that uses a different protocol.

Apart from the protocols, there are also standards that describe how the contents of a letter must be coded when it contains something other than plain text. Such a standard is MIME. The encoding and decoding is normally handled by the MUA, but not all E-mail readers can handle MIME yet.

Mailing lists/listservers

A listserver is a program that continually watches for incoming mail on a certain mailbox and forwards any message to a list of other addresses. Such a mailing list can bring together people with a common interest. There are hundreds of mailing lists, each for a different subject.

Most listservers can automatically handle requests to subscribe or unsubscribe. Many also keep an archive of past discussions. Subscribers can send special messages to a separate address to retrieve files from the archive or to get other information about the mailing list (such as who are on it.)

Unfortunately, mail readers (MUA's) have no special support for mailing lists. They handle messages from such lists the same as messages coming from a normal address. The proposal includes some comments about possible enhancements to mail readers.

Telnet & rlogin

Telnet and rlogin are programs that let some computer act as a terminal for another computer. They tell the remote computer that it now has an extra terminal and from then on they simply copy whatever the user types on the local machine to the remote machine, the output from the remote machine is similarly copied to the display of the local machine.

They work in a similar manner as the popular communications software for PC's, that works over modem lines, except that an Internet connection is used instead of a modem (dial-up) line.

The underlying protocol is called TCP/IP. It is a low-level protocol, in the sense that it doesn't deal with the meaning of the transferred data (characters), but only with ensuring that it is transferred without errors. Protocols such as Gopher and HTTP work on a higher level: they rely on TCP/IP for transferring data error-free, but they also interpret the data in some way, to determine what to do with it

rcp, NFS, AFS

Rcp is a program to copy files from one machine to another. It is a bit like ftp, except that it is much simpler.

NFS and AFS are systems that make a connection to a remote machine and then present the filesystem of the remote machine as if it was a disk hanging directly off the local computer. Once NFS or AFS is started, remote files and directories are indistinguishable from local ones.

Hyper-G

Hyper-G is a large, networked hypermedia system, being developed at Graz University of Technology (Austria). Like WWW it uses an SGML-like notation for hypertext, but unlike WWW it stores all link information in a central database, instead of in the documents themselves.

Hyper-G supports different types of organization: not only hypertext, but also hierarchical. Keyword searches are also available. There are several levels of user authentication. Hyper-G also automatically selects a document in the user's language, if translated versions are available. Users can attach annotations to all documents; the annotations are visible to other users.

Compared to WWW, Hyper-G has more similarities than differences, and the latest developments in both suggest that the developers have similar ideas. There is a gateway from the Hyper-G system in Graz to WWW. Hyper-G is not as portable as WWW, and there are only two sites running the system. But it is an interesting system, that still includes some ideas that have not yet shown up in WWW. On the other hand, the proposed new standard for WWW includes things that are not available in Hyper-G, such as interactive forms.

Prospero

Prospero is system for presenting `virtual directories.' The directories contain files and other directories, which may actually be located on different machines. Prospero automatically uses Ftp (and possibly other protocols, including its own) to retrieve files transparently.

The directories can be composed by every user to his own taste. It is also possible to make personal directories available to others, which gives people a way of publishing information or offering information services. Publishing directories requires the installation of a Prospero server, in the same manner as for Gopher, WWW, etc.

Directories can also be created by so-called filters, programs that apply certain criteria to a set of files & directories in order to distribute them over directories. These automatic directories are dynamically updated, whenever the original files or directories change.

With suitable presentation software, it is possible to make the virtual directories look like normal Unix-style directories, but also like Gopher menus, or even WWW documents.

The best-known application of Prospero is in the Archie software database. Archie queries are in fact filters, and the result is a virtual directory containing all files and directories that match the query.