Characteristics

This section characteristics of various naming schemes, requirements which some existing schemes meet, and requirements for the URL scheme itself. URLs, as an introduction of and background for the Recommendations section.

Uses of names and addresses

A name allows a user, with the help of a "client" program, to retrieve or operate on objects via a "server" program. A name may be passed for example:

In communication of any form between two people, to refer to a document, or part of a document;
As part of the description of a link associated with a hypertext document;
As part of the result of searching an index.

Some typical requirements on a name which are met to a varying degree by various schemes are for example that the name is

Persistent: A given name will remain valid as long as it is needed;
Extensible: A given naming syntax will remain valid through the introduction of new protocols and directory technologies;
Resolvable: A name will contain enough information to allow the document or index to which it refers to be accessed, perhaps via resolution into an intermediate, more physical, name.
Unique: Each object can only have one such name. The fact that two such names are different implies that the objects to which they refer are different (in some way).
Unambiguous: The fact that two names are identical implies that the objects named are the same (in some way).

The syntax discussed is the syntax of one name, be it a lasting name or a physical address. When a directory server or hypertext link contains a set of alternative names, then that is beyond the scope of this syntax. Similarly, a syntax for describing a compound object is outside the scope of this syntax. The specific locator name spaces (defined under the umbrella of the general syntax) each meet the requirements above to a greater or lesser extent.

Current practice

Current protocols use many different standards for names. For some protocols, such as ISO-10163 Search and Retrieve protocol[16], the names returned in a search are only valid during the session. For others, such as FTP[9], they are lasting names which may be used for object retrieval at a later time. Typically, however, they are not long-lasting names which are independent of the location of the object. Such names may be provided using directory servers such as x.500. They will refer to the registration, however formal or informal, of a object with a particular organisation or person. Both hypertext and manual references rely on long- lasting names. Current names are basically location specifiers (addresses). These may be known as Uniform Resource Locators (URLs). They give the necessary parts of an address for a reader to access an information provider using the given protocol, and ask for the object required. Examples of names used by various protocols include

File Transfer Protocol (Postel 1985):

Host name or IP-address
[TCP port]
[user name, password]
Filename

W.A.I.S. (Kahle 1990)

Host name or IP-address
[TCP port]
local document id

Gopher (Alberti 1991)

Host name or IP-address
[TCP port]
database name
selector string

HTTP (Berners-Lee 1991)

Host name or IP-address
[TCP port]
local object id

NNTP (Kantor 1986)

NNTP group

Group name

NNTP article

Host name
unique message identifier

Prospero links (Neuman 1992)

Host name or IP address
[UDP port]
Host specific object name
[version]
[identifier]*

x.500 distinguished name

Country
Organisation
Organisational unit
Person
Local object identifier

Other systems with their own naming schemes include BITNET "LISTSERV" application, FTAM file retrieval, SQLnetTM remote database search, proprietary distributed file systems, etc. Conventional syntax for writing these addresses involve various forms of punctuation to separate these parts. This sometimes, but not always, allows the naming scheme to be deduced from the punctuation. For example, a name of the form xxx.yyy.zz.edu:/pub.aa.bb.cc often implies anonymous FTP access. However, there is no well-defined algorithm for parsing an arbitrary name, as there is no common syntax.

Expandability

There will necessarily be a phase during which lasting names will become more common, as the deployment of directory services increases to the point where every user has direct or indirect access to one. Even then, however, one can envisage more than one competing directory system, and cases in which physical names are still required. A directory service takes a lasting name and reduces it to a physical address (or set of addresses) which, though less useful for lasting reference, is the only way to actually retrieve the object. An addressing syntax is required which will be able to encompass existing physical address spaces, and be extendible to any future protocols. This requires that it contain an identifier for the protocol in use. The format of the rest of the address will necessarily depend to a certain extent on the protocol.

Relevance

The life of a name is limited by any information contained within it which may become prematurely invalid. It is therefore necessary to limit the contents of a name to the information required for the operations above. Other extraneous information about the object (its size, data format, authorisation details, etc.) may in general change with time and should not be part of the name. One might expect such information to be part of the "header" of a object, and for protocols to allow the header information to be retrieved independently of the objects themselves. Any physical address may be subject to change with time: hence we encourage the move to lasting names and directory services.

Uniqueness

Clearly one requires unambiguous names in the sense that one name should refer to only one logical object. This is the case with all the addressing schemes in use, whether they are directory systems or physical addresses. (The internet addresses all rely on the domain name (Mockapetris 1987) of the host to achieve this). However, given that names can be translated, many apparently different names may lead to the same object. Any object may therefore be referred to by many names. One needs to be able to know whether two objects, retrieved through different paths, are in fact the same object. It is suggested that each object have a unique "official" name. This name could be stored in the object in some representations, or stored in a database accessible to the server, for example. Any references within that object should be parsed in the context of the official name. In the presence of a directory service, the official name will normally be the registered name of the object. However, a name in any scheme will do, so long as it is completely specified. On systems which do not allow the name to be stored (such as anonymous FTP archive sites), a possible ambiguity will always exist as to whether two similarly named objects are in fact the same. Note that Internet newsgroup names are unique world-wide, and news articles carry a unique message id. In most other cases, however, there is no guarantee that dereferencing a URL will work, or that if it does the object it refers to will in fact be the object intended. URLs such as FTP addresses are transient in that files may be moved and even replaced by different files of the same name. This disorganisation may be limited by good server management, but a naming scheme which is independent also of internet host name is obviously preferable.

Readability by people

This requirement has been put forward by several people (Clifford Lynch, Douglas Engelbart among others), and disputed by others. The author's view is that it will be a while before technology and standardisation have reached the point at which names and addresses will be hidden from human beings. As long as they must be written on the backs of envelopes and "cut and pasted" between workstation windows, there is a strong need for names to be

Short
Composed of printable (preferably non-white) characters
To a certain extent, understadable by a human being.

Structure of names and addresses

A physical address is required in order for:

The user's program to contact the server;
The server to perform the operation (e.g. search and index, retrieve a object, or look up the name) and return a result;
The user's program to locate an individual position or element within a returned object.

This suggests that a name be structured, such that the parts necessary for these three operations be separate and only used by those system elements which need those parts. This corresponds to the basic principle of information hiding. In fact, four parts are necessary, including the indicator of the naming scheme to be used:

The naming scheme: a registered identifier for the protocol.
The name of a suitable server. The format of this part must be well defined. It will depend on the lower-layer protocols in use. Systems which use widely distributed information, such as x.500 and NNTP, do not need this part as each client generally contacts his nearest server (or a particular server).
Information to be passed to the server. This may be private to the server, as all names may be generated and used by the same server. This part of the name should be opaque to the client.
Information to be used by the application once the object has been retrieved. This part is private to the application (or, more strictly, the data format) and so cannot be defined here.

Both lasting names and physical addresses often share a hierarchical structure. This follows often from the organisation of the system. From the naming point of view, it has the advantage that a reference in one object to another object need not include that part of the structure which is common to both names.

Choices for a universal syntax

The requirements above leave little room for choice save for the order and punctuation of the elements of an address. It is only reasonable for the order of writing of the parts to be consistently from left to right (or right to left) with increasing specificity. Punctuation schemes fall into two categories (Huitema 1991): tagged schemes in which field are given names, and fields which use special characters and field order. The latter tend to be more compact schemes.

	protocol: aftp host: xxx.yyy.edu path:  

        /pub/doc/README

        PR=aftp; H=xx.yy.edu; PA=/pub/doc/README;

        PR:aftp/xx.yy.edu/pub/doc/README
  
        /aftp/xx.yy.edu/pub/doc/README

Fig 1. Some alternative tagged and untagged representations

The choice of special symbols for punctuation tends to be a matter of taste. It is easier to read addresses whose symbols correspond to those of one's favourite operating system. A variety of symbols is needed so that when a name is abbreviated it is possible to tell which parts have been omitted.

The recommendation below uses special characters in order to achieve a compact name, and uses where possible punctuation symbols established in the internet or unix community.

The choice of escape character for introducing representations of non-allowed characters also tends to be a matter of taste. An ANSI standard exists in the C language, using the back-slash character "\". The use of this character on unix command lines, however, can be a problem as it is interpreted by many shell programs, and would have itself to be escaped.

There is a conflict between the need to be able to represent many characters including spaces within a URL directly, and the need to be able to use a URL in environments which have limited character sets or in which certain characters are prone to corruption. This conflict has been resolved by use of an hexadecimal escaping method which may be applied to any characters forbidden in a given context. When URLs are moved between contexts, the set of characters escaped may be enlarged or reduced unambiguously.

The use of multiple white space characters is discouraged in URLs to be printed or sent by electronic mail. This is because of the frequent introduction of extraneous white space when lines are wrapped by systems such as mail, or sheer necessity of narrow column width, and because of the inter-conversion of various forms of white space which occurs during character code conversion and the transfer of text between applications.