Tim Berners-Lee, Jean-FranŰois Groff, Robert Cailliau
CERN, 1211 Geneva 23, Switzerland
February 1992

Universal Document Identifiers on the 
Network


OSI-DS-XX

Status of this memo

This draft document is for unlimited distribution for discussion. Comments 
please to the author as timbl@info.cern.ch. or to the discussion list 
CNIąARCH@UCCVMA.BITNET. Following such discussion a future revision of 
this paper may be submitted to the Internet RFC editor for consideration as an 
internet standard.

Abstract

Many protocols and systems for document search and retrieval are currently in 
use, and many more protocols or refinements of existing protocols are to be 
expected in a field whose expansion is explosive.

These systems are aiming to achieve global search and readership of documents 
across differing computing platforms, and despite a plethora of protocols and 
data formats.   As protocols evolve, gateways can allow global access to remain 
possible. As data formats evolve, format conversion programs can preserve 
global access.  There is one area, however, in which it is impractical to make 
conversions, and that is in the names used to identify documents.  This is 
because names of documents are passed on in so many ways, from the backs of 
envelopes to hypertext documents, and may have a long life.

This paper discusses the requirements on a universal naming syntax which can 
be used to refer to documents available using existing protocols, and may be 
extended with technology.  It makes a recommendation for a generic syntax, and 
for specific forms using existing internet protocols.

Terms

The objects on the network which are to be named include objects which can be 
retrieved, and objects which can be searched.

In this paper we refer to the first as Şdocumentsş, no matter what they contain. 
We imply nothing about the contents at this stage.  The Şdocumentş is the unit of 
retrieval and need not correspond to any unit of storage.  We refer to objects 
which can be searched as Şindexesş.  We emphasize that this is the abstract view 
of the users, and these objects need not correspond to physical files on 
computers. We refer to the person who does the retrieval or searching as the 
user.

We use the terms Şnameş and Şidentifierş synonymously. The term Şaddressş is 
reserved for an identifier which specifies a more or less physical location.

Uses of a document name

The name allows a user, with the help of a Şclientş program, to search indexes 
and/or retrieve documents from a Şserverş program.  A name may be passed for 
example

ˇ	In communication of any form between two people, to refer to a 
document, or part of a document;

ˇ	As part of the description of a link associated with a hypertext 
document;

ˇ	As part of the result of searching an index.

Basic requirements on a name are that

ˇ	A given name will remain valid as long as it is needed;

ˇ	A given naming syntax will remain valid through the introduction of 
new protocols and directory technologies;

ˇ	A name will contain enough information to allow the document or index 
to which it refers to be accessed.

The syntax discussed is the syntax of one name, be it a lasting name or a 
physical address.  When a directory server or hypertext link contains a set of 
alternative names, then that is beyond the scope of this syntax. Similarly, a 
syntax for describing a compound document is outside the scope of this syntax.  


Current practice


Current protocols use many different standards for names.  For some protocols, 
such as ISO-10163 Search and Retrieve protocol[16], the identifiers returned in 
a search are only valid during the session. For others, such as FTP[9], they are 
lasting names which may be used for document retrieval at a later time.  
Typically, however, they are not long-lasting names which are independent of 
the location of the document.  Such names may be provided using directory 
servers such as x.500.  They will refer to the registration, however formal or 
informal, of a document with a particular organization or person.  Both hypertext 
and  manual references rely on long-lasting names.

Current names are basically location specifiers (addresses). They give the 
necessary parts of an address for a reader to access an information provider 
using the given protocol, and ask for the document required. Examples of names 
used by various protocols include

File transfer protocol (Postel 1985):	Host name or IP-address
	[IP port]
	[user name, password]
	Filename

W.A.I.S. (Kahle 1990)	Host name or IP-address
	[IP port]
	database name
	local document id

Gopher (Alberti 1991)	Host name or IP-address
	[IP port]
	database name
	selector string

HTTP (Berners-Lee 1991)	Host name or IP-address
	[IP port]
	local document id

NNTP (Kantor 1986) group	Group name

NNTP article	Host name
	unique message identifier

x.500 distinguished name	Country
	Organization
	Organizational unit
	Person
	Local document identifier


Other systems with their own naming schemes include BITNET ŞLISTSERVş 
application, FTAM file retrieval, SQLnetTM remote database search, proprietary 
distributed file systems, etc. Conventional syntax for writing these addresses 
involve various forms of punctuation to separate these parts.  This sometimes, 
but not always, allows the naming scheme to be deduced from the punctuation. 
For example, a name of the form  xxx.yyy.zz.edu:/pub.aa.bb.cc 
often implies anonymous FTP access.  However, there is no well-defined 
algorithm for parsing an arbitrary name, as there is no common syntax.


Expandability


There will necessarily be a phase during which lasting names will become more 
common, as the deployment of directory services increases to the point where 
every user has direct or indirect access to one.  Even then, however, one can 
envisage more than one competing directory system, and cases in which physical 
names are still required.  A directory service takes a lasting name and reduces it 
to a physical address (or set of addresses) which, though less useful for lasting 
reference, is the only way to actually retrieve the document.

An addressing syntax is required which will be able to encompass existing 
physical address spaces, and be extendable to any future protocols.  This 
requires that it contain an identifier for the protocol in use. The format of the rest 
of the address will necessarily depend to a certain extent on the protocol.  
Obviously, ISO standard lower protocol layers will have their own forms of 
addressing, and new applications will have their own forms for addressing 
subsections of documents.


Relevance

The life of a name is limited by any information contained within it which may 
become prematurely invalid. It is therefore necessary to limit the contents of a 
name to the information required for the operations above.  Other extraneous 
information about the document (its size, data format, authorization details, etc) 
may in general change with time and should not be part of the name.

One might expect such information to be part of the Şheaderş of a document, 
and for protocols to allow the header information to be retrieved independently 
of the documents themselves. 

Any physical address may be subject to change with time: hence we encourage 
the move to lasting names and directory services.

Uniqueness

Clearly one requires uniqueness in the sense that one name should refer to only 
one logical document. This is the case with all the addressing schemes in use, 
whether they are directory systems or physical addresses. (The internet addreses 
all rely on the domain name (Mockapetris 1987) of the host to achieve this).

However, given that names can be translated, many apparently different names 
may lead to the same object. Any object may therefore be refered to by many 
names. One needs to be able to know whether two documents, retrieved through 
different paths, are in fact the same document.

It is suggested that each document have one Şofficialş name. This name could be 
stored in the document in some representations, or stored in a database 
accessible to the server, for example.  Any references within that document 
should be parsed in the context of the offcial name.  In the presence of a 
directory service, the official name will normally be the registered name of the 
document. However, a name in any scheme will do, so long as it is completely 
specified.  On systems which do not allow the name to be stored (such as 
anonymous FTP archive sites), a possible ambiguity will always exist as to 
whether to similarly named documents are in fact the same.

Note that internet newsgroup names are unique worldwide, and news articles 
carry a unique message id.

Readability by people
 
This requirement has ben put forward by several people (Clifford Lynch, Doug 
Engelbart among others), and disputed by others.  The author's view is that it 
will be a while before technology and standardization have reached the point at 
which names and identifiers will be hidden from human beings. As long as they 
must be written on the backs of envelopes and Şcut and pastedş between 
workstation windows, there is a strong need for names to be

ˇ	Short
ˇ	Composed of printable (preferably non-white) characters
ˇ	To a certain extent, parsable by a human being.


Structure

A physical address is required in order for 

ˇ	The user's program to contact the server

ˇ	The server to search and index or retrieve a document

ˇ	The user's program to locate an individual position or element within a 
document.

This suggests that a name be structured, such that the parts necessary for these 
three operations be separate and only used by those system elements which need 
those parts. This corresponds to the basic principle of information hiding.  In 
fact,  four parts are necessary, including the indicator of the naming scheme to 
be used:

ˇ	The naming scheme: a registered identifier for the protocol.

ˇ	The name of a suitable server. The format of this part must be well 
defined. It will depend on the lower-layer protocols in use.  Systems 
which use widely distributed information, such as x.500 and NNTP, do 
not need this part as each client generally contacts his nearest server (or 
a particular server).

ˇ	Information to be passed to the server. This may be private to the server, 
as all names may be generated and used by the same server. The client 
should normally be transparent to this part of the name.

ˇ	Information to be used by the application once the document has been 
retrieved.  This part is private to the application (or, more strictly, the 
data format) and so cannot be defined here.


Both lasting names and physical addresses often share a hierarchical structure. 
This follows often from the organization of the system. From the naming point 
of view, it has the advantage that a reference in one document to another 
document need not include that part of the structure which is common to both 
names.


Choices

The requirements above leave little room for choice save for the order and 
punctuation of the elements of an address.  It is only reasonable for the order of 
writing of the parts to be consistently from left to right (or right to left) with 
increasing specificity.  Punctuation schemes fall into two categories (Huitema 
1991): tagged schemes in which field are given names, and fields which use 
special characters and field order. The latter tend to be more compact schemes.


protocol: aftp host: xxx.yyy.edu path: 
/pub/doc/README

PR=aftp; H=xx.yy.edu; PA=/pub/doc/README;

PR:aftp/xx.yy.edu/pub/doc/README

/aftp/xx.yy.edu/pub/doc/README

Fig 1. Some alternative tagged and untagged 
representations


The choice of special symbols for punctuation tends to be a matter of taste: It is 
easier to read  addresses whose symbols correspond to those of one's favorite 
operating system.  A variety of symbols is needed so that when a name is 
abbreviated it is possible to tell which parts have been omitted. The 
recommendation below uses special characters in order to achieve a compact 
name, and uses where possible punctuation symbols established in the internet or 
unix community.

The choice of escape character for introducing representations of non-allowed 
characters also tends to be a matter of taste. An ANSI standard exists in the C 
language, using the back-slash character Ş\ş. The use of this character on unix 
command lines, however, can be a problem as it is interpreted by many shell 
programs, and would have itself to be escaped.

The use of white space characters has been avoided  in UDIs: spaces are not 
legal characters.   This was done because of the frequent introduction of 
extraneous white space when lines are wrapped by systems such as mail, or 
sheer necessity of narrow column width, and because of the  inter-conversion of 
various forms of white space which occurs during character code conversion and 
the  transfer of text between applications.


Recommendation

The syntax is described in two parts. Firstly, the syntax rules of a completely 
specified name are given: then, the rules under which parts of the name may be 
omitted in a well-defined context. 

Full form

A complete address consists of a naming scheme specifier followed by an 
address whose format is a function of the naming scheme. For physical 
addresses of information on the internet, a common syntax is used for the 
internet address part. A BNF description of the UDI syntax is given in an 
appendix.  The components are as follows.

Anchor-id

This represents a part of, or a sub-function within, a document. Its syntax and 
semantics are defined by the application responsible for the document. The only 
definition here is of the allowed characters by which it may be represented in a 
UDI.

The anchor-id follows the UDI of the whole document from which it is 
separated by a hash sign (#).  If the anchor-id is void, the hash sign may be 
omitted: A void anchor-id with or without the hash sign means that the UDI 
refers to the whole document.

Scheme

Within the UDI of a document, the first element is the name of the scheme, 
separated from the rest of the document by a colon. The rest of the UDI follows 
the colon in a format depending on the scheme.

Internet protocol parts

Those schemes which refer to internet protocols have a common syntax for the 
rest of the document name. This starts with a double slash Ş//ş to indicate its 
presence, and continues until the following slash Ş/ş.  Within that section are

ˇ	An optional user name, if this must be quoted to the server, followed by 
a commercial at sign Ş@ş.  (Use of this field is discouraged. Provision 
of encoding a password after the user name, delimited by a colon, could 
be made but obviously is only useful when the password is public, in 
which case it should not be necessary, so that is also discouraged.)

ˇ	The internet domain name  of the host in RFC1037 format (or, optionally 
and less advisably, the IP address as a set of four decimal digits)

ˇ	The port number, if it is not the default number for the protocol, is given 
in decimal notation after a colon.


Path

The rest of the address is known as the Şpathş. It may define details of how the 
client should communicate with the server, including information to be passed 
transparently to the server without any processing by the client.

The path is interpreted in a manner dependent on the protocol being used.  
However, when it contains slashes, these must imply a hierarchical structure. 


Partial form

Within a document whose UDI is well defined, the UDI of another document 
may be given in abbreviated form, where parts of the two UDIs are the same. 
This allows documents within a group to refer to each other without requiring 
the space for a complete reference, and it incidentally allows the group of 
documents  to be moved without changing any references.

This relies on a property of the UDI syntax that certain characters ("/") and 
certain path elements (Ş..ş, Ş.ş)  have a significance reserved for representing a 
hierarchical space, and must be recognized as such by both clients and servers.

The rules for the use of a partial name are:

ˇ	If the scheme parts  are different, the whole absolute address must be 
given. Otherwise, the scheme is omitted, and:

ˇ	If the host and/or port parts are the different, the host, port name and all 
the rest of the address must be given.

ˇ	If the access and host parts are the same, then the path may be given in 
absolute (fully qualified) or relative form. Within the path:

ˇ	If a leading slash is present, the path is absolute. Otherwise, a relative 
path is interpreted as follows:

ˇ	The last part of the path of the context address (anything following the 
rightmost slash) is removed, and the given relative address appended in 
its place.

ˇ	Within the result,  all occurrences of "/xxx/.."  or "/." are recursively 
removed, where xxx, Ş..ş and Ş.ş  are complete path elements.


Mapping Local Names

When a system uses a local addressing scheme, it is useful to provide a mapping 
from local addresses into UDIs so that references to documents within the 
addressing scheme may be referred to globally, and possibly accessed through 
gateway servers.

Any mapping scheme may be defined provided it is unambiguous, reversible, 
and provides valid UDIs. It is recommended that where hierarchical aspects to 
the local naming scheme exist, they be mapped onto the hierarchical UDI path 
syntax in order to allow the partial form to be used.

The following escaping method is used for mapping WAIS and Gopher 
addresses onto UDIs. Where the local naming scheme uses ASCII characters 
which are not allowed in the UDI,  these may be represented in the UDI by a 
percent sign Ş%ş followed by two hexadecimal digits (0-9, A-F) giving the 
ASCII value for that character. If non-ASCII characters are used, then a similar 
escaping system should be used. Character codes other than those allowed by the 
syntax shall not be used in a UDI.

The same considerations apply to mapping local anchor identifiers onto the 
anchorid part of a UDI.

Specific Naming Schemes

The mapping for some existing standard and experimental protocols is outlined 
in the BNF syntax definition.  Notes on particular protocols follow.

File

The file: prefix indicates a file which is to be picked up from the file system of 
the given host. The FTP protocol is normally used. The port number if given 
gives the port of the FTP server if not the FTP default. The client may use local 
file access to retrieve objects which are available though more efficient means 
such as local file open or NFS mounting. 

The syntax allows for the inclusion of a user name and even a password for 
those systems which do not use the anonymous FTP convention. The default, 
however, if no user or password is supplied, will be to use that convention, viz 
that the username is Şanonymousş and the password the user's mail address.

The adoption of a unix-style syntax involves the conversion into non-unix local 
forms by either the client or server. Some non-unix servers do this, but clients 
wishing to access sites which do not have unix-style naming will need certain 
algorithms to enable  other file systems to be identified and treated.  Client 
software may also have to be flexible in terms of the sequence of FTP 
commands used with different varieties of server.  In view of a tendency for file 
systems to look increasingly similar, it was felt that the UDI convention should 
not be weighed down by extra mechanisms for identifying these cases.

The data format of a file can only, in the general FTP case, be deduced from the 
name, normally the suffix of the name. This is not standardized. The transfer 
mode (binary or text) must in turn be deduced from the data format.  It is 
recommended that conventions for suffixes of public archives be established, but 
it outside the scope of this paper.

News

The news addresses refer to either news group names or article message 
identifiers which must conform to the rules of RFC 850.  A message identifier 
may be distinguished from a news group name by the presence of the 
commercial at Ş@ş character.   These rules imply that within an article, a 
reference to a news group or to another article will be a valid UDI (in the partial 
form).

An outstanding problem is that the message identifier is insufficient to allow the 
retrieval of an expired article, as no algorithm exists for deriving an archive site 
and filename.  The addition of the date and news group set to the article's UDI 
would allow this if a directory existed of archive sites by news group.

WAIS

The current WAIS implementation public domain requires that a client know the 
Ştypeş and length of a document prior to retrieval.  These values are returned 
along with the internal document identifier in the search response.  They have 
been encoded into the path part of the UDI in order to make the UDI sufficient 
for the retrieval of the document.  If  changes to WAIS specs make the internal 
id something which is sufficient for later retrieval then this will not be necessary.

Within the WAIS world, identifiers do not of course not need to be prefixed by 
Şwais:ş  (by the partial form rules). 

Prospero

The prospero (Neuman, 1991) UDP-based virtual file system protocol is used. The 
host and port parts are used, and optional.  The significance of the path part may 
be the name of a file, or anything else according to the server.  If the path ends 
with a final slash Ş/ş that indicates to the client that the object is a directory to be 
listed.. Prospero links of the form EXTERNAL are converted into UDIs of non-
prospero naming schemes (such as Şfile:ş).

Gopher

The first character of the UDI path part (after the initial single slash) is a single-
character Ştypeş field which is that used by the Gopher protocol.  The rest of the 
path is the Şselector stringş, with unprintable characters and spaces encoded. 
Gopher links which refer to different protocols may be converted into UDIs for 
those protocols.

Telnet, rlogin

The use of UDIs to represent interactive sessions is a convenient extension to 
their uses for documents.  This allows access to information systems which only 
provide an interactive service, and no information server.  As information within 
the service cannot be addressed individually or, in general, automatically 
retrieved, this is a less desirable, though currently common, solution.

x500

The mapping of x500 names onto UDIs is not defined here. A decision is 
required as to whether Şdistinguished namesş or Şuser friendly namesş (ufn), or 
both, should be allowed. If any punctuation conversions are needed from the 
adopted x500 representation (such as the use of slashes between parts of a ufn) 
they must be defined. This is a subject for study.  


Registration of naming schmes

A new naming scheme may be introduced by defining a mapping onto a 
conforming UDI syntax, using a new scheme identifier.  Experimental scheme 
identifiers may be used by mutual agreement between parties, and must start 
with the characters Şx-ş.

It is proposed that the Internet Assigned Numbers Authority perform the 
function of registration of new schemes.  Any submission of a new scheme must 
include a definition of an algorithm for the retrieval of any object withing that 
scheme. The algorithm must take  the UDI and produce either a set of UDI(s) 
whichwill lead to the desired object, or the object itself, in a well-defined or 
determinable format. It is recommended that those proposing a new scheme 
demonstrate its utility and operability by the provision of a gateway which will 
provide images of objects in the new scheme for clients using an existing 
protocol.

It is likewise recommended that, where a protocol allows for retrieval by UDI, 
that the client software have provision for being configured to use specfic 
gateway addreses for new naming schemes.

Conclusion

A need has been demonstrated, and a number of requirements have been stated 
for universal document identifiers (UDIs). A scheme has been proposed which 
builds on existing conventions to define a syntax for UDIs.  Adoption of the 
scheme in correspondence, standards and software will ease the use of 
references to online information in a flexible way as the coming information age 
arrives.

Acknowledgements

This paper builds on much discussion of these issues by many people on the 
network.  The discussion was particularly stimulated by articles by Clifford 
Lynch (1991), Brewster Kahle (1991) and Wengyik Yeong (1991b). Contributions 
from John Curran (NSF), Clifford Neuman (ISI) and Ed Vielmetti (MSEN) have 
been incorporated into this issue of this paper.


REFERENCES

Alberti, R., et.al.  (1991) ŞNotes on the Internet Gopher Protocolş Univeristy of Minnesota, 
December 1991, UDI=file://boombox.micro.umn.edu/pub/gopher/gopher_protocol. See also 
UDI=gopher://gopher.micro.umn.edu:70/00/Information%20About%20Gopher/About%20Gopher
Berners-Lee, T., (1991) ŞHTTP as implemented in WWWş,  CERN, December 1991, 
UDI=file://info.cern.ch./pub/www/doc/http.txt
International Standards Organization, (1991) Information and Documentation ą Search and 
Retrieve Application Protocol Specification for open Systems Interconnection, ISO-10163
Huitema, C., (1991) ŞNaming: strategies and techniquesş, Computer Networks and ISDN 
Systems 23 (1991) 107-110.
Davis, F, et  al., (1990) ŞWAIS Interface Protocol: Prototype Functional Specificationş, 
Thinking Machines Corporation, April 23, 1990 
UDI=file://quake.think.com/pub/wais/doc/protspec.txt
Kahle, Brewster, (1991) ŞDocument Identifiers,  or International Standard Book Numbers for 
the Electronic Ageş, UDI=file://quake.think.com/pub/wais/doc/doc-ids.txt
Kantor, B., and Lapsley, P., (1986) ŞA proposed standard for the stream-based transmission of 
newsş, Internet RFC-977, February 1986. UDI=file://nnsc.nsf.net/rfc/rfc977.txt
Lynch,C. , Coallition for Networked Information: (1991) ŞWorkshop on ID and Reference 
Structures for Networked Informationş, November 1991. See UDI=wais://quake.think.com/wais-
discussion-archives?lynch
Mockapetris, P., (1987) ŞDomain names ą concepts and facilitiesş, RFC-1034, USC-ISI, 
November 1987, UDI=file://nnsc.nsf.net/rfc/rfc1034.txt
Neuman, B. Clifford, (1992) "Prospero: A Tool for Organizing Internet Resources", Electronic 
Networking: Research, Applications and Policy, Vol 1 No 2, Meckler Westport CT USA.  See also 
UDI=file://prospero.isi.edu/pub/prospero/oir.ps
Postel, J. and Reynolds, J. (1985)  ŞFile Transfer Protocol (FTP)ş, Internet RFC-959, October 
1985. UDI=file://nnsc.nsf.net/rfc/rfc959.txt
Yeong, W., (1991a) ŞTowards Networked Information Retrievalş, Technical report 91-06-25-
01, June 1991, Performance Systems International, Inc. UDI=file://uu.psi.com/wp/nir.txt
Yeong, W., (1991b), ŞRepresenting Public Archives in the Directoryş, Internet Draft, 
November 1991.  In UDI=wais://nnsc.nsf.net/internet-drafts?yeong

Appendix:  BNF syntax of Universal Document 
Identifiers

This is a BNF-like description of the W3 addressing syntax. We use a vertical 
line "|" to indicate alternatives, and [brackets] to indicate  optional parts.   Spaces 
are representational only: no spaces are actually allowed within a UDI. Single 
letters stand for single letters. All words of more than one letter below are 
entities described somewhere in this description. 


anchoraddress	docaddress [ # anchorid ]

docaddress	generic | httpaddress | fileaddress | 
newsaddress | prosperoaddress | 
telnetaddress | gopheraddress | 
waisaddress

generic	scheme :  path

scheme	ialpha

httpaddress	 h t t p :   / / hostport  [  / path ] [ ? search ]

fileaddress	 f i l e : / / host / path

newsaddress	 n e w s : groupart

waisaddress	 waisindex | waisdoc

waisindex	 w a i s : / / hostport / database [ ? search ]

waisdoc	 w a i s : / / hostport / database / wtype / 
digits / path

groupart	 * | group | article

group	 ialpha [ . group ]

article	 xalphas @ host

database	 xalphas

wtype	 xalphas

prosperoaddress	p r o s p e r o : / / path

telnetaddress	 t e l n e t : / / [ user @ ] hostport

gopheraddress	 g o p h e r : / / hostport  [/ gtype  [ / selector 
] ] [ ? search ]

hostport	 host [ : port ]

host	 hostname | hostnumber

hostname	 ialpha [  .  hostname ]

hostnumber	 digits . digits . digits . digits

port	 digits

selector	 path

path	 void |  xalphas  [  / path ]

search	 xalphas [ + search ]

user	 xalphas

anchorid	 xalphas

gtype	 xalpha

xalpha	 alpha | $ | _ | @ | ! | % | ^ | & | * |  (  |  ) | . | 
digit

xalphas	 xalpha [ xalphas ]

ialpha	 alpha [ xalphas ]

alpha	 a | b | c | d | e | f | g | h | i | j | k | l | m | n | o 
| p | q | r | s | t | u | v | w | x | y | z | A | B | C 
| D | E | F | G | H | I | J | K | L | M | N  | O | P | 
Q | R | S | T | U | V | W | X | Y | Z

digit	 0 |1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

digits	 digit [ digits ]

alphanum	 alpha | digit

alphanums	 alphanum [ alphanums ]

void