This section characteristics of various
naming schemes, requirements which
some existing schemes meet, and requirements
for the URL scheme itself. URLs,
as an introduction of and background
for the Recommendations section.
Uses of names and addresses
A name allows a user, with the help
of a "client" program, to retrieve
or operate on objects via a "server"
program. A name may be passed for
example:
- In communication of any form between
two people, to refer to a document,
or part of a document;
- As part of the description of a link
associated with a hypertext document;
- As part of the result of searching
an index.
Some typical requirements on a name
which are met to a varying degree
by various schemes are for example
that the name is
- Persistent
- A given name will remain
valid as long as it is needed;
- Extensible
- A given naming syntax
will remain valid through the introduction
of new protocols and directory technologies;
- Resolvable
- A name will contain enough
information to allow the document
or index to which it refers to be
accessed, perhaps via resolution
into an intermediate, more physical,
name.
- Unique
- Each object can only have
one such name. The fact that two
such names are different implies
that the objects to which they refer
are different (in some way).
- Unambiguous
- The fact that two names
are identical implies that the objects
named are the same (in some way).
The syntax discussed is the syntax
of one name, be it a lasting name
or a physical address. When a directory
server or hypertext link contains
a set of alternative names, then
that is beyond the scope of this
syntax. Similarly, a syntax for
describing a compound object is outside
the scope of this syntax. The specific
locator name spaces (defined under
the umbrella of the general syntax)
each meet the requirements above
to a greater or lesser extent.
Current practice
Current protocols use many different
standards for names. For some protocols,
such as ISO-10163 Search and Retrieve
protocol[16], the names returned
in a search are only valid during
the session. For others, such as
FTP[9], they are lasting names which
may be used for object retrieval
at a later time. Typically, however,
they are not long-lasting names which
are independent of the location of
the object. Such names may be provided
using directory servers such as x.500.
They will refer to the registration,
however formal or informal, of a
object with a particular organisation
or person. Both hypertext and manual
references rely on long- lasting
names. Current names are basically
location specifiers (addresses).
These may be known as Uniform Resource
Locators (URLs). They give the necessary
parts of an address for a reader
to access an information provider
using the given protocol, and ask
for the object required. Examples
of names used by various protocols
include
File Transfer Protocol (Postel 1985):
- Host name or IP-address
- [TCP port]
- [user name, password]
- Filename
W.A.I.S. (Kahle 1990)
- Host name or IP-address
- [TCP port]
- local document id
Gopher (Alberti 1991)
- Host name or IP-address
- [TCP port]
- database name
- selector string
HTTP (Berners-Lee 1991)
- Host name or IP-address
- [TCP port]
- local object id
NNTP (Kantor 1986)
NNTP group
NNTP article
- Host name
- unique message identifier
Prospero links (Neuman 1992)
- Host name or IP address
- [UDP port]
- Host specific object name
- [version]
- [identifier]*
x.500 distinguished name
- Country
- Organisation
- Organisational unit
- Person
- Local object identifier
Other systems with their own naming
schemes include BITNET "LISTSERV"
application, FTAM file retrieval,
SQLnetTM remote database search,
proprietary distributed file systems,
etc. Conventional syntax for writing
these addresses involve various forms
of punctuation to separate these
parts. This sometimes, but not
always, allows the naming scheme
to be deduced from the punctuation.
For example, a name of the form xxx.yyy.zz.edu:/pub.aa.bb.cc
often implies anonymous FTP access.
However, there is no well-defined
algorithm for parsing an arbitrary
name, as there is no common syntax.
Expandability
There will necessarily be a phase
during which lasting names will become
more common, as the deployment of
directory services increases to the
point where every user has direct
or indirect access to one. Even
then, however, one can envisage more
than one competing directory system,
and cases in which physical names
are still required. A directory
service takes a lasting name and
reduces it to a physical address
(or set of addresses) which, though
less useful for lasting reference,
is the only way to actually retrieve
the object. An addressing syntax
is required which will be able to
encompass existing physical address
spaces, and be extendible to any
future protocols. This requires
that it contain an identifier for
the protocol in use. The format of
the rest of the address will necessarily
depend to a certain extent on the
protocol.
Relevance
The life of a name is limited by
any information contained within
it which may become prematurely
invalid. It is therefore necessary
to limit the contents of a name to
the information required for the
operations above. Other extraneous
information about the object (its
size, data format, authorisation
details, etc.) may in general change
with time and should not be part
of the name. One might expect such
information to be part of the "header"
of a object, and for protocols to
allow the header information to be
retrieved independently of the objects
themselves. Any physical address
may be subject to change with time:
hence we encourage the move to lasting
names and directory services.
Uniqueness
Clearly one requires unambiguous
names in the sense that one name
should refer to only one logical
object. This is the case with all
the addressing schemes in use, whether
they are directory systems or physical
addresses. (The internet addresses
all rely on the domain name (Mockapetris
1987) of the host to achieve this).
However, given that names can be
translated, many apparently different
names may lead to the same object.
Any object may therefore be referred
to by many names. One needs to be
able to know whether two objects,
retrieved through different paths,
are in fact the same object. It
is suggested that each object have
a unique "official" name. This name
could be stored in the object in
some representations, or stored in
a database accessible to the server,
for example. Any references within
that object should be parsed in the
context of the official name. In
the presence of a directory service,
the official name will normally be
the registered name of the object.
However, a name in any scheme will
do, so long as it is completely specified.
On systems which do not allow the
name to be stored (such as anonymous
FTP archive sites), a possible ambiguity
will always exist as to whether two
similarly named objects are in fact
the same. Note that Internet newsgroup
names are unique world-wide, and
news articles carry a unique message
id. In most other cases, however,
there is no guarantee that dereferencing
a URL will work, or that if it does
the object it refers to will in fact
be the object intended. URLs such
as FTP addresses are transient in
that files may be moved and even
replaced by different files of the
same name. This disorganisation
may be limited by good server management,
but a naming scheme which is independent
also of internet host name is obviously
preferable.
Readability by people
This requirement has been put forward
by several people (Clifford Lynch,
Douglas Engelbart among others),
and disputed by others. The author's
view is that it will be a while before
technology and standardisation have
reached the point at which names
and addresses will be hidden from
human beings. As long as they must
be written on the backs of envelopes
and "cut and pasted" between workstation
windows, there is a strong need for
names to be
- Short
- Composed of printable (preferably
non-white) characters
- To a certain extent, understadable
by a human being.
Structure of names and addresses
A physical address is required in
order for:
- The user's program to contact the
server;
- The server to perform the operation
(e.g. search and index, retrieve
a object, or look up the name) and
return a result;
- The user's program to locate an individual
position or element within a returned
object.
This suggests that a name be structured,
such that the parts necessary for
these three operations be separate
and only used by those system elements
which need those parts. This corresponds
to the basic principle of information
hiding. In fact, four parts are
necessary, including the indicator
of the naming scheme to be used:
- The naming scheme: a registered identifier
for the protocol.
- The name of a suitable server. The
format of this part must be well
defined. It will depend on the lower-layer
protocols in use. Systems which
use widely distributed information,
such as x.500 and NNTP, do not need
this part as each client generally
contacts his nearest server (or a
particular server).
- Information to be passed to the server.
This may be private to the server,
as all names may be generated and
used by the same server. This part
of the name should be opaque to the
client.
- Information to be used by the application
once the object has been retrieved.
This part is private to the application
(or, more strictly, the data format)
and so cannot be defined here.
Both lasting names and physical addresses
often share a hierarchical structure.
This follows often from the organisation
of the system. From the naming point
of view, it has the advantage that
a reference in one object to another
object need not include that part
of the structure which is common
to both names.
Choices for a universal syntax
The requirements above leave little
room for choice save for the order
and punctuation of the elements of
an address. It is only reasonable
for the order of writing of the parts
to be consistently from left to right
(or right to left) with increasing
specificity. Punctuation schemes
fall into two categories (Huitema
1991): tagged schemes in which field
are given names, and fields which
use special characters and field
order. The latter tend to be more
compact schemes.
protocol: aftp host: xxx.yyy.edu path:
/pub/doc/README
PR=aftp; H=xx.yy.edu; PA=/pub/doc/README;
PR:aftp/xx.yy.edu/pub/doc/README
/aftp/xx.yy.edu/pub/doc/README
Fig 1. Some alternative tagged and
untagged representations
The choice of special symbols for
punctuation tends to be a matter
of taste. It is easier to read addresses
whose symbols correspond to those
of one's favourite operating system.
A variety of symbols is needed so
that when a name is abbreviated it
is possible to tell which parts have
been omitted.
The recommendation below uses special
characters in order to achieve a
compact name, and uses where possible
punctuation symbols established in
the internet or unix community.
The choice of escape character for
introducing representations of non-allowed
characters also tends to be a matter
of taste. An ANSI standard exists
in the C language, using the back-slash
character "\". The use of this character
on unix command lines, however, can
be a problem as it is interpreted
by many shell programs, and would
have itself to be escaped.
There is a conflict between the need
to be able to represent many characters
including spaces within a URL directly,
and the need to be able to use a
URL in environments which have limited
character sets or in which certain
characters are prone to corruption.
This conflict has been resolved by
use of an hexadecimal escaping method
which may be applied to any characters
forbidden in a given context. When
URLs are moved between contexts,
the set of characters escaped may
be enlarged or reduced unambiguously.
The use of multiple white space characters
is discouraged in URLs to be printed
or sent by electronic mail. This
is because of the frequent introduction
of extraneous white space when lines
are wrapped by systems such as mail,
or sheer necessity of narrow column
width, and because of the inter-conversion
of various forms of white space which
occurs during character code conversion
and the transfer of text between
applications.