Status of this memo

This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts.

Internet Drafts are working documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress".

Distribution of this document is unlimited. Please send comments to the author as timbl@info.cern.ch. or to the discussion list ietf-url@merit.edu.

The need for a universal syntax

Many protocols and systems for document search and retrieval are currently in use, and many more protocols or refinements of existing protocols are to be expected in a field whose expansion is explosive.

These systems are aiming to achieve global search and readership of documents across differing computing platforms, and despite a plethora of protocols and data formats. As protocols evolve, gateways can allow global access to remain possible. As data formats evolve, format conversion programs can preserve global access. There is one area, however, in which it is impractical to make conversions, and that is in the names and addresses used to identify objects. This is because names and addresses of objects are passed on in so many ways, from the backs of envelopes to hypertext objects, and may have a long life.

A common feature of almost all the data models of past and proposed systems is something which can be mapped onto a concept of "object" and some kind of name, address, or identifier for that object. One can therefore define a set of name spaces in which these objects can be said to exist.

Practical systems need to access and mix objects which are part of different existing and proposed systems. Therefore, the concept of the universal set of all objects, and hence the the universal set of names and addresses, in all name spaces, becomes important. This allows names in different spaces to be treated in a common way, even though names in different spaces have differing characteristics, as do the objects to which they refer.

URIs

This document defines a way to encapsulate a name in any registered name space, and label it with the the name space, producing a member of the universal set. Such an encoded and labelled member of this set is known as a Universal Resource Identifier, or URI

The universal syntax allows access of objects available using existing protocols, and may be extended with technology.

URLs

For existing Internet access protocols, it is necessary in most cases to define the encoding of the access algorithm into something concise enough to be tremed address. URIs which refer to objects accessed with existing protocols are known as "Uniform Resource Locators" (URLs) and are described in a separate document .

URNs

There is currently a drive to define a space of more persistent names than any URLs. These "Uniform Resource Names" are the subject of an IETF working group's discussions. (See Sollins and Masinter, Functional Specifications for URNs, circulated informally.)

The URI syntax and URL forms have been in widespread use by World-Wide Web software since 1990.

Design criteria and choices

This section is not part of the specification: it is simply an explanation of the way in which the specification was derived.

Design criteria

The syntax was designed to be
Extensible
New naming schemes may be added later.
Complete
It is possible to encode any naming scheme.
Printable
It is possible to express any URI using 7-bit ASCII characters so that URIs amy if necessary be passed using pen and ink

Choices for a universal syntax

For the syntax itself there is little choice save for the order and punctuation of the elements, and the acceptable characters and escaping rules.

The extensability requirement is met by allowing an arbitrray (but registered) string to be used as a prefix. A prefix is chosen as left to right parsing is more common than right to left. The choiuce of a colon as separater of the prefix from the rest of the URL was arbitrary.

The decoding of the rest of the string is defined as a function of the prefix. New prefixed are introduced for new schemes as necessary, in agreement with the registartion authority. The registeration of a new scheme clearly requires the definition of the decoding of the URI into a given name space, and a definition of the properties and, where applicable, resolution protocols, for the namespace.

The completeness requirement is easily met by allowing partiuclarly strange or plain binary names to be encoded in base 16 or 64 using the acceptable characters.

The printability requirement could have been met by requiring all schemes to encode characters not part of a basic set. This led to many discussions of what the basic set should be. A difficult case, for example, is when an ISO latin 1 string appears in a URL, and within an application with ISO Latin-1 capability, it can be handled intact. However, for transport in general, the non_ASCIi characters need to be escaped.

The solotion to this was to specify a safe set of characters, and a general escaping scheme which may be used for encoding "unsafe" characters. This "safe"set is suitable, for example, for use in electronic mail. This is the cannonical form of a URI. The escaping mechansim is such that in particular contexts, where the practically safe set greater, that the escaping level may be reduced, because it is reversible.

The choice of escape character for introducing representations of non-allowed characters also tends to be a matter of taste. An ANSI standard exists in the C language, using the back-slash character "\". The use of this character on unix command lines, however, can be a problem as it is interpreted by many shell programs, and would have itself to be escaped. It is also a character which is not available on certain keyboards. The equals sign is commonly used in the encoding of names havng attributte=value pairs. The percent sign was eventually chosen as a suitable escape character.

There is a conflict between the need to be able to represent many characters including spaces within a URI directly, and the need to be able to use a URI in environments which have limited character sets or in which certain characters are prone to corruption. This conflict has been resolved by use of an hexadecimal escaping method which may be applied to any characters forbidden in a given context. When URLs are moved between contexts, the set of characters escaped may be enlarged or reduced unambiguously.

The use of white space characters is risky in URIs to be printed or sent by electronic mail, and the use of multiple white space characters is very risky. This is because of the frequent introduction of extraneous white space when lines are wrapped by systems such as mail, or sheer necessity of narrow column width, and because of the inter-conversion of various forms of white space which occurs during character code conversion and the transfer of text between applications. This is why the cannonical form for URIs has all white spaces encoded.

Recommendations

This section describes the syntax for URIs as used in the WorldWide Web initiative. The generic syntax provides a framework for new schemes for names to be resolved using as yet undefined protocols.

URI syntax

A complete URL consists of a naming scheme specifier followed by a string whose format is a function of the naming scheme. For locators of information on the internet, a common syntax is used for the IP address part. A BNF description of the URL syntax is given in an a later section. The components are as follows. Fragment identifiers and relative URIs are not involved in the basic URL definition.

Scheme

Within the URL of a object, the first element is the name of the scheme, separated from the rest of the object by a colon.

Path

The rest of the URL follows the colon in a format depending on the scheme. The path is interpreted in a manner dependent on the protocol being used. However, when it contains slashes, these must imply a hierarchical structure.

Reserved characters

The path in the URI has a significance defined bythe particular scheme. Typically it is used to encode a name in a given name space, or an algorithm for accessing an object. In either case, the encoding may use those characters allowed by the BNF syntax, or hexadecimal encodings of other characters.

Some of the reserved characters have special uses as defined here.

The percent sign

The percent sign ("%", ASCII 25 hex) is used in the encoding scheme and is never allowed for anything else.

Hierarchical forms

The slash ("/", ASCII 2F hex) character is reserved for the delimiting of substrings whose relationship is hierarchical. This enables partial forms of the URI. Substrings consisting of single or double dots ("." or "..") are similiarly reserved.
Note
The similarity to unix and msdos filename conventions should be taken as purely coincidental, and should not be taken to indicate that URIs should be intepreted as filenames.

Hash for Fragment Identifiers

The hash ("#", ASCII 23 hex) character is reserved as a delimiter to separate the URI of an object from a fragment identifier .

Query strings

The question mark ("?", ASCII 3F hex) is used to delimit the boundary between the URL of a queryable object, and a set of words used to express a query on that object. When this form is used, the combined URI stands for the object which results from the query being applied to the original object.

Within the query string, the plus sign is reserved as shorthand notation for a space. Therefore, real plus signs must be encoded. This method was used to make query URLs easier to pass in systems which did not allow spaces.

Unsafe characters

The URI specicfication specifies that in connonical form, certain characters such as spaces, control characters, and some characters whose ASCII code is used differently in different national character variant 7 bit sets, are not used unencoded. This is a recommendation for trouble-free interchange, and as indicated below, the safe set may be under certain circumstances extended or reduced.

Encoding reserved characters

When a system uses a local addressing scheme, it is useful to provide a mapping from local addresses into URLs so that references to objects within the addressing scheme may be referred to globally, and possibly accessed through gateway servers.

For a new naming scheme, any mapping scheme may be defined provided it is unambiguous, reversible, and provides valid URIs. It is recommended that where hierarchical aspects to the local naming scheme exist, they be mapped onto the hierarchical URL path syntax in order to allow the partial form to be used.

It is also recommended that the conventional scheme below be used in all cases except for any scheme which encodes binary data as opposed to text, in which case a more compact encoding such as pure hexadecimal or base 64 might be more appropriate. For example, the conventional URI encoding method is used for mapping WAIS, FTP, Prospero and Gopher addresses in the URL specification..

Conventional URI encoding scheme

Where the local naming scheme uses ASCII characters which are not allowed in the URL, these may be represented in the URL by a percent sign "%" followed by two hexadecimal digits (0-9, A-F) giving the ISO Latin 1 code for that character. Character codes other than those allowed by the syntax shall not be used unencoded in a URL.

Reduced or increased safe character sets

The same encoding method may be used for encoding characters whose use, although technically allowed in a URL, would be unwise due to problems of corruption by imperfect gateways or misrepresentation due to the use of variant character sets, or which would simply be awkward in a given environment. Because a % sign always indicates an encoded character, a URL may be made "safer" simply by encoding any characters considered unsafe, while leaving already encoded characters still encoded. Similarly, in cases where a larger set of characters is acceptable, % signs can be selectively and reversibly expanded.

Before two URIs can be compared, it is therefore necessary to bring them to the same encoding level.

However, the reserved characters mentioned above have a quite different significance when encoded, and so may NEVER be encoded and unencoded in this way.

The percent sign intended as such must always be encoded, as its presence otherwise always indciates an encoding. Sequences which start with a percent sign but are not followed by two hexadecimal characters are reserved for future extenstion.

Example 1
The URIs
		http://www.w3.org/albert/bertram/marie-claude

and
		http://www.w3.org/albert/bertram/marie%2D 


claude are identical, as the %2D encodes a hyphen character.
Example 2
The URIs
 			http://www.w3.org/albert/bertram/marie-claude
and
 			http://www.w3.org/albert/bertram%2Fmarie-claude

are NOT identical, as in the second case the encoded slash does not have hierarchical significance.
Example 3
The URIs
			fxqn:/us/va/reston/cnri/ietf/24/asdf%*.fred

and
			news:12345667123%asdghfh@info.cern.ch

are illegal, as all % characters imply encodings, and there is no decoding defined for "%*" or "%as" in this recommendation.











Partial (relative) form

Within a object whose URL is well defined, the URI of another object may be given in abbreviated form, where parts of the two URIs are the same. This allows objects within a group to refer to each other without requiring the space for a complete reference, and it incidentally allows the group of objects to be moved without changing any references. This is not discussed in detail here, it is only mentioned so that the characters required by the technique be reserved for that purpose. It must be emphasised that when a reference is passed in anything other than a well controlled context, the full form must always be used.

In the World-Wide Web applications, the context URI of the document or object containing a reference. In this case partial URIs can be generated by virtual objects and stored in real objects, without the need for dramatic change if the higher-order parts of a hierarchical naming system are modified. Apart from terseness, this gives greater robustness to practical systems, by enabling information hiding between system components.

The partial form relies on a property of the URI syntax that certain characters ("/") and certain path elements ("..", ".") have a significance reserved for representing a hierarchical space, and must be recognised as such by both clients and servers.

A partial form can be distinguished from a full form in that a full form must have a colon and that colon must occur before any slash characters. Systems not requiring partial forms should not use any unencoded slashes in their naming schemes.

The rules for the use of a partial name relative to the URI of the context are:

  • If the scheme parts are different, the whole absolute locator must be given. Otherwise, the scheme is omitted, and:
  • If the the partial URI starts with a non-zero number of consecutive slashes, then everything from the context URI up to (but not including) the first occurence of exectly the same number of consecutive slashes is taken to be the same and so prepended to the partial URL to form the full URL. Otherwise:
  • The last part of the path of the context URI (anything following the rightmost slash) is removed, and the given partial URI appended in its place, and then:
  • Within the result, all occurrences of "xxx/../" or "/." are recursively removed, where xxx, ".." and "." are complete path elements.
Note: If a path of the context locator ends in slash, partial URIs will be treated differently to their treatment with respect to the same path without a slash. The trailing slash indicates a void segment of the path.
Examples
IIn the context of URI
			magic://a/b/c//d/e/f

the partial URIs would expand as follows:
g
magic://a/b/c//d/e/g
/g
magic://a/g
//g
magic://g
../g
magic://a/b/c//d/g
g:a
g:a
In the context of the URI
			magic://a/b/c//d/e/

the resutls would be exactly the same.

Fragment-id

This represents a part of, fragment of, or a sub-function within, an object . Its syntax and semantics are defined by the application responsible for the object, or the specification of the content type of the object. The only definition here is of the allowed characters by which it may be represented in a URL.

Specific syntaxes for representing fragments in text documents by line and character range, or in garphics by coordinates, or in staructured documents using ladders, are be suitable for standardisation but not currently defined.

The fragment-id follows the URL of the whole object from which it is separated by a hash sign (#). If the fragment-id is void, the hash sign may be omitted: A void fragment-id with or without the hash sign means that the URL refers to the whole object.

While this hook is allowed for identification of fragments, the question of addressing of parts of objects, or of the grouping of objects and relationship between contined and containing objects, is not addressed by this document.

Fragment identifiers do NOT address the question of objects which are different versions of a "living" object, nor of expressing the relationships between different versions and the living object.

Specific Schemes

The mapping for some existing standard and experimental protocols is outlined in the BNF syntax definition . Notes on particular protocols follow. The schemes covered are
http
Hypertext Transfer Protocol
ftp
File Transfer protocol
gopher
The Gopher protocol
mailto
Electronic mail address
mid
Message identifiers for electroni mail
cid
Content identifiers for MIME body part
news
Usenet news
nntp
Usenet news for local NNTP access only
prospero
Access using the prospero protocols
telnet , rlogin and tn3270
Reference to interactive sessions
wais
Wide Area Information Servers
The schemes for x.500, network management database and whois++ have not been specified and may be the subject of futher study.

New schemes may be registered at a later time.

FTP

The ftp: prefix indicates a file which is to be picked up from the file system of the given host. The FTP protocol is used, As defined in RFC957 or any suceesor. The port number if given gives the port of the FTP server if not the FTP default. (A client may in practice use local file access to retrieve objects which are available though more efficient means such as local file open or NFS mounting, where this is available and equivalent).

The syntax allows for the inclusion of a user name and even a password for those systems which do not use the anonymous FTP convention. The default, however, if no user or password is supplied, will be to use that convention, viz. that the user name is "anonymous" and the password the user's internet-style mail address.

The FTP protocol allows for a sequence of CWD commands (change working directory prior to a RETR which actually accesses a file. The arguments of any CD commands are successive segment parts of the URL, and the filename argument to the RETR command is the final segment of the URL path.

Note
In the case in which the file system of the server is known or guessed, the path may possibly converted into a filename. This may allow the file to be retrieved in one command. In the case of unix, the filename will look the same as the path. This must NOT be taken to indicate that the URL is a unix filename. In practice, as many FTP servers in fact have or emulate unix file systems, it may in fact be time-efficient to attempt first a direct retrieval guessing unix syntax, and, if that fails, to attempt the official sequence of succession of directory changes followed by a RETR command.

There is no common hierarchical model to the FTP protocol, so if a directory change command has been given, it is impossible in general to deduce what sequence should be given to navigate to another directory for a second retrieval, if the paths are different. The only reliable algorithm is to disconnect and reestablish the control connection. However, if no directory changes have been made, but direct retrieval has been done, then the control

(This note previously read: "The adoption of a unix-style syntax involves the conversion into non-unix local forms by either the client or server. Some non-unix servers do this, but clients wishing to access sites which do not have unix-style naming will need certain algorithms to enable other file systems to be identified and treated. Client software may also have to be flexible in terms of the sequence of FTP commands used with different varieties of server. In view of a tendency for file systems to look increasingly similar, it was felt that the URL convention should not be weighed down by extra mechanisms for identifying these cases." )

Note
The data format of a file can only, in the general FTP case, be deduced from the name, normally the suffix of the name. This is not standardized. An alternative is for it to be transferred in information outside the URL. The transfer mode (binary or text) must in turn be deduced from the data format. It is recommended that conventions for suffixes of public archives be established, but it outside the scope of this paper.

HTTP

The HTTP protocol specifies that the path is handled transparently by those who handle URLs, except for the servers which de-reference them. The path is passed by the client to the server with any request, but is not otherwise understood by the client. The fragmentid part is not sent with the request. The search part, if present, is sent. Spaces and control characters in URLs must be escaped for transmission in HTTP.

Gopher

The first character of the URL path part (after the initial single slash) is a single-character "type" field which is that used by the Gopher protocol. The rest of the path is the "selector string", with disallowed characters encoded. Note that some selector strings begin with a copy of the gopher type character, in which case that character will occur twice consecutively in the URL. If the type character and selector are omitted, the type defaults to "1". Gopher links which refer to non-Gopher protocols are represented directly as URLs of the underlying access method and are not represented as Gopher URLs.

Mailto

This allows a URL to specify an RFC822 addr-spec mail address. Note that use of % , for example as used in forming a gatewayed mail address, requires conversion to %25 in a URL.

This semantics may be considered to be that the object referred to by the mailto: URL is the set of messages sent to or from that address. There is no algorithm to retrieve this set, but the SMTP protocol allows messages to be added to it, and any given user may be aware of a subset of its members.

Telnet, rlogin, tn3270

The use of URLs to represent interactive sessions is a convenient extension to their uses for objects. This allows access to information systems which only provide an interactive service, and no information server. As information within the service cannot be addressed individually or, in general, automatically retrieved, this is a less desirable, though currently common, solution.

Provisional and Speculative schemes

Message-Id

Within the context of infomation transferred using mail protocols, there is a need to be able to make cross-refrences between different items of information, even though, by the nature of mail, those items are only available to a restricted set of people.

Two schemes are defined. The first, "mid:", refers tothe RFC822 Message-Id of a mail message. This Identifier is already used in RFC822 in for example the References and In-Reply-to field . The rest of the URL after the "mid:" is the RFC822 msg-id with the constant <> wrapper removed, leaving an identifier whose format is in fact happens to be the same as addr-spec format for mailboxes (though the semantics are different).

The use fo a "mid" URL implies access to a body of mail already received. If a message has been distributed using NNTP or other usenet protcols over the news system, then the "news:" form should be used.

Content-Id

The second scheme, "cid:", id similar to "mid:" , but makes reference to a body part of a MIME message by the value of its content-id field. This allows, for example, a master document being the first part of a multipart/related MIME message to refer to component parts which are transferred in the same message.
Note
Beware however, that content identifiers are only required to be unique within the context of a given MIME message, and so the cid: URL is only meaningful with the context the same MIME message. For a reference outside the message, it woul dneed to be appended to the message-id of the whole message. A syntax for this has not been defined.

x500

The mapping of x500 names onto URLs is not defined here. A decision is required as to whether "distinguished names" or "user friendly names" (ufn), or both, should be allowed. If any punctuation conversions are needed from the adopted x500 representation (such as the use of slashes between parts of a ufn) they must be defined. This is a subject for study.

WHOIS

This prefix describes the access using the "whois++" scheme in the process of definition. The host name part is the same as for other IP based schemes. The path part can be either a whois handle for a whois object, or it can be a valid whois query string. This is a subject for further study.

Network Management Database

This is a subject for study.

Registration of naming schemes

A new naming scheme may be introduced by defining a mapping onto a conforming URL syntax, using a new scheme identifier. Experimental scheme identifiers may be used by mutual agreement between parties, and must start with the characters "x-". The scheme name "urn:" is reserved for the work in progress on a scheme for more persistent names. Therefore URNs (Names) and URLs (Locators) be distinguishable. An object which is either a URL or a URN is known as a URI (Identifier).

It is proposed that the Internet Assigned Numbers Authority (IANA) perform the function of registration of new schemes. Any submission of a new URI scheme must include a definition of an algorithm for the retrieval of any object within that scheme. The algorithm must take the URI and produce either a set of URL(s) which will lead to the desired object, or the object itself, in a well-defined or determinable format.

It is recommended that those proposing a new scheme demonstrate its utility and operability by the provision of a gateway which will provide images of objects in the new scheme for clients using an existing protocol. If the new scheme is not a locator scheme, then the properties of names in the new space should be clearly defined. It is likewise recommended that, where a protocol allows for retrieval by URI, that the client software have provision for being configured to use specific gateway locators for indirect access through new naming schemes.

Message-Id

Within the context of infomation transferred using mail protocols, there is a need to be able to make cross-refrences between different items of information, even though, by the nature of mail, those items are only available to a restricted set of people.

Two schemes are defined. The first, "mid:", refers tothe RFC822 Message-Id of a mail message. This Identifier is already used in RFC822 in for example the References and In-Reply-to field . The rest of the URL after the "mid:" is the RFC822 msg-id with the constant <> wrapper removed, leaving an identifier whose format is in fact happens to be the same as addr-spec format for mailboxes (though the semantics are different).

The use fo a "mid" URL implies access to a body of mail already received. If a message has been distributed using NNTP or other usenet protcols over the news system, then the "news:" form should be used.

Content-Id

The second scheme, "cid:", id similar to "mid:" , but makes reference to a body part of a MIME message by the value of its content-id field. This allows, for example, a master document being the first part of a multipart/related MIME message to refer to component parts which are transferred in the same message.
Note
Beware however, that content identifiers are only required to be unique within the context of a given MIME message, and so the cid: URL is only meaningful with the context the same MIME message. For a reference outside the message, it woul dneed to be appended to the message-id of the whole message. A syntax for this has not been defined.

News

The news locators refer to either news group names or article message identifiers which must conform to the rules of RFC 850. A message identifier may be distinguished from a news group name by the presence of the commercial at "@" character. These rules imply that within an article, a reference to a news group or to another article will be a valid URL (in the partial form).

A news URL may be dereferenced using NNTP or using any other protocol for the conveyance of usenet news articles, or by reference to a body of news articles already received.

Note1:
Among URLs the news: URLs are anomalous in that they are location-independent. They are unsuitable as URN candidates because the NNTP architecture relies on the expiry of articles and therefore a small number of articles being available at any time. When a news: URL is quoted, the assumption is that the reader will fetch the article or group from his or her local news host. News host names are NOT part of news URLs.
Note 2:
An outstanding problem is that the message identifier is insufficient to allow the retrieval of an expired article, as no algorithm exists for deriving an archive site and file name. The addition of the date and news group set to the article's URL would allow this if a directory existed of archive sites by news group. Suggested subject of study in conjunction with NNTP WG. Further extension possible may be to allow the naming of subject threads as addressable objects.

NNTP

This is an alternative form of reference for news articles, specifically to be used with NNTP servers, and particularly those incomplete server implementations which do not allow retrieval by message identifier. In all otehr cases the "news" scheme should be used.

The news server name, newsgroup name, and index number of an article within the newsgroup on that particular server are given.

Note1.
This form of URL is not of global accessiablity, as typically NNTP servers only allow access from local clients. This form or URL should not be quoted outside this local area. It should not be used within news articles for wider circulation than the one server. This is a local identifier for a resourse which is often available globally, and so is not recommended excpet in the case in which incomplete NNTP implementations on the local server force its adoption.

Prospero

The Prospero (Neuman, 1991) directory service is used to resolve the URL yielding an access method for the object (which can then itself be represented as a URL if translated). The host part contains a host name or internet address. The port part is optional.

The path part contains a host specific object name and an optional version number. If present, the version number is seperated from the host specific object name by the characters "%00" (percent zero zero), this being an escaped string terminator (null). External Prospero links are represented as URLs of the underlying access method and are not represented as Prospero URLs.

Telnet, rlogin, tn3270

The use of URLs to represent interactive sessions is a convenient extension to their uses for objects. This allows access to information systems which only provide an interactive service, and no information server. As information within the service cannot be addressed individually or, in general, automatically retrieved, this is a less desirable, though currently common, solution.

WAIS

The current WAIS implementation public domain requires that a client know the "type" of a object prior to retrieval. This value is returned along with the internal object identifier in the search response. It has been encoded into the path part of the URL in order to make the URL sufficient for the retrieval of the object. Within the WAIS world, names do not of course not need to be prefixed by "wais:" (by the partial form rules).

Schemes for Further Study

x500

The mapping of x500 names onto URLs is not defined here. A decision is required as to whether "distinguished names" or "user friendly names" (ufn), or both, should be allowed. If any punctuation conversions are needed from the adopted x500 representation (such as the use of slashes between parts of a ufn) they must be defined. This is a subject for study.

WHOIS

This prefix describes the access using the "whois++" scheme in the process of definition. The host name part is the same as for other IP based schemes. The path part can be either a whois handle for a whois object, or it can be a valid whois query string. This is a subject for further study.

Network Management Database

This is a subject for study.

Registration of naming schemes

A new naming scheme may be introduced by defining a mapping onto a conforming URL syntax, using a new scheme identifier. Experimental scheme identifiers may be used by mutual agreement between parties, and must start with the characters "x-". The scheme name "urn:" is reserved for the work in progress on a scheme for more persistent names. Therefore URNs (Names) and URLs (Locators) be distinguishable. An object which is either a URL or a URN is known as a URI (Identifier).

It is proposed that the Internet Assigned Numbers Authority (IANA) perform the function of registration of new schemes. Any submission of a new URI scheme must include a definition of an algorithm for the retrieval of any object within that scheme. The algorithm must take the URI and produce either a set of URL(s) which will lead to the desired object, or the object itself, in a well-defined or determinable format.

It is recommended that those proposing a new scheme demonstrate its utility and operability by the provision of a gateway which will provide images of objects in the new scheme for clients using an existing protocol. If the new scheme is not a locator scheme, then the properties of names in the new space should be clearly defined. It is likewise recommended that, where a protocol allows for retrieval by URL, that the client software have provision for being configured to use specific gateway locators for indirect access through new naming schemes.

BNF syntax

This is a BNF-like description of the URI syntax.

A vertical line "|" indicates alternatives, and [brackets] indicate optional parts. Spaces are representated by the word "space", and the vertical line character by "vline". Single letters stand for single letters. All words of more than one letter below are entities described somewhere in this description.

The "generic" production gives a higher level parsing of the same URLs as the other productions. The "national" and "punctuation" characters fo not appear in any productions and therefore may not appear in URLs.

fragmentaddress
uri [ # fragmentid ]
uri
scheme : path [ ? search ]
scheme
ialpha
path
void | xpalphas [ / path ]
search
xalphas [ + search ]
fragmentid
xalphas
xalpha
alpha | digit | safe | extra | escape
xalphas
xalpha [ xalphas ]
xpalpha
xalpha | +
xpalphas
xpalpha [ xpalpha ]
ialpha
alpha [ xalphas ]
alpha
a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
digit
0 |1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
safe
$ | - | _ | @ | . | & | -
extra
! | * | " | ' | ( | ) | : | ; | , | space
escape
% hex hex
hex
digit | a | b | c | d | e | f | A | B | C | D | E | F
national
{ | } | vline | [ | ] | \ | ^ | ~
punctuation
< | >
void

Author's address

			   Tim Berners-Lee  
		Address:   World-Wide Web project  
			   CERN,
			   1211 Geneva 23,
		           Switzerland
 
	    	Telephone: +41 (22)767 3755
		Fax:       +41 (22)767 7155 
		Email:     timbl@info.cern.ch