A Universal Resource Identifier (URI) is a member of this universal set of names in registered name spaces and addresses referring to registered protocols or name spaces. A Uniform Resource Locator (URL), defined elsewhere, is a form of URI which expresses an address which maps onto an access algorithm using network protocols. Existing URI schemes which correspond to the (still mutating) concept of IETF URLs are listed here. The Uniform Resource Name (URN) debate attempts to define a name space (and presumably resolution protocols) for persistent object names. This area is not addressed by this document, which is written in order to document existing practice and provide a reference point for URL and URN discussions.
This document is therefore to be issued under the "informational RFC" disclaimer .
The world-wide web protocols are discussed on the mailing list firstname.lastname@example.org and the newsgroup comp.infosystems.www is preferable for beginner's questions. The mailing list email@example.com has discussion related particularly to the URI issue. The author may be contacted as firstname.lastname@example.org.
This document is available in hypertext form at http://www.w3.org/hypertext/WWW/Addressing/URL/URI_Overview.html
Internet Drafts are working documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress".
Distribution of this document is unlimited.
Many protocols and systems for document search and retrieval are currently in use, and many more protocols or refinements of existing protocols are to be expected in a field whose expansion is explosive.
These systems are aiming to achieve global search and readership of documents across differing computing platforms, and despite a plethora of protocols and data formats. As protocols evolve, gateways can allow global access to remain possible. As data formats evolve, format conversion programs can preserve global access. There is one area, however, in which it is impractical to make conversions, and that is in the names and addresses used to identify objects. This is because names and addresses of objects are passed on in so many ways, from the backs of envelopes to hypertext objects, and may have a long life.
A common feature of almost all the data models of past and proposed systems is something which can be mapped onto a concept of "object" and some kind of name, address, or identifier for that object. One can therefore define a set of name spaces in which these objects can be said to exist.
Practical systems need to access and mix objects which are part of different existing and proposed systems. Therefore, the concept of the universal set of all objects, and hence the universal set of names and addresses, in all name spaces, becomes important. This allows names in different spaces to be treated in a common way, even though names in different spaces have differing characteristics, as do the objects to which they refer.
The universal syntax allows access of objects available using existing protocols, and may be extended with technology.
The specification of the URI syntax does not imply anything about the properties of names and addresses in the various name spaces which are mapped onto the set of URI strings. The properties follow from the specifications of the protocols and the associated usage conventions for each scheme.
The URI syntax and URL forms have been in widespread use by World-Wide Web software since 1990.
The extensibility requirement is met by allowing an arbitrary (but registered) string to be used as a prefix. A prefix is chosen as left to right parsing is more common than right to left. The choice of a colon as separator of the prefix from the rest of the URI was arbitrary.
The decoding of the rest of the string is defined as a function of the prefix. New prefixed are introduced for new schemes as necessary, in agreement with the registration authority. The registration of a new scheme clearly requires the definition of the decoding of the URI into a given name space, and a definition of the properties and, where applicable, resolution protocols, for the name space.
The completeness requirement is easily met by allowing particularly strange or plain binary names to be encoded in base 16 or 64 using the acceptable characters.
The printability requirement could have been met by requiring all schemes to encode characters not part of a basic set. This led to many discussions of what the basic set should be. A difficult case, for example, is when an ISO latin 1 string appears in a URL, and within an application with ISO Latin-1 capability, it can be handled intact. However, for transport in general, the non-ASCII characters need to be escaped.
The solution to this was to specify a safe set of characters, and a general escaping scheme which may be used for encoding "unsafe" characters. This "safe" set is suitable, for example, for use in electronic mail. This is the canonical form of a URI.
The choice of escape character for introducing representations of non-allowed characters also tends to be a matter of taste. An ANSI standard exists in the C language, using the back-slash character "\". The use of this character on unix command lines, however, can be a problem as it is interpreted by many shell programs, and would have itself to be escaped. It is also a character which is not available on certain keyboards. The equals sign is commonly used in the encoding of names having attribute=value pairs. The percent sign was eventually chosen as a suitable escape character.
There is a conflict between the need to be able to represent many characters including spaces within a URI directly, and the need to be able to use a URI in environments which have limited character sets or in which certain characters are prone to corruption. This conflict has been resolved by use of an hexadecimal escaping method which may be applied to any characters forbidden in a given context. When URLs are moved between contexts, the set of characters escaped may be enlarged or reduced unambiguously.
The use of white space characters is risky in URIs to be printed or sent by electronic mail, and the use of multiple white space characters is very risky. This is because of the frequent introduction of extraneous white space when lines are wrapped by systems such as mail, or sheer necessity of narrow column width, and because of the inter-conversion of various forms of white space which occurs during character code conversion and the transfer of text between applications. This is why the canonical form for URIs has all white spaces encoded.
Some of the reserved characters have special uses as defined here.
The significance of the slash between two segments is that the segment of the path to the left is more significant than the segment of the path to the right. ("Significance" in this case refers solely to closeness to the root of the hierarchical structure and makes no value judgement!)
Within the query string, the plus sign is reserved as shorthand notation for a space. Therefore, real plus signs must be encoded. This method was used to make query URIs easier to pass in systems which did not allow spaces.
The query string represents some operation applied to the object, but this specification gives no common syntax or semantics for it. In practice the syntax and sematics may depend on the scheme and may even on the base URI.
For a new naming scheme, any mapping scheme may be defined provided it is unambiguous, reversible, and provides valid URIs. It is recommended that where hierarchical aspects to the local naming scheme exist, they be mapped onto the hierarchical URL path syntax in order to allow the partial form to be used.
It is also recommended that the conventional scheme below be used in all cases except for any scheme which encodes binary data as opposed to text, in which case a more compact encoding such as pure hexadecimal or base 64 might be more appropriate. For example, the conventional URI encoding method is used for mapping WAIS, FTP, Prospero and Gopher addresses in the URI specification.
Before two URIs can be compared, it is therefore necessary to bring them to the same encoding level.
However, the reserved characters mentioned above have a quite different significance when encoded, and so may NEVER be encoded and unencoded in this way.
The percent sign intended as such must always be encoded, as its presence otherwise always indicates an encoding. Sequences which start with a percent sign but are not followed by two hexadecimal characters are reserved for future extension. (see example 3 )
http://www.w3.org/albert/bertram/marie%2Dclaudeare identical, as the %2D encodes a hyphen character.
http://www.w3.org/albert/bertram%2Fmarie-claudeare NOT identical, as in the second case the encoded slash does not have hierarchical significance.
news:email@example.com illegal, as all % characters imply encodings, and there is no decoding defined for "%*" or "%as" in this recommendation.
In the World-Wide Web applications, the context URI is that of the document or object containing a reference. In this case partial URIs can be generated in virtual objects or stored in real objects, without the need for dramatic change if the higher-order parts of a hierarchical naming system are modified. Apart from terseness, this gives greater robustness to practical systems, by enabling information hiding between system components.
The partial form relies on a property of the URI syntax that certain characters ("/") and certain path elements ("..", ".") have a significance reserved for representing a hierarchical space, and must be recognized as such by both clients and servers.
A partial form can be distinguished from an absolute form in that the latter must have a colon and that colon must occur before any slash characters. Systems not requiring partial forms should not use any unencoded slashes in their naming schemes. If they do, absolute URIs will still work, but confusion may result. (See note on Gopher below).
The rules for the use of a partial name relative to the URI of the context are:
magic://a/b/c//d/e/fthe partial URIs would expand as follows:
magic://a/b/c//d/e/the results would be exactly the same.
Specific syntaxes for representing fragments in text documents by line and character range, or in graphics by coordinates, or in structured documents using ladders, are suitable for standardization but not defined here.
The fragment-id follows the URL of the whole object from which it is separated by a hash sign (#). If the fragment-id is void, the hash sign may be omitted: A void fragment-id with or without the hash sign means that the URL refers to the whole object.
While this hook is allowed for identification of fragments, the question of addressing of parts of objects, or of the grouping of objects and relationship between continued and containing objects, is not addressed by this document.
Fragment identifiers do NOT address the question of objects which are different versions of a "living" object, nor of expressing the relationships between different versions and the living object.
There is no implication that a fragment identifier refers to anything which can be extracted as an object in its own right. It may, for example, refer to an indivisible point within an object.
The "urn" prefix is reserved for use in encoding a Uniform Resource Name when that has been developed by the IETF working group.
New schemes may be registered at a later time.
The host details are not passed on to the client when the URL is an http URL which refers to the server in question. In this case the string sent starts with the slash which follows the host details. However, when an http server is being used as a gateway (or "proxy") then the entire URI, whether HTTP or some other scheme, is passed on the HTTP command line.The search part, if present, is sent as part of the HTTP command, and may in this respect be treated as part of the path.No fragmentid part of a WWW URI (the hash sign and following) is sent with the request. Spaces and control characters in URLs must be escaped for transmission in HTTP, as must other disallowed characters.
http://www.my.work.com/As the rest of the URL (after the hostname an port) is opaque to the client, it shows great variety but the following are all fairly typical.
http://www.my.uni.edu/info/matriculation/enroling.html http://info.my.org/AboutUs/Phonebook http://www.library.my.town.va.us/Catalogue/76523471236%2Fwen44--4.98 http://www.my.org/462F4F2D4241522A314159265358979323846A URL for a server on a different port to 80 looks like
http://www.w3.org:8000/imaginary/testA reference to a particular part of a document may, including the fragment identifier, look like
http://www.myu.edu/org/admin/people#andyin which case the string "#andy" is not sent to the server, but is retained by the client and used when the whole object had been retrieved.
A search on a text database might look like
http://info.my.org/AboutUs/Index/Phonebook?dobbinsand on another database
http://www.w3.org/RDB/EMP?*%20where%20name%%3DdobbinsIn all cases the client passes the path string to the server uninterpreted, and for the client to deduce anything from
Where possible, this mail address should correspond to a usable mail address for the user, and preferably give a DNS host name which resolves to the IP address of the client. Note that servers currently vary in their treatment of the anonymous password.
The arguments of any CWD commands are successive segment parts of the URL delimited by slash, and the final segment is suitable as the filename argument to the RETR command for retrieval or the directory argument to NLIST.
For some file systems (Unix in particular), the "/" used to denote the hierarchical structure of the URL corresponds to the delimiter used to construct a file name hierarchy, and thus, the filename will look the same as the URL path. This does NOT mean that the URL is a Unix filename.
An FTP URL may optionally specify the FTP data transfer type by which an object is to be retrieved. Most of the methods correspond to the FTP "Data Types" ASCII and IMAGE for the retrieval of a document, as specified in FTP by the TYPE command . One method indicates directory access.
The data type is specified by a suffix to the URL. Possible suffixes are:
When the gopher command string contains characters (such a embedded CR LF and HT characters) not allowed in a URL, these are encoded using the conventional encoding.
Note that some gopher selector strings begin with a copy of the gopher type character, in which case that character will occur twice consecutively. Also note that the gopher selector string may be an empty string since this is how gopher clients refer to the top-level directory on a gopher server.
If the encoded command string (with trailing CR LF stripped) would be void then the gopher type character may be omiited and "1" (ASCII 31 hex) is assumed.
Note that slash "/" in gopher selector strings may not correspond to a level in a hierarchical structure.
A news URL may be dereferenced using NNTP (RFC977, Kantor 86) (The ARTICLE by message-id command ) or using any other protocol for the conveyance of usenet news articles, or by reference to a body of news articles already received.
An example might be a name such as
urn:/iana/dns/ch/cern/cn/techdoc/94/1642-3but the reader should refer to the latest URN drafts or specifications.
The wpath of a WAIS URL consists of encoded fields of the WAIS identifier, in the same order as inthe WAIS identifier. For each field, the identifier field number is the digits before the equals sign, and the field contents follow, encoded in the conventional encoding, terminated by ";".
There is however a real practical requirement to be able to generate a URL for an object in a machine's local file system.
The syntax is similar to the ftp syntax, but in this case the slash is used to donate boundaries between directory levels of a hierarchical file system is used. The "client" software converts the file URL into a file name in the local file name conventions. This allows local files to be treated just as network objects without any necessity to use a network server for access. This may be used for example for defining a user's "home" document in WWW.
There is clearly a danger of confusion that a link made to a local file should be followed by someone on a different system, with unexpected and possibly harmful results. Therefore, the convention is that even a "file" URL is provided with a host part. This allows a client on another system to know that it cannot access the file system, or perhaps to use some other local mecahnism to access the file.
The special value "localhost" is used in the host field to indicate that the filename should really be used on whatever host one is. This for example allows links to be made to files which are distribted on many machines, or to "your unix local password file" subject of course to consistency across the users of the data.
A void host field is equivalent to "localhost".
Two schemes are defined. The first, "mid:", refers to the RFC822 Message-Id of a mail message. This Identifier is already used in RFC822 in for example the References and In-Reply-to field . The rest of the URL after the "mid:" is the RFC822 msg-id with the constant <> wrapper removed, leaving an identifier whose format in fact happens to be the same as addr-spec format for mailboxes (though the semantics are different).
The use of a "mid" URL implies access to a body of mail already received. If a message has been distributed using NNTP or other usenet protocols over the news system, then the "news:" form should be used.
The news server name, newsgroup name, and index number of an article within the newsgroup on that particular server are given. The NNTP protocol must be used.
This form or URL should not be quoted outside this local area. It should not be used within news articles for wider circulation than the one server. This is a local identifier for a resource which is often available globally, and so is not recommended except in the case in which incomplete NNTP implementations on the local server force its adoption.
The path part contains a host specific object name and an optional version number. If present, the version number is separated from the host specific object name by the characters "%00" (percent zero zero), this being an escaped string terminator (null). External Prospero links are represented as URLs of the underlying access method and are not represented as Prospero URLs.
It is proposed that the Internet Assigned Numbers Authority (IANA) perform the function of registration of new schemes. Any submission of a new URI scheme must include a definition of an algorithm for the retrieval of any object within that scheme. The algorithm must take the URI and produce either a set of URL(s) which will lead to the desired object, or the object itself, in a well-defined or determinable format.
It is recommended that those proposing a new scheme demonstrate its utility and operability by the provision of a gateway which will provide images of objects in the new scheme for clients using an existing protocol. If the new scheme is not a locator scheme, then the properties of names in the new space should be clearly defined. It is likewise recommended that, where a protocol allows for retrieval by URL, that the client software have provision for being configured to use specific gateway locators for indirect access through new naming schemes.
A vertical line "|" indicates alternatives, and [brackets] indicate optional parts. Spaces are represented by the word "space", and the vertical line character by "vline". Single letters stand for single letters. All words of more than one letter below are entities described somewhere in this description.
The "generic" production gives a higher level parsing of the same URIs as the other productions. The "national" and "punctuation" characters do not appear in any productions and therefore may not appear in URIs.
The current IETF URI working group preference is for the prefixedurl production. (Nov 1993. July 93: url).
The "national" and "punctuation" characters do not appear in any productions and therefore may not appear in URLs.
The "afsaddress" is left in as historical note, but is not a url production
Address: World-Wide Web project
1211 Geneva 23,
Telephone: +41 (22)767 3755
Fax: +41 (22)767 7155