This section describes the syntax for URIs as used in
the WorldWide Web initiative.
The generic syntax provides a framework for new schemes for names to be
resolved using as yet undefined protocols.
URI syntax
A complete URI consists of a naming scheme specifier followed
by a string whose format is a function of the naming scheme.
For locators of information on the Internet,
a common syntax is used for the IP address part.
A
BNF description of the URL syntax is given in an a later section.
The components are as follows.
Fragment identifiers and
relative URIs are not involved in the basic URL definition.
Scheme
Within the URI of a object,
the first element is the name of the scheme,
separated from the rest of the object by a colon.
Path
The rest of the URI follows the colon in a format depending on the
scheme.
The path is interpreted in a manner dependent on the protocol being used.
However, when it contains slashes, these must imply a hierarchical structure.
Reserved characters
The path in the URI has a significance defined by the
particular scheme.
Typically it is used to encode a name in a given name space,
or an algorithm for accessing an object.
In either case,
the encoding may use those characters allowed by the BNF syntax,
or hexadecimal encoding of other characters.Some of the reserved characters
have special uses as defined here.
The percent sign
The percent sign ("%",
ASCII 25 hex) is used as the escape character in the encoding scheme and is
never allowed for anything else.
Hierarchical forms
The slash ("/",
ASCII 2F hex) character is reserved for the delimiting of substrings whose
relationship is hierarchical. This enables partial forms of the URI.
Substrings consisting of single or double dots ("." or "..") are similarly
reserved.
The significance of the slash between two segments is that the segment of
the path to the left is more significant than the segment of the path to the
right.
("Significance" in this case refers solely to closeness to the root of the
hierarchical structure and makes no value judgement!)
Note
The similarity to unix and other disk operating system filename
conventions should be taken as purely coincidental,
and should not be taken to indicate that URIs should be interpreted as file
names.
Hash for Fragment Identifiers
The hash ("#",
ASCII 23 hex) character is reserved as a delimiter to separate the URI of an
object from a fragment identifier .
Query strings
The question mark ("?",
ASCII 3F hex) is used to delimit the boundary between the URI of a queryable
object, and a set of words used to express a query on that object.
When this form is used,
the combined URI stands for the object which results from the query being
applied to the original object.Within the query string,
the plus sign is reserved as shorthand notation for a space.
Therefore, real plus signs must be encoded.
This method was used to make query URIs easier to pass in systems which did
not allow spaces.
The query string represents some operation applied to the
object, but this specification gives no common syntax or semantics for it.
In practice the syntax and sematics may depend on the scheme and may even on
the base URI.
Other reserved characters
The asterisk ("*",
ASCII 2A hex) and exclamation mark ("!" ,
ASCII 21 hex) are reserved for use as having special signifiance within
specific schemes.
Unsafe characters
In canonical form,
certain characters such as spaces, control characters,
some characters whose ASCII code is used differently in different national
character variant 7 bit sets,
and all 8bit characters beyond DEL (7F hex) of the ISO Latin-1 set,
shall not be used unencoded.
This is a recommendation for trouble-free interchange, and as indicated below,
the encoded set may be extended or reduced.
When a system uses a local addressing
scheme,
it is useful to provide a mapping from local addresses into URIs so that
references to objects within the addressing scheme may be referred to
globally,
and possibly accessed through gateway servers.For a new naming scheme,
any mapping scheme may be defined provided it is unambiguous, reversible,
and provides valid URIs.
It is recommended that where hierarchical aspects to the local naming scheme
exist,
they be mapped onto the hierarchical URL path syntax in order to allow the
partial form to be used.
It is also recommended that the conventional scheme below be used in all
cases except for any scheme which encodes binary data as opposed to text,
in which case a more compact encoding such as pure hexadecimal or base 64
might be more appropriate.
For example, the conventional URI encoding method is used for mapping WAIS,
FTP, Prospero and Gopher addresses in the URI specification.
Conventional URI encoding scheme
Where the local naming scheme uses ASCII
characters which are not allowed in the URI,
these may be represented in the URL by a percent sign "%" immediately followed
by two hexadecimal digits (0-9,
A-F) giving the ISO Latin 1 code for that character.
Character codes other than those allowed by the syntax shall not be used
unencoded in a URI.
Reduced or increased safe character sets
The same encoding method may be
used for encoding characters whose use, although technically allowed in a URI,
would be unwise due to problems of corruption by imperfect gateways or
misrepresentation due to the use of variant character sets,
or which would simply be awkward in a given environment.
Because a % sign always indicates an encoded character,
a URI may be made "safer" simply by encoding any characters considered unsafe,
while leaving already encoded characters still encoded.
Similarly, in cases where a larger set of characters is acceptable,
% signs can be selectively and reversibly expanded.Before two URIs can be
compared,
it is therefore necessary to bring them to the same encoding level.
However,
the reserved characters mentioned above have a quite different significance
when encoded, and so may NEVER be encoded and unencoded in this way.
The percent sign intended as such must always be encoded,
as its presence otherwise always indicates an encoding.
Sequences which start with a percent sign but are not followed by two
hexadecimal characters are reserved for future extension.
(see example 3 )
Example 1
The URIs
http://www.w3.org/albert/bertram/marie-claude
and
http://www.w3.org/albert/bertram/marie%2Dclaude
are identical, as the %2D encodes a hyphen character.
Example 2
The URIs
http://www.w3.org/albert/bertram/marie-claude
and
http://www.w3.org/albert/bertram%2Fmarie-claude
are NOT identical,
as in the second case the encoded slash does not have hierarchical
significance. The URIs
fxqn:/us/va/reston/cnri/ietf/24/asdf%*.fred
and
news:12345667123%asdghfh@info.cern.ch
are illegal, as all % characters imply encodings,
and there is no decoding defined for "%*" or "%as" in this recommendation.