This section describes the syntax
for "Uniform Resource Locators" (URLs):
that is, basically physical addresses
of objects which are retrievable
using protocols already deployed
on the net. The generic syntax provides
a framework for new schemes for names
to be resolved using as yet undefined
protocols.
The syntax is described in two parts.
Firstly, we give the syntax rules
of a completely specified name; secondly,
we give the rules under which parts
of the name may be omitted in a well-defined
context.
URL syntax
A complete URL consists of a naming
scheme specifier followed by a string
whose format is a function of the
naming scheme. For locators of information
on the internet, a common syntax
is used for the IP address part.
A BNF description of the URL syntax
is given in an a later section. The
components are as follows. Fragment
identifiers and partial URLs are
not involved in the basic URL definition.
To be a Uniform Resource Locator
as currently defined by the URI working
group, the whole string must start
with a constant prefix "URL:". Note
that to save space in this document,
some URLs may have been quoted throughout
without this preprefix.
Scheme
Within the URL of a object, the first
element is the name of the scheme,
separated from the rest of the object
by a colon. The rest of the URL follows
the colon in a format depending on
the scheme.
Internet protocol parts
Those schemes which refer to internet
protocols mostly have a common syntax
for the rest of the object name.
This starts with a double slash "//"
to indicate its presence, and continues
until the following slash "/". Within
that section are
- An optional user name,
- if required
(as it is with a few FTP servers).
The password, is present, follows
the user name, separated from it
by a colon; the user name and optional
password are followed by a commercial
at sign "@". The user of user name
and passwords which are public is
discouraged.
- The internet domain name
- of the host
in RFC1037 format (or, optionally
and less advisably, the IP address
as a set of four decimal digits)
- The port number,
- if it is not the
default number for the protocol,
is given in decimal notation after
a colon.
- Path
- The rest of the locator is known
as the "path". It may define details
of how the client should communicate
with the server, including information
to be passed transparently to the
server without any processing by
the client.
The path is interpreted in a manner
dependent on the scheme being used.
Generally, the reserved slash "/"
character (ASCII 2F hex) denotes
a level in a hierarchical structure,
the higher level part to the left
of the slash.
When a system uses a local addressing
scheme, it is useful to provide a
mapping from local addresses into
URLs so that references to objects
within the addressing scheme may
be referred to globally, and possibly
accessed through gateway servers.
Any mapping scheme may be defined
provided it is unambiguous, reversible,
and provides valid URLs. It is recommended
that where hierarchical aspects to
the local naming scheme exist, they
be mapped onto the hierarchical URL
path syntax in order to allow the
partial form to be used.
The following encoding method shall
be used for mapping WAIS, FTP, Prospero
and Gopher addresses onto URLs. Where
the local naming scheme uses octet
values which are not allowed in the
URL, these shall be represented in
the URL by a percent sign "%" followed
by two hexadecimal digits (0-9, A-F)
giving the value for that octet.
This specification makes no assumptions
or requirements about the character
sets, if any, referred to be the
(decoded) octets a URL. Character
codes other than those allowed by
the syntax shall not be used unencoded
in a URL.
The same encoding method may be used
for encoding characters whose use,
although technically allowed in a
URL, would be unwise due to problems
of corruption by imperfect gateways
or misrepresentation due to the use
of variant character sets, or which
would simply be awkward in a given
environment. Because a % sign always
indicates an encoded character, a
URL may be made safer simply by encoding
any characters considered unsafe,
while leaving already encoded characters
still encoded. Similarly, in cases
where a larger set of characters
is acceptable, % signs can be selectively
and reversibly expanded.
The reserved characters shall however
never be arbitrarly encoded and decoded.