5 HTML and URLs

Contents

Uniform Resource Locators (URLs)

The World Wide Web is a network of information resources. The Web relies on three mechanisms to make these resources readily available to the widest possible audience:

A uniform naming scheme for locating resources on the Web, e.g. URLs
Protocols, for access to named resources over the Web. e.g. HTTP
Hypertext, for easy navigation among resources. e.g. HTML

HTML documents utilize URLs for specifying hypertext links. The following provides a brief introduction to URLs.

5.1 Uniform Resource Locators (URLs)

Every resource available on the Web --- HTML document, image, video clip, program, etc. --- has an address that may be encoded by a Uniform Resource Locator, or "URL" (defined in [RFC1738]).

URLs typically consist of three pieces:

The scheme identifying the protocol used to access the resource.
The name of the machine hosting the resource.
The name of the resource itself, given as a path.

Consider the URL that designates the current HTML specification:

http://www.w3.org/TR/WD-html4/cover.html

This URL may be read as follows: Use the HTTP protocol (see [RFC2068]) to transfer the data residing on the machine www.w3.org in the file "/TR/WD-html4/cover.html". Other schemes you may see in HTML documents include "mailto" for email and "ftp" for FTP.

URLs in general are case-sensitive (with the exception of machine names). There may be URLs, or parts of URLs, where case doesn't matter, but identifying these may not be easy. Users should always consider that URLs are case-sensitive.

The character set of URLs that appear in HTML is specified in [RFC1738].

5.1.1 Fragment identifiers

Some URLs refer to a location within a resource. As specified in [RFC1808], this kind of URL ends with "#" followed by an anchor identifier (called the "fragment identifier"). For instance, here is a URL pointing to an anchor named section_2:

http://somesite.com/html/top.html#section_2

5.1.2 Relative URLs

A relative URL (defined in [RFC1808]) doesn't contain any protocol or machine information. Its path generally refers to a resource on the same machine as the current document. Relative URLs may contain relative path components (".." means one level up in the hierarchy defined by the path), and may contain fragment identifiers.

Relative URLs are resolved to full URLs using a base URL. [RFC1808] defines the normative algorithm for this process.

As an example of relative URL resolution, assume we have the base URL "http://www.acme.com/support/intro.html". The relative URL in the following markup for a hypertext link:

  <A href="suppliers.html">Suppliers</A>

would expand to the full URL "http://www.acme.com/support/suppliers.html", while the relative URL in the following markup for an image

  <IMG src="../icons/logo.gif" alt="logo">

would expand to the full URL "http://www.acme.com/icons/logo.gif".

5.1.3 URLs in HTML

In HTML, URLs play a role in these situations:

linking to another document or resource, (see the A and LINK elements).
linking to an external style sheet or script (see the LINK and SCRIPT elements).
images, objects and applets for inclusion in a page, (see the IMG, OBJECT, APPLET and INPUT elements).
image maps (see the MAP and AREA elements).
form submission (see FORM).
frames (see the FRAME and IFRAME elements).
citing an external reference (see the Q, BLOCKQUOTE, INS and DEL elements).
referring to metadata conventions describing a document (see the HEAD element).

User agents should calculate the base URL for resolving relative URLs according to the [RFC1808]. The following is a summary of how [RFC1808] applies to HTML. User agents should calculate the base URL according to the following precedences (highest priority to lowest):

The base URL is set by the BASE element.
The base URL is given by an HTTP header (see [RFC2068]).
By default, the base URL is that of the current document.

Additionally, the OBJECT and APPLET elements define attributes that take precedence over the value set by the BASE element. Please consult the definitions of these elements for more information about URL issues specific to them.

Link elements specified by HTTP headers are handled exactly as LINK elements that appear explicitly in a document.

MAILTO URLs

In addition to HTTP URLs, authors might want to include MAILTO URLs (see [RFC1738]) in their documents. MAILTO URLs cause email to be sent to some email address. For instance, the author might create a link that, when activated, causes the user agent to open a mail program with the destination address in the "To:" field.

MAILTO URLs have the following syntax:

mailto:email-address

User agents may support MAILTO URL extensions that are not yet Internet standards (e.g., appending subject information to a URL with the syntax "?Subject=my%20subject" where any space characters are replaced by "%20"). Some user agents also support "?Cc=email-address".