Much of what makes the World Wide Web useful and interesting is the ability to link from one document to another.

For example, to write a link to chapter two is:

  <a href="chapter2.html">chapter two</a>

Links can go to resources as close as the same document:

  <a href="#section3">section three on mammals</a>

Or as far as a service from another country:

  <a href="http://www.genome.jp/">Genome Net</a>

The <a href=" part is HTML markup, but what goes inside the href attribute is something that is shared with many other data formats in the Web. The Internet Standard since January 2005 for how to write them is Uniform Resource Identifier (URI): Generic Syntax .

Let's quickly review URI syntax using some tests*:

$ a = document.createElement('a'); null
$ a.href = "http://example:8000/path?query#frag"; null
$ a.protocol
"http:"
$ a.host
"example:8000"
$ a.hostname
"example"
$ a.port
"8000"
$ a.pathname
"/path"
$ a.search
"?query"
$ a.hash
"#frag"

And note the connection between linking syntax and protocols; for example, when the protocol part is "http:", the host is used to make a connection to an HTTP server and the path and search parts are used to produce a GET request:

GET /path?query HTTP/1.1

Space in Path

There are no spaces in the path part of an HTTP request syntax; spaces separate the request name, the path, and the version identifiers.

But links are sometimes made using filenames and spaces are allowed in modern filesystems, so HTML reduces paths with spaces to paths without spaces using the %xx encoding convention (cf. section 2.1 Percent-Encoding of RFC3986):

$ a = document.createElement('a'); null
$ a.href = "http://example/book1/chapter 2.html"; null
$ a.pathname
"/book1/chapter%202.html"

Note that spaces at the beginning and end of href attribute values are stripped:

$ a = document.getElementById('strip1'); null
$ a.protocol
"http:"
$ a = document.getElementById('strip2'); null
$ a.pathname
"/book1/chapter%202.html"

In this case, we initialize the href by parsing this document; in other cases, we use javascript assignment. In theory, these could be different code paths. Should we check both in every case?

Colon in path

Since protocol names begin with a letter, the following reference is relative to the base of this document.

$ a = document.createElement('a'); null
$ a.href = "111:foo"; null
$ parts = a.pathname.split("/"); parts[parts.length-1]
"111:foo"

Non-ASCII characters in path

The URI standard only discusses making links with ASCII characters, but links are also made with other characters, such as ☺:

$ a = document.createElement('a'); null
$ happy = parseInt("263A", 16); null
$ a.href = "http://example/" + String.fromCharCode(happy); null
$ a.pathname
"/%E2%98%BA"

Some features can only be tested in documents in other character encodings:

About the tests

The examples above are executable, thanks to Doctest/JS by Ian Bicking.




Acknowldgements

Thanks to Ian Hickson for suggesting and documenting the URL decomposition attributes and for providing materials.

Incidentally, when it comes time to write the test suite, I recommend using the <a> element's URL decomposition attributes as a good way to test this stuff.
Ian H.
Pine.LNX.4.62.0903181737330.2690@hixie.dreamhostps.com