Bug 11379 - [pending URL spec] definition of hierarchical URL inconsistent with rfc 3986
[pending URL spec] definition of hierarchical URL inconsistent with rfc 3986
Product: WHATWG
Classification: Unclassified
Component: URL
All All
: P4 normal
: Unsorted
Assigned To: Anne
Depends on:
  Show dependency treegraph
Reported: 2010-11-22 18:40 UTC by Glenn Adams
Modified: 2012-12-15 10:47 UTC (History)
8 users (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Glenn Adams 2010-11-22 18:40:56 UTC
Section 2.6.1 defines a hierarchical URL thus:

"An absolute URL is a hierarchical URL if, when resolved and then parsed, there is a character immediately after the <scheme> component and it is a U+002F SOLIDUS character (/)."

However, RFC3986 Section 3 defines all URIs as containing a hierarchical part as follows:

URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

and, further, does not require the hierarchical part to start with "/". In particular, it defines hier-part as:

hier-part   = "//" authority path-abempty
                  / path-absolute
                  / path-rootless
                  / path-empty

Which, when expanding these components into their definitions, corresponds to:

          = "//" authority
          | "//" authority 1*( "/" segment )
          | "/" [ segment-nz *( "/" segment ) ]
          | segment-nz *( "/" segment )
          | 0<pchar>

Note that the last two alternatives do not start with "/", yet are still considered a "hierarchical" part by RFC3986. For example, the following URIs match this syntax, with hier-part mapping to path-rootless:


In order to avoid confusion, it may be desirable to use a different term in HTML5 than "hierarchical URL" in this regard. Alternatively, a note could be added which distinguishes the defined usage from the like named (but different) constructs in RFC3986.

I would also note that, in terms of the definitions found in 2.6.1, all "authority-based URLs" are also "hierarchical URLs". I can't tell if this is intentional or not, if it is, then perhaps a note indicating this would be useful.

Comment 1 Ian 'Hixie' Hickson 2011-01-01 05:50:13 UTC
I'll look into this in more detail once Adam's spec on how to parse URLs is ready. From a quick glance, though, it seems not too unreasonable to come up with different terminology if there's a better term than "hierarchical" here. Any suggestions?
Comment 2 Adam Barth 2011-01-01 21:51:32 UTC
I've been using the term "standard URL" but that might not be the optimal term either.
Comment 3 Michael[tm] Smith 2011-08-04 05:13:26 UTC
mass-move component to LC1
Comment 4 Anne 2012-09-28 10:51:51 UTC
http://url.spec.whatwg.org/ defines URLs now. Per that document a URL is always "absolute" (perhaps invalid, but always absolute). The input to the parsing algorithm may be relative to something else, but you always end up with URL that has all the relevant information (although it could be invalid if there's relative input and nothing to resolve it to).