Re: CfC: Request transition of HTML5 to Candidate Recommendation

On Nov 14, 2012, at 2:15 PM, Sam Ruby wrote:

> In accordance with both the W3C process's requirement to record the group's decision to request advancement[1], and with the steps identified in the "Plan 2014" CfC[2], this is a Call for Consensus (CfC) to request transition to CR for the following document:
> 
> http://htmlwg.org/cr/html/index.html
> 
> Silence will be taken to mean there is no objection, but positive
> responses are encouraged. If there are no objections by Monday,
> November 26th, this resolution will carry.
> 
> Considerations to note:
> 
> - A request to advance indicates that the Working Group believes the specification is stable and appropriate for implementation.
> 
> - The specification MAY still change based on implementation experience.
> 
> - Sam Ruby, on behalf of the W3C co-chairs
> 
> [1] http://www.w3.org/2005/10/Process-20051014/tr.html#transition-reqs
> [2] http://lists.w3.org/Archives/Public/public-html/2012Oct/0026.html


This specification continues to use terminology and definitions
that are arbitrarily different from the other specifications of
Web architecture, resulting in needless argumentation in support
of willful violations that are really just a failure to use the
right terms at the right times.

  URL       --> reference
  resource  --> representation
  encoding  --> charset (or character encoding scheme)

The section on URLs

  http://htmlwg.org/cr/html/urls.html

is particularly egregious since it redefines URL to be a reference
and then modifies the ABNF of RFC3986 in order to parse and
resolve a URL (== reference) to an absolute URL, reversing the
incorrect terms in order to invoke the other specification's
algorithms.  AFAICT, the rest of the HTML5 specification does not
need to use the term URL except as part of a defined phrase, such
as "valid URL potentially surrounded by spaces".

In fact, the places where the defined phrase is used do allow
any string as input (a reference) and do not perform any sort
of validation on that input.  They also don't treat arbitrary
whitespace characters verbatim, as described in the algorithm.
The places where validation is relevant are the DOM setters,
which define their own conversion algorithms specific to each
component.

Where the 3986 parsed components are used, technical errors
have been introduced for more unnecessary definitions. E.g.,
sec 2.6.4:

  An absolute URL is a hierarchical URL if, when resolved and
  then parsed, there is a character immediately after the
  <scheme> component and it is a "/" (U+002F) character.

  An absolute URL is an authority-based URL if, when resolved
  and then parsed, there are two characters immediately after
  the <scheme> component and they are both "//" (U+002F)
  characters.

In both cases, the character immediately after the <scheme>
component is a colon (":"), because the colon is a separator
and not part of the component.  That is, unlike the DOM
attribute "protocol", which (due to some ancient bug) has a
getter that appends the ":" to a scheme.

It would be correct to say that an absolute URL is
authority-based if, after parsing, the authority component
is defined.  Likewise, a URL is hierarchical if it is
authority-based or the pathname begins with "/".  However,
I doubt that the specification needs either of these terms;
if they are used somewhere, then define them where they are used.

And in 2.6.7

   o.protocol [ = value ]

   Returns the current scheme of the underlying URL.

   Can be set, to change the underlying URL's scheme.

is likewise incorrect because it returns the URL scheme and ":".
I am not sure what happens when it is set, with or without a ":".

Sec 2.6.5 apparently defines an incorrect fragment-escape
algorithm within a section entitled "URL manipulation and
creation".  I am not sure how that is supposed to be used.
If it is just for fragments, then the section title should
be corrected and the algorithm changed to avoid double-encoding
an existing pct-encoded sequence.  If it is for any URL component,
then delete the section because it is hopelessly wrong.

I also find it curious that the spec defines the meaning and
attributes of the anchor (a) element within the section on Link
(without an actual section xref).  It should say somewhere that
the a element's href attribute contains a reference (not a URL)
and that the a element's href DOM property is the output of
transcoding and resolving that reference to absolute URL
form (including a fragment, if any).

Instead, it says in

  http://htmlwg.org/cr/html/the-a-element.html#the-a-element

  The IDL attributes href, target, rel, media, hreflang, and type,
  must reflect the respective content attributes of the same name.

which is not consistent with how the href DOM attribute is
implemented in Firefox and Chrome (other UAs not yet tested).

Examples of how embedded whitespace is treated can be seen
by looking at the href DOM property's result of references
like:

<PRE>
<a href=" g "> g (leading and trailing)</a>    = http://a/b/c/g

<a href="g o">g o (embedded)</a>               = http://a/b/c/g%20o

<a href="g
o">g o (embedded linefeed)</a>                 = http://a/b/c/go

<a href="g 
 o">g o (embedded space linefeed space)</a>    = http://a/b/c/g%20o

<a href="g
 o">g o (embedded linefeed and space)</a>      = http://a/b/c/g%20o

<a href="g
  o">g o (embedded linefeed and 2 spaces)</a>  = http://a/b/c/g%20%20o

<a href="g	o">g o (embedded tab)</a>      = http://a/b/c/go

<a href="g
	o">g o (embedded linefeed and tab)</a> = http://a/b/c/go

<a href="g
	 o">g o (embedded linefeed space tab)</a> = http://a/b/c/g%20o

<a href="g 
	o">g o (embedded space linefeed tab)</a> = http://a/b/c/g%20o
</PRE>

In other words, this would suggest that linefeeds and HTAB
characters are ignored, along with leading and trailing SP
characters, but each embedded SP is replaced with a %20.
This was tested on Chrome, so other UAs might differ,
and I have only been testing <a href>, not the other contexts
where references are used in HTML.


RFC3986 does not define a single standard for converting an
arbitrary string reference into a standard URL.  The reason it
does not do so is because those rules have (in the past) differed
based on context, such as the differences in algorithms for
<a href>, <form>, <img src>, and the Location dialog/bar on
GUI-based browsers.  There is even less commonality among reference
algorithms across different data formats (RFC3986 defines URLs
for the entire Internet, not just HTML). It has been assumed
that individual data formats, like HTML, will define their own
algorithms for converting reference strings, using something
like the regular expression in the appendix to split the
arbitrary string into the syntax components.

That conversion algorithm has to take into account subjects that
3986 doesn't even attempt to address, like the document character
encoding scheme, surrounding and embedded whitespace, and how to
compose a query component.  Defining those things in HTML5 is not
a willful violation of RFC3986, for the same reason that converting
HTML character entities before processing them isn't a violation;
it is simply preprocessing the supplied data in order to form the
URL.

What would be a violation of RFC3986 is if a UA were to send an
arbitrary reference string, without pct-encoding the invalid
characters and resolving it relative to the base, in a protocol
element that expects a valid URL (e.g., the request target of
an HTTP request).

A better algorithm for Resolving References would accurately
describe how embedded whitespace is stripped or replaced with
a single pct-encoded space, components split using a regular
expression (as in RFC3986), and non-URL characters processed in
a component-specific way, to produce the URL that is used for
fetching and for the URL decomposition IDL attributes.

Instead, the specification takes on a bizarre "Us vs The Man"
attitude about 3986 (a standard for protocol elements SENT),
redefines URL as an INPUT reference, converts that "URL" in
place to absolute and pct-encoded form and calls that result
the "URL", and then makes requests on the "URL" for the URL,
sometimes in the same sentence (e.g., "When a URL is to be
fetched, the URL identifies a resource to be obtained.").

I don't care if the WG insists on using the acronym URL instead
of URI -- they are defined to be equivalent in 3986.  I do
care that the HTML5 spec is defining the input to its
preprocessing as a URL and the output to its preprocessing
as a URL, since that is both confusing and inaccurate.

In my opinion, these inconsistencies should be fixed before
HTML5 is advanced to CR.  These sections cause more damage
than the benefit gained over simply referencing RFC3986.
I am aware of Anne's work -- it does not seem intended to
fix any of these inconsistencies and is not currently on
track for HTML5.

If not fixed, then these sections should be removed from HTML5
and replaced with forward looking definitions to be defined by
later extension specs.  For example, define a "Web reference"
as an arbitrary string that is to be transformed into an
absolute URL reference, from which the IDL attribute values
are obtained and the activated actions are targeted.  RFC3986's
algorithms are sufficient to define the components that
make up the IDL attributes.

If the WG decides to advance the HTML5 specification to CR
without fixing these errors and inconsistencies, then please
consider this a formal objection.


Cheers,

Roy T. Fielding                 <http://roy.gbiv.com/>
Sr. Principal Scientist, Adobe  <http://adobe.com/>

Received on Sunday, 25 November 2012 23:18:39 UTC