Re: Proposed revision of CSS2.1 description of backslash escapes from Zack Weinberg on 2010-07-15 (www-style@w3.org from July 2010)

From: Zack Weinberg <zweinberg@mozilla.com>
Date: Thu, 15 Jul 2010 12:49:38 -0700
To: W3C Emailing list for WWW Style <www-style@w3.org>, L. David Baron <dbaron@dbaron.org>, fantasai <fantasai@inkedblade.net>
Message-ID: <20100715124938.0b111d83@moxana.local>
[ I never received dbaron's response to my proposal, so I'm replying to
the text at
http://lists.w3.org/Archives/Public/www-style/2010Jun/0658.html - sorry
for breaking threading ]

> In http://lists.w3.org/Archives/Public/www-style/2010Feb/0221.html ,
> Zack Weinberg wrote:
> >   <li><p>Backslash (\) characters are not significant inside
> >       <a href="#comments">comments</a>.  Elsewhere, they
> >       introduce <span class="index-def" title="backslash
> >       escapes"><a name="escaped-characters"><dfn>character
> >       escapes</dfn></a></span>.</p>
>
> As an introductory piece of text, I think this is hard to scan,
> since it puts the main point inside an "Elsewhere" clause.  I think
> it would be clearer written as (with the same links as above):
>   # Backslash (\) characters introduce character escapes, except
>   # inside of comments, where they are not significant.

Good point.  I'd be fine with that change.

> Your proposal includes the text "normal character" and later "normal
> punctuation character", which isn't a defined term.  I think you
> mean "tokenized as a single-character DELIM token", though there
> might be a better way to say that.

That's not quite what I mean -- inside a string, for instance, it
wouldn't be tokenized as a DELIM.  I don't immediately see a less
clunky way to put it, alas.

> The rewording of the rules on escapes for the zero codepoint removes
> the authoring conformance requirement in the old text "must not be
> zero".  I think this could be solved by replacing:
>   # If a hexadecimal escape would insert the character with code
>   # point U+0000, the behavior is undefined.
> with:
>   # Style sheets must not contain escapes that would insert the code
>   # point U+0000.  If a user-agent encounters such an escape, the
>   # behavior is undefined.

Ok, except that HTML5 now requires U+0000 to be converted to U+FFFD
very early in processing (I believe this is still a "parse error" in
HTML5 terms, i.e. an authoring conformance violation, but whatwg.org
is down right now and I can't find the parser algorithm on the W3C
site), so it is tempting to make CSS2.1 match:

  # Style sheets must not contain escapes that would insert the code
  # point U+0000.  If a user-agent encounters such an escape, it is to
  # insert the REPLACEMENT CHARACTER, U+FFFD, instead.

and, perhaps as an additional bullet point to the list of "The
following rules always hold":

  # Style sheets must not contain the character with code point U+0000,
  # or characters in the range U+D800--U+D8FF (except as properly
  # encoded UTF-16 surrogate pairs).  If a user-agent encounters any of
  # these characters, it is to behave as if it had encountered the
  # REPLACEMENT CHARACTER, U+FFFD, instead.

(I also brought this up in
http://lists.w3.org/Archives/Public/www-style/2010Jun/0109.html .)

> Your proposal also erroneously drops this part of the current text:
>   # In this case, user agents should treat a "CR/LF" pair
>   # (U+000D/U+000A) as a single white space character.

Good catch.  That was unintentional.

> It would probably also be good to reincorporate these pieces of the
> current text:
>   # Note that this means that a "real" space after the escape sequence
>   # must itself either be escaped or doubled.

Hang on, that's not quite right.  a\26 \ x is the same identifier as
a\26\20x, whereas a\26  x is two identifiers, [a\26] [x], yes?  So it
shouldn't say "escaped".  But apart from that, I would be fine with
putting the note back.

> I think it would also be beneficial in the introductory paragraph to
> point out that there are three types of escaping: causing a newline
> to be ignored, canceling the meaning of special characters, and
> inserting a character by codepoint.
>
> Otherwise the proposal seems fine, although:
>  * I suspect others will find further issues,
>  * I'm not sure such a big rewrite is really necessary, and
>  * it does have the usual problem, present throughout CSS 2.1, of
>    not specifying conformance requirements clearly using RFC2119
>    keywords, and not clearly distinguishing conformance requirements
>    on different parties (style sheets, user-agents, etc.).
> (I wonder whether it would be better to try to keep more of the
> current text as a statement of style sheet conformance and then
> write a separate statement of processor conformance.)

I'd be okay with a much smaller patch.  I didn't like my previous
attempts to just insert the new normative requirements without also
revising the whole section, but here's another go at it:

  * Replace "indicates three types of character escapes" with "may
    indicate one of three types of character escape.  Inside a CSS
    comment, a backslash has no special meaning, and if a backslash is
    immediately followed by the end of the style sheet, it also has no
    special meaning."

  * Append "Outside a string, a backslash followed by a newline has no
    special meaning." to the paragraph beginning "First, inside a
    string".

  * Delete "Except within CSS comments" from the paragraph beginning
    "Second, it cancels".

  * Delete ", where allowed," from the note at the bottom of the
    section.

  * Append this text to the first paragraph of the note at the bottom of
    the section: "When a backslash has 'no special meaning', it is
    tokenized like any other punctuation character without special
    meaning: as part of a comment, part of a string, or as a DELIM,
    based on the context."

  * Possibly change "must itself either be escaped or doubled" to "must
    be doubled", but this is a nitpick on a non-normative aside.

How does that sound?

zw
Received on Thursday, 15 July 2010 19:50:13 UTC