[css2.1] eliminating arbitrary back-up in lexical rules

The CSS 2.1 core lexical productions for COMMENT and URI tokens can
absorb an arbitrary amount of text but then fail to match, because
their terminating punctuation is missing.  This requires a conforming
implementation to back up an arbitrary distance and restart, which can
be very difficult to implement.  As no syntactically correct document
contains un-terminated COMMENTs or URIs, the extra code required is
pointless.

Empirically, nobody seems to implement backing up for un-terminated
comments, and browsers are not consistent about backing up for URIs
lacking the close paren (it is not possible to test what happens for
quoted URIs lacking the close quote, because the conformant parse after
backing up would simply absorb all the following text in an INVALID
token).  See the attached test case.

I would like to propose that the following additional INVALID
productions be added to the core tokenization rules to avoid this
awkward requirement for implementors.  With this change in place, it is
still necessary to back up more than one character in some cases when
CDO, CDC, or UNICODE-RANGE fail to match, but not to back up over an
arbitrary amount of text.

New macros:

invalid-comment1    \/\*[^*]*\*+([^/*][^*]*\*+)*
invalid-comment2    \/\*[^*]*(\*+[^/*][^*]*)*

invalid-url1    url\({w}([!#$%&*-~]|{nonascii}|{escape})*{w}
invalid-url2    url\({w}{invalid}

Changed production:

INVALID    {invalid}|{invalid-comment1}|{invalid-comment2}
           |{invalid-url1}|{invalid-url2}

Make analogous changes to Appendix G.  For clarity, it might be
nice to rename the existing {invalid} macro to {invalid-string}, as
well.

zw

Received on Tuesday, 16 June 2009 19:46:52 UTC