Re: [CSS21] last edition: pity from Bert Bos on 2009-05-07 (www-style@w3.org from May 2009)

From: Bert Bos <bert@w3.org>
Date: Thu, 7 May 2009 21:14:03 +0200
To: www-style@w3.org
Message-Id: <200905072114.04612.bert@w3.org>
(Hello Andrey, Your tone doesn't really inspire a response, but I know
that exclamation marks often mask a lack of knowledge of a foreign
language.)

It's quite possible that we've made mistakes, that's why we're asking 
for comments. People raised issues on the previous version and we tried 
to solve them.

There is no intention to change the syntax. There were people who asked 
for more rules for handling non-CSS input and although it is almost 
impossible to give rules without dictating a particular parsing 
algorithm, we gave it a try anyway. Those rules are new, but they only 
deal with input that isn't CSS, will never be CSS and for which there 
previously were no rules. If we inadvertently changed something in how 
*valid* CSS is handled, that's a bug.

The same is true, mutatis mutandis, for bolder/lighter. Maybe it looked 
like it was well-defined before, but in reality it wasn't. At least we 
received issues on it. It's possible that there are better ways to 
solve those issues...


On Tuesday 05 May 2009, Andrey Mikhalev wrote:
> 1. Appendix G. Grammar of CSS 2.1,
>     G.2 Lexical scanner,
>     following production was removed:
>       {s}+\/\*[^*]*\*+([^/*][^*]*\*+)*\/ {unput(' '); /*replace by
> space*/}
>
>     production essential for selector parsing: without it selectors
> like 'A /**/>B'
>     became invalid (token sequence is "ident,s,greater,ident" instead
> of "ident,greater,ident")

This indeed looks like a mistake. But the unput() wasn't correct
either, because it allowed 'A/**/B' without any combinator.

The change was made in response to an issue[1] that was raised on this 
mailing list. When we discussed it, we noticed that the grammar in 
appendix G and in the Selectors module were different. We thought the 
latter looked better and copied it. It seems now that the Selectors 
module wasn't correct either.

Here is my new attempt, in the form of a "unified diff" (i.e., lines
that start with "-" are to be removed, lines with "+" are added, and
lines with a space are unchanged).

There were also errors in the resolution of issue 104[2]. The first of
the changes is meant to fix that.

(Yves probably won't like this grammar, because it is again not LL(1),
although I believe it is LALR(1).)

----------------------------------------------------------------------
 stylesheet
   : [ CHARSET_SYM STRING ';' ]?
-    [S|CDO|CDC]* [ import [ [CDO|CDC] [S|CDO|CDC] ]* ]*
-    [ [ ruleset | media | page ] [ [CDO|CDC] [S|CDO|CDC] ]* ]*
+    [S|CDO|CDC]* [ import [ CDO S* | CDC S* ]* ]*
+    [ [ ruleset | media | page ] [ CDO S* | CDC S* ]* ]*
   ;
 import
   : IMPORT_SYM S*
-    [STRING|URI] S* [ medium [ COMMA S* medium]* ]? ';' S*
+    [STRING|URI] S* [ medium [ ',' S* medium]* ]? ';' S*
   ;
 media
-  : MEDIA_SYM S* medium [ COMMA S* medium ]* LBRACE S* ruleset* '}' S*
+  : MEDIA_SYM S* medium [ ',' S* medium ]* '{' S* ruleset* '}' S*
   ;
 medium
   : IDENT S*
   ;
 page
   : PAGE_SYM S* pseudo_page?
-    LBRACE S* declaration? [ ';' S* declaration? ]* '}' S*
+    '{' S* declaration? [ ';' S* declaration? ]* '}' S*
   ;
 pseudo_page
   : ':' IDENT S*
   ;
 operator
-  : '/' S* | COMMA S*
+  : '/' S* | ',' S*
   ;
 combinator
-  : PLUS S*
-  | GREATER S*
+  : S* '+' S*
+  | S* '>' S*
   | S+
   ;
 unary_operator
-  : '-' | PLUS
+  : '-' | '+'
   ;
 property
   : IDENT S*
   ;
 ruleset
-  : selector [ COMMA S* selector ]*
-    LBRACE S* declaration? [ ';' S* declaration? ]* '}' S*
+  : selector [ ',' S* selector ]*
+    '{' S* declaration? [ ';' S* declaration? ]* '}' S*
   ;
 selector
-  : simple_selector [ combinator simple_selector ]*
+  : simple_selector [ combinator simple_selector ]* S*
   ;
 simple_selector
   : element_name [ HASH | class | attrib | pseudo ]*
   | [ HASH | class | attrib | pseudo ]+
   ;
 class
   : '.' IDENT
   ;
 element_name
   : IDENT | '*'
   ;
 attrib
   : '[' S* IDENT S* [ [ '=' | INCLUDES | DASHMATCH ] S*
     [ IDENT | STRING ] S* ]? ']'
   ;
 pseudo
   : ':' [ IDENT | FUNCTION S* [IDENT S*]? ')' ]
   ;
 declaration
   : property ':' S* expr prio?
   ;
 prio
   : IMPORTANT_SYM S*
   ;
 expr
   : term [ operator? term ]*
   ;
 term
   : unary_operator?
     [ NUMBER S* | PERCENTAGE S* | LENGTH S* | EMS S* | EXS S* | ANGLE S* |
       TIME S* | FREQ S* ]
   | STRING S* | IDENT S* | URI S* | hexcolor | function
   ;
 function
   : FUNCTION S* expr ')' S*
   ;
 /*
  * There is a constraint on the color that it must
  * have either 3 or 6 hex-digits (i.e., [0-9a-fA-F])
  * after the "#"; e.g., "#000" is OK, but "#abcd" is not.
  */
 hexcolor
   : HASH S*
   ;
----------------------------------------------------------------------

And some lines can be removed from the tokenizer:

----------------------------------------------------------------------
%option case-insensitive

 h               [0-9a-f]
 nonascii        [\200-\377]
 unicode         \\{h}{1,6}(\r\n|[ \t\r\n\f])?
 escape          {unicode}|\\[^\r\n\f0-9a-f]
 nmstart         [_a-z]|{nonascii}|{escape}
 nmchar          [_a-z0-9-]|{nonascii}|{escape}
 string1         \"([^\n\r\f\\"]|\\{nl}|{escape})*\"
 string2         \'([^\n\r\f\\']|\\{nl}|{escape})*\'
 invalid1        \"([^\n\r\f\\"]|\\{nl}|{escape})*
 invalid2        \'([^\n\r\f\\']|\\{nl}|{escape})*

 comment         \/\*[^*]*\*+([^/*][^*]*\*+)*\/
 ident           -?{nmstart}{nmchar}*
 name            {nmchar}+
 num             [0-9]+|[0-9]*"."[0-9]+
 string          {string1}|{string2}
 invalid         {invalid1}|{invalid2}
 url             ([!#$%&*-~]|{nonascii}|{escape})*
 s               [ \t\r\n\f]+
 w               {s}?
 nl              \n|\r\n|\r|\f

 A               a|\\0{0,4}(41|61)(\r\n|[ \t\r\n\f])?
 C               c|\\0{0,4}(43|63)(\r\n|[ \t\r\n\f])?
 D               d|\\0{0,4}(44|64)(\r\n|[ \t\r\n\f])?
 E               e|\\0{0,4}(45|65)(\r\n|[ \t\r\n\f])?
 G               g|\\0{0,4}(47|67)(\r\n|[ \t\r\n\f])?|\\g
 H               h|\\0{0,4}(48|68)(\r\n|[ \t\r\n\f])?|\\h
 I               i|\\0{0,4}(49|69)(\r\n|[ \t\r\n\f])?|\\i
 K               k|\\0{0,4}(4b|6b)(\r\n|[ \t\r\n\f])?|\\k
 L               l|\\0{0,4}(4c|6c)(\r\n|[ \t\r\n\f])?|\\l
 M               m|\\0{0,4}(4d|6d)(\r\n|[ \t\r\n\f])?|\\m
 N               n|\\0{0,4}(4e|6e)(\r\n|[ \t\r\n\f])?|\\n
 O               o|\\0{0,4}(4f|6f)(\r\n|[ \t\r\n\f])?|\\o
 P               p|\\0{0,4}(50|70)(\r\n|[ \t\r\n\f])?|\\p
 R               r|\\0{0,4}(52|72)(\r\n|[ \t\r\n\f])?|\\r
 S               s|\\0{0,4}(53|73)(\r\n|[ \t\r\n\f])?|\\s
 T               t|\\0{0,4}(54|74)(\r\n|[ \t\r\n\f])?|\\t
 U               u|\\0{0,4}(55|75)(\r\n|[ \t\r\n\f])?|\\u
 X               x|\\0{0,4}(58|78)(\r\n|[ \t\r\n\f])?|\\x
 Z               z|\\0{0,4}(5a|7a)(\r\n|[ \t\r\n\f])?|\\z

 %%

 {s}                     {return S;}

 \/\*[^*]*\*+([^/*][^*]*\*+)*\/          /* ignore comments */

 "<!--"          {return CDO;}
 "-->"                   {return CDC;}
 "~="                    {return INCLUDES;}
 "|="                    {return DASHMATCH;}

-{w}"{"                  {return LBRACE;}
-{w}"+"                  {return PLUS;}
-{w}">"                  {return GREATER;}
-{w}","                  {return COMMA;}
-
 {string}                {return STRING;}
 {invalid}               {return INVALID; /* unclosed string */}

 {ident}                 {return IDENT;}

 "#"{name}               {return HASH;}

 @{I}{M}{P}{O}{R}{T}     {return IMPORT_SYM;}
 @{P}{A}{G}{E}           {return PAGE_SYM;}
 @{M}{E}{D}{I}{A}        {return MEDIA_SYM;}
 "@charset "             {return CHARSET_SYM;}

 "!"({w}|{comment})*{I}{M}{P}{O}{R}{T}{A}{N}{T}  {return IMPORTANT_SYM;}

 {num}{E}{M}             {return EMS;}
 {num}{E}{X}             {return EXS;}
 {num}{P}{X}             {return LENGTH;}
 {num}{C}{M}             {return LENGTH;}
 {num}{M}{M}             {return LENGTH;}
 {num}{I}{N}             {return LENGTH;}
 {num}{P}{T}             {return LENGTH;}
 {num}{P}{C}             {return LENGTH;}
 {num}{D}{E}{G}          {return ANGLE;}
 {num}{R}{A}{D}          {return ANGLE;}
 {num}{G}{R}{A}{D}       {return ANGLE;}
 {num}{M}{S}             {return TIME;}
 {num}{S}                {return TIME;}
 {num}{H}{Z}             {return FREQ;}
 {num}{K}{H}{Z}          {return FREQ;}
 {num}{C}{M}             {return LENGTH;}
 {num}{M}{M}             {return LENGTH;}
 {num}{I}{N}             {return LENGTH;}
 {num}{P}{T}             {return LENGTH;}
 {num}{P}{C}             {return LENGTH;}
 {num}{D}{E}{G}          {return ANGLE;}
 {num}{R}{A}{D}          {return ANGLE;}
 {num}{G}{R}{A}{D}       {return ANGLE;}
 {num}{M}{S}             {return TIME;}
 {num}{S}                {return TIME;}
 {num}{H}{Z}             {return FREQ;}
 {num}{K}{H}{Z}          {return FREQ;}
 {num}{ident}            {return DIMENSION;}

 {num}%                  {return PERCENTAGE;}
 {num}                   {return NUMBER;}

 {U}{R}{L}"("{w}{string}{w}")"   {return URI;}
 {U}{R}{L}"("{w}{url}{w}")"      {return URI;}

 {ident}"("              {return FUNCTION;}

 .                       {return *yytext;}
----------------------------------------------------------------------


[1] http://wiki.csswg.org/spec/css2.1#issue-5
[2] http://wiki.csswg.org/spec/css2.1#issue-104


>
> 2. 4 Syntax and basic data types,
>     4.2 Rules for handling parsing errors,
>     Invalid at-keywords:
>       User agents must ignore an invalid at-keyword together with
>       everything following it, up to and including ...
>     following sentence added:
>       the end of the block (}) that contains the invalid at-keyword
>
>     what you are talking about? _ignore_ _end of the block_?!!
>
>     @media x { /*...*/ @invalid } /*... style here belongs to what?*/

I think we meant "up to the end of the block (}) that contains the
invalid at-keyword" rather than "up to and including." The intention
was precisely to make it clear that the "}" should *not* be ignored.

Maybe change from

    User agents must ignore an invalid at-keyword together with
    everything following it, up to and including the next
    semicolon (;), the next block ({...}), or the end of the
    block (}) that contains the invalid at-keyword, whichever
    comes first.

to

    User agents must ignore an invalid at-keyword together with
    everything following it, up to the end of the block that
    contains the invalid at-keyword, or up to and including the
    next semicolon (;) or up to and including the next
    block ({...}), whichever comes first.

Or, more verbosely:

    User agents must ignore an invalid at-keyword together with
    everything following it, up to and including the next
    semicolon (;) or the next block ({...}), whichever
    comes first. If the invalid at-keyword occurs inside a block,
    and there is no semicolon or block between the at-keyword and
    the end of that block, then everything from the at-keyword up
    to the end of the block is ignored.

>
> 3. 4 Syntax and basic data types,
>     4.2 Rules for handling parsing errors,
>     following paragraph added:
>       Malformed statements.
>       User agents must handle unexpected tokens encountered while
> parsing a statement by reading until the end of the statement, while
> observing the rules for matching pairs of (), [], {}, "", and '', and
> correctly handling escapes.
>       ...
>
>     most evil idea, violate nearly everything in chapter 4, starting
>     from formal core syntax.
>     in short:
>       'unexpected token' in 'statement' cannot occur - since
> 'statement' is not a checkpoint (not a Real Thing, precisely).
>       handling of parsing errors differs for selectors / declarations
> / at-rules.
>       paragraph above redundant and introduce conflict with them.

The intention is that the rule for malformed declarations takes
precedence over that for malformed statements, as it comes
first in the spec. Thus an unexpected token in a declaration
causes just the declaration to be ignored, not the whole statement.

The new rule about malformed statements is a generalization of
that in 4.1.7 about errors in selectors: not only an error in a
selector causes a statement to be ignored, but also an error that
occurs after an at-keyword, e.g:

    @media @error {...}

In fact, although it may not be very clear from the text (which is
kept as short as possible), but hopefully from the examples, if an
unexpected token occurs anywhere where a statement *could* occur,
then that token is ignored together with the next statement. E.g.,
the whole 1st line is ignored in this:

   } h2 {color: orange}
   h1 {color: green}



> 4. 15 Fonts,
>     15.6 Font boldness : the 'font-weight' property:
>       'bolder' selects the next weight that is assigned to a font
> that is darker than the inherited one.
>     following sentence removed:
>       If there is no such weight, it simply results in the next
> darker numerical value (and the font remains unchanged), unless the
> inherited value was '900' in which case the resulting weight is also
> '900'.
>     [similar in 'lighter']
>     following paragraph added:
>       Note: A set of nested elements that mix 'bolder' and 'lighter'
> will give unpredictable results depending on the UA, OS, and font
> availability. This behavior will be more precisely defined in CSS3.
>
>     - changing _defined_ behaviour to _undefined_ is not an
> improvement. - css3 reference nonsence.
>       (imo: if someone tries to turn css2 specification into
>       'css3 todo list' - shoot, don't talk)
>     - the weight metric is independent from font[family].
>       as value of independent metric, 'bolder' SHOULD result next
> numerical value.
>       futher - as a hack for non-perfect world - it MAY (or MAY NOT)
>       yield to next available font's weight.
>       what was unclean here? why you killing primary objectives of
>       property/value, leaving only hack description?

Imagine four nested elements, from outside to inside they have

    font-weight: normal
    font-weight: bolder
    font-weight: bolder
    font-weight: lighter

The old spec said the computed value of the innermost is "one of the
legal number values combined with one or more of the relative
values (bolder or lighter)." But does that mean

    400 + bolder + bolder + lighter
or
    400 + bolder
or
    400 + 1 * lighter + 2 * bolder?

That makes a difference. Assume a font with weights 400 (normal)
and 900 (extra bold). A UA that does the first will end up at 400,
while a UA that does the second will choose 900.

The text about taking the next available weight or the next
numerical value if there is no next weight available dates from
the old CSS2 REC, and assumed that the computed value was a number.
But it didn't define what happened for elements with more than one
font family, so it's likely that taking the next numerical value
isn't actually a good idea.

Maybe we will find a good solution before we progress CSS 2.1 to 
Recommendation. That's currently issue 111[3]. But maybe we won't
and leave the algorithm undefined in CSS 2.1.

[3] http://wiki.csswg.org/spec/css2.1#issue-111



Bert
-- 
  Bert Bos                                ( W 3 C ) http://www.w3.org/
  http://www.w3.org/people/bos                               W3C/ERCIM
  bert@w3.org                             2004 Rt des Lucioles / BP 93
  +33 (0)4 92 38 76 92            06902 Sophia Antipolis Cedex, France
Received on Thursday, 7 May 2009 19:14:43 UTC