This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9965 - White space in attributes - please justify why multiple spaces/linebreaks are problematic - or delete this requirement
Summary: White space in attributes - please justify why multiple spaces/linebreaks are...
Status: RESOLVED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML/XHTML Compat. Authoring Guide (ed: Eliot Graff) (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: FPWD
Assignee: Eliot Graff
QA Contact: HTML WG Bugzilla archive list
URL: http://dev.w3.org/html5/html-xhtml-au...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-21 01:58 UTC by Leif Halvard Silli
Modified: 2011-01-27 21:55 UTC (History)
7 users (show)

See Also:


Attachments

Description Leif Halvard Silli 2010-06-21 01:58:06 UTC
Currently the first paragraph of the Attributes section says:

]]
A polyglot document does not contain line breaks and multiple white space characters within attribute values. These are handled inconsistently by user agents.
[[

Firstly, I suggest to be explicit about whether this is a XML-compatibiility requiremetn or HTML-compatibility requirement. Please use the wording such as "For XML-compatibily, this and that must be like so and so"

Secondly, is this a problem at all? At first I thought that this had to be a XML related problem. But a quick internet search pointed me to Appendix C of XHTML 1.0. Hence, it seems to me that we are dealing with a requirement that was meant to make XML (XHTML 1.0) compatible with HTML.

However, I believe that this is not any problem any more. At least I have not heard about it as a text/html problem. So I think this req should be deleted.
Comment 1 Henri Sivonen 2010-06-21 08:32:27 UTC
(In reply to comment #0)
> These are handled inconsistently by user agents.
[...]
> However, I believe that this is not any problem any more. At least I have not
> heard about it as a text/html problem. So I think this req should be deleted.

The polyglot publication should not be based on beliefs about what user agents do. Doing that would just get us another Appendix C. It should be based on logical inferences from the normative statements in HTML5, XML, Namespaces in XML and the DOM.

The correct thing to say here is:
Tabs, line feeds and carriage returns in attribute values must be encoded as numeric character references. At least every second of consecutive spaces in attribute values must be encoded as numeric character references (so that there are no consecutive literal spaces in attribute values). This is because XML parsing doesn't preserve unescaped tabs, line feeds, carriage returns or consecutive spaces in attribute values.
Comment 2 Leif Halvard Silli 2010-06-21 11:22:11 UTC
FIRST: My use of "believe" was not meant to indicate that we should base the spec on belief, on the contrary, I asked for documenation.

SECOND: I will, once again, point out that Appendix C claim was mean to solve a _text/html_ problem and *not* a XML problem. (If Appendix C _really_ meant that this particular advice was meant to solve a _XML_ problem, then it didn't manage to keep its tongue in balance - and that is of course a possibility, that they did not manage to do keep their tongue in balance.) In my THIRD point below, I give an example of what Appendix C talks about.

Appendix C: http://www.w3.org/TR/xhtml1/#C_5

      ]] 
C.5. Line Breaks within Attribute Values

Avoid line breaks and multiple white space characters within attribute values. These are handled inconsistently by user agents.
    [[

THIRD: What does you mean by ]]Tabs, line feeds and carriage returns in attribute values[[ ? Do you mean e.g. class names with multiple spaces and linebraks between them? Is XML "against" doing this:?

         class="   class1        class2
                         class3 "

as pposed to this: 

         class="class1 class2 class3"
Comment 3 Henri Sivonen 2010-06-21 12:13:39 UTC
(In reply to comment #2)
> THIRD: What does you mean by ]]Tabs, line feeds and carriage returns in
> attribute values[[ ?

I mean the case where the author wants the *output* of the parser to have U+0009, U+000A or U+000D in an attribute value.

> Do you mean e.g. class names with multiple spaces and
> linebraks between them? Is XML "against" doing this:?
> 
>          class="   class1        class2
>                          class3 "
> 
> as pposed to this: 
> 
>          class="class1 class2 class3"

Those parse to the same DOM in XML but to different DOMs in text/html. The first case is not polyglot, since there's an observable DOM difference between HTML and XML.
Comment 4 Leif Halvard Silli 2010-06-21 14:17:55 UTC
(In reply to comment #3)

> >          class="   class1        class2
> >                          class3 "
> > 
> > as [o]pposed to this: 
> > 
> >          class="class1 class2 class3"
> 
> Those parse to the same DOM in XML but to different DOMs in text/html. The

Did you *mean* "DOMs" in plural about 'text/html'? For what it is worth, there *does* seem to be differences with regard to how text/html browsr treat white space:

IE8 in IE8/edge mode seems to handle the first example above more and less as XML parsers do (that is: if I can trust Live DOM viewer). 
Wheras IE8 in IE7 mode takes care of all space characers, but convert the linebreak to a space (if I can trust Live DOM Viewer). Trusting Live DOM Viewer here, seems risky, though.

In Opera, Firebug, Safari, then Dragonfly, Firebug, WebInspector showed that linebreaks were preserved, in text/HTML mode. 

> first case is not polyglot, since there's an observable DOM difference between
> HTML and XML.

I probably agree, when it comes to tabs, line feeds and carriage returns. As you said in Comment #1:

> The correct thing to say here is:
> Tabs, line feeds and carriage returns in attribute values must be encoded as
> numeric character references.

But I question what you said about consecutive spaces:

> At least every second of consecutive spaces in
> attribute values must be encoded as numeric character references (so that there
> are no consecutive literal spaces in attribute values). This is because XML
> parsing doesn't preserve unescaped tabs, line feeds, carriage returns or
> consecutive spaces in attribute values.

When I consulted Dragonfly, Firebug and Webinspector, then spaces *were* taken care of in both XML and text/HTML. So what is this claim based on?
Comment 5 Leif Halvard Silli 2010-06-21 14:42:31 UTC
(In reply to comment #4)

> > At least every second of consecutive spaces in
> > attribute values must be encoded as numeric character references (so that there
> > are no consecutive literal spaces in attribute values). This is because XML
> > parsing doesn't preserve unescaped tabs, line feeds, carriage returns or
> > consecutive spaces in attribute values.
> 
> When I consulted Dragonfly, Firebug and Webinspector, then spaces *were* taken
> care of in both XML and text/HTML. So what is this claim based on?

Just to be clear: _multiple_ consequive spaces are taken care of in XML, according to the DOM inspectors I have consulted.
Comment 6 Leif Halvard Silli 2010-06-21 15:49:26 UTC
(In reply to comment #5)
> (In reply to comment #4)

> Just to be clear: _multiple_ consequive spaces are taken care of in XML,
> according to the DOM inspectors I have consulted.

My conclusion: if there are problems with regard to what multiple spaces, then it is related to differences between XML parsers. Thus, multiple spaces in _CDATA_ attributes, is not a polyglot HTML/XML issue. The polyglot spec might still forbid it, but if so, it has to justify it by pointing to differences inside XML - and not to differences between XML and HTML. It is not, I think, the task of the Polyglot spec to solve all problems on the XML side.

See: http://www.usingxml.com/Basics/XmlSpace#AttributeWhiteSpaceHandling

By the way, it seems like XML Canonicalisation (C14N) is related to the Polyglot spec 

See: http://www.usingxml.com/Basics/XmlSpace#TreatingWhiteSpaceinaUniformWay
And: http://www.w3.org/TR/xml-c14n

For example, according to C14N, 

* then <li/> should be written as <li></li>
* in C14N, then only UTF-8 is valid.
* in C14N, then class="   one     two" becomes class="one two" 
   see http://www.w3.org/TR/xml-c14n#Example-Chars

If the Polyglot spec really should be more of "XHML/HTML with helmets" kind of spec (Sam used that idio once) , rather than a spec that identfies the common ground between XHTML5 and HTML5, then it would be a good idea to take C14N as starting point.

C14N also lays out rules for how to write numeric character reference ... e.g. trailing zeros are forbidden. E.g. &#x0d; becomes &#xD; Again, see http://www.w3.org/TR/xml-c14n#Example-Chars
Removing trailing zeros in numeric character references *is* more IE compatible, for instance ...

I am not certain that I want the strict rules of C14N, though.
Comment 7 Henri Sivonen 2010-06-22 05:59:22 UTC
(In reply to comment #4)

> IE8 in IE8/edge mode seems to handle the first example above more

I think the polyglot publication should be a set of inferences from specs--not from the behavior of browsers that aren't HTML5-compliant.

The key thing is http://www.w3.org/TR/xml/#AVNormalize and that text/html parsing doesn't (and won't due to compat concerns) have an equivalent step.
Comment 8 Leif Halvard Silli 2010-06-22 06:10:25 UTC
(In reply to comment #7)

> The key thing is http://www.w3.org/TR/xml/#AVNormalize and that text/html
> parsing doesn't (and won't due to compat concerns) have an equivalent step.

]]
For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
[[

That point in XML - http://www.w3.org/TR/xml/#AVNormalize - does _not_ say that multiple spaces (#x20) should be collapsed into a single space.   And, like I said: the DOM inspectors of Firefox, Opera and Safari, _in XML mode_: also doesn't remove multiple consecutive occurences of teh space (#x20;) character.

Are we agreement about that?
Comment 9 Leif Halvard Silli 2010-06-22 06:19:02 UTC
(In reply to comment #8)

> And, like I
> said: the DOM inspectors of Firefox, Opera and Safari, _in XML mode_: also
> doesn't remove multiple consecutive occurences of teh space (#x20;) character.

Correction:  It _looks_ like they remove it. But if you click on the node, you get to see the uncollapsed value. And then you'll see that the multiple consecutife #x20 are kept. Whereas #xD, #xA, #x9 have been converted to #x20.

Do we agree?
Comment 10 Henri Sivonen 2010-06-22 06:28:04 UTC
Oops. Sorry. Collapsing consecutive U+0020 characters applies only to non-CDATA attributes.
Comment 11 Leif Halvard Silli 2010-06-22 07:01:06 UTC
(In reply to comment #10)

You mean things like  <ol start="   9    "> ? Yes. But then it is allready forbidden in HTML5. So probably not a spec issue. Or are there non-CDATA attributes where it is an issue?
Comment 12 Henri Sivonen 2010-06-22 07:17:09 UTC
With <!DOCTYPE html>, all attribute are CDATA attributes.

However, if you use one of the permitted XHTML 1.0 doctypes and the XML processor processes external entities, some attributes aren't CDATA attributes. As for it being an issue, the document tree would be *different* which implies not polyglot.
Comment 13 Leif Halvard Silli 2010-06-22 08:48:22 UTC
(In reply to comment #12)

> As for it being an issue, the document tree would be *different* which implies
> not polyglot.

I do agree that this is a complicating factor.  But any difference should thus be possible to take care of by the right authoring requirement, not? Again, HTML5 iforbids whitespace inside such attributes as @start. Or does what you say mean that not only the content  but also the "color" of the very @start attribute is different in a parser that reads external entities, compared with one that doesn't?
Comment 14 Eliot Graff 2010-06-24 20:35:51 UTC
(In reply to comment #1)
> (In reply to comment #0)
[...]
> The correct thing to say here is:
> Tabs, line feeds and carriage returns in attribute values must be encoded as
> numeric character references. At least every second of consecutive spaces in
> attribute values must be encoded as numeric character references (so that there
> are no consecutive literal spaces in attribute values). This is because XML
> parsing doesn't preserve unescaped tabs, line feeds, carriage returns or
> consecutive spaces in attribute values.

Henri, can you point to documentation for this? Thanks.
Comment 15 Leif Halvard Silli 2010-06-24 21:51:27 UTC
(In reply to comment #14)
> (In reply to comment #1)
> > (In reply to comment #0)
> [...]
> > The correct thing to say here is:
> > Tabs, line feeds and carriage returns in attribute values must be encoded as
> > numeric character references. At least every second of consecutive spaces in
> > attribute values must be encoded as numeric character references (so that there
> > are no consecutive literal spaces in attribute values). This is because XML
> > parsing doesn't preserve unescaped tabs, line feeds, carriage returns or
> > consecutive spaces in attribute values.
> 
> Henri, can you point to documentation for this? Thanks.

He quoted it already: http://www.w3.org/TR/xml/#AVNormalize

But later addmitte an error with regard to collapsing of spaces:

- Comment #10 From Henri Sivonen 2010-06-22 06:28:04 (-) [reply] -------
Oops. Sorry. Collapsing consecutive U+0020 characters applies only to non-CDATA
attributes.

So I and Henri agree that http://www.w3.org/TR/xml/#AVNormalize require XHTML parseres to normalize tabs, line feeds and carriage into space characters, within CDATA attributes.
Comment 16 Eliot Graff 2010-09-27 21:42:37 UTC
The 27 September editor's draft contains the following changes:

Section 7 now reads:

**************************

7. Attributes
Because of attribute-value normalization in XML [XML10], polyglot markup does not contain tabs, line feeds, or carriage returns within CDATA attributes. 

Polyglot markup surrounds all attribute values with quotation marks. Attribute values may be surrounded either by single quotation marks or by double quotation marks.

**************************

I _believe_ this satisfies what, ultimately, this bug requested. Please let me know if I've omitted anything.

Thanks for your patience.
Comment 17 Leif Halvard Silli 2011-01-21 06:02:12 UTC
(In reply to comment #16)
> The 27 September editor's draft contains the following changes:
> 
> Section 7 now reads:
> 
> **************************
> 
> 7. Attributes
> Because of attribute-value normalization in XML [XML10], polyglot markup does
> not contain tabs, line feeds, or carriage returns within CDATA attributes. 
> 
> Polyglot markup surrounds all attribute values with quotation marks. Attribute
> values may be surrounded either by single quotation marks or by double
> quotation marks.
> 
> **************************
> 
> I _believe_ this satisfies what, ultimately, this bug requested. Please let me
> know if I've omitted anything.
> 
> Thanks for your patience.

Sorry for my delay. I think this should be more specific. Tabs, line-feeds and carriage-returns *can* be added, but if added, then they need to be escaped. See:

http://dev.w3.org/html5/spec/the-iframe-element.html#process-the-iframe-attributes

which says:

]] Due to restrictions of the XML syntax, in XML the U+003C LESS-THAN SIGN character (<) needs to be escaped as well. In order to prevent attribute-value normalization, some of XML's whitespace characters  specifically U+0009 CHARACTER TABULATION (HT), U+000A LINE FEED (LF), and U+000D CARRIAGE RETURN (CR)  also need to be escaped. [XML] [[

The simplest correction of your text above, would probably be add the word "unescaped"  in the first paragraph, like so:

]]
 Because of attribute-value normalization in XML [XML10], polyglot markup does
 not contain **unescaped** tabs, line feeds, or carriage returns within CDATA attributes. 
[[
Comment 18 Eliot Graff 2011-01-27 21:55:49 UTC
Hi Leif.

The current editor's draft (26 January) says this:

7. Attributes
 
Within an attribute's value, polyglot markup represents tabs, line feeds, and carriage returns as numeric character references rather than by using literal characters. For example, within an attribute's value, polyglot markup uses &#x9; for a tab rather than the literal character '\t'. This is because of attribute-value normalization in XML [XML10].
 
I believe this satisfies the request in comment 17. Please let me know if it does not or if there's anything else I can do.

Thanks so much.

Eliot