14360 – Count Unicode 'combining marks" together with "inter-element whitespace"

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 14360 - Count Unicode 'combining marks" together with "inter-element whitespace"

Summary: Count Unicode 'combining marks" together with "inter-element whitespace"

Status:	RESOLVED NEEDSINFO

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	contributor
QA Contact:	HTML WG Bugzilla archive list

URL:	http://dev.w3.org/html5/spec/content-...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-10-03 03:29 UTC by Leif Halvard Silli
Modified:	2011-12-09 23:30 UTC (History)
CC List:	6 users (show)

See Also:

Attachments

Description Leif Halvard Silli 2011-10-03 03:29:11 UTC

SPEC SAYS:

]] As a general rule, elements whose content model allows **any flow content** should have either at least one descendant text node that is not inter-element whitespace, [[

PROPOSALS: 
  1)  After last comma above, add roughly this text:
       "and that also isn't a Unicode combining mark".
  2)  Also, in a parenthesis or side note, state that if an isolated 
       combining mark is needed, then a one should, in line with
       Unicode 6.0, combine it  with U+00A0 no-break space.
  3) Allow conformance checkers to warn if a combining mark - 
       with or without  U+0020, is the sole text node of an element
       "whose content model allows any flow content" as well as 
       when - regardless of whether it allos any content - 
       it combines with/is placed adjacentn to U+0020.

TEST CASE: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1167

PROBLEM DESCRIPTION: Bug 13502 resulted in a de-facto permisson to let text runs begin with combining marks. However, while it should perhaps not be completely forbidden, still - if an element "whose content model allows any flow content"  contains nothing but (inter-element) space + combining mark (or even solely a combining mark), then there are several potential issues:

1)  White space collapsing means that the combining character doesn't really
     combine with the space character
2)  Combing marks that combines with nothing or space, are hard to select with the mouse
3)  Visually, such marks may look as if they combine with something outside the element
     (See third paragraph in test case)
4)  When the first letter is a combnining mark, then the CSS *:first-letter{} selector may
     seem, to authors, to not work

UNICODE ARGUMENTS: In bug 13502, comment number 4, it came up how to represent isolated combining marks. (http://www.w3.org/Bugs/Public/show_bug.cgi?id=13502#c4) However, the mentioned solution - to use U-0020 - is no longer the recommended method, due to the space character normalization issues rules of XML. Citing Unicode 6.0:

]]
7.9 Combining Marks
   [ snip ]
Marks as Spacing Characters. By convention, combining marks may be exhibited in (apparent) isolation by applying them to U+00A0 no-break space. This approach might be taken, for example, when referring to the diacritical mark itself as a mark, rather than using it in its normal way in text. Prior to Version 4.1 of the Unicode Standard, the standard also recommended the use of U+0020 space for display of isolated combining marks. This is no longer recommended, however, because of potential conflicts with the handling of sequences of U+0020 space characters in such contexts as XML.
[[
   [ For RTL scripts, it is slightly more complicated - see section 7.9 of Unicode 6.]

The justificaitons for somewhat aligning with inter-elemetn whitespace  rather than completley forbidding combining marks that combine with U-0020 are:
  1)  the same as for the permission to have empty elements: it may be used as place holder or template. E.g. a combining accent migh tbe combined with different letters via scriptiong.
  2) Further more, Unicode contains "Spacing Clones of Diacritical Marks" which most of them have "have compatibility decomposition mappings involving U+0020 space, but implementers should be cautious in making use of those decomposition mappings because of the complications that can arise from replacing a spacing character with a space + combining mark sequence". (Point is that, even if Unicode warns againast it, one can probably not completely forbid combining marks combined with U+0020 when Unicode itself operates with normalization that includes the U+0020.)

Comment 1 David Carlisle 2011-10-03 12:08:01 UTC

note 

 2)  Also, in a parenthesis or side note, state that if an isolated 
       combining mark is needed, then a one should, in line with
       Unicode 6.0, combine it  with U+00A0 no-break space.

this would make any use of the entities 

DownBreve tdot TripleDot DotDot

Non conforming, see

http://www.w3.org/TR/2010/REC-xml-entity-names-20100401/#chars_math-multiple-tables

prefixing with #160 rather than #32 wasn't really an option due to legacy use of <mo>& tdot;</mo>
to get a triple dot accent.
space characters are ignored in mathml processing so changing the definition of tdot from U+20DB to U+0020 U+20DB (at MathML 2 if I recall correctly) wouldn't affect processing but did meet the requirement not to start an entity with a combining character. Using U+00A0 instead would have affected the spacing if this were used alone and made this character most likely not recognised if used in accent constructs.

> Prior to Version 4.1 of
the Unicode Standard, the standard also recommended the use of U+0020 space for
display of isolated combining marks. This is no longer recommended,

Unicode may have changed its recommendation here but these entities had been standardised years earlier.

Comment 2 Leif Halvard Silli 2011-10-03 16:03:47 UTC

(In reply to comment #1)
> note 
> 
>  2)  Also, in a parenthesis or side note, state that if an isolated 
>        combining mark is needed, then a one should, in line with
>        Unicode 6.0, combine it  with U+00A0 no-break space.
> 
> this would make any use of the entities 
>
>       DownBreve tdot TripleDot DotDot
>
> Non conforming, see
> http://www.w3.org/TR/2010/REC-xml-entity-names-20100401/#chars_math-multiple-tables

Are there any XML parsers that actually resolves e.g. &DotDot; into  &#x0020;&#x20DB; ?  What you say about MatML behaviour below indicats that the answer is no. The illustration in that document, of how the &DotDot: is supposed to be endered, does not contain any space: 

http://www.w3.org/TR/2010/REC-xml-entity-names-20100401/glyphs/020/U020DB.png

For verification, check how Firefox and Webkit - the only HTML parsers that thus far implements the &DotDot; entity. Neither of them includes the U+0020 as part of the entity:

http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1171

I note as well that the spec says that DotDot means U+020DC.

Btw and fwiw: note that [Charmod-norm] says:

]] Full-normalization prevents the use of entities for expressing composing characters. This limitation can be circumvented by using character escapes or by using entities representing complete combining character sequences. With appropriate entity definitions, instead of A&acute;, write &Aacute; (or better, use '

Comment 3 David Carlisle 2011-10-03 16:21:48 UTC

(In reply to comment #2)

> Are there any XML parsers that actually resolves e.g. &DotDot; into 
> &#x0020;&#x20DB; ?

All of them I would imagine if given a dtd that defines it that way.
Or did you mean html parsers?

>  What you say about MatML behaviour below indicats that the
> answer is no. The illustration in that document, of how the &DotDot: is
> supposed to be rendered, does not contain any space:

White space at the start and end of text nodes is parsed but doesn't affect rendering in mathml, that's why it's important to guard the combining character with space rather than nbsp, precisely so it doesn't affect rendering.
 
> 
> http://www.w3.org/TR/2010/REC-xml-entity-names-20100401/glyphs/020/U020DB.png
> 
> For verification, check how Firefox and Webkit - the only HTML parsers that
> thus far implements the &DotDot; entity. Neither of them includes the U+0020 as
> part of the entity:

ah they are following the spec, I must have missed that when I checked the html5 entities, so if the intention is that charmod-norm ever comes out of draft and that the html entities comply with it those entities should be defined to have the combining character guarded with a space to math the xml entity spec definitions from where they were copied.


> Btw and fwiw: note that [Charmod-norm] says:
> 
> ]] Full-normalization prevents the use of entities for expressing composing
> characters. This limitation can be circumvented by using character escapes or
> by using entities representing complete combining character sequences. With
> appropriate entity definitions, instead of A&acute;, write &Aacute; (or better,
> use '

Comment 4 Leif Halvard Silli 2011-10-03 17:42:06 UTC

(In reply to comment #3)
> (In reply to comment #2)
> 
> > Are there any XML parsers that actually resolves e.g. &DotDot; into 
> > &#x0020;&#x20DB; ?
> 
> All of them I would imagine if given a dtd that defines it that way.
> Or did you mean html parsers?

No, no. I meant XML.
 
> >  What you say about MatML behaviour below indicats that the
> > answer is no. The illustration in that document, of how the &DotDot: is
> > supposed to be rendered, does not contain any space:
> 
> White space at the start and end of text nodes is parsed but doesn't affect
> rendering in mathml,

So MathML and HTML behave the same way, then, I think? Eg white-space at the beginning of a <p>, for instance, doesn't affect the rendering in HTML - there is no space in the end result.

> that's why it's important to guard the combining character
> with space rather than nbsp, precisely so it doesn't affect rendering.

Apart from the fact that HTML does not allow me to define entities, spot not difference between XML parsers:
   http://tinyurl.com/6lyx5mg 
Or HTML parsers: 
  http://tinyurl.com/5stsb7r

Note that a space before the combining character behaves differently if the space + combining chare together are the first characters of a block or inline-block element, compared to if the space comes between a character and the combining characters. (In my demo documents, it  only works in Opera and Webkit, though. Not in Firefox. I don't know how it works in IE9.)

I suspect that in MathML is just like XML. But that MathML (compared to XHTML and HTML) has very many display:inline-block elements.

Please note that this bug is about elements that can contain "any flow content". Most such elements are container elements and of display:block type (or something equivalent). I suspect that a the <mo/> element, for instance can not contain "any flow content".   Moreover, I suspect that <mo>&tdot;</mo> is an inline-block element, and thus it works.

It seems to me that your objection to *this* bug perhaps is invalid. Note as well that I did not say that it would be invalid, I just recommended that conformance checkers will warn - or at least recommend - that combining characters are combined with something other than the space character.

> > http://www.w3.org/TR/2010/REC-xml-entity-names-20100401/glyphs/020/U020DB.png
> > 
> > For verification, check how Firefox and Webkit - the only HTML parsers that
> > thus far implements the &DotDot; entity. Neither of them includes the U+0020 as
> > part of the entity:
> 
> ah they are following the spec, I must have missed that when I checked the
> html5 entities, so if the intention is that charmod-norm ever comes out of
> draft and that the html entities comply with it those entities should be
> defined to have the combining character guarded with a space to math the xml
> entity spec definitions from where they were copied.

If &DotDot; were to begin with a space character, then it would make it generally unusuable.  Because a combining entity that begins with a space character would not combine with the preceding charater, unless the combing character itself is inside an element or in a position where the effect of the U+0020 character is cancelled. (Hint: display:inline-block etc.)

> > Btw and fwiw: note that [Charmod-norm] says:
> > 
> > ]] Full-normalization prevents the use of entities for expressing composing
> > characters. This limitation can be circumvented by using character escapes or
> > by using entities representing complete combining character sequences. With
> > appropriate entity definitions, instead of A&acute;, write &Aacute; (or better,
> > use '

Comment 5 Leif Halvard Silli 2011-10-03 17:56:30 UTC

(In reply to comment #4)

> (In my demo documents, it  only works in Opera and
> Webkit, though. Not in Firefox. I don't know how it works in IE9.)

Actually, when I zoom in, then I see that it works in Firefox too - though only in the HTML version.

Comment 6 Ian 'Hixie' Hickson 2011-10-03 18:52:55 UTC

(In reply to comment #0)
> 
> PROBLEM DESCRIPTION: Bug 13502 resulted in a de-facto permisson to let text
> runs begin with combining marks. However, while it should perhaps not be
> completely forbidden, still - if an element "whose content model allows any
> flow content"  contains nothing but (inter-element) space + combining mark (or
> even solely a combining mark), then there are several potential issues:
> 
> 1)  White space collapsing means that the combining character doesn't really
>      combine with the space character

Why would this be a problem?

> 2)  Combing marks that combines with nothing or space, are hard to select with
> the mouse

Why would they be any harder than combining with a letter?

> 3)  Visually, such marks may look as if they combine with something outside the
> element

They might well combine with something outside the element's border box. Why is this a problem?

> 4)  When the first letter is a combnining mark, then the CSS *:first-letter{}
> selector may seem, to authors, to not work

Why not? It would do exactly what CSS says it should, no?

Comment 7 Leif Halvard Silli 2011-10-03 21:38:07 UTC

(In reply to comment #6)
> (In reply to comment #0)

> > 1)  White space collapsing means that the combining character doesn't really
> >      combine with the space character
> 
> Why would this be a problem?

Because the assumption is that the author wants to represent the combining mark as an letter in itself, as if it was a spacing character. The assumption is also that he/she wants this character to behave equally regardless of where it is placed, and not that it becomes extra difficult to control it each time it (or space+combiningMark) is the first character(s) of the line.

> > 2)  Combing marks that combines with nothing or space, are hard to select with
> > the mouse
> 
> Why would they be any harder than combining with a letter?

Because it is typically difficult to select a combining character. One typically selects the base character plus the combining character as a whole. And then, if there is no base character, it is rather understandable that it becomes hard to select it.

Section '5.11 Editing and Selection' of Unicode 6. 0 has some details about these matters. For example, it says:

Comment 8 Ian 'Hixie' Hickson 2011-10-03 22:41:43 UTC

> Because the assumption is that the author wants to represent the combining mark
> as an letter in itself, as if it was a spacing character. The assumption is
> also that he/she wants this character to behave equally regardless of where it
> is placed, and not that it becomes extra difficult to control it each time it
> (or space+combiningMark) is the first character(s) of the line.

Even if the combing mark becomes an isolated one, it'll still render fine. It doesn't become more difficult at the start of the line. So this is not, IMHO, a real problem.


> Because it is typically difficult to select a combining character. One
> typically selects the base character plus the combining character as a whole.
> And then, if there is no base character, it is rather understandable that it
> becomes hard to select it.

There's always a base character. Isolated combining characters essentially magic one out of nowhere to combine with, as if it was a space. So this is not, IMHO, a real problem.

There may be some implementation issues; those should be filed as bugs with the implementations.

This is not, IMHO, a real problem, at least not one in the spec.


> > > 3)  Visually, such marks may look as if they combine with something outside the
> > > element
> > 
> > They might well combine with something outside the element's border box. Why is
> > this a problem?
> 
> The situation I described was one where it *looks* as if it it combines with
> something (that is: with something unvisible) outside the element.  That is: A
> situation where there is nothing to combine with. (For all I know, it combines
> withe box - rather than a character - outside the element.)
> 
> If the combining character is inside an element with display:inline-block, and
> combines with another character in a mathml element, then that is another
> matter - and not a problem. 

I have no idea what you're saying here. Could you elaborate? Maybe a concrete example?


> > > 4)  When the first letter is a combnining mark, then the CSS *:first-letter{}
> > > selector may seem, to authors, to not work
> > 
> > Why not? It would do exactly what CSS says it should, no?
> 
> I said "may seem to not". I did not say "does not". (In addition, there are
> bugs.)

If there are bugs, please file them with the implementations. That does not affect the spec.


Leaving open for clarification on point 3 above.

Comment 9 Leif Halvard Silli 2011-10-04 02:21:41 UTC

(In reply to comment #8)
> > > > 3)  Visually, such marks may look as if they combine with something outside the
> > > > element
> > > 
> > > They might well combine with something outside the element's border box. Why is
> > > this a problem?
> > 
> > The situation I described was one where it *looks* as if it it combines with
> > something (that is: with something unvisible) outside the element.  That is: A
> > situation where there is nothing to combine with. (For all I know, it combines
> > withe box - rather than a character - outside the element.)
> > 
> > If the combining character is inside an element with display:inline-block, and
> > combines with another character in a mathml element, then that is another
> > matter - and not a problem. 
> 
> I have no idea what you're saying here. Could you elaborate? Maybe a concrete
> example?

W.r.t.  initial comment 3), then look here:
<http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1174>
The demo shows tha combining marks, when "alone", move outside the left border of the element, as if it combines with something that is outside the element. It does perhaps not happen with every combining mark, but it seems to happen at least with diacritica.

W.r.t. my subsequent reply to your comment, then look at the "inline-block" examples at the bottom of the following demo: <http://tinyurl.com/5stsb7r>. When I flesh it out a bit more, that demo has this code:
<p>

Comment 10 Ian 'Hixie' Hickson 2011-10-06 23:06:53 UTC

> <http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1174>

Do you have a minimised example? I can't make heads or tails of this page.


> The demo shows tha combining marks, when "alone", move outside the left border
> of the element, as if it combines with something that is outside the element.

I don't see what this has to do with HTML. It's a rendering issue; either an issue to bring up with the CSS working group or with the browser vendors.

> W.r.t. my subsequent reply to your comment, then look at the "inline-block"
> examples at the bottom of the following demo: <http://tinyurl.com/5stsb7r>.

I can't work out what's going on on that page.


> When I flesh it out a bit more, that demo has this code:
> <p>�<span style="display:inline-block">&#x0020;&#x20DC;</span>
>    Despite the space between the '�' and the combining character, the combining
> character combines with the '�'.  This, I said is fine. Only if we removed the
> "�", would there be a problem as then there would be nothing to combine with:
> <p><span style="display:inline-block">&#x0020;&#x20DC;</span>

I don't understand the problem here. Unicode is clear about what you do with isolated combining characters.


> NOTE 1: This bug is filed against the 'Flow content' section, where you give a
> description of the  general rule of what """elements whose content model allows
> any flow content""" as a minimum **should** contain.

Actually that's been moved into its own section now.


> The spec says that the
> minimum is not a strict rule: 'not a hard requirement'. And I simply would like
> that this "not hard requirement" is stretched to include combining characters
> too. 

I don't see why they need mentioning at all.


>       Btw, conformance checkers do not display a warning if e.g. the <body>
> element is emtpy, and so it did not need to to actually warn in case the <body>
> only contains a combining mark either ... It would be enough for me if the spec
> explained that an element "whose content model allows any flow content", is
> more than spaces and combining marks.

What's wrong with just having an isolated combining mark? It's perfectly legal per Unicode.


> NOTE 2: Do you disagree with the advice of Unicode6, that  authors, when they
> want to represent a combining character as if was an independent, spacing
> character, should combine with no-break space?  If you don't, how  can one get
> this authoring advice into the spec?

Why would we need to mention it at all? That's a Unicode issue.


> NOTE 3: It is not so that I that *my* proposal circumentvents all
> implementation bugs. Far the from. So it is not a proposal that seeks to
> circument implementation bugs. In fact, my proposal emphasizews that 

I've no idea what you're saying here.


> NOTE 3: Do I misunderstand "any flow content"? I read it as "every sort" but
> perhaps it is meant "some sort"?

I do not think you misunderstand it.


> NOTE 4: This variant of my previous demo, has dd:first-letter{white-space:pre}.
> And as you can see, this makes the line where there is a space plus a combining
> mark identical with the line where there is no-break space and a combining
> mark. However, the line where there is only a combining mark as first eltter,
> is not affected - given that not every implementation has CSS enabled, one
> can't rely on this:
> 
> http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1176

I have no idea how this is relevant to HTML.

Comment 11 Ian 'Hixie' Hickson 2011-12-09 23:30:27 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Did Not Understand Request
Change Description: no spec change
Rationale: see previous comment