Bug 11211 - Need a way to force a line wrap with the bidi semantics of LINE SEPARATOR when necessary.
Need a way to force a line wrap with the bidi semantics of LINE SEPARATOR whe...
Status: RESOLVED WORKSFORME
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson)
unspecified
PC Windows XP
: P2 normal
: ---
Assigned To: Ian 'Hixie' Hickson
HTML WG Bugzilla archive list
:
Depends on: 10828
Blocks:
  Show dependency treegraph
 
Reported: 2010-11-03 22:37 UTC by Aharon Lanin
Modified: 2010-11-14 10:18 UTC (History)
11 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Aharon Lanin 2010-11-03 22:37:26 UTC
Bug 10828, when fixed, will define <br> as a paragraph break for bidi purposes, in order to match widespread usage and the way IE and WebKit treat it today (despite the HTML 4 spec saying otherwise).

Nevertheless, as bug 10828 states, we still need a way, at times, to force a line wrap with the bidi behavior that the Unicode Bidi Algorithm assigns to the LINE SEPARATOR character, i.e. the bidi behavior that <br> was defined to have in HTML 4.01, and which is the way that Firefox and Opera currently treat <br>.

For use case see bug 10828, comment 18.

<http://www.w3.org/International/docs/html-bidi-requirements/#br-as-separator> suggests (and bug 10828 used to include) addressing this by adding a bidibreak=soft|hard attribute that would determine the behavior of <br> in its descendants (defaults to hard for the root element).
Comment 1 Ian 'Hixie' Hickson 2010-11-04 05:53:38 UTC
Specifically, the use case in bug 10828 comment 18 is to use HTML as a formatting language so as to be able to replicate line breaking seen in other media, such as books or newspapers, as in:

   http://newspapers.nla.gov.au/ndp/del/article/1118868

Such line breaks must not be paragraph separators for bidi purposes.
Comment 2 Ian 'Hixie' Hickson 2010-11-04 07:32:37 UTC
(Isn't the right solution here to just use PDF or SVG or another layout language? This seems like a bit of an abuse of HTML.)
Comment 3 fantasai 2010-11-04 13:46:51 UTC
The right way to capture non-semantic line-breaking copied from another medium is <pre>, aka Preformatted.

But a valid use case would be poetry, where line breaks are semantic but soft breaks are appropriate. I would be interested in hearing other use cases as well, though.
Comment 4 Aharon Lanin 2010-11-04 23:41:33 UTC
(In reply to comment #3)
> The right way to capture non-semantic line-breaking copied from another medium
> is <pre>, aka Preformatted.

I have no opinion on whether HTML is the right format for OCR output.

If we are on the subject, though, OCR from bidi text is devilishly hard. You have to run a visual-to-logical transformation (which is enough of a complication by itself). And you would have to guess which of the line breaks in the original text are actually line wraps, and which are paragraph breaks, since for the line wraps, you do indeed need line separators. For example, let's say this is the original visual order in the printed RTL book:

   ali baba and the" SI YROTS EHT FO EMAN EHT
                                 ."40 thieves

The correct logical-order content would be:

THE NAME OF THE STORY IS "ali baba and the[LINE SEPARATOR]
40 thieves".[PARAGRAPH SEPARATOR]

If one used a [PARAGRAPH SEPARATOR] at the end of the first line, it would get displayed as:

   ali baba and the" SI YROTS EHT FO EMAN EHT
                                 ."thieves 40

> I would be interested in hearing other use cases as
> well, though.

I don't have anything. But I know that it will come and bite me on the behind one day.
Comment 5 Ian 'Hixie' Hickson 2010-11-05 07:21:09 UTC
The simplest solution here is an attribute on <br>. I don't know how to map it to CSS though. Any suggestions on that front? Right now the spec says: "The br element is expected to render as if its contents were a single U+000A LINE FEED (LF) character and its 'white-space' property was 'pre'." Is there any way in CSS right now to get the effect desired by this bug?
Comment 6 fantasai 2010-11-05 09:29:19 UTC
Yes. Have <br> insert LS instead of LF. :) Most implementations currently have lots of difficulty rendering LS... but theoretically it should work. :P
Comment 7 Adil 2010-11-05 15:16:01 UTC
(In reply to comment #3)
> The right way to capture non-semantic line-breaking copied from another medium
> is <pre>, aka Preformatted.
> 
> But a valid use case would be poetry, where line breaks are semantic but soft
> breaks are appropriate. I would be interested in hearing other use cases as
> well, though.

The example I gave is a current issue I am developing now. I need the text as HTML as I want it as copyable text. Using <pre> is possible but then I guess the DOM would treat the whole text as a single element where my application would prefer to work on one element for each line.

The other use case is for web applications - the situation is theoretical but I believe will be real as more web applications support bidi. e.g. an email application that shows the first line of an email then reveals the rest on pressing a "more" button. Predicting where to truncate a long string in a web app is a bit hit and miss - so it is safe to overcompensate. But once the rest of the paragraph is revealed I would not want words to magically appear on the first line. So: Imagine something like this..

| FROM: John MESSAGE: Dear John, Dont be hard on yourself,     |
|                     <more...>                                |

And clicking on "more" would reveal the rest of the lines of the email..

| FROM: John MESSAGE: Dear John, Dont be hard on yourself,     |
|                     give yourself a break, life wasn't meant |
|                     to be run, the race is over you won.     |

The "give" would fit but now needs to be wrapped to the next line. Assuming the paragraph contains mixed bidi text a <br> would break the correct ordering.
Comment 8 Ian 'Hixie' Hickson 2010-11-08 08:07:53 UTC
(In reply to comment #6)
> Yes. Have <br> insert LS instead of LF. :) Most implementations currently have
> lots of difficulty rendering LS... but theoretically it should work. :P

Where in CSS does it define that LS creates a new line box? I was considering doing it this way but I couldn't find anything that defined this appropriately. (It isn't obvious that it should Just Work, for the same reason e.g. CR and FF don't "just work".)


The use case in comment 7 seems to be a presentation issue that should be handled in CSS only.

Incidentally, are there any use cases for a <br> that _doesn't_ act as described here in a bidi context? I know we have to make <br> act that way, I'm just asking if there are any cases where one might actually legitimately _want_ to use <br> with mixed RTL and LTR text even though it breaks paragraphs. All the cases I can think of are strictly presentational and would be best handled by <p>... Am I missing any?
Comment 9 Aharon Lanin 2010-11-08 11:23:02 UTC
(In reply to comment #8)
> (In reply to comment #6)
> > Yes. Have <br> insert LS instead of LF. :) Most implementations currently have
> > lots of difficulty rendering LS... but theoretically it should work. :P
> 
> Where in CSS does it define that LS creates a new line box? I was considering
> doing it this way but I couldn't find anything that defined this appropriately.
> (It isn't obvious that it should Just Work, for the same reason e.g. CR and FF
> don't "just work".)
> 
> 
> The use case in comment 7 seems to be a presentation issue that should be
> handled in CSS only.
> 
> Incidentally, are there any use cases for a <br> that _doesn't_ act as
> described here in a bidi context? I know we have to make <br> act that way, I'm
> just asking if there are any cases where one might actually legitimately _want_
> to use <br> with mixed RTL and LTR text even though it breaks paragraphs. All
> the cases I can think of are strictly presentational and would be best handled
> by <p>... Am I missing any?

The best one that I can think certainly should not be done with <p>'s, and probably would be legitimate with <br>, but would probably actually best be done with <pre>. It is when you wrap break the text you are quoting into short lines and prefixing each one with &gt;. Bidi paragraph breaks are imperative there because otherwise the &gt; gets tangled up:

&gt; he said 'PLEASE SHOW ME<br>
&gt; THE EXAMPLE' and i did

If <br> has the semantics of only a bidi line separator, this will come out as:

> he said 'EM WOHS ESAELP
ELPMAXE EHT <' and I did

instead of as:

> he said 'EM WOHS ESAELP
> ELPMAXE EHT' and I did
Comment 10 Adil 2010-11-08 15:37:32 UTC
Just to add to the use cases.. (this is probably ancient history for most web developers) my own desktop publishing application provides two types of line-break - one that breaks a paragraph and one that does not. The users make all kinds of use of this - e.g.
- to control the line breaks in headlines that span multiple lines, 
- control line breaks in advertising copy where the text has to match a certain pattern,
- create a line wrap without breaking the paragraph formatting - e.g indents and paragraph spacing.

In each of these cases the user intends to insert an entity into the line that semantically means "line break" but does not mean "paragraph break".

With bug 10828 we are semantically redefining <br> as a lightweight paragraph break. But, also, there is a need to have a way of defining a line break that does not break a paragraph and semantically means this - with the advantage that it forces a line wrap with the bidi behavior of LINE SEPARATOR.
Comment 11 Ian 'Hixie' Hickson 2010-11-10 17:36:23 UTC
(In reply to comment #9)
> 
> &gt; he said 'PLEASE SHOW ME<br>
> &gt; THE EXAMPLE' and i did

That's what <blockquote> and/or <pre> are for. I don't think <br> would be appropriate there.


(In reply to comment #10)
> - to control the line breaks in headlines that span multiple lines, 

That's presentational, and should be handled in CSS, not in the markup.

> - control line breaks in advertising copy where the text has to match a certain
> pattern,

That's presentational also, probably an SVG issue, not HTML.

> - create a line wrap without breaking the paragraph formatting - e.g indents
> and paragraph spacing.

That's entirely CSS (text-indent, padding, etc).

> In each of these cases the user intends to insert an entity into the line that
> semantically means "line break" but does not mean "paragraph break".

I don't think there are any line break semantics above, it's all presentation.


Can someone from the CSS working group confirm whether the LS character in content will cause a line break without a bidi paragraph break, and cite the relevant part of the relevant spec? Given that, I think we'd be good to go with just adding an attribute to <br> to use LS rather than LF (though I'm still not convinced we have any use cases for <br> _without_ this feature, so I'm tempted to make the attribute required if there's any RTL content around the <br> element).
Comment 12 Maciej Stachowiak 2010-11-11 02:36:55 UTC
(In reply to comment #11)
> (In reply to comment #9)

> 
> Can someone from the CSS working group confirm whether the LS character in
> content will cause a line break without a bidi paragraph break, and cite the
> relevant part of the relevant spec? Given that, I think we'd be good to go with
> just adding an attribute to <br> to use LS rather than LF (though I'm still not
> convinced we have any use cases for <br> _without_ this feature, so I'm tempted
> to make the attribute required if there's any RTL content around the <br>
> element).

I think U+2028 should in theory work regardless of CSS white-space mode (it is not subject to collapsing), so &#x2028; should satisfy the use case in this bug (though it doesn't currently work in any browser).
Comment 13 Ian 'Hixie' Hickson 2010-11-11 19:13:40 UTC
I talked to fantasai about this in #css earlier.

I recommend that we try using &#x2028; for now, and see where that gets us. Since the CSS layer doesn't yet support this, there's not much we can do right now. If &#x2028; works, then we can get the MathML WG to add a &ls; named charref for us in the future to make it more usable. If it doesn't, we can try to find a better solution (like a new element). I'm a little reluctant to add an attribute to <br> for this because I can't really come up with a good name and any name longer than three characters is longer than "&#x2028;" would be.
Comment 14 Ian 'Hixie' Hickson 2010-11-11 19:14:08 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: no spec change
Rationale: see above
Comment 15 Adil 2010-11-12 13:34:06 UTC
I agree that &#x2028 should cover most of the use cases and that adding an attribute on <br> is too cumbersome. In theory. Most of the examples I gave in comment 10 can be handled by CSS but it is more complex in a web app to manage this and in all the examples the creator would want to keep the same line break even if the CSS changes.

&#x2028 will work for now but as Aharon said in comment 4 I suspect this will bite us on the behind one day.

My preferred solution is a new tag (e.g. <lbr>). It would be consistent with <br> and <wbr> and I still think HTML needs a tag that has the semantic meaning of breaking a line without creating a paragraph. But I can understand that this may be too large a change now. 

So take this as an acceptance with reservations.
Comment 16 Amit Aronovitch 2010-11-12 17:44:24 UTC
(In reply to comment #13)
> I talked to fantasai about this in #css earlier.
> 
> I recommend that we try using &#x2028; for now, and see where that gets us.
> Since the CSS layer doesn't yet support this, there's not much we can do right
> now. If &#x2028; works, then we can get the MathML WG to add a &ls; named

Was that a typo? AFAIK MathML has nothing to do with this. Maybe XML WG?
Can't/shouldn't html add entities beyond the predefined XML set? (I do not know - this is for my general knowledge...)

> charref for us in the future to make it more usable. If it doesn't, we can try
> to find a better solution (like a new element). I'm a little reluctant to add
> an attribute to <br> for this because I can't really come up with a good name
> and any name longer than three characters is longer than "&#x2028;" would be.

How about <br ubi> as suggested by Aharon in bug 10828, comment 22?

As for later comments - I'm not sure I understand the current direction: was the intention to: (a) Add some new attribute/element and use &#x2028; in the specification of its required behaviour? 
or: (b) Add nothing new, and have the content-authors implement the linebreak &#x2028; ?

I must say that I do not like option (b).

The objections that Ian raised on comment 11 are probably correct: <br> *is* a presentational thing, which causes all use cases to be some sort of "abuse". But the same can be said about the parabreaking <br> (bug 10828).

Now, it seems that we are going to keep <br> in HTML despite that (this seems to be a concensus, apparently for practical reasons), but decide to *change* its behavior, to match one sort of "abusive" usage (re: comments 8 and 11: for usecases we should probably search Mozilla bug-reports), and not the other (usecases in comments 7 and 10). For one thing, this is aesthetically awkward. Furthermore, from the POV of those browser-makers that stuck to HTML4 standard despite pressure from their customers, it would seem like the standards are following implementation (so why should they want to keep following it?).

If it is in the scope of HTML to *require* implementation of entities, then a possible solution would be to require compliant application to properly handle both U+2028 and U+2029, and to add a comment in the <br> spec, saying that users should prefer using U+2029 (or better yet, named entity e.g. &ps;) instead of <br> ( or using U+2028 (or named entity e.g. &ls;), for the usecase described in this bug).
Comment 17 Aharon Lanin 2010-11-14 10:18:42 UTC
(In reply to comment #16)
> (In reply to comment #13)
> > I recommend that we try using &#x2028; for now, and see where that gets us.

As far as I understand, LINE SEPARATOR support has been added to CSS 2.1 tests. Perhaps this will cause browsers to start supporting it. Currently they don't - it either has no effect at all or is displayed as a rectangle or something.

> > Since the CSS layer doesn't yet support this, there's not much we can do right
> > now.

Well, theoretically the CSS layer does support it, since it does not explicitly say anything about LINE SEPARATOR, so the Unicode spec should apply, and the new tests should enforce it.

> > If &#x2028; works, then we can get the MathML WG to add a &ls; named
> 
> Was that a typo? AFAIK MathML has nothing to do with this. Maybe XML WG?
> Can't/shouldn't html add entities beyond the predefined XML set? (I do not know
> - this is for my general knowledge...)

Yeah, who controls named entities? Will we have to wait for HTML6 to get new ones if we don't do it now?

> How about <br ubi> as suggested by Aharon in bug 10828, comment 22?

There is no ubi attribute. Instead, there is a bdi element, which does not help us here. See there for discussion.

> As for later comments - I'm not sure I understand the current direction: was
> the intention to: (a) Add some new attribute/element and use &#x2028; in the
> specification of its required behaviour? 
> or: (b) Add nothing new, and have the content-authors implement the linebreak
> &#x2028; ?

It's b.

> If it is in the scope of HTML to *require* implementation of entities, then a
> possible solution would be to require compliant application to properly handle
> both U+2028 and U+2029, and to add a comment in the <br> spec, saying that
> users should prefer using U+2029 (or better yet, named entity e.g. &ps;)
> instead of <br> ( or using U+2028 (or named entity e.g. &ls;), for the usecase
> described in this bug).

It's not the entity that's the problem. It's the character itself that all current browsers don't support.