Bug 19505 - Describe visual direction when document encoding is iso-8859-8
Summary: Describe visual direction when document encoding is iso-8859-8
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-10-12 11:47 UTC by Anne
Modified: 2014-03-08 00:44 UTC (History)
11 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Anne 2012-10-12 11:47:18 UTC
While iso-8859-8 and iso-8859-8-i both encode and decode Hebrew in the same way, the former displays in visual direction while the latter displays in logical direction.

The CSS specification needs to describe this in some way.

See http://encoding.spec.whatwg.org/ for details on the encodings. See bug 17003 and http://krijnhoetmer.nl/irc-logs/whatwg/20121012#l-540 for some of the history behind this filing.

As part of this you might want to consider studying -webkit-rtl-ordering. (I have not been able to find much.)
Comment 1 fantasai 2012-11-26 19:52:04 UTC
Hmm. I guess there would be two logical ways to handle that:
  a) Automatically set direction and override based on the encoding used
  b) Treat Hebrew characters as strong LTR instead of strong RTL.

smontagu probably knows what actually goes on though
Comment 2 Anne 2012-11-26 20:08:34 UTC
For easy comparison:

data:text/html;charset=iso-8859-8,%E0%E1%E2%E3
data:text/html;charset=iso-8859-8-i,%E0%E1%E2%E3
Comment 3 Simon Montagu 2012-11-26 21:48:41 UTC
Not sure what CSS needs to say about this other than "don't do that". See http://www.w3.org/TR/REC-html40/struct/dirlang.html#bidi88598 for comparison.
Comment 4 Anne 2012-11-26 21:56:30 UTC
We want CSS to list the requirements for browser vendors. (So new browsers can enter the market, existing browsers can more easily rewrite their code, etc.) Of course developers should just use utf-8 etc.
Comment 5 fantasai 2012-11-26 23:06:47 UTC
Looks like 'direction' is not affected. Between a few rough tests and the comments in Gecko, I think the rules here are:
  1. Treat all characters in iso-8859-8 documents as matching the embedding
     direction / treat them all as neutral (including any escaped characters).
  2. Treat form controls normally; they are not affected by the encoding.

data:text/html;charset=iso-8859-8,<li>%E0%E1%E2%E3<input%20value="%E0%E1-%E2%E3*"><li%20dir="rtl">%E0%E1%E2%E3<li>&rlm;%E0%E1%E2%E3&rlm;<li%20style="direction:%20rtl">%E0%E1AB%E2%E3

As for what spec to put that in... I'd suggest a combination of Encoding and HTML, e.g. Encoding for rule #1, and HTML for rule #2. This is very specific to iso-8859-8.

smontagu, does that seem right?
Comment 6 Martin Dürst 2012-11-27 01:54:44 UTC
(In reply to comment #5)

> As for what spec to put that in... I'd suggest a combination of Encoding and
> HTML, e.g. Encoding for rule #1, and HTML for rule #2. This is very specific
> to iso-8859-8.

The best thing would be if all of this were in HTML, as a (piece of a) default style sheet. To get around the fact, as Fantasai's example shows, that dir attribute values are ignored, one could use !important in the default style sheet.

This wouldn't be absolutely perfect, but I doubt there are people who use visual Hebrew and stylesheets where they tweak bidi rendering properties, even more !important. It would give (future) implementers a hopefully easy way to cover this. They wouldn't need special rendering logic, just a switch to change the default style sheet.

The more basic question is how many iso-8859-8 pages are still around, overall. Does anybody have any numbers? Mark Davis should have them as part of his "UTF-8 reached more than 50% of the Web" survey.

[One set of data I found was http://w3techs.com/technologies/details/en-iso885908/all/all. That shows iso-8859-8 at between 0.002% and 0.001% (which would be between 0.00002 and 0.00001, i.e. about every Web page in 50,000 or 100,000 is in iso-8859-8). However, http://w3techs.com/technologies/overview/character_encoding/all doesn't list iso-8859-8-i at all, so I don't trust this data. And none of the top pages of the sites listed uses iso-8859-8.]
Comment 7 Simon Montagu 2012-11-27 05:26:12 UTC
I don't trust the data at http://w3techs.com/technologies/details/en-iso885908/all/all either: of the sites it lists under "Random selection of sites using ISO-8859-8" some are using ISO-8859-8-I, and some are using Windows-1255; not one actually uses ISO-8859-8 on its main page, nor the alias "visual" that used to be common.
Comment 8 Martin Dürst 2012-11-27 05:36:00 UTC
(In reply to comment #6)

> Mark Davis should have them as part
> of his "UTF-8 reached more than 50% of the Web" survey.

Here's what I got back from Mark:

I don't have the exact stats, but as I recall the iso-8859-8-i was in the
noise.
Comment 9 Simon Montagu 2012-11-27 06:20:36 UTC
(In reply to comment #8)
> Here's what I got back from Mark:
> 
> I don't have the exact stats, but as I recall the iso-8859-8-i was in the
> noise.

iso-8859-8-i is not the question: only iso-8859-8 and its aliases are relevant to visual direction.
Comment 10 fantasai 2012-11-27 16:34:52 UTC
(In reply to comment #6)
>
> The best thing would be if all of this were in HTML, as a (piece of a)
> default style sheet. To get around the fact, as Fantasai's example shows,
> that dir attribute values are ignored, one could use !important in the
> default style sheet.

It's not a default style sheet thing. Look again at the testcases. The dir attribute values are honored in Opera, Chrome, and Gecko. All of the characters however are treated as neutral. This isn't something you can simulate with CSS or Unicode control codes--you'd have to add a new neutral-override feature. Also, this is not specific to HTML: a plaintext file in this encoding should behave the same way.
Comment 11 Matitiahu Allouche 2012-11-27 17:46:51 UTC
(In reply to comment #10)
fantasai wrote:
> It's not a default style sheet thing. ... All of the
> characters however are treated as neutral. This isn't something you can
> simulate with CSS or Unicode control codes--you'd have to add a new
> neutral-override feature.

I don't see why this cannot be simulated with Unicode LRO/RLO/PDF or the CSS override equivalents.
Comment 12 fantasai 2012-11-27 21:55:17 UTC
Hmm. You could probably get most of the way there with
  * { unicode-bidi: bidi-override; }
  input, textarea, etc. { unicode-bidi: normal; }

The one thing you can't simulate that way is the interaction with actual Unicode bidi control codes: those are ignored/treated as invisible neutral characters afaict.
Comment 13 Simon Montagu 2012-11-27 22:12:40 UTC
(In reply to comment #12)
> The one thing you can't simulate that way is the interaction with actual
> Unicode bidi control codes: those are ignored/treated as invisible neutral
> characters afaict.

That seems not unreasonable, on the assumption that (a) the bidi control codes exist to fine-tune the UBA and (b) visual direction bypasses the UBA altogether.

(One has to make some assumptions, since AFAIK visual direction isn't formally defined anywhere and is only defined de facto by what authors did to get RTL pages looking reasonable on browsers without proper Bidi support. Thus the exception for form controls comes from the fact that form controls in browsers from that period were generally implemented by native OS widgets, which generally did have Bidi support on popular OSs in RTL locales)
Comment 14 Matitiahu Allouche 2012-11-27 22:39:40 UTC
(In reply to comment #12)
fantasai wrote:
> The one thing you can't simulate that way is the interaction with actual
> Unicode bidi control codes: those are ignored/treated as invisible neutral
> characters afaict.

Well, we are talking about pages encoded with iso-8859-8, a code page which does not include Unicode bidi control characters except LRM and RLM, and those interact just like regular characters.
It is still possible to generate other Unicode control characters using NCRs like &#x202B, but pages encoded with iso-8859-8 containing such NCRs would be a nonsense. The "raison d'être" of visual pages is to run on systems without bidi support. With no bidi support, there is no motive for the author to use any Unicode directional control character.
I expect that there are no such pages whatsoever. Even if there were, it would not be worth introducing a new feature to take care of such sickly cases.
Comment 15 fantasai 2012-11-27 23:03:02 UTC
data:text/html;charset=iso-8859-8,<p>%E0%E1%E2%E3<span%20dir=rtl>%E0%E1%E2</span><bdo%20dir=rtl>%E3%E0%E1%E2%E3</bdo><p%20dir=rtl>%E0%E1%E2%E3

More weirdness: On Webkit and Gecko, 'direction' is honored on blocks, but not on inlines. Opera honors it on both. Not sure what IE does.
Comment 16 Martin Dürst 2012-12-03 09:39:18 UTC
With regards to the frequency of iso-8859-8 (visual Hebrew), here is what I got from Aharon Lanin:

>>>>>>>>
Here are some numbers from Google web crawling. These are percentages of
total web pages.

0.061% Windows Hebrew CP1255
0.00081% visual order ISO Hebrew iso-8859-8
0.00060% logical order ISO Hebrew iso-8859-8-i
>>>>>>>>

I suspect that these are actual numbers, i.e. they reflect the actual encoding of the page (as judged by Google, of course), rather than the declared encoding. That would also best explain that windows-1255 is 100 times more popular than iso-8859-8-i.

The question is how much spec work we want to do for something that represents around one in 120,000 pages, and hopefully is on the decline.
Comment 17 Anne 2012-12-03 10:06:17 UTC
(In reply to comment #12)
> Hmm. You could probably get most of the way there with
>   * { unicode-bidi: bidi-override; }
>   input, textarea, etc. { unicode-bidi: normal; }

So this would be specific to where the encoding is iso-8859-8-i? If we make it *|* and add that to the HTML rendering section it should cover text/plain, text/xml etc.

Martin, either we document it so new players know what to implement to interoperate with existing clients, or we remove it from existing clients. I don't think it's responsible to leave things lingering. (And it's not that much trouble anyway.)
Comment 18 Martin Dürst 2012-12-03 10:56:39 UTC
(In reply to comment #17)
> (In reply to comment #12)
> > Hmm. You could probably get most of the way there with
> >   * { unicode-bidi: bidi-override; }
> >   input, textarea, etc. { unicode-bidi: normal; }
> 
> So this would be specific to where the encoding is iso-8859-8-i?

Yes.

> If we make
> it *|* and add that to the HTML rendering section it should cover
> text/plain, text/xml etc.

Great.

> Martin, either we document it so new players know what to implement to
> interoperate with existing clients, or we remove it from existing clients. I
> don't think it's responsible to leave things lingering. (And it's not that
> much trouble anyway.)

Well, something like the above stylesheet is indeed not too much trouble; that's why I suggested it in #c6. If we try to find all the special cases in current implementations (as fantasai has started to do), then it's probably a lot of work.

So now the question is (a) whether browser vendors would be willing to converge to something like the above stylesheet, and (b) whether there are any serious number of pages out there where something like the above stylesheet isn't good enough. I'm not the right person to answer either (a) or (b). My general implementation experience tells me that specing it like above would make it easier for new market entrants than if there were lots of special cases.
Comment 19 Aharon Lanin 2012-12-03 12:39:43 UTC
(In reply to comment #16)
> I suspect that these [...] reflect the actual
> encoding of the page (as judged by Google, of course), rather than the
> declared encoding.

That is correct.
Comment 20 Koji Ishii 2014-02-04 20:56:37 UTC
So, is everyone here in consensus not to fix this bug?
Just wanted to make sure before I change this to RESOLVED WONTFIX.
Comment 21 Anne 2014-02-05 09:02:35 UTC
No that is not the consensus, see second paragraph of comment 17 and follow up comment by Martin.
Comment 22 Koji Ishii 2014-02-05 17:49:04 UTC
> Martin, either we document it so new players know what to implement to
> interoperate with existing clients, or we remove it from existing clients.

I don't think new players would want to add logic for 0.00081% and further decreasing, and also it's not worth spend our time to investigate and spec for 0.00081%.

I'm ok for vendors to remove it, I don't know if vendors want to spend their time for 0.00081%. But it'd be vendors' call, not our call anyway.

I've asked implementers at the Unicode Technical Conference to take a look at this bug. They may or may not take actions, that part I don't know, but I don't think any work left for W3C.
Comment 23 Koji Ishii 2014-02-05 19:57:37 UTC
Got another feedback offline that it should be spec'ed.

So, Anne, since you mentioned it should be spec'ed in HTML rendering section, should this bug passed to HTML WG? I'm not clear how to pass a bug between WGs in this system. If you know, I appreciate to know, or I can ask Mike or someone else for how to give this bug to HTML WG.
Comment 24 Koji Ishii 2014-02-05 20:35:35 UTC
Changed the Product to WHATWG as suggested by Mike.
Comment 25 Anne 2014-02-05 21:52:22 UTC
Ian, see comment 17.

Having said that, if usage is really that low, maybe we should try to remove this feature.
Comment 26 Koji Ishii 2014-02-05 23:05:32 UTC
(In reply to Anne from comment #25)
> Ian, see comment 17.
> 
> Having said that, if usage is really that low, maybe we should try to remove
> this feature.

Yeah, that was what I thought at the first place. One of the members mentioned, however, that the website of the bank in Israel he uses still uses the visual iso-8859-8 in dynamically generated web pages, for example. I'm not strongly pushing either way, but it's understandable that big companies such as financial institutions might be the last one to move out from old encodings.
Comment 27 Ian 'Hixie' Hickson 2014-02-06 00:33:32 UTC
So... this would be a set of ruler that only apply when, exactly?
Are browsers more interested in implementing this than removing support?
Comment 28 Koji Ishii 2014-02-06 03:55:56 UTC
(In reply to Ian 'Hixie' Hickson from comment #27)
> So... this would be a set of ruler that only apply when, exactly?

From comment #8: when encoding is "iso-8859-8 and its aliases are relevant to visual direction"

> Are browsers more interested in implementing this than removing support?

In my understanding, it is implemented in all existing browsers. Anne believes that it should be either a) spec'ed for new players, or b) removed. So the question to browsers is whether they want to remove it or not.

I talked with a few bidi people at UTC and in Google, they think it should NOT be removed, because although numbers are low, some critical sites such as banks are still using the encoding (and rely on browser to interpret as visual order.)

Logically speaking, there's another option: c) keep the existing code and do nothing in the spec. I'd vote this; I won't be surprised if new browsers do not support 10 years old sites, but Anne seems to be negative to this option, and I'm not sure if this is a good policy when compared to W3C/WHATWG policies, so I appreciate your opinion.
Comment 29 Matitiahu Allouche 2014-02-06 08:48:48 UTC
Here is my point of view, as one with some experience of the Hebrew scene.

a) The percentage of pages marked with ISO-8859-8 may be very low for the whole internet, but for Hebrew users it may be nonetheless more than "noise" (I don't have any statistics).

b) Hebrew data in visual order may come from 2 main sources:
- pages created when the most common browsers did not have bidi support. It is to be expected that the number of such active pages will decrease steadily.
- data from mainframes and large data bases. Typically, those are owned by big institutions, who started using IT when bidi support was limited to keyboards and fonts (no support for logical order) and are very slow modernizing their software systems. Nobody can tell when such usage will become marginal.

c) The current state of things is good enough. The main browsers have reasonable support for visual data. If new browsers won't support them, or support them with simple means like a default stylesheet, it is still ok for the users.

d) My suggestions: 
1. Do nothing to make existing browsers remove the support they provide, don't outlaw the visually ordered data, so that the service to Hebrew users is not impaired.
2. On the other hand, I see no reason to invest much effort in specs or in new implementations. Newcomers may decide to support this part of the Hebrew market or not, like any other business decision.
3. Make very clear to anyone making first steps in the bidi world that ISO-8859-8 and visually ordered data are a dead end.
Comment 30 Tomer Mahlin 2014-02-06 10:18:21 UTC
One minor comment on "visually ordered data are a dead end". It is absolutely not true in the context of mainframes. Main source of visually ordered data is coming from mainframes and large data bases. First of all, no one is planning on stopping producing / developing / marketing mainframes.  Second, no one is going to change ground rules of Bidi data storage in the mainframes. Finally, not only the legacy data(huge volumes of data) but also new data generated and stored on the mainframes in many cases is stil visually ordered one. Thus at least from the storage perspective we are not talking about any short or long term strategy of moving away from visually ordered data to logically ordered one. 

In the context currently discussed we are interested in scenarios in which this data from legacy system appears on the web page. 
There are different ways to address the problem. For example: 
1. Change bidirectional layout of the data before it reaches the web page (this way the data when it reaches web page is already logically ordered)
2. Provide tools (basically set of widgets) which allow proper work with visual data in a general web application.

Considerably more details can be found at: http://www.w3.org/International/questions/qa-visual-vs-logical 
and also in articles / materials linked to it.

With general tendency of software to move to the Cloud / Mobile worlds, the need to properly work with legacy (visually ordered) data is more obvious than ever. There are efforts underway aimed at development of JS based widget library (based on Dojo) which has full support for legacy data. In this solution the data is converted to Unicode (however, it remains to be visually ordered). To support work with visually ordered data we use standard HTML markup.
Comment 31 Ian 'Hixie' Hickson 2014-02-06 19:27:39 UTC
If the browsers don't match the specs, doing nothing is not an option. We're not writing works of fiction here.

Ok, here's a concrete proposal. The following text would be added at the end of the rendering subsection titled "Bidirectional text":

--------------------8<--------------------
When the document's character encoding is iso-8859-8, the following rules are additionally expected to apply, following those above:

 address, blockquote, center, div, figure, figcaption, footer, form,
 header, hr, legend, listing, main, p, plaintext, pre, summary, xmp, article,
 aside, h1, h2, h3, h4, h5, h6, hgroup, nav, section, table, caption,
 colgroup, col, thead, tbody, tfoot, tr, td, th, dir, dd, dl, dt, menu,
 ol, ul, li, [dir=ltr i], [dir=rtl i], [dir=auto i], *|* {
   unicode-bidi: bidi-override; 
 }
 input:not([type=submit i]):not([type=reset i]):not([type=button i]), textarea,
 keygen { unicode-bidi: normal; }
--------------------8<--------------------

Does that look right?
Comment 32 Ian 'Hixie' Hickson 2014-02-06 20:37:54 UTC
Anne said it did on IRC.
Comment 33 contributor 2014-02-06 20:38:01 UTC
Checked in as WHATWG revision r8470.
Check-in comment: Handle visual hebrew
http://html5.org/tools/web-apps-tracker?from=8469&to=8470
Comment 34 Martin Dürst 2014-02-08 03:44:35 UTC
(In reply to Ian 'Hixie' Hickson from comment #32)
> Anne said it did on IRC.

It looks good to me, too. I was wondering whether we need "direction: ltr;" together with "unicode-bidi: normal;", but my understanding is that it's not necessary because that's what's the overall default for CSS.
Comment 35 Martin Dürst 2014-02-08 03:46:34 UTC
(In reply to Ian 'Hixie' Hickson from comment #32)
> Anne said it did on IRC.

It looks good to me, too. I was wondering whether we need "direction: ltr;" together with "unicode-bidi: normal;", but my understanding is that it's not necessary because that's what's the overall default for CSS.
Comment 36 Aharon Lanin 2014-02-20 11:53:47 UTC
Sorry to get involved so late, but I just realized that it is unclear that CSS (any CSS) can fully describe how to handle iso-8859-8. The problem is with the title, alt, and placeholder attributes.

All that the HTML spec says about them is that they should be displayed in the element's directionality (unless the element has dir="auto" in which case each of these have to be each displayed in the directionality determined from the content of each attribute separately). Directionality is just LTR or RTL - it does not include unicode-bidi.

As for the CSS spec, I have no idea whether it covers the display of those attribute values at all. Writing Modes Level 3 certainly does not say anything about them.

There is a CSS test that does cover unicode-bidi:override and the alt attribute, http://www.w3.org/Style/CSS/Test/CSS2.1/20100127/html4/bidi-alt-001.htm, and it does demand that unicode-bidi:override be applied to the alt attribute. One problem with that is that the test fails in Mozilla, WebKit, Blink, and IE. A clean sweep.

Furthermore, I believe that the test's demands are inappropriate.

One of them is that an element's unicode-bidi:override be applied to the alt of an image *inside* that element, e.g.

<bdo dir="rtl">abc <img alt="def"> ghi</bdo>

I believe that this is clearly inappropriate because the alt is displayed in a separate box. Displaying it as "fed" would make as much sense as displaying "fed" for the inside span below:

<bdo dir=rtl>abc <span style="display:inline-block">def</span> ghi</bdo>

(I believe that the CSS spec as it stands would prohibit that happening. Needless to say it does not happen in any browser.)

More to the point in the context here, the test also demands that an element's unicode-bidi:override be applied to its own alt. And although the CSS rules suggested here currently don't include img in the list of elements to which they apply unicode-bidi:override, they could be modified to do so.

The question is whether it is a good idea to say that an element's unicodde-bidi applies to its attributes, outside the scope of iso-8859-8? I don't think so. Consider the following (in a utf-8 page):

<input dir="ltr" style="unicode-bidi:override" placeholder="HEBREW FOR 'PASTE VISUAL HEBREW HERE'">

Do we really want the placeholder (or title) to be displayed backwards?
Comment 37 Koji Ishii 2014-02-24 19:24:53 UTC
(In reply to Aharon Lanin from comment #36)
> Sorry to get involved so late, but I just realized that it is unclear that
> CSS (any CSS) can fully describe how to handle iso-8859-8. The problem is
> with the title, alt, and placeholder attributes.
> :
> Do we really want the placeholder (or title) to be displayed backwards?

I don't have answers to you, sorry about that, but I guess it'd work better if you could open another bug, since this was not covered in the original description, and this bug was already fixed in a good way as far as I understand.

Also if I understand correctly, what you raised is about <bdo> and bidi-override in general, while this bug is about iso-8859-8. Having a separate bug might work for all of us to discuss and handle easier.
Comment 38 Aharon Lanin 2014-02-24 19:55:16 UTC
Maybe I wasn't brief or clear enough.

I don't think that this bug is actually fixed because I don't think the CSS changes do anything to the way the title, alt, and placeholder attributes are displayed. In other words, in an iso-8859-8 page, alt="OLLEH" needs to be displayed as OLLEH, not as HELLO (which would be visually backwards). But if one would take an iso-8859-8 page, add to it the CSS that has been given here, and re-label it as iso-8859-8-I, alt="OLLEH" would indeed be displayed as HELLO because the CSS simply does not and should not apply to alt.

I am not changing the status of this bug or filing a new bug because I do not think that there is a way to fix the remaining issue. I am just recording it for posterity.
Comment 39 Koji Ishii 2014-02-25 14:51:39 UTC
I'm ok to handle bugs as you like, I'm not very familiar with rules at W3C.

Allow me to confirm a couple of understandings:
1. The iso-8859-8 pages now displays correctly with the fix, except the display direction of alt text.
2. To fix alt for iso-8859-8, we need additional text in HTML5 spec (not in CSS)

Is this correct? Would you then suggest the text to fix?

Given iso-8859-8 is 0.00081%, I personally don't think alt is critical to fix, but I'm not the one to make the call anyway, and I don't have good knowledge on bidi to suggest a fix either.
Comment 40 Aharon Lanin 2014-02-25 15:14:22 UTC
(In reply to Koji Ishii from comment #39)
> 1. The iso-8859-8 pages now displays correctly with the fix, except the
> display direction of alt text.

Yes, as far as I know

> 2. To fix alt for iso-8859-8, we need additional text in HTML5 spec (not in
> CSS)
> 
> Is this correct? Would you then suggest the text to fix?

Nothing comes to mind.

> Given iso-8859-8 is 0.00081%, I personally don't think alt is critical to
> fix, but I'm not the one to make the call anyway, and I don't have good
> knowledge on bidi to suggest a fix either.

I have no idea how important this is either. Just being a good citizen and reporting what I have found. Over and out.
Comment 41 Ian 'Hixie' Hickson 2014-03-08 00:44:16 UTC
(In reply to Koji Ishii from comment #39)
> 
> Given iso-8859-8 is 0.00081%

0.00081% of what? The 100 trillion or so pages on the Web? Or of page loads of users in predominantly right-to-left locales? Because 0.00081% of 100 trillion pages is a LOT of pages. How many pages do you think are in predominantly right-to-left locales? If there's a billion such pages, then that 0.00081% looks more like 80%...


Anyway, it may be that there's nothing more to fix here, to get parity with implementations:
   http://junkyard.damowmow.com/533 (Firefox)
   http://junkyard.damowmow.com/534 (Chrome)

If you do think something should be fixed, please file a new bug.