I18N / HTML5 Break-Out

Friday 09 Nov 2007

See also: IRC log


Anne, Addison, Richard, Hixie, Fantasai, Najib, Amit, Philippe, Hsivonen, J.Graham
Addison Phillips (I18N)


<aphillip_> http://www.w3.org/html/wg/html5/#determining0

<anne> http://www.whatwg.org/specs/web-apps/current-work/multipage/section-parsing.html

<aphillip_> http://lists.w3.org/Archives/Public/public-i18n-core/2007OctDec/0088.html

16: 13 -!- Irssi: Join to #i18n was synced in 0 secs
... 13 < Hixie> http://www.whatwg.org/specs/web-apps/current-work/multipage/section-parsing.html#parsing
... 13 < Hixie> http://www.whatwg.org/specs/web-apps/current-work/multipage/section-parsing.html#the-input0

<scribe> ScribeNick: fantasai

Addison: There was a badly-titled thread saying something about making windows-1252 the default encoding.
... Our first reaction was, wouldn't it be nice if that were something else, say utf-8
... At the same time we recognize that there's a legacy encoding issue, since previous versions of HTML required iso-????

<hsivonen> http://hsivonen.iki.fi/charmod-checking/

<hsivonen> http://hsivonen.iki.fi/charmod-norm-checking/

Addison: If you actually look at the sections, 8.2 and ....
... It does not in fact say that the default encoding of the universe at large is windows 1252
... In the sequence there's looking at byte sequences, then using heuristics, etc.
... at the end of that sequence there's a paragraph that says
... if all else fails, you have to supply some implementation-defined default and we recommend you do these things.
... And windows-1252 just appears out of nowhere.
... One thought we had was for us to provide some information on why windows-1252 is preferable and how it differs from the standard ISO encodings.

<Hixie> "

<Hixie> When a user agent would otherwise use the ISO-8859-1 encoding, it must instead use the Windows-1252 encoding."

Henri: that part is a violation of charmod

Addison doesn't consider that a violation of charmod

Addison: There are superset encodings and they're often tagged with the subset encodings.
... using the superset interpretation doesn't conflict with using the subset interpretation
... We're not proposing a substantive change, just providing more justification for what you're doing.
... We also looked at the structure of the paragraph, and had some concerns.
... one was the phrasing of "western demographics" etc
... We had several reactions.
... Oene it's not clear what a western demographic and how you tell when you're talking to one on the internet.
... We proposed 2 things, one of which was to turn two things around.
... We have a love of utf-8, and we'd like you to mention that one first and then the legacy thing
... We also think the wording could be changed somewhat on the windows-1252 to say that "in a legacy context, if you have to guess, you should guess this one"

Ian: I haven't gotten to that issue yet, haven't looked at it in detail, sounds ok

Richard: Is it purely editorial?

Addison: It doesn't change the result, it just changes how you explain the result.

Ian: Do you have any recommendation for dealing with say Japan and other parts of East Asia?

Addison: There are a variety of things in step #7 that allow for various heuristics and sniffing.

Ian: windows-1252 is fine for US and UK, but what about other places?

Felix: Depends on what device.

Addison: Most implementations use information in the browser, e.g. what the browser uses or if a narrower auto-detect is set (as for Japanese)

Ian: So in the Japanese cases, you expect that the rest of the steps would take care of it?

Addison: I think you'd trap those encodings before you get to step 7(?)
... Might want to mention that in some cases of getting a subset encoding to use the superset encoding.
... I think we can provide that information.

Ian: I believe when I wrote that section that I checked a browser and that was the only mapping they had.

Addison: Most browsers dont' just do GBK, but do ????
... There are some cases, such as in Japan, where the byte patterns are completely different.
... where the encoding schemes are different even though the charset is the same
... that kind of autodetection is a separate thing
... I think this is still valid.
... THe only question I have is, if you're thinking "what should happen in step 7" is some language-dependent or context-dependent thing ...

Hixie: In this final step, you don't have any information from the content

Addison: You might want to think about splitting step 7 and doing a utf-8 detection first
... UTF-8 has recognizable byte patterns, it would be great to put that first before saying "use your favorite legacy encoding"

Hixie: The concern is what happens if the user enters some bytes into the form and then submits it?

Addison: We were just looking at that in the i18n working group

Hixie: We'd have to make sure that that's what the server was expecting.

Felix what information are you looking at to guess what encoding the user applies?

Hixie: Typically different localizations of the browser have different default encodings.
... well, the email's in my pile. I don't know when I'll get to it.

Addison: We'll look at superset encodings and try to write up a document that you can reference.


Richard Ishida: W3C Internationalization Lead

Anne van Kesteren: Opera Software

Elika: fantasai, CSSWG Invited Expert, works on international text layout

Addison Phillips: Yahoo, i18n wg

Amit Parashar: something-or-other chair

Henri Sivonen: working on HTML5 conformance checker

Ian Hickson: HTML5 editor

Felix Sasaki: i18n Core, i18n ITS and Web Services Policy WG [W3C]

<plh> Philippe Le Hegaret: W3C, Architecture Domain (XML, Web Services, i18n), and Video

Ishida: Can you explain the alt text issue?

<najib> Najib Tounsi, W3C Morocco Office Mgr.

Ishida: We believe that you should never put human-readable text in an attribute value because you can't put markup in it
... which is important for various i18n reasons: bidi, language annotation, ruby, etc.

Hixie: We still have the <img> element; we can't get rid of it. It still has alt attr, because it's had that.
... We can't give it content because HTML parsers all close it right after the start tag.
... We also have the <object> tag, which has full fallback capabilities.

Ishida: Would the group advise the <object> tag then?

Hixie: I don't think we'll have a recommendation one way or another; if your fallback content needs element content, then you'll have to use <object>
... We've been doing some work, e.g. Acid2, on making sure the <object> tag works properly in various browsers.

Ishida asks about some XHTML2 stuff

Hixie: THe XHTML2 group did two things, one was switching some attributes into elements, e.g. title attributes.
... Then they also went and started usng rdf for everything: we are certainly not going to do that.
... For the first one, I'm not convinced that the benefits of using an element for these things is better than the costs
... We can try not to do things like that in the future though
... This problem comes up in many places, e.g. in DOM APIs that take a string.
... There are also places where we can't make such changes, such as the <title> element
... whose content winds up in places like filenames where you can't have structured markup anyway

Ishida: Can you use bidi in filenames?

Hixie: probably, but I'm not going to recommend it

Ishida: We might need to start thinking about how to convert text from markup to strings with bidi control characters.

<anne> (I think HTML 5 should get &rlo;, &lro;, and &pdf; (or something in that direction) for BiDi. These are already in IE.)

Hixie: We did consider having a DOM attribute that would pull out e.g. bidi control characters from the markup and alt text from images
... not sure where that's going
... I would recommend finding solutions for plaintext, since that will work for both

Discussion of that

language tags are in Unicode, but were deprecated as soon as they were added: they were added as deprecated and should never be used

<anne> (event though the characters they map to are apparently deprecated)

discussion of markup-plaintext thing

<apppp> reference RFC 3066 should point to BCP 47

Addison notes that the i18n group needs to review the date parsing things

<najib> +1 for to add &rle, ..., &pdf; in HTML

Henri notes that it's using ISO dates anyway

najib, if we're adding more entities I want &zwsp;


<najib> It depends on usage frequences. :-)

Validator checking entity reqs

Henri: I don't check that character entities are only used for characters that are unclear.
... because I can't tell mechanically whether the character is unclear

<anne> fantasai, I think &zwsp; is also supported by IE


let's add it :P

all the characters next to it have names,

zwnj, zwj etc

<najib> I don't have IE on MacOS :-( & :-)

Ishida explain that this part of charmod is about best practices

it's not should in the normative sense

Elika: Maybe you should go through the document and change the wording of should sentences that don't match RFC2119 to something else

Ishida: Well, we mean it that way for authors. Maybe we need to create different classes and explain which recommendations apply to which

<fsasaki> http://hsivonen.iki.fi/charmod-norm-checking/

Henri: I documented which constructs in HTML5 result in a continuous string
... I don't have any other comment there except that I wrote this and it is available :)
... I have another comment, but its targetted at the unicode/icu specs

Ishida: Might want to post to the unicode list

<apppp> Title: I18N / HTML5 break out session

<apppp> Scribe: fantasai

<apppp> ScribeNick: fantasai