See also: IRC log
<aphillip_> http://www.w3.org/html/wg/html5/#determining0
<anne> http://www.whatwg.org/specs/web-apps/current-work/multipage/section-parsing.html
<aphillip_> http://lists.w3.org/Archives/Public/public-i18n-core/2007OctDec/0088.html
16: 13 -!- Irssi: Join to #i18n was synced in
0 secs
... 13 < Hixie>
http://www.whatwg.org/specs/web-apps/current-work/multipage/section-parsing.html#parsing
... 13 < Hixie>
http://www.whatwg.org/specs/web-apps/current-work/multipage/section-parsing.html#the-input0
<scribe> ScribeNick: fantasai
Addison: There was a badly-titled
thread saying something about making windows-1252 the default
encoding.
... Our first reaction was, wouldn't it be nice if that were
something else, say utf-8
... At the same time we recognize that there's a legacy
encoding issue, since previous versions of HTML required
iso-????
<hsivonen> http://hsivonen.iki.fi/charmod-checking/
<hsivonen> http://hsivonen.iki.fi/charmod-norm-checking/
Addison: If you actually look at
the sections, 8.2 and ....
... It does not in fact say that the default encoding of the
universe at large is windows 1252
... In the sequence there's looking at byte sequences, then
using heuristics, etc.
... at the end of that sequence there's a paragraph that
says
... if all else fails, you have to supply some
implementation-defined default and we recommend you do these
things.
... And windows-1252 just appears out of nowhere.
... One thought we had was for us to provide some information
on why windows-1252 is preferable and how it differs from the
standard ISO encodings.
<Hixie> "
<Hixie> When a user agent would otherwise use the ISO-8859-1 encoding, it must instead use the Windows-1252 encoding."
Henri: that part is a violation of charmod
Addison doesn't consider that a violation of charmod
Addison: There are superset
encodings and they're often tagged with the subset
encodings.
... using the superset interpretation doesn't conflict with
using the subset interpretation
... We're not proposing a substantive change, just providing
more justification for what you're doing.
... We also looked at the structure of the paragraph, and had
some concerns.
... one was the phrasing of "western demographics" etc
... We had several reactions.
... Oene it's not clear what a western demographic and how you
tell when you're talking to one on the internet.
... We proposed 2 things, one of which was to turn two things
around.
... We have a love of utf-8, and we'd like you to mention that
one first and then the legacy thing
... We also think the wording could be changed somewhat on the
windows-1252 to say that "in a legacy context, if you have to
guess, you should guess this one"
Ian: I haven't gotten to that issue yet, haven't looked at it in detail, sounds ok
Richard: Is it purely editorial?
Addison: It doesn't change the result, it just changes how you explain the result.
Ian: Do you have any recommendation for dealing with say Japan and other parts of East Asia?
Addison: There are a variety of things in step #7 that allow for various heuristics and sniffing.
Ian: windows-1252 is fine for US and UK, but what about other places?
Felix: Depends on what device.
Addison: Most implementations use information in the browser, e.g. what the browser uses or if a narrower auto-detect is set (as for Japanese)
Ian: So in the Japanese cases, you expect that the rest of the steps would take care of it?
Addison: I think you'd trap those
encodings before you get to step 7(?)
... Might want to mention that in some cases of getting a
subset encoding to use the superset encoding.
... I think we can provide that information.
Ian: I believe when I wrote that section that I checked a browser and that was the only mapping they had.
Addison: Most browsers dont' just
do GBK, but do ????
... There are some cases, such as in Japan, where the byte
patterns are completely different.
... where the encoding schemes are different even though the
charset is the same
... that kind of autodetection is a separate thing
... I think this is still valid.
... THe only question I have is, if you're thinking "what
should happen in step 7" is some language-dependent or
context-dependent thing ...
Hixie: In this final step, you don't have any information from the content
Addison: You might want to think
about splitting step 7 and doing a utf-8 detection first
... UTF-8 has recognizable byte patterns, it would be great to
put that first before saying "use your favorite legacy
encoding"
Hixie: The concern is what happens if the user enters some bytes into the form and then submits it?
Addison: We were just looking at that in the i18n working group
Hixie: We'd have to make sure that that's what the server was expecting.
Felix what information are you looking at to guess what encoding the user applies?
Hixie: Typically different
localizations of the browser have different default
encodings.
... well, the email's in my pile. I don't know when I'll get to
it.
Addison: We'll look at superset encodings and try to write up a document that you can reference.
Introductions
Richard Ishida: W3C Internationalization Lead
Anne van Kesteren: Opera Software
Elika: fantasai, CSSWG Invited Expert, works on international text layout
Addison Phillips: Yahoo, i18n wg
Amit Parashar: something-or-other chair
Henri Sivonen: working on HTML5 conformance checker
Ian Hickson: HTML5 editor
Felix Sasaki: i18n Core, i18n ITS and Web Services Policy WG [W3C]
<plh> Philippe Le Hegaret: W3C, Architecture Domain (XML, Web Services, i18n), and Video
Ishida: Can you explain the alt text issue?
<najib> Najib Tounsi, W3C Morocco Office Mgr.
Ishida: We believe that you
should never put human-readable text in an attribute value
because you can't put markup in it
... which is important for various i18n reasons: bidi, language
annotation, ruby, etc.
Hixie: We still have the
<img> element; we can't get rid of it. It still has alt
attr, because it's had that.
... We can't give it content because HTML parsers all close it
right after the start tag.
... We also have the <object> tag, which has full
fallback capabilities.
Ishida: Would the group advise the <object> tag then?
Hixie: I don't think we'll have a
recommendation one way or another; if your fallback content
needs element content, then you'll have to use
<object>
... We've been doing some work, e.g. Acid2, on making sure the
<object> tag works properly in various browsers.
Ishida asks about some XHTML2 stuff
Hixie: THe XHTML2 group did two
things, one was switching some attributes into elements, e.g.
title attributes.
... Then they also went and started usng rdf for everything: we
are certainly not going to do that.
... For the first one, I'm not convinced that the benefits of
using an element for these things is better than the
costs
... We can try not to do things like that in the future
though
... This problem comes up in many places, e.g. in DOM APIs that
take a string.
... There are also places where we can't make such changes,
such as the <title> element
... whose content winds up in places like filenames where you
can't have structured markup anyway
Ishida: Can you use bidi in filenames?
Hixie: probably, but I'm not going to recommend it
Ishida: We might need to start thinking about how to convert text from markup to strings with bidi control characters.
<anne> (I think HTML 5 should get &rlo;, &lro;, and &pdf; (or something in that direction) for BiDi. These are already in IE.)
Hixie: We did consider having a
DOM attribute that would pull out e.g. bidi control characters
from the markup and alt text from images
... not sure where that's going
... I would recommend finding solutions for plaintext, since
that will work for both
Discussion of that
language tags are in Unicode, but were deprecated as soon as they were added: they were added as deprecated and should never be used
<anne> (event though the characters they map to are apparently deprecated)
discussion of markup-plaintext thing
<apppp> reference RFC 3066 should point to BCP 47
Addison notes that the i18n group needs to review the date parsing things
<najib> +1 for to add &rle, ..., &pdf; in HTML
Henri notes that it's using ISO dates anyway
najib, if we're adding more entities I want &zwsp;
:)
<najib> It depends on usage frequences. :-)
Henri: I don't check that
character entities are only used for characters that are
unclear.
... because I can't tell mechanically whether the character is
unclear
<anne> fantasai, I think &zwsp; is also supported by IE
cool
let's add it :P
all the characters next to it have names,
zwnj, zwj etc
<najib> I don't have IE on MacOS :-( & :-)
Ishida explain that this part of charmod is about best practices
it's not should in the normative sense
Elika: Maybe you should go through the document and change the wording of should sentences that don't match RFC2119 to something else
Ishida: Well, we mean it that way for authors. Maybe we need to create different classes and explain which recommendations apply to which
<fsasaki> http://hsivonen.iki.fi/charmod-norm-checking/
Henri: I documented which
constructs in HTML5 result in a continuous string
... I don't have any other comment there except that I wrote
this and it is available :)
... I have another comment, but its targetted at the
unicode/icu specs
Ishida: Might want to post to the unicode list
<apppp> Title: I18N / HTML5 break out session
<apppp> Scribe: fantasai
<apppp> ScribeNick: fantasai