This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6858 - More details needed for "ASCII-compatible encoding"
Summary: More details needed for "ASCII-compatible encoding"
Status: VERIFIED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL: http://www.w3.org/TR/html5/infrastruc...
Whiteboard:
Keywords: NoReply
Depends on:
Blocks:
 
Reported: 2009-05-01 08:48 UTC by Martin Dürst
Modified: 2010-10-04 13:58 UTC (History)
4 users (show)

See Also:


Attachments

Description Martin Dürst 2009-05-01 08:48:17 UTC
The draft currently says:
"An ASCII-compatible character encoding is one that is a superset of US-ASCII (specifically, ANSI_X3.4-1968) for bytes in the set 0x09, 0x0A, 0x0C, 0x0D, 0x20 - 0x22, 0x26, 0x27, 0x2C - 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A."
What exactly is meant? Is Shift_JIS Ascii-compatible (it encodes the US-ASCII characters at the same bytes as US-ASCII, but also uses these bytes as the second bytes in a multibyte character encoding).
http://www.w3.org/TR/html5/forms.html#application-x-www-form-urlencoded-encoding-algorithm refers to ASCII-compatible, and in that context, I would want to have Shift_JIS be ASCII-compatible, but from the above definition, I'd lean to the conclusion that Shift_JIS may not be ASCII-compatible.
Please clarify whether ASCII-compatible means that the above bytes are only used for US-ASCII character, or whether it means that for US-ASCII characters, only the above bytes are used.
If you use ASCII-compatible in several places in your spec, you might have to check and maybe split the definition into two, to take into account different circumstances.
Comment 1 Ian 'Hixie' Hickson 2009-06-28 10:10:49 UTC
I've tried to fix this; please let me know if the new text is ok.
Comment 2 Martin Dürst 2009-06-29 01:57:49 UTC
http://www.w3.org/TR/html5/infrastructure.html#ascii-compatible-character-encoding doesn't seem to be updated. I'm assuming you are working on an editing copy, if you tell me where it is, I'll gladly have a look at it.
Comment 3 Ian 'Hixie' Hickson 2009-06-29 04:25:50 UTC
Oh, yeah, you don't want to use the TR version, that's out of date even before it gets published. Use the WHATWG version, it's the most up to date:
   http://www.whatwg.org/specs/web-apps/current-work/
...or alternatively the version on dev.w3.org, which is only slightly behind:
   http://dev.w3.org/html5/spec/Overview.html
Comment 4 Martin Dürst 2009-06-29 07:19:59 UTC
Looking at http://dev.w3.org/html5/spec/Overview.html#ascii-compatible-character-encoding:

This solves the problem, but is needlessly complex. Instead of

An ASCII-compatible character encoding is a single-byte or variable-length encoding in which the bytes 0x09, 0x0A, 0x0C, 0x0D, 0x20 - 0x22, 0x26, 0x27, 0x2C - 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A, ignoring bytes that are the second and later bytes of multibyte sequences, all correspond to single-byte sequences that map to the same Unicode characters as those bytes in ANSI_X3.4-1968 (US-ASCII).

the following would say the same but would be simpler:

An ASCII-compatible character encoding is a character encoding in which the Unicode characters that have bytes values 0x09, 0x0A, 0x0C, 0x0D, 0x20 - 0x22, 0x26, 0x27, 0x2C - 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A in ANSI_X3.4-1968 (US-ASCII, [RFC1345]) are represented by exactly and only the same byte values.

The note after that is also a good start, but also needs some more work. Shift_JIS is used on every Japanese PC and Mac, so I wouldn't call this an exotic encoding. On the other hand, I didn't find a *submitted* draft for UTF-8+names, so whathever you think about it, it's clearly a dead end at this point of time. So I would reword:

Note: This includes such exotic encodings as Shift_JIS and variants of ISO-2022, even though it is possible for bytes like 0x70 to be part of longer sequences that are unrelated to their interpretation as ASCII. It excludes such encodings as UTF-7, UTF-8+names, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC variants.

to something like:

Note: This includes encodings such as Shift_JIS and variants of ISO-2022, where it is possible for bytes like 0x70 to appear as part of multibyte sequences that are unrelated to their interpretation as ASCII. It excludes encodings such as UTF-7, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC variants.
Comment 5 Ian 'Hixie' Hickson 2009-06-29 08:36:34 UTC
Your rephrasing of the first paragraph doesn't work because it is ambiguous about whether multibyte sequences that contain ASCII-like bytes are included or not. Also, it excludes other representations of those same characters, which as far as I can tell isn't necessary.

I'll fix the second paragraph.
Comment 6 Martin Dürst 2009-06-29 10:03:22 UTC
Multibyte sequences that contain ASCII-like bytes are included by the fact that they are not excluded. Other representations of the same characters are excluded, because I thought that that's what you want. If the only thing that you need for an ASCII-compatible encoding is that characters such as !, ", &, ',... can, *among else*, be encoded by the relevant ASCII bytes, then indeed that has to be worded differently.

But before we lock down one wording, it might be good to understand why we hase the ASCII-compatible encoding restriction in the first place.
Comment 7 Ian 'Hixie' Hickson 2009-06-29 10:43:50 UTC
The ASCII-compatible encoding concept exists for two reasons; first, to prevent authors from trying to declare the encoding using the <meta charset> feature in the cases where the encoding detection algorithm wouldn't ever find the charset (e.g. using <meta charset> alone in a UTF-16 file, with no BOM and no external charset declaration), and second, to restrict the character encodings that can be used in form submission to those that aren't incompatible with opaquely treating a URL as ASCII with some unknown bytes.
Comment 8 Ian 'Hixie' Hickson 2009-06-29 23:28:19 UTC
Fixed the wording of the second paragraph.

Please feel free to reopen this bug if you think that other aspects should still change; I merely change the resolutions as a way to manage which bugs are on my "TODO" list, it is not an attempt to prevent further discussion.
Comment 9 Maciej Stachowiak 2010-03-14 14:48:07 UTC
This bug predates the HTML Working Group Decision Policy.

If you are satisfied with the resolution of this bug, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
  http://dev.w3.org/html5/decision-policy/decision-policy.html

This bug is now being moved to VERIFIED. Please respond within two weeks. If this bug is not closed, reopened or escalated within two weeks, it may be marked as NoReply and will no longer be considered a pending comment.