23646 – "us-ascii" should not be an alias for "windows-1252"

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 23646 - "us-ascii" should not be an alias for "windows-1252"

Summary: "us-ascii" should not be an alias for "windows-1252"

Status:	RESOLVED WONTFIX

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:	http://dondeverfutbol.com
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-10-27 00:22 UTC by Paul Eggert
Modified:	2014-11-30 02:30 UTC (History)
CC List:	9 users (show)

See Also:

Attachments

Description Paul Eggert 2013-10-27 00:22:15 UTC

http://encoding.spec.whatwg.org/ lists US-ASCII and equivalents (e.g., ASCII) as an alias for "windows-1252".  This isn't correct, as Windows 1252 has several code points above 127 that are not ASCII characters.  US-ASCII should be kept distinct from Windows 1252, just as it should be kept distinct from Latin-1 and all the other encodings that are extensions of US-ASCII.

Comment 1 Addison Phillips 2013-10-27 02:24:20 UTC

US-ASCII the 7-bit encoding certainly is distinct from windows-1252. However, the Encoding spec treats it as an alias for windows-1252 for the same reason it treats ISO 8859-1 as an alias for windows-1252. In both cases, windows-1252 is a true superset of the specified encoding. When you are decoding a byte sequence in one of these encodings and encounter a byte that US-ASCII or ISO 8859-1 treats as unassigned but which is assigned in windows-1252, it is highly likely that the byte sequence actually uses the windows-1252 encoding. 

The alternative (keeping these other encodings distinct) would result in additional replacement characters being generated in both the decoding and encoding directions. This is generally best practice on the Web, although the Encoding spec could be a bit more verbose in spelling this out.

This is, incidentally, one of the early draft's of HTML5's "willful violations", in this case of the W3C Character Model, which forbids this sort of renaming. While I tend to agree that software generally should use the encoding I specify and accept no substitutes, in practice this turns out to be a better choice.

Comment 2 Addison Phillips 2013-10-27 02:26:26 UTC

I could have been a bit clearer there :-/. One of those sentences should have read:

It's generally best practice on the Web to *not* generate additional replacement characters, since that removes information to no one's particular benefit.

Comment 3 John C Klensin 2013-10-27 14:15:54 UTC

Addison, I certainly agree that generating additional replacement characters is a bad idea. 

But the argument that ASCII (and ISO 8859-1) are reasonable aliases for "windows-1252" because the latter is a proper superset could also be used to argue that ASCII is a reasonable alias for UTF-8 because UTF-8 is also a proper superset.   If one then assumes transitivity of aliases, your suggestion that, if one thinks something is ASCII and some octet is out of range then breaks down because that out of range octet could be either Windows-1252 (or ISO 8859-1 if it is in the 0xA1 - 0xFF range) or part of a UTF-8 sequence.

The document itself is probably ok because of the way it separates single-byte and multi-byte operations but, if an implementation gets even slightly sloppy about terminology or labeling, it seems to me that those aliases help us get onto rather thin ice.  For better or worse, such sloppy behavior appears frequently on the Internet, probably most commonly induced by copying strings from one document and pasting them into another.

Comment 4 Addison Phillips 2013-10-27 17:38:50 UTC

John, I agree generally.

The problem here is, when a document declares an encoding and/or if one has not detected UTF-8, one can instantiate one and only one encoder/decoder to handle the text.

Latin-1 is the more obvious one here. If you're decoding a page declared as "iso-8859-1" and you see a byte like 0x80, you *could* treat it as a C1 control character. But the C1 controls add no value to the page. It's very likely that byte is actually U+20AC (EURO SIGN). In fact, browsers and major websites already make that assumption and have done for quite a while. Hence the alias appearing in this document.

US-ASCII is a little different. It is, after all, a subset of virtually all encodings on the Web. But if you have a page declared in US-ASCII and instantiate a true US-ASCII-7 transcoder, you have to do something with the bytes from 0x80 to 0xFF. Making lots of U+FFFD is not a very useful. Using Latin-1 makes sense as the converter for US-ASCII then.

It might make more sense, in that case, if US-ASCII used the *true* ISO 8859-1 converter, since that encoding's mapping to Unicode is just to round trip the bytes with the first 256 Unicode characters. That, in fact, is a common enough trick for data of unknown origin and encoding where you don't want to lose the original byte values. But for a Web page this isn't very useful. The C1 controls are still invisible or tofu junk. Converting to likely printable characters is more useful. If it's wrong, at least you can see the mojibake, and there is a reasonable likelihood that it'll be the right way to interpret the bytes.

Still, that does call out: a true transcoder implementation (think iconv or what have you), really *DOES* need to distinguish each of these encodings. If you use the "Latin-1 encoding trick" I mention above but your transcoder treats the bytes as windows-1252, you'll be downright unhappy and annoyed (I know I'd be furious). But in a Web page, I want the browser to produce likely visible characters and C1 controls are (almost) always wrong.

Comment 5 John C Klensin 2013-10-28 11:51:13 UTC

Addison,

This all sounds reasonable.  Especially given the tendency for folks with other applications to try to generalize from web specs because the web is so popular and familiar to users, I think it suggests that the document would be considerably improved by some introductory text that stresses the limited applicability (could go in the preface or in Section 4 before the table) and what "alias" actually means in that table (either before the table or in an appropriate note).

"Encoding names that can usually be considered equivalent for rendering HTML web pages" would probably do the latter job although you or Anne can probably come up with something better.

FWIW, we should also probably try to get "US-ASCII" out of our preferred terminology repertoire.  It results, AFAIK, for some misunderstandings around the IETF and elsewhere 20-odd years ago.  The name of the relevant standard, repertoire, and coding system is "ASCII" ("American Standards Code for Information Interchange"), named back when ANSI's name was still "American Standards Association" or "ASA".  "US-ASCII" would be justified if there were, e.g., "CA-ASCII", "MX-ASCII", "BR-ASCII", etc., but there aren't.  "EU-ASCII", "JP-ASCII", or "ISO-ASCII" don't exist either and would be oxymorons if they did.  No problem with the document --listing it as a non-preferred synonym for "ASCII" is fine -- but I note that this discussion has used it as if it were the preferred and most precise form.

Comment 6 Anne 2013-10-28 14:08:38 UTC

The applicability is not that limited... This document describes the encoding subsystem of the web platform.

ASCII itself is probably best defined as the U+0000 to U+007F code point range or 0x00 to 0x7F byte range, rather than saying just ASCII. Newer web platform specifications do that.

Anyway, this is what browsers do for the "us-ascii" label. They decode it per the windows-1252 encoding. So this is WONTFIX.

Comment 7 Jirka Kosek 2014-06-27 07:59:03 UTC

The trouble is that Encoding spec defines algorithms both for encoding and decoding. For decoding it's completely reasonable to treat "us-ascii" as an alias for "windows-1252". However when you encode text using "us-ascii" encoding any character outside U+0000-U+007F must be replaced, escaped (in HTML mode) or it's an error.

So either "us-ascii" must be handled separately (at least for encoding purposes) or Encoding spec should be changed to talk about decoding only.

Jirka

Comment 8 Anne 2014-06-27 09:51:26 UTC

That is not how the platform works. Please don't reopen bugs without suitable evidence.

Comment 9 Jirka Kosek 2014-06-28 10:27:21 UTC

(In reply to Anne from comment #8)
> That is not how the platform works. Please don't reopen bugs without
> suitable evidence.

Which part of platform directly encodes characters above U+007F when "us-ascii" encoding is used?

Either I'm missing something essential, or scope of Encoding spec is not specific enough.

Consider the following example. I have page containing copyright symbol (U+00A9). I want to save it in "us-ascii" encoding using XHTML syntax of HTML5. With the current spec copyright symbol would go through as 0xA9 byte which would cause parsing error when result is read back by using XML parser.

Comment 10 Anne 2014-06-28 10:35:39 UTC

Why would the XML parser result in an error? Surely it should use the same encoding layer.

Comment 11 Jirka Kosek 2014-06-28 10:47:42 UTC

(In reply to Anne from comment #10)
> Why would the XML parser result in an error? Surely it should use the same
> encoding layer.

Because 0xA9 is invalid sequence in 7-bit encoding. I have tried two randomly chosen XML parser and both choke on this example:

$ cat test.xml
<?xml version="1.0" encoding="us-ascii"?>
<test>©</test>

$ xmllint --noout test.xml 
I/O error : encoder error
test.xml:2: parser error : Premature end of data in tag test line 2
<test>
      ^

$ xjparse test.xml
Attempting validating, namespace-aware parse
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Byte "169" is not a member of the (7-bit) ASCII character set.

So in my opinion Encoding spec breaks compatibility with existing content and implementations in regard to "us-ascii" encoding.

Comment 12 Anne 2014-06-28 10:51:41 UTC

All browsers agree on the Encoding Standard:

data:text/xml,<?xml%20version="1.0"%20encoding="us-ascii"?><test>%A9</test>

And given that this cannot be changed for HTML and having different encoding layers for different formats is crazy, given that everyone ought to use utf-8, I'm not particularly concerned.

Comment 13 Jirka Kosek 2014-06-28 11:03:58 UTC

(In reply to Anne from comment #12)
> All browsers agree on the Encoding Standard:
> 
> data:text/xml,<?xml%20version="1.0"%20encoding="us-ascii"?><test>%A9</test>

IE mangles such characters on decoding, so not all browsers.

> And given that this cannot be changed for HTML and having different encoding
> layers for different formats is crazy, given that everyone ought to use
> utf-8, I'm not particularly concerned.

Then please clearly state that this spec covers only HTML and web browsers, if that's your intention.

Otherwise spec will be out of sync with text encoders available in all various programming platforms.

Comment 14 Anne 2014-06-28 11:08:13 UTC

That seems silly. The whole point of having specs is to forge interoperability and get programming languages to agree. So far encoding "standards" did a bad job at that. The plan is to fix that.

Comment 15 Jirka Kosek 2014-06-28 11:09:11 UTC

(In reply to Anne from comment #12)
> And given that this cannot be changed for HTML and having different encoding

Is there any real reason why *encoding* when "us-ascii" is used for HTML can't be changed?

> given that everyone ought to use utf-8

It's usually when you are using non-HTML/XML content that you need to use non-UTF encodings and then you must be much more picky about how encoders work. So if spec should be bended some way then to cater for such cases.

Comment 16 Anne 2014-06-28 11:17:07 UTC

(In reply to Jirka Kosek from comment #15)
> Is there any real reason why *encoding* when "us-ascii" is used for HTML
> can't be changed?

Web content.


> It's usually when you are using non-HTML/XML content that you need to use
> non-UTF encodings and then you must be much more picky about how encoders
> work. So if spec should be bended some way then to cater for such cases.

Why would that be true? <form> submission and URL query parameter encoding are a thing and given the scale of the web they are widespread.

It seems much easier to ensure in a closed environment that the input to the encoder is in the range U+0000 to U+007F.

Comment 17 Jirka Kosek 2014-06-28 11:36:16 UTC

(In reply to Anne from comment #16)
> (In reply to Jirka Kosek from comment #15)
> > Is there any real reason why *encoding* when "us-ascii" is used for HTML
> > can't be changed?
> 
> Web content.

Please be specific. There is surely existing content that relies on *decoding* "us-ascii" as "windows-1252". But where is such content or applications that rely on this for *encoding*?

> > It's usually when you are using non-HTML/XML content that you need to use
> > non-UTF encodings and then you must be much more picky about how encoders
> > work. So if spec should be bended some way then to cater for such cases.
> 
> Why would that be true? <form> submission and URL query parameter encoding
> are a thing and given the scale of the web they are widespread.

Is "us-ascii" used there?

Comment 18 Jirka Kosek 2014-06-28 11:40:13 UTC

(In reply to Anne from comment #14)
> That seems silly. The whole point of having specs is to forge
> interoperability and get programming languages to agree. So far encoding
> "standards" did a bad job at that. The plan is to fix that.

If your plan starts with saying that "us-ascii" behaves differently than in past and differently than it's implemented in .NET, JDK, iconv and many other libraries, then this seems silly to me.

I agree that all should agree on how encodings works, but it seems that in this quest for unification everything except browsers is ignored.

Comment 19 Anne 2014-06-28 11:43:05 UTC

I've seen sites rely on it for iso-8859-1. No doubt that's the case for us-ascii too. People label it as one, start using a password that contains €, rely on that being byte 0x80 because that is what the browser submits, done.

Comment 20 Paul Eggert 2014-06-28 16:05:50 UTC

As the original reporter, I'd like to mention the use case that prompted the bug report. I maintain the web page for the IANA time zone database, which gets patches by correspondents all over the world, many of whom are not experts in encodings. For decades the database has had a policy of using only ASCII, to avoid interoperability problems. This includes its web pages. We don't want these web pages to contain any non-ASCII characters; if they do, it's an error, and the browser should display a botch.

We are now slowly migrating to UTF-8, having (thankfully) bypassed the Latin-1 disasters entirely. At some point I expect we'll even allow UTF-8 in our web pages (but not yet). In the meantime, we don't want to give anybody the mistaken impression that the database will use windows-1252 or Latin-1 or any other unibyte encoding, because for us these encodings would be a disaster. And yet that's what our users' browsers tell them.

I can understand the use case that prompted some web developers to say "hey, just treat US-ASCII as Windows-1252". That may have made sense back in 1995 when unibyte encodings were still the typical use on the Web. But it doesn't make sense any more, and this discussion is a symptom of it.

Because of this problem, I have given up on charset="US-ASCII" and have switched our web pages to charset="UTF-8" even though they are strictly ASCII and any non-ASCII characters in them are an error. I suggest adding commentary to the standard that suggests to developers what to do in my situation, since evidently charset="US-ASCII" is not the right thing to do, and certainly charset="windows-1252" is not right either. If there's nothing developers can do and the standard does not support this use case, then the commentary should say so.

With this suggestion in mind, I also suggest that we change the status of this report back to REOPENED.

Comment 21 Anne 2014-06-28 16:40:28 UTC

Paul, in http://encoding.spec.whatwg.org/#names-and-labels we have 'Authors must use the utf-8 encoding and must use the ASCII case-insensitive "utf-8" label to identify it.' The standard offers no options other than utf-8. Using charset=us-ascii is already non-conforming and the browser is encouraged to alert web developers of this fact through the error console. Validators will do so as well.

Comment 22 Jirka Kosek 2014-06-29 16:05:57 UTC

(In reply to Anne from comment #19)
> I've seen sites rely on it for iso-8859-1. No doubt that's the case for
> us-ascii too. 

Do you have any evidence of us-ascii used there? I would be very surprised to find such examples. In 90' iso-8859-1 was default. But if you characters were not available in iso-8859-1 then other single-byte encodings were used. In my country there were something like 5 different encoding used depending on browser and operating system used and then you have to find way how to handle this encoding mess. But there were no reason to change anything if iso-8859-1 was superset of characters you needed.

So rather then relying on speculations than 20 years ago us-ascii was used in some strange way I would prefer having its definition reflect how it works in many existing implementations and libraries.

Anyway encoding spec is now in the Last Call and since you reverted my previous reopening of this bug, I suggest that you reopen bug yourself to give it proper LC handling and consideration.

Thanks, Jirka

Comment 23 Paul Eggert 2014-06-29 16:21:54 UTC

It was never common practice to use charset="us-ascii" when the text was actually Latin-1 or some other extension to ASCII. The default was Latin-1, and some validators would recommend charset="us-ascii" when the text was limited to characters in the range 00-7F. So the longstanding meaning of charset="us-ascii" was "This document is not using any characters outside the ASCII range, and I've checked it and that's what I want".

Again, I'm not asking that the standard be *changed*, only that this issue be *explained*. Currently this stuff is entirely a mystery to a non-expert (and it appears, even to some experts). That's not right.

Comment 24 Glenn Adams 2014-06-30 14:06:02 UTC

Anne, you are ignoring the substance of the comments. It is not acceptable to dismiss the stated concerns simply because you think "everyone ought to use utf-8".

Reopening.

Comment 25 Anne 2014-06-30 14:19:16 UTC

Jirka, the evidence I have is that the most widely deployed software exhibits this behavior. Unless that changes, I don't see a reason to change this document.

Paul, there are numerous things in this standard that are weird. That is why all the weird stuff is cataloged as "legacy". If you think a note of sorts would have helped you, could you perhaps suggests some wording and a location?

Glenn, thanks for the triaging help. Please be patient, we'll get through this.

Comment 26 Addison Phillips 2014-06-30 15:57:13 UTC

(In reply to Paul Eggert from comment #23)
> It was never common practice to use charset="us-ascii" when the text was
> actually Latin-1 or some other extension to ASCII. The default was Latin-1,
> and some validators would recommend charset="us-ascii" when the text was
> limited to characters in the range 00-7F. So the longstanding meaning of
> charset="us-ascii" was "This document is not using any characters outside
> the ASCII range, and I've checked it and that's what I want".

Look at it from the browser (or search engine or document consumer) point of view. If you have a document that declares "us-ascii", but, in fact, contains non-ASCII byte values, what should happen to those byte values when the document is interpreted? 

I find myself writing text here that I already said in or around comment 1, so I won't repeat myself. 

> 
> Again, I'm not asking that the standard be *changed*, only that this issue
> be *explained*. Currently this stuff is entirely a mystery to a non-expert
> (and it appears, even to some experts). That's not right.

I agree that an explanation is desirable. There is no discussion of superset encodings or why any of this occurs in the Encoding spec. A note is probably called for so that it won't be a mystery. Perhaps just after the "violation of UTS#22" note in section 4.2:

--
In many cases the legacy single-byte encoding selected has a larger character repertoire than that of the label actually used in the document. For example, both the "iso8859-1" and "us-ascii" labels use the "windows-1252" encoding. This is because user-agents historically have applied the larger "super-set" encoding in practice because document authors tend to be imprecise in identifying the correct label.
--

Comment 27 Paul Eggert 2014-06-30 19:34:38 UTC

(In reply to Paul Eggert from comment #23)
> If you have a document that declares "us-ascii", but, in fact, contains non-ASCII byte values, what should happen to those byte values when the document is interpreted?

If the byte values are UTF-8 text, they should be interpreted as UTF-8.  We use UTF-8 for our other text files, and occasionally the UTF-8 inadvertently leaks into the HTML, so treating it as UTF-8 would be the most useful for us.  I don't think we're alone in this.

I realize that in this context many browsers interpret non-ASCII bytes using a unibyte encoding for legacy reasons, but some newer browers do treat it as UTF-8.  I just now tried eww (which will be part of the next GNU Emacs release; see <http://www.emacswiki.org/emacs/eww>) and that's how it works.  The standard should allow this behavior.  More generally, the standard should allow the browser to heuristically decode invalid bytes in ways appropriate for the current user and context.

So I guess I am asking for a change to the standard after all.  Here's a proposed change, inspired by your wording.

* In section 4.2 step 2, change "the corresponding encoding" to "a corresponding encoding".

* In section 4.2's table, add the "us-ascii" label to the utf-8 encoding.

* Append the following text after section 4.2's Note:

In practice document authors tend to be imprecise in identifying the correct label, and the following table gives decoders advice and some leeway when dealing with incorrectly labeled documents.  For example, because the "iso8859-1" and "us-ascii" labels both correspond to the windows-1252 encoding, a user-agent given a document with either label can treat the document as if it were windows-1252.  Conversely, because the "us-ascii" label corresponds to both the utf-8 and the windows-1252 encodings, a user-agent given a document labeled "us-ascii" can decode it as either utf-8 or as windows-1252, depending on user preferences or other heuristics.


Assuming that the above suggestion is acceptable, I suppose we could also add other labels to superset encodings as appropriate, e.g., add "us-ascii" to "euc-up".  This is not needed for my use case, though.

Comment 28 Anne 2014-07-01 06:53:59 UTC

Paul, that is unacceptable. Having browsers behave in non-deterministic ways when it comes to something as crucial as encodings would be extremely bad for interoperability and the future health of the web. It is exactly such differences (though typically more minor) for which we created this document.

Comment 29 Jirka Kosek 2014-07-01 07:12:22 UTC

(In reply to Anne from comment #25)
> Jirka, the evidence I have is that the most widely deployed software
> exhibits this behavior. Unless that changes, I don't see a reason to change
> this document.

Anne, could you please show me browser function which deals with text encoding and which treats us-ascii same way as windows-1252? So far, I haven't seen this, but I have provided several examples where treating us-ascii as alias for windows-1252 create interop problems.

Please note that I'm fine with treating us-ascii as windows-1252 for purposes of decoding.

Comment 30 Anne 2014-07-01 07:22:26 UTC

I mentioned <form> and URLs. There's also the TextEncoder() API.

Comment 31 Jirka Kosek 2014-07-01 07:31:54 UTC

(In reply to Anne from comment #30)
> I mentioned <form> and URLs. 

But you haven't provided example of real page which will try encode form data or URL using us-ascii. I don't believe there are such pages and I have seen really crazy things during web encoding hell in '90s.

> There's also the TextEncoder() API.

Is a new API and if I'm not mistaken existing implementations support only utf-* encodings, not us-ascii.

Comment 32 Anne 2014-07-01 07:37:17 UTC

(In reply to Jirka Kosek from comment #31)
> But you haven't provided example of real page which will try encode form
> data or URL using us-ascii.

You keep asking for different things. You asked for a browser function, not a web page. We already discussed the latter.

Comment 33 Jirka Kosek 2014-07-01 07:48:16 UTC

(In reply to Anne from comment #32)
> (In reply to Jirka Kosek from comment #31)
> > But you haven't provided example of real page which will try encode form
> > data or URL using us-ascii.
> 
> You keep asking for different things. You asked for a browser function, not
> a web page. We already discussed the latter.

Spec you have created treats us-ascii as an alias for windows-1252 for purposes of encoding. I disagree and ask for special treatment for us-ascii for encoding based on examples showing what will break with current treatment of us-ascii. 

You are disagreeing with such change and saying that it will break something, but so far you have failed to provide example what exactly become broken with such change.

Comment 34 Anne 2014-07-01 08:17:18 UTC

The examples in comment 11 are not compelling and do not indicate a widespread problem. That some packages operate in a way different from products shipping to billions of people is not indicative of much.

I'm sure you appreciate it's hard to prove a negative. The evidence I have is that all browsers operate in the same way and given that there's likely content that depends on that, e.g. per the scenario from comment 19.

Comment 35 Jirka Kosek 2014-07-01 08:30:24 UTC

(In reply to Anne from comment #34)
> The examples in comment 11 are not compelling and do not indicate a
> widespread problem. That some packages operate in a way different from
> products shipping to billions of people is not indicative of much.

I wouldn't call JRE or .NET Runtime "some packages" and "not indicative".

> I'm sure you appreciate it's hard to prove a negative. The evidence I have
> is that all browsers operate in the same way and given that there's likely
> content that depends on that, e.g. per the scenario from comment 19.

Scenario from #19 talks about ISO-8859-1. Here I agree. But I have never seen page that asked browser to send form data in us-ascii encoding. Have you?

Comment 36 Henri Sivonen 2014-07-01 09:42:34 UTC

(In reply to Jirka Kosek from comment #18)
> I agree that all should agree on how encodings works, but it seems that in
> this quest for unification everything except browsers is ignored.

I think the Encoding Standard should describe the required behavior for implementations of the Web Platform--i.e. browser engines. Other software that wants to be Web-compatible is welcome to implement the spec, too. However, I think it would be wrong to change the spec and browser implementations to make pre-existing non-browser behaviors "correct" per spec.

(In reply to Jirka Kosek from comment #9)
> Consider the following example. I have page containing copyright symbol
> (U+00A9). I want to save it in "us-ascii" encoding using XHTML syntax of
> HTML5.

You want something that's hostile to interop, then. The spec should not accommodate what you want.

XML doesn't require implementations to support "us-ascii". UTF-8 support, however, is required. If you are generating XML and you use an encoding that XML processors aren't required to support, you are engaging in an activity that's hostile to interop compared to using an encoding that XML processors are required to support.

In other words, the right solution is to always use UTF-8 when you create XML documents. 

(In reply to Paul Eggert from comment #27)
> I realize that in this context many browsers interpret non-ASCII bytes using
> a unibyte encoding for legacy reasons, but some newer browers do treat it as
> UTF-8.  I just now tried eww (which will be part of the next GNU Emacs
> release; see <http://www.emacswiki.org/emacs/eww>) and that's how it works. 

You should be doing your Web compat reasoning from browsers with substantial market share. eww isn't one.

> The standard should allow this behavior.  More generally, the standard
> should allow the browser to heuristically decode invalid bytes in ways
> appropriate for the current user and context.

I strongly disagree. We should move towards determinism from the heuristic mess we have instead of having more heuristics.

- -

I request this Bugzilla item be WONTFIXed.

Comment 37 Jirka Kosek 2014-07-01 09:57:23 UTC

(In reply to Henri Sivonen from comment #36)
> (In reply to Jirka Kosek from comment #18)
> > I agree that all should agree on how encodings works, but it seems that in
> > this quest for unification everything except browsers is ignored.
> 
> I think the Encoding Standard should describe the required behavior for
> implementations of the Web Platform--i.e. browser engines. 

Then this should be clearly articulated in the scope of the Encoding Standard. Current wording and responses from Anne indicate that intention of the Encoding Standard is to cover entire universe.

> Other software
> that wants to be Web-compatible is welcome to implement the spec, too.
> However, I think it would be wrong to change the spec and browser
> implementations to make pre-existing non-browser behaviors "correct" per
> spec.

So far no one showen where browsers implement us-ascii as an windows-1252 alias for *encoding*, so to me it seems that change I propose (replace or escape characters above U+007F for us-ascii encoding) doesn't change anything in browser implementations.

> (In reply to Jirka Kosek from comment #9)
> > Consider the following example. I have page containing copyright symbol
> > (U+00A9). I want to save it in "us-ascii" encoding using XHTML syntax of
> > HTML5.
> 
> You want something that's hostile to interop, then. The spec should not
> accommodate what you want.
> 
> In other words, the right solution is to always use UTF-8 when you create
> XML documents. 

In ideal world yes, but when you have other constraints and you know that receiver can handle us-ascii then why it should be broken?

Also same could be applied to any other text based format then XML. XML was only example that easily shown error using widely deployed libraries.

> I request this Bugzilla item be WONTFIXed.

I would rather see if browser vendors can prove that changing the current definition of us-ascii encoding breaks anything. Please note that the Encoding Standard changes how us-ascii encoding behaved in the past, so this change must be justified and well reasoned. Not vice versa.

Comment 38 Henri Sivonen 2014-07-01 10:12:21 UTC

(In reply to Jirka Kosek from comment #37)
> So far no one showen where browsers implement us-ascii as an windows-1252
> alias for *encoding*, so to me it seems that change I propose (replace or
> escape characters above U+007F for us-ascii encoding) doesn't change
> anything in browser implementations.

Not so.

The us-ascii label is resolved to the windows-1252 encoding early when loading a document. The original label isn't around anymore at the time of e.g. form submission. Therefore, the label mapping for decoding also affects encoding.

At least in Firefox, there is no separate mapping from labels to encodings for encoding. You can see the mapping that Firefox uses at
https://mxr.mozilla.org/mozilla-central/source/dom/encoding/labelsencodings.properties#148

> > (In reply to Jirka Kosek from comment #9)
> > > Consider the following example. I have page containing copyright symbol
> > > (U+00A9). I want to save it in "us-ascii" encoding using XHTML syntax of
> > > HTML5.
> > 
> > You want something that's hostile to interop, then. The spec should not
> > accommodate what you want.
> > 
> > In other words, the right solution is to always use UTF-8 when you create
> > XML documents. 
> 
> In ideal world yes, but when you have other constraints and you know that
> receiver can handle us-ascii then why it should be broken?

What "other constraints"?

If you know what the receiver can handle, you don't need specs to bless your bilateral arrangement.

> I would rather see if browser vendors can prove that changing the current
> definition of us-ascii encoding breaks anything.

No, that's not how this works.

> Please note that the
> Encoding Standard changes how us-ascii encoding behaved in the past, so this
> change must be justified and well reasoned. 

Citation-needed for the Encoding Standard describing a change compared to pre-Encoding Standard browser behavior.

Comment 39 Jirka Kosek 2014-07-01 10:32:20 UTC

(In reply to Henri Sivonen from comment #38)
> The us-ascii label is resolved to the windows-1252 encoding early when
> loading a document. The original label isn't around anymore at the time of
> e.g. form submission. Therefore, the label mapping for decoding also affects
> encoding.

But are there any pages using us-ascii encoding in a wild? If no, then there is no problem with having different aliases for decoding/encoding.

> > In ideal world yes, but when you have other constraints and you know that
> > receiver can handle us-ascii then why it should be broken?
> 
> What "other constraints"?

For example 15 years old POS terminal with no UTF-8 support.

> If you know what the receiver can handle, you don't need specs to bless your
> bilateral arrangement.

If I'm asking encoder to produce us-ascii output I'm not expecting getting bytes with value larger then 127 in my output. 

> > Please note that the
> > Encoding Standard changes how us-ascii encoding behaved in the past, so this
> > change must be justified and well reasoned. 
> 
> Citation-needed for the Encoding Standard describing a change compared to
> pre-Encoding Standard browser behavior.

I think that definition of US-ASCII is pretty clear, it's 7-bit encoding. I'm talking about us-ascii in general not only in browsers because the Encoding Standard seems to apply to everything, not only to browsers. If the scope is narrowed to browsers only, then do as you wish. But it would be silly to have two different definitions of us-ascii -- one for browsers and second for other environments.

Comment 40 Henri Sivonen 2014-07-01 10:51:48 UTC

(In reply to Jirka Kosek from comment #39)
> But are there any pages using us-ascii encoding in a wild?

It would be extremely surprising if there weren't.

> If no, then there
> is no problem with having different aliases for decoding/encoding.

As noted in my previous comment, when e.g. submitting a form, browser use the encoding of the submitting document. The document stores the identity of the encoding. It doesn't store the original label, so you don't have a chance to re-resolve the label according to a different mapping.

As for the TextEncoding API, it doesn't support non-UTF-* encodings anyway, so the issue of "us-ascii" is moot. 

> > > In ideal world yes, but when you have other constraints and you know that
> > > receiver can handle us-ascii then why it should be broken?
> > 
> > What "other constraints"?
> 
> For example 15 years old POS terminal with no UTF-8 support.

Without UTF-8 support, they can't have conforming XML support. It's not the Encoding Standard's problem to accommodate XML interchange with fundamentally XML-non-conforming legacy systems.

> > If you know what the receiver can handle, you don't need specs to bless your
> > bilateral arrangement.
> 
> If I'm asking encoder to produce us-ascii output I'm not expecting getting
> bytes with value larger then 127 in my output. 

The point where things go wrong is asking an encoder to produce something other than UTF-8. :-)

> > > Please note that the
> > > Encoding Standard changes how us-ascii encoding behaved in the past, so this
> > > change must be justified and well reasoned. 
> > 
> > Citation-needed for the Encoding Standard describing a change compared to
> > pre-Encoding Standard browser behavior.
> 
> I think that definition of US-ASCII is pretty clear, it's 7-bit encoding.

I said "browser behavior"--not (de jure) "definition".

> I'm talking about us-ascii in general not only in browsers because the
> Encoding Standard seems to apply to everything, not only to browsers. If the
> scope is narrowed to browsers only, then do as you wish. But it would be
> silly to have two different definitions of us-ascii -- one for browsers and
> second for other environments.

I think we should focus the spec on the Web Platform--i.e. browsers. As other systems find the need to consume Web content, they'll eventually grow Encoding Standard-compliant encoding subsystems.

It's clear that there exist encoding libraries whose label handling is IANA-oriented. Those will probably stick around for a long time for compatibility with their old selves. It's unfortunate that the Web behavior and e.g. the IANA-oriented JDK behavior differ, but we should just admit the existence of two different legacies and not try to mix e.g. the JDK legacy into Web specs.

Comment 41 Jirka Kosek 2014-07-01 12:57:23 UTC

(In reply to Henri Sivonen from comment #40)
> As for the TextEncoding API, it doesn't support non-UTF-* encodings anyway,
> so the issue of "us-ascii" is moot. 

Fortunately, it doesn't. But there is still generic definition of encoder which allows any encoding. And there is nothing in the Encoding Standard which prevents creation of another APIs which will allow more encodings and sooner and later will cause interop problems with encoding libraries that strictly follow IANA defintions of characters available in each encoding.

> I think we should focus the spec on the Web Platform--i.e. browsers. As
> other systems find the need to consume Web content, they'll eventually grow
> Encoding Standard-compliant encoding subsystems.
> 
> It's clear that there exist encoding libraries whose label handling is
> IANA-oriented. Those will probably stick around for a long time for
> compatibility with their old selves. It's unfortunate that the Web behavior
> and e.g. the IANA-oriented JDK behavior differ, but we should just admit the
> existence of two different legacies and not try to mix e.g. the JDK legacy
> into Web specs.

OK, so what about if the scope of the Encoding Standard will incorporate what is in two paragraphs above and also it would state that encoding and decoding algorithms are defined in a way that it's compatible with the existing usage on the web and that any APIs build on the top of the Encoding Standard should support only utf-* encodings, otherwise interop with other encoding libraries is not guaranteed?

Comment 42 Paul Eggert 2014-07-01 15:22:13 UTC

> You should be doing your Web compat reasoning from browsers with substantial market share. eww isn't one.

Well, I happened to be using Emacs at the time, but how about Chromium and Firefox? Please see <http://cs.ucla.edu/~eggert/tz/text.html>: it is labeled as charset=US-ASCII but actually encoded in UTF-8 with some non-ASCII characters. Firefox 30.0 and Chromium 34.0.1847.116 both behave like Emacs eww, and decode it as UTF-8.

Again, I propose that the standard be changed to allow decoders to treat UTF-8 properly even if it's mislabeled as US-ASCII, since that's what browsers actually do in some cases, and it's helpful behavior that should not be prohibited. The argument that the behavior is "bad for interoperability" a weak one: for a user, seeing text is more interoperable than seeing mojibake.

Comment 43 Jirka Kosek 2014-07-01 15:28:26 UTC

(In reply to Paul Eggert from comment #42)
> > You should be doing your Web compat reasoning from browsers with substantial market share. eww isn't one.
> 
> Well, I happened to be using Emacs at the time, but how about Chromium and
> Firefox? Please see <http://cs.ucla.edu/~eggert/tz/text.html>: it is labeled
> as charset=US-ASCII but actually encoded in UTF-8 with some non-ASCII
> characters. Firefox 30.0 and Chromium 34.0.1847.116 both behave like Emacs
> eww, and decode it as UTF-8.

Probably your UTF-8 content was saved with BOM which has priority over charset=us-ascii?

Comment 44 Paul Eggert 2014-07-01 15:41:35 UTC

(In reply to Jirka Kosek from comment #43)
> Probably your UTF-8 content was saved with BOM which has priority over
> charset=us-ascii?

Yes, it does have a BOM. Sorry, I didn't know that overrode the label. Please ignore my previous example.

Comment 45 Henri Sivonen 2014-07-04 10:31:32 UTC

(In reply to Jirka Kosek from comment #41)
> (In reply to Henri Sivonen from comment #40)
> > As for the TextEncoding API, it doesn't support non-UTF-* encodings anyway,
> > so the issue of "us-ascii" is moot. 
> 
> Fortunately, it doesn't. But there is still generic definition of encoder
> which allows any encoding. And there is nothing in the Encoding Standard
> which prevents creation of another APIs which will allow more encodings and
> sooner and later will cause interop problems with encoding libraries that
> strictly follow IANA defintions of characters available in each encoding.

An encoding API that exposes all the Encoding Standard encodings should strongly encourage users to 
 1) Always use UTF-8 anyway
and
 2) When rule #1 violated, to label the data using the canonical name of the encoding from the spec.

For example, the API should refuse to give you an encoder for "us-ascii" to force you to resolve the label "us-ascii" to windows-1252 first and then request and encoder for that. Once you've had to resolve the label to the encoding anyway, you should then use the canonical name of the encoding when labeling the outgoing data.

> > I think we should focus the spec on the Web Platform--i.e. browsers. As
> > other systems find the need to consume Web content, they'll eventually grow
> > Encoding Standard-compliant encoding subsystems.
> > 
> > It's clear that there exist encoding libraries whose label handling is
> > IANA-oriented. Those will probably stick around for a long time for
> > compatibility with their old selves. It's unfortunate that the Web behavior
> > and e.g. the IANA-oriented JDK behavior differ, but we should just admit the
> > existence of two different legacies and not try to mix e.g. the JDK legacy
> > into Web specs.
> 
> OK, so what about if the scope of the Encoding Standard will incorporate
> what is in two paragraphs above and also it would state that encoding and
> decoding algorithms are defined in a way that it's compatible with the
> existing usage on the web and that any APIs build on the top of the Encoding
> Standard should support only utf-* encodings, otherwise interop with other
> encoding libraries is not guaranteed?

Well, if you follow my advice above, labeling the data as windows-1252 rather than us-ascii increases interop with non-Encoding Standard receivers quite a bit. This pattern applies to all the single-byte encodings in the Encoding Standard as well as UTF-*, GB18030 and, AFAICT, EUC-JP: Use the canonical name for labeling and you get interop. 

Unfortunately, it might not quite work for Shift_JIS, EUC-KR and Big5. I'm not sure.

Comment 46 Simon Pieters 2014-08-04 12:42:16 UTC

Examples of pages that break if us-ascii label doesn't encode as windows-1252 for URL query component or <form> submission:

* http://futbolenlatele.com/ the "Copa Brasileña" link should point to http://futbolenlatele.com/indexf.php?comp=Copa%20Brasile%F1a

* http://www.wooglie.com/?page=contact has a login form and a contact form that might break.

* http://mysafelistmailer.com/contact.php contact form might break.

data from http://webdevdata.org 2013-09-01

$ find . -type f -print0 | xargs -0 -P 4 -n 40 grep -Ei ";\s*charset\s*=\s*[\"']?((us-)?ascii|ansi_x3\.4-1968)" >> ../us-ascii.txt

Comment 47 Henri Sivonen 2014-08-05 09:25:31 UTC

(In reply to Simon Pieters from comment #46)
> Examples of pages that break if us-ascii label doesn't encode as
> windows-1252 for URL query component or <form> submission:

Thank you! Can we now resolve this as WONTFIX, please?

(In reply to Henri Sivonen from comment #45)
> For example, the API should refuse to give you an encoder for "us-ascii" to
> force you to resolve the label "us-ascii" to windows-1252 first and then
> request and encoder for that. Once you've had to resolve the label to the
> encoding anyway, you should then use the canonical name of the encoding when
> labeling the outgoing data.

FWIW, as real-world existence proof of API of this nature, the Gecko-internal API for obtaining a decoder and the Gecko-internal API for obtaining an encoder take encodings--not labels. If a label is what you have, you have to to first call a method that resolves a label into an encoding. (It's unfortunate, though, that Gecko represents both encodings and labels as strings of 8-bit code units instead of representing encodings either as an enumeration or aas singleton objects inheriting from an abstract class. When properly designing a new API, you should not use strings to represent encodings.)

The introduction of this API has already exposed a bug in SeaMonkey.

Comment 48 vporrasdbr 2014-11-30 02:29:27 UTC

Similar this?

Comment 49 vporrasdbr 2014-11-30 02:30:05 UTC

(In reply to Paul Eggert from comment #0)
> http://encoding.spec.whatwg.org/ lists US-ASCII and equivalents (e.g.,
> ASCII) as an alias for "windows-1252".  This isn't correct, as Windows 1252
> has several code points above 127 that are not ASCII characters.  US-ASCII
> should be kept distinct from Windows 1252, just as it should be kept
> distinct from Latin-1 and all the other encodings that are extensions of
> US-ASCII.

is similar this?
http://dondeverfutbol.com