Currently, many single-byte encodings map undefined bytes in the 0x80-0x9F range to the C1 codepoints with the same byte value. This seems to be what most if not all browsers do, so it may be difficult to change.
However, validators work differently. For example, the W3C Validator at http://validator.w3.org/check, when seeing a 0x81 byte in windows-1252, says the following:
Sorry! This document cannot be checked.
Sorry, I am unable to validate this document because on line 213 it contained one or more bytes that I cannot interpret as windows-1252 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: cp1252 "\x81" does not map to Unicode
As far as I understand, the current encoding spec would disallow such a behavior. However, \x81 is clearly garbage (I only included it in a file in order to test browsers), and so it makes a lot of sense that this is caught by validators and similar tools, and not produced by tool chains.
Well, the current encoding specification declares all those encodings as legacy, and requires everyone to use utf-8, so implementing extra logic for legacy encodings seems kind of pointless.
FWIW, I’m not interested in implementing checks that would uphold the fiction of certain legacy encodings being narrower than they actually are in practice.
(In reply to comment #2)
> FWIW, I’m not interested in implementing checks that would uphold the
> fiction of certain legacy encodings being narrower than they actually are in
That would be validator.nu, yes? You don't have to implement any new checks, they are already implemented. Here is what I get when trying to validate a document with virtually all byte values in windows-1252 (except for 0x00 and 0x0D, because these don't survive document load).
(The character encoding override was just to be on the safe side; I also deleted the actual code line to avoid getting some weird stuff into bugzilla.)
Warning: Overriding document character encoding from none to Windows-1252.
Error: Forbidden code point U+0001.
At line 77, column 41
[similar for most code points in C0 area, shortened]
Warning: This document is not mappable to XML 1.0 without data loss due to U+000c which is not a legal XML 1.0 character.
At line 89, column 41
[U+000C is treated somewhat specially]
Error: Unmappable byte sequence: 81.
At line 222, column 41
Error: Forbidden code point U+007f.
At line 218, column 41
Error: Unmappable byte sequence: 8d.
At line 234, column 41
Error: Unmappable byte sequence: 8f.
At line 236, column 41
Error: Unmappable byte sequence: 90.
At line 239, column 41
Error: Unmappable byte sequence: 9d.
At line 252, column 41
The above five "Unmappable byte sequence"s are exactly the ones I'm concerned about (in windows-1252).
I haven't looked at your code, but I have originally written the corresponding code in the W3C validator (Perl, not sure that's still used :-). My guess is that your validator produces these errors because it uses the Java character code conversion, where of course 0x81, 0x8d, 0x8f, 0x90, 0x9d are invalid.
So in this case, we have at least two validator implementations that do the same thing, which is the right thing for validators because these bytes can nothing else than garbage. So the only thing we would need is some wording in the spec that allows this practice.
(In reply to comment #3)
> (In reply to comment #2)
> > FWIW, I’m not interested in implementing checks that would uphold the
> > fiction of certain legacy encodings being narrower than they actually are in
> > practice.
> That would be validator.nu, yes?
> You don't have to implement any new checks,
> they are already implemented.
Oh, I thought you were talking about cases where like the part of Windows-1252 that isn’t part of ISO-8859-1 or the part of Windows-31j that isn’t part of de jure Shift_JIS.
> Error: Forbidden code point U+0001.
> At line 77, column 41
This is not an encoding-level error. It’s an error detected after decoding into Unicode.
> Warning: This document is not mappable to XML 1.0 without data loss due to
> U+000c which is not a legal XML 1.0 character.
> At line 89, column 41
This isn’t an encoding-level warning.
> Error: Unmappable byte sequence: 81.
> At line 222, column 41
> Error: Forbidden code point U+007f.
> At line 218, column 41
> Error: Unmappable byte sequence: 8d.
> At line 234, column 41
> Error: Unmappable byte sequence: 8f.
> At line 236, column 41
> Error: Unmappable byte sequence: 90.
> At line 239, column 41
> Error: Unmappable byte sequence: 9d.
> At line 252, column 41
These are indeed encoding-level errors. Validator.nu is currently using decoders from the JDK and ICU4J (not sure whether the Windows-1252 decoder is from the JDK or ICU4J). That is, Validator.nu does not have a set of decoders that’d conform to the Encoding Standard at present.
Moreover, Validator.nu currently only has the capability of handling the sort of decoder errors that result in a REPLACEMENT CHARACTER in the output. Mapping the above bytes to the REPLACEMENT CHARACTER is non-compliant per the Encoding Standard.
Developing a mechanism for reporting errors that don’t result in a REPLACEMENT CHARACTER in the output does not seem like a particularly high-value endeavor. It would be more interesting to have a post-decode Unicode-level mechanism for whining about the C1 Unicode range—even if those code point came in as valid UTF-8. That said, I think the point you *are* making is more valid than the point *I previously thought* you were making.
(In reply to comment #4)
I meant: “Yes”.
> > Error: Forbidden code point U+007f.
> > At line 218, column 41
> These are indeed encoding-level errors.
Except the one about U+007f is not.
Once validators update their encoders/decoders to match the specification this issue seems moot. On top of that the idea is to flag all non-utf-8 usage. Not really sure what remains to be fixed here.
The Internationalization WG considered this bug in our January 23rd teleconference  (per Martin's request) and I've been delegated to add a comment here giving our "sense of the working group"
Generally speaking, the WG supports the intent of this bug, which we understand to be: that validators should be allowed to (and probably should) emit a warning when they detect that a byte sequence in the document being checked is not convertible, represents a malformed sequences, or is otherwise illegal.
We don't think that having the validator fail to check the remainder of the document is a good thing (obviously a truly wrong encoding declaration may lead to an interminable number of document errors).
Henri's reply suggests that Validator.nu would ignore any byte sequence that doesn't result in U+FFFD and that the currently-reported errors might change/disappear for some of the sequences Martin mentions if/when the encoders used in the validator are updated to match the encoding specification. We think validators should be permitted to continue reporting these errors, although we tend to agree that this is more of a warning and that maybe nothing is necessary here?
(In reply to comment #6)
> Once validators update their encoders/decoders
Why would they? It would only be "make work" for them, for no real benefit. It isn't easy to throw away converters that come with your programming language and create your own converters, and it doesn't make sense to do so just to match a specification, when the actual data is crap, and that's what the validators are supposed to catch.
> On top of that the idea is to flag all non-utf-8 usage.
That's not a bad idea in and by itself, but should remain separate. For many existing web sites, moving from a legacy encoding to UTF-8 may be a major undertaking, while removing an occasional crappy byte may be much easier.
If the validator does not match the standards, its reports will be out-of-sync with the experiences of developers in products that implement those standards. We've had that problem big-time with HTML4. Lets not go there again.
(In reply to comment #8)
> (In reply to comment #6)
> > Once validators update their encoders/decoders
> Why would they?
To conform to the specs they are supposed to be dealing with. It seems pretty bogus for a validator not to implement the specs the validator is supposed to be checking for. (Yes, Validator.nu is currently bogus when it comes to the Encoding Standard.)
In the case of Validator.nu, the parser is also available for non-validator apps. In order to make the HTML consumption of those apps Web-compatible, the parser should use Encoding Standard-compliant decoders. It's sad that it isn't already. I have had too much other stuff to pay attention to.
> It isn't easy to throw away converters that come with your programming
> language and create your own converters
Should be pretty easy for the single-byte converters. Considerably more work for the multibyte ones, though.
Per comment 9 and comment 10, WONTFIX.