This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 719 - Doctype/encoding fallback issues
Summary: Doctype/encoding fallback issues
Status: VERIFIED FIXED
Alias: None
Product: Validator
Classification: Unclassified
Component: check (show other bugs)
Version: 0.6.5
Hardware: Other other
: P1 major
Target Milestone: 0.6.6
Assignee: Ville Skyttä
QA Contact: qa-dev tracking
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-05-15 06:21 UTC by Ville Skyttä
Modified: 2004-05-17 21:46 UTC (History)
0 users

See Also:


Attachments

Comment 1 Terje Bless 2004-05-15 12:25:17 UTC
I've added the O_CHARSET flag to &abort_if_error_flagged that gets triggered on
byte errors. This has the potential to break in some of the other exception
cases that gets handled by this particular instance, so it bears watching for
further weirdness (there was a reason why it was disabled IIRC).

This enables the Charset popup in result pages for URLs with no charset and
bytes that are invalid in UTF-8 (i.e. unlabled Latin 1, Win-1252, etc.).
Comment 2 Terje Bless 2004-05-15 22:33:40 UTC
Both Charset and DOCTYPE were borked due to logic errors. Now fixed, but the
code is somewhat hairy so there could still be edge cases and it's likely to
break again the next time someone futzes about in this part of the code. Anyone
have ideas for a complete revamp of this code?
Comment 3 Ville Skyttä 2004-05-16 07:13:48 UTC
Still some issues: try for example http://www.hut.fi

- HTTP Content-Type: text/html (no charset)
- No <meta> element with a charset in markup
- Not XML, no XML encoding.

--> validator uses UTF-8.

Shouldn't this be iso-8859-1 based on the "strong default" of HTTP?
Comment 4 Terje Bless 2004-05-16 21:52:33 UTC
cf. Comment #3

"Shouldn't this be iso-8859-1 based on the "strong default" of HTTP?"

No. As we've been over a gazillion and one times on w-v, the HTML Recs have seen
fit to override the HTTP RFC making it impossible to conform with both standards
at the same time. Given the mess this issues is in, we're erring to the side 1)
obeying the W3C Recs since we're a W3C hosted service, and 2) to promote UNICODE
over limited legacy charsets.

This is not ideal, but the best we can do under the circumstances.


If you have specific proposed changes (other than changing the default to
ISO-8859-1) please outline them (warn about the fallback perhaps?). Otherwise,
please close this bug.
Comment 5 Ville Skyttä 2004-05-17 02:17:15 UTC
Correct me if I'm wrong:

I don't know where the conflict between the specs in the situation outlined in
comment 3 is.  AFAIK, one of the specs has a "strong default", the other
(speaking HTML here, not X(HT)ML) does not have any default.  FWIW, I disagree
with bluntly acting against the HTTP spec _when not necessary_.

Anyway, we already have warnings about not being able to find a character
encoding to use in the validator code.  Why aren't those shown in this case?  In
which cases they are shown, then?

I think the warnings should be shown no matter what charset we choose if none is
explicitly specified.  In addition if we choose to use UTF-8 in these cases for
which the reasoning is not at all obvious IMO, a blurb/statement about it needs
to be included in the documentation.
Comment 6 Terje Bless 2004-05-17 03:05:01 UTC
cf. Comment #5;

HTTP specifies that the absence of a charset parameter in the Content-Type field
means a default of ISO-8859-1. The HTML 4.01 Recommendation says something along
the lines of "This has turned out to be sub-optimal. You should disregard this
and default to UTF-8 instead."

IOW, when no other charset information is present -- including any defaults
implied from an XML Content-Type -- we can pick either ISO-8859-1 or UTF-8
depending on whether we choose to listen to the IETF or the W3C. After many
(*many*) discussions on w-v we've ended up listening to the W3C.


As for why there is no warning in the case outlined in Comment #3, this is due
to the page generating a fatal error. The exception handler is conservative in
what it tries to spit out because fatal errors usually occur too early and the
datastructures are in a garbage state.

I'll look into whether we can fix it in this particular case.
Comment 7 Terje Bless 2004-05-17 03:57:37 UTC
Ok, I've added the accumulated warnings to the output for all cases that the
metadata table is also output for. I think the initialization state for
&add_table and &add_warning is the same at that stage.

Should should be up on qa-dev now; try it on hut.fi and let me know if it does
the job.
Comment 8 Bj 2004-05-17 09:19:28 UTC
HTML 4.01 actually says do what you want, but do not default to ISO-8859-1 
blindly as that would fail in many situations. The specification mentions UTF-8 
only as an example for common encodings and in the section on how to handle 
illegal URIs. If we do not attempt to be clever, ISO-8859-1 makes the most 
sense among possible default encodings, as there are most likely more ISO-8859-
1 documents than UTF-8 documents which do not declare any encoding on the web.
Comment 9 Terje Bless 2004-05-17 16:54:02 UTC
As mentioned, this has been discussed, at length, on several occasions on w-v.
Bugzilla is not the place to take this discussion up again, and 0.6.6 is not the
target for revisiting this issue. Closing this bug as FIXED; lets hash this out
on the list and iff necessary make any changes with a target of 0.7.
Comment 10 Ville Skyttä 2004-05-17 17:46:35 UTC
Now that the warnings are visible, the fallback is a lot less confusing; good
enough for 0.6.6 IMO.