Bug 15076 - Make UAs use UTF-8 as fallback encoding if the page has a HTML5 doctype
Summary: Make UAs use UTF-8 as fallback encoding if the page has a HTML5 doctype
Alias: None
Product: HTML WG
Classification: Unclassified
Component: HTML5 spec (show other bugs)
Version: unspecified
Hardware: PC All
: P3 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL: http://dev.w3.org/html5/spec/parsing#...
Depends on:
Reported: 2011-12-06 05:02 UTC by Leif Halvard Silli
Modified: 2011-12-07 08:19 UTC (History)
7 users (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Leif Halvard Silli 2011-12-06 05:02:58 UTC
Quoting Kornel Lesiński:

> Could <!DOCTYPE html> be an opt-in to default UTF-8 encoding?
> It would be nice to minimize number of declarations a page needs to include.


Such a UA behaviour would, presumably, involve a formalizing a new step in the encoding sniffing algorithm, between the current step 5  and step 6. In essence, the UA would default to UTF-8 if the other meta data fails - see: http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-December/034069.html

Presumably, UAs would need to change before the spec could officially allow authors to rely on the DOCTYPE.
Comment 1 Ian 'Hixie' Hickson 2011-12-06 05:50:35 UTC
Try to get a browser vendor to ship it. If they can, I'd be happy to change the spec.
Comment 2 Henri Sivonen 2011-12-07 08:19:06 UTC
(In reply to comment #1)
> Try to get a browser vendor to ship it. If they can, I'd be happy to change the
> spec.

My editorial assistant hat asked my HTML parser module owner hat. Therefore:

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the tracker issue; or you may create a tracker issue
yourself, if you are able to do so. For more details, see this document:

Status: Rejected
Change Description: no spec change

We already have *three* backwards-compatible ways to opt into UTF-8. <!DOCTYPE html> isn't one of them. Making the change proposed here would violate the Don't Reinvent the Wheel design principle.

Moreover, I think it's a mistake to bundle a lot of unrelated things into one mode switch instead of having legacy-compatible defaults and having granular ways to opt into legacy-incompatible behaviors. (That is, I think, in retrospect, it's bad that we have a doctype-triggered standards mode with legacy-incompatible CSS defaults instead of having legacy-compatible CSS defaults and CSS properties for opting into different behaviors.)

Making this change would make the encoding selection behavior even more confusing to authors than it is now, since using <!DOCTYPE html> would lead to radically different behavior in old and new browsers.

Furthermore, currently the doctype mode processing happens on the tree builder level but efficient encoding sniffing needs to happen before tokenization. It would be *even* more confusing to have case where encoding sniffing <!DOCTYPE html> and CSS mode sniffing <!DOCTYPE html> get out of sync.

If the author wishes to minimize declarations, (s)he can put the UTF-8 BOM followed immediately by <!DOCTYPE html> at the start of the file.