27435 – Document.inputEncoding

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 27435 - Document.inputEncoding

Summary: Document.inputEncoding

Status:	RESOLVED FIXED

Alias:	None

Product:	WebAppsWG
Classification:	Unclassified
Component:	DOM (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Anne
QA Contact:	public-webapps-bugzilla

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-11-25 23:23 UTC by Philip Jägenstedt
Modified:	2014-11-27 17:25 UTC (History)
CC List:	4 users (show)

See Also:

Attachments

Description Philip Jägenstedt 2014-11-25 23:23:30 UTC

https://dom.spec.whatwg.org/#dom-document-inputencoding

Usage is currently around 0.4%:
https://www.chromestatus.com/metrics/feature/timeline/popularity/114

AFAICT no browser has removed it yet.

In Blink/WebKit it's just an alias of characterSet, while in Gecko it returns null for in-memory document and is otherwise an alias of characterSet. Making it an alias seems simplest.

Comment 1 Boris Zbarsky 2014-11-26 02:13:33 UTC

Making it an alias is also an explicit violation of the previous spec for it, right?

Comment 2 Anne 2014-11-26 08:00:49 UTC

Yeah, see http://www.w3.org/TR/DOM-Level-3-Core/core.html#Document3-inputEncoding

Comment 3 Philip Jägenstedt 2014-11-26 08:40:06 UTC

Yes, but do we care, given that a plain alias seems Web compatible?

In IE11, document.implementation.createHTMLDocument('').inputEncoding is "UTF-8" while .characterSet is "utf-8". In Chrome both are null.  In Firefox inputEncoding is null while characterSet is "UTF-8". In other words, the in-memory case doesn't have great interop right now.

(For documents served over HTTP it's all "UTF-8" except characterSet in IE11 which is "utf-8".)

Comment 4 Anne 2014-11-26 08:43:08 UTC

I don't care. I'm happy for them all to be "utf-8" (assuming no other encoding was used).

Comment 5 Boris Zbarsky 2014-11-26 15:15:38 UTC

You mean "UTF-8", since that's the one thing people more or less agree on?

I can probably deal with the alias thing, but just wanted to point out that this is an explicit behavior change and an explicit spec change.  I do fully expect to get some compat fallout from it, but not much.

Comment 6 Anne 2014-11-27 09:36:08 UTC

https://github.com/whatwg/dom/commit/03e170351f095e4fe749e0259a3aafc0cbb49c91

Comment 7 Philip Jägenstedt 2014-11-27 09:53:37 UTC

Why not just uppercase the return value? Are there any common cases not listed to worry about?

Comment 8 Anne 2014-11-27 09:54:33 UTC

E.g. windows-1252 is not uppercased.

Comment 9 Philip Jägenstedt 2014-11-27 11:39:56 UTC

Ugh, that's unfortunate. It seems like Chromium already returns "UTF-8" but "windows-1252", but I'm sure there are discrepancies.

Are there other Web-facing APIs that are supposed to return lowercase encoding names? Do they actually in shipping implementations?

Comment 10 Anne 2014-11-27 12:18:02 UTC

TextDecoder does, yes. If we are to expose these elsewhere I would hope we align with that. Having to guess the case of the encoding name is no fun.

Comment 11 Philip Jägenstedt 2014-11-27 12:55:37 UTC

In Blink, the TextDecoder.encoding getter lowercases the returned string, so somewhere internally the canonical names already differ by case.

I guess it doesn't matter how the specs phrase this as long as the observable behavior is the same.

Comment 12 Boris Zbarsky 2014-11-27 14:38:08 UTC

In Gecko the canonical encoding name for UTF-8 is "UTF-8".

The canonical encoding name for windows-1252 is "windows-1252".

It sounds like Blink does the same.

What do other UAs do?

Seems to me like ideally the canonical names in the encoding standard would match UA behavior to the extent it's interoperable.

inputEncoding should just return the canonical name, imo.  We shouldn't be adding stupid complexity and special casing here if we can avoid it.

Fwiw, what TextEncoder/TextDecoder do in Gecko is to just always lowercase our internal canonical name before returning.  :(

> Having to guess the case of the encoding name is no fun.

I agree, but neither is breaking compat.  :(

Comment 13 Anne 2014-11-27 14:49:27 UTC

(In reply to Boris Zbarsky from comment #12)
> Seems to me like ideally the canonical names in the encoding standard would
> match UA behavior to the extent it's interoperable.

It's not interoperable for gbk/gb18030 (I aligned with Blink, which has uppercase). Not sure what IE does.


> inputEncoding should just return the canonical name, imo.  We shouldn't be
> adding stupid complexity and special casing here if we can avoid it.

I think we shouldn't add silly casing to the Encoding Standard as they might leak elsewhere. That's why I chose this setup.

Comment 14 Boris Zbarsky 2014-11-27 15:46:43 UTC

I think having different parts of the platform have different "canonical" case for encodings is just bizarre beyond belief, personally.

Comment 15 Philip Jägenstedt 2014-11-27 16:01:22 UTC

Given that we've already shipped Document.characterSet as "UTF-8" and TextDecoder.encoding as "utf-8", is there a way out of this bizarre situation?

1. Let Document.characterSet and aliases return lowercase, like IE.

2. Make TextDecoder.encoding match characterSet's variable case.

Option 1 seems slightly better long-term, but also far more likely to break stuff.

Comment 16 Arkadiusz Michalski (Spirit) 2014-11-27 17:25:49 UTC

(In reply to Philip Jägenstedt from comment #15)
> 1. Let Document.characterSet and aliases return lowercase, like IE.

But IE for inputEncoding always return uppercasee (for all names, like UTF-8, BIG5, GB18030, WINDOWS-1250). So aliases to characterSet will never be correct (if we take into account size of characters).

In other site returned value for encoding's name by browser are realy inconsistent. I don't think anyone really create a code without prior conversion to uppercase or lowercase when it is used in conditions. Changing to always returning lowercase letters, in all cases, really break compatibility?

Some interesting result:

<meta charset="big5">
Firefox
document.characterSet: Big5
document.charset: undefined
document.inputEncoding: Big5
Chrome
document.characterSet: Big5
document.charset: Big5
document.inputEncoding: Big5
IE
document.characterSet: big5
document.charset: big5
document.inputEncoding: BIG5 

<meta charset="uff-8">
Firefox
document.characterSet: UTF-8
document.charset: undefined
document.inputEncoding: UTF-8 
Chrome
document.characterSet: UTF-8
document.charset: UTF-8
document.inputEncoding: UTF-8
IE
document.characterSet: utf-8
document.charset: utf-8
document.inputEncoding: UTF-8 

<meta charset="gbk">
Firefox
document.characterSet: gbk
document.charset: undefined
document.inputEncoding: gbk 
Chrome
document.characterSet: GBK
document.charset: GBK
document.inputEncoding: GBK
IE
document.characterSet: gb2312
document.charset: gb2312
document.inputEncoding: GB2312

<meta charset="gb18030">
Firefox
document.characterSet: gb18030
document.charset: undefined
document.inputEncoding: gb18030
Chrome
document.characterSet: gb18030
document.charset: gb18030
document.inputEncoding: gb18030
IE
document.characterSet: GB18030
document.charset: GB18030
document.inputEncoding: GB18030

If the various APIs can return different size of encoding names then I think that minimum is add to the Encoding spec such information (somewhere near the table which lists those names).