27435 2014-11-25 23:23:30 +0000 Document.inputEncoding 2014-11-27 17:25:49 +0000 1 1 1 Unclassified WebAppsWG DOM unspecified PC All RESOLVED FIXED P2 normal --- 1 philipj annevk bzbarsky crimsteam mike www-dom public-webapps-bugzilla oldest_to_newest 115484 0 philipj 2014-11-25 23:23:30 +0000 https://dom.spec.whatwg.org/#dom-document-inputencoding Usage is currently around 0.4%: https://www.chromestatus.com/metrics/feature/timeline/popularity/114 AFAICT no browser has removed it yet. In Blink/WebKit it's just an alias of characterSet, while in Gecko it returns null for in-memory document and is otherwise an alias of characterSet. Making it an alias seems simplest. 115490 1 bzbarsky 2014-11-26 02:13:33 +0000 Making it an alias is also an explicit violation of the previous spec for it, right? 115493 2 annevk 2014-11-26 08:00:49 +0000 Yeah, see http://www.w3.org/TR/DOM-Level-3-Core/core.html#Document3-inputEncoding 115497 3 philipj 2014-11-26 08:40:06 +0000 Yes, but do we care, given that a plain alias seems Web compatible? In IE11, document.implementation.createHTMLDocument('').inputEncoding is "UTF-8" while .characterSet is "utf-8". In Chrome both are null. In Firefox inputEncoding is null while characterSet is "UTF-8". In other words, the in-memory case doesn't have great interop right now. (For documents served over HTTP it's all "UTF-8" except characterSet in IE11 which is "utf-8".) 115498 4 annevk 2014-11-26 08:43:08 +0000 I don't care. I'm happy for them all to be "utf-8" (assuming no other encoding was used). 115521 5 bzbarsky 2014-11-26 15:15:38 +0000 You mean "UTF-8", since that's the one thing people more or less agree on? I can probably deal with the alias thing, but just wanted to point out that this is an explicit behavior change and an explicit spec change. I do fully expect to get some compat fallout from it, but not much. 115603 6 annevk 2014-11-27 09:36:08 +0000 https://github.com/whatwg/dom/commit/03e170351f095e4fe749e0259a3aafc0cbb49c91 115607 7 philipj 2014-11-27 09:53:37 +0000 Why not just uppercase the return value? Are there any common cases not listed to worry about? 115608 8 annevk 2014-11-27 09:54:33 +0000 E.g. windows-1252 is not uppercased. 115617 9 philipj 2014-11-27 11:39:56 +0000 Ugh, that's unfortunate. It seems like Chromium already returns "UTF-8" but "windows-1252", but I'm sure there are discrepancies. Are there other Web-facing APIs that are supposed to return lowercase encoding names? Do they actually in shipping implementations? 115618 10 annevk 2014-11-27 12:18:02 +0000 TextDecoder does, yes. If we are to expose these elsewhere I would hope we align with that. Having to guess the case of the encoding name is no fun. 115619 11 philipj 2014-11-27 12:55:37 +0000 In Blink, the TextDecoder.encoding getter lowercases the returned string, so somewhere internally the canonical names already differ by case. I guess it doesn't matter how the specs phrase this as long as the observable behavior is the same. 115622 12 bzbarsky 2014-11-27 14:38:08 +0000 In Gecko the canonical encoding name for UTF-8 is "UTF-8". The canonical encoding name for windows-1252 is "windows-1252". It sounds like Blink does the same. What do other UAs do? Seems to me like ideally the canonical names in the encoding standard would match UA behavior to the extent it's interoperable. inputEncoding should just return the canonical name, imo. We shouldn't be adding stupid complexity and special casing here if we can avoid it. Fwiw, what TextEncoder/TextDecoder do in Gecko is to just always lowercase our internal canonical name before returning. :( > Having to guess the case of the encoding name is no fun. I agree, but neither is breaking compat. :( 115625 13 annevk 2014-11-27 14:49:27 +0000 (In reply to Boris Zbarsky from comment #12) > Seems to me like ideally the canonical names in the encoding standard would > match UA behavior to the extent it's interoperable. It's not interoperable for gbk/gb18030 (I aligned with Blink, which has uppercase). Not sure what IE does. > inputEncoding should just return the canonical name, imo. We shouldn't be > adding stupid complexity and special casing here if we can avoid it. I think we shouldn't add silly casing to the Encoding Standard as they might leak elsewhere. That's why I chose this setup. 115626 14 bzbarsky 2014-11-27 15:46:43 +0000 I think having different parts of the platform have different "canonical" case for encodings is just bizarre beyond belief, personally. 115628 15 philipj 2014-11-27 16:01:22 +0000 Given that we've already shipped Document.characterSet as "UTF-8" and TextDecoder.encoding as "utf-8", is there a way out of this bizarre situation? 1. Let Document.characterSet and aliases return lowercase, like IE. 2. Make TextDecoder.encoding match characterSet's variable case. Option 1 seems slightly better long-term, but also far more likely to break stuff. 115629 16 crimsteam 2014-11-27 17:25:49 +0000 (In reply to Philip Jägenstedt from comment #15) > 1. Let Document.characterSet and aliases return lowercase, like IE. But IE for inputEncoding always return uppercasee (for all names, like UTF-8, BIG5, GB18030, WINDOWS-1250). So aliases to characterSet will never be correct (if we take into account size of characters). In other site returned value for encoding's name by browser are realy inconsistent. I don't think anyone really create a code without prior conversion to uppercase or lowercase when it is used in conditions. Changing to always returning lowercase letters, in all cases, really break compatibility? Some interesting result: <meta charset="big5"> Firefox document.characterSet: Big5 document.charset: undefined document.inputEncoding: Big5 Chrome document.characterSet: Big5 document.charset: Big5 document.inputEncoding: Big5 IE document.characterSet: big5 document.charset: big5 document.inputEncoding: BIG5 <meta charset="uff-8"> Firefox document.characterSet: UTF-8 document.charset: undefined document.inputEncoding: UTF-8 Chrome document.characterSet: UTF-8 document.charset: UTF-8 document.inputEncoding: UTF-8 IE document.characterSet: utf-8 document.charset: utf-8 document.inputEncoding: UTF-8 <meta charset="gbk"> Firefox document.characterSet: gbk document.charset: undefined document.inputEncoding: gbk Chrome document.characterSet: GBK document.charset: GBK document.inputEncoding: GBK IE document.characterSet: gb2312 document.charset: gb2312 document.inputEncoding: GB2312 <meta charset="gb18030"> Firefox document.characterSet: gb18030 document.charset: undefined document.inputEncoding: gb18030 Chrome document.characterSet: gb18030 document.charset: gb18030 document.inputEncoding: gb18030 IE document.characterSet: GB18030 document.charset: GB18030 document.inputEncoding: GB18030 If the various APIs can return different size of encoding names then I think that minimum is add to the Encoding spec such information (somewhere near the table which lists those names).