<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>27435</bug_id>
          
          <creation_ts>2014-11-25 23:23:30 +0000</creation_ts>
          <short_desc>Document.inputEncoding</short_desc>
          <delta_ts>2014-11-27 17:25:49 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WebAppsWG</product>
          <component>DOM</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Philip Jägenstedt">philipj</reporter>
          <assigned_to name="Anne">annevk</assigned_to>
          <cc>bzbarsky</cc>
    
    <cc>crimsteam</cc>
    
    <cc>mike</cc>
    
    <cc>www-dom</cc>
          
          <qa_contact>public-webapps-bugzilla</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>115484</commentid>
    <comment_count>0</comment_count>
    <who name="Philip Jägenstedt">philipj</who>
    <bug_when>2014-11-25 23:23:30 +0000</bug_when>
    <thetext>https://dom.spec.whatwg.org/#dom-document-inputencoding

Usage is currently around 0.4%:
https://www.chromestatus.com/metrics/feature/timeline/popularity/114

AFAICT no browser has removed it yet.

In Blink/WebKit it&apos;s just an alias of characterSet, while in Gecko it returns null for in-memory document and is otherwise an alias of characterSet. Making it an alias seems simplest.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115490</commentid>
    <comment_count>1</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2014-11-26 02:13:33 +0000</bug_when>
    <thetext>Making it an alias is also an explicit violation of the previous spec for it, right?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115493</commentid>
    <comment_count>2</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-26 08:00:49 +0000</bug_when>
    <thetext>Yeah, see http://www.w3.org/TR/DOM-Level-3-Core/core.html#Document3-inputEncoding</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115497</commentid>
    <comment_count>3</comment_count>
    <who name="Philip Jägenstedt">philipj</who>
    <bug_when>2014-11-26 08:40:06 +0000</bug_when>
    <thetext>Yes, but do we care, given that a plain alias seems Web compatible?

In IE11, document.implementation.createHTMLDocument(&apos;&apos;).inputEncoding is &quot;UTF-8&quot; while .characterSet is &quot;utf-8&quot;. In Chrome both are null.  In Firefox inputEncoding is null while characterSet is &quot;UTF-8&quot;. In other words, the in-memory case doesn&apos;t have great interop right now.

(For documents served over HTTP it&apos;s all &quot;UTF-8&quot; except characterSet in IE11 which is &quot;utf-8&quot;.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115498</commentid>
    <comment_count>4</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-26 08:43:08 +0000</bug_when>
    <thetext>I don&apos;t care. I&apos;m happy for them all to be &quot;utf-8&quot; (assuming no other encoding was used).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115521</commentid>
    <comment_count>5</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2014-11-26 15:15:38 +0000</bug_when>
    <thetext>You mean &quot;UTF-8&quot;, since that&apos;s the one thing people more or less agree on?

I can probably deal with the alias thing, but just wanted to point out that this is an explicit behavior change and an explicit spec change.  I do fully expect to get some compat fallout from it, but not much.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115603</commentid>
    <comment_count>6</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-27 09:36:08 +0000</bug_when>
    <thetext>https://github.com/whatwg/dom/commit/03e170351f095e4fe749e0259a3aafc0cbb49c91</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115607</commentid>
    <comment_count>7</comment_count>
    <who name="Philip Jägenstedt">philipj</who>
    <bug_when>2014-11-27 09:53:37 +0000</bug_when>
    <thetext>Why not just uppercase the return value? Are there any common cases not listed to worry about?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115608</commentid>
    <comment_count>8</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-27 09:54:33 +0000</bug_when>
    <thetext>E.g. windows-1252 is not uppercased.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115617</commentid>
    <comment_count>9</comment_count>
    <who name="Philip Jägenstedt">philipj</who>
    <bug_when>2014-11-27 11:39:56 +0000</bug_when>
    <thetext>Ugh, that&apos;s unfortunate. It seems like Chromium already returns &quot;UTF-8&quot; but &quot;windows-1252&quot;, but I&apos;m sure there are discrepancies.

Are there other Web-facing APIs that are supposed to return lowercase encoding names? Do they actually in shipping implementations?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115618</commentid>
    <comment_count>10</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-27 12:18:02 +0000</bug_when>
    <thetext>TextDecoder does, yes. If we are to expose these elsewhere I would hope we align with that. Having to guess the case of the encoding name is no fun.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115619</commentid>
    <comment_count>11</comment_count>
    <who name="Philip Jägenstedt">philipj</who>
    <bug_when>2014-11-27 12:55:37 +0000</bug_when>
    <thetext>In Blink, the TextDecoder.encoding getter lowercases the returned string, so somewhere internally the canonical names already differ by case.

I guess it doesn&apos;t matter how the specs phrase this as long as the observable behavior is the same.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115622</commentid>
    <comment_count>12</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2014-11-27 14:38:08 +0000</bug_when>
    <thetext>In Gecko the canonical encoding name for UTF-8 is &quot;UTF-8&quot;.

The canonical encoding name for windows-1252 is &quot;windows-1252&quot;.

It sounds like Blink does the same.

What do other UAs do?

Seems to me like ideally the canonical names in the encoding standard would match UA behavior to the extent it&apos;s interoperable.

inputEncoding should just return the canonical name, imo.  We shouldn&apos;t be adding stupid complexity and special casing here if we can avoid it.

Fwiw, what TextEncoder/TextDecoder do in Gecko is to just always lowercase our internal canonical name before returning.  :(

&gt; Having to guess the case of the encoding name is no fun.

I agree, but neither is breaking compat.  :(</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115625</commentid>
    <comment_count>13</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-11-27 14:49:27 +0000</bug_when>
    <thetext>(In reply to Boris Zbarsky from comment #12)
&gt; Seems to me like ideally the canonical names in the encoding standard would
&gt; match UA behavior to the extent it&apos;s interoperable.

It&apos;s not interoperable for gbk/gb18030 (I aligned with Blink, which has uppercase). Not sure what IE does.


&gt; inputEncoding should just return the canonical name, imo.  We shouldn&apos;t be
&gt; adding stupid complexity and special casing here if we can avoid it.

I think we shouldn&apos;t add silly casing to the Encoding Standard as they might leak elsewhere. That&apos;s why I chose this setup.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115626</commentid>
    <comment_count>14</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2014-11-27 15:46:43 +0000</bug_when>
    <thetext>I think having different parts of the platform have different &quot;canonical&quot; case for encodings is just bizarre beyond belief, personally.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115628</commentid>
    <comment_count>15</comment_count>
    <who name="Philip Jägenstedt">philipj</who>
    <bug_when>2014-11-27 16:01:22 +0000</bug_when>
    <thetext>Given that we&apos;ve already shipped Document.characterSet as &quot;UTF-8&quot; and TextDecoder.encoding as &quot;utf-8&quot;, is there a way out of this bizarre situation?

1. Let Document.characterSet and aliases return lowercase, like IE.

2. Make TextDecoder.encoding match characterSet&apos;s variable case.

Option 1 seems slightly better long-term, but also far more likely to break stuff.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115629</commentid>
    <comment_count>16</comment_count>
    <who name="Arkadiusz Michalski (Spirit)">crimsteam</who>
    <bug_when>2014-11-27 17:25:49 +0000</bug_when>
    <thetext>(In reply to Philip Jägenstedt from comment #15)
&gt; 1. Let Document.characterSet and aliases return lowercase, like IE.

But IE for inputEncoding always return uppercasee (for all names, like UTF-8, BIG5, GB18030, WINDOWS-1250). So aliases to characterSet will never be correct (if we take into account size of characters).

In other site returned value for encoding&apos;s name by browser are realy inconsistent. I don&apos;t think anyone really create a code without prior conversion to uppercase or lowercase when it is used in conditions. Changing to always returning lowercase letters, in all cases, really break compatibility?

Some interesting result:

&lt;meta charset=&quot;big5&quot;&gt;
Firefox
document.characterSet: Big5
document.charset: undefined
document.inputEncoding: Big5
Chrome
document.characterSet: Big5
document.charset: Big5
document.inputEncoding: Big5
IE
document.characterSet: big5
document.charset: big5
document.inputEncoding: BIG5 

&lt;meta charset=&quot;uff-8&quot;&gt;
Firefox
document.characterSet: UTF-8
document.charset: undefined
document.inputEncoding: UTF-8 
Chrome
document.characterSet: UTF-8
document.charset: UTF-8
document.inputEncoding: UTF-8
IE
document.characterSet: utf-8
document.charset: utf-8
document.inputEncoding: UTF-8 

&lt;meta charset=&quot;gbk&quot;&gt;
Firefox
document.characterSet: gbk
document.charset: undefined
document.inputEncoding: gbk 
Chrome
document.characterSet: GBK
document.charset: GBK
document.inputEncoding: GBK
IE
document.characterSet: gb2312
document.charset: gb2312
document.inputEncoding: GB2312

&lt;meta charset=&quot;gb18030&quot;&gt;
Firefox
document.characterSet: gb18030
document.charset: undefined
document.inputEncoding: gb18030
Chrome
document.characterSet: gb18030
document.charset: gb18030
document.inputEncoding: gb18030
IE
document.characterSet: GB18030
document.charset: GB18030
document.inputEncoding: GB18030

If the various APIs can return different size of encoding names then I think that minimum is add to the Encoding spec such information (somewhere near the table which lists those names).</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>