<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>14526</bug_id>
          
          <creation_ts>2011-10-20 15:27:03 +0000</creation_ts>
          <short_desc>WF2: When adding filenames to the data set, should there be normalization of decomposed forms?</short_desc>
          <delta_ts>2012-07-20 04:31:50 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>HTML</component>
          <version>unspecified</version>
          <rep_platform>Other</rep_platform>
          <op_sys>other</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc>http://www.whatwg.org/specs/web-apps/current-work/#constructing-the-form-data-set</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P3</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter>contributor</reporter>
          <assigned_to name="Ian &apos;Hixie&apos; Hickson">ian</assigned_to>
          <cc>bzbarsky</cc>
    
    <cc>ej</cc>
    
    <cc>ian</cc>
    
    <cc>me</cc>
    
    <cc>mike</cc>
    
    <cc>mjs</cc>
    
    <cc>naruse</cc>
    
    <cc>VYV03354</cc>
          
          <qa_contact>contributor</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>58546</commentid>
    <comment_count>0</comment_count>
    <who name="">contributor</who>
    <bug_when>2011-10-20 15:27:03 +0000</bug_when>
    <thetext>Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html
Multipage: http://www.whatwg.org/C#constructing-the-form-data-set
Complete: http://www.whatwg.org/c#constructing-the-form-data-set

Comment:
When adding filenames to the data set, should there be normalization of
decomposed forms?

Posted from: 71.184.125.56
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:10.0a1) Gecko/20111017 Firefox/10.0a1</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>58547</commentid>
    <comment_count>1</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2011-10-20 15:27:27 +0000</bug_when>
    <thetext>Apparently at least some sites make assumptions about precomposed vs decomposed forms; see https://bugzilla.mozilla.org/show_bug.cgi?id=695995</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>58564</commentid>
    <comment_count>2</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-10-20 20:16:03 +0000</bug_when>
    <thetext>Why is https://bugzilla.mozilla.org/show_bug.cgi?id=695995 a problem? The bug doesn&apos;t say why it matters what the uploaded filename is in that case. Are there servers doing comparisons or something?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>58567</commentid>
    <comment_count>3</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2011-10-20 20:19:44 +0000</bug_when>
    <thetext>I&apos;m still trying to get that information.

Note that the bug also cites https://support.mozilla.com/fi/questions/874246 which seems to suggest that servers are doing _something_ dumb with it.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>58701</commentid>
    <comment_count>4</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2011-10-22 05:39:36 +0000</bug_when>
    <thetext>Yes, the server involved in the cited Mozilla bug is doing comparisons without normalizing.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>58871</commentid>
    <comment_count>5</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-10-25 05:07:43 +0000</bug_when>
    <thetext>Hmm. I&apos;m not sure whether that&apos;s truly a problem. I mean, what if the uploaded filename is in uppercase vs lowercase? Or has one space or two somewhere in the filename? Surely basing anything on the file name of the uploaded file is rife with problems, canonicalisation being the least of them.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>58874</commentid>
    <comment_count>6</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2011-10-25 05:12:41 +0000</bug_when>
    <thetext>Dunno.  I&apos;m passing on what info I have so far.  If I get more, I&apos;ll pass on more!

But the fact remains that there appears to be (mostly) browser interop here on an observable behavior that&apos;s broken at least some servers...

A good question is how strong that interop actually is.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>58877</commentid>
    <comment_count>7</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-10-25 05:28:48 +0000</bug_when>
    <thetext>Yeah, I guess I&apos;ll have to test it.

I don&apos;t suppose there&apos;s a convenient test I can start from, by any chance? Otherwise I&apos;ll just build one.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>58883</commentid>
    <comment_count>8</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2011-10-25 06:00:47 +0000</bug_when>
    <thetext>I don&apos;t have a test offhand, sorry.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59116</commentid>
    <comment_count>9</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-10-28 00:53:42 +0000</bug_when>
    <thetext>Ok I built a test: http://damowmow.com/playground/demos/filename-upload/

Results:
Firefox/10.0a1 on Mac and Opera/9.80 on Mac send the filename decomposed.
Everyone else I tested (IE/9, Firefox/5 on Windows, Safari/5 on Mac and Windows, Chrome/16 on Mac and Windows) send the filename composed.

No difference between GET and POST.

I guess I&apos;ll update the spec to say to send the filename composed. Any particular guess as to what kind of normalisation I should be applying here? NFC?

I&apos;ll test to see what browsers do using the example in the third row of figure 6 of http://unicode.org/reports/tr15/ unless someone gets there before me.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59117</commentid>
    <comment_count>10</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2011-10-28 00:59:38 +0000</bug_when>
    <thetext>No idea on choice of normalization.  Not something I know well enough to comment on intelligently...</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59143</commentid>
    <comment_count>11</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-10-28 19:39:40 +0000</bug_when>
    <thetext>I created a test that would distinguish normalisation forms:
http://damowmow.com/playground/demos/filename-upload/002.html

When I create the given file name on Mac, I get a file whose name&apos;s bytes are displayed to the console by ls(1) piped through hexdump as:

   c5 bf cc a3 cc 87 e2 84 ab

This isn&apos;t what I expected. In particular, it means that it is not normalising singletons, but is doing NFD for composition. As far as I can tell.

Uploading this file results in the following (recent builds or latest shipping copy in all cases, only testing POST):

Mac Firefox: same as file system (c5 bf cc a3 cc 87 e2 84 ab)
Mac Opera: same as file system (c5 bf cc a3 cc 87 e2 84 ab)
Mac Safari: NFC (e1 ba 9b cc a3 c3 85)
Mac Chrome: NFC (e1 ba 9b cc a3 c3 85)

On Windows I had more trouble creating the file. I copied and pasted the string from the page in IE to a command shell to create the file. According to dir, the file had three characters, which it displayed as &quot;??Å.txt&quot;. No idea what kind of &quot;Å&quot; that is, unfortunately. Then I tried uploading it (sorry about the old software versions):

IE9: original bytes (e1 ba 9b cc a3 e2 84 ab)
Win Firefox 5: original bytes (e1 ba 9b cc a3 e2 84 ab)
Win Safari 5: NFC (e1 ba 9b cc a3 c3 85)
Win Chrome: NFC (e1 ba 9b cc a3 c3 85)

So basically as far as I can tell, all browsers except WebKit-based browsers do no normalisation, they just trust the file system. On Mac this is slightly problematic only because Mac&apos;s file system does its own normalisation. WebKit always does NFC normalisation on the file name before submission.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59427</commentid>
    <comment_count>12</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-11-01 17:30:15 +0000</bug_when>
    <thetext>I went with requiring NFC, since that seems like the only option that will lead to any kind of interop.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59428</commentid>
    <comment_count>13</comment_count>
    <who name="">contributor</who>
    <bug_when>2011-11-01 17:31:17 +0000</bug_when>
    <thetext>Checked in as WHATWG revision r6810.
Check-in comment: Require NFC for file names from &lt;input type=file&gt;.
http://html5.org/tools/web-apps-tracker?from=6809&amp;to=6810</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59466</commentid>
    <comment_count>14</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2011-11-02 02:59:44 +0000</bug_when>
    <thetext>We may not be out of the woods here.  See https://bugzilla.mozilla.org/show_bug.cgi?id=695995#c18

I asked the commenter to comment here directly as needed...</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59522</commentid>
    <comment_count>15</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2011-11-02 20:43:18 +0000</bug_when>
    <thetext>Mac OS uses a special variant of NFD to avoid normalizing CJK Compatibility Ideographs because some of Compatibility Ideographs is important (even required) in Japan. Roughly speaking, It excludes a specific ranges of code points from Normalization.

I found a proposal document from Apple (but rejected by UTC).
http://www.unicode.org/review/resolved-pri.html#pri7
http://www.unicode.org/review/pr-7b.html
Note that this proposal is a bit different from what Mac OS is actually using. Mac OS also excludes code points from U+2000 to U+2FFF.

I think we should define &quot;willful violation of UAX #15&quot; or &quot;Web Normalization&quot; or something other than NFC.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59552</commentid>
    <comment_count>16</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-11-03 15:31:19 +0000</bug_when>
    <thetext>So how exactly should it be defined? &quot;File names must be exposed in a normalized form, whether in the DOM (e.g. in File objects) or in form submission, regardless of the conventions of the user agent&apos;s platform&apos;s file system. The normalization form used must be Unicode normalization Form C (NFC), except that input characters in the range U+2000 to U+2FFF, U+F900 to U+FA6A, and U+2F800 to U+2FA1D must be left unchanged in the output.&quot;?

This isn&apos;t what any browser does as far as I can tell. Are we sure that what WebKit does is broken for CJK?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59577</commentid>
    <comment_count>17</comment_count>
    <who name="NARUSE, Yui">naruse</who>
    <bug_when>2011-11-03 18:33:00 +0000</bug_when>
    <thetext>(In reply to comment #15)
&gt; I found a proposal document from Apple (but rejected by UTC).
&gt; http://www.unicode.org/review/resolved-pri.html#pri7
&gt; http://www.unicode.org/review/pr-7b.html
&gt; Note that this proposal is a bit different from what Mac OS is actually using.
&gt; Mac OS also excludes code points from U+2000 to U+2FFF.
&gt; 
&gt; I think we should define &quot;willful violation of UAX #15&quot; or &quot;Web Normalization&quot;
&gt; or something other than NFC.

Recent document says exactly the same what you say:
&quot;U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed&quot;
http://developer.apple.com/library/mac/#qa/qa1173/_index.html</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59611</commentid>
    <comment_count>18</comment_count>
    <who name="NARUSE, Yui">naruse</who>
    <bug_when>2011-11-04 06:25:31 +0000</bug_when>
    <thetext>(In reply to comment #16)
&gt; So how exactly should it be defined? &quot;File names must be exposed in a
&gt; normalized form, whether in the DOM (e.g. in File objects) or in form
&gt; submission, regardless of the conventions of the user agent&apos;s platform&apos;s file
&gt; system. The normalization form used must be Unicode normalization Form C (NFC),
&gt; except that input characters in the range U+2000 to U+2FFF, U+F900 to U+FA6A,
&gt; and U+2F800 to U+2FA1D must be left unchanged in the output.&quot;?

I think so.
But whether such behavior should be portable (should be applied other than Mac OS X) is debatable.

Imagine following situation, a directory has two file, U+795E.txt and U+FA19.txt.
And the user want to upload them. As you can notice, DOM and uploaded server
can&apos;t distinguish them. Normalization considered harmful.

It is not harmful only where the file&apos;s filesystem uses normalization,
and the filesystem and the browser uses exactly the same algorithm.

Idealy normalization over filenames should be done only for files on the normalized
filesystems such as HFS Plus. (but an assumption filenames on Mac OS X are
normalized can be acceptable)

&gt; This isn&apos;t what any browser does as far as I can tell. Are we sure that what
&gt; WebKit does is broken for CJK?

Yes, current WebKit normalizes those Kanjis, and it is considered breakage.
You can see the breakage by uploading U+FA19.txt.
After uploading, it become U+795E.txt and you can find the left part of the Kanji is changed.
These kanjis have the same meaning &quot;god&quot;, and specified as compatibility character thorough
some political reason, but people don&apos;t want to normalize them other than the true
normalization situation.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>61100</commentid>
    <comment_count>19</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-12-07 18:32:04 +0000</bug_when>
    <thetext>That argues for not doing any kind of normalisation.

bz: What do you think? Looks like NFC is out, and modified NFC would cause problems on Windows. Suggestions? I&apos;m leaning back towards &quot;trust the filesystem&quot;.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>61114</commentid>
    <comment_count>20</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2011-12-07 20:18:51 +0000</bug_when>
    <thetext>I guess I can live with that if UAs actually converge on it....</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>61200</commentid>
    <comment_count>21</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-12-08 00:02:52 +0000</bug_when>
    <thetext>I guess we should file a bug on WebKit and see if they&apos;re willing to change?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>61344</commentid>
    <comment_count>22</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-12-09 22:23:17 +0000</bug_when>
    <thetext>I&apos;m going to remove the normalisation stuff and, if nobody else gets there before me, file a bug on WebKit to remove the normalisation.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>70773</commentid>
    <comment_count>23</comment_count>
    <who name="">contributor</who>
    <bug_when>2012-07-18 17:27:56 +0000</bug_when>
    <thetext>This bug was cloned to create bug 18153 as part of operation convergence.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>71240</commentid>
    <comment_count>24</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2012-07-20 04:31:17 +0000</bug_when>
    <thetext>Filed https://bugs.webkit.org/show_bug.cgi?id=91817 and reverted spec.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>71241</commentid>
    <comment_count>25</comment_count>
    <who name="">contributor</who>
    <bug_when>2012-07-20 04:31:50 +0000</bug_when>
    <thetext>Checked in as WHATWG revision r7195.
Check-in comment: Revert r6810 since it doesn&apos;t work.
http://html5.org/tools/web-apps-tracker?from=7194&amp;to=7195</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>