This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 18153 - WF2: When adding filenames to the data set, should there be normalization of decomposed forms?
Summary: WF2: When adding filenames to the data set, should there be normalization of ...
Status: RESOLVED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: HTML5 spec (show other bugs)
Version: unspecified
Hardware: Other other
: P3 normal
Target Milestone: ---
Assignee: This bug has no owner yet - up for the taking
QA Contact: HTML WG Bugzilla archive list
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-18 17:27 UTC by contributor
Modified: 2012-09-14 07:13 UTC (History)
11 users (show)

See Also:


Attachments

Description contributor 2012-07-18 17:27:52 UTC
This was was cloned from bug 14526 as part of operation convergence.
Originally filed: 2011-10-20 15:27:00 +0000

================================================================================
 #0   contributor@whatwg.org                          2011-10-20 15:27:03 +0000 
--------------------------------------------------------------------------------
Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html
Multipage: http://www.whatwg.org/C#constructing-the-form-data-set
Complete: http://www.whatwg.org/c#constructing-the-form-data-set

Comment:
When adding filenames to the data set, should there be normalization of
decomposed forms?

Posted from: 71.184.125.56
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:10.0a1) Gecko/20111017 Firefox/10.0a1
================================================================================
 #1   Boris Zbarsky                                   2011-10-20 15:27:27 +0000 
--------------------------------------------------------------------------------
Apparently at least some sites make assumptions about precomposed vs decomposed forms; see https://bugzilla.mozilla.org/show_bug.cgi?id=695995
================================================================================
 #2   Ian 'Hixie' Hickson                             2011-10-20 20:16:03 +0000 
--------------------------------------------------------------------------------
Why is https://bugzilla.mozilla.org/show_bug.cgi?id=695995 a problem? The bug doesn't say why it matters what the uploaded filename is in that case. Are there servers doing comparisons or something?
================================================================================
 #3   Boris Zbarsky                                   2011-10-20 20:19:44 +0000 
--------------------------------------------------------------------------------
I'm still trying to get that information.

Note that the bug also cites https://support.mozilla.com/fi/questions/874246 which seems to suggest that servers are doing _something_ dumb with it.
================================================================================
 #4   Boris Zbarsky                                   2011-10-22 05:39:36 +0000 
--------------------------------------------------------------------------------
Yes, the server involved in the cited Mozilla bug is doing comparisons without normalizing.
================================================================================
 #5   Ian 'Hixie' Hickson                             2011-10-25 05:07:43 +0000 
--------------------------------------------------------------------------------
Hmm. I'm not sure whether that's truly a problem. I mean, what if the uploaded filename is in uppercase vs lowercase? Or has one space or two somewhere in the filename? Surely basing anything on the file name of the uploaded file is rife with problems, canonicalisation being the least of them.
================================================================================
 #6   Boris Zbarsky                                   2011-10-25 05:12:41 +0000 
--------------------------------------------------------------------------------
Dunno.  I'm passing on what info I have so far.  If I get more, I'll pass on more!

But the fact remains that there appears to be (mostly) browser interop here on an observable behavior that's broken at least some servers...

A good question is how strong that interop actually is.
================================================================================
 #7   Ian 'Hixie' Hickson                             2011-10-25 05:28:48 +0000 
--------------------------------------------------------------------------------
Yeah, I guess I'll have to test it.

I don't suppose there's a convenient test I can start from, by any chance? Otherwise I'll just build one.
================================================================================
 #8   Boris Zbarsky                                   2011-10-25 06:00:47 +0000 
--------------------------------------------------------------------------------
I don't have a test offhand, sorry.
================================================================================
 #9   Ian 'Hixie' Hickson                             2011-10-28 00:53:42 +0000 
--------------------------------------------------------------------------------
Ok I built a test: http://damowmow.com/playground/demos/filename-upload/

Results:
Firefox/10.0a1 on Mac and Opera/9.80 on Mac send the filename decomposed.
Everyone else I tested (IE/9, Firefox/5 on Windows, Safari/5 on Mac and Windows, Chrome/16 on Mac and Windows) send the filename composed.

No difference between GET and POST.

I guess I'll update the spec to say to send the filename composed. Any particular guess as to what kind of normalisation I should be applying here? NFC?

I'll test to see what browsers do using the example in the third row of figure 6 of http://unicode.org/reports/tr15/ unless someone gets there before me.
================================================================================
 #10  Boris Zbarsky                                   2011-10-28 00:59:38 +0000 
--------------------------------------------------------------------------------
No idea on choice of normalization.  Not something I know well enough to comment on intelligently...
================================================================================
 #11  Ian 'Hixie' Hickson                             2011-10-28 19:39:40 +0000 
--------------------------------------------------------------------------------
I created a test that would distinguish normalisation forms:
http://damowmow.com/playground/demos/filename-upload/002.html

When I create the given file name on Mac, I get a file whose name's bytes are displayed to the console by ls(1) piped through hexdump as:

   c5 bf cc a3 cc 87 e2 84 ab

This isn't what I expected. In particular, it means that it is not normalising singletons, but is doing NFD for composition. As far as I can tell.

Uploading this file results in the following (recent builds or latest shipping copy in all cases, only testing POST):

Mac Firefox: same as file system (c5 bf cc a3 cc 87 e2 84 ab)
Mac Opera: same as file system (c5 bf cc a3 cc 87 e2 84 ab)
Mac Safari: NFC (e1 ba 9b cc a3 c3 85)
Mac Chrome: NFC (e1 ba 9b cc a3 c3 85)

On Windows I had more trouble creating the file. I copied and pasted the string from the page in IE to a command shell to create the file. According to dir, the file had three characters, which it displayed as "??Å.txt". No idea what kind of "Å" that is, unfortunately. Then I tried uploading it (sorry about the old software versions):

IE9: original bytes (e1 ba 9b cc a3 e2 84 ab)
Win Firefox 5: original bytes (e1 ba 9b cc a3 e2 84 ab)
Win Safari 5: NFC (e1 ba 9b cc a3 c3 85)
Win Chrome: NFC (e1 ba 9b cc a3 c3 85)

So basically as far as I can tell, all browsers except WebKit-based browsers do no normalisation, they just trust the file system. On Mac this is slightly problematic only because Mac's file system does its own normalisation. WebKit always does NFC normalisation on the file name before submission.
================================================================================
 #12  Ian 'Hixie' Hickson                             2011-11-01 17:30:15 +0000 
--------------------------------------------------------------------------------
I went with requiring NFC, since that seems like the only option that will lead to any kind of interop.
================================================================================
 #13  contributor@whatwg.org                          2011-11-01 17:31:17 +0000 
--------------------------------------------------------------------------------
Checked in as WHATWG revision r6810.
Check-in comment: Require NFC for file names from <input type=file>.
http://html5.org/tools/web-apps-tracker?from=6809&to=6810
================================================================================
 #14  Boris Zbarsky                                   2011-11-02 02:59:44 +0000 
--------------------------------------------------------------------------------
We may not be out of the woods here.  See https://bugzilla.mozilla.org/show_bug.cgi?id=695995#c18

I asked the commenter to comment here directly as needed...
================================================================================
 #15  Masatoshi Kimura                                2011-11-02 20:43:18 +0000 
--------------------------------------------------------------------------------
Mac OS uses a special variant of NFD to avoid normalizing CJK Compatibility Ideographs because some of Compatibility Ideographs is important (even required) in Japan. Roughly speaking, It excludes a specific ranges of code points from Normalization.

I found a proposal document from Apple (but rejected by UTC).
http://www.unicode.org/review/resolved-pri.html#pri7
http://www.unicode.org/review/pr-7b.html
Note that this proposal is a bit different from what Mac OS is actually using. Mac OS also excludes code points from U+2000 to U+2FFF.

I think we should define "willful violation of UAX #15" or "Web Normalization" or something other than NFC.
================================================================================
 #16  Ian 'Hixie' Hickson                             2011-11-03 15:31:19 +0000 
--------------------------------------------------------------------------------
So how exactly should it be defined? "File names must be exposed in a normalized form, whether in the DOM (e.g. in File objects) or in form submission, regardless of the conventions of the user agent's platform's file system. The normalization form used must be Unicode normalization Form C (NFC), except that input characters in the range U+2000 to U+2FFF, U+F900 to U+FA6A, and U+2F800 to U+2FA1D must be left unchanged in the output."?

This isn't what any browser does as far as I can tell. Are we sure that what WebKit does is broken for CJK?
================================================================================
 #17  NARUSE, Yui                                     2011-11-03 18:33:00 +0000 
--------------------------------------------------------------------------------
(In reply to comment #15)
> I found a proposal document from Apple (but rejected by UTC).
> http://www.unicode.org/review/resolved-pri.html#pri7
> http://www.unicode.org/review/pr-7b.html
> Note that this proposal is a bit different from what Mac OS is actually using.
> Mac OS also excludes code points from U+2000 to U+2FFF.
> 
> I think we should define "willful violation of UAX #15" or "Web Normalization"
> or something other than NFC.

Recent document says exactly the same what you say:
"U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed"
http://developer.apple.com/library/mac/#qa/qa1173/_index.html
================================================================================
 #18  NARUSE, Yui                                     2011-11-04 06:25:31 +0000 
--------------------------------------------------------------------------------
(In reply to comment #16)
> So how exactly should it be defined? "File names must be exposed in a
> normalized form, whether in the DOM (e.g. in File objects) or in form
> submission, regardless of the conventions of the user agent's platform's file
> system. The normalization form used must be Unicode normalization Form C (NFC),
> except that input characters in the range U+2000 to U+2FFF, U+F900 to U+FA6A,
> and U+2F800 to U+2FA1D must be left unchanged in the output."?

I think so.
But whether such behavior should be portable (should be applied other than Mac OS X) is debatable.

Imagine following situation, a directory has two file, U+795E.txt and U+FA19.txt.
And the user want to upload them. As you can notice, DOM and uploaded server
can't distinguish them. Normalization considered harmful.

It is not harmful only where the file's filesystem uses normalization,
and the filesystem and the browser uses exactly the same algorithm.

Idealy normalization over filenames should be done only for files on the normalized
filesystems such as HFS Plus. (but an assumption filenames on Mac OS X are
normalized can be acceptable)

> This isn't what any browser does as far as I can tell. Are we sure that what
> WebKit does is broken for CJK?

Yes, current WebKit normalizes those Kanjis, and it is considered breakage.
You can see the breakage by uploading U+FA19.txt.
After uploading, it become U+795E.txt and you can find the left part of the Kanji is changed.
These kanjis have the same meaning "god", and specified as compatibility character thorough
some political reason, but people don't want to normalize them other than the true
normalization situation.
================================================================================
 #19  Ian 'Hixie' Hickson                             2011-12-07 18:32:04 +0000 
--------------------------------------------------------------------------------
That argues for not doing any kind of normalisation.

bz: What do you think? Looks like NFC is out, and modified NFC would cause problems on Windows. Suggestions? I'm leaning back towards "trust the filesystem".
================================================================================
 #20  Boris Zbarsky                                   2011-12-07 20:18:51 +0000 
--------------------------------------------------------------------------------
I guess I can live with that if UAs actually converge on it....
================================================================================
 #21  Ian 'Hixie' Hickson                             2011-12-08 00:02:52 +0000 
--------------------------------------------------------------------------------
I guess we should file a bug on WebKit and see if they're willing to change?
================================================================================
 #22  Ian 'Hixie' Hickson                             2011-12-09 22:23:17 +0000 
--------------------------------------------------------------------------------
I'm going to remove the normalisation stuff and, if nobody else gets there before me, file a bug on WebKit to remove the normalisation.
================================================================================
Comment 1 Silvia Pfeiffer 2012-09-14 07:13:04 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If
   you are satisfied with this response, please change the state of
   this bug to CLOSED. If you have additional information and would
   like the Editor to reconsider, please reopen this bug. If you would
   like to escalate the issue to the full HTML Working Group, please
   add the TrackerRequest keyword to this bug, and suggest title and
   text for the Tracker Issue; or you may create a Tracker Issue
   yourself, if you are able to do so. For more details, see this
   document:       http://dev.w3.org/html5/decision-policy/decision-policy.html

   Status: Revert Accepted
   Change Description:
https://github.com/w3c/html/commit/93267bb4ace76bb67582890bd5e8c5e47c9cedca
   Rationale: applied the commit of WHATWG, see comments at https://www.w3.org/Bugs/Public/show_bug.cgi?id=14526