This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 14526 - WF2: When adding filenames to the data set, should there be normalization of decomposed forms?
Summary: WF2: When adding filenames to the data set, should there be normalization of ...
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: Other other
: P3 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-10-20 15:27 UTC by contributor
Modified: 2012-07-20 04:31 UTC (History)
8 users (show)

See Also:


Attachments

Description contributor 2011-10-20 15:27:03 UTC
Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html
Multipage: http://www.whatwg.org/C#constructing-the-form-data-set
Complete: http://www.whatwg.org/c#constructing-the-form-data-set

Comment:
When adding filenames to the data set, should there be normalization of
decomposed forms?

Posted from: 71.184.125.56
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:10.0a1) Gecko/20111017 Firefox/10.0a1
Comment 1 Boris Zbarsky 2011-10-20 15:27:27 UTC
Apparently at least some sites make assumptions about precomposed vs decomposed forms; see https://bugzilla.mozilla.org/show_bug.cgi?id=695995
Comment 2 Ian 'Hixie' Hickson 2011-10-20 20:16:03 UTC
Why is https://bugzilla.mozilla.org/show_bug.cgi?id=695995 a problem? The bug doesn't say why it matters what the uploaded filename is in that case. Are there servers doing comparisons or something?
Comment 3 Boris Zbarsky 2011-10-20 20:19:44 UTC
I'm still trying to get that information.

Note that the bug also cites https://support.mozilla.com/fi/questions/874246 which seems to suggest that servers are doing _something_ dumb with it.
Comment 4 Boris Zbarsky 2011-10-22 05:39:36 UTC
Yes, the server involved in the cited Mozilla bug is doing comparisons without normalizing.
Comment 5 Ian 'Hixie' Hickson 2011-10-25 05:07:43 UTC
Hmm. I'm not sure whether that's truly a problem. I mean, what if the uploaded filename is in uppercase vs lowercase? Or has one space or two somewhere in the filename? Surely basing anything on the file name of the uploaded file is rife with problems, canonicalisation being the least of them.
Comment 6 Boris Zbarsky 2011-10-25 05:12:41 UTC
Dunno.  I'm passing on what info I have so far.  If I get more, I'll pass on more!

But the fact remains that there appears to be (mostly) browser interop here on an observable behavior that's broken at least some servers...

A good question is how strong that interop actually is.
Comment 7 Ian 'Hixie' Hickson 2011-10-25 05:28:48 UTC
Yeah, I guess I'll have to test it.

I don't suppose there's a convenient test I can start from, by any chance? Otherwise I'll just build one.
Comment 8 Boris Zbarsky 2011-10-25 06:00:47 UTC
I don't have a test offhand, sorry.
Comment 9 Ian 'Hixie' Hickson 2011-10-28 00:53:42 UTC
Ok I built a test: http://damowmow.com/playground/demos/filename-upload/

Results:
Firefox/10.0a1 on Mac and Opera/9.80 on Mac send the filename decomposed.
Everyone else I tested (IE/9, Firefox/5 on Windows, Safari/5 on Mac and Windows, Chrome/16 on Mac and Windows) send the filename composed.

No difference between GET and POST.

I guess I'll update the spec to say to send the filename composed. Any particular guess as to what kind of normalisation I should be applying here? NFC?

I'll test to see what browsers do using the example in the third row of figure 6 of http://unicode.org/reports/tr15/ unless someone gets there before me.
Comment 10 Boris Zbarsky 2011-10-28 00:59:38 UTC
No idea on choice of normalization.  Not something I know well enough to comment on intelligently...
Comment 11 Ian 'Hixie' Hickson 2011-10-28 19:39:40 UTC
I created a test that would distinguish normalisation forms:
http://damowmow.com/playground/demos/filename-upload/002.html

When I create the given file name on Mac, I get a file whose name's bytes are displayed to the console by ls(1) piped through hexdump as:

   c5 bf cc a3 cc 87 e2 84 ab

This isn't what I expected. In particular, it means that it is not normalising singletons, but is doing NFD for composition. As far as I can tell.

Uploading this file results in the following (recent builds or latest shipping copy in all cases, only testing POST):

Mac Firefox: same as file system (c5 bf cc a3 cc 87 e2 84 ab)
Mac Opera: same as file system (c5 bf cc a3 cc 87 e2 84 ab)
Mac Safari: NFC (e1 ba 9b cc a3 c3 85)
Mac Chrome: NFC (e1 ba 9b cc a3 c3 85)

On Windows I had more trouble creating the file. I copied and pasted the string from the page in IE to a command shell to create the file. According to dir, the file had three characters, which it displayed as "??Å.txt". No idea what kind of "Å" that is, unfortunately. Then I tried uploading it (sorry about the old software versions):

IE9: original bytes (e1 ba 9b cc a3 e2 84 ab)
Win Firefox 5: original bytes (e1 ba 9b cc a3 e2 84 ab)
Win Safari 5: NFC (e1 ba 9b cc a3 c3 85)
Win Chrome: NFC (e1 ba 9b cc a3 c3 85)

So basically as far as I can tell, all browsers except WebKit-based browsers do no normalisation, they just trust the file system. On Mac this is slightly problematic only because Mac's file system does its own normalisation. WebKit always does NFC normalisation on the file name before submission.
Comment 12 Ian 'Hixie' Hickson 2011-11-01 17:30:15 UTC
I went with requiring NFC, since that seems like the only option that will lead to any kind of interop.
Comment 13 contributor 2011-11-01 17:31:17 UTC
Checked in as WHATWG revision r6810.
Check-in comment: Require NFC for file names from <input type=file>.
http://html5.org/tools/web-apps-tracker?from=6809&to=6810
Comment 14 Boris Zbarsky 2011-11-02 02:59:44 UTC
We may not be out of the woods here.  See https://bugzilla.mozilla.org/show_bug.cgi?id=695995#c18

I asked the commenter to comment here directly as needed...
Comment 15 Masatoshi Kimura 2011-11-02 20:43:18 UTC
Mac OS uses a special variant of NFD to avoid normalizing CJK Compatibility Ideographs because some of Compatibility Ideographs is important (even required) in Japan. Roughly speaking, It excludes a specific ranges of code points from Normalization.

I found a proposal document from Apple (but rejected by UTC).
http://www.unicode.org/review/resolved-pri.html#pri7
http://www.unicode.org/review/pr-7b.html
Note that this proposal is a bit different from what Mac OS is actually using. Mac OS also excludes code points from U+2000 to U+2FFF.

I think we should define "willful violation of UAX #15" or "Web Normalization" or something other than NFC.
Comment 16 Ian 'Hixie' Hickson 2011-11-03 15:31:19 UTC
So how exactly should it be defined? "File names must be exposed in a normalized form, whether in the DOM (e.g. in File objects) or in form submission, regardless of the conventions of the user agent's platform's file system. The normalization form used must be Unicode normalization Form C (NFC), except that input characters in the range U+2000 to U+2FFF, U+F900 to U+FA6A, and U+2F800 to U+2FA1D must be left unchanged in the output."?

This isn't what any browser does as far as I can tell. Are we sure that what WebKit does is broken for CJK?
Comment 17 NARUSE, Yui 2011-11-03 18:33:00 UTC
(In reply to comment #15)
> I found a proposal document from Apple (but rejected by UTC).
> http://www.unicode.org/review/resolved-pri.html#pri7
> http://www.unicode.org/review/pr-7b.html
> Note that this proposal is a bit different from what Mac OS is actually using.
> Mac OS also excludes code points from U+2000 to U+2FFF.
> 
> I think we should define "willful violation of UAX #15" or "Web Normalization"
> or something other than NFC.

Recent document says exactly the same what you say:
"U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed"
http://developer.apple.com/library/mac/#qa/qa1173/_index.html
Comment 18 NARUSE, Yui 2011-11-04 06:25:31 UTC
(In reply to comment #16)
> So how exactly should it be defined? "File names must be exposed in a
> normalized form, whether in the DOM (e.g. in File objects) or in form
> submission, regardless of the conventions of the user agent's platform's file
> system. The normalization form used must be Unicode normalization Form C (NFC),
> except that input characters in the range U+2000 to U+2FFF, U+F900 to U+FA6A,
> and U+2F800 to U+2FA1D must be left unchanged in the output."?

I think so.
But whether such behavior should be portable (should be applied other than Mac OS X) is debatable.

Imagine following situation, a directory has two file, U+795E.txt and U+FA19.txt.
And the user want to upload them. As you can notice, DOM and uploaded server
can't distinguish them. Normalization considered harmful.

It is not harmful only where the file's filesystem uses normalization,
and the filesystem and the browser uses exactly the same algorithm.

Idealy normalization over filenames should be done only for files on the normalized
filesystems such as HFS Plus. (but an assumption filenames on Mac OS X are
normalized can be acceptable)

> This isn't what any browser does as far as I can tell. Are we sure that what
> WebKit does is broken for CJK?

Yes, current WebKit normalizes those Kanjis, and it is considered breakage.
You can see the breakage by uploading U+FA19.txt.
After uploading, it become U+795E.txt and you can find the left part of the Kanji is changed.
These kanjis have the same meaning "god", and specified as compatibility character thorough
some political reason, but people don't want to normalize them other than the true
normalization situation.
Comment 19 Ian 'Hixie' Hickson 2011-12-07 18:32:04 UTC
That argues for not doing any kind of normalisation.

bz: What do you think? Looks like NFC is out, and modified NFC would cause problems on Windows. Suggestions? I'm leaning back towards "trust the filesystem".
Comment 20 Boris Zbarsky 2011-12-07 20:18:51 UTC
I guess I can live with that if UAs actually converge on it....
Comment 21 Ian 'Hixie' Hickson 2011-12-08 00:02:52 UTC
I guess we should file a bug on WebKit and see if they're willing to change?
Comment 22 Ian 'Hixie' Hickson 2011-12-09 22:23:17 UTC
I'm going to remove the normalisation stuff and, if nobody else gets there before me, file a bug on WebKit to remove the normalisation.
Comment 23 contributor 2012-07-18 17:27:56 UTC
This bug was cloned to create bug 18153 as part of operation convergence.
Comment 24 Ian 'Hixie' Hickson 2012-07-20 04:31:17 UTC
Filed https://bugs.webkit.org/show_bug.cgi?id=91817 and reverted spec.
Comment 25 contributor 2012-07-20 04:31:50 UTC
Checked in as WHATWG revision r7195.
Check-in comment: Revert r6810 since it doesn't work.
http://html5.org/tools/web-apps-tracker?from=7194&to=7195