16909 – multipart/form-data: field name encoding is not specified; browsers do incompatible things

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16909 - multipart/form-data: field name encoding is not specified; browsers do incompatible things

Summary: multipart/form-data: field name encoding is not specified; browsers do incomp...

Status:	RESOLVED MOVED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	P3 major
Target Milestone:	Unsorted
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:	http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:

Duplicates (1):	19879 (view as bug list)
Depends on:
Blocks:	19879
	Show dependency tree / graph

Reported:	2012-05-02 20:09 UTC by contributor
Modified:	2019-03-29 21:49 UTC (History)
CC List:	11 users (show)

See Also:

Attachments

Description contributor 2012-05-02 20:09:21 UTC

Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html
Multipage: http://www.whatwg.org/C#multipart-form-data
Complete: http://www.whatwg.org/c#multipart-form-data

Comment:
The specification is unclear about how field names should be encoded. In
particular, what should be done if they include special characters? (eg.
quotes, new lines, unicode, etc?). I started a mailing list thread on this
issue...

Posted from: 74.66.64.60
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5

Comment 1 Evan Jones 2012-05-02 20:10:52 UTC

The specification is unclear about how field names should be encoded. In particular, what should be done if they include special characters? (eg. quotes, new lines, unicode, etc?).

Comment 2 Evan Jones 2012-05-02 20:41:21 UTC

Argh; whoops. Sorry for the bugzilla spam. I didn't realize that the "comment" thingy just filed a bugzilla bug.

HTML5 states: "Encode the (now mutated) form data set using the rules described by RFC 2388". However, it then modifies the rules:

"The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified. Their names and values must be encoded using the character encoding selected above (field names in particular do not get converted to a 7-bit safe encoding as suggested in RFC 2388)."

http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html#multipart-form-data

So the problem is: what are we supposed to do with field names? In particular, what if they contain "special" MIME characters (e.g. \r\n newlines, backslashes, double quotes, or semi-colons?). Different browsers do different things, meaning that currently server code must detect the browser to do the right thing.

Example: <input name='bàz%22\"\' value="foo">

Firefox 13b: Content-Disposition: form-data; name="bàz%22\\"\"
Webkit nightly: Content-Disposition: form-data; name="bàz%22\%22\"

Firefox backslash quotes double quotes, except it fails to quote backslashes. This means its header fails to parse according to the MIME specification (it sort of decodes as bàz%22\ with an extra trailing \"

Webkit %-escapes the double quotes, but does not %-escape the percent. Thus the above form control could be either name='bàz"\"\' or the desired name. Webkit has a bug open on this issue, asking for specification guidance: https://bugs.webkit.org/show_bug.cgi?id=62107

HTML5 should specify exactly how field names are encoded. Some potential solutions:

1) Bless Firefox's backslash quoting rules (they are very weird but I think they are unambiguous?). This means Webkit POSTs will be decoded to the wrong field names, and POSTs to older servers may parse incorrectly if the name includes a \ (but that must already happen for Firefox?).

2) Bless Webkit's percent escaping rules (ideally also escaping %). Servers that strictly parse this format will fail to parse Firefox POSTs if the name includes a \, and will

3) Adopt RFC 6266's approach of having two name parameters when there are special characters: one with the existing escaping, and one with an unambiguously escaped version. Ideally, existing servers will parse the first name and not break unless the form value contains a special character. As servers are upgraded, they will be able to unambiguously parse the new header. See: http://tools.ietf.org/html/rfc6266

Aside: The *same* issue happens for uploaded file names. I started a mailing list thread to attempt to collect more information about this: http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-May/035610.html

Comment 3 contributor 2012-07-18 17:24:14 UTC

This bug was cloned to create bug 18135 as part of operation convergence.

Comment 4 Ian 'Hixie' Hickson 2012-12-01 21:59:07 UTC

Larry, any chance RFC 2388 will get updated to resolve this issues?

Comment 5 Ian 'Hixie' Hickson 2013-02-13 00:41:17 UTC

Dropped a mail to Larry, we'll see what he says.

Comment 6 Ian 'Hixie' Hickson 2013-03-27 23:47:51 UTC

http://www.ietf.org/mail-archive/web/apps-discuss/current/msg08908.html

I'm marking this with the same milestone as other form-related stuff, but I doubt I'll actually do this in the HTML spec. Any volunteers want to write this up as a new spec? See the e-mail above if you want to do this in the IETF space, or contact me on IRC if you want to do it in the WHATWG space, I'm sure either way you'll find people eager to help you.

Comment 7 Ian 'Hixie' Hickson 2013-03-27 23:53:45 UTC

*** Bug 19879 has been marked as a duplicate of this bug. ***

Comment 8 Larry Masinter 2013-07-16 07:04:58 UTC

RFC 2388 was clear:
   Field names originally in non-ASCII character sets may be encoded
   within the value of the "name" parameter using the standard method
   described in RFC 2047.

For reasons I don't understand, browsers did different, incompatible
things. 

I think the main advice is: 

* those creating HTML forms 
   SHOULD use ASCII field names, since deployed HTML processors vary,
   and field names shouldn't be visible to the user anyway.

* Those developing server infrastructure to read multipart/form-data uploads
   SHOULD be aware of the varying behavior of the browsers in translating
   non-ASCII field names, and look for any of the variants (if they're 
   expecting non-ASCII field names). 

* Those developing browsers should migrate toward a standard 
  encoding, but the server infrastructure will still have to do
  fuzzy match for a long while.

What should the browsers migrate to?

 http://www.rfc-editor.org/rfc/rfc5987.txt 
seems like a more recent proposal and possibly implemented in HTTP anyway.

Sites that use non-ASCII field names and want to work with multiple
browsers already have to do fuzzy matching.

The problem is that the fuzzy matchers already deployed might not
recognize any *NEW* encodings.

So I suppose having a name* value would be necessary.

Comment 9 Ian 'Hixie' Hickson 2013-09-11 18:37:32 UTC

(In reply to Larry Masinter from comment #8)
> RFC 2388 was clear:
>    Field names originally in non-ASCII character sets may be encoded
>    within the value of the "name" parameter using the standard method
>    described in RFC 2047.

"may" is what makes this not clear. It means that the above is one option, but what are the other options? What else can they do?

Specs should basically never say MAY or SHOULD when it comes to describing what they put on the wire.


> * those creating HTML forms 
>    SHOULD use ASCII field names, since deployed HTML processors vary,
>    and field names shouldn't be visible to the user anyway.

The goal on the HTML side is to have HTML processors not vary.


> * Those developing server infrastructure to read multipart/form-data uploads
>    SHOULD be aware of the varying behavior of the browsers in translating
>    non-ASCII field names, and look for any of the variants (if they're 
>    expecting non-ASCII field names). 

If the servers have to look for variants, we should define those variants.


> * Those developing browsers should migrate toward a standard 
>   encoding, but the server infrastructure will still have to do
>   fuzzy match for a long while.
>
> What should the browsers migrate to?

What do they do now? Presumably what they do now is the right answer.


> So I suppose having a name* value would be necessary.

I don't think adding new features here is viable. We should specify what most browsers do, and just stick with that. IMHO.

Comment 10 Larry Masinter 2013-09-15 22:06:56 UTC

in Comment 8:

> "may" is what makes this not clear

draft-masinter-multipart-form-data-00 (current revision
as of this note) doesn't use may or MAY

> The goal on the HTML side is to have HTML processors not vary.

=>  https://github.com/masinter/multipart-form-data/issues/8


That's a great goal for HTML, and this definition of
multipart/form-data shouldn't interfere with that goal.

> If the servers have to look for variants,
    we should define those variants.


 What should the browsers migrate to?

What do they do now? Presumably what they do now is the right answer.


draft-masinter-multipart-form-data fixes

Comment 11 Larry Masinter 2013-09-15 23:00:43 UTC

see https://github.com/masinter/multipart-form-data/ 

contains draft of RFC2388bis, plus proposed HTML spec (excerpted content).

Comment 12 Domenic Denicola 2016-04-08 23:54:08 UTC

I believe this might have been fixed in https://github.com/whatwg/html/pull/710. Larry, could you confirm that https://html.spec.whatwg.org/#multipart/form-data-encoding-algorithm correctly delegates to RFC 7578, and that RFC 7578 handles the cases discussed here?

From my reading I am not so sure... the HTML spec now says "Encode the (now mutated) form data set using the rules described by RFC 7578", but I can't find an algorithm in RFC 7578 that takes as input a form data set and gives as output a byte stream.

Comment 13 Anne 2016-04-28 11:25:37 UTC

How to encode names containing e.g., quotes, is still not defined. Apparently Chrome/WebKit uses percent-encoding to some extent and Firefox did not. See https://bugzilla.mozilla.org/show_bug.cgi?id=136676. I suppose at some point we'll need to define this format completely someplace.

Comment 14 Anne 2017-08-11 08:18:03 UTC

Another problem, do filenames get normalized: https://bugzilla.mozilla.org/show_bug.cgi?id=695995. (Though this may also affect application/x-www-form-urlencoded I suspect.)

Comment 15 anforowicz 2017-08-23 17:18:03 UTC

Another problem: which characters are allowed in a mime multipart boundary?

For example - Chromium and WebKit restrict the allowed characters to a subset of what is allowed by RFC 2046.  The restriction helps achieve compatibility with some of the servers - in particular see the analysis in https://bugs.webkit.org/show_bug.cgi?id=13352#c29 which says that some servers cannot process boundaries that include the '/' character.

References:

1) https://crbug.com/575779#c10 which tracks the following TODO in the Chromium code https://chromium.googlesource.com/chromium/src/+/79420989569478d5b9a05e35a841a10d9d836cc4/net/base/mime_util.cc#592 :

    // Characters to be used for mime multipart boundary.
    //
    // TODO(rsleevi): crbug.com/575779: Follow the spec or fix the spec.
    // The RFC 2046 spec says the alphanumeric characters plus the
    // following characters are legal for boundaries:  '()+_,-./:=?
    // However the following characters, though legal, cause some sites
    // to fail: (),./:=+
    const char kMimeBoundaryCharacters[] =
        "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";

2) Equivalent code and comment in WebKit: https://github.com/WebKit/webkit/blob/d071f76012298b17327ca14981ca5ffdbd1621df/Source/WebCore/platform/network/FormDataBuilder.cpp#L79

Comment 16 Domenic Denicola 2019-03-29 21:49:01 UTC

https://github.com/whatwg/html/issues/3223 appears to be the best current tracking issue for this.