This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html Multipage: http://www.whatwg.org/C#multipart-form-data Complete: http://www.whatwg.org/c#multipart-form-data Comment: The specification is unclear about how field names should be encoded. In particular, what should be done if they include special characters? (eg. quotes, new lines, unicode, etc?). I started a mailing list thread on this issue... Posted from: 74.66.64.60 User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5
The specification is unclear about how field names should be encoded. In particular, what should be done if they include special characters? (eg. quotes, new lines, unicode, etc?).
Argh; whoops. Sorry for the bugzilla spam. I didn't realize that the "comment" thingy just filed a bugzilla bug. HTML5 states: "Encode the (now mutated) form data set using the rules described by RFC 2388". However, it then modifies the rules: "The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified. Their names and values must be encoded using the character encoding selected above (field names in particular do not get converted to a 7-bit safe encoding as suggested in RFC 2388)." http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html#multipart-form-data So the problem is: what are we supposed to do with field names? In particular, what if they contain "special" MIME characters (e.g. \r\n newlines, backslashes, double quotes, or semi-colons?). Different browsers do different things, meaning that currently server code must detect the browser to do the right thing. Example: <input name='bàz%22\"\' value="foo"> Firefox 13b: Content-Disposition: form-data; name="bàz%22\\"\" Webkit nightly: Content-Disposition: form-data; name="bàz%22\%22\" Firefox backslash quotes double quotes, except it fails to quote backslashes. This means its header fails to parse according to the MIME specification (it sort of decodes as bàz%22\ with an extra trailing \" Webkit %-escapes the double quotes, but does not %-escape the percent. Thus the above form control could be either name='bàz"\"\' or the desired name. Webkit has a bug open on this issue, asking for specification guidance: https://bugs.webkit.org/show_bug.cgi?id=62107 HTML5 should specify exactly how field names are encoded. Some potential solutions: 1) Bless Firefox's backslash quoting rules (they are very weird but I think they are unambiguous?). This means Webkit POSTs will be decoded to the wrong field names, and POSTs to older servers may parse incorrectly if the name includes a \ (but that must already happen for Firefox?). 2) Bless Webkit's percent escaping rules (ideally also escaping %). Servers that strictly parse this format will fail to parse Firefox POSTs if the name includes a \, and will 3) Adopt RFC 6266's approach of having two name parameters when there are special characters: one with the existing escaping, and one with an unambiguously escaped version. Ideally, existing servers will parse the first name and not break unless the form value contains a special character. As servers are upgraded, they will be able to unambiguously parse the new header. See: http://tools.ietf.org/html/rfc6266 Aside: The *same* issue happens for uploaded file names. I started a mailing list thread to attempt to collect more information about this: http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-May/035610.html
This bug was cloned to create bug 18135 as part of operation convergence.
Larry, any chance RFC 2388 will get updated to resolve this issues?
Dropped a mail to Larry, we'll see what he says.
http://www.ietf.org/mail-archive/web/apps-discuss/current/msg08908.html I'm marking this with the same milestone as other form-related stuff, but I doubt I'll actually do this in the HTML spec. Any volunteers want to write this up as a new spec? See the e-mail above if you want to do this in the IETF space, or contact me on IRC if you want to do it in the WHATWG space, I'm sure either way you'll find people eager to help you.
*** Bug 19879 has been marked as a duplicate of this bug. ***
RFC 2388 was clear: Field names originally in non-ASCII character sets may be encoded within the value of the "name" parameter using the standard method described in RFC 2047. For reasons I don't understand, browsers did different, incompatible things. I think the main advice is: * those creating HTML forms SHOULD use ASCII field names, since deployed HTML processors vary, and field names shouldn't be visible to the user anyway. * Those developing server infrastructure to read multipart/form-data uploads SHOULD be aware of the varying behavior of the browsers in translating non-ASCII field names, and look for any of the variants (if they're expecting non-ASCII field names). * Those developing browsers should migrate toward a standard encoding, but the server infrastructure will still have to do fuzzy match for a long while. What should the browsers migrate to? http://www.rfc-editor.org/rfc/rfc5987.txt seems like a more recent proposal and possibly implemented in HTTP anyway. Sites that use non-ASCII field names and want to work with multiple browsers already have to do fuzzy matching. The problem is that the fuzzy matchers already deployed might not recognize any *NEW* encodings. So I suppose having a name* value would be necessary.
(In reply to Larry Masinter from comment #8) > RFC 2388 was clear: > Field names originally in non-ASCII character sets may be encoded > within the value of the "name" parameter using the standard method > described in RFC 2047. "may" is what makes this not clear. It means that the above is one option, but what are the other options? What else can they do? Specs should basically never say MAY or SHOULD when it comes to describing what they put on the wire. > * those creating HTML forms > SHOULD use ASCII field names, since deployed HTML processors vary, > and field names shouldn't be visible to the user anyway. The goal on the HTML side is to have HTML processors not vary. > * Those developing server infrastructure to read multipart/form-data uploads > SHOULD be aware of the varying behavior of the browsers in translating > non-ASCII field names, and look for any of the variants (if they're > expecting non-ASCII field names). If the servers have to look for variants, we should define those variants. > * Those developing browsers should migrate toward a standard > encoding, but the server infrastructure will still have to do > fuzzy match for a long while. > > What should the browsers migrate to? What do they do now? Presumably what they do now is the right answer. > So I suppose having a name* value would be necessary. I don't think adding new features here is viable. We should specify what most browsers do, and just stick with that. IMHO.
in Comment 8: > "may" is what makes this not clear draft-masinter-multipart-form-data-00 (current revision as of this note) doesn't use may or MAY > The goal on the HTML side is to have HTML processors not vary. => https://github.com/masinter/multipart-form-data/issues/8 That's a great goal for HTML, and this definition of multipart/form-data shouldn't interfere with that goal. > If the servers have to look for variants, we should define those variants. What should the browsers migrate to? What do they do now? Presumably what they do now is the right answer. draft-masinter-multipart-form-data fixes
see https://github.com/masinter/multipart-form-data/ contains draft of RFC2388bis, plus proposed HTML spec (excerpted content).
I believe this might have been fixed in https://github.com/whatwg/html/pull/710. Larry, could you confirm that https://html.spec.whatwg.org/#multipart/form-data-encoding-algorithm correctly delegates to RFC 7578, and that RFC 7578 handles the cases discussed here? From my reading I am not so sure... the HTML spec now says "Encode the (now mutated) form data set using the rules described by RFC 7578", but I can't find an algorithm in RFC 7578 that takes as input a form data set and gives as output a byte stream.
How to encode names containing e.g., quotes, is still not defined. Apparently Chrome/WebKit uses percent-encoding to some extent and Firefox did not. See https://bugzilla.mozilla.org/show_bug.cgi?id=136676. I suppose at some point we'll need to define this format completely someplace.
Another problem, do filenames get normalized: https://bugzilla.mozilla.org/show_bug.cgi?id=695995. (Though this may also affect application/x-www-form-urlencoded I suspect.)
Another problem: which characters are allowed in a mime multipart boundary? For example - Chromium and WebKit restrict the allowed characters to a subset of what is allowed by RFC 2046. The restriction helps achieve compatibility with some of the servers - in particular see the analysis in https://bugs.webkit.org/show_bug.cgi?id=13352#c29 which says that some servers cannot process boundaries that include the '/' character. References: 1) https://crbug.com/575779#c10 which tracks the following TODO in the Chromium code https://chromium.googlesource.com/chromium/src/+/79420989569478d5b9a05e35a841a10d9d836cc4/net/base/mime_util.cc#592 : // Characters to be used for mime multipart boundary. // // TODO(rsleevi): crbug.com/575779: Follow the spec or fix the spec. // The RFC 2046 spec says the alphanumeric characters plus the // following characters are legal for boundaries: '()+_,-./:=? // However the following characters, though legal, cause some sites // to fail: (),./:=+ const char kMimeBoundaryCharacters[] = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"; 2) Equivalent code and comment in WebKit: https://github.com/WebKit/webkit/blob/d071f76012298b17327ca14981ca5ffdbd1621df/Source/WebCore/platform/network/FormDataBuilder.cpp#L79
https://github.com/whatwg/html/issues/3223 appears to be the best current tracking issue for this.