6742 – pre-encoded form values should be restorable as submitted

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6742 - pre-encoded form values should be restorable as submitted

Summary: pre-encoded form values should be restorable as submitted

Status:	CLOSED INVALID

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P3 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-03-27 03:54 UTC by Nick Levinson
Modified:	2010-10-04 14:47 UTC (History)
CC List:	6 users (show)

See Also:

Attachments
illustration from email (11.33 KB, image/png) 2009-03-30 06:22 UTC, Nick Levinson	Details
source for previous attachment's essential part (6.09 KB, image/png) 2009-04-06 03:29 UTC, Nick Levinson	Details

Description Nick Levinson 2009-03-27 03:54:32 UTC

Suppose a submitting user types into an HTML form field the following sentence:

I am puzzled by %26.

None of this is dangerous, so none of this needs encoding, so an HTML-compliant user agent will submit the sentence unchanged to the next stage, which might result in simple storage in a database.

Now suppose the sentence is retrieved and decoded. If whether decoding is needed is determined by the presence of a percent-encoding or numeric entity reference and not by a separate flag or list, then the parser function should produce the following:

I am puzzled by &.

This may not be what the submitting user intended. If the submitting user is not very computer-savvy, they won't know about encoding at all and therefore won't anticipate it.

The human recipient, if also not very computer-savvy, won't know what happened and will think they are seeing literally what was entered. The resulting human-to-human conversation via HTML would likely be confusing, if communications don't break down entirely.

I propose that an option exist to permit accurate restoration. The option should be asserted in the HTML markup for the form or in the document head. The user agent should then be responsible for either suspending or delaying all encoding or for delimiting or listing passages that are not encoded by the user agent even if they already appear in encoded style.

This is in reference to section 4.10.15 of the Working Draft of February 12, 2009, including 4.10.15.3, step 6, substeps 1 and 2, or any successor provisions.

Since safety requires encoding, it must be performed at some stage. Therefore, if assertion of the option suspends or delays all encoding, that would be to permit the website author to provide a method that when executed results in safety, i.e., in no need for further encoding. The website author might, for example, devise a method that lists passages not to be changed when restored later. If the method is not executed or it does not produce the result required for safety, the user agent would then encode as it must now.

A delimiter can be any string that does not appear elsewhere in the submission. The number of possible delimiters is infinite; e.g., if j, jj, and jjj are in the submission, then jjjj might qualify as the delimiter. The delimiter would not have to be the same in all submissions using that form, that page, that website, or that user agent, as long as a field or value is added to the form to declare the delimiter for the submission. The existence or absence of a delimiter declaration in a submission would signal the parser function as to what to do.

This leaves open whether the website author or the user agent is responsible for a method once the option for accurate restoration is asserted and a method is needed. One is responsible, and as a fallback it will be the user agent, as now. If there's a possibility for creativity or variety in method, then the website author should be allowed to interject a method, such as a third-party-developed method.

Thank you.

--
Nick

Comment 1 Nick Levinson 2009-03-27 04:41:59 UTC

Correcting my error in setting Component menu. Don't know how this affects assignment, etc.

-- 
Nick

Comment 2 Ian 'Hixie' Hickson 2009-03-27 06:55:34 UTC

"HTML5: The Markup Language" is this draft:
   http://www.w3.org/html/wg/markup-spec/
Is that what you meant? It doesn't seem to have a 4.10.15. This one does:
   http://www.whatwg.org/specs/web-apps/current-work/#form-submission-0
...so I presume you mean that.

I don't really understand what you mean, though. As far as I can tell, the spec covers all the submission and retrieval steps in an unambiguous manner (with one exception, out-of-encoding characters, but that's a separate issue). Are you sure the problem you describe isn't a server-side issue?

Comment 3 Boris Zbarsky 2009-03-27 16:28:00 UTC

> If whether decoding is needed is determined by the presence of a percent-
> encoding or numeric entity reference and not by a separate flag or list

Then the function that stores the data in the database needs to make sure to escape '%' signs in its input if the input is not expected to already be URI-escaped.

It sounds to me like the server-side form handler here is just buggy.

Comment 4 Nick Levinson 2009-03-29 05:22:42 UTC

No, I was referring to <http://www.w3.org/TR/html5/> (presumably http://www.w3.org/TR/2009/WD-html5-20090212/). I didn't even know about the two you cited; I've now gone through them in some detail. (Accesses were Mar. 27-28, 2009.) The corresponding provisions in the documents you cited is in sections 4.10.15.4 through -.6 of the WHAT WG document (http://www.whatwg.org/specs/web-apps/current-work/>, Draft Recommendation, 28 March 2009 (the same date appeared when accessed on the 27th)). The other doc (Editor's Draft (24 March 2009), accessed 27th) doesn't cover it.

For example, WHAT WG's section 4.10.15.4, step 6, substep 1, requires replacing a section symbol with &167; and substep 2 requires replacing an ampersand with %26, yielding %26167;. This will be reversed later, presumably first to recover an ampersand and then for a section symbol. But if the submitter originally entered literally &167; into the field that is now being encoded, then restoration will still result in a section symbol even though the submitter never typed one. The result will thus differ from what was submitted.

But it's not a good idea to turn encoding off or make it optional, because unencoded data streams present security risks. Instead, what's needed is a method for knowing that certain strings should not be reversed because they only look reversible but actually were not due to encoding by a user agent. Given such a method, encoding can still be completed within the UA, ensuring safety.

Two methods are available: listing and flagging strings not to be reversed.

A flag cannot be universal, as it, too, then becomes susceptible to the same problem described above. But a flag can be nonappearing in one string (such as a URL's query string) in which it will be used. For example, if pkXwq doesn't appear, then pkXwq can become that flag, placed both before and after the string that is not to be reversible and in a separate user-noneditable field that declares what specific string is the flag. For the next submission, jjt might be the flag.

The other method, listing, would construct a separate list of strings not to be reversed. If a string is reversible if it's in one location but not if it's in another, the location of the string not to be reversed would be in the list. If a string is reversible if it's in many locations but not if it's in one or a few particular locations, the locations of the many strings to be reversed would be given instead of the location/s of the string not to be reversed, the latter option intended to shorten the transmission length but requiring a T/F value to indicate that the list is of what to reverse rather than of what not to reverse.

Flagging is likely more efficient. Both require programming in an implementation but flagging probably makes a shorter URL than would listing.

On the other hand, is it possible that reversal is already selective and I just haven't seen a mechanism or specification for it? A database management system could handle it but would have to know when, and for that the sender's UA would have to be the informant. Have I missed something?

Thanks.

--
Nick

Comment 5 Nick Levinson 2009-03-30 06:22:28 UTC

Created attachment 675 [details]
illustration from email

Comment 6 Nick Levinson 2009-03-30 06:38:50 UTC

A real-life example came up. A person posted a hypothetical tag for discussion purposes. The post was in the W3C Bugzilla (at Bug 6746 Comment #5), which then emailed me with the text of the post. The posted string included a character entity, complete with ampersand and semicolon. When it arrived at my Yahoo email inbox, the entity had become a letter. I don't know whether the translation occurred during the Bugzilla's preparing the email to me or during Yahoo's presenting of the email to me, or elsewhere, but nothing in the email alerted me that it should not have translated because the original text was already in entity form.

Arguably, this isn't the best real-life example possible because it's not restoring to the original medium, such as a Web page formed from a submitter's comment after CGI acceptance, but the effect occurred in an email from a Web page and that's close enough.

Attached hereto is a picture of the Yahoo result (a cropped PNG screenshot enlarged 4-fold without interpolation in Gimp 2.2.7, saved as PNG, and named about-tag.png). Compare the tag from this email to the one in the bug report comment.

-- 
Nick

Comment 7 Ian 'Hixie' Hickson 2009-03-30 06:41:04 UTC

The latter is a bug in Yahoo! Mail. (I haven't read your earlier comments in detail yet, so I haven't commented on those yet.)

Comment 8 Henri Sivonen 2009-03-30 09:34:54 UTC

The problem isn't percent encoding. The problem is UA-performed NCR encoding when the submission charset is not UTF-8.

The only interoperable and backwards compatible way to avoid this problem is that the server 
1) Serves all its forms as UTF-8.
2) Does not expand NCRs, because strings that look like NCRs will always be user-entered and not UA-generated when the form was served as UTF-8.

All in all, any site that has forms should use UTF-8. Or more generally, all sites should use UTF-8.

(Migration to UTF-8 is about as easy as migration to any novel workaround for this old problem. And migration to UTF-8 works with existing UAs while novel workarounds would not.)

Comment 9 Nick Levinson 2009-04-02 16:26:43 UTC

I think I like the preference for UTF-8. It's already preferred in HTML5. I recall a preference in general authoring practice for UTF-8 anyway (and I use it in websites). But, since that sounds like it would require and not just prefer UTF-8 in the HTML5 standard, maybe someone with a preference for another charset would like to weigh in on why not UTF-8, so concerns are recognized.

If UTF-8 is to be required, I would suggest that UTF-8 apply to the page. I would not suggest sitewide applicability given the possibility of mixing markup languages within a site, e.g., the practice of making sitemaps in XML though other pages be, say, HTML.

The issue isn't specific to percent-encoding vs. encoding into character references; any system of encoding would pass the same problem. The issue is what and when to encode, and so the solution has to be timed and must preserve pre-encoding.

I invited Yahoo's response, but so far I haven't heard. Do we know what specific bug Yahoo has? I don't think that generically having dereferenced a numeric character reference is their bug, unless we're saying that that particular NCR should not have been dereferenced or that any NCR in that context should not have been dereferenced, if we define the context.

-- 
Nick

Comment 10 Nick Levinson 2009-04-06 03:29:34 UTC

Created attachment 676 [details]
source for previous attachment's essential part

This is source code from the same email as was snipped and attached earlier, with syntax highlighting off, again in Firefox 1.0.4, again via a screen shot. Yahoo sent NCRs, my browser interpreted them to what they represent, and nothing in or with the email said not to.

Thanks.

-- 
Nick

Comment 11 Ian 'Hixie' Hickson 2009-06-28 01:51:59 UTC

If the server is interpreting strings like "&123;" as character escapes when using UTF-8, then that's a bug on the server. If the server is not using UTF-8 and expecting round-tripping of user-entered characters outside the declared encoding, then that's a bug on the server.

Other than those two cases, I don't understand the problem being described here. As far as I can tell, there is no ambiguity anywhere if you use UTF-8 and follow the spec properly.

Comment 12 Nick Levinson 2009-06-28 08:03:29 UTC

Consider a one-way communication, using UTF-8. The key parts of a server are the UA and I/O. UTF-8 is used here, for convenience and in case the plan for HTML 5 is to require UTF-8 everywhere relevant.

Say a form is used by a human to contact another human. Four parties take part:
-- the sending human;
-- the sending UA;
-- the receiving UA; and
-- the receiving human.

A communication as it should work follows:
-- Sending human types: I said "this & that."
-- Sending UA encodes to: I said %22this %26 that.%22
-- Receiving UA will decode from: I said %22this %26 that.%22
-- Receiving human reads: I said "this & that."

But suppose the sender, who we'll say is a programmer, types a percent-code into the original message, while typing quote marks as usual:
-- Sending human types: You inserted %26 on line 7. Like Pat said, "you shouldn't have."
-- Sending UA encodes to: You inserted %26 on line 7. Like Pat said, %22you shouldn't have.%22
-- Receiving UA will decode from: You inserted %26 on line 7. Like Pat said, %22you shouldn't have.%22
-- Receiving UA, without further information, assumes that %26 previously replaced an ampersand and so replaces it now with an ampersand.
-- Receiving human reads: You inserted & on line 7. Like Pat said, "you shouldn't have."

Result: The receiving human does not receive the message that was sent, but a different one. The receiving human could well reply, "I didn't insert &." The sending human might send a new message, "I didn't say you did. I said you inserted %26, and you shouldn't have. & would have been better." The receiving human will see, "I didn't say you did. I said you inserted &, and you shouldn't have. & would have been better.", and may reply, "What's the difference between & and &?"

This responds to sec. 4.10.16.4, step 6, substep 2, subsubstep 1, and sec. 8.2 of <http://www.w3.org/TR/html5/single-page/>, Working Draft 23 April 2009, as accessed 6-28-09. Sec. 8.2.4 appears relevant except that I couldn't find a subsection thereof that specifically governed percent-decoding, or I missed it; perhaps something should be added on the assumption that UA makers infer its existence anyway.

UTF-8 is recommended but not mandatory, thus a UA not using UTF-8 might not be a violation. See especially section 4.10.16.4, step 2; also, e.g., ". . . windows-1252 is recommended as a [fallback] default . . . ." (sec. 8.2.2.1, step 7), "User agents must at a minimum support the UTF-8 and Windows-1252 encodings, but may support more." (sec. 2.8), "The [meta element's] charset attribute specifies the character encoding used by the document. . . . If the attribute is present in an XML document, its value must be an ASCII case-insensitive match for the string 'UTF-8' (and the document is therefore required to use UTF-8 as its encoding)." (sec. 4.2.5), "Authors are encouraged to use UTF-8. Conformance checkers may advise against authors using legacy encodings." (sec. 4.2.5.5), and secs. 2.7.2-2.7.3 & 2.7.6. Thus, UTF-8 is not required for non-XML documents except as otherwise required.

Correcting a prior error of mine: Of the options of listing and flagging, if listing is chosen, and if one or more instances of a single representation are to be reversed to recover original strings but another one or more instances are to be left as they are, only the fewer instances would be listed to save on bandwidth, as long as T/F will flag whether the list is for reversing or preserving.

A use case is not limited to online conversations between programmers. This also applies to scholarly writing in which storage and transmission of a submission have to be highly accurate and paraphrasing of the "we know what was meant" variety may not be acceptable to content authors. Since even programmers who are expert in other languages having little to do with the Web, such as Cobol or PostScript, might have conversations like that hypothesized above, familiarity with the existence of HTML's percent-encoding should not be assumed even for programmers in general, thus adding to the use case.

Thank you.

--
Nick

Comment 13 Ian 'Hixie' Hickson 2009-06-28 08:33:39 UTC

> A communication as it should work follows:
> -- Sending human types: I said "this & that."
> -- Sending UA encodes to: I said %22this %26 that.%22
> -- Receiving UA will decode from: I said %22this %26 that.%22
> -- Receiving human reads: I said "this & that."

If by "Receiving UA" you mean the server, then this is correct.


> But suppose the sender, who we'll say is a programmer, types a percent-code
> into the original message, while typing quote marks as usual:
> -- Sending human types: You inserted %26 on line 7. Like Pat said, "you
> shouldn't have."
> -- Sending UA encodes to: You inserted %26 on line 7. Like Pat said, %22you
> shouldn't have.%22

This is incorrect. When sending the text, the "%" entered by the user must be encoded as %25, so the sending UA encodes to: You inserted %2526 on line 7. Like Pat said, %22you shouldn't have.%22


> This responds to sec. 4.10.16.4, step 6, substep 2, subsubstep 1, and sec. 8.2
> of <http://www.w3.org/TR/html5/single-page/>, Working Draft 23 April 2009, as
> accessed 6-28-09. Sec. 8.2.4 appears relevant except that I couldn't find a
> subsection thereof that specifically governed percent-decoding, or I missed it;
> perhaps something should be added on the assumption that UA makers infer its
> existence anyway.

Specifically, in section 4.10.16.4 URL-encoded form data, notice that in step 6.2.1 the "%" character is not included in the list of characters that is not encoded -- that means it must itself be encoded:

"If the character isn't in the range U+0020, U+002A, U+002D, U+002E, U+0030 .. U+0039, U+0041 .. U+005A, U+005F, U+0061 .. U+007A then replace the character with a string formed as follows: Start with the empty string, and then, taking each byte of the character when expressed in the selected character encoding in turn, append to the string a U+0025 PERCENT SIGN character (%) followed by two characters in the ranges U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9) and U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z representing the hexadecimal value of the byte (zero-padded if necessary)."


> UTF-8 is recommended but not mandatory, thus a UA not using UTF-8 might not be
> a violation.

The particular case you are mentioning is unaffected by the encoding used, the escaping of "%" happens in all encodings.

Comment 14 Nick Levinson 2009-06-28 09:32:52 UTC

Got it. (I tried two more tests, with "5%" and "5%25" in the sending human's text, and they followed what you said what should happen.)

Thank you.

-- 
Nick

Comment 15 Maciej Stachowiak 2010-03-14 13:17:07 UTC

This bug predates the HTML Working Group Decision Policy.

If you are satisfied with the resolution of this bug, please change the state of this bug to CLOSED. If
you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

This bug is now being moved to VERIFIED. Please respond within two weeks. If this bug is not closed, reopened or escalated within two weeks, it may be marked as NoReply and will no longer be considered a pending comment.