Re: Change proposal for issue 103, was: ISSUE-103 change proposal from Philip Taylor on 2010-03-24 (public-html@w3.org from March 2010)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Wed, 24 Mar 2010 10:49:14 +0000
To: Maciej Stachowiak <mjs@apple.com>
CC: public-html@w3.org
Message-ID: <4BA9EE2A.1060808@cam.ac.uk>
Maciej Stachowiak wrote:
> 
> On Mar 24, 2010, at 2:29 AM, Philip Taylor wrote:
> 
>> Maciej Stachowiak wrote:
>>> [...]
>>> Julian & Philip, how confident are you that the full set of 
>>> characters that need escaping is U+003C, U+000D, U+000A, U+0009 and 
>>> U+0020? Does & need to be escaped?
>>
>> It needs these characters "as well" as the ones already mentioned in 
>> the previous paragraph in the spec (quotes and &s).
>>
>> I can't think of any other characters that have particularly special 
>> behaviour, but what is the purpose of this note? If it is aimed at 
>> people writing software that emits XML syntax fragments given an 
>> arbitrary string of Unicode codepoints, attempting to tell them 
>> everything they need to know in order to serialise safely (i.e. 
>> without allowing the content to break their entire page), then it 
>> would probably also have to say that U+FFFE and U+FFFF and other 
>> characters in U+0000..U+001F aren't allowed, and that they must be 
>> encoded in the same character encoding as the rest of the document, etc.
>>
>> It seems silly to duplicate the XML spec in that much detail here - if 
>> someone's correctly implementing XML then they should already have an 
>> XML serialiser that deals with all these issues, and repeating the 
>> information here will be a source of bugs and a waste of time.
>>
>> If it's aimed at people writing XHTML by hand, telling them about 
>> common things to be careful of, it probably doesn't need to bother 
>> mentioning U+0020 because (as far as I can see) that only matters in 
>> obscure cases when the DTD has set srcdoc to be non-CDATA. But <iframe 
>> srcdoc> is not a useful feature when writing markup by hand - the use 
>> cases were things like sandboxing untrusted user comments, and the 
>> whole point is that people will write software to serialise these 
>> values, so it's not useful to give advice intended for hand-authoring.
> 
> What spec change (if any) would you recommend on this issue? (I'm not 
> sure from the above if you are arguing for a detailed note, a shorter 
> but partially incomplete note, no note, or something else.)

Mostly I'm just trying to provide information, not argue :-)

I'm happy with the current spec text - it tells XML authors to be 
careful, and they can use existing tools and documentation to work out 
exactly what to do.

I wouldn't like to entirely remove the note about XML, because authors 
may think the note about the HTML syntax applies to them too. I wouldn't 
like to remove the note about the HTML syntax, because it's simple and 
is sufficient for the case of writing software that safely embeds 
untrusted user input, so it helps authors use this feature correctly for 
its intended purpose.

I wouldn't like text about XML that only gives half the details (e.g. 
doesn't say how to protect against well-formedness errors from invalid 
characters), because people are likely to think it's intended as a 
sufficient description when it's not.

I wouldn't like text about XML that gives the full details, because it 
would be long and unwieldy and a likely source of confusion (for authors 
and reviewers) - it's a complex enough topic that it seems better to 
have authors look at XML documentation for a full discussion.

If the text did attempt to give the full details for embedding untrusted 
input, I think it should cover the need to escape ", &, <, U+000D, 
U+000A, U+0009, U+0020, and the need to delete/replace U+FFFE and U+FFFF 
and other characters in U+0000..U+001F, and the need to encode it 
validly in the document's character encoding. I'm not certain that's 
all, but I can't think of any more right now.

-- 
Philip Taylor
pjt47@cam.ac.uk
Received on Wednesday, 24 March 2010 10:49:44 UTC