Re: Change proposal for issue 103, was: ISSUE-103 change proposal

Julian Reschke, Thu, 30 Sep 2010 15:23:15 +0200:
> On 30.09.2010 13:06, Leif Halvard Silli wrote:
>> Julian Reschke, Thu, 18 Mar 2010 13:09:36 +0100:
>> 
>>> "Note: Due to restrictions of the XML syntax, in XML the U+003C
>>> LESS-THAN SIGN (<) needs be escaped as well. Also, XML's whitespace
>>> characters -- U+0009 CHARACTER TABULATION (HT), U+000A LINE FEED
>>> (LF), U+000D CARRIAGE RETURN (CR) and U+0020 SPACE -- need to be
>>> escaped in order to prevent attribute-value normalization ([XML],
>>> Section 3.3.3)."
>> 
>> (This is a follow-up to my reply in the poll.)
> 
> For the record: I don't see a reply from you here: 
> <http://www.w3.org/2002/09/wbs/40318/issue-103-objection-poll/results>

That must be an error in the system, I have got WBS Mailer confirmation 
some minutes before the dead line of the poll - I ope that get fixed.

>> To say that all XML white space characters have to be escaped, seems
>> more complicated than what is correct.
>> 
>> 1 #xA will, in CDATA attributes (and @srcdoc is CDATA) be
>>    normalized to x#20. Thus, if white space is significant, then
>>    #xA must be escaped. The same goes for #x9. But if it is not
>>    significant, then lack of escaping is no danger.
>> 2 when it comes to #xD, then it is in principle not
>>    regulated by Section 3.3.3. of XML 1.0 but by section 2.3:
>>    ]] all #xD characters literally present in an XML document are
>>       either removed or replaced by #xA [[
>>    Thus it is "a black sheep" which is generally treated as #xA.
>>    If one really needs to avoid the default of being treated as
>>    a non-escaped #xA, then it must be escaped.
>> 3 however, it is not true that one needs to escape U+0020, see
>>    Henri's last two comments in bug 9965 (against Polyglot spec).
>> ...
> 
> Well, we certainly wouldn't want to put all of this into the note. 
> Would saying "significant whitespace" address your concern?

Yes, that should work, I think. As long as you also remove #x20 from 
the list of characters which is necessary to escape. (Both "&#x20;" and 
"<space>" get normalized to "<space>".)

> Keep in mind that the advice is for people who already have a 
> character sequence, and need to figure out what to do in order to put 
> it into the attribute. At this point, it's not trivial to distinguish 
> between significant and insignificant anymore.

But perhaps it is rather trivial whether it is part of an attribute or 
no?

If one can escape all #xA, #xD and #x9, without any side effects, then 
fine. I guess it will work, since it is parsed twice:  first &#xA; 
becomes normalized to <line-feed>, and then,  in step 2, <line-feed> 
(if it is part of an CDATA attribute) becomes normalized to <space>
-- 
leif hlavard silli

Received on Thursday, 30 September 2010 14:42:52 UTC