Bug 8245 - [Ser] Error for characters that are not permitted in HTML omits some control characters
Summary: [Ser] Error for characters that are not permitted in HTML omits some control ...
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Serialization 1.0 (show other bugs)
Version: Recommendation
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Henry Zongaro
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL: http://www.w3.org/TR/xslt-xquery-seri...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-11-09 04:08 UTC by Henry Zongaro
Modified: 2010-06-29 13:53 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Henry Zongaro 2009-11-09 04:08:39 UTC
According to section 7.3 of Serialization,[1] "Certain characters, specifically the control characters #x7F-#x9F, are legal in XML but not in HTML. It is a serialization error [err:SERE0014] to use the HTML output method when such characters appear in the instance of the data model. The serializer MUST signal the error."

The definition of the error in appendix B[2] repeats this with a slightly different formulation:  "It is an error to use the HTML output method when characters which are legal in XML but not in HTML, specifically the control characters #x7F-#x9F, appear in the instance of the data model."

It is true that the control characters #x7F through #x9F were the only characters permitted in XML 1.0 that were not permitted in HTML.  In addition, the control characters #x01 through #x1F, excepting #x09, #xA and #xD, are permitted in XML 1.1 (though only as character references), but not in HTML per the SGML declaration of HTML 4.[3]


I suggest the following corrections:

. In the third paragraph of section 7.3, change "specifically the control characters #x7F-#x9F, are legal in XML" to "specifically the control characters #x1-#x8, #xB, #xC, #xE-#x1F and #x7F-#x9F, are legal in one or both versions of XML, but not in HTML"

. In appendix B, in the definition of err:SER0014, change "specifically the control characters #x7F-#x9F" to "specifically the control characters #x1-#x8, #xB, #xC, #xE-#x1F and #x7F-#x9F"


[1] http://www.w3.org/TR/xslt-xquery-serialization/#HTML_CHARDATA
[2] http://www.w3.org/TR/xslt-xquery-serialization/#ERRSERE0014
[3] http://www.w3.org/TR/html401/sgml/sgmldecl.html
Comment 1 Michael Kay 2009-11-09 09:30:58 UTC
Is this now a complete list? Will it always remain a complete list? Might it not be better to change the "specifically" to "such as"?
Comment 2 Henry Zongaro 2009-11-12 16:35:09 UTC
Yes, it's quite possible that an explicit enumeration of characters will become out of date.  I had worried about that, but I was also concerned that the list of proscribed characters in HTML is so obscure that simply saying "such as" wouldn't be of much help to either implementers or users.  (After seven years of experience with implementing XSLT, it took me about an hour to discover where the list appears.  I'd like to save others that pain.)

How would you feel about the following proposed edits, which list all the control characters, while still hedging by using "such as"?

. In the third paragraph of section 7.3, change "specifically the control
characters #x7F-#x9F, are legal in XML" to "such as the control characters
#x1-#x8, #xB, #xC, #xE-#x1F and #x7F-#x9F, are legal in one or both versions of XML, but not in HTML"

. In appendix B, in the definition of err:SER0014, delete ", specifically the
control characters #x7F-#x9F,"
Comment 3 Michael Kay 2009-11-12 16:41:21 UTC
That looks fine to me.
Comment 4 Henry Zongaro 2009-11-26 20:00:29 UTC
At its teleconference of 2009-11-12,[4] the WG suggested the wording proposed in comment #2 should be reworked to make it clear which control characters are permitted by which version of XML - particularly as many people will not be as familiar with the XML 1.1 Recommendation.  This is my revised proposal:

. In the third paragraph of section 7.3, change "Certain characters, specifically the control characters #x7F-#x9F, are legal in XML but not in HTML." to "Certain characters are legal in XML, but not in HTML -- for example, the control characters #x7F-#x9F, are legal in both XML 1.0 and XML 1.1, and the control characters #x1-#x8, #xB, #xC and #xE-#x1F are legal in XML 1.1, but none of these is permitted in HTML."

. In appendix B, in the definition of err:SER0014, delete ", specifically the
control characters #x7F-#x9F,"


[4] http://lists.w3.org/Archives/Member/w3c-xsl-wg/2009Nov/0028.html (Member-only link)
Comment 5 Henry Zongaro 2009-12-02 11:29:58 UTC
At the joint teleconference of the XQuery and XSL Working Groups of 2009-12-01,[1]
the proposal in comment #4 was accepted.  As only a few members of the XSL WG were present on the call, I will bring the proposal back to that working group for final ratification.

[5] http://lists.w3.org/Archives/Member/w3c-xsl-query/2009Dec/0005.html (Member-only link)
Comment 6 Henry Zongaro 2009-12-02 18:57:24 UTC
[Revising the abstract.]
Comment 7 Henry Zongaro 2009-12-03 19:58:36 UTC
At its teleconference of 2009-12-03,[6] the XSL Working Group ratified the decision reported in comment #5.

This will be Serialization erratum SE.E15.

[6] http://lists.w3.org/Archives/Member/w3c-xsl-wg/2009Dec/0008.html