This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 28479 - [Ser 3.1] Character Maps
Summary: [Ser 3.1] Character Maps
Status: RESOLVED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Serialization 3.1 (show other bugs)
Version: Last Call drafts
Hardware: PC All
: P2 normal
Target Milestone: ---
Assignee: C. M. Sperberg-McQueen
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-13 09:54 UTC by Michael Kay
Modified: 2015-07-16 18:09 UTC (History)
2 users (show)

See Also:


Attachments

Description Michael Kay 2015-04-13 09:54:08 UTC
Some changes have occurred in the 3.0 and 3.1 specs regarding character maps, whose implications do not appear to have been fully thought through.

In 3.0, item-separator was added. There's no clear indication as to whether character-map substitution is applied before or after insertion of item separators. The closest we get is:

Character mapping is applied to the characters that actually appear in a text or attribute node in the instance of the data model, before any other serialization operations such as escaping or Unicode Normalization are applied.

which I propose we change to:

Character mapping is applied to the characters that actually appear in a text or attribute node in the instance of the data model, before any other serialization operations such as sequence normalization, escaping, or Unicode normalization are applied.

In 3.1, character mapping is applied to strings, as well as to text and attribute nodes. This change was presumably intended primarily for JSON, though it's not clear quite what the expected use case is. 

We say "If a character is mapped, then it is not subjected to XML or HTML escaping, nor to Unicode Normalization." I would think that for character maps to be useful with JSON, this should say "XML or HTML or JSON escaping" (indeed, the rest of the paragraph could be interpreted as implying this).

(We actually say thrice that it is not subjected to XML or HTML escaping. Presumably this is on the theory that "what I say three times is true").

I'm slightly worried that the extension of character maps to apply to strings causes a backwards incompatibility for the XML and HTML output methods. In XSLT 2.0 it wasn't possible for the XSLT processor to send a string to the serializer: only XML result trees were sent, which means any string would be turned into a text node which would be subject to character mapping. But XQuery 3.0 could certainly send a string to an XML or HTML output method, and if we accept that character mapping was supposed to happen before sequence normalization, then character maps would not be applied to the string.

Finally, for the JSON case, I think it's not quite precise enough to say that character mapping applies to "strings". We treat miscellaneous data types such as dates, times, anyURIs and untypedAtomics by conversion to strings: does it apply to these? Does it apply to the keys in maps as well as the values?

I would also point out that the idea of applying character maps early in the serialization pipeline, and then treating mapped characters differently from unmapped characters in later stages of the pipeline, is very messy from an implementation viewpoint. We're saddled with this for XML and HTML serialization, but do we really want to do this for JSON?
Comment 1 Josh Spiegel 2015-04-13 15:38:39 UTC
In section 4, (phases of serialization) I think it is clear that sequence normalization (if applicable) comes before character expansion.  The item-separator is inserted during sequence normalization and character mapping is applied during character expansion.  So why isn't clear which should come first?

You said:
"if we accept that character mapping was supposed to happen before sequence normalization, then character maps would not be applied to the string."

The other thing that jumps to mind is the interaction with cdata-section-elements.  When cdata-section-elements applies to an element, character mapping is skipped.   (see bullet (b) of character expansion)

I agree with your comment about strings and the JSON method (see bug 27330).
Comment 2 Michael Kay 2015-04-13 16:21:58 UTC
OK, I wasn't looking at section 4. If that's the case, then the paragraph in section 11 (Character maps)

Character mapping is applied to the characters that actually appear in a text or attribute node in the instance of the data model, before any other serialization operations such as escaping or Unicode Normalization are applied.

is misleading, because (a) "the instance of the data model" seems to be saying "the input to the serializer" rather than "the output of sequence normalization", and (b) any other serialization operations" is too broad.

The way the spec was written for XML/HTML, it was clear that character maps did NOT apply to markup generated during the "Markup Generation" phase. I think it is desirable that this principle should also apply for JSON; but it's not at all clear under the current spec whether it does; and indeed the inclusion of "strings" in the input to the character expansion phase muddies the water for the traditional XML/HTML methods as well.
Comment 3 John Snelson 2015-07-16 18:09:40 UTC
DECISION: Adopt MKay's proposal in the following email, adding encoding as the final step in the serialization process.
https://lists.w3.org/Archives/Member/w3c-xsl-query/2015Jul/0026.html
Also include the paragraph: "Any occurrence of quotation mark, backspace, form-feed, newline, carriage return, or tab is replaced by \", \b, \f, \n, \r, or \t respectively, and any other codepoint in the range 1-31 or 127-159 is replaced by an escape in the form \uHHHH where HHHH is the (upper- or lower-case) hexadecimal representation of the codepoint value. In addition, any character that cannot be encoded using the selected encoding is escaped using the \uHHHH notation."