28476 – [SER 3.1]JSON serialization: escaping strings

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 28476 - [SER 3.1]JSON serialization: escaping strings

Summary: [SER 3.1]JSON serialization: escaping strings

Status:	RESOLVED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Serialization 3.1 (show other bugs)
Version:	Last Call drafts
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	C. M. Sperberg-McQueen
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-04-12 21:50 UTC by Michael Kay
Modified:	2015-07-16 18:09 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description Michael Kay 2015-04-12 21:50:04 UTC

Section 9 says:

* An atomic valueXP31 in the data model instance of any other type is serialized to a JSON string by outputting the result of applying the fn:string function to the item.

* A node in the data model instance is serialized as a JSON string by outputting the result of serializing the node using the method specified by the json-node-output-method parameter. If the json-node-output-method parameter is set to xml or xhtml then the node is serialized with the additional serialization parameter omit-xml-declaration set to yes.

In both cases it fails to mention the need to enclose the string in quotes, and the need to escape special characters (such as quotes) to make them legal JSON.

Less obviously, it fails to describe the detailed escaping rules.

Suitable rules can be found in the xml-to-json function:

* Any occurrence of backslash (\) is replaced by \\.

* Any occurrence of quotation mark, backspace, form-feed, newline, carriage return, or tab is replaced by \", \b, \f, \n, \r, or \t respectively, and any other codepoint in the range 1-31 or 127-159 is replaced by an escape in the form \uHHHH where HHHH is the hexadecimal representation of the codepoint value.

I wonder if we should reconsider the rule in 9.1.3: "If the instance of the data model contains a character that cannot be represented in the encoding that the serializer is using for output, the serializer MUST signal a serialization error [err:SERE0008]." Would it not be friendlier to escape any such character? It seems reasonable to ask for JSON in US-ASCII encoding, with the intent that all non-ASCII characters should be represented using \u escape sequences.

Comment 1 Michael Kay 2015-04-12 21:51:18 UTC

Note that the escaping rules should also apply to the serialization of keys in maps.

Comment 2 Michael Kay 2015-04-12 22:46:28 UTC

Note that test Serialization-json-57 uses US-ASCII encoding for the output and currently allows either a serialization error, or escaping of the non-ascii character in the output. This doesn't appear to match the spec.

Comment 3 Michael Kay 2015-04-13 13:24:51 UTC

Also affects Serialization-json-42

Comment 4 Michael Kay 2015-04-13 13:37:54 UTC

I now see that section 4 clause 3(e) provides additional material on JSON escaping, and contradicts section 9.1.3 by saying that escaping IS applied to characters that cannot be represented in the chosen encoding.

However section 4 clause 3 is still a little vague as to what strings it applies to. It starts: "Character expansion is concerned with the representation of characters appearing in text and attribute nodes and strings in the sequence. ". Normally I would interpret this as "instances of xs:string appearing as items in the sequence to be serialized". But the intended meaning is clearly more general than this: JSON escaping needs to be applied to anything that appears as a string in the output, whether or not it started life as a "string in the sequence". For example it applies to strings within arrays or maps; to values of type untypedAtomic or anyURI or QName; and to the strings that result from serializing nodes.

Comment 5 Josh Spiegel 2015-04-13 14:38:59 UTC

Serialization-json-42 and Serialization-json-57 allow the serialization error (SESU0007) since implementations are not required to support US-ASCII.  See section 9.1.3:

"The encoding parameter identifies the encoding that the JSON output method MUST use to convert sequences of characters to sequences of bytes. Serializers are REQUIRED to support values of UTF-8 and UTF-16. A serialization error [err:SESU0007] occurs if the serializer does not support the encoding specified by the encoding parameter."

These tests also allow the character to be escaped.  From Section 4, bullet 3. e:
"Escape according to the rules of the XML, HTML, or JSON output method, ... where JSON requires escaping, and any characters that cannot be represented in the selected encoding."

By the way, I raised the same concern as you about strings (see bug 27330).

Comment 6 John Snelson 2015-07-16 18:09:19 UTC

DECISION: Adopt MKay's proposal in the following email, adding encoding as the final step in the serialization process.
https://lists.w3.org/Archives/Member/w3c-xsl-query/2015Jul/0026.html
Also include the paragraph: "Any occurrence of quotation mark, backspace, form-feed, newline, carriage return, or tab is replaced by \", \b, \f, \n, \r, or \t respectively, and any other codepoint in the range 1-31 or 127-159 is replaced by an escape in the form \uHHHH where HHHH is the (upper- or lower-case) hexadecimal representation of the codepoint value. In addition, any character that cannot be encoded using the selected encoding is escaped using the \uHHHH notation."