29216 2015-10-21 09:47:42 +0000 JSON Conversion: Handling of surrogate pairs 2016-03-22 10:03:53 +0000 1 1 1 Unclassified XPath / XQuery / XSLT Functions and Operators 3.1 Candidate Recommendation PC Windows NT CLOSED FIXED P2 normal --- 1 christian.gruen mike public-qt-comments oldest_to_newest 123816 0 christian.gruen 2015-10-21 09:47:42 +0000 I believe that the parsing of surrogate pairs in the JSON conversion process needs some clarification. In the current "escape" option rules for fn:parse-json and fn:json-to-xml, it is only insinuated that surrogate pairs need to be considered as well: "(for example, unpaired surrogates)", "This includes codepoints representing unpaired surrogates". But I am wondering what is going to happen if a high surrogate is found that is not followed by a valid low surrogate. The following query... fn:parse-json('"\uD800\uD83C\uDC1C"', map { 'escape': true() }) might return one of the following results: a) \uD800, followed by the surrogate pair for U+1F01C, or b) \uD800\uD83C\uDC1C Intuitively, I would expect a) to be correct: As \uD83C is no valid low surrogate, it is not combined with the high surrogate. b) would be correct if \uD83C was interpreted as low surrogate. As a result, \uDC1C is then invalid as well. Any thoughts? Maybe the parsing of surrogate pairs is already standardized somewhere else (I couldn't find anything so far)? 123832 1 mike 2015-10-21 22:25:46 +0000 RFC 7159 section 8.2 says, pragmatically: However, the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters; for example, "\uDEAD" (a single unpaired UTF-16 surrogate). Instances of this have been observed, for example, when a library truncates a UTF-16 string without checking whether the truncation split a surrogate pair. The behavior of software that receives JSON texts containing such values is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions. Since the JSON RFC says the effects of doing this kind of thing are unpredictable, I really don't think it's necessary that we pin it down any further than we do at the moment. I would also tend to expect your option (a), but I really don't think it matters greatly if the software does something else. Anyone who puts unpaired surrogates in their data deserves what they get. 123965 2 mike 2015-10-27 16:31:01 +0000 We decided to add a note to the effect: Unpaired surrogates don't cause an error, but the exact treatment might depend on the parsing algorithm used.