This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
I believe that the parsing of surrogate pairs in the JSON conversion process needs some clarification. In the current "escape" option rules for fn:parse-json and fn:json-to-xml, it is only insinuated that surrogate pairs need to be considered as well: "(for example, unpaired surrogates)", "This includes codepoints representing unpaired surrogates". But I am wondering what is going to happen if a high surrogate is found that is not followed by a valid low surrogate. The following query... fn:parse-json('"\uD800\uD83C\uDC1C"', map { 'escape': true() }) might return one of the following results: a) \uD800, followed by the surrogate pair for U+1F01C, or b) \uD800\uD83C\uDC1C Intuitively, I would expect a) to be correct: As \uD83C is no valid low surrogate, it is not combined with the high surrogate. b) would be correct if \uD83C was interpreted as low surrogate. As a result, \uDC1C is then invalid as well. Any thoughts? Maybe the parsing of surrogate pairs is already standardized somewhere else (I couldn't find anything so far)?
RFC 7159 section 8.2 says, pragmatically: However, the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters; for example, "\uDEAD" (a single unpaired UTF-16 surrogate). Instances of this have been observed, for example, when a library truncates a UTF-16 string without checking whether the truncation split a surrogate pair. The behavior of software that receives JSON texts containing such values is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions. Since the JSON RFC says the effects of doing this kind of thing are unpredictable, I really don't think it's necessary that we pin it down any further than we do at the moment. I would also tend to expect your option (a), but I really don't think it matters greatly if the software does something else. Anyone who puts unpaired surrogates in their data deserves what they get.
We decided to add a note to the effect: Unpaired surrogates don't cause an error, but the exact treatment might depend on the parsing algorithm used.