This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
How should the following be handled? (1) parse-json('["\uD834"]') (2) parse-json('["\uD834"]a') (3) parse-json('["\udD1E"]') (4) parse-json('["a\udD1E"]') (5) parse-json('["\uD834\uD834\udD1E"]') I can guess at (1), (3) being invoking the fallback option e.g. �. But would (2) and (4) consume the two characters as one badly encoded string codepoint, or as two characters? i.e. �a or just �?
The rules state: The function is called when the JSON input contains a special character (as defined under the escape option) that is valid according to the JSON grammar, whether the special character is represented in the input directly or as an escape sequence. The function is called once for any surrogate that is not properly paired with another surrogate. The string supplied as the argument will always be a two- or six- character escape sequence, starting with a backslash, that conforms to the rules in the JSON grammar This seems pretty clear to me. You process the input one nibble at a time, where a nibble is a character or an escape sequence introduced by "\". If you hit a high surrogate that isn't followed by a low surrogate, you emit FFFD and move on to the next nibble. If you hit a low surrogate that isn't preceded by a high surrogate, you emit FFFD and move on to the next nibble.
Agreed. Thanks.