A dialog on surrogate characters in XML

]> A dialog on surrogate characters in XML C. M. Sperberg-McQueen Cambridge, Mass. Sophia-Antipolis Tokyo World Wide Web Consortium 2007

Transcribed from an email to Chris Lilley.

A dialog on surrogate characters in XML 21 March 2007 C. M. Sperberg-McQueen with some help from my friends

A friend writes:Chris Lilley, Bare surrogates in XML - must halt and catch fire? Email to W3C XML Coordination Group and others, 7 March 2007.

In XML 4th edition:

[Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.] [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors MUST accept any character in the range specified for Char. ] http://www.w3.org/TR/xml/#charsets

This makes it clear that potentially valid characters must be accepted. The character range is also clear:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Charmod is clear about bare surrogates:

Unicode contains some code points for internal use (such as noncharacters) or special functions (such as surrogate code points).

C079 [S] Specifications SHOULD NOT allow the use of codepoints reserved by Unicode for internal use. http://www.w3.org/TR/charmod/#C079

C078 [S] Specifications MUST NOT allow the use of surrogate code points. http://www.w3.org/TR/charmod/#C078

What is not clear is that XML specifically forbids bare surrogates (ie, half of a surrogate pair). This came up in recent SVG WG discussions. Is the XML parser required to reject an xml document containing a bare surrogate? Would that be a well formedness error, or some other sort of error?

This is my reply.

I believe the short answers are yes, and unspecified (but most processors are likely to treat it as a WF error).

But the short answers are imprecise.

To be more precise, let us consider an octet stream we receive, with respect to which we wish to ask is it a well-formed XML document? Let us suppose we recognize the octet stream as UTF16, either by following the rules in the XML spec or on account of an external label, or because an omniscient being, or just a being with particular knowledge of the case (such as the creator of the data stream, in this case me), has whispered UTF16 in our ear.

If we ask a UTF16-savvy dump utility to show us the data,The assistance of Richard Ishida's Unicode Code Converter v4 is gratefully acknowledged. You don't think I translate this stuff by hand, do you? And if you think the converter is cool, check out his other utilities, too. we might see this: 003C 003F 0078 006D < ? x m 006C 0020 0076 0065 l v e 0072 0073 0069 006F r s i o 006E 003D 0022 0031 n = " 1 002E 0030 0022 003F . 0 " ? 003E 000A 003C 0078 > < x 003E 0048 0069 002C > H i , D801 0020 004D 006F . M o 006D 002E 003C 002F m . < / 0078 003E 000A x > So what we've got looks a lot like Hi,* Mom. ]]> except that where the * appears in the lines above, we have the 16-bit value D801, which in a normal UTF16 encoding would be half of a surrogate character. We can ask several questions:

Is this a well-formed XML document?

What do you mean by this?

I mean the octet stream.

Octet streams are streams of bits. XML documents are sequences of characters. The question seems to embody a category error.

Who are you, Spock? Does this octet stream represent a well-formed XML document?

The term represent is fraught with difficulties; I think you must mean encode.

Long pause. ... nine, ten.

OK. Does this octet stream encode a well-formed XML document?

Now the question is conceptually well-formed.

Pedant.

Hey, you ask a language-lawyer question, you get a language-lawyer answer. No, the octet stream doesn't encode a well-formed XML document.

Why not?

Because the octet stream does not encode a sequence of characters in the UTF-16 encoding. To encode a well-formed XML document, an octet stream must encode a sequence of characters which match the document production from the XML spec and satisfy some other constraints, and which thus constitute a well-formed XML document. The 16-bit value D801, followed as it is here by 0020, does not encode a character. The octet sequence is not UTF-16.

What if I said it was encoded not in UTF-16 (which has defined the surrogate characters) but in UCS-2 (which doesn't define surrogate characters)?

I'd have to check the Unicode specs to be sure. Hold on ...

Wait, don't bother. Suppose I invented an encoding and called it x-myencoding and said that this sequence of octets is a legitimate encoding of a sequence of Unicode 1.0 characters, and D801 represents, er, encodes U+D801, or equivalently the Unicode 1.0 character whose integer value is 55297.

I don't think Unicode defines a character at that point. In fact, I'm pretty sure they say explicitly that there isn't one and can never be one.

Not in Unicode 1.0. Surrogates weren't til later. Is it well-formed then?

No.

Why not?

Two reasons. First, by not including an encoding declaration, you implicitly claimed that the encoding was either UTF-8 or UTF-16, or else reliably given by an external authority. (You will have to read up on the current state of the various RFCs to get a chapter and verse account of when and where and how and why for all of this.) The external authority who whispered in my ear distinctly said UTF-16, not x-myencoding.

So if I added an encoding declaration would it be well-formed?

No. You told me that the octets in the relevant bit of the data stream encode the Unicode 1.0 characters whose integers are (in hex, I can't do decimal conversions on the fly) ... 002C, D801, 0020, ... I'm taking your word for it that the octet stream correctly encodes those characters. But production [2] of XML says clearly that the second of those characters, the one whose number is D801, is not a legal XML character. So if the octet stream is correctly recognized as being encoded in x-myencoding, then we have a sequence of characters but not a well-formed XML document.

What if I told you that I was wrong, earlier, when I said that x-myencoding treats D801 as an encoding of the Unicode 1.0 character whose integer is xD801?

I wouldn't be the least bit surprised.

What if I told you that D801 is recognized as a valid encoding of the character whose number is 33, i.e. x21?

That would be exclamation point.

So is the octet stream a well-formed XML docu— I mean, does the octet stream now repre— er, encode, a well-formed XML document?

You're telling me it encodes the sequence of characters whose conventional display form is Hi,! Mom. ]]> That sequence of characters is indeed a well-formed XML document. I have to grant that, even if I deplore your choice of character encodings. And your English punctuation isn't too hot, either.

So going back to the earlier examples, when we assumed a UTF-16 encoding. The octet stream wasn't a—I mean, didn't encode a well-formed XML document.

So did it have a well-formedness error? And crucially, is a processor required to detect encoding errors?

Good questions. I think informed opinion may differ on the first.

Most readers of the spec seem to agree that a sequence of characters which fails to match the 'document' production or violates some WF constraint in the spec, has a well-formedness error. (They are taking the term textual object to mean sequence of characters, which may or may not be a perfect interpretation.)

It's less clear whether something which is not a sequence of characters, or not a textual object, can rise to the status of having a well-formedness error.

The coffee cup in my hand does not match the 'document' production of the XML spec. It is not, and does not encode, any well-formed XML document. At least, not using any encoding in common use. I could invent one tomorrow in which my coffee cup encodes the character sequence <x/>, just to be able to say that my coffee cup encodes an XML document.

But today I'm busy. So today, my coffee cup encodes no WF XML document. Can we infer, then, that it has a well-formedness error? The spec neither requires us to say so, nor forbids it. At least, not that I noticed when I looked just now.

Is a processor required to detect encoding errors?

If I hand you a document encoded in ISO 8859-7 and tell you it's in ISO 8859-1, do you guarantee that you will detect the error?

That could be hard.

Yes. Impossible in principle. On the other hand, if you tell me the data stream is encoded in some encoding E, and it turns out, when decoded using the rules of encoding E, not to produce a well-formed XML document, it's probably worth reporting, right?

Right. But aren't I the one supposed to be asking the questions? For the third time of asking, is a processor required to detect and report the issue with the D801 character in the example we started with?

Yes. The XML spec says, in section 4.3.3:

It is a fatal error if an XML entity is determined ... to be in a certain encoding but contains byte sequences that are not legal in that encoding.

Conforming processors are required to detect fatal errors and report them. The one in the example thus may or may not be a well-formedness error, but it's definitely a fatal error.

On the other hand, I suspect many implementations take the reasonable view that distinguishing between well-formedness errors and other fatal errors is a game for language lawyers, and call them all well-formedness errors, if they call them anything at all.

Reading the spec. Hold on a sec. Didn't you just tell me that it was impossible in principle to detect all cases in which the encoding declaration is inaccurate?

I did.

So why does the XML spec require the detection of an error you say is impossible to detect in principle, in the general case? In section 4.3.3 the Rec also says

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration,

That seems to mean that if I told you the document was in ISO 8859-1 when it was really in 8859-7, you would be obligated to detect it.

Are you telling me it's impossible in principle to write a conforming XML processor?

Yeah, you know, I was wondering about that myself. I thought at first that maybe the Core WG had snuck that in later, after the first edition, but no, it's been there all along. I think that those who think as I do on this subject must just have lost the argument with the rest of the XML WG on that one.

Fortunately, there's a metaphysical defense. It's true that the octet stream you gave me encoded a document in ISO 8859-7, and that may have been the one you wanted me to validate and process. But it also encodes a document encoded in ISO 8859-1, which is the one I actually did validate and process. That document didn't make much sense — a number of passages just looked like gibberish, to be honest — but when I'm playing the role of well-formedness checker I try to avoid making stylistic comments on my users' prose. It alarms them. And they find most of the suggestions pedantic.

Why do you think would that be?