<!DOCTYPE TEI.2 PUBLIC '-//C. M. Sperberg-McQueen//DTD
          TEI Lite 1.0 plus SWeb (XML)//EN'
          'http://www.w3.org/People/cmsmcq/lib/swebxml.dtd' [
<!ENTITY mdash  "&#x2014;" ><!--=em dash-->
]>
<?xml-stylesheet type="text/xsl" href="dialog.xsl"?> 
<TEI.2>
<teiHeader>
<fileDesc>
<titleStmt>
<title>A dialog on surrogate characters in XML</title>
<author>C. M. Sperberg-McQueen</author>
</titleStmt>
<publicationStmt>
<pubPlace>Cambridge, Mass.</pubPlace>
<pubPlace>Sophia-Antipolis</pubPlace>
<pubPlace>Tokyo</pubPlace>
<publisher>World Wide Web Consortium</publisher>
<date>2007</date>
</publicationStmt>
<sourceDesc>
<p>Transcribed from an email to Chris Lilley.</p>
<!-- one of (listBibl biblFull bibl p) -->
</sourceDesc>
</fileDesc>
</teiHeader>
<text>

<front>
<titlePage>
<docTitle>
<titlePart>A dialog on surrogate characters</titlePart>
<titlePart>in XML</titlePart>
</docTitle>
<docDate>21 March 2007</docDate>
<docAuthor>C. M. Sperberg-McQueen</docAuthor>
<titlePart>with some help from my friends</titlePart>
</titlePage>
</front>

<body>
<p>A friend writes:<note place="foot"><bibl>Chris Lilley,
<title level="a">Bare surrogates in XML - must halt and catch fire?</title>
Email to W3C XML Coordination Group and others, 
7 March 2007.</bibl></note>
<q type="block"><p>
In XML 4th edition:
<q type="block"><p>
   [Definition: A parsed entity contains text, a sequence of
   characters, which may represent markup or character data.]
   [Definition: A character is an atomic unit of text as specified by
   ISO/IEC 10646:2000 [ISO/IEC 10646]. Legal characters are tab,
   carriage return, line feed, and the legal characters of Unicode and
   ISO/IEC 10646. The versions of these standards cited in A.1
   Normative References were current at the time this document was
   prepared. New characters may be added to these standards by
   amendments or new editions. Consequently, XML processors MUST
   accept any character in the range specified for Char. ]
   <xref>http://www.w3.org/TR/xml/#charsets</xref>
</p></q>
This makes it clear that potentially valid characters must be
accepted. The character range is also clear:
<q type="block"><p>
  <code>[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
  [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate
  blocks, FFFE, and FFFF. */</code>
</p></q>
Charmod is clear about bare surrogates:
<q type="block"><p>
  Unicode contains some code points for internal use (such as
  noncharacters) or special functions (such as surrogate code points).
</p>
<p>
  C079 [S] Specifications SHOULD NOT allow the use of codepoints
   reserved by Unicode for internal use.
   <xref>http://www.w3.org/TR/charmod/#C079</xref>
</p>
<p>
  C078 [S]  Specifications MUST NOT allow the use of surrogate
    code points.
    <xref>http://www.w3.org/TR/charmod/#C078</xref>
</p></q>
</p>
<p>
What is not clear is that XML specifically forbids bare surrogates
(ie, half of a surrogate pair). This came up in recent SVG WG
discussions.  Is the XML parser required to reject an xml document
containing a bare surrogate? Would that be a well formedness error, or
some other sort of error?</p></q>
</p>
<p>This is my reply.</p>

<p>I believe the short answers are yes, and unspecified (but most processors
are likely to treat it as a WF error).
</p>
<p>But the short answers are imprecise.</p>

<p>
To be more precise, let us consider an octet stream we receive,
with respect to which we wish to ask <q>is it a well-formed XML
document?</q>  Let us suppose we recognize the octet stream as UTF16,
either by following the rules in the XML spec or on account of an
external label, or because an omniscient being, or just a being
with particular knowledge of the case (such as the creator of
the data stream, in this case me), has whispered <q>UTF16</q> in our ear.
</p>
<p>
If we ask a UTF16-savvy dump utility to show us the 
data,<note place="foot">The assistance of Richard Ishida's
<xref doc="tool">Unicode Code Converter v4</xref> is gratefully
acknowledged.  You don't think I translate this stuff by hand, 
do you?  And if you think the converter is cool, check out
his other utilities, too.</note> we might
see this:
<eg>
  003C 003F 0078 006D &lt; ? x m
  006C 0020 0076 0065 l   v e
  0072 0073 0069 006F r s i o
  006E 003D 0022 0031 n = " 1
  002E 0030 0022 003F . 0 " ?
  003E 000A 003C 0078 >   &lt; x
  003E 0048 0069 002C > H i ,
  D801 0020 004D 006F .   M o
  006D 002E 003C 002F m . &lt; /
  0078 003E 000A      x >
</eg>
So what we've got looks a lot like
<eg><![CDATA[
  <?xml version="1.0"?>
  <x>Hi,* Mom.</x>
]]></eg>
except that where the * appears in the lines above, we have
the 16-bit value D801, which in a normal UTF16 encoding
would be half of a surrogate character.  We can ask several
questions:
</p>
<sp><speaker>Q</speaker><p>Is this a well-formed XML document?</p></sp>
<sp><speaker>A</speaker><p>What do you mean by <q><mentioned>this</mentioned></q>?</p></sp>
<sp><speaker>Q</speaker><p>I mean the octet stream.</p></sp>
<sp><speaker>A</speaker><p>Octet streams are streams of bits.  XML documents are
   sequences of characters.  The question seems to embody
   a category error.</p></sp>
<sp><speaker>Q</speaker><p>Who are you, Spock?  Does this octet stream represent a
   well-formed XML document?</p></sp>
<sp><speaker>A</speaker><p>The term <q><mentioned>represent</mentioned></q> is
fraught with difficulties; I think you must mean <q><mentioned>encode</mentioned></q>.</p></sp>
<sp><speaker>Q</speaker><p><stage>Long pause.</stage> ... nine, ten.</p>
<p>OK. Does this octet stream encode a
   well-formed XML document?</p></sp>
<sp><speaker>A</speaker><p>Now the question is conceptually well-formed.</p></sp>
<sp><speaker>Q</speaker><p>Pedant.</p></sp>
<sp><speaker>A</speaker><p>Hey, you ask a language-lawyer question, you get a
   language-lawyer answer.  No, the octet stream doesn't
   encode a well-formed XML document.</p></sp>
<sp><speaker>Q</speaker><p>Why not?</p></sp>
<sp><speaker>A</speaker><p>Because the octet stream does not encode a sequence
   of characters in the UTF-16 encoding.  To encode a
   well-formed XML document, an octet stream must encode
   a sequence of characters which match the <ident>document</ident>
   production from the XML spec and satisfy some other constraints,
   and which thus constitute a well-formed
   XML document.  The 16-bit value D801, followed as it is
   here by 0020, does not encode a character.  The
   octet sequence is not UTF-16.</p></sp>
<sp><speaker>Q</speaker><p>What if I said it was encoded not in UTF-16 (which has
   defined the surrogate characters) but in UCS-2 (which
   doesn't define surrogate characters)?</p></sp>
<sp><speaker>A</speaker><p>I'd have to check the Unicode specs to be sure.  Hold
   on ...</p></sp>
<sp><speaker>Q</speaker><p>Wait, don't bother.  Suppose I invented an encoding and
   called it x-myencoding and said that this sequence of octets is a legitimate
   encoding of a sequence of Unicode 1.0 characters, and
   D801 represents, er, encodes U+D801, or equivalently the Unicode 1.0
   character whose integer value is 55297.</p></sp>
<sp><speaker>A</speaker><p>I don't think Unicode defines a character at that point.
   In fact, I'm pretty sure they say explicitly that
   there isn't one and can never be one.</p></sp>
<sp><speaker>Q</speaker><p>Not in Unicode 1.0.  Surrogates weren't til later.
   Is it well-formed then?</p></sp>
<sp><speaker>A</speaker><p>No.</p></sp>
<sp><speaker>Q</speaker><p>Why not?</p></sp>
<sp><speaker>A</speaker><p>Two reasons.  First, by not including an encoding
   declaration, you implicitly claimed that the encoding
   was either UTF-8 or UTF-16, or else reliably given by an
   external authority.  (You will have to read up on the
   current state of the various RFCs to get a chapter and
   verse account of when and where and how and why for
   all of this.)  The external authority who whispered in
   my ear distinctly said <q>UTF-16</q>, not <q>x-myencoding</q>.</p></sp>
<sp><speaker>Q</speaker><p>So if I added an encoding declaration would it be
   well-formed?</p></sp>
<sp><speaker>A</speaker><p>No.  You told me that the octets in the relevant bit
   of the data stream encode the Unicode 1.0 characters
   whose integers are (in hex, I can't do decimal conversions
   on the fly) ... 002C, D801, 0020, ...
   I'm taking your word for it that the octet stream
   correctly encodes those characters.  But production [2]
   of XML says clearly that the second of those characters,
   the one whose number is D801, is not a legal XML
   character.  So if the octet stream is correctly recognized
   as being encoded in x-myencoding, then we have a
   sequence of characters but not a well-formed XML
   document.</p></sp>
<sp><speaker>Q</speaker><p>What if I told you that I was wrong, earlier, when I
   said that x-myencoding treats D801 as an encoding
   of the Unicode 1.0 character whose integer is xD801?</p></sp>
<sp><speaker>A</speaker><p>I wouldn't be the least bit surprised.</p></sp>
<sp><speaker>Q</speaker><p>What if I told you that D801 is recognized as a
   valid encoding of the character whose number is
   33, i.e. x21?</p></sp>
<sp><speaker>A</speaker><p>That would be exclamation point.</p></sp>
<sp><speaker>Q</speaker><p>So is the octet stream a well-formed XML docu&mdash; I mean,
   does the octet stream now repre&mdash; er, encode, a well-formed
   XML document?</p></sp>
<sp><speaker>A</speaker><p>You're telling me it encodes the sequence of characters whose
   conventional display form is
<eg><![CDATA[
   <?xml version="1.0"?>
   <x>Hi,! Mom.</x>
]]></eg>
   That sequence of characters is indeed a well-formed XML
   document.  I have to grant that, even if I deplore your choice
   of character encodings.  And your English punctuation isn't
   too hot, either.</p></sp>
<sp><speaker>Q</speaker><p>So going back to the earlier examples, when we assumed a UTF-16
   encoding.  The octet stream wasn't a&mdash;I mean, didn't encode
   a well-formed XML document.  </p>
<p>So did it have a well-formedness
   error?  And crucially, is a processor required to detect
   encoding errors?</p></sp>
<sp><speaker>A</speaker>
<p>Good questions.  I think informed opinion may differ on the first.</p>
<p>Most readers of the spec seem to agree that a sequence of
   characters which fails to match the 'document' production
   or violates some WF constraint in the spec, has a well-formedness
   error.  (They are taking the term <term>textual object</term> to mean
   <gloss>sequence of characters</gloss>, which may or may not be a perfect
   interpretation.)  </p>
<p>It's less clear whether something which
   is not a sequence of characters, or not a textual object,
   can rise to the status of having a well-formedness error.
</p>
<p>
   The coffee cup in my hand does not match the 'document'
   production of the XML spec.  It is not, and does not encode,
   any well-formed XML document.  At least, not using any encoding
   in common use.  I could invent one tomorrow in which my
   coffee cup encodes the character sequence <code>&lt;x/></code>, just to be
   able to say that my coffee cup encodes an XML document.
</p>
<p>
   But today I'm busy.  So today, my coffee cup encodes no WF
   XML document.  Can we infer, then, that it has a well-formedness
   error?  The spec neither requires us to say so, nor forbids
   it.  At least, not that I noticed when I looked just now.</p></sp>
<sp><speaker>Q</speaker><p>Is a processor required to detect encoding errors?</p></sp>
<sp><speaker>A</speaker><p>If I hand you a document encoded in ISO 8859-7 and tell you
   it's in ISO 8859-1, do you guarantee that you will detect
   the error?</p></sp>
<sp><speaker>Q</speaker><p>That could be hard.</p></sp>
<sp><speaker>A</speaker><p>Yes. Impossible in principle.  On the other hand, if you tell me
   the data stream is encoded in some encoding <ident>E</ident>, and it turns out,
   when decoded using the rules of encoding <ident>E</ident>, not to produce
   a well-formed XML document, it's probably worth reporting,
   right?</p></sp>
<sp><speaker>Q</speaker><p>Right.  But aren't I the one supposed to be asking the questions?
   For the third time of asking, is a processor required to detect
   and report the issue with the D801 character in the example
   we started with?</p></sp>
<sp><speaker>A</speaker><p>Yes.  The XML spec says, in section 4.3.3:
<q type="block"><p>
      It is a fatal error if an XML entity is determined ...
      to be in a certain encoding but contains byte sequences
      that are not legal in that encoding.
</p></q>
   Conforming processors are required to detect fatal errors
   and report them.  The one in the example thus may or may not
   be a well-formedness error, but it's definitely a fatal
   error.</p>
<p>On the other hand, I suspect many implementations take the 
reasonable view that distinguishing between well-formedness errors
and other fatal errors is a game for language lawyers, and call
them all well-formedness errors, if they call them anything at 
all.</p></sp>
<sp><speaker>Q</speaker><p><stage>Reading the spec.</stage>
Hold on a sec.  Didn't you just tell me that it was impossible in principle
   to detect all cases in which the encoding declaration is
   inaccurate?</p></sp>
<sp><speaker>A</speaker><p>I did.</p></sp>
<sp><speaker>Q</speaker><p>So why does the XML spec require the detection of an error you
   say is impossible to detect in principle, in the general
   case?  In section 4.3.3 the Rec also says
<q type="block"><p>
      In the absence of information provided by an external
      transport protocol (e.g. HTTP or MIME), it is a fatal
      error for an entity including an encoding declaration to
      be presented to the XML processor in an encoding other
      than that named in the declaration,
</p></q>
   That seems to mean that if I told you the document was in
   ISO 8859-1 when it was really in 8859-7, you would be
   obligated to detect it.</p>
<p>Are you telling me it's impossible in principle to write
a conforming XML processor?</p></sp>
<sp><speaker>A</speaker><p>Yeah, you know, I was wondering about that myself.  I thought
   at first that maybe the Core WG had snuck that in later,
   after the first edition, but no, it's been there all along.
   I think that those who think as I do on this subject
   must just have lost the argument with the rest of
   the XML WG on that one.</p>
<p>Fortunately, there's a metaphysical
   defense.  It's true that the octet stream you gave me encoded
   a document in ISO 8859-7, and that may have been the one you
   wanted me to validate and process.  But it <emph>also</emph> encodes a
   document encoded in ISO 8859-1, which is the one I actually
   did validate and process.  That document didn't make much
   sense &mdash; a number of passages just looked like gibberish,
   to be honest &mdash; but when I'm playing the role of well-formedness
   checker I try to avoid making stylistic comments on my
   users' prose.  It alarms them.  And they find most of the
   suggestions pedantic.</p></sp>
<sp><speaker>Q</speaker><p>Why do you think would that be?</p></sp>
</body>
</text>
</TEI.2>
<!-- Keep this comment at the end of the file
Local variables:
mode: xml
sgml-default-dtd-file:"/Library/SGML/Public/Emacs/sweb.ced"
sgml-omittag:t
sgml-shorttag:t
End:
-->
