This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 4372 - [Serialization] Lexical checking of doctype-public
Summary: [Serialization] Lexical checking of doctype-public
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Serialization 1.0 (show other bugs)
Version: Recommendation
Hardware: PC Windows XP
: P2 normal
Target Milestone: ---
Assignee: Scott Boag
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-03-07 09:02 UTC by Michael Kay
Modified: 2007-11-15 14:40 UTC (History)
0 users

See Also:


Attachments

Description Michael Kay 2007-03-07 09:02:17 UTC
Bjoern Hoehrmann [derhoermi@gmx.net]raised the following point today on public-qt-comments. I am transferring it here for tracking purposes. Please ensure that any decisions are relayed to Bjoern!

Dear XSL Working Group,

  In http://www.w3.org/1999/11/REC-xslt-19991116-errata/ E4 XSLT 1.0 processors are required to generate well-formed XML documents. I think this erratum is incomplete (the last sentence of the first paragraph in
3.1 would also need to be changed, and arguably also the first one in
16.1) and I do not think processors can implement the requirement. In XSLT 2.0 and XSLT 2.0 and XQuery 1.0 Serialization a similar issue exists.

The reason is that neither version of XSLT requires lexical checking of the doctype-public parameter, both specify the content model as just "string", but XML 1.0 places additional restriction on it. For example,

  <xsl:output
    method="xml"
    version="1.0"
    doctype-system="x"
    doctype-public="-//W3C//DTD&#x9;XHTML 1.0 Transitional//EN"
  />

or

  <xsl:output
    method="xml"
    version="1.0"
    doctype-system="x"
    doctype-public="x&#xf6;y"
  />

would result in ill-formed XML as neither U+0009 nor U+00F6 are allowed in the public identifier. In case of XSLT 1.0 it seems processors are not allowed to signal an error in this case, and in case of XSLT 2.0 it can be argued that this should result in the generic err:SERE0003 error, but e.g. Saxon 8.7.1J emits ill-formed XML instead. I think both XSLT 1.0 and XSLT 2.0 should require doctype-public to be syntactically correct, or failing that, XSLT 1.0's E4 should be modified to allow the processor to signal an error in the cases above.

regards,
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Comment 1 C. M. Sperberg-McQueen 2007-03-15 18:01:55 UTC
The XSL Working Group discussed this issue on today's call.  Your point
appears to be well taken; the consensus of the group was that serializers
should indeed check the values of public identifiers for conformance with
the relevant production of the XML spec.  We expect to draft errata for
the relevant documents and approve corrections in due course.

We note for the record that checking the characters of the public identifier
is NOT the same as checking the public identifier for conformance to the
grammar for formal public identifiers in ISO 8879.  XML does not require
that public identifiers be formal public identifiers, and such checking
doesn't feel as if it belongs at the well-formedness level.
Comment 2 Michael Kay 2007-03-15 18:20:09 UTC
The relevant rules for XML appear to be:

[12]   	PubidLiteral	   ::=   	'"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
[13]   	PubidChar	   ::=   	#x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]

and I think it's fairly straightforward for us to add a rule to the serialization spec that says it's an error if doctype-public doesn't conform to this syntax.

The more difficult question is what to do about HTML. In principle we could require that the doctype-public is one of the official FPIs appearing in the HTML recommendation, for example "-//W3C//DTD HTML 4.01//EN". However, that would almost certainly break a lot of existing stylesheets, since there's almost certainly a lot of code getting away with undetected typos in such a string. Arguably XSLT processors should tell people when they are generating bad HTML, but I personally don't want to be the one in the firing line on this: although we could have done it earlier, it's a bad candidate for an erratum. Also, it's not future-proof: we don't know what FPIs will be allowed in future versions of HTML. 

I think my preference would be that we impose the same rules for HTML as we do for XML - that is, a simple restriction on the permitted character set.
Comment 3 Colin Adams 2007-03-16 07:42:49 UTC
Future-proofing for HTML is not a problem, as we require the implementation to state what versions of HTML we support (or does this only apply to XSLT - I dont know XQuery at all).

It also means we dont need an erratum, I think (unless it is to define a new error code).
If the public identifier does not match the requested version of HTML, then the processor doesnt support this pseudo-html version. So the processor is already free to issue an error message (and, I would think, is obliged to in order to claim conformance).

The same applies for XHTML.
Comment 4 Michael Kay 2007-03-16 08:30:49 UTC
I'm not sure quite what you had in mind in comment #3, Colin. But I suppose if we chose to do so we could have a rule that stated: with the HTML method, if a doctype-public attribute is specified, then it must be a value that is permitted by the (explicitly or implicitly) chosen version of the HTML specification; we wouldn't need to enumerate the valid values. However, I still don't fancy the idea that we suddenly start rejecting stylesheets which (as far as the user is concerned) have been working for years. That's because I have to answer the bug reports...
Comment 5 Colin Adams 2007-03-16 09:13:54 UTC
What I had in mind is that there is no need for a rule for this - it is implicit in supporting a particular version of HTML.

If a user requests version 4.0 HTML serialization, but specifies for doctype-public the fpi for version 3.2, then there is a contradiction in the users xsl:output statements. So the user has simultaneously requested both version 3.2 and 4.0.

But I guess this is a slightly different error from just specifying version 3.2 in the version attribute (and the implementation only supports 4.0).

Nevertheless, I would feel perfectly justified in issuing an error message saying that doctype-public requests a public identifier for a version of HTML not supported by this implementation (of course, I dont have your problem of thousands of users likely to complain :-).

Still, as I write this, I see another problem. Supposing the implementation supports both 4.0 and 3.2. In that case, a different error message is appropriate (and appears not to be necessarliy authorized by the current spec. )
Comment 6 David Carlisle 2007-03-16 10:06:59 UTC
I'd strongly argue that the HTML serialisation should not enforce particular PUBLIC ID values, certainly that would be a potentially breaking change not a fix for an erratum. But even if compatibility with the existing practice was not a concern I would still think that this would be a bad idea. The FPI is (was) _intended_ to be locally adapted, There have been dozens of HTML FPI published (and probably many more not published) see for example
http://dbaron.org/mozilla/doctypes
for one list. At most the HTML method could check that the value matches the FPI
syntax

http://www.oasis-open.org/cover/tauber-fpi.html,

but I think that a consistent thing to do is just check the (simpler) XML rule
even in the html case.

David
Comment 7 Scott Boag 2007-04-12 16:20:06 UTC
The formal proposal is for the following to be added to section three in the Serialization spec, in the row for doctype-public:

  It is an error if doctype-public does not conform to the syntax of PubidLiteral {with xml external link notation}.

Similar wording should also be added to the XSLT specification.
Comment 8 Sharon Adler 2007-05-15 16:02:46 UTC
The XSL WG discussed this bug and accepted the text outlined by Scott Boag in Comment # 7 on 12 April.  This bug and recommended change was presented to the XQuery WG for information at the joint meeting in North Carolina.  The change was accepted.  This bug will be closed.  Note: the changes need to made in both Serialization and XSLT.
Comment 9 Michael Kay 2007-10-10 18:26:07 UTC
The XSLT side of this is handled by Erratum E3.
Comment 10 Henry Zongaro 2007-11-15 14:40:16 UTC
This will be Serialization erratum E1.