2474 2005-11-07 20:10:53 +0000 [SER] Can fully-normalized be implemented? 2006-02-01 15:11:01 +0000 1 1 1 Unclassified XPath / XQuery / XSLT Serialization 1.0 Candidate Recommendation PC Linux CLOSED FIXED P2 normal --- 1 colin scott_boag public-qt-comments oldest_to_newest 7078 0 colin 2005-11-07 20:10:53 +0000 Can normalization-form="fully-normalized" be implemented? Suppose a user codes a stylesheet which includes a literal result element xi:include, where xi is bound to the XInclude namespace. Then for the serializer to fully-normalize the output, it must act as an XInclude processor and normalize the content of the included text, and then replace the xi:include element with the normalized text. But this is contrary to the syntax of a literal result element. So I conclude that it is impossible to implement this normalization form in XSLT (I don't know XQuery, so I cannot say). An alternative is to inspect the contents of the to-be-included resource, and raise a serialization error if it is not already normalized. But this still involves the serializer having to act as an XInclude processor. But it's not just xi:include elements that are includes. What about if doctype-system is specified? And in general, how can the serializer know if an LRE is meant to function as include syntax. 7348 1 cmsmcq 2005-12-08 17:55:29 +0000 It seems to me that the character stream which is to be fully normalized is the stream output by the serializer, not the stream which would result from applying some inclusion process to it. If that's not sufficiently clear from the text of the spec, we should perhaps say explicitly that fully normalization requires full normalization of the output of the serializer, NOT full normalization of the result of running an XInclude processor (or any other processor) on that output. [Speaking only for myself.] 7367 2 colin 2005-12-09 07:09:55 +0000 But that's not full normalization, it's just NFC, as according to the definition it subsumes include normalization. 7368 3 davidc 2005-12-09 10:18:10 +0000 (In reply to comment #2) > But that's not full normalization, it's just NFC, as according to the definition > it subsumes include normalization. I agree with you (but I'm not in the WG as you know). I don't think the serialiser can guarantee full normalisation. Apart from xinclude, there may be entity references (from character maps, or html known entities) that the serialiser doesn't really have full control over, and it may be impossible to fully normalise without either expanding the reference or modifying the referenced entity, or by modifying the tree (eg putting a space before an entity reference if the character before it could possibly be affected by normalisation with a character at the start of a referenced entity) David 7417 4 mike 2005-12-13 13:14:45 +0000 I think this comes down to a question of the definition (or our interpretation of the definition) of "fully normalized". We refer to CharMod, which says this: Text is fully-normalized if... the text is in a Unicode encoding form, is include-normalized and none of the constructs comprising the text begin with a composing character or a character escape representing a composing character; Text is include-normalized if... the text is Unicode-normalized and does not contain any character escapes or includes whose expansion would cause the text to become no longer Unicode-normalized; The definition of "includes" is: An include is an instance of a syntactic device specified in a language to include text at the position of the include, replacing the include itself. Examples of includes are entity references in XML, @import rules in CSS and the #include preprocessor statement in C/C++. Colin seems to be assuming that an XInclude element is an "include" in this sense. We decided that it was not. XInclude operates at a higher level of the stack than we do: from our perspective it is an application-level construct, not a "syntactic device". It's no different from <xsl:include> or <xsd:include>: we can't be expected to understand the semantics of every element in every XML vocabulary. Entity references are "includes" in this sense, but the serializer never generates them, so they don't affect the outcome. I think the only difference between "fully-normalized" and NFC, as far as our serialization spec is concerned, is that with "fully-normalized" the output cannot start with a composing character or a character escape representing a composing character. Michael Kay 7419 5 davidc 2005-12-13 14:03:58 +0000 I agree that XSLT can't know the semantics of the elements it writes out so xs:include is probably out of scope, but I'm not sure about entity references. If I use a character map to write out &foo; and use xsl:output to specify a doctype that defines foo to be a combining acute can it really be said that the result is fully-normalised if the result contains e&foo; ? Perhaps it can, as basically the output of a character map is absolved from requirements of unicode normalisation, well formedness etc. > with "fully-normalized" the output cannot start with a composing yes, so long as by start here you mean start of each text node (or attribute value) not just the start of the whole result tree? If that isn't the case I see it raises err:SERE0012 perhaps it would be helpful if the spec defined rather more explictly than "any relevant construct" exactly which constructs are affected. Presumably this is text nodes and attribute values. Does it include each individual token in a attribute or text node that is schema typed with a list type? (If the system knows the result schema) I'm assuming (but haven't checked) that no NameStartChar are combining characters (in which case "relevant constructs" would need to include element and attribute names as well) David 7420 6 mike 2005-12-13 14:34:29 +0000 > with "fully-normalized" the output cannot start with a composing Yes: I should have said "no 'relevant construct' must start with a composing... The serialization spec defines "relevant construct" by reference to section 2.13 of XML 1.1, which defines it thus: 1. The replacement text of all parsed entities 2. All text matching, in context, one of the following productions: 1. CData 2. CharData 3. content 4. Name 5. Nmtoken I think we're only concerned with one parsed entity, namely the one we are generating, and the rest is all perfectly well-defined, if somewhat onerous to check. 7421 7 davidc 2005-12-13 15:07:10 +0000 > by reference to section 2.13 grr sorry about that, before posting I read that (but read the wrong bit of the xml spec by mistake) I know it's W3C house style, but the practice of linking to the reference list at the end of the current document rather than directly linking to the correct section of an external document is _so_ annoying/confusing. > I think we're only concerned with one parsed entity, as I say that depends just how far out of scope are references to parsed entities generated by character maps. The note in section 9 warning that character maps can result in non-wellformed documents could perhaps be extended to say that they can also make the resulting document not conform to whatever unicode normalisation form is specified. As far as 4 and 5 on the list, they are also a bit questionable, as list-typed content (eg IDREFS) is a white space list of Name tokens, so arguably they each need to be checked individually (although conversely, arguably the serialiser doesn't know that the element or attribute is so typed...) I don't mind either way, but I can't tell from the spec. I suspect that the answer is that the serialisation doesn't need to check individual tokens in a list, but that a validating parser which choses to warn of normalistion errors _will_ warn when reading the resulting document if tokens are not in normal form. David 7442 8 mike 2005-12-14 16:29:36 +0000 Colin Adams asked me to enter the following comment on his behalf: I DO think XInclude is in the same category as Cs #include. But even if the WG disagree with this, there is still the question of #include for <xsl:output method=text/>. As you say, we cannot expect to be able to understand all such constructs. But the definition of fully-normalized seems to require this. Which is why I think it cannot be implemented by an XSLT serializer. 7443 9 mike 2005-12-14 16:39:37 +0000 Both <xi:include> nor #include are things that operate at a higher semantic level. We're concerned with plain XML or plain text; we've no idea what the XML vocabulary is, or whether the text is supposed to be a C# program. We could perhaps take the media-type into account, but that's upside-down in terms of a layered architecture: it's the responsibility of a higher level of software to look after constraints that apply at its own level. I think the CharMod spec could perhaps have made it clearer that the concept of an "include" is a relative one: it can only be defined in relation to the knowledge of the document's syntax and semantics available at a particular level of the system. Moreover, when we create one file in a family of files that contains "include" references to each other, our responsibility can only be to ensure that the file we are generating is a legitimate component of a fully-normalized document; we cannot take any responsibility for the other files. Michael Kay (speaking personally) 7475 10 colin 2005-12-16 05:17:08 +0000 Then I think some extra wording would be appropriate, to explain the responsibilities of the serializer when fully-normalized is specified. 8009 11 mike 2006-01-27 19:30:29 +0000 I have been asked to propose clarifying text that will resolve this issue. (Colin, please let us know whether you are happy with this text: though it's at the moment only a proposal). In Serialization 5.1.8 we currently say: It is a serialization error [err:SERE0012] if the value of the parameter is fully-normalized and any relevant construct of the result begins with a combining character. The serializer MUST signal the error. See Section 2.13 of [XML11] for the definition of the relevant constructs of XML. I proposed that we change this to: If the value of the parameter is <code>fully-normalized</code>, then no <emph>relevant construct</emph> of the parsed entity created by the serializer may start with a composing character. The term <emph>relevant construct</emph> has the meaning defined in section 2.13 of [XML11]. If this condition is not satisfied, a serialization error [err:SERE0012] MUST be signalled. Note: specifying <code>fully-normalized</code> as the value of this parameter does not guarantee that the XML document output by the serializer will in fact be fully normalized as defined in [XML11]. This is because the serializer does not check that the text is <code>include normalized</code>, which would involve checking all external entities that it refers to (such as an external DTD). Furthermore, the serializer does not check whether any character escape generated using character maps represents a composing character. 8011 12 colin 2006-01-27 19:35:20 +0000 I am quite content with the proposed change. 8083 13 joannet 2006-02-01 15:10:51 +0000 The XSL and XQuery working group accepted the proposed text in message #11 on Feb 1, 2006 F2F meeting. Thank-you for raising this comment. Joanne