2474 – [SER] Can fully-normalized be implemented?

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 2474 - [SER] Can fully-normalized be implemented?

Summary: [SER] Can fully-normalized be implemented?

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Serialization 1.0 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	---
Assignee:	Scott Boag
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-11-07 20:10 UTC by Colin Adams
Modified:	2006-02-01 15:11 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Colin Adams 2005-11-07 20:10:53 UTC

Can normalization-form="fully-normalized" be implemented?

Suppose a user codes a stylesheet which includes a literal result element
xi:include, where xi is bound to the XInclude namespace.
Then for the serializer to fully-normalize the output, it must act as an
XInclude processor and normalize the content of the included text, and then
replace the xi:include element with the normalized text.
But this is contrary to the syntax of a literal result element.

So I conclude that it is impossible to implement this normalization form in XSLT
(I don't know XQuery, so I cannot say).

An alternative is to inspect the contents of the to-be-included resource, and
raise a serialization error if it is not already normalized.
But this still involves the serializer having to act as an XInclude processor.

But it's not just xi:include elements that are includes.
What about if doctype-system is specified?
And in general, how can the serializer know if an LRE is meant to function as
include syntax.

Comment 1 C. M. Sperberg-McQueen 2005-12-08 17:55:29 UTC

It seems to me that the character stream which is to
be fully normalized is the stream output by the 
serializer, not the stream which would result from
applying some inclusion process to it.

If that's not sufficiently clear from the text of the spec,
we should perhaps say explicitly that fully normalization
requires full normalization of the output of the
serializer, NOT full normalization of the result of
running an XInclude processor (or any other processor)
on that output.

[Speaking only for myself.]

Comment 2 Colin Adams 2005-12-09 07:09:55 UTC

But that's not full normalization, it's just NFC, as according to the definition
it subsumes include normalization.

Comment 3 David Carlisle 2005-12-09 10:18:10 UTC

(In reply to comment #2)
> But that's not full normalization, it's just NFC, as according to the definition
> it subsumes include normalization.

I agree with you (but I'm not in the WG as you know). I don't think the
serialiser can guarantee full normalisation. Apart from xinclude, there may be
entity references (from character maps, or html known entities) that the
serialiser doesn't really have full control over, and it may be impossible
to fully normalise without either expanding the reference or modifying the
referenced entity, or by modifying the tree (eg putting a space before an entity
reference if the character before it could possibly be affected by normalisation
with a character at the start of a referenced entity)

David

Comment 4 Michael Kay 2005-12-13 13:14:45 UTC

I think this comes down to a question of the definition (or our interpretation
of the definition) of "fully normalized".

We refer to CharMod, which says this:

Text is fully-normalized if... the text is in a Unicode encoding form, is
include-normalized and none of the constructs comprising the text begin with a
composing character or a character escape representing a composing character;

Text is include-normalized if... the text is Unicode-normalized and does not
contain any character escapes or includes whose expansion would cause the text
to become no longer Unicode-normalized; 

The definition of "includes" is: An include is an instance of a syntactic device
specified in a language to include text at the position of the include,
replacing the include itself. Examples of includes are entity references in XML,
@import rules in CSS and the #include preprocessor statement in C/C++.

Colin seems to be assuming that an XInclude element is an "include" in this
sense. We decided that it was not. XInclude operates at a higher level of the
stack than we do: from our perspective it is an application-level construct, not
a "syntactic device". It's no different from <xsl:include> or <xsd:include>: we
can't be expected to understand the semantics of every element in every XML
vocabulary.

Entity references are "includes" in this sense, but the serializer never
generates them, so they don't affect the outcome.

I think the only difference between "fully-normalized" and NFC, as far as our
serialization spec is concerned, is that with "fully-normalized" the output
cannot start with a composing character or a character escape representing a
composing character.

Michael Kay

Comment 5 David Carlisle 2005-12-13 14:03:58 UTC

I agree that XSLT can't know the semantics of the elements it writes out so
xs:include is probably out of scope, but I'm not sure about entity references.
If I use a character map to write out &foo; and use xsl:output to specify a
doctype that defines foo to be a combining acute can it really be said that the
result is fully-normalised if the result contains e&foo; ? Perhaps it can, as
basically the output of a character map is absolved from requirements of unicode
normalisation, well formedness etc.

> with "fully-normalized" the output cannot start with a composing

yes, so long as by start here you mean start of each text node (or attribute
value) not just the start of the whole result tree?

If that isn't the case I see it raises
err:SERE0012

perhaps it would be helpful if the spec defined rather more explictly than
"any relevant construct"
exactly which constructs are affected.
Presumably this is text nodes and attribute values.
Does it include each individual token in a attribute or text node that is schema
typed with a list type? (If the system knows the result schema)
I'm assuming (but haven't checked) that no NameStartChar are combining
characters (in which case "relevant constructs" would need to include element
and attribute names as well)

David

Comment 6 Michael Kay 2005-12-13 14:34:29 UTC

> with "fully-normalized" the output cannot start with a composing

Yes: I should have said "no 'relevant construct' must start with a composing...

The serialization spec defines "relevant construct" by reference to section 2.13
of XML 1.1, which defines it thus:

   1. The replacement text of all parsed entities
   2. All text matching, in context, one of the following productions:
         1. CData
         2. CharData
         3. content
         4. Name
         5. Nmtoken

I think we're only concerned with one parsed entity, namely the one we are
generating, and the rest is all perfectly well-defined, if somewhat onerous to
check.

Comment 7 David Carlisle 2005-12-13 15:07:10 UTC

> by reference to section 2.13
grr sorry about that, before posting I read that (but read the wrong bit of the
xml spec by mistake) I know it's W3C house style, but the practice of linking to
the reference list at the end of the current document rather than directly
linking to the correct section of an external document is _so_ annoying/confusing. 

> I think we're only concerned with one parsed entity,
as I say that depends just how far out of scope are references to parsed
entities generated by character maps. The note in section 9 warning that
character maps can result in non-wellformed documents could perhaps be extended
to say that they can also make the resulting document not conform to whatever
unicode normalisation form is specified.

As far as 4 and 5 on the list, they are also a bit questionable, as list-typed
content (eg IDREFS) is a white space list of Name tokens, so arguably they each
need to be checked individually (although conversely, arguably the serialiser
doesn't know that the element or attribute is so typed...) I don't mind either
way, but I can't tell from the spec. I suspect that the answer is that the
serialisation doesn't need to check individual tokens in a list, but that a 
validating parser which choses to warn of normalistion errors _will_ warn
when reading the resulting document if tokens are not in normal form.

David

Comment 8 Michael Kay 2005-12-14 16:29:36 UTC

Colin Adams asked me to enter the following comment on his behalf:

I DO think XInclude is in the same category as Cs #include.

But even if the WG disagree with this, there is still the question of #include
for <xsl:output method=text/>.

As you say, we cannot expect to be able to understand all such constructs. But
the definition of fully-normalized seems to require this.

Which is why I think it cannot be implemented by an XSLT serializer.

Comment 9 Michael Kay 2005-12-14 16:39:37 UTC

Both <xi:include> nor #include are things that operate at a higher semantic
level. We're concerned with plain XML or plain text; we've no idea what the XML
vocabulary is, or whether the text is supposed to be a C# program. We could
perhaps take the media-type into account, but that's upside-down in terms of a
layered architecture: it's the responsibility of a higher level of software to
look after constraints that apply at its own level.

I think the CharMod spec could perhaps have made it clearer that the concept of
an "include" is a relative one: it can only be defined in relation to the
knowledge of the document's syntax and semantics available at a particular level
of the system.

Moreover, when we create one file in a family of files that contains "include"
references to each other, our responsibility can only be to ensure that the file
we are generating is a legitimate component of a fully-normalized document; we
cannot take any responsibility for the other files.

Michael Kay
(speaking personally)

Comment 10 Colin Adams 2005-12-16 05:17:08 UTC

Then I think some extra wording would be appropriate, to explain the
responsibilities of the serializer when fully-normalized is specified.

Comment 11 Michael Kay 2006-01-27 19:30:29 UTC

I have been asked to propose clarifying text that will resolve this issue.
(Colin, please let us know whether you are happy with this text: though it's at
the moment only a proposal). 

In Serialization 5.1.8 we currently say:

It is a serialization error [err:SERE0012] if the value of the parameter is
fully-normalized and any relevant construct of the result begins with a
combining character. The serializer MUST signal the error. See Section 2.13 of
[XML11] for the definition of the relevant constructs of XML.

I proposed that we change this to:

If the value of the parameter is <code>fully-normalized</code>, then no
<emph>relevant construct</emph> of the parsed entity created by the serializer
may start with a composing character. The term <emph>relevant construct</emph>
has the meaning defined in section 2.13 of [XML11]. If this condition is not
satisfied, a serialization error [err:SERE0012] MUST be signalled.

Note: specifying <code>fully-normalized</code> as the value of this parameter
does not guarantee that the XML document output by the serializer will in fact
be fully normalized as defined in [XML11]. This is because the serializer does
not check that the text is <code>include normalized</code>, which would involve
checking all external entities that it refers to (such as an external DTD).
Furthermore, the serializer does not check whether any character escape
generated using character maps represents a composing character.

Comment 12 Colin Adams 2006-01-27 19:35:20 UTC

I am quite content with the proposed change.

Comment 13 Joanne Tong 2006-02-01 15:10:51 UTC

The XSL and XQuery working group accepted the proposed text in message #11 on 
Feb 1, 2006 F2F meeting.

Thank-you for raising this comment.

Joanne