1309 – white space in the DM

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1309 - white space in the DM

Summary: white space in the DM

Status:	CLOSED WONTFIX

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Data Model 1.0 (show other bugs)
Version:	Last Call drafts
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Norman Walsh
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-05-07 15:44 UTC by David Carlisle
Modified:	2005-10-03 12:34 UTC (History)
CC List:	0 users

See Also:

Attachments

Description David Carlisle 2005-05-07 15:44:45 UTC

Some of the following issues have been raised on earlier drafts but it
seems safest to raise them again as last call issues in bugzilla. 



6.7.3 Construction [of text nodes] from an Infoset

says

   If the resulting Text Node consists entirely of white space and the
   Text Node occurs in Element content[XML], the content of the Text Node
   is the zero-length string.  

The reference to Element Content XML production is inappropriate as
the input to this procedure is an infoset rather than a literal XML
document. The [element content whitespace] infoset property is flagged
a few lines up as being optionally used so this could say

   If the resulting Text Node consists entirely of characters with an
   [element content whitespace] property with value true, the content
   of the Text Node is the zero-length string.

This would make the document consistent however (with either wording)
this clause introduces a very large incompatibility with XPath1.

I think it would be better to drop this clause altogether, systems
requiring white space nodes to be dropped can use the PSVI mapping
or a proprietary mapping to the datamodel, neither of which have any
xpath1 compatiblity implications.

Dropping white space from declared element content from schema
validated (PSVI) input makes sense and is something that could be
tested in a conformance test. Dropping white space from the infoset
mapping if [element content whitespace] is reported isn't really
testable as non validating parsers may or may not report this
and don't need to document whether they do or they don't.

As it is it means that given
<!DOCTYPE x [
<!ELEMENT x (x*)>
]>
<x>
  <x/>
  <x/>
</x>

a simple xpath of /x/node()[2] is completely undefined: it may pick up
the the first or the second empty x node.

If this clause is kept it should be higlighted here that it is
incompatible with Xpath1's data model and the XPath (and XSLT)
Compatability appendices should also mention this.





For the reverse mapping
6.7.5 (and J7) states that all characters get mapped to infoset items
with [element content whitespace] of unknown.

The infoset has a constraint that all non-white characters have a
value of false for this property
http://www.w3.org/TR/xml-infoset/#infoitem.character
says:  ..It is always false for characters that are not white space.

So I think the mapping from the DM to the infoset should set this
property to false or to unknown depending on whether the character is
white space.

David

Comment 1 Paul Cotton 2005-05-08 14:54:11 UTC

Is it possible to identify when these comments were previously made so that we 
can figure out what the WG did about the original comments?  

In particular were these comments made on the the immediately previous DM Last 
Call document and if so can you point to the issue(s) in the Last Call issues 
list?
http://www.w3.org/2005/04/data-model-issues.html

/paulc

Comment 2 David Carlisle 2005-05-09 10:05:36 UTC

> Is it possible to identify when these comments were previously made 



This current report is strongly related to

http://www.w3.org/2005/04/data-model-issues.html#toc.qt-2003Dec0085-01


However the text that I commented about there has been largely
removed/rewriten so most of the text of that comment is no longer
relevant, the final comment lodged there is Norm's

  White space is now significant in all cases except element-only content
  where it is not significant. The draft has been clarified to reflect this.

So the first of my comments in this bug report could be rephased as
saying that:

a) the clarification added at this point isn't correct  as it
   uses the "Element Content" XML production which isn't necessarily
   available in an infoset which even if it was generated by parsing an
   XML document may not have the information. It will (or may) have the
   [element content whitespace] property reported though so this should
   be used as described in my message.

b) If this clause isn't removed, the negative impact it has on XPath
   compatibility should be documented.


The second of my comments that the reverse mapping should not always set
the infoset property [element content whitespace] to "unknown", as it
violates a constraint in the infoset spec, is new.

David

Comment 3 David Carlisle 2005-05-10 09:45:49 UTC

I just noticed that the first two points of this report are identical to the two
points in report 1303 from XML Core.

Comment 4 Norman Walsh 2005-06-03 15:55:35 UTC

The WG has discussed this comment:

  http://lists.w3.org/Archives/Member/w3c-xsl-query/2005May/0069.html

and declines to make any changes. Please let us know if you accept this
resolution.

Comment 5 David Carlisle 2005-06-09 11:57:57 UTC

> Please let us know if you accept this resolution.

I object to this resolution.

If a breaking change is going to be introduced between 1.0
and 2.0 then the least that could be done is document that fact.

Even if this is documented I object to the resolution.

It's not just incompatibility with XSLT1 that must be documented: it's
incompatibility between XSLT2 (and Xquery) systems. As whether or not
[element content whitespace] is reported by a
non-validating-parser-that-reads-a-dtd is entirely parser specific as
far as I can see, and parsers of this type have traditionally been the
ones most commonly used with XSLT.

If the resolution is to keep the currently specified behaviour, the
editorial change to change from referencing Element content[XML] to
referencing [element content whitespace] should be made as Element
content[XML] is a property of an XML document (a sequence of Unicode
characters) not a property of an infoset ie it's not a property of the
input to the mapping being defined.

I also don't understand the WG's intention to specify a mapping from the
Xquery data model that always sets [element content whitespace] to
unknown.  Do they disagree with the analysis that this produces an
infoset that violates a constraint specified on the infoset
recommendation?

David

Comment 6 David Carlisle 2005-09-21 16:41:56 UTC

Just an additional comment to confirm that these issues are not addressed in the
new drafts.

As an example of the incompatiblity this introduces between XPath 1 and 2 (that
is still not documented in the incompatibilities section) consider

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:output method="text"/>

<xsl:template match="x">
<xsl:copy-of select="node()[position() mod 2 = 0]"/>
</xsl:template>
 
</xsl:stylesheet>


applied to

<!DOCTYPE x [
<!ELEMENT x (y*)>
<!ELEMENT y (#PCDATA)>
]>
<x>
 <y>s</y>
 <y>kill </y>
 <y>ti</y>
 <y>me </y>
</x>



With a 1.0 processor one gets the output
skill time
with a 2.0 processor as specified here you would get the output
kill me


Having the same construct work  _without error_ in version 2 but produce
radically different output to version 1 is something that should be avoided
if at all possible, and documented _very_ clearly if it is impossible to avoid
the incompatibility.

Comment 7 Norman Walsh 2005-09-27 12:02:00 UTC

The XSL and XML Query WGs have decided to make this an issue for which we
explicitly solicit CR feedback.

Comment 8 David Carlisle 2005-09-27 12:07:57 UTC

OK that sounds reasonable.

What's the protocol, close this report and open new reports commenting on the CR
draft? Or keep this report open and add comments to this when the new draft appears?

Feel free to close this if that is appropriate.

Comment 9 Norman Walsh 2005-09-28 15:00:13 UTC

The XSL WG also agreed to add the following non-normative text to the XSLT
specification:

It occurred to me during the discussion today that we ought in section 4.4
(which discusses xsl:strip-space) to mention the statement in the data model
that whitespace text nodes in element-only content are stripped before
xsl:strip-space/preserve-space comes into play. The place for this seems to
be in the existing note at the end of the section which currently reads:

Note:
A source document is supplied as input to the XSLT processor in the form of
a tree conforming to the data model described in [Data Model]. Nothing in
this specification states that this tree must be built by parsing an XML
document; nor does it state that the application that constructs the tree is
required to treat whitespace in any particular way. The provisions in this
section relate only to whitespace text nodes that are present in the tree
supplied as input to the processor. In particular, the processor cannot
preserve whitespace text nodes unless they were actually present in the
supplied tree.

I propose to change this to:

Note:
In [Data Model], processes are described for constructing a tree (an
instance of the data model) from an Infoset or from a PSVI. Those processes
deal with whitespace according to their own rules, and the provisions in
this section apply to the resulting tree. In practice this means that
elements that are defined in a DTD or a Schema to contain element-only
content will have whitespace text nodes stripped, regardless of the
xsl:strip-space and xsl:preserve-space declarations in the stylesheet.

However, source trees are not necessarily constructed using those processes;
indeed, they are not necessarily constructed by parsing XML documents.
Nothing in the XSLT specification constrains how the source tree is
constructed, or what happens to whitespace during its construction. The
provisions in this section relate only to whitespace text nodes that are
present in the tree supplied as input to the XSLT processor. The XSLT
processor cannot preserve whitespace text nodes unless they were actually
present in the supplied tree.


I think we should also say something in the compatibility appendix. I'd
suggest a new section J.1.2 before the existing J.1.2:

J.1.2 Tree Construction: whitespace stripping

In both 1.0 and in 2.0, the XSLT specification places no constraints on the
way in which source trees are constructed. For XSLT 2.0, however, the [Data
Model] specification describes explicit processes for constructing a tree
from an Infoset or a PSVI, while also permitting other processes to be used.
The process described in [Data Model] has the effect of stripping whitespace
text nodes from elements declared to have element-only content. Although the
XSLT 1.0 specification did not preclude such behavior, it differs from the
way that most existing XSLT 1.0 implementations work. It is RECOMMENDED that
an XSLT 2.0 implementation wishing to provide maximum interoperability and
backwards compatibility should offer the user the option either to construct
source trees using the processes described in [Data Model], or alternatively
to retain or remove whitespace according to the common practice of previous
XSLT 1.0 implementations.

To write transformations that give the same result regardless of the
whitespace stripping applied during tree construction, stylesheet authors
can:

* use the xsl:strip-space declaration to remove whitespace text nodes from
elements having element-only content (this has no effect if the whitespace
has already been stripped)

* use instructions such as <xsl:apply-templates select="*"/> that cause only
the element children of the context node to be processed, and not its text
nodes. 



I also spotted while reading section 4.4 that the following Note:

Note:
This implies that if an xml:space attribute is specified on a literal result
element, it will be included in the result.

is misplaced in section 4.4, since literal result elements do not occur in
source documents. I suggest we add it to the note at the end of 4.2,
rephrasing it to fit:

Note:
If an xml:space attribute is specified on a literal result element, it will
be copied to the result tree in the same way as any other attribute.

Comment 10 Richard Tobin for XML Core WG 2005-10-03 12:34:52 UTC

(In reply to comment #9)

> In both 1.0 and in 2.0, the XSLT specification places no constraints on the
> way in which source trees are constructed.

I dispute this assertion.  XSLT 1 refers to XPath for its data model, and XPath
describes the correspondence between its data model and an XML document.  In
particular, it says that "The children of an element node are the element nodes,
comment nodes, processing instruction nodes and text nodes for its content".
There is no doubt from the XML spec that the content of an element includes
whitespace characters regardless of the content model.

I therefore believe that when an XSLT 1 processor constructs a data model
from an XML document, it must not remove element content whitespace.

However, the XML Core WG is happy with your decision to solicit CR feedback
on this issue.