This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 2729 - Whitespace text nodes
Summary: Whitespace text nodes
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Data Model 1.0 (show other bugs)
Version: Candidate Recommendation
Hardware: PC Windows XP
: P2 normal
Target Milestone: ---
Assignee: Norman Walsh
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-01-19 14:04 UTC by David Carlisle
Modified: 2006-02-02 17:41 UTC (History)
0 users

See Also:


Attachments

Description David Carlisle 2006-01-19 14:04:25 UTC
This is essentially a re-raising of bug #1309 which was explicitly
deferred for comment on the CR drafts.

I agree with the requirement to strip white space text nodes in trees
built from schema-validated input. This report just concerns the
default mapping from a non schema-validated infoset.

The requirement to strip white space text nodes from elements declared
in a DTD introduces a large incompatibility between XPath 1 and XPath 2.
This incompatibility is highlighted in the XSLT draft (J.1.1) but not in the
XPath draft. If no changes are made to the specification to remove the
incompatibility then similar wording to XSLT J.1.1 should be added to
XPath I.1, as otherwise the small list of edge cases in appendix I.1
gives a rather over-optimistic view of the compatibility between the
two versions.


However, perhaps even more important than the compatibility between
XPath 1 and XPath2, is compatibility between XPath2 (and XQuery)
systems. The current requirement makes such compatibility rather hard to
achieve.

Typically a system will document which XML parser it uses, or give the
user a choice of which to use, or give a choice of whether to use the
parser in non-validating or validating mode.

If a validating parser is used, the [element content whitespace]
property will be reported, so in this case, all XPath2 (and XQuery)
systems will act in the same way (although in a way incompatible with
XPath1, this would be something I could "live with" (in W3C working
group consensus-speak).

However traditionally the most common type of parser used with XSLT
(in particular) has been a non-validating-parser-which-reads-a-dtd
(as the structure of the XSLT language means that this type of parser
is more or less required to read the XSLT file, and typically the same
parser is used on input documents). For this kind of parser there is,
as far as I can tell, no specification at all, which suggests whether
they should, or should not, report the [element content whitespace]
property on elements for which they have read a DTD declaration.
So typically a user will have no way of knowing whether or not white
space will be stripped and no way of changing the behaviour if it is
unwanted. Incompatibility with XPath1 is something that will hopefully
become less important over time, but incompatibility between
different XPath2/XQuery systems is something that should be avoided if
at all possible.

I offer 3 options

A: Do not change the specification.
   In this case, the XPath compatibility appendix should document the
   incompatibility.

B. Change the requirement to strip white space nodes so that it only
   applies to infosets constructed by a _validating_ XML parser. (DTD
   validated, so that if you validate with a DTD, the whitespace
   behaviour matches that of schema validation).

C. Remove the requirement to strip white space when building from an
   Infoset (keeping it in the case of building from a PSVI)



The status quo (A) has the largest incompatibility with Xpath 1 and
introduces similarly large incompatibilities between Xquery and XPath2
systems running on different XML parsers.

Taking either option (B) or (C) would cause all XPath2 and XQuery systems
to work the same way.

Option (C) is the most compatible with XPath1, and the one that I
personally prefer, but perhaps option (B) would be a useful compromise
position that should be considered.

David
Comment 1 Michael Kay 2006-01-19 15:51:18 UTC
I must admit that I thought the specs currently said [B]. Reading it more
carefully, I see that it is indeed possible that a parser doesn't do validation,
but does distinguish whitespace appearing in element content from whitespace
appearing in mixed content. Are there real parsers that do this, however? 

Looking at your option B, is there any way one can look at an InfoSet and
determine whether it was constructed by a validating parser? If not, [B] looks
like a lost cause (unless we abandon the pretence that we only look at the
infoset and have no idea how it was created).
Comment 2 David Carlisle 2006-01-19 15:59:40 UTC
(In reply to comment #1)
> Are there real parsers that do this, however? 

I don't know, and that's my concern. Even just using saxon I have no idea what
happens using the parsers that can be easily used, without testing each case by
trial and error.


> 
> Looking at your option B, is there any way one can look at an InfoSet and
> determine whether it was constructed by a validating parser? 

No, apparently not as far as I can see. The nearest thing is [all declarations
processed] on the document item which I suppose could be used allthough that
doesn't really do the right thing here.

> If not, [B] looks
> like a lost cause (unless we abandon the pretence that we only look at the
> infoset and have no idea how it was created).

I would word (B) such that if you don't know how it was created, you don't
strip. Only strip if you know it was generated by a validating parser.


Comment 3 David Carlisle 2006-01-19 16:40:29 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > Are there real parsers that do this, however? 

Yes. The default parser used with saxon does this. I have slightly modified the
example given in bug #1309 to show this, making ws.xml invalid.

The default parser probably depends on the JVM which is:
$ java -version
java version "1.5.0_06"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
Java HotSpot(TM) Client VM (build 1.5.0_06-b05, mixed mode, sharing)



ws.xml
<!DOCTYPE x [<!ELEMENT x (z*)>]>
<x>
 <y>s</y>
 <y>kill </y>
 <y>ti</y>
 <y>me </y>
</x>


ws.xsl
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:output method="text"/>

<xsl:template match="x">
<xsl:copy-of select="node()[position() mod 2 = 0]"/>
</xsl:template>
 
</xsl:stylesheet>



with XSLT1 you get

$ saxon ws.xml ws.xsl
skill time

with XSLT2 using a validing parser, you get validation errors but then (if you
carry on) a completely different result

$ saxon8 -v ws.xml ws.xsl
Warning: Running an XSLT 1.0 stylesheet with an XSLT 2.0 processor
Recoverable error on line 3 column 5 of file:/c:/tmp/ws.xml:
  SXXP0003: Error reported by XML parser: Element type "y" must be declared.
Recoverable error on line 4 column 5 of file:/c:/tmp/ws.xml:
  SXXP0003: Error reported by XML parser: Element type "y" must be declared.
Recoverable error on line 5 column 5 of file:/c:/tmp/ws.xml:
  SXXP0003: Error reported by XML parser: Element type "y" must be declared.
Recoverable error on line 6 column 5 of file:/c:/tmp/ws.xml:
  SXXP0003: Error reported by XML parser: Element type "y" must be declared.
Recoverable error on line 7 column 5 of file:/c:/tmp/ws.xml:
  SXXP0003: Error reported by XML parser: The content of element type "x" must
atch "(z)*".
kill me



and using the same parser in _non_ validating mode no warning (about validity or
white space) but again a dramatically different result from that obtained by XSLT1:

$ saxon8 ws.xml ws.xsl
Warning: Running an XSLT 1.0 stylesheet with an XSLT 2.0 processor
kill me
Comment 4 Michael Rys 2006-01-19 17:29:03 UTC
Note  that there are widely deployed non-validating parsers that strip 
whitespace-only text nodes as a user-option.
Comment 5 David Carlisle 2006-01-19 22:12:00 UTC
(In reply to comment #4)
> Note  that there are widely deployed non-validating parsers that strip 
> whitespace-only text nodes as a user-option.

True, although the most widely deployed one that I know of (msxml) strips
whitespace nodes  (if that option  is chosen) whether or not the element is
declared, so as it stands that option is incompatible with all three of the
specification versions that I suggested, including the current status quo in the
data model draft.
As it says in the status section of the data model spec, the behaviour 
specified is "incompatible with current common practice".

David
Comment 6 Michael Rys 2006-01-19 23:31:18 UTC
It is not incompatible. It just does not follow the Infoset to Data model 
mapping outlined in the data model document. But the data model document 
allows other processes to generate the data model.
Comment 7 David Carlisle 2006-01-20 00:01:30 UTC
(In reply to comment #6)
> It is not incompatible. It just does not follow the Infoset to Data model 
> mapping outlined in the data model document. But the data model document 
> allows other processes to generate the data model.

Oh yes, I agree. As I think I mentioned in my pre-CR version of this thread,
nothing here stops any system building an XDM tree any way it likes from
whatever data sources it has. Perhaps incompatible wasn't the best word. What I
meant was the behaviour of stripping all white space nodes, although implemented
in a very widely used system (I use it quite a bit:-) doesn't really help the
present discussion decide any course of action as no system doing that is
implementing the infoset to XDM mapping outlined in the data model spec. If the
resulting tree meets the consistency constraints on an XDM tree (which I'm sure
it will) it's conformant behaviour but irrelevant to the discussion of this part
of the spec, surely.

David
Comment 8 Michael Kay 2006-02-02 17:10:25 UTC
We debated the issue yesterday and decided (reluctantly) that option A is the
best we can do. B doesn't work because it depends on knowing or influencing how
the infoset was constructed, and in our processing model we don't have access to
that information (though of course real products can attempt to do things this
way). Option C implies an incompatibility between DTD-based and schema-based
processing which many people felt would be just as troublesome as the other
incompatibilities mentioned.
Comment 9 David Carlisle 2006-02-02 17:41:31 UTC
I can't say I'm surprised by this decision given the reaction of the WGs
to previous reports on this subject (from me and xml core, at least). However I
think it's a pretty bad decision.

> Option C implies an incompatibility between DTD-based and schema-based
> processing 

Given that processing with or without schema is pretty much completely
incompatible, (or as the XSLT2 spec puts it more delicately "This may lead to a
number of differences in behavior") I am surprised that white space would be
considered an issue here.

As we are finding in the test suite, results of order by expressions (for
example) typically result in completely different orderings (numeric or textual)
depending on whether a schema was used.

If reviewing these specs were my day-job and I had time to carry on the argument
I would certainly re-open this so that it is flagged as an issue at termination
of CR. As neither of those things is true, I am instead going to close this
report which is why I'm taking this last opportunity to complain (if not
formally object) in this comment.


David