29479 – [XSLT30] Streaming and non-well-formed documents

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 29479 - [XSLT30] Streaming and non-well-formed documents

Summary: [XSLT30] Streaming and non-well-formed documents

Status:	CLOSED WORKSFORME

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	XSLT 3.0 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	Michael Kay
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-02-18 12:27 UTC by Abel Braaksma
Modified:	2016-04-14 16:58 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Abel Braaksma 2016-02-18 12:27:05 UTC

Martin Honnen brought this to my attention in a bug report on Exselt (ECS-12). Het quoted a part of the spec:

"A streamed transformation that only accesses part of the input
document (for example, a header at the start of a document) is not
required to continue reading once the data it needs has been read.
This means that XML well-formedness or validity errors occurring in
the unread part of the input stream may go undetected."

and gave this example:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xs">
 
<xsl:param name="input-uri" as="xs:string" select="'test201602170101.xml'"/>

<xsl:param name="items-to-copy" as="xs:integer" select="4"/>
<xsl:variable name="children-to-copy" as="xs:integer" select="$items-to-copy + 1"/>

<xsl:mode streamable="yes"/>

<xsl:output indent="yes"/>

<xsl:template name="xsl:initial-template">
  <xsl:stream href="{$input-uri}">
    <xsl:apply-templates/>
  </xsl:stream>
</xsl:template>

<xsl:template match="/*">
  <xsl:copy>
    <xsl:iterate select="*">
      <xsl:copy-of select="."/>
      <xsl:if test="position() eq $children-to-copy">
        <xsl:break/>
      </xsl:if>
    </xsl:iterate>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

with the following input:

<root>
  <header>...</header>
  <item name="1">...</item>
  <item name="2">...</item>
  <item name="3">...</item>
  <item name="4">...</item>
  <item>
</root>

This input is deliberately not well-formed.

He ran the example with Saxon as well, which threw no error. My product threw a rather unclear internal error which is clearly a bug. 

However, this shows a peculiar situation that may arise with non well-formed documents. I would challenge that in this case the error can be ignored, because the xsl:copy is shallow-copying the <root> element. To complete that copy it needs to read through to the end.

If the template were written differently, this error may not need to arise:

<xsl:template match="/*">
   <xsl:element name="{name()}">
      <xsl:iterate....>
   </xslelement>
</xsl:template>

But even then, whether or not an error is raised will be entirely implementation dependent. 

I am wondering if we can make this more interoperable. For instance by requiring an option to at least through to the end. This will not always be feasible, hence it must be a user option, but one that a processor *must* support.

Conversely, how much a processor looks ahead before it "breaks" further processing (recall that <xsl:break> is not a real break, it just skips over the next items, it doesn't mean that these items should not be processed) is implementation defined, but I wonder if we could be more prescriptive about where and when a processor is really allowed to skip further processing of a document.

The main use-case for adding the line above is for when a user is interested only in a certain leaf node, or existence of one, and further processing is not needed. The problem is: can we define when "further processing is not needed"?

Comment 1 Michael Kay 2016-02-18 14:23:18 UTC

>I would challenge that in this case the error can be ignored, because the xsl:copy is shallow-copying the <root> element. To complete that copy it needs to read through to the end.

Why? To shallow-copy an element you only need to know the name of the element. Shallow-copy doesn't (intrinsically) depend on the content of the element or on anything found in its end-tag.

>I am wondering if we can make this more interoperable. For instance by requiring an option to at least [read] through to the end.

I'm reluctant. It's an intrinsic property of functional languages that you don't read more input than is needed to compute the output. If the user wants a complete well-formedness or validity check then they can always add one to the processing pipeline.

Comment 2 Abel Braaksma 2016-02-19 17:44:04 UTC

> If the user wants a complete well-formedness or validity check then they can 
> always add one to the processing pipeline.

I'm not sure how this can be done, unless by forcing the processor to go over every node, but that will hardly ever be needed and if no action is going on on a node (say, it is simply dismissed) then there's still the issue of closing the root element, following your logic it may never be needed to find that close tag, making it hard to detect an error in a document like:

<foo>
  <bar />
  <bar />
<!-- missing end </foo> -->

> Why? To shallow-copy an element you only need to know the name of the element.
I agree in principle, but isn't it also true that XSLT instructions close over (if that's the term) their beginning and end tags?

My worry is mainly with the large processor-dependent behavior in this area, as Martin Honnen's example shows. If a processor looks ahead one item, it may fail, if another is not looking ahead, it may succeed, leading to processor-independent behavior.

I'm not sure if there's an easy way to fix this, but if there is, I think it is worthwhile to have a go at it.

Comment 3 Michael Kay 2016-03-03 16:52:49 UTC

My preference is to do nothing. I think we're in the area of "quality of implementation" differences between products (including trade-offs between performance and usability).

Comment 4 Abel Braaksma 2016-04-14 16:58:15 UTC

As discussed at telcon of 14 April 2016, the WG decided to do nothing. Closing the bug as WORKSFORME.