13747 – [XPath 3.0] Determinism of expressions returning constructed nodes

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 13747 - [XPath 3.0] Determinism of expressions returning constructed nodes

Summary: [XPath 3.0] Determinism of expressions returning constructed nodes

Status:	RESOLVED LATER

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	XPath 3.0 (show other bugs)
Version:	Working drafts
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	Jonathan Robie
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-08-10 10:20 UTC by Michael Kay
Modified:	2014-10-08 21:21 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Michael Kay 2011-08-10 10:20:09 UTC

This bug is prompted by bug #13494 which raises issues against XSLT 2.0, but there are some more specific issues in the 3.0 specifications (XSLT and XQuery more than XPath) that need to be addressed.

I think XQuery 1.0 tries to ensure that expressions whose value is a newly constructed node construct a new node each time they are evaluated. This prevents optimizations such as loop-lifting and common-subexpression-refactoring, which rewrite multiple evaluations of such an expression with a single evaluation. Such a restriction on the optimizer is acceptable so long as it is possible to analyze statically where the restriction applies; it becomes a serious problem once we have more dynamic constructs (such as calls on function items) where the static analysis is not possible. As bug #13494 demonstrates, the introduction of generate-id() to XQuery also introduces a new set of challenges for the optimizer.

It's not clear what our rules are here. We've introduced the possibility of type annotations to say that function items are non-deterministic, but this is all implementation-defined territory. I suspect that the rules that require a strict interpretation of constructed node identity have actually disappeared with the formal semantics, and we are left with a folklore as to what the expected behaviour is, rather than a clear statement in the specs.

I'd like to suggest the possibility of abandoning the rule (that multiple evaluations must return distinct nodes) entirely. That is, make it implementation-dependent whether two evaluations of the same expression with the same operands and the same context should return the same constructed node or different constructed nodes. I find it hard to imagine a real application that would be adversely affected by this change, other than the kind of application discussed in bug #13494 that is deliberately taking advantage of the rule in order to construct non-deterministic functions.

The benefit of making this change would be to simplify the semantics of the language and make it easier to perform strong optimizations such as loop-lifting.

Comment 1 Vladimir Nesterovsky 2011-08-10 14:36:26 UTC

(In reply to comment #0)

> I'd like to suggest the possibility of abandoning the rule (that multiple
> evaluations must return distinct nodes) entirely.

1. This can have consequences that go beyond of "loop-lifting and
common-subexpression-refactoring".

The code like the following may potentially become undefined:

<!-- 
  repeatable big subtree can be respresented by optimizer,
  as a hidden function call: $impl$:subtree(). 
-->
<xsl:variable name="a">
  <a/> 
</xsl:variable>
<xsl:variable name="b">
  <a/>
</xsl:variable>

<xsl:sequence select="$a is $b"/>

As implementation might (or might not) refactor it the way that $a is the same node as $b.

2. Xslt spec already specifies which instruction produce new nodes, so there is no ambiguity present in each particular case.

3. Engine is allowed to optimize the code but not to violate rules defined by spec (see 2)

4. If engine is not certain if it's possible to perform an optimization then it should not do it. This addresses dynamic calls that cannot be verified statically.

Comment 2 Vladimir Nesterovsky 2011-08-10 14:47:54 UTC

(In reply to comment #1)
> (In reply to comment #0)
> > I'd like to suggest the possibility of abandoning the rule (that multiple
> > evaluations must return distinct nodes) entirely.
> 1. This can have consequences that go beyond of "loop-lifting and
> common-subexpression-refactoring".

another example is:

<xsl:function name="t:copy-of" as="element()">
  <xsl:param name="element" as="element()"/>

  <xsl:copy-of select="$element"/>
</xsl:function>

This MUST return a distinct nodes on each call.

Comment 3 Michael Kay 2011-08-10 15:10:39 UTC

>This [t:copy-of()] MUST return a distinct nodes on each call.

Can you show us a realistic application that fails if it doesn't? I.e., not an application whose sole purpose is to exploit the fact that it currently returns distinct nodes on each call?

Comment 4 Vladimir Nesterovsky 2011-08-11 06:40:52 UTC

(In reply to comment #3)
> >This [t:copy-of()] MUST return a distinct nodes on each call.
> Can you show us a realistic application that fails if it doesn't? I.e., not an
> application whose sole purpose is to exploit the fact that it currently returns
> distinct nodes on each call?

This sample is based on real application:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:t="http://www.nesterovsky-bros.com/xslt/public"
  exclude-result-prefixes="xs t">

  <xsl:output indent="yes"/>

  <xsl:variable name="input">
    <data>
      <block parts="a b c d"/>
      <block parts="c b"/>
      <block parts="d b"/>
    </data>
  </xsl:variable>

  <xsl:template match="/" name="main">
    <xsl:variable name="result">
      <result>
        <xsl:apply-templates mode="t:process" select="$input"/>
      </result>
    </xsl:variable>

    <xsl:message select="$result"/>

    <xsl:sequence select="$result"/>
  </xsl:template>

  <xsl:template mode="t:process" match="*">
    <xsl:copy>
      <xsl:sequence select="@*"/>
      <xsl:apply-templates mode="#current"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template mode="t:process" match="block">
    <block>
      <xsl:for-each select="tokenize(@parts, '\s+')">
        <xsl:sequence select="t:generate-part(.)"/>
      </xsl:for-each>
    </block>
  </xsl:template>

  <xsl:function name="t:generate-part" as="element()">
    <xsl:param name="id" as="xs:string"/>

    <xsl:variable name="statements" as="element()+"
      select="t:get-statements-by-id($id)"/>

    <label id="{generate-id($statements[1])}">
      <xsl:sequence select="$statements"/>
    </label>
  </xsl:function>

  <xsl:function name="t:get-statements-by-id" as="element()">
    <xsl:param name="id" as="xs:string"/>

    <xsl:variable name="handler" as="element()">
      <xsl:element name="{$id}"/>
    </xsl:variable>

    <xsl:apply-templates mode="t:call" select="$handler"/>
  </xsl:function>

  <xsl:template mode="t:call" match="a">
    <scope>a statements ...</scope>
  </xsl:template>

  <xsl:template mode="t:call" match="b">
    <scope>b statements ...</scope>
  </xsl:template>
  
  <xsl:template mode="t:call" match="c">
    <scope>c statements ...</scope>
  </xsl:template>
  
  <xsl:template mode="t:call" match="d">
    <scope>d statements ...</scope>
  </xsl:template>

</xsl:stylesheet>

Comment 5 Michael Kay 2011-08-11 10:32:50 UTC

Thanks. So the requirement here, if I have reverse-engineered it correctly, is to generate unique ID attributes on nodes in the result tree, and this is being achieved by (a) relying on the uniqueness of constructed nodes, and (b) calling generate-id() on those constructed nodes. In this particular example the IDs do not appear to be referenced from anywhere, but in another example I guess they could be.

There ought to be a way of generating unique identifiers in a functional programming language without relying on the "side-effect" of creating new nodes. The generally quoted reference in the FP literature seems to be 

Functional Pearl: On generating unique names
Lennart Augustssona, Mikael Rittria and Dan Syneka
Journal of Functional Programming, 1994 

I haven't read it, but will try to get hold of a copy.

Comment 6 Vladimir Nesterovsky 2011-08-11 11:45:14 UTC

(In reply to comment #5)
> ...the requirement here, if I have reverse-engineered it correctly, is
> to generate unique ID attributes on nodes in the result tree, and this is 
> being achieved by (a) relying on the uniqueness of constructed nodes, and 
> (b) calling generate-id() on those constructed nodes.

Yes, that's correct.

I want also to point out that t:generate-part() is "infected" with non-determenism, i.e. we cannot say that its result depends on arguments only.

Comment 7 Jonathan Robie 2011-08-12 16:37:41 UTC

(In reply to comment #5)
> Thanks. So the requirement here, if I have reverse-engineered it correctly, is
> to generate unique ID attributes on nodes in the result tree, and this is being
> achieved by (a) relying on the uniqueness of constructed nodes, and (b) calling
> generate-id() on those constructed nodes. In this particular example the IDs do
> not appear to be referenced from anywhere, but in another example I guess they
> could be.

A node constructor creates a new node. If you create two nodes, and update one of them, that doesn't affect the other one. The uniqueness of nodes matters in XQuery.

Clearly, it's easier to optimize if you don't need to ensure unique nodes. If a processor is able to determine that the identity of nodes doesn't affect the result, it can perform optimizations that discard this identity. But I do think the uniqueness of newly constructed nodes.

I think XQuery is clear that constructors create new nodes, e.g. 

> An element constructor creates an element node.

And in our data model, it's clear that nodes have identity:

> Each node has a unique identity.

Comparisons like "is", "<<", ">>" are affected by this identity.

Comment 8 Jonathan Robie 2011-09-06 16:23:58 UTC

You haven't responded to my Comment #7.

Can this Bug be closed against XPath 3.0? As I pointed out, we need unique IDs because XQuery allows nodes to be updated, and because some expressions rely on  "is", "<<", ">>", etc.

Comment 9 Jonathan Robie 2011-09-13 15:56:41 UTC

The Working Group agreed that default behavior should continue to require these nodes to be constructed with unique IDs.

We believe that this is the kind of thing implementations can do with annotations or declaration options, and it would be best to get implementation experience with this before standardizing.