This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 24317 - [xslt 3.0] Parallel splitting: the dynamic multi-coloured widgets problem
Summary: [xslt 3.0] Parallel splitting: the dynamic multi-coloured widgets problem
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: XSLT 3.0 (show other bugs)
Version: Last Call drafts
Hardware: PC All
: P2 normal
Target Milestone: ---
Assignee: Michael Kay
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-01-17 09:02 UTC by Michael Kay
Modified: 2014-07-04 10:01 UTC (History)
2 users (show)

See Also:


Attachments

Description Michael Kay 2014-01-17 09:02:03 UTC
There is a streaming use case that we currently cannot handle, exemplified by the following stylesheet:

<xsl:for-each-group select="/*/widget" group-by="colour"
   bind-group="g" bind-grouping-key="k">
  <xsl:result-document href="{$k}.xml">
   <widgets colour="{$k}">
     <xsl:sequence select="$g"/>
   </widgets>
  </xsl:result-document>
</xsl:for-each-group>

It's easy to see that this is in principle capable of executing with bounded memory (or at worst, memory proportional to the number of colours encountered).

We can handle the case where the set of colours is known statically, by using xsl:fork, but I don't think we have a solution for the case where the set of colours is not known in advance.

One solution might simply be to deem xsl:for-each-group/@group-by to be streamable (grounded) provided that the body of the instruction is an xsl:result-document instruction.

Or of course, we might decide that this requirement is out of scope.
Comment 1 Michael Kay 2014-01-17 09:24:12 UTC
To avoid complications (multiple downward selections), I should modify the example either to do group-by="@colour", or to do select="/*/widget/copy-of()". This doesn't affect the underlying problem.
Comment 2 Michael Kay 2014-01-29 11:24:34 UTC
I'm a bit inclined to make xsl:for-each-group/@group-by streamable under the same conditions as xsl:for-each-group/@group-adjacent. This would be consistent with our approach to distinct-values().

Otherwise there is no convenient way to do something like

<xsl:for-each-group select="*" group-by="colour"
  bind-group="g" bind-grouping-key="k">
  <a colour="{$k}" count="{count($g)}"/>
</xsl:for-each-group>

which is streamable for all practical purposes (it needs memory proportional to the number of groups, which in many cases will be constant memory independent of the size of the population).

In essence this is the same as xsl:fork, except the number of "prongs" in the fork is not fixed statically by the stylesheet author, but instead is one-per-group based on the number of distinct grouping key values found in the data.
Comment 3 C. M. Sperberg-McQueen 2014-02-10 15:00:21 UTC
We discussed this at the Prague face to face.

The parallel with xsl:fork seems a strong one, but the requirement to buffer what might be a great deal of output worried some WG members.  

Pavel Labath suggested possibly an attribute on for-each-group to signal a request for streaming treatment; an analogy with 'treat as' for static typing was detected.

There was a good deal of sentiment for accepting the proposal made here, but we decided to sleep on it and come back to this bug before making a final resolution.
Comment 4 Abel Braaksma 2014-03-13 16:00:06 UTC
I have some trouble understanding the implications mentioned in this bug report without an example of input data. Is it possible to update with a (small) XML source document?
Comment 5 Michael Kay 2014-03-19 22:10:05 UTC
In response to comment 4, a typical use case is to take an input document of the form:

<widgets>
  <widget colour="red" affability="3.2"/>
  <widget colour="blue" affability="3.3"/>
  <widget colour="orange" affability="3.1"/>
  <widget colour="lilac" affability="3.5"/>
  <widget colour="red" affability="3.6"/>
  <widget colour="green" affability="3.1"/>
  <widget colour="green" affability="3.2"/>
  <widget colour="blue" affability="3.4"/>
</widgets>

where the set of colours is not known in advance, and produce output of the form

<out>
  <group colour="red" count="2" average-affability="3.4"/>
  <group colour="blue" count="2" average-affability="3.3"/>
  <group colour="orange" count="1" average-affability="3.1"/>
  <group colour="lilac" count="1" average-affability="3.5"/>
  <group colour="green" count="2" average-affability="3.15"/>
</out>

A particular variant of the problem is where the output of processing each group is a separate result document; in this case there is a streaming strategy that needs no buffering (except I/O buffer space proportional to the number of groups).
Comment 6 Michael Kay 2014-06-04 11:08:31 UTC
The WG asked me to make a concrete proposal to meet this requirement.

I propose to use the construct:

<xsl:fork>
  <xsl:for-each-group select="/*/widget" group-by="colour">
    <xsl:result-document href="{current-grouping-key()}.xml">
      <widgets colour="{current-grouping-key()}">
        <xsl:sequence select="current-group()"/>
      </widgets>
    </xsl:result-document>
  </xsl:for-each-group>
</xsl:fork>

for this use case. The content model of xsl:fork is changed to allow either its current content (xsl:sequence+), or a single xsl:for-each-group element; wrapping xsl:for-each-group inside xsl:fork acts as a signal that the user recognizes that buffering of output will be needed and accepts the consequences.

The rules for streamability of xsl:for-each-group (19.4.8.19) rules 2 and 3 become:

2. If there is a group-by attribute and the xsl:for-each-group instruction is not a child of xsl:fork, then roaming and free-ranging.
3. If there is a group-adjacent or group-by attribute that is not motionless, then roaming and free-ranging.

The rules for streamability of xsl:fork (19.4.8.20) change to add a rule 0:

0. If the content of the xsl:fork instruction comprises a single xsl:for-each-group instruction, then the posture and sweep of that instruction.
Comment 7 Michael Kay 2014-06-05 16:37:10 UTC
Proposal accepted, but with a possible addition to be investigated: allow xsl:for-each inside the xsl:fork as well as xsl:for-each-group.
Comment 8 Michael Kay 2014-06-06 09:32:28 UTC
I'm going to push back on the idea of allowing xsl:for-each as a child instruction. I don't think it achieves anything useful; I don't think the streamability rules would be any different for an xsl:for-each appearing as a child of xsl:fork than for any other xsl:for-each instruction.

I have applied the changes to allow xsl:for-each-group as a child of xsl:fork.
Comment 9 Abel Braaksma 2014-06-06 10:08:39 UTC
Upon reconsidering my suggestion during the latests telcon on xsl:for-each and xsl:apply-templates as a child of xsl:fork - while I do think it has _some_ merit in explicitly telling the processor "I want to do a dynamic fork" - I agree with your conclusion. 

So I'll pushback on my own suggestion as well ;). Let's not extend the syntax further unless absolutely necessary. If someone really wants to fork on each node, you can now use a workaround through xsl:for-each-group with a group-size of one.
Comment 10 Michael Kay 2014-07-03 15:43:00 UTC
Looking at the discussion, we decided that we have reached consensus and can mark this resolved.