29889 – [xslt30] Add clarifications on stylesheet invocation options

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 29889 - [xslt30] Add clarifications on stylesheet invocation options

Summary: [xslt30] Add clarifications on stylesheet invocation options

Status:	RESOLVED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	XSLT 3.0 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Michael Kay
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-09-30 11:12 UTC by Michael Kay
Modified:	2017-01-12 23:27 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Michael Kay 2016-09-30 11:12:16 UTC

The WG discussed the suggestions for clarification of stylesheet invocation options posted here:

https://lists.w3.org/Archives/Public/public-xsl-wg/2016Sep/0001.html

and decided that many of these points were worthy of incorporation. This bug entry is raised to provide a reference for labelling the resulting changes.

Comment 1 Michael Kay 2016-09-30 11:54:59 UTC

<div2 id="streaming-non-xml" diff="add" at="T-bug29889"> <head>Streaming of non-XML data</head> 

<p>The facilities in this specification designed to enable large data sets to be processed in a streaming manner are oriented almost entirely to XML data. This does not mean that there is never a requirement to stream non-XML data, or that the Working Group has ignored this requirement; rather, the Working Group has concluded that for the most part, streaming of non-XML data can be achieved by implementations without the need for specific language features in XSLT.</p> 

<p>To make streamed processing of unparsed text files easier, the function <xfunction>unparsed-text-lines</xfunction> has been introduced. This is not only more convenient for stylesheet authors than reading the entire input using the <xfunction>unparsed-text</xfunction> and then tokenizing the result, it is also easier for implementations to optimize, allowing each line of text to be discarded from memory after it has been processed.</p> 

<p>For all functions that access external data, including <function>document</function>, <xfunction>doc</xfunction>, <xfunction>collection</xfunction>, <xfunction>unparsed-text</xfunction>, <xfunction>unparsed-text-lines</xfunction>, and (in XPath 3.1) <xfunction spec="FO31">json-doc</xfunction>, the requirements on determinism can now be relaxed using <termref def="dt-implementation-defined"/> configuration options. This is significant because it means that when a transformation reads the same external resource more than once, it becomes legitimate for the contents of the resource to be different on different invocations, and this eliminates the need for the processor to cache the contents of the resource in memory.</p> 

<p>In the XDM data model, every value is a sequence, and (as with most functional programming languages), processing of sequences of items is pervasive throughout the XSLT and XPath languages and their function library. Good performance of a functional programming language often depends on sequence-based operations being pipelined, and being evaluated in a lazy fashion (that is, many operations process items in a sequence one at a time, in order; and many operations can deliver a result without processing the entire sequence). The semantics of XSLT and XPath permit pipelined and lazy evaluation (for example, the error handling semantics are carefully written to ensure this), but they do not require it: the details are left to implementations. Pipelined processing of a sequence is not the same thing as streamed processing of a tree, and where the XSLT specification talks of operations being "guaranteed streamable", this is always referring to processing of trees, not of sequences.</p> 

<p>The facilities for streaming of XML trees include operations such as <xfunction>copy-of</xfunction> and <xfunction>snapshot</xfunction> which are able to take a sequence of streamed nodes as input,  and produce a sequence of in-memory (unstreamed) nodes as output. It is also possible to generate a sequence of strings or other atomic values through the process of atomization. The actual memory usage of a streamed XSLT application may depend significantly on whether the processing of the resulting sequence of in-memory nodes or atomic values is pipelined or not. The specification, however, has nothing to say on this matter: it is considered an area where implementors can exercise their discretion and ingenuity.</p> 

<p>Streaming of JSON input receives little attention in this specification. One can envisage an implementation of the <function>json-to-xml</function> function in which the XML delivered by the function consists of streamed nodes; but the Working Group has not researched the feasibility of such an implementation in any detail.</p> 

</div2>

Comment 2 Michael Kay 2016-09-30 14:01:19 UTC

The intro to comment #1 got lost through a cut-and-paste error. The text in comment #1 is proposed as a new subsection of Chapter 2 (Concepts) and addresses a couple of the points in Abel's suggestions, specifically points 5 and 6.

Comment 3 Abel Braaksma 2016-10-03 03:35:42 UTC

This actually looks like a very good addition and explanation of many of the rather fuzzy areas of the spec w.r.t. streaming. I find it very understandably written. I didn't even realize the json-to-xml issue, but it makes sense to mention it here.

Instead of through a function, I think it makes more sense if implementations would be able to broaden the input that xsl:source-document allows. For instance, if it encounters JSON, and there's a (implementation-defined?) option, say @source-format, or @xml-from-type="json", it would be easier to apply streaming abilities.

Strictly speaking, the spec doesn't allow such extensions. Should we become more lenient here, so as to make it easier for the spec to evolve in the future? Or leave as is with implementations probably going to ignore the strictness requirements of extensions anyway?

Comment 4 Michael Kay 2016-10-07 21:15:10 UTC

I propose:

#1 and #4. In 19.10, change the sentence

If a construct is guaranteed-streamable then it must be processed using streaming.

If a construct is guaranteed-streamable and the input is provided in streamable form, then the input must be processed using streaming.

with a Note: the requirement to process the input using streaming does not apply if the processor is able to determine that this would convey no benefit: for example, if the input is supplied as a tree in memory. However, this does not remove the requirement to verify that the relevant stylesheet constructs are guaranteed streamable.

#2. In 2.3.5, Function Call Invocation, add:

If the initial function is declared streamable, a streaming processor SHOULD allow the value of the first argument to be supplied in streamable form, and if it is supplied in this form, then it MUST be processed using streaming.

#3. In 2.3.3 Apply-templates invocation, replace the existing note:

If the initial mode is a streamable mode, then streaming will only be possible if nodes in the input sequence are supplied in a form that allows such processing: for example, as a reference to a stream of parsing events.

If the initial mode is declared streamable, a streaming processor SHOULD allow some or all of the items in the initial match selection to be nodes supplied in streamable form, and any nodes that are supplied in this form MUST then be processed using streaming.

#7. I'm not convinced anything needs saying here. Remember that everything is allowed unless we say it isn't - we don't have to list all the things that processors might choose to do.

#8. I find it difficult to see what we should say beyond the existing paragraph in 19.10:

For a non-streaming processor, the processor must evaluate the construct delivering the same results as if execution used streaming, but with no constraints on the evaluation strategy. (Processing may, of course, fail due to insufficient memory being available, or for other reasons.) A non-streaming processor is not required to assess whether constructs are guaranteed-streamable, or to apply restrictions such as the rules for where calls on the functions accumulator-before and accumulator-after may appear. However, a non-streaming processor must enforce the constraint implied by a use-accumulators attribute restricting which accumulators can be used with a particular document.

#9. (the table) I'll take another look at this. I'm a bit concerned at the risk that the table might say (or be perceived as saying) something different from the current prose.

Comment 5 Michael Kay 2017-01-12 23:27:15 UTC

The WG accepted the proposal in comment #4. (In fact, these changes had already been applied to the spec.)