2553 – [F+O] Stability of collection()

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 2553 - [F+O] Stability of collection()

Summary: [F+O] Stability of collection()

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 1.0 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ashok Malhotra
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-11-26 08:42 UTC by Michael Kay
Modified:	2006-11-16 18:49 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Michael Kay 2005-11-26 08:43:02 UTC

A while back I introduced an implementation of collection() in Saxon that uses a
URI to identify a set of XML documents in filestore, with the ability to do
pattern matching on the file names, recursively traverse the directory
structure, and so on. This has proved very popular with XSLT users in
particular: for example it allows you to build an index over a large set of
source documents. The problem is that it isn't stable: if you call collection()
again with the same URI, and files have been created or deleted, you will get a
different result the next time. I've tried various devices to get around this
problem, but the only conformant solution I can think of is to abandon using the
collection() function for this purpose and introduce a proprietary extension
function instead, which doesn't seem to be in anyone's interests.

As far as I can tell there are only two ways of making the collection() function
stable. One is to lock the stored collection against updates for the duration of
the query or transformation. This is only possible where you have exclusive
access to the data, it's not a practical solution for files in filestore. The
other approach is to take a snapshot of the entire collection. But that's
hideously expensive, given that the collection will usually be too big to fit in
memory, and that the chances are that 99% of the time it will only be read once,
often with each document being processed to completion before the next one is
examined.

So I think there's a strong case for relaxing the requirement that collection()
should be stable. David Carlisle made an interesting suggestion: one could
define the semantics so that collection() is guaranteed to create new nodes
rather than return existing nodes. Since our processing model already allows a
function to create new nodes each time it is called, this shouldn't be problematic. 

Of course for XQuery scenarios involving a database that might be updated, one
does want a reference to the existing node, which suggests a need for two
options or modes.

Michael Kay

Comment 1 Colin Adams 2005-11-27 09:35:34 UTC

I implemented collection for a direectory of files in a stable manner, by
keeping all documents in storage. Of course this does not scale well.

We could have two different functions, as Michael suggests - say collection and
stable-collection. Alternatively, we could have a form with three arguments, the
third being a boolean flag. In this case the stability could be computed at
runtime. I'm not sure though that there are any use cases for this.

Comment 2 David Carlisle 2005-12-01 22:27:54 UTC

similar comments would apply to doc() and xslt's document() functions

Consider (to take a real example) processing the set of XqueryX files
in the xquery test suite (as input documents, to do some kind of transformation,
or query-on-queries.

The files are (typically) on the filestore so Michael's comments about the
non-availablility if database-style locking mechanisms applies, but 
you wouldn't normally  want to load them via collection, but rather something like

<xsl:for-each select="descendant::qt:test-case">
...
  select="doc(concat(@FilePath,query/@name,'.xqx'))" ...


If you try doing this on a system that holds all documents in memory you run out
of memory pretty quickly (or have a much bigger machine than I have)

Saxon's discard-documemnt  extension function has proved invaluable in making
this type of operation, processing large numbers of smallish files, feasible.

Something in the standard that addresses this (it doesn't have to be exactly
discard-document) would be really useful, I think, otherwise I suspect that
discard-document is going to be 2.0's xx:node-set() ie a must-have extension
that every implementer has to implement, and which causes confusion to beginners
who don't know why they need an extension anyway, and causes interoperability
problems for non-beginners (due to usual extension function issues about
differences of behaviour in edge cases, different namespaces, etc)

I know it's late in the process, but CR is about implementation experience, and
a worryingly large proportion of my xslt2 stylesheets are using this extension,
already.

David

Comment 3 Colin Adams 2005-12-03 09:16:26 UTC

Re. David's comment that similar considerations should apply to doc() and XSLT's
document(), and the reference to Saxon's discard-document() extension function.

This is fundamentally unsound, I believe.

If after a call to discard-document(), the stylesheet later encounters a
reference to nodes within that same document, then the document will have to be
re-parsed. Although it is not difficult to implement generate-id() in such a way
that it's results are guarenteed to be the same before and after re-parsing a
document for a given node (determined by it's numbering in document order), this
is not sufficient to guarantee node identity for the same generated id, as the
same URI may no longer refer to the same document contents (this is often the
case for HTTP URIs referring to dynamically generated documents).

Comment 4 Michael Kay 2005-12-03 17:24:52 UTC

David's comment that the same arguments apply to document()/doc() is true in
principle: there are use cases where you read a large number of documents using
doc(), and where you only access each document once, and where the "stability"
provision therefore gives you a lot of pain and no gain by locking all the
documents into memory.

Perhaps we can solve this as follows:

(a) we specify that doc() and collection() are stable by default (in SQL terms,
the default isolation level is SERIALIZABLE)
(b) we specify that implementations may provide an option to select a different
isolation level
(c) we specify that a call on doc() or collection() may fail if the
implementation cannot provide access to the requested resource with the
requested isolation level

This is anticipating a more comprehensive treatment of transactions and
isolation levels in a future version of the spec.

Michael Kay

Comment 5 David Carlisle 2005-12-03 22:35:08 UTC

(In reply to comment #3)
> Re. David's comment that similar considerations should apply to doc() and XSLT's
> document(), and the reference to Saxon's discard-document() extension function.
> 
> This is fundamentally unsound, I believe.
> 

Yes but your comments re soundness apply equally to collection(), so I don't
think you were really disagreeing with my comment that doc() and collection()
could be considered equally.

I was going to answer (but I see Michael already made a similar suggestion)
that the solution may be along the lines of having a mode that drops the
requiremnt that the same nodes are returned if you call doc() twice with the
same uri. 

There are use cases where the guaranteed stability is a good thing but it isn't
really an essential part of the language. Other functions that generate nodes do
not have this feature.  A function definition like

declare function x:f () {<x/>}

means that x:f() returns a new node each time, so it is not a pure function in
that sense. A mode in which doc() acted the same way, would be very useful I think.

David

Comment 6 Colin Adams 2005-12-04 07:14:58 UTC

You are right that the lack of soundness applies to collection too.
I like Michael's suggestion that an implementation may provide non-default
isolation levels, and that an error may result in these cases. I'm going to
implement such a scheme starting today.

Comment 7 Michael Kay 2006-01-10 19:27:20 UTC

At the joint meeting on 10 Jan the change was accepted in principle and I was
actioned to supply detailed text. Here is the proposal:

1. In the XQuery book, section 2.4.4, delete the paragraph

"If one of the above functions is invoked repeatedly with arguments that resolve
to the same absolute URI during the processing of a single query, each
invocation must return the same node sequence. This rule applies also to
repeated invocations of fn:collection with zero arguments during the processing
of a single query."

(This information is currently redundant)

2. In F+O, 1.7, under the definition of the term "stable", add a note after the
first paragraph: "Note: in the case of fn:collection() and fn:doc(), the
requirement for stability may be relaxed: see the function definitions for details"

3. In F+O 15.5.6 fn:collection, delete the sentence "This function is ·stable·.
" and replace it with the following paragraph:

By default, this function is stable. This means that repeated calls on the
function specifying the same URI will return the same result each time. However,
for performance reasons, implementations may provide a user option to evaluate
the function without a guarantee of stability. The manner in which any such
option is provided is implementation-defined. If the user has not selected such
an option, a call of the function must either return a stable result or must
raise an error.

4. In F+O 15.5.4, fn:doc, change "This function is stable" to "By default, this
function is stable". After the example explaining what this means, add the
paragraph:

However, for performance reasons, implementations may provide a user option to
evaluate the function without a guarantee of stability. The manner in which any
such option is provided is implementation-defined. If the user has not selected
such an option, a call of the function must either return a stable result or
must raise an error.

At the end of this section, add a fifth bullet:

* Implementations may provide user options that relax the requirement for the
function to return stable results.

5. In F+O 15.5.5, fn:doc-available, add a final sentence: "However, if
non-stable processing has been selected for the fn:doc function, this guarantee
is lost."

6. In XSLT 5.4.3 (Initializing the Dynamic Context) add after the second paragraph:

As specified in [F+O], implementations may provide user options that relax the
requirement for the doc and collection functions (and therefore, by implication,
the document function) to return stable results. By default, however, the
functions must be stable. The manner in which such user options are provided, if
at all, is implementation-defined.

Comment 8 Ashok Malhotra 2006-01-30 23:16:21 UTC

Fixed as per the wording suggested by Michael Kay and modified in the minutes of
the joint XSL/XQuery telcon Jan 24, 2006.