27498 – [ser 3.1]Unfailing serialization

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 27498 - [ser 3.1]Unfailing serialization

Summary: [ser 3.1]Unfailing serialization

Status:	RESOLVED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Serialization 3.1 (show other bugs)
Version:	Last Call drafts
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	C. M. Sperberg-McQueen
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-12-03 15:42 UTC by Michael Kay
Modified:	2015-01-29 22:36 UTC (History)
CC List:	5 users (show)

See Also:	27880

Attachments

Description Michael Kay 2014-12-03 15:42:39 UTC

When a user enters a query at a command line or other "ad-hoc" interface, they expect to see an answer displayed. They don't expect to be told that the query succeeded, but the results can't be displayed because there was a serialization error.

This wasn't a problem in the past because in nearly all cases, if the query succeeded, then serialization would also succeed (one of the notable exceptions was a query that returns attribute nodes).

But with the introduction of maps and arrays, we're seeing lots of ad-hoc queries that produce serialization errors. In fact, we're moving towards displaying the results of queries using our own ad-hoc serialization that doesn't correspond to anything in the serialization spec. The logic is something like: display nodes using XML serialization, display maps and arrays using JSON serialization.

I feel uncomfortable that we are finding we need to do serialization in a way that isn't covered by the spec. Anyone else feel the same?

Especially as the item-separator property was apparently invented for this kind of use case.

I think there's a need for some kind of "adaptive" serialization where a sequence is serialized by choosing an appropriate serialization method for each item based on what kind of item it is, and then separating the items using item-separator.

Comment 1 Jonathan Robie 2014-12-09 17:31:28 UTC

I agree, it's really hard to debug queries if you can't see the output.

There's one major decision point:  if JSON contains XML nodes, do you want them to be placed in a string so that the output is valid JSON, or do you want to serialize the XML without a string?  Or does this depend on the serialization mode?

Comment 2 Jonathan Robie 2014-12-09 17:33:16 UTC

Incidentally, JSONiq can serialize mixtures of arrays, maps, and XML nodes, including maps or arrays that contain XML nodes.

Comment 3 Michael Kay 2014-12-11 09:49:07 UTC

As an outline, I propose the following:

1. A new serialization method, called perhaps "adaptive".

2. Sequence normalization is not carried out.

3. Items in the supplied sequence are serialized individually, as follows, with an occurrence of the chosen item-separator between successive items:

a. document, element, text, comment, and processing instruction nodes are serialized using the XML output method

b. attribute and namespace nodes are serialized as if they had a containing element: for example xsi:type="xs:integer", or xmlns:xs="http://..../". Note that this may result in output of QNames containing prefixes whose binding is not displayed.

c. atomic values are serialized by casting the value to a string.

d. maps and arrays are serialized using the JSON output method.

e. functions are serialized to the representation "function fn:name#A where fn:name is the function name and A is the arity. If the function is anonymous, "fn:name" is replaced by the string "(anonymous)"

4. If any value cannot be output because doing so would cause a serialization error, the processor SHOULD attempt to recover by inserting an implementation-defined error indicator into the output, and serializating as much of the input as can be serialized without error.

5. If the output is sent to a destination that allows hyperlinks to be included in the generated text, then the serializer MAY include implementation-dependent hyperlinks to provide additional information for example

a) to allow the type of atomic values to be ascertained
b) to allow the namespace binding of prefixes to be ascertained
c) to provide further information about the cause of error indicators

Comment 4 Christian Gruen 2014-12-17 10:29:50 UTC

I like the proposal! A few comments:

1. I would recommend to separate items with a newline if no item-separator is specified. Experience tells me that if an item-separator is declared, "&#xa;" is the only value that is assigned to it anyway.

2. If maps or arrays have values with more than one item (example: [(1,2)]), the JSON output method will raise SERE0023. This is probably not what is desirable. We could

a) serialize sequences inside maps and arrays as arrays, or
b) suppress SERE0023 for the adaptive mode (then, we would need to clarify how/if the serialized items are separated from each other)

If we believe that it's not too confusing for the user, we could generally adapt a) to JSON serialization whenever the serialized sequence is not the top level sequence.

3. Similar to 2, we need to handle attributes or namespace nodes that occur as values in maps or arrays (example: [<a b='c'/>/@b]). We could serialize them in the same way as done on top level.

4. Do you have an example for "hyperlinks" (even if it's supposed to be implementation-dependent)?

Comment 5 Michael Kay 2015-01-01 11:24:07 UTC

I've just had a request from a Saxon user which suggests an additional requirement: they are interested in serializing the query result (an arbitrary XDM value) not for human consumption, but for transmission to a client application that can reconstruct the XDM value from its serialized form. This suggests additional requirements such as including the type of an atomic value, not just its string value.

Comment 6 Hans-Juergen Rennau 2015-01-02 22:04:07 UTC

(In reply to Michael Kay from comment #5)
> I've just had a request from a Saxon user which suggests an additional
> requirement: they are interested in serializing the query result (an
> arbitrary XDM value) not for human consumption, but for transmission to a
> client application that can reconstruct the XDM value from its serialized
> form. This suggests additional requirements such as including the type of an
> atomic value, not just its string value.

Incidentally, this additional requirement (as well as the need of XDM serialization in general) has been pleaded for several years by David Lee, e.g. on the XQuery talk list and at two Balisage conferences, alas, without receiving any attention or response.

Comment 7 C. M. Sperberg-McQueen 2015-01-05 18:56:09 UTC

If we adopt round-trippability as a requirement (as implicitly suggested at least for arrays and maps in comment 5 and endorsed in comment 6), does the requirement also apply to XML data?

One story that would be simple to tell would be:  serialize it using a new serialization method, and then you will be able to reconstitute an isomorphic collection of XDM data from the serialization.  We seem to be missing a couple of things here:

1 a way to annotate XML nodes with type information that can be reliably reconstituted (as long as all the appropriate in-scope schema information is available) -- remember that revalidating with the in-scope schema starting at the root of each maximal XDM tree is not guaranteed to produce the same results;

2 a way to read the serialized data and re-type everything the same way.

It's not clear at first glance how best to add the type annotations required for reliable write + read round tripping for either JSON or XML, without getting in the way of non-XDM systems.

And it would be nice to be able to serialize the entire collection of data without loss; but that involves being able to handle parentless attributes and functions (and possibly other things I'm forgetting at the moment).  Or is there a plausible subset of XDM for which reliable write + read round-tripping can be easily defined and which will suffice for all imaginable purposes?  all rational imaginable purposes?  all rational purposes that don't involve meta-programming or other unusual or unnatural acts?  most rational purposes?  many purposes?  

I thank Hans-Jürgen Rennau for pointing to some earlier discussions that have not been raised as bugs or enhancement requests in Bugzilla; I'll have to refresh my memory to see if solutions have already been suggested for these problems.

Comment 8 Michael Kay 2015-01-05 23:14:30 UTC

I don't think it's difficult to define an XML representation of the full XDM model, but I doubt it would be very human-readable, so it's a very different objective from the original requirement of this thread.

Parsing that XML representation to reconstitute the XDM would not be possible using pure XSLT and XQuery programs because the only way we allow type annotations to be set is by using validate expressions. But we could define a magic function to do it.

Comment 9 Christian Gruen 2015-01-06 11:47:37 UTC

I agree that a serialization method that allows users to reconstruct original query results would be helpful. As the "adaptive" serialization method is probably not the best target for all that, I have just added a new bug entry for further discussion (Bug 27763).

Comment 10 David Lee 2015-01-06 20:55:13 UTC

A few years back I started a discussion on this and created a wiki with quite a few of these issues.

http://xml.calldei.com/XDMSerialize

Mike encouraged me to start with Use Cases ... and I definitely agree.
For example, a primary use case I have is "streaming" XDM producers. For example producing log messages or long lived sessions.
The Efficient XML group (while I was on it) had several real world 'customers' who needed this as well (but for efficient XML), and the solution is non-ideal. One example was for "Instant Message" applications. each message is an XML Element but the entire stream is a long lasting document because there is no standardized way of representing streams of XML documents ... The requirement is to get each message without over-reading the socket ... That may be a edge case, but consider typical "feeds' ... twitter, facebook, stocks, news, message queues.
Or simply log files ... how to parse a log file before its "done" ...

This site is still up and discusses many of the use cases I considered. I put this on hold when I realized I didn't have a clean solution to item types like maps or functions, and that some use cases have contradictory requirements such as full fidelity vs minimal output.
An example - do you really need to expose node identity ? without it you cant reconstruct the XDM perfectly but is that needed ? for what cases ?

Its good to see some renewed interest in this topic.

If we cant get XDM (of some sort ... ) in and out of our XDM Tools using some format that has a reasonable chance of being recognized by another tool set ... that a big barrier ... To me, the "human readable text output" is interesting but not that problematic as any vendor can solve that differently ... (humans are tolerant of differences).

Comment 11 Christian Gruen 2015-01-20 12:54:22 UTC

One more hint regarding Comment 3:

> e. functions are serialized to the representation "function fn:name#A
> where fn:name is the function name and A is the arity. If the function
> is anonymous, "fn:name" is replaced by the string "(anonymous)"

If "function name" is supposed to be the string representation of the function item's QName, we could illustrate this with some examples:

* exists#1 → function exists#1
* fn:exists#1 → function fn:exists#1
* Q{http://www.w3.org/2005/xpath-functions}exists#1 → function exists#1
* function($a) { $a } function (anonymous)#1
* exists(?) → function (anonymous)#1
* exists#1(?) → function (anonymous)#1

Comment 12 Andrew Coleman 2015-01-21 10:42:30 UTC

Thanks Christian for the suggested examples.  The WG agreed to add these in as a non-normative note and mark this bug resolved/fixed.

Comment 13 Christian Gruen 2015-01-21 14:50:10 UTC

One more detail: The working draft states that

  An array item or a map item is serialized using the JSON output method
  described in 9 JSON Output Method, with the following amendments to the
  JSON serialization rules:

  * A sequence of length greater than one in the data model instance will be
    serialized using the Adaptive output method rather than raising a
    serialization error [err:SERE0023].

As a consequence, sequences with a single item will be serialized differently outside and inside maps/arrays:

* true#0  →  true#0
* [ true#0 ]  →  error/implementation-defined fallback
* [ (true#0, false#0) ]  →  [ true#0 false#0 ]
* map { 'x': true#0 }  →  error/implementation-defined fallback
* map { 'x': (true#0, false#0) }  →  { "x": true#0 false#0 }

My advice would be to simplify the first rule to:

  * Members of arrays and values of maps will be serialized using the
    Adaptive output method.

The serialized results would then be:

* true#0  →  true#0
* [ true#0 ]  →  [ true#0 ]
* [ (true#0, false#0) ]  →  [ true#0 false#0 ]
* map { 'x': true#0 }  →  { "x": true#0 }
* map { 'x': (true#0, false#0) }  →  { "x": true#0 false#0 }

Comment 14 Andrew Coleman 2015-01-21 22:54:16 UTC

I think you make a good point (comment 13), and I agree that we could make a better effort to handle the examples you have given. However, the wording that you have proposed does not directly align with any of the rules in the JSON output method. We could ammend the first and second rules (starting with 'An array item...' and 'A map item...') to say that all members/values are serialized using the adaptive method as you have suggested, but the adaptive method would loose the ability to output chunks of JSON syntax for portions of the XDM that contain valid JSON data.

The primary design principle for the adaptive method was to be able to take XDM instances that contain a mixture of data from XML and JSON sources and output a serialized form that contains a mixture of XML syntax and JSON syntax. And at the same time, try and avoid throwing errors when other noise gets into the mix.

I think would could avoid throwing the errors you have observed by amending the 'catchall' rule in the JSON output method to say

* Any item in the data model instance of type not specified in the above list will be serialized using the Adaptive output method, rather than raising a serialization error [err:SERE0021].

This amendment would be in addition to all the other amendments currently in the working draft.

This would allow single functions in maps and arrays to be serialized without throwing an error.

We would need to convince ourselves that there aren't situations where we go into infinite recursion.

Comment 15 Christian Gruen 2015-01-22 08:30:47 UTC

Andrew, thanks for replying. I think it's a good idea to amend the last JSON rule as you proposed in Comment 14.

I'd like to correct the serialized results of my examples in Comment 13; they should have looked as follows:

* true#0  →  function true#0
* [ true#0 ]  →  [ function true#0 ]
* [ (true#0, false#0) ]  →  [ function true#0 function false#0 ]
* map { 'x': true#0 }  →  { "x": function true#0 }
* map { 'x': (true#0, false#0) }  →  { "x": function true#0 function false#0 }

Comment 16 Andrew Coleman 2015-01-29 22:36:24 UTC

The WG agreed to adopt the change proposed in comment 14 at the telecon on 2015-01-27.