This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 29831 - [FO31] fn:transform and serialization to string
Summary: [FO31] fn:transform and serialization to string
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 3.1 (show other bugs)
Version: Candidate Recommendation
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Michael Kay
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-09-20 12:59 UTC by Abel Braaksma
Modified: 2016-12-16 19:55 UTC (History)
1 user (show)

See Also:


Attachments

Description Abel Braaksma 2016-09-20 12:59:27 UTC
When serializing to a string, what serialization-options are applicable? And in the case of a 1.0 processor, does disable-output-escaping come into play (provided a processor supports it)?

Consider:

<xsl:value-of select="'&lt;br>'" disable-output-escaping="true" />

The output would be "<br>".

and: <xsl:value-of select="'&lt;br>'" disable-output-escaping="false" />

The output would be "&lt;br>"

Does the returned string contain '<', 'b', 'r', '>' ("<br>", and therefore illegal XML), or does the string contain '&', 'l', 't', ';' , 'b', 'r', '>' ("&lt;br>" and therefore legal XML)?

And with respect to other output options:
- if non-UTF is specified, do we return expanded entities?
- if character-maps is specified, are they invoked (leading, again, to potentially illegal XML)
- if HTML is specified, do we return the string for HTML, or a parsable XML string?
Comment 1 Michael Kay 2016-09-20 18:53:55 UTC
Looking at the spec, I think it is as clear as it needs to be.

Firstly, it's clear how you request a serialized result, and it's clear that serialization-params specified in the request take precedence over those specified in xsl:output.

There is a slight mismatch because we require the serialized result as a string. The serialization spec says:

Note:
Serialization is only defined in terms of encoding the result as a stream of octets. However, a serializer MAY provide an option that allows the encoding phase to be skipped, so that the result of serialization is a stream of Unicode characters. The effect of any such option is implementation-defined, and a serializer is not required to support such an option.


and we might perhaps refer to that note. If the serializer does not support serialization to a string, then the stream of octets can always be decoded as a string. It might be worth mentioning that there are two possible ways of doing this: either skip the encoding phase (in which case escaping of non-encodable characters probably doesn't happen), or decode the octet stream (in which case you're probably left with character references for unencodable characters).

Support for disable-output-escaping in XSLT has always been optional, and remains so for the fn:transform function. It's a deprecated feature and I don't think we need to say anything about it: if you use it, you're not guaranteed interoperable.

I don't see any reason to restrict the output methods available. HTML and JSON make perfectly good sense, for example.
Comment 2 Abel Braaksma 2016-09-21 00:16:56 UTC
Re comment#1:

I think you are right. I now vaguely remember having seen that paragraph before, and perhaps even asking this question before (which in itself may be an indication that a Note could be helpful).

I used d-o-e as an example. I would have done better using character-maps as an example (they have similar semantics but are interoperable). 

You write "However, a serializer MAY provide an option that allows the encoding phase to be skipped". 

That suggests that applying character maps (which happens before encoding) has effect. It also suggests that the output method has effect (which, I agree, makes sense, otherwise one should use "raw" if you just needed a tree).

All in all: everything *except* (optionally) encoding takes place and the result is given back as a string (that is, a series of characters, not octets). If encoding takes place, you may get the string "&#x416;" (7 characters) instead of the string "Ж" (Cyrillic Zhe, 1 character).
Comment 3 Tim Mills 2016-09-27 15:24:55 UTC
See 

4 Phases of Serialization

(note under point 5).
Comment 4 Michael Kay 2016-09-27 20:03:45 UTC
The WG decided that there was scope here for editorial clarification, but no substantive error in the spec.

I have explained the point about serializing to a character string by reference to the fn:serialize function where the same considerations apply.