This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 2459 - [Serialization] Phases of Character Expansion
Summary: [Serialization] Phases of Character Expansion
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Serialization 1.0 (show other bugs)
Version: Candidate Recommendation
Hardware: PC Windows XP
: P2 normal
Target Milestone: ---
Assignee: Scott Boag
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-04 16:44 UTC by Michael Kay
Modified: 2006-02-16 15:56 UTC (History)
0 users

See Also:


Attachments
Section 4 Phases of Serialization (14.83 KB, text/html)
2006-02-16 15:43 UTC, Joanne Tong
Details

Description Michael Kay 2005-11-04 16:44:50 UTC
Serialization section 4, list item 2, contains the rule:

The substitution processes that apply are listed below, in priority order: a
character that is handled by one process in this list will be unaffected by
processes appearing later in the list, except that a character affected....

My question is, what does "handled" mean here? Does it mean the same as
"affected"?

Consider this example (a Saxon bug report today). The result tree contains

<a href="mailto:sales@backbase.com">sales@backbase.com</a>

and there is a character map that translates "@" to "(at)"

Saxon is doing the character mapping for the text node but not for the
attribute node, because the characters in the attribute node are all
"handled" by the URI escaping phase, even though they are unchanged by that
phase. Is this a correct interpretation? It isn't an interpretation that
makes much sense for this use case, and I can't really think of a use case
where it does make sense. So perhaps "handled" should be "affected", or even
"altered".

This leads me to question these rules from first principles. The rules have
become increasingly messy. Let's look at all the interactions between
phases: for reference these are

a   URI escaping
b   character mapping
c   unicode normalization
d   CDATA sections
e   ampersand escaping

Looking at all possible pairs of phases, let's ask the question "should a
character that's changed by the first phase also be changed by the second"

ab  - unlikely to affect practical use cases
ac  - makes no difference (we have recently added a new rule to normalize
before URI escaping)
ad  - makes no difference
ae  - makes no difference
bc  - probably yes, though currently no
bd  - makes no difference (we have a special rule here that elements
specified as cdata-section-elements are not affected by character mapping)
be  - definitely no
cd  - definitely yes (the exception to the general rule is already stated)
ce  - definitely yes (the exception to the general rule is already stated)
de  - definitely no

This is far from the blanket "no" that the general rule implies. I think we
could make the whole thing a lot simpler by inverting the general rule, so
that characters output by one phase act as input to the next, with stated
exceptions:

(1) an & or < character produced as a result of string replacement in a
character map is not ampersand-escaped in step (e)

(2) steps (d) and (e) are alternatives, rather than being sequential: a text
node is either processed by (d) or by (e).

Michael Kay
previously raised at 
http://lists.w3.org/Archives/Member/w3c-xsl-query/2005Oct/0010.html
Comment 1 Joanne Tong 2006-02-16 15:43:20 UTC
Created attachment 407 [details]
Section 4 Phases of Serialization
Comment 2 Joanne Tong 2006-02-16 15:55:44 UTC
The XSL and XQuery working group accepted the proposed revision (view 
attachment in this bug report) to section 4 of the serialization specification 
on Feb. 1, 2006.  The character expansion phase has been rewritten so that the 
sequence of processes to be applied are unambiguous and should produce a 
reasonable result.