2459 – [Serialization] Phases of Character Expansion

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 2459 - [Serialization] Phases of Character Expansion

Summary: [Serialization] Phases of Character Expansion

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Serialization 1.0 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Scott Boag
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-11-04 16:44 UTC by Michael Kay
Modified:	2006-02-16 15:56 UTC (History)
CC List:	0 users

See Also:

Attachments
Section 4 Phases of Serialization (14.83 KB, text/html) 2006-02-16 15:43 UTC, Joanne Tong	Details

Description Michael Kay 2005-11-04 16:44:50 UTC

Serialization section 4, list item 2, contains the rule:

The substitution processes that apply are listed below, in priority order: a
character that is handled by one process in this list will be unaffected by
processes appearing later in the list, except that a character affected....

My question is, what does "handled" mean here? Does it mean the same as
"affected"?

Consider this example (a Saxon bug report today). The result tree contains

<a href="mailto:sales@backbase.com">sales@backbase.com</a>

and there is a character map that translates "@" to "(at)"

Saxon is doing the character mapping for the text node but not for the
attribute node, because the characters in the attribute node are all
"handled" by the URI escaping phase, even though they are unchanged by that
phase. Is this a correct interpretation? It isn't an interpretation that
makes much sense for this use case, and I can't really think of a use case
where it does make sense. So perhaps "handled" should be "affected", or even
"altered".

This leads me to question these rules from first principles. The rules have
become increasingly messy. Let's look at all the interactions between
phases: for reference these are

a   URI escaping
b   character mapping
c   unicode normalization
d   CDATA sections
e   ampersand escaping

Looking at all possible pairs of phases, let's ask the question "should a
character that's changed by the first phase also be changed by the second"

ab  - unlikely to affect practical use cases
ac  - makes no difference (we have recently added a new rule to normalize
before URI escaping)
ad  - makes no difference
ae  - makes no difference
bc  - probably yes, though currently no
bd  - makes no difference (we have a special rule here that elements
specified as cdata-section-elements are not affected by character mapping)
be  - definitely no
cd  - definitely yes (the exception to the general rule is already stated)
ce  - definitely yes (the exception to the general rule is already stated)
de  - definitely no

This is far from the blanket "no" that the general rule implies. I think we
could make the whole thing a lot simpler by inverting the general rule, so
that characters output by one phase act as input to the next, with stated
exceptions:

(1) an & or < character produced as a result of string replacement in a
character map is not ampersand-escaped in step (e)

(2) steps (d) and (e) are alternatives, rather than being sequential: a text
node is either processed by (d) or by (e).

Michael Kay
previously raised at 
http://lists.w3.org/Archives/Member/w3c-xsl-query/2005Oct/0010.html

Comment 1 Joanne Tong 2006-02-16 15:43:20 UTC

Created attachment 407 [details]
Section 4 Phases of Serialization

Comment 2 Joanne Tong 2006-02-16 15:55:44 UTC

The XSL and XQuery working group accepted the proposed revision (view 
attachment in this bug report) to section 4 of the serialization specification 
on Feb. 1, 2006.  The character expansion phase has been rewritten so that the 
sequence of processes to be applied are unambiguous and should produce a 
reasonable result.