This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3446 - [XQueryX] constructing attribute values with whitespace characters
Summary: [XQueryX] constructing attribute values with whitespace characters
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: XQueryX 1.0 (show other bugs)
Version: Candidate Recommendation
Hardware: PC Windows XP
: P2 normal
Target Milestone: ---
Assignee: Jim Melton
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-07-11 15:02 UTC by Andrew Eisenberg
Modified: 2007-02-25 23:55 UTC (History)
0 users

See Also:


Attachments

Description Andrew Eisenberg 2006-07-11 15:02:03 UTC
I might want to construct an XQueryX instance that is equivalent to the following XQuery:

<elem attr="a&#x09;b"></elem>

I can try to do this with the following:

<xqx:module xmlns:xqx="http://www.w3.org/2005/XQueryX"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="http://www.w3.org/2005/XQueryX
                                http://www.w3.org/2005/XQueryX/xqueryx.xsd">
  <xqx:mainModule>
    <xqx:queryBody>
      <xqx:elementConstructor>
        <xqx:tagName>elem</xqx:tagName>
        <xqx:attributeList>
          <xqx:attributeConstructor>
            <xqx:attributeName>attr</xqx:attributeName>
            <xqx:attributeValue>a&#x09;b</xqx:attributeValue>
          </xqx:attributeConstructor>
        </xqx:attributeList>
      </xqx:elementConstructor>
    </xqx:queryBody>
  </xqx:mainModule>
</xqx:module>

The XQueryX stylesheet turns this into the following:

<elem attr="a	b"></elem>

This XQuery will not produce the same result as the XQuery that I desire.

When I use "a&amp;#x09;b" for the attribute value, then the stylesheet produces the following:

<elem attr="a&amp;#x09;b"></elem>

Again, not what I desire.

This issue applies to all of the whitespace characters, #x09;, #x0a;, and #x0d;.

Do we need additional markup in order to preserve char refs such as these? Something like the following:

<xqx:attributeValue>a<xqx:charRefHex>09</xqx:charRefHex>b</xqx:attributeValue>
Comment 1 Andrew Eisenberg 2006-07-11 15:28:28 UTC
To expand on this a bit further, this problem is also seen in the following XQuery:

"&#x0d;"

The "natural" XQueryX for this would be

<?xml version="1.0"?>
<xqx:module xmlns:xqx="http://www.w3.org/2005/XQueryX"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="http://www.w3.org/2005/XQueryX
                                http://www.w3.org/2005/XQueryX/xqueryx.xsd">
  <xqx:mainModule>
    <xqx:queryBody>
      <xqx:stringConstantExpr>
        <xqx:value>&#xd;</xqx:value>
      </xqx:stringConstantExpr>
    </xqx:queryBody>
  </xqx:mainModule>
</xqx:module>

The stylesheet transforms this XQueryX into the following:

""

Comment 2 David Carlisle 2006-07-11 15:50:40 UTC
(non WG response)

> Do we need additional markup in order to preserve char refs such as these?
> Something like the following:

No, I don't think so, I think that the stylesheet should either quote the white
space while writing the attribute as XQuery and XSLT serialisers do when
writing attributes in xml or html output (the xquery text is written with
xsl:output method="text" which is why the "attribute" is not automatically
quoted here).

the stylesheet already has template for quoting characters and applies it to &
and { it just needs to do white space as well. (It would be easier in xslt2,
but it's not hard, just tedious to do it in xslt1, see bug #1285)

alternatively of course xqx:attributeConstructor could be converted to a
computed attribute constructor

If Xquery processes 
attribute attr {"a
b"}

it does automatically quote the newline in the result, so in that case the
stylesheet would not have to do anything, although the error behaviour is
slightly different in that case (duplicate direct attribute constructors are
reported statically with a different error code than duplicate attributes
generated by a computed element constructor).

David
Comment 3 David Carlisle 2006-07-11 15:53:42 UTC
(In reply to comment #1)
> To expand on this a bit further, this problem is also seen in the following
> XQuery:
> 

When I process that xqueryx with the stylesheet (using saxon) I don't get "" but rather
"<ctrlm>"
where <ctrlm> is a single #13 character, which is equivalent to the original query. 

David
Comment 4 Jim Melton 2006-07-29 02:30:10 UTC
Thanks for this report.  However, I must be missing something important. 

When I create the following XQuery (derived from the XQueries that appeared in this bug report):
   <x><elem attr="a&#x09;b"></elem><y/><elem attr="a	b"></elem></x>
(where the value of the attr attribute of the second elem element contains a literal ASCII horizontal tab) and run it through at least one XQuery processor (Stylus Studio 6 Enterprise Edition Release 3), the result is serialized as:
   <x><elem attr="a	b"/><y/><elem attr="a	b"/></x>
where the gap between the 'a' and 'b' is the control character with the value 9 (that is, in ASCII, a horizontal tab) in BOTH attributes! 

I'm afraid that I do not see the difference between the two. 

Therefore, I feel that I must dispute the claim that "[The second] XQuery will not produce the same result as the XQuery that I desire." To the best of my ability to tell, the two XQueries produce identical results. 

I infer, however, that you would prefer for the generated XQuery expression to include an attribute whose value is "a&#x09;b" (with the six visible characters between the 'a' and 'b'). 

Clearly, the first XQueryX attempt (<xqx:attributeValue>a&#x09;b</xqx:attributeValue>) cannot do what you wish, because the XML parser that transforms the XQueryX document into its internal form for transformation turns the '&#x09;' into a literal tab character. 

The second attempt ((<xqx:attributeValue>a&amp;#x09;b</xqx:attributeValue>) should do what you wish, because the XML parser translates the '&amp;' into a literal ampersand.  However, the XQueryX stylesheet contains code that always transforms literal ampersands into '&amp;' to preserve them (consider the XQuery expression 'element a {This & that}', !  Therefore, this attempt does not give the desired results.

It would appear that the stylesheet should not, in fact, be transforming '&' (ampersand) characters into '&amp;' entity references, because there are no places in an XQuery or in an XQueryX document where a "naked" ampersand can occur, so there is no reason to "preserve" them. 

Right? 

The escaping of '&' to '&amp;' has been removed from the stylesheet, but all reviewers are urged to enter another response if there are situations in which an XQueryX document can contain legitimate ampersands that must be escaped. 
Comment 5 Michael Kay 2006-07-29 08:45:00 UTC
The two elements in 

<x><elem attr="a&#x09;b"></elem><y/><elem attr="a    b"></elem></x>

should not produce the same result. XQuery 3.7.1.1 clause 1 says:

"Each consecutive sequence of literal characters in the attribute content is treated as a string containing those characters. Attribute value normalization is then applied to normalize whitespace and expand character references and predefined entity references. An XQuery processor that supports XML 1.0 uses the rules for attribute value normalization in Section 3.3.3 of [XML 1.0]; "

and the relevant rule in XML 1.0 is 

* For a character reference, append the referenced character to the normalized value.

* For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.

Thus a character reference #x9 should result in a tab character, whereas a tab character should result in a single space (#x20) character.
Comment 6 Michael Kay 2006-07-29 09:12:34 UTC
In fact, Saxon's output from this query is 

<x><elem attr="a&#x9;b"/><y/><elem attr="a b"/></x>

where the second attr contains a single space (x20) character. In the result tree, the first attribute contains the three characters (a, x9, b), while the second contains the three characters (a, x20, b). The serializer converts the tab character to a character reference to ensure that it round-trips correctly.
Comment 7 David Carlisle 2006-07-30 20:31:53 UTC
(> The escaping of '&' to '&amp;' has been removed from the stylesheet,

> because there are no
> places in an XQuery or in an XQueryX document where a "naked" ampersand can
> occur, so there is no reason to "preserve" them. 

The stylesheet needs to quote & otherwise an XqueryX string of
<xqx:value>AT&amp;T</xsx:value>
would be translated to the XQuery syntax error
"AT&T"

I believe the correct fix, as mentioned in comment #2 is to quote #9,10,13 as well as &.

Comment 8 David Carlisle 2006-07-31 13:09:43 UTC
In addition to quoting all white space in attribute values, the stylesheet needs to quote (just) #13 in strings, otherwise it interacts badly with the XQuery line ending normalisation. essentially the rule needs to be that because the stylesheet is using xsl:output method="text" it needs to do "by hand" all the XML quoting that would be automatic if  xsl:output method="xml" were used, as the XQuery white space rules are designed to mimic the XML ones.

<x>&#13;&#10;</x>/string-length(.)

should evaluate to 2 It should (I claim) be encoded in XQueryX as shown below,
but the stylesheet generates

<x>{"
"}</x>/string-length(.)
on that with literal #13 and #10 characters which are merged into a single #10 when parsed, so this expression evaluates to 1.

If the stylesheet followed the rules of the serialisation spec's xml output, then #13 would be serialised as &#13; (or something equivalent) and it would then not be normalised to #10 on parsing.

David


<xqx:module xmlns:xqx="http://www.w3.org/2005/XQueryX">
   <xqx:mainModule>
      <xqx:queryBody>
         <xqx:pathExpr>
            <xqx:stepExpr>
               <xqx:filterExpr>
                  <xqx:elementConstructor>
                     <xqx:tagName>x</xqx:tagName>
                     <xqx:elementContent>
                        <xqx:stringConstantExpr>
                           <xqx:value>&#xD;&#xA;</xqx:value>
                        </xqx:stringConstantExpr>
                     </xqx:elementContent>
                  </xqx:elementConstructor>
               </xqx:filterExpr>
            </xqx:stepExpr>
            <xqx:stepExpr>
               <xqx:filterExpr>
                  <xqx:functionCallExpr>
                     <xqx:functionName>string-length</xqx:functionName>
                     <xqx:arguments>
                        <xqx:pathExpr>
                           <xqx:stepExpr>
                              <xqx:filterExpr>
                                 <xqx:contextItemExpr/>
                              </xqx:filterExpr>
                           </xqx:stepExpr>
                        </xqx:pathExpr>
                     </xqx:arguments>
                  </xqx:functionCallExpr>
               </xqx:filterExpr>
            </xqx:stepExpr>
         </xqx:pathExpr>
      </xqx:queryBody>
   </xqx:mainModule>
</xqx:module>
Comment 9 Jim Melton 2006-08-01 00:52:55 UTC
Clearly, I should not be trying to resolve complex issues (which whitespace is for me) in the evening after a very long day.  Thanks to Mike and David both for perserving in fact of my, er, ill-advised "solution". 

Now that I have worked on this while I was (more nearly) awake, I have done the following:

* Always: quoted #xD (CR), #x85 (NEL), and #x2028 (LINE SEPARATOR) per "XSLT 2.0 and XQuery 1.0 Serialization", Section 5 (XML Output Method), paragraph beginning "A consequence of this rule..."

* Additionally, in attribute nodes: quoted #xA (NL) and #x9 (TAB) per the same paragraph. 

I have *not* quoted any of the non-white space characters in the ranges #x1 through #x1F and #x7F through #x9F, even though that same paragraph says they should be represented as character references. 

Furthermore, although I have always quoted ampersands (by making them into character references) and doubled double-quotes, I have not quoted less-than/left-angle-bracket characters ("<").  QUESTION: Should I?

I've run a number of tests and believe that this is now resolved.  Therefore, I am once again marking it FIXED and invite further testing, comment, etc. so it can eventually be marked CLOSED.

Thanks again!
Comment 10 Jim Melton 2006-08-01 00:55:42 UTC
One more note: Nobody needs to answer the question about "Should I [escape "<"]?" because bug #3474 makes it clear that that character must also be escaped. 

"Never mind"
Comment 11 Jim Melton 2007-02-25 23:55:29 UTC
Closing bug because commenter has not objected to the resolution posted and more than two weeks have passed.