<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>3550</bug_id>
          
          <creation_ts>2006-07-31 19:58:15 +0000</creation_ts>
          <short_desc>expected result for ns-queries-results-q5</short_desc>
          <delta_ts>2006-08-01 20:46:47 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>XML Query Test Suite</product>
          <component>XML Query Test Suite</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows XP</op_sys>
          <bug_status>CLOSED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Andrew Eisenberg">andrew.eisenberg</reporter>
          <assigned_to name="Carmelo Montanez">carmelo</assigned_to>
          <cc>jonathan.robie</cc>
          
          <qa_contact name="Mailing list for public feedback on specs from XSL and XML Query WGs">public-qt-comments</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>10898</commentid>
    <comment_count>0</comment_count>
    <who name="Andrew Eisenberg">andrew.eisenberg</who>
    <bug_when>2006-07-31 19:58:15 +0000</bug_when>
    <thetext>I&apos;m on shaky ground here, but I&apos;m going to question the expected result for ns-queries-results-q5.

The last &lt;remark&gt; element in the expected result is given as:

            &lt;remark xml:lang=&quot;de&quot;&gt; Columbia Records 12&quot; 33-1/3 rpm LP,
                #FC-38641, Stereo. Die Platte ist noch immer sauber
                und glÃÂ¤nzend und sieht ungespielt aus
                (NM Zustand). Das Cover hat leichte Abnutzungen an
                OberflÃÂ¤che und Ecken.
            &lt;/remark&gt;

The test case was taken from the Use Cases document, where this last &lt;remark&gt; element in the result appears as:

                &lt;remark xml:lang=&quot;de&quot;&gt;Columbia Records 12&quot; 33-1/3 rpm LP,
                #FC-38641, Stereo. Die Platte ist noch immer sauber
                und glänzend und sieht ungespielt aus
                (NM Zustand). Das Cover hat leichte Abnutzungen an
                Oberfläche und Ecken.&lt;/remark&gt;

I don&apos;t believe that &quot;ÃÂ¤&quot; the correct UTF-8 encoding for &quot;ä&quot;.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>10899</commentid>
    <comment_count>1</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2006-07-31 20:22:09 +0000</bug_when>
    <thetext>I think you have displayed the file (and copied text from it) using a text editor that wasn&apos;t configured to read UTF-8. If you look at the file in hex, I think you will see that the relevant bytes are xC3A4, which I think is the correct UTF-8 representation of lower-case-a-with-umlaut, codepoint E4. 

00E4 = 00000000 11100100 =&gt; 11000011 10100100 = C3A4 

Michael Kay</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>10924</commentid>
    <comment_count>2</comment_count>
    <who name="Andrew Eisenberg">andrew.eisenberg</who>
    <bug_when>2006-08-01 03:34:03 +0000</bug_when>
    <thetext>I still believe that the expected result is incorrect. I believe that it contains the byte sequence xC383C2A4 (in 2 locations) where Michael expects xC3A4.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>10934</commentid>
    <comment_count>3</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2006-08-01 13:05:53 +0000</bug_when>
    <thetext>UltraEdit in hex mode shows the character as C3 A4, but when I read the file into a Java InputStream and display the bytes I do indeed get 

c3 83 c2 a4

That&apos;s actually the UTF-8 encoding of C3 A4, which is the UTF-8 encoding of E4. So it&apos;s been doubly-encoded into UTF-8. I got confused by UltraEdit - in hex mode it doesn&apos;t actually show the octets present in the file, it shows the UTF-16 characters after decoding from UTF-8

I&apos;m seeing the same byte sequence in the result file produced by Saxon, so I suspect this might be the cause of the problem. Perhaps I supplied a result file at some stage and this was incorporated into the distribution. I suspect this double-encoding is happening as a result of the way I do canonicalization - as it&apos;s done to both files it doesn&apos;t normally show up.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>10936</commentid>
    <comment_count>4</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2006-08-01 13:37:25 +0000</bug_when>
    <thetext>On further investigation, the problem is with the source file auction.xml, which declares its encoding as iso-8859-1, but which is actually encoded in UTF-8. This leads to the two octets C3A4 being read as two separate characters, each of which is separately encoded into UTF-8 in the result file.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>10940</commentid>
    <comment_count>5</comment_count>
    <who name="Carmelo Montanez">carmelo</who>
    <bug_when>2006-08-01 16:34:24 +0000</bug_when>
    <thetext>Mike:

Any suggestions on how to convert a UTF-8 into iso-8859-1. perhaps some kind of a tool or editor?

Thanks,
Carmelo </thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>10941</commentid>
    <comment_count>6</comment_count>
    <who name="David Carlisle">davidc</who>
    <bug_when>2006-08-01 16:51:22 +0000</bug_when>
    <thetext>(In reply to comment #5)
&gt; Mike:
&gt; 
&gt; Any suggestions on how to convert a UTF-8 into iso-8859-1. perhaps some kind of
&gt; a tool or editor?
&gt; 
&gt; Thanks,
&gt; Carmelo 
&gt; 
In a test suite it would be better simply to change the xml declaration to utf-8
and leave the characters in uf8 encoding. An xml parser is not obliged to be able to handle iso-8859-1.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>10951</commentid>
    <comment_count>7</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2006-08-01 17:10:55 +0000</bug_when>
    <thetext>The simplest is just to change the XML declaration so that it declares the encoding correctly, i.e. change it to encoding=&quot;UTF-8&quot;.

If you then want to change the encoding, try

java net.sf.saxon.Query -s auction.xml -o auction2.xml {.} !encoding=iso-8859-1</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>10966</commentid>
    <comment_count>8</comment_count>
    <who name="Carmelo Montanez">carmelo</who>
    <bug_when>2006-08-01 19:51:08 +0000</bug_when>
    <thetext>Thanks.  I think I got this right (at least one of my editors tell me so).
Changed encoding to UTF-8 and generated and submitted new results.

Thanks,
Carmelo</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>