This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3550 - expected result for ns-queries-results-q5
Summary: expected result for ns-queries-results-q5
Status: CLOSED FIXED
Alias: None
Product: XML Query Test Suite
Classification: Unclassified
Component: XML Query Test Suite (show other bugs)
Version: unspecified
Hardware: PC Windows XP
: P2 normal
Target Milestone: ---
Assignee: Carmelo Montanez
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-07-31 19:58 UTC by Andrew Eisenberg
Modified: 2006-08-01 20:46 UTC (History)
1 user (show)

See Also:


Attachments

Description Andrew Eisenberg 2006-07-31 19:58:15 UTC
I'm on shaky ground here, but I'm going to question the expected result for ns-queries-results-q5.

The last <remark> element in the expected result is given as:

            <remark xml:lang="de"> Columbia Records 12" 33-1/3 rpm LP,
                #FC-38641, Stereo. Die Platte ist noch immer sauber
                und glänzend und sieht ungespielt aus
                (NM Zustand). Das Cover hat leichte Abnutzungen an
                Oberfläche und Ecken.
            </remark>

The test case was taken from the Use Cases document, where this last <remark> element in the result appears as:

                <remark xml:lang="de">Columbia Records 12" 33-1/3 rpm LP,
                #FC-38641, Stereo. Die Platte ist noch immer sauber
                und glänzend und sieht ungespielt aus
                (NM Zustand). Das Cover hat leichte Abnutzungen an
                Oberfläche und Ecken.</remark>

I don't believe that "ä" the correct UTF-8 encoding for "ä".
Comment 1 Michael Kay 2006-07-31 20:22:09 UTC
I think you have displayed the file (and copied text from it) using a text editor that wasn't configured to read UTF-8. If you look at the file in hex, I think you will see that the relevant bytes are xC3A4, which I think is the correct UTF-8 representation of lower-case-a-with-umlaut, codepoint E4. 

00E4 = 00000000 11100100 => 11000011 10100100 = C3A4 

Michael Kay
Comment 2 Andrew Eisenberg 2006-08-01 03:34:03 UTC
I still believe that the expected result is incorrect. I believe that it contains the byte sequence xC383C2A4 (in 2 locations) where Michael expects xC3A4.
Comment 3 Michael Kay 2006-08-01 13:05:53 UTC
UltraEdit in hex mode shows the character as C3 A4, but when I read the file into a Java InputStream and display the bytes I do indeed get 

c3 83 c2 a4

That's actually the UTF-8 encoding of C3 A4, which is the UTF-8 encoding of E4. So it's been doubly-encoded into UTF-8. I got confused by UltraEdit - in hex mode it doesn't actually show the octets present in the file, it shows the UTF-16 characters after decoding from UTF-8

I'm seeing the same byte sequence in the result file produced by Saxon, so I suspect this might be the cause of the problem. Perhaps I supplied a result file at some stage and this was incorporated into the distribution. I suspect this double-encoding is happening as a result of the way I do canonicalization - as it's done to both files it doesn't normally show up.
Comment 4 Michael Kay 2006-08-01 13:37:25 UTC
On further investigation, the problem is with the source file auction.xml, which declares its encoding as iso-8859-1, but which is actually encoded in UTF-8. This leads to the two octets C3A4 being read as two separate characters, each of which is separately encoded into UTF-8 in the result file.
Comment 5 Carmelo Montanez 2006-08-01 16:34:24 UTC
Mike:

Any suggestions on how to convert a UTF-8 into iso-8859-1. perhaps some kind of a tool or editor?

Thanks,
Carmelo 
Comment 6 David Carlisle 2006-08-01 16:51:22 UTC
(In reply to comment #5)
> Mike:
> 
> Any suggestions on how to convert a UTF-8 into iso-8859-1. perhaps some kind of
> a tool or editor?
> 
> Thanks,
> Carmelo 
> 
In a test suite it would be better simply to change the xml declaration to utf-8
and leave the characters in uf8 encoding. An xml parser is not obliged to be able to handle iso-8859-1.
Comment 7 Michael Kay 2006-08-01 17:10:55 UTC
The simplest is just to change the XML declaration so that it declares the encoding correctly, i.e. change it to encoding="UTF-8".

If you then want to change the encoding, try

java net.sf.saxon.Query -s auction.xml -o auction2.xml {.} !encoding=iso-8859-1
Comment 8 Carmelo Montanez 2006-08-01 19:51:08 UTC
Thanks.  I think I got this right (at least one of my editors tell me so).
Changed encoding to UTF-8 and generated and submitted new results.

Thanks,
Carmelo