3550 – expected result for ns-queries-results-q5

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3550 - expected result for ns-queries-results-q5

Summary: expected result for ns-queries-results-q5

Status:	CLOSED FIXED

Alias:	None

Product:	XML Query Test Suite
Classification:	Unclassified
Component:	XML Query Test Suite (show other bugs)
Version:	unspecified
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Carmelo Montanez
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2006-07-31 19:58 UTC by Andrew Eisenberg
Modified:	2006-08-01 20:46 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Andrew Eisenberg 2006-07-31 19:58:15 UTC

I'm on shaky ground here, but I'm going to question the expected result for ns-queries-results-q5.

The last <remark> element in the expected result is given as:

            <remark xml:lang="de"> Columbia Records 12" 33-1/3 rpm LP,
                #FC-38641, Stereo. Die Platte ist noch immer sauber
                und glÃÂ¤nzend und sieht ungespielt aus
                (NM Zustand). Das Cover hat leichte Abnutzungen an
                OberflÃÂ¤che und Ecken.
            </remark>

The test case was taken from the Use Cases document, where this last <remark> element in the result appears as:

                <remark xml:lang="de">Columbia Records 12" 33-1/3 rpm LP,
                #FC-38641, Stereo. Die Platte ist noch immer sauber
                und glänzend und sieht ungespielt aus
                (NM Zustand). Das Cover hat leichte Abnutzungen an
                Oberfläche und Ecken.</remark>

I don't believe that "ÃÂ¤" the correct UTF-8 encoding for "ä".

Comment 1 Michael Kay 2006-07-31 20:22:09 UTC

I think you have displayed the file (and copied text from it) using a text editor that wasn't configured to read UTF-8. If you look at the file in hex, I think you will see that the relevant bytes are xC3A4, which I think is the correct UTF-8 representation of lower-case-a-with-umlaut, codepoint E4. 

00E4 = 00000000 11100100 => 11000011 10100100 = C3A4 

Michael Kay

Comment 2 Andrew Eisenberg 2006-08-01 03:34:03 UTC

I still believe that the expected result is incorrect. I believe that it contains the byte sequence xC383C2A4 (in 2 locations) where Michael expects xC3A4.

Comment 3 Michael Kay 2006-08-01 13:05:53 UTC

UltraEdit in hex mode shows the character as C3 A4, but when I read the file into a Java InputStream and display the bytes I do indeed get 

c3 83 c2 a4

That's actually the UTF-8 encoding of C3 A4, which is the UTF-8 encoding of E4. So it's been doubly-encoded into UTF-8. I got confused by UltraEdit - in hex mode it doesn't actually show the octets present in the file, it shows the UTF-16 characters after decoding from UTF-8

I'm seeing the same byte sequence in the result file produced by Saxon, so I suspect this might be the cause of the problem. Perhaps I supplied a result file at some stage and this was incorporated into the distribution. I suspect this double-encoding is happening as a result of the way I do canonicalization - as it's done to both files it doesn't normally show up.

Comment 4 Michael Kay 2006-08-01 13:37:25 UTC

On further investigation, the problem is with the source file auction.xml, which declares its encoding as iso-8859-1, but which is actually encoded in UTF-8. This leads to the two octets C3A4 being read as two separate characters, each of which is separately encoded into UTF-8 in the result file.

Comment 5 Carmelo Montanez 2006-08-01 16:34:24 UTC

Mike:

Any suggestions on how to convert a UTF-8 into iso-8859-1. perhaps some kind of a tool or editor?

Thanks,
Carmelo

Comment 6 David Carlisle 2006-08-01 16:51:22 UTC

(In reply to comment #5)
> Mike:
> 
> Any suggestions on how to convert a UTF-8 into iso-8859-1. perhaps some kind of
> a tool or editor?
> 
> Thanks,
> Carmelo 
> 
In a test suite it would be better simply to change the xml declaration to utf-8
and leave the characters in uf8 encoding. An xml parser is not obliged to be able to handle iso-8859-1.

Comment 7 Michael Kay 2006-08-01 17:10:55 UTC

The simplest is just to change the XML declaration so that it declares the encoding correctly, i.e. change it to encoding="UTF-8".

If you then want to change the encoding, try

java net.sf.saxon.Query -s auction.xml -o auction2.xml {.} !encoding=iso-8859-1

Comment 8 Carmelo Montanez 2006-08-01 19:51:08 UTC

Thanks.  I think I got this right (at least one of my editors tell me so).
Changed encoding to UTF-8 and generated and submitted new results.

Thanks,
Carmelo