This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
I'm on shaky ground here, but I'm going to question the expected result for ns-queries-results-q5. The last <remark> element in the expected result is given as: <remark xml:lang="de"> Columbia Records 12" 33-1/3 rpm LP, #FC-38641, Stereo. Die Platte ist noch immer sauber und glänzend und sieht ungespielt aus (NM Zustand). Das Cover hat leichte Abnutzungen an Oberfläche und Ecken. </remark> The test case was taken from the Use Cases document, where this last <remark> element in the result appears as: <remark xml:lang="de">Columbia Records 12" 33-1/3 rpm LP, #FC-38641, Stereo. Die Platte ist noch immer sauber und glänzend und sieht ungespielt aus (NM Zustand). Das Cover hat leichte Abnutzungen an Oberfläche und Ecken.</remark> I don't believe that "ä" the correct UTF-8 encoding for "ä".
I think you have displayed the file (and copied text from it) using a text editor that wasn't configured to read UTF-8. If you look at the file in hex, I think you will see that the relevant bytes are xC3A4, which I think is the correct UTF-8 representation of lower-case-a-with-umlaut, codepoint E4. 00E4 = 00000000 11100100 => 11000011 10100100 = C3A4 Michael Kay
I still believe that the expected result is incorrect. I believe that it contains the byte sequence xC383C2A4 (in 2 locations) where Michael expects xC3A4.
UltraEdit in hex mode shows the character as C3 A4, but when I read the file into a Java InputStream and display the bytes I do indeed get c3 83 c2 a4 That's actually the UTF-8 encoding of C3 A4, which is the UTF-8 encoding of E4. So it's been doubly-encoded into UTF-8. I got confused by UltraEdit - in hex mode it doesn't actually show the octets present in the file, it shows the UTF-16 characters after decoding from UTF-8 I'm seeing the same byte sequence in the result file produced by Saxon, so I suspect this might be the cause of the problem. Perhaps I supplied a result file at some stage and this was incorporated into the distribution. I suspect this double-encoding is happening as a result of the way I do canonicalization - as it's done to both files it doesn't normally show up.
On further investigation, the problem is with the source file auction.xml, which declares its encoding as iso-8859-1, but which is actually encoded in UTF-8. This leads to the two octets C3A4 being read as two separate characters, each of which is separately encoded into UTF-8 in the result file.
Mike: Any suggestions on how to convert a UTF-8 into iso-8859-1. perhaps some kind of a tool or editor? Thanks, Carmelo
(In reply to comment #5) > Mike: > > Any suggestions on how to convert a UTF-8 into iso-8859-1. perhaps some kind of > a tool or editor? > > Thanks, > Carmelo > In a test suite it would be better simply to change the xml declaration to utf-8 and leave the characters in uf8 encoding. An xml parser is not obliged to be able to handle iso-8859-1.
The simplest is just to change the XML declaration so that it declares the encoding correctly, i.e. change it to encoding="UTF-8". If you then want to change the encoding, try java net.sf.saxon.Query -s auction.xml -o auction2.xml {.} !encoding=iso-8859-1
Thanks. I think I got this right (at least one of my editors tell me so). Changed encoding to UTF-8 and generated and submitted new results. Thanks, Carmelo