3550 2006-07-31 19:58:15 +0000 expected result for ns-queries-results-q5 2006-08-01 20:46:47 +0000 1 1 1 Unclassified XML Query Test Suite XML Query Test Suite unspecified PC Windows XP CLOSED FIXED P2 normal --- 1 andrew.eisenberg carmelo jonathan.robie public-qt-comments oldest_to_newest 10898 0 andrew.eisenberg 2006-07-31 19:58:15 +0000 I'm on shaky ground here, but I'm going to question the expected result for ns-queries-results-q5. The last <remark> element in the expected result is given as: <remark xml:lang="de"> Columbia Records 12" 33-1/3 rpm LP, #FC-38641, Stereo. Die Platte ist noch immer sauber und glÃÂ¤nzend und sieht ungespielt aus (NM Zustand). Das Cover hat leichte Abnutzungen an OberflÃÂ¤che und Ecken. </remark> The test case was taken from the Use Cases document, where this last <remark> element in the result appears as: <remark xml:lang="de">Columbia Records 12" 33-1/3 rpm LP, #FC-38641, Stereo. Die Platte ist noch immer sauber und glänzend und sieht ungespielt aus (NM Zustand). Das Cover hat leichte Abnutzungen an Oberfläche und Ecken.</remark> I don't believe that "ÃÂ¤" the correct UTF-8 encoding for "ä". 10899 1 mike 2006-07-31 20:22:09 +0000 I think you have displayed the file (and copied text from it) using a text editor that wasn't configured to read UTF-8. If you look at the file in hex, I think you will see that the relevant bytes are xC3A4, which I think is the correct UTF-8 representation of lower-case-a-with-umlaut, codepoint E4. 00E4 = 00000000 11100100 => 11000011 10100100 = C3A4 Michael Kay 10924 2 andrew.eisenberg 2006-08-01 03:34:03 +0000 I still believe that the expected result is incorrect. I believe that it contains the byte sequence xC383C2A4 (in 2 locations) where Michael expects xC3A4. 10934 3 mike 2006-08-01 13:05:53 +0000 UltraEdit in hex mode shows the character as C3 A4, but when I read the file into a Java InputStream and display the bytes I do indeed get c3 83 c2 a4 That's actually the UTF-8 encoding of C3 A4, which is the UTF-8 encoding of E4. So it's been doubly-encoded into UTF-8. I got confused by UltraEdit - in hex mode it doesn't actually show the octets present in the file, it shows the UTF-16 characters after decoding from UTF-8 I'm seeing the same byte sequence in the result file produced by Saxon, so I suspect this might be the cause of the problem. Perhaps I supplied a result file at some stage and this was incorporated into the distribution. I suspect this double-encoding is happening as a result of the way I do canonicalization - as it's done to both files it doesn't normally show up. 10936 4 mike 2006-08-01 13:37:25 +0000 On further investigation, the problem is with the source file auction.xml, which declares its encoding as iso-8859-1, but which is actually encoded in UTF-8. This leads to the two octets C3A4 being read as two separate characters, each of which is separately encoded into UTF-8 in the result file. 10940 5 carmelo 2006-08-01 16:34:24 +0000 Mike: Any suggestions on how to convert a UTF-8 into iso-8859-1. perhaps some kind of a tool or editor? Thanks, Carmelo 10941 6 davidc 2006-08-01 16:51:22 +0000 (In reply to comment #5) > Mike: > > Any suggestions on how to convert a UTF-8 into iso-8859-1. perhaps some kind of > a tool or editor? > > Thanks, > Carmelo > In a test suite it would be better simply to change the xml declaration to utf-8 and leave the characters in uf8 encoding. An xml parser is not obliged to be able to handle iso-8859-1. 10951 7 mike 2006-08-01 17:10:55 +0000 The simplest is just to change the XML declaration so that it declares the encoding correctly, i.e. change it to encoding="UTF-8". If you then want to change the encoding, try java net.sf.saxon.Query -s auction.xml -o auction2.xml {.} !encoding=iso-8859-1 10966 8 carmelo 2006-08-01 19:51:08 +0000 Thanks. I think I got this right (at least one of my editors tell me so). Changed encoding to UTF-8 and generated and submitted new results. Thanks, Carmelo