This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
We are hitting character encoding issues within the test suite in mercurial. Specifically in the test set 'number'. For example see the test case number-5075. In some tools and environments the special characters are corrupted. It maybe better to generate the characters using entity references to avoid these character encoding corruptions.
I am reluctant, there are many testsets that use non-ascii Unicode characters or even non-BMP Unicode characters. They don't seem to cause problems. So I'd rather try to find out why these particular tests are now causing trouble. Perhaps a simple BOM at the beginning of the file is missing and adding it will prevent Mercurial trying to treat it as ASCII and corrupt the file? Or we encode as UTF-16?
O'Neil, do you still have issues with this? Or was it resolved meanwhile?
Just to confirm, we are still having problems with this test-set - it is working in some environments and not others. We have yet to pin the issue down.
We have drilled more deeply into this problem and discovered that the difference between environments that work and those that don't is the choice of XML parser - Apache Xerces gets it right, the JDK parser (even in JDK 8) gets it wrong. It's unbelievable that the JDK parser is still buggy after so many years, but that seems to be the case. We will once again raise a JDK bug on this, and once again remind ourselves and our users always to use the Apache parser in preference. Meanwhile, closing this bug.
For my own future reference, we raised essentially the same problem as a JDK bug here http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8145969 and it has been reported by others here: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8058175 and they claim it is fixed in JDK 9.