This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The validator reports that unicode characters U+0080 - U+009F should not be used in the XHTML source. The XML specification allows these characters to be used. Also see http://bugzilla.wikipedia.org/show_bug.cgi?id=5732.
This bug appears to be present still in the latest version of openSP, see: http://qa-dev.w3.org/wmvs/0.7/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2 but for some reason the development version using Bjoern's S::P::O library appears to not be affected. Bjoern, do you know why this is the case? Thanks.
Forgot the pointer to spo-enabled version: http://qa-dev.w3.org/wmvs/HEAD/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2
The difference in HEAD is a bug in S:P:O, it will be fixed in the next version.
If my understanding is correct, this is: * a bug in opensp * which in HEAD is masked by a bug in S:P:O Terje, could I ask you to look at whether this is on the radar of the openjade group? Bjoern, any idea when the new version with the spo bug fix would be in? Thank you.
My information on OpenJade is not current and my attention right now is... elsewhere. I vaguely recall looking at this a while back and concluding it was fixable, but I don't recall specifics. However, the non-SGML Character warnings are emitted based on the SGML Declaration in use, so updating that may be all that's required.
The characters in question are indeed allowed by the XML 1.0 specification (although in 1.1, I believe they are allowed only in the form of character references, not as literals). There appear to be discrepancies in every SGML declaration I've found which claims to represent XML in SGML terms: they all declare these as UNUSED characters (i.e. non-SGML characters, not to appear literally). But are they allowed by XHTML 1.0? XHTML 1.0 describes itself as a reformulation in XML of HTML 4. And HTML 4 includes an SGML declaration (which I believe to be normative) which excludes these characters. http://www.w3.org/TR/1999/REC-html401-19991224/sgml/sgmldecl.html The relevant part of the document character set declaration in the HTML 4 SGML declaration reads: 127 1 UNUSED 128 32 UNUSED If the character-repertoire restrictions of HTML 4 are inherited by XHTML 1.0, then I think the validator is right to reject these characters. Further discussion and details of this logic may be found at http://www.w3.org/People/cmsmcq/2007/C1.xml
(In reply to comment #4) > Bjoern, any idea when the new version with the spo bug fix would be in? I don't know what bug I was talking about here so I can't comment on that. For the Wikipedia document HEAD seems to emit garbage (ISO-8859-1 encoded chars in a UTF-8 encoded document). My guess is that the source code does not have the utf-8 bit on, or that the output stream is not marked as utf-8, or something along those lines. It seems I did release a new spo version after my comment, so I suppose its one of - fixed a bug in how parse_string handles encodings - fixed a bug in handling warnings(qw/multiple args/) That you get errors for C1 characters is due to xml.dcl which has CHARSET DESCSET 128 32 UNUSED which is from http://www.w3.org/TR/NOTE-sgml-xml-971215 and probably wrong.
(In reply to comment #6) > If the character-repertoire restrictions of HTML 4 are inherited by > XHTML 1.0, then I think the validator is right to reject these characters. Note http://www.w3.org/TR/xhtml1/DTD/xhtml1.dcl but I don't think you will find the new HTML Working Group argue in favour of this restriction.
Changing milestone - I am currently checking that the explanations given by Michael in Comment #6 mean that we should indeed update our "sgml declaration for xml" to allow this range of characters in XML mode. Bjoern in Comment #7 and Comment #8 is also right, I don't recall exact wording but I think the HTML working group just reused the declaration from the http://www.w3.org/TR/NOTE-sgml-xml-971215 document when publishing http://www.w3.org/TR/xhtml1/DTD/xhtml1.dcl - both should be fixed, I believe.
the xml.dcl file used by the validator has been changed in CVS http://lists.w3.org/Archives/Public/www-validator-cvs/2007May/0012.html and I expect other published "SGML declaration of XML", notably those used in the XHTML family, will be amended soon too.
The latest change in CVS seems to fix http://www.w3.org/mid/200704281505.13824.ville.skytta@iki.fi for me, thanks.
A bug in the transcoding routines of the validator also triggered the 128-159 character error for non-C1 characters, the patch http://lists.w3.org/Archives/Public/www-validator-cvs/2007May/0064.html fixes this.