3164 2006-04-28 01:11:08 +0000 non SGML character number 128-159 2008-12-01 03:04:26 +0000 1 1 1 Unclassified Validator check 0.7.2 PC Linux RESOLVED FIXED http://validator.w3.org/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2&charset=%28detect+automatically%29&doctype=Inline P2 normal 0.8.0 1 inbox link aaz bjoern cmsmcq link www-validator-cvs oldest_to_newest 9519 0 inbox 2006-04-28 01:11:08 +0000 The validator reports that unicode characters U+0080 - U+009F should not be used in the XHTML source. The XML specification allows these characters to be used. Also see http://bugzilla.wikipedia.org/show_bug.cgi?id=5732. 11319 1 ot 2006-08-30 06:34:05 +0000 This bug appears to be present still in the latest version of openSP, see: http://qa-dev.w3.org/wmvs/0.7/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2 but for some reason the development version using Bjoern's S::P::O library appears to not be affected. Bjoern, do you know why this is the case? Thanks. 11320 2 ot 2006-08-30 06:35:16 +0000 Forgot the pointer to spo-enabled version: http://qa-dev.w3.org/wmvs/HEAD/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2 11325 3 bjoern 2006-08-30 14:39:00 +0000 The difference in HEAD is a bug in S:P:O, it will be fixed in the next version. 14531 4 ot 2007-03-23 02:38:34 +0000 If my understanding is correct, this is: * a bug in opensp * which in HEAD is masked by a bug in S:P:O Terje, could I ask you to look at whether this is on the radar of the openjade group? Bjoern, any idea when the new version with the spo bug fix would be in? Thank you. 14533 5 link 2007-03-23 10:34:55 +0000 My information on OpenJade is not current and my attention right now is... elsewhere. I vaguely recall looking at this a while back and concluding it was fixable, but I don't recall specifics. However, the non-SGML Character warnings are emitted based on the SGML Declaration in use, so updating that may be all that's required. 14536 6 cmsmcq 2007-03-24 02:04:12 +0000 The characters in question are indeed allowed by the XML 1.0 specification (although in 1.1, I believe they are allowed only in the form of character references, not as literals). There appear to be discrepancies in every SGML declaration I've found which claims to represent XML in SGML terms: they all declare these as UNUSED characters (i.e. non-SGML characters, not to appear literally). But are they allowed by XHTML 1.0? XHTML 1.0 describes itself as a reformulation in XML of HTML 4. And HTML 4 includes an SGML declaration (which I believe to be normative) which excludes these characters. http://www.w3.org/TR/1999/REC-html401-19991224/sgml/sgmldecl.html The relevant part of the document character set declaration in the HTML 4 SGML declaration reads: 127 1 UNUSED 128 32 UNUSED If the character-repertoire restrictions of HTML 4 are inherited by XHTML 1.0, then I think the validator is right to reject these characters. Further discussion and details of this logic may be found at http://www.w3.org/People/cmsmcq/2007/C1.xml 14942 7 bjoern 2007-04-29 22:46:29 +0000 (In reply to comment #4) > Bjoern, any idea when the new version with the spo bug fix would be in? I don't know what bug I was talking about here so I can't comment on that. For the Wikipedia document HEAD seems to emit garbage (ISO-8859-1 encoded chars in a UTF-8 encoded document). My guess is that the source code does not have the utf-8 bit on, or that the output stream is not marked as utf-8, or something along those lines. It seems I did release a new spo version after my comment, so I suppose its one of - fixed a bug in how parse_string handles encodings - fixed a bug in handling warnings(qw/multiple args/) That you get errors for C1 characters is due to xml.dcl which has CHARSET DESCSET 128 32 UNUSED which is from http://www.w3.org/TR/NOTE-sgml-xml-971215 and probably wrong. 14943 8 bjoern 2007-04-29 22:51:03 +0000 (In reply to comment #6) > If the character-repertoire restrictions of HTML 4 are inherited by > XHTML 1.0, then I think the validator is right to reject these characters. Note http://www.w3.org/TR/xhtml1/DTD/xhtml1.dcl but I don't think you will find the new HTML Working Group argue in favour of this restriction. 14956 9 ot 2007-05-01 17:43:28 +0000 Changing milestone - I am currently checking that the explanations given by Michael in Comment #6 mean that we should indeed update our "sgml declaration for xml" to allow this range of characters in XML mode. Bjoern in Comment #7 and Comment #8 is also right, I don't recall exact wording but I think the HTML working group just reused the declaration from the http://www.w3.org/TR/NOTE-sgml-xml-971215 document when publishing http://www.w3.org/TR/xhtml1/DTD/xhtml1.dcl - both should be fixed, I believe. 14992 10 ot 2007-05-04 19:11:29 +0000 the xml.dcl file used by the validator has been changed in CVS http://lists.w3.org/Archives/Public/www-validator-cvs/2007May/0012.html and I expect other published "SGML declaration of XML", notably those used in the XHTML family, will be amended soon too. 15022 11 ville.skytta 2007-05-06 07:43:16 +0000 The latest change in CVS seems to fix http://www.w3.org/mid/200704281505.13824.ville.skytta@iki.fi for me, thanks. 15112 12 ot 2007-05-18 01:00:43 +0000 A bug in the transcoding routines of the validator also triggered the 128-159 character error for non-C1 characters, the patch http://lists.w3.org/Archives/Public/www-validator-cvs/2007May/0064.html fixes this.