This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3164 - non SGML character number 128-159
Summary: non SGML character number 128-159
Status: RESOLVED FIXED
Alias: None
Product: Validator
Classification: Unclassified
Component: check (show other bugs)
Version: 0.7.2
Hardware: PC Linux
: P2 normal
Target Milestone: 0.8.0
Assignee: Terje Bless
QA Contact: qa-dev tracking
URL: http://validator.w3.org/check?uri=htt...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-04-28 01:11 UTC by Ruud Koot
Modified: 2008-12-01 03:04 UTC (History)
4 users (show)

See Also:


Attachments

Description Ruud Koot 2006-04-28 01:11:08 UTC
The validator reports that unicode characters U+0080 - U+009F should not be used in the XHTML source. The XML specification allows these characters to be used. Also see http://bugzilla.wikipedia.org/show_bug.cgi?id=5732.
Comment 1 Olivier Thereaux 2006-08-30 06:34:05 UTC
This bug appears to be present still in the latest version of openSP, see:
http://qa-dev.w3.org/wmvs/0.7/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2

but for some reason the development version using Bjoern's S::P::O library appears to not be affected.

Bjoern, do you know why this is the case?

Thanks.
Comment 2 Olivier Thereaux 2006-08-30 06:35:16 UTC
Forgot the pointer to spo-enabled version:
http://qa-dev.w3.org/wmvs/HEAD/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2
Comment 3 Bj 2006-08-30 14:39:00 UTC
The difference in HEAD is a bug in S:P:O, it will be fixed in the next version.
Comment 4 Olivier Thereaux 2007-03-23 02:38:34 UTC
If my understanding is correct, this is:
* a bug in opensp
* which in HEAD is masked by a bug in S:P:O

Terje, could I ask you to look at whether this is on the radar of the openjade group?

Bjoern, any idea when the new version with the spo bug fix would be in?

Thank you.

Comment 5 Terje Bless 2007-03-23 10:34:55 UTC
My information on OpenJade is not current and my attention right now is... elsewhere.

I vaguely recall looking at this a while back and concluding it was fixable, but I don't recall specifics.

However, the non-SGML Character warnings are emitted based on the SGML Declaration in use, so updating that may be all that's required.
Comment 6 C. M. Sperberg-McQueen 2007-03-24 02:04:12 UTC
The characters in question are indeed allowed by the XML 1.0 specification
(although in 1.1, I believe they are allowed only in the form of character
references, not as literals).  There appear to be discrepancies in every
SGML declaration I've found which claims to represent XML in SGML terms: they
all declare these as UNUSED characters (i.e. non-SGML characters, not
to appear literally).

But are they allowed by XHTML 1.0?  XHTML 1.0 describes itself as a
reformulation in XML of HTML 4.  And HTML 4 includes an SGML declaration
(which I believe to be normative) which excludes these characters.

http://www.w3.org/TR/1999/REC-html401-19991224/sgml/sgmldecl.html

The relevant part of the document character set declaration in the HTML 4
SGML declaration reads:

                 127     1       UNUSED
                 128     32      UNUSED

If the character-repertoire restrictions of HTML 4 are inherited by
XHTML 1.0, then I think the validator is right to reject these characters.

Further discussion and details of this logic may be found at
http://www.w3.org/People/cmsmcq/2007/C1.xml
Comment 7 Bj 2007-04-29 22:46:29 UTC
(In reply to comment #4)
> Bjoern, any idea when the new version with the spo bug fix would be in?

I don't know what bug I was talking about here so I can't comment on that. For the Wikipedia document HEAD seems to emit garbage (ISO-8859-1 encoded chars in a UTF-8 encoded document). My guess is that the source code does not have the utf-8 bit on, or that the output stream is not marked as utf-8, or something along those lines. It seems I did release a new spo version after my comment,
so I suppose its one of

  - fixed a bug in how parse_string handles encodings
  - fixed a bug in handling warnings(qw/multiple args/)

That you get errors for C1 characters is due to xml.dcl which has

  CHARSET
    DESCSET
      128 32 UNUSED

which is from http://www.w3.org/TR/NOTE-sgml-xml-971215 and probably wrong.
Comment 8 Bj 2007-04-29 22:51:03 UTC
(In reply to comment #6)
> If the character-repertoire restrictions of HTML 4 are inherited by
> XHTML 1.0, then I think the validator is right to reject these characters.

Note http://www.w3.org/TR/xhtml1/DTD/xhtml1.dcl but I don't think you
will find the new HTML Working Group argue in favour of this restriction.
Comment 9 Olivier Thereaux 2007-05-01 17:43:28 UTC
Changing milestone - I am currently checking that the explanations given by Michael in Comment #6 mean that we should indeed update our "sgml declaration for xml" to allow this range of characters in XML mode.

Bjoern in Comment #7 and Comment #8 is also right, I don't recall exact wording but I think the HTML working group just reused the declaration from the http://www.w3.org/TR/NOTE-sgml-xml-971215 document when publishing http://www.w3.org/TR/xhtml1/DTD/xhtml1.dcl - both should be fixed, I believe. 
Comment 10 Olivier Thereaux 2007-05-04 19:11:29 UTC
the xml.dcl file used by the validator has been changed in CVS
http://lists.w3.org/Archives/Public/www-validator-cvs/2007May/0012.html
and I expect other published "SGML declaration of XML", notably those used in the XHTML family, will be amended soon too.
Comment 11 Ville Skyttä 2007-05-06 07:43:16 UTC
The latest change in CVS seems to fix http://www.w3.org/mid/200704281505.13824.ville.skytta@iki.fi for me, thanks.
Comment 12 Olivier Thereaux 2007-05-18 01:00:43 UTC
A bug in the transcoding routines of the validator also triggered the 128-159 character error for non-C1 characters,

the patch
http://lists.w3.org/Archives/Public/www-validator-cvs/2007May/0064.html
fixes this.