<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>3164</bug_id>
          
          <creation_ts>2006-04-28 01:11:08 +0000</creation_ts>
          <short_desc>non SGML character number 128-159</short_desc>
          <delta_ts>2008-12-01 03:04:26 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>Validator</product>
          <component>check</component>
          <version>0.7.2</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Linux</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc>http://validator.w3.org/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2&amp;charset=%28detect+automatically%29&amp;doctype=Inline</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>0.8.0</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Ruud Koot">inbox</reporter>
          <assigned_to name="Terje Bless">link</assigned_to>
          <cc>aaz</cc>
    
    <cc>bjoern</cc>
    
    <cc>cmsmcq</cc>
    
    <cc>link</cc>
          
          <qa_contact name="qa-dev tracking">www-validator-cvs</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>9519</commentid>
    <comment_count>0</comment_count>
    <who name="Ruud Koot">inbox</who>
    <bug_when>2006-04-28 01:11:08 +0000</bug_when>
    <thetext>The validator reports that unicode characters U+0080 - U+009F should not be used in the XHTML source. The XML specification allows these characters to be used. Also see http://bugzilla.wikipedia.org/show_bug.cgi?id=5732.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>11319</commentid>
    <comment_count>1</comment_count>
    <who name="Olivier Thereaux">ot</who>
    <bug_when>2006-08-30 06:34:05 +0000</bug_when>
    <thetext>This bug appears to be present still in the latest version of openSP, see:
http://qa-dev.w3.org/wmvs/0.7/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2

but for some reason the development version using Bjoern&apos;s S::P::O library appears to not be affected.

Bjoern, do you know why this is the case?

Thanks.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>11320</commentid>
    <comment_count>2</comment_count>
    <who name="Olivier Thereaux">ot</who>
    <bug_when>2006-08-30 06:35:16 +0000</bug_when>
    <thetext>Forgot the pointer to spo-enabled version:
http://qa-dev.w3.org/wmvs/HEAD/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>11325</commentid>
    <comment_count>3</comment_count>
    <who name="Bj">bjoern</who>
    <bug_when>2006-08-30 14:39:00 +0000</bug_when>
    <thetext>The difference in HEAD is a bug in S:P:O, it will be fixed in the next version.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>14531</commentid>
    <comment_count>4</comment_count>
    <who name="Olivier Thereaux">ot</who>
    <bug_when>2007-03-23 02:38:34 +0000</bug_when>
    <thetext>If my understanding is correct, this is:
* a bug in opensp
* which in HEAD is masked by a bug in S:P:O

Terje, could I ask you to look at whether this is on the radar of the openjade group?

Bjoern, any idea when the new version with the spo bug fix would be in?

Thank you.

</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>14533</commentid>
    <comment_count>5</comment_count>
    <who name="Terje Bless">link</who>
    <bug_when>2007-03-23 10:34:55 +0000</bug_when>
    <thetext>My information on OpenJade is not current and my attention right now is... elsewhere.

I vaguely recall looking at this a while back and concluding it was fixable, but I don&apos;t recall specifics.

However, the non-SGML Character warnings are emitted based on the SGML Declaration in use, so updating that may be all that&apos;s required.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>14536</commentid>
    <comment_count>6</comment_count>
    <who name="C. M. Sperberg-McQueen">cmsmcq</who>
    <bug_when>2007-03-24 02:04:12 +0000</bug_when>
    <thetext>The characters in question are indeed allowed by the XML 1.0 specification
(although in 1.1, I believe they are allowed only in the form of character
references, not as literals).  There appear to be discrepancies in every
SGML declaration I&apos;ve found which claims to represent XML in SGML terms: they
all declare these as UNUSED characters (i.e. non-SGML characters, not
to appear literally).

But are they allowed by XHTML 1.0?  XHTML 1.0 describes itself as a
reformulation in XML of HTML 4.  And HTML 4 includes an SGML declaration
(which I believe to be normative) which excludes these characters.

http://www.w3.org/TR/1999/REC-html401-19991224/sgml/sgmldecl.html

The relevant part of the document character set declaration in the HTML 4
SGML declaration reads:

                 127     1       UNUSED
                 128     32      UNUSED

If the character-repertoire restrictions of HTML 4 are inherited by
XHTML 1.0, then I think the validator is right to reject these characters.

Further discussion and details of this logic may be found at
http://www.w3.org/People/cmsmcq/2007/C1.xml
</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>14942</commentid>
    <comment_count>7</comment_count>
    <who name="Bj">bjoern</who>
    <bug_when>2007-04-29 22:46:29 +0000</bug_when>
    <thetext>(In reply to comment #4)
&gt; Bjoern, any idea when the new version with the spo bug fix would be in?

I don&apos;t know what bug I was talking about here so I can&apos;t comment on that. For the Wikipedia document HEAD seems to emit garbage (ISO-8859-1 encoded chars in a UTF-8 encoded document). My guess is that the source code does not have the utf-8 bit on, or that the output stream is not marked as utf-8, or something along those lines. It seems I did release a new spo version after my comment,
so I suppose its one of

  - fixed a bug in how parse_string handles encodings
  - fixed a bug in handling warnings(qw/multiple args/)

That you get errors for C1 characters is due to xml.dcl which has

  CHARSET
    DESCSET
      128 32 UNUSED

which is from http://www.w3.org/TR/NOTE-sgml-xml-971215 and probably wrong.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>14943</commentid>
    <comment_count>8</comment_count>
    <who name="Bj">bjoern</who>
    <bug_when>2007-04-29 22:51:03 +0000</bug_when>
    <thetext>(In reply to comment #6)
&gt; If the character-repertoire restrictions of HTML 4 are inherited by
&gt; XHTML 1.0, then I think the validator is right to reject these characters.

Note http://www.w3.org/TR/xhtml1/DTD/xhtml1.dcl but I don&apos;t think you
will find the new HTML Working Group argue in favour of this restriction.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>14956</commentid>
    <comment_count>9</comment_count>
    <who name="Olivier Thereaux">ot</who>
    <bug_when>2007-05-01 17:43:28 +0000</bug_when>
    <thetext>Changing milestone - I am currently checking that the explanations given by Michael in Comment #6 mean that we should indeed update our &quot;sgml declaration for xml&quot; to allow this range of characters in XML mode.

Bjoern in Comment #7 and Comment #8 is also right, I don&apos;t recall exact wording but I think the HTML working group just reused the declaration from the http://www.w3.org/TR/NOTE-sgml-xml-971215 document when publishing http://www.w3.org/TR/xhtml1/DTD/xhtml1.dcl - both should be fixed, I believe. </thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>14992</commentid>
    <comment_count>10</comment_count>
    <who name="Olivier Thereaux">ot</who>
    <bug_when>2007-05-04 19:11:29 +0000</bug_when>
    <thetext>the xml.dcl file used by the validator has been changed in CVS
http://lists.w3.org/Archives/Public/www-validator-cvs/2007May/0012.html
and I expect other published &quot;SGML declaration of XML&quot;, notably those used in the XHTML family, will be amended soon too.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>15022</commentid>
    <comment_count>11</comment_count>
    <who name="Ville Skyttä">ville.skytta</who>
    <bug_when>2007-05-06 07:43:16 +0000</bug_when>
    <thetext>The latest change in CVS seems to fix http://www.w3.org/mid/200704281505.13824.ville.skytta@iki.fi for me, thanks.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>15112</commentid>
    <comment_count>12</comment_count>
    <who name="Olivier Thereaux">ot</who>
    <bug_when>2007-05-18 01:00:43 +0000</bug_when>
    <thetext>A bug in the transcoding routines of the validator also triggered the 128-159 character error for non-C1 characters,

the patch
http://lists.w3.org/Archives/Public/www-validator-cvs/2007May/0064.html
fixes this.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>