<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>690</bug_id>
          
          <creation_ts>2004-04-26 07:43:09 +0000</creation_ts>
          <short_desc>Unusual processing instructions confuse the validator</short_desc>
          <delta_ts>2004-05-16 08:21:39 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>Validator</product>
          <component>check</component>
          <version>0.6.1</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Linux</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>LATER</resolution>
          
          
          <bug_file_loc>http://www.ltg.ed.ac.uk/~ht/xx.html</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P5</priority>
          <bug_severity>minor</bug_severity>
          <target_milestone>1.0</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Henry S. Thompson">ht</reporter>
          <assigned_to name="Terje Bless">link</assigned_to>
          
          
          

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>1740</commentid>
    <comment_count>0</comment_count>
    <who name="Henry S. Thompson">ht</who>
    <bug_when>2004-04-26 07:43:09 +0000</bug_when>
    <thetext>This doc&apos;t is valid per SP 1.3.4, but the online validator complains it
can&apos;t detect an encoding.

I suspect the charset sniffer is not getting past the PIs at the beginning -- if
I remove them it&apos;s happy

[Note I have no idea what version I&apos;m using -- neither the form nor the result
page gives a version number that I can see -- sorry if I&apos;m missing something
obvious]</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1771</commentid>
    <comment_count>1</comment_count>
    <who name="Terje Bless">link</who>
    <bug_when>2004-04-26 23:59:50 +0000</bug_when>
    <thetext>That URL is is 404 Compliant.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1772</commentid>
    <comment_count>2</comment_count>
    <who name="Henry S. Thompson">ht</who>
    <bug_when>2004-04-27 04:03:23 +0000</bug_when>
    <thetext>Sorry, my screw-up, it&apos;s in place now</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1836</commentid>
    <comment_count>3</comment_count>
    <who name="Terje Bless">link</who>
    <bug_when>2004-05-16 04:21:39 +0000</bug_when>
    <thetext>I think this is a case of &quot;Don&apos;t Do That Then&quot;. The charset sniffer is groping
around for a &lt;meta&gt; charset because there isn&apos;t one in the Content-Type (Strike
#1), and fails to find it because at this stage we&apos;re using a non-SGML parser
(Perl&apos;s HTML::Parser) which can handle weird constructs prior to the information
it&apos;s after (Strike #2).

The only way setting encoding info inside the document can ever work is if you
take extreme care to make the bytes up to that point be easily parsed. This
includes avoiding any non-vanilla constructs and making sure whatever encoding
it&apos;s in looks identical to US-ASCII up to that point.


I don&apos;t think we&apos;re going to be able to fix this without some fairly elaborate
digging inside HTML::Parser&apos;s guts, or by using OpenSP and doing a two-pass
parse. Given the overhead and the low gain, I&apos;m not sure it&apos;s worth it for this
bug alone (but it&apos;s another point in favour of doing a two-pass parse).

Resolving as &quot;LATER&quot;, and setting Target to 1.0 (aka. &quot;Once upon a time...&quot;).

Thanks for the catch Henry!</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>