<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>10174</bug_id>
          
          <creation_ts>2010-07-15 15:32:51 +0000</creation_ts>
          <short_desc>Bogus error reported for UTF-8 characters in larger documents</short_desc>
          <delta_ts>2015-08-23 07:07:03 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>HTML Checker</product>
          <component>General</component>
          <version>unspecified</version>
          <rep_platform>All</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc>http://nisza.org/rendering_tests/utf-8-validation.html</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Adam">a.lyskawa</reporter>
          <assigned_to name="Michael[tm] Smith">mike</assigned_to>
          <cc>hsivonen</cc>
    
    <cc>ville.skytta</cc>
          
          <qa_contact name="qa-dev tracking">www-validator-cvs</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>36920</commentid>
    <comment_count>0</comment_count>
    <who name="Adam">a.lyskawa</who>
    <bug_when>2010-07-15 15:32:51 +0000</bug_when>
    <thetext>Provided test document is perfectly valid HTML5. It uses UTF-8 encoding, without BOM, served as text/html; charset=utf-8. Validator reports 1 error, pointing at one of UTF-8 characters. There&apos;s no rule which character, the only rule it has to be multibyte UTF-8. After inserting any text before bogus error position - other character is marked. I think it&apos;s a kind of overflow bug.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>55909</commentid>
    <comment_count>1</comment_count>
    <who name="Ville Skyttä">ville.skytta</who>
    <bug_when>2011-08-27 10:08:37 +0000</bug_when>
    <thetext>This seems to have something to do with the local validator.nu backend used by the markup validator.  It is reproducible also in qa-dev.w3.org which uses its own validator.nu instance as well (http://localhost:8888/html5/), but not reproducible on my local box which uses http://validator.nu/, and if I configure the qa-dev markup validator to use http://validator.nu/ instead of http://localhost:8888/html5/, the problem goes away.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>58914</commentid>
    <comment_count>2</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2011-10-25 13:54:32 +0000</bug_when>
    <thetext>(In reply to comment #1 from Ville)
&gt; This seems to have something to do with the local validator.nu backend used by
&gt; the markup validator.  It is reproducible also in qa-dev.w3.org which uses its
&gt; own validator.nu instance as well (http://localhost:8888/html5/),

I&apos;ve not been able to reproduce this with any standalone version of the validator.nu backend -- that is, using it directly through its own Web UI, rather than using in through the validator.w3.org UI and perl interface to the validator.nu backend.

Specifically, I have not been able to reproduce it using the Web UI at http://qa-dev.w3.org:8888 at all, despite trying for quite a while.

But I can reproduce it easily with the validator.w3.org frontend, typically on the first or second try.

So I can&apos;t find any problem to fix here in the validator.nu source itself. It seems whatever the problem is, it is not a problem that occurs within an instance of validator.nu service itself -- that is, when a validator.nu backend communicates its own frontend. The problem instead seems to only occur when a validator.nu backend communicates with an instance of the validator.w3.org backend+frontend.

So to me that suggests that the problem here could instead be in the validator.w3.org perl code that consumes data from a validator.nu backend, or even be an OS problem or something in the local environments of the validator.w3.org hosts.

But I suppose it could also be a problem in the validator.nu REST interface, so I&apos;ll try that directly from a local validator.nu instance and see if I can reproduce this but that way.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59088</commentid>
    <comment_count>3</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2011-10-27 19:39:47 +0000</bug_when>
    <thetext>Ville,

So I have now tried this using curl with the REST interface to a validator.nu instance on one of the W3C validator hosts, but I have not been able to reproduce it.

The curl command I&apos;m using is this:

curl -F file=@utf-8-validation.html http://localhost:8888

(where utf-8-validation.html is a local copy of the problem file)

That produces HTML output with a message saying, &quot;The document validates according to the specified schema(s) and to additional constraints checked by the validator.&quot;

Also, just to get briefer output, I tried it with:

curl -F out=gnu -F file=@utf-8-validation.html http://localhost:8888

(that is, with the &quot;gnu&quot; output format specified)

That produces not error messages (just two expected &quot;info&quot; messages).

So as near as I can tell thus far, the problem does not seem to be caused by anything in the validator.nu REST interface.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59101</commentid>
    <comment_count>4</comment_count>
    <who name="Ville Skyttä">ville.skytta</who>
    <bug_when>2011-10-27 20:55:59 +0000</bug_when>
    <thetext>Validator does not use the form-based file upload interface of validator.nu; it POSTs the document as the request entity body:
http://wiki.whatwg.org/wiki/Validator.nu_POST_Body_Input

This has something to do with whether the request is gzipped or not.  It seems to always work if it is gzipped, but not if it isn&apos;t.  Reproducing on qa-dev.w3.org (validator perl code uses out=xml, out=gnu is here for readability):

-------------
$ curl --data-binary @utf-8-validation.html -H &quot;Content-Type: text/html&quot; &quot;http://localhost:8888/?out=gnu&quot;
: info: The Content-Type was </thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59195</commentid>
    <comment_count>5</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2011-10-29 12:38:02 +0000</bug_when>
    <thetext>(In reply to comment #4)
&gt; Whether validator&apos;s perl code gzips the request or not depends on whether debug
&gt; mode is enabled, and whether the HTML5 validator seems to be on the same host
&gt; as the validator; gzip is used if debug mode is off and the HTML5 validator
&gt; appears to be non-local.  Since IIUC both validator.w3.org and qa-dev use local
&gt; HTML5 validator instances, they end up always _not_ gzipping the response, thus
&gt; triggering the problem.

So should we consider always having the validator gzip the request?

By the way, thanks very much for the details.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59196</commentid>
    <comment_count>6</comment_count>
    <who name="Ville Skyttä">ville.skytta</who>
    <bug_when>2011-10-29 13:26:45 +0000</bug_when>
    <thetext>(In reply to comment #5)

&gt; So should we consider always having the validator gzip the request?

I suppose that would be a good workaround for now unless there is a fix for the HTML5 backend side in sight.  Let me know if you plan to look into the HTML5 validator further in the coming few days or if you&apos;d like me to apply the workaround to validator now (I plan to tag a new validator release really soon).

Not gzipping the request has the benefit that it makes things easier to debug with tools such as wireshark, and I suppose it has some small performance benefits when traffic to the HTML5 validator goes over a very fast network connection such as a loopback interface.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59199</commentid>
    <comment_count>7</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2011-10-29 18:27:30 +0000</bug_when>
    <thetext>(In reply to comment #4)
&gt; -------------
&gt; $ curl --data-binary @utf-8-validation.html -H &quot;Content-Type: text/html&quot;
&gt; &quot;http://localhost:8888/?out=gnu&quot;

I&apos;ve tried that in my local environment and can reproduce the problem 100% of the time when I do. So the cause is not something specific to the W3C validator server environment. I don&apos;t know why it&apos;s not reproducible with http://validator.nu.

That said, I cannot reproduce the problem if I run the same curl command with the --data switch instead of the --data-binary switch; that is:

curl --data @utf-8-validation.html -H &quot;Content-Type: text/html&quot; &quot;http://localhost:8888/?out=gnu&quot;

The bug does not ever happen if I run curl that way instead.

And in reading the curl docs, I&apos;m not sure why the --data-binary switch would be used in this case. The curl man page says, &quot;To post data purely binary, you should instead use the --data-binary option.&quot;. So, I can see why that switch would be needed for the gzipped case, since that&apos;s binary data. But the case of the non-gzipped file, the data is not binary -- it&apos;s text.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59200</commentid>
    <comment_count>8</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2011-10-29 18:31:48 +0000</bug_when>
    <thetext>(In reply to comment #6)
&gt; (In reply to comment #5)
&gt; &gt; So should we consider always having the validator gzip the request?
&gt; 
&gt; I suppose that would be a good workaround for now unless there is a fix for the
&gt; HTML5 backend side in sight.  Let me know if you plan to look into the HTML5
&gt; validator further in the coming few days or if you&apos;d like me to apply the
&gt; workaround to validator now (I plan to tag a new validator release really
&gt; soon).

I&apos;ve pinged Henri about it, so let&apos;s at least wait until we hear back from him.

In the mean time, I have looked at it myself, but so far what I&apos;ve found is what I noted in comment #7 -- that the problem does not occur for the non-gzipped case if I use the --data switch with curl instead of the --data-binary switch.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59201</commentid>
    <comment_count>9</comment_count>
    <who name="Ville Skyttä">ville.skytta</who>
    <bug_when>2011-10-29 18:44:22 +0000</bug_when>
    <thetext>(In reply to comment #7)

&gt; And in reading the curl docs, I&apos;m not sure why the --data-binary switch would
&gt; be used in this case. The curl man page says, &quot;To post data purely binary, you
&gt; should instead use the --data-binary option.&quot;.

The curl man page is indeed pretty confusing wrt. what exactly --data does.  But it does say this: &quot;-d/--data is the same as --data-ascii.&quot; and then later for --data-binary &quot;Data is posted in a similar manner as --data-ascii does, except that newlines are preserved and conversions are never done.&quot;.  I don&apos;t think it&apos;s actually a matter of binary vs text, but rather posting as-is or with some conversions.

Not sure what conversions they mean other than something related to newlines, but I have verified locally with wireshark is that --data-binary POSTs files as-is as I want it to (and like the validator does), and --data on the other hand at least discards newlines, probably also leading whitespace (which would mean problems with line and column numbers in results if validator did that).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59208</commentid>
    <comment_count>10</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2011-10-30 01:08:21 +0000</bug_when>
    <thetext>(In reply to comment #9)
&gt; The curl man page is indeed pretty confusing wrt. what exactly --data does. 
&gt; But it does say this: &quot;-d/--data is the same as --data-ascii.&quot; and then later
&gt; for --data-binary &quot;Data is posted in a similar manner as --data-ascii does,
&gt; except that newlines are preserved and conversions are never done.&quot;.  I don&apos;t
&gt; think it&apos;s actually a matter of binary vs text, but rather posting as-is or
&gt; with some conversions.

Ah, OK. Yeah, after re-reading the curl man page, I understand now. That switch really ought to be called &quot;--data-as-is&quot; or something instead... 

&gt; Not sure what conversions they mean other than something related to newlines,
&gt; but I have verified locally with wireshark is that --data-binary POSTs files
&gt; as-is as I want it to (and like the validator does)

Yeah, I verified the same thing using the curl --trace option; e.g.,

curl --data-binary @utf-8-validation.html -H &quot;Content-Type: text/html&quot; --trace - &quot;http://localhost:8888/?out=gnu&quot;

And looking at the hex dump of that, I notice that the position at which it reports &quot;End of file seen&quot; (line 1254, column 13) is byte 0x2000 (decimal 8096, 8KB) of the last chunk of data in the post. And I then notice sort of the same thing that Adam mentions in comment #1 -- if I insert a different character before that position, then behavior changes. But unlike Adam&apos;s case (where he seems to be saying that he still gets an error, but just that it gets reported for a different character), I get no error any longer at all -- instead, the document validates as expected.

So, there&apos;s definitely something weird going on here. The fact that it the error gets reported at exactly the 8KB mark really does make it seem like it&apos;s running into some kind of limit.

&gt; and --data on the other
&gt; hand at least discards newlines, probably also leading whitespace (which would
&gt; mean problems with line and column numbers in results if validator did that).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>59209</commentid>
    <comment_count>11</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2011-10-30 01:27:46 +0000</bug_when>
    <thetext>(In reply to comment #10)
&gt; And looking at the hex dump of that, I notice that the position at which it
&gt; reports &quot;End of file seen&quot; (line 1254, column 13) is byte 0x2000 (decimal 8096,
&gt; 8KB) of the last chunk of data in the post.

I should have also noted that the first three chunks are all 16KB, so that point in that file is also exactly byte 0xE000 (decimal 57,344, 56KB).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>60339</commentid>
    <comment_count>12</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2011-11-22 09:53:31 +0000</bug_when>
    <thetext>Sorry about the delay. This was a bug in the bytes to UTF-16 code units conversion loop. I&apos;ve pushed a fix:
https://hg.mozilla.org/projects/htmlparser/rev/70a48b922c95

Thanks for catching this!</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>60340</commentid>
    <comment_count>13</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2011-11-22 10:12:41 +0000</bug_when>
    <thetext>I pushed Henri&apos;s fix to validator.w3.org</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>60357</commentid>
    <comment_count>14</comment_count>
    <who name="Ville Skyttä">ville.skytta</who>
    <bug_when>2011-11-22 21:06:59 +0000</bug_when>
    <thetext>Thanks, Henri and Michael.  Michael, could you push the fix to qa-dev.w3.org&apos;s instance as well?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>60363</commentid>
    <comment_count>15</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2011-11-23 06:16:29 +0000</bug_when>
    <thetext>Hi Ville, 

I&apos;ve now pushed the fix to qa-dev as well</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>