This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 10174 - Bogus error reported for UTF-8 characters in larger documents
Summary: Bogus error reported for UTF-8 characters in larger documents
Status: RESOLVED FIXED
Alias: None
Product: HTML Checker
Classification: Unclassified
Component: General (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Michael[tm] Smith
QA Contact: qa-dev tracking
URL: http://nisza.org/rendering_tests/utf-...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-15 15:32 UTC by Adam
Modified: 2015-08-23 07:07 UTC (History)
2 users (show)

See Also:


Attachments

Description Adam 2010-07-15 15:32:51 UTC
Provided test document is perfectly valid HTML5. It uses UTF-8 encoding, without BOM, served as text/html; charset=utf-8. Validator reports 1 error, pointing at one of UTF-8 characters. There's no rule which character, the only rule it has to be multibyte UTF-8. After inserting any text before bogus error position - other character is marked. I think it's a kind of overflow bug.
Comment 1 Ville Skyttä 2011-08-27 10:08:37 UTC
This seems to have something to do with the local validator.nu backend used by the markup validator.  It is reproducible also in qa-dev.w3.org which uses its own validator.nu instance as well (http://localhost:8888/html5/), but not reproducible on my local box which uses http://validator.nu/, and if I configure the qa-dev markup validator to use http://validator.nu/ instead of http://localhost:8888/html5/, the problem goes away.
Comment 2 Michael[tm] Smith 2011-10-25 13:54:32 UTC
(In reply to comment #1 from Ville)
> This seems to have something to do with the local validator.nu backend used by
> the markup validator.  It is reproducible also in qa-dev.w3.org which uses its
> own validator.nu instance as well (http://localhost:8888/html5/),

I've not been able to reproduce this with any standalone version of the validator.nu backend -- that is, using it directly through its own Web UI, rather than using in through the validator.w3.org UI and perl interface to the validator.nu backend.

Specifically, I have not been able to reproduce it using the Web UI at http://qa-dev.w3.org:8888 at all, despite trying for quite a while.

But I can reproduce it easily with the validator.w3.org frontend, typically on the first or second try.

So I can't find any problem to fix here in the validator.nu source itself. It seems whatever the problem is, it is not a problem that occurs within an instance of validator.nu service itself -- that is, when a validator.nu backend communicates its own frontend. The problem instead seems to only occur when a validator.nu backend communicates with an instance of the validator.w3.org backend+frontend.

So to me that suggests that the problem here could instead be in the validator.w3.org perl code that consumes data from a validator.nu backend, or even be an OS problem or something in the local environments of the validator.w3.org hosts.

But I suppose it could also be a problem in the validator.nu REST interface, so I'll try that directly from a local validator.nu instance and see if I can reproduce this but that way.
Comment 3 Michael[tm] Smith 2011-10-27 19:39:47 UTC
Ville,

So I have now tried this using curl with the REST interface to a validator.nu instance on one of the W3C validator hosts, but I have not been able to reproduce it.

The curl command I'm using is this:

curl -F file=@utf-8-validation.html http://localhost:8888

(where utf-8-validation.html is a local copy of the problem file)

That produces HTML output with a message saying, "The document validates according to the specified schema(s) and to additional constraints checked by the validator."

Also, just to get briefer output, I tried it with:

curl -F out=gnu -F file=@utf-8-validation.html http://localhost:8888

(that is, with the "gnu" output format specified)

That produces not error messages (just two expected "info" messages).

So as near as I can tell thus far, the problem does not seem to be caused by anything in the validator.nu REST interface.
Comment 4 Ville Skyttä 2011-10-27 20:55:59 UTC
Validator does not use the form-based file upload interface of validator.nu; it POSTs the document as the request entity body:
http://wiki.whatwg.org/wiki/Validator.nu_POST_Body_Input

This has something to do with whether the request is gzipped or not.  It seems to always work if it is gzipped, but not if it isn't.  Reproducing on qa-dev.w3.org (validator perl code uses out=xml, out=gnu is here for readability):

-------------
$ curl --data-binary @utf-8-validation.html -H "Content-Type: text/html" "http://localhost:8888/?out=gnu"
: info: The Content-Type was 
Comment 5 Michael[tm] Smith 2011-10-29 12:38:02 UTC
(In reply to comment #4)
> Whether validator's perl code gzips the request or not depends on whether debug
> mode is enabled, and whether the HTML5 validator seems to be on the same host
> as the validator; gzip is used if debug mode is off and the HTML5 validator
> appears to be non-local.  Since IIUC both validator.w3.org and qa-dev use local
> HTML5 validator instances, they end up always _not_ gzipping the response, thus
> triggering the problem.

So should we consider always having the validator gzip the request?

By the way, thanks very much for the details.
Comment 6 Ville Skyttä 2011-10-29 13:26:45 UTC
(In reply to comment #5)

> So should we consider always having the validator gzip the request?

I suppose that would be a good workaround for now unless there is a fix for the HTML5 backend side in sight.  Let me know if you plan to look into the HTML5 validator further in the coming few days or if you'd like me to apply the workaround to validator now (I plan to tag a new validator release really soon).

Not gzipping the request has the benefit that it makes things easier to debug with tools such as wireshark, and I suppose it has some small performance benefits when traffic to the HTML5 validator goes over a very fast network connection such as a loopback interface.
Comment 7 Michael[tm] Smith 2011-10-29 18:27:30 UTC
(In reply to comment #4)
> -------------
> $ curl --data-binary @utf-8-validation.html -H "Content-Type: text/html"
> "http://localhost:8888/?out=gnu"

I've tried that in my local environment and can reproduce the problem 100% of the time when I do. So the cause is not something specific to the W3C validator server environment. I don't know why it's not reproducible with http://validator.nu.

That said, I cannot reproduce the problem if I run the same curl command with the --data switch instead of the --data-binary switch; that is:

curl --data @utf-8-validation.html -H "Content-Type: text/html" "http://localhost:8888/?out=gnu"

The bug does not ever happen if I run curl that way instead.

And in reading the curl docs, I'm not sure why the --data-binary switch would be used in this case. The curl man page says, "To post data purely binary, you should instead use the --data-binary option.". So, I can see why that switch would be needed for the gzipped case, since that's binary data. But the case of the non-gzipped file, the data is not binary -- it's text.
Comment 8 Michael[tm] Smith 2011-10-29 18:31:48 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > So should we consider always having the validator gzip the request?
> 
> I suppose that would be a good workaround for now unless there is a fix for the
> HTML5 backend side in sight.  Let me know if you plan to look into the HTML5
> validator further in the coming few days or if you'd like me to apply the
> workaround to validator now (I plan to tag a new validator release really
> soon).

I've pinged Henri about it, so let's at least wait until we hear back from him.

In the mean time, I have looked at it myself, but so far what I've found is what I noted in comment #7 -- that the problem does not occur for the non-gzipped case if I use the --data switch with curl instead of the --data-binary switch.
Comment 9 Ville Skyttä 2011-10-29 18:44:22 UTC
(In reply to comment #7)

> And in reading the curl docs, I'm not sure why the --data-binary switch would
> be used in this case. The curl man page says, "To post data purely binary, you
> should instead use the --data-binary option.".

The curl man page is indeed pretty confusing wrt. what exactly --data does.  But it does say this: "-d/--data is the same as --data-ascii." and then later for --data-binary "Data is posted in a similar manner as --data-ascii does, except that newlines are preserved and conversions are never done.".  I don't think it's actually a matter of binary vs text, but rather posting as-is or with some conversions.

Not sure what conversions they mean other than something related to newlines, but I have verified locally with wireshark is that --data-binary POSTs files as-is as I want it to (and like the validator does), and --data on the other hand at least discards newlines, probably also leading whitespace (which would mean problems with line and column numbers in results if validator did that).
Comment 10 Michael[tm] Smith 2011-10-30 01:08:21 UTC
(In reply to comment #9)
> The curl man page is indeed pretty confusing wrt. what exactly --data does. 
> But it does say this: "-d/--data is the same as --data-ascii." and then later
> for --data-binary "Data is posted in a similar manner as --data-ascii does,
> except that newlines are preserved and conversions are never done.".  I don't
> think it's actually a matter of binary vs text, but rather posting as-is or
> with some conversions.

Ah, OK. Yeah, after re-reading the curl man page, I understand now. That switch really ought to be called "--data-as-is" or something instead... 

> Not sure what conversions they mean other than something related to newlines,
> but I have verified locally with wireshark is that --data-binary POSTs files
> as-is as I want it to (and like the validator does)

Yeah, I verified the same thing using the curl --trace option; e.g.,

curl --data-binary @utf-8-validation.html -H "Content-Type: text/html" --trace - "http://localhost:8888/?out=gnu"

And looking at the hex dump of that, I notice that the position at which it reports "End of file seen" (line 1254, column 13) is byte 0x2000 (decimal 8096, 8KB) of the last chunk of data in the post. And I then notice sort of the same thing that Adam mentions in comment #1 -- if I insert a different character before that position, then behavior changes. But unlike Adam's case (where he seems to be saying that he still gets an error, but just that it gets reported for a different character), I get no error any longer at all -- instead, the document validates as expected.

So, there's definitely something weird going on here. The fact that it the error gets reported at exactly the 8KB mark really does make it seem like it's running into some kind of limit.

> and --data on the other
> hand at least discards newlines, probably also leading whitespace (which would
> mean problems with line and column numbers in results if validator did that).
Comment 11 Michael[tm] Smith 2011-10-30 01:27:46 UTC
(In reply to comment #10)
> And looking at the hex dump of that, I notice that the position at which it
> reports "End of file seen" (line 1254, column 13) is byte 0x2000 (decimal 8096,
> 8KB) of the last chunk of data in the post.

I should have also noted that the first three chunks are all 16KB, so that point in that file is also exactly byte 0xE000 (decimal 57,344, 56KB).
Comment 12 Henri Sivonen 2011-11-22 09:53:31 UTC
Sorry about the delay. This was a bug in the bytes to UTF-16 code units conversion loop. I've pushed a fix:
https://hg.mozilla.org/projects/htmlparser/rev/70a48b922c95

Thanks for catching this!
Comment 13 Michael[tm] Smith 2011-11-22 10:12:41 UTC
I pushed Henri's fix to validator.w3.org
Comment 14 Ville Skyttä 2011-11-22 21:06:59 UTC
Thanks, Henri and Michael.  Michael, could you push the fix to qa-dev.w3.org's instance as well?
Comment 15 Michael[tm] Smith 2011-11-23 06:16:29 UTC
Hi Ville, 

I've now pushed the fix to qa-dev as well