This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 4586 - LWP retrieves empty page
Summary: LWP retrieves empty page
Status: RESOLVED INVALID
Alias: None
Product: Validator
Classification: Unclassified
Component: check (show other bugs)
Version: 0.8.0b1
Hardware: PC Windows 2000
: P2 critical
Target Milestone: ---
Assignee: Olivier Thereaux
QA Contact: qa-dev tracking
URL: http://www.athrasoft.com
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-05-26 18:08 UTC by Paulo Fran
Modified: 2007-10-24 03:42 UTC (History)
2 users (show)

See Also:


Attachments

Description Paulo Fran 2007-05-26 18:08:03 UTC
Since older versions of the Markup Validator, I've been facing a big problem.
Again, in this beta 0.8, the problem remains.

The validator works fine both via "File Upload" and "Direct Input", but FAILS when fired by referer ("uri=?referer"). It also fails the complete URL is passed as parameter.

The complete error message:
-------------------------------------------------
 This page is not Valid (no Doctype found)!
 Error Line 1 column 0: end of document in prolog.
-------------------------------------------------

My XHTML 1.0 Transitional page is hosted on a site that redirects to a IP-based one. Wheter I enter the URL of the first one (domain-based) or the IP-based (using my current IP), the validation fails.

However, the validator at http://www.validome.org is able to parse the doc via URL Parameter, without any problem. See:

BOTH FAIL:
  http://validator.w3.org/check?uri=http://athrasoft.com/en/Home.htm
  http://validator-test.w3.org/check?uri=http://athrasoft.com/en/Home.htm
BOTH WORK FINE:
  http://www.validome.org/validate/?uri=http://athrasoft.com/en/Home.htm
  http://www.validome.org/get/http://athrasoft.com/en/Home.htm

The redirection is correctly managed by W3C's validator, but the result is always "No DocType Found". Even if I enter the IP-based URL as the target, I got the error.

My DocType is correct, as well as the whole page.
Any clue?

Thanks in advance.
Pauo França
  http://www.athrasoft.com
Comment 1 Paulo Fran 2007-05-26 19:37:32 UTC
Just dicovered this also works fine:
  http://www.validome.org/referer

So why doesn't W3C's?

The link above is now on every page in my site.
Just fire it from there and see it's working.

This proves there's no problem with my DocType, nor with calling the validator by referer. What happens to W3C's validator when checking a URL?

Thanks.
Paulo França
Comment 2 Olivier Thereaux 2007-05-27 00:25:36 UTC
(In reply to comment #0)
> The validator works fine both via "File Upload" and "Direct Input", but FAILS
> when fired by referer ("uri=?referer"). It also fails the complete URL is
> passed as parameter.
> 
> The complete error message:
> -------------------------------------------------
>  This page is not Valid (no Doctype found)!
>  Error Line 1 column 0: end of document in prolog.
> -------------------------------------------------

Apparently this has nothing to do with the document type. If you turn on the "show page source" option, you will see that the validator is given an empty document to validate (and hence, does not find a doctype - or any content for that matter).

I traced the issue down to libwww-perl, the (usually extremely rubust) library the validator uses to retrieve documents on the web. When using this library, one gets empty content back from your server.

Not sure what is going on between LWP and your server, but this is a very rare case. Possibly, the fact that your server is sending two "Content-Type" HTTP headers could be confusing LWP, but I'm not sure about it yet.
Comment 3 Paulo Fran 2007-05-27 07:37:02 UTC
Hi, Olivier.
 
> Apparently this has nothing to do with the document type. If you turn on the
> "show page source" option, you will see that the validator is given an empty
> document to validate (and hence, does not find a doctype - or any content for
> that matter).

Yes, I noticed that.
\\:^.

--o-o-o-o-o-o-o--

> Not sure what is going on between LWP and your server, but this is a very rare
> case. Possibly, the fact that your server is sending two "Content-Type" HTTP
> headers could be confusing LWP, but I'm not sure about it yet.

The server has been set up and is maintained by myself, so I'm free to do any
modifications needed to solve the issue.

As for the headers, I tested them by using some online header-readers and saw
nothing unusual. If you see a page about browser compliance, there's a screenshot available that prove the pages are correctly being rendered by a
number of browsers:
   http://www.athrasoft.com/en/Compliance.htm

I have set up the HTTP Headers (on IIS/Win2k) so that it serves two different
ones:
1) UTF-8 for the forums root folder, otherwise the phpbb 3 messes up with
   accented characthers. This is working fine this way.
2) Windows-1252 (the same as in page's meta statements), for English/Portuguese
   pages on the web site.Also working fine.

I tested W3Cs validator with several by serving the pages with different
HTTP Headers (UTF-8, ISO-8859-1, etc) and also with none at all - nothing works.

I am yet to try reading the headers from a Delphi program for wich I got the
source code - ICS from François Piette. I'm going to change one of its sample
projects so that it traces the read of server headers. I'll let you know if I
discover something weird (as the duplicate header you mentioned).

For now, thanks for replying so fast.
Very good this bug tracker!

Best regards,
Paulo França
Comment 4 Paulo Fran 2007-05-27 08:24:10 UTC
(In reply to comment #3)

> I am yet to try reading the headers from a Delphi program for wich I got the
> source code - ICS from François Piette. I'm going to change one of its sample
> projects so that it traces the read of server headers. I'll let you know if I
> discover something weird (as the duplicate header you mentioned).

Ok, I have just done the test.
This is the header data received by the tool from my server:
----------------------------------------------------------------------
 HTTP/1.1 200 OK
 Server: Microsoft-IIS/5.0
 Content-Type: text/html; charset=Windows-1252
 Cache-Control: no-cache
 Expires: Sun, 27 May 2007 07:41:22 GMT
 Date: Sun, 27 May 2007 07:41:22 GMT
 Content-Type: text/html
 Accept-Ranges: bytes
 Last-Modified: Sat, 26 May 2007 20:28:28 GMT
 ETag: "0fef96ad49fc71:892"
 Content-Length: 15633
 StatusCode = 200
----------------------------------------------------------------------

Ok, the "Content-Type" was partially duplicated, so I changed IIS so that
it does not send that header entry. As result, the "duplication" is gone,
and only the "Content-Type: text/html" entry is returned (although not
sent by IIS). Even though, W3C's validator KEEPS going crazy, and
Validome too starts returning an error.

Then I changed back to the original HTTP Header at server. At least
Validome works fine that way.

If the partially duplicated "Content-Type" was the cause, it would
have worked when I remove it, but it hasn't.

Back to point zero again.
\\:^)
Comment 5 Terje Bless 2007-05-27 19:33:33 UTC
Your server is borked.

Apart from what's evident in the trace below, the server also timeouts way too quickly on input. I'd venture a guess that your server is either running custom IIS plugins or is deferring far too much to homebrew CGI or ASP type code.

[[[
  $ telnet athrasoft.com http
  Trying 216.98.141.250...
  Connected to athrasoft.com.
  Escape character is '^]'.
  GET /en/Home.htm HTTP/1.0
  Host: athrasoft.com
  
  HTTP/1.0 301 Found
  Server: Apache
  Status: 301 Found
  Expires: Mon, 28 May 2007 19:25:08 GMT
  Date: Sun, 27 May 2007 19:25:08 GMT
  location: http://200.222.134.220:8081/en/Home.htm
  
  Connection closed by foreign host.
  
  $ telnet 200.222.134.220 8081
  Trying 200.222.134.220...
  Connected to 200222134220.user.veloxzone.com.br.
  Escape character is '^]'.
  GET /en/Home.htm HTTP/1.0
  Host: athrasoft.com
  
  HTTP/1.1 200 OK
  Server: Microsoft-IIS/5.0
  Content-Type: text/html; charset=Windows-1252
  Cache-Control: no-cache
  Expires: Sun, 27 May 2007 19:24:58 GMT
  Date: Sun, 27 May 2007 19:24:58 GMT
  Content-Type: text/html
  Accept-Ranges: bytes
  Last-Modified: Sat, 26 May 2007 20:28:28 GMT
  ETag: "0fef96ad49fc71:893"
  Content-Length: 15633
  
  Connection closed by foreign host.
  
  $ 
]]]
Comment 6 Paulo Fran 2007-05-27 23:45:40 UTC
(In reply to comment #5)

> Your server is borked.

First of all, my server's IP has changed now to
201.79.145.166 , but at that time it was
200.222.134.220 .

Just took this picture (before changing the ip):
  http://img518.imageshack.us/img518/4713/validomeau5.png

Sincerily, I can't see anything wrong in the trace you post.
If you use the domain-based url, it leads you to my IP-based server.

 athrasoft.com = 216.98.141.250:80 (domain monger redirector)
 -> 200.222.134.220:8081 (my IIS/5.0 server)

And your telnet session has reached my server normally, and has
received the server's header as any other header reader currently does.

The W3C's validator is correctly following the redirection. Thi sis not
the problem.
Even if you type in the final url ( http://200.222.134.220:8081/en/Home.htm )
directly in validator's uri field, the validator fails (W3C's, not Validome).

I didn't see any evidence of the "bork" in the trace you post.


>  $ telnet athrasoft.com http
>  Trying 216.98.141.250...
>  Connected to athrasoft.com.
>  Escape character is '^]'.
>  GET /en/Home.htm HTTP/1.0
>  Host: athrasoft.com
>
>  HTTP/1.0 301 Found
>  Server: Apache
>  Status: 301 Found
>  Expires: Mon, 28 May 2007 19:25:08 GMT
>  Date: Sun, 27 May 2007 19:25:08 GMT
>  location: http://200.222.134.220:8081/en/Home.htm

So at this point the redirection completed fine.


>  $ telnet 200.222.134.220 8081
>  Trying 200.222.134.220...
>  Connected to 200222134220.user.veloxzone.com.br.
>  Escape character is '^]'.
>  GET /en/Home.htm HTTP/1.0
>  Host: athrasoft.com
>
>  HTTP/1.1 200 OK

It says "OK", and "200222134220.user.veloxzone.com.br" corresponds
to my server's ip (at that time).


>  HTTP/1.1 200 OK
>  Server: Microsoft-IIS/5.0
>  Content-Type: text/html; charset=Windows-1252
>  Cache-Control: no-cache
>  Expires: Sun, 27 May 2007 19:24:58 GMT
>  Date: Sun, 27 May 2007 19:24:58 GMT
>  Content-Type: text/html
>  Accept-Ranges: bytes
>  Last-Modified: Sat, 26 May 2007 20:28:28 GMT
>  ETag: "0fef96ad49fc71:893"
>  Content-Length: 15633
>
>  Connection closed by foreign host.

So the whole header has been reached.


> the server also timeouts way too quickly on input.

In fact, doing a tracerout from some locations it takes some
time to complete, and in some cases the tracerout is aborted.

It seems the problem is really with a botleneck somewhere in the
path. Validome takes 5-8 seconds to show the report; maybe its
closer to Brazil, dunno.


> I'd venture a guess that your server is either running custom
> IIS plugins or is deferring far too much to homebrew CGI or ASP
> type code.

No plugin, no ASP at all, just serving plain html pages.

thanks for your effort, Olivier.
I think we have found the problem: the short timeout from W3C's
validator.

I remember day ago to be able to tracerout my server from USA,
but today it's really being a pain. I'll try again inthe next
days to see if something changes in this regard.

Thanks again.
Comment 7 Paulo Fran 2007-05-27 23:59:08 UTC
Just a note:

W2C's CSS Validator is validating my CSS (in the same "/en/Home.htm" page)
quickly and via URI. The same for the Link Checker: working fine!

The CSS and link validators are hosted in a subdomain of W3C, so my previous
guess about the tracerout was wrong!

WORKING:
  http://jigsaw.w3.org/css-validator/validator?warning=no&profile=css3&usermedium=all&uri=http://athrasoft.com/en/Home.htm
  http://validator.w3.org/checklink?hide_redirects=on&hide_type=all&check=Check&uri=http://athrasoft.com/en/Home.htm

See? There's something wrong with the Markup Validator, indeed.
Comment 8 Terje Bless 2007-05-28 21:11:05 UTC
(In reply to comment #6)
>>Connection closed by foreign host.
>
> So the whole header has been reached.

Actually, no, that's kind of the point... :-)

The server returns the headers, but not the actual content; in spite of emitting something that looks plausible in the Content-Length header field. At that point in the output I should have gotten pretty much the same thing you'd see if you did View Source in a browser, and not Connection closed by foreign host.
Comment 9 Paulo Fran 2007-05-28 21:43:07 UTC
(In reply to comment #8)

> The server returns the headers, but not the actual content...

Could you please test it again?
At the time I read your previous post, I looked at IIS and realised I would have to restart it (this sometimes happens to IIS, 1 or 2 times a week).
After doing so, the server got back to its normal state.

Despite your next tests (in case you are willing to do so), the fact is the same W3C is able to read my contents (CSS and all links), so how come the contents cannot be read? How is W3C's Link Checker able to trace every link on my pages without reading their contents first?!  \\8^*

Not to mention that, besides W3C's CSS and Link Checker, Validome too reads my page's contents (as do as my site visitors).  \\;^)
Comment 10 Paulo Fran 2007-05-28 21:51:07 UTC
As I mentioned in my latest post about Link Checker...
  http://img253.imageshack.us/img253/7476/linkcheckerdv4.png
  (just took the snapshot)

So Link Checker is able to read my contents but the Makup Validator is not?!
\\:^.
Comment 11 Terje Bless 2007-05-28 22:40:00 UTC
(In reply to comment #9)
> Could you please test it again?

Sure. I checked a little more in-depth and it looks like the difference is caused by the value of the 'Connection' HTTP header. When it's 'close' the server fails to return a body, when the value is 'keep-alive' it performs as expected.

In the below trace I have elided (marked with []) irrelevant bits for clarity:

[[[
  $ telnet 201.79.145.166 8081 
  GET /en/Home.htm HTTP/1.0
  Host: athrasoft.com
  Connection: close
  
  HTTP/1.1 200 OK
  []
  Content-Length: 15617
  
  Connection closed by foreign host.
  
  $ telnet 201.79.145.166 8081 
  GET /en/Home.htm HTTP/1.0
  Host: athrasoft.com
  Connection: keep-alive
  
  HTTP/1.1 200 OK
  []
  Content-Length: 15617
  
  <!DOCTYPE html []>
  []
  <title>&nbsp;AthraSoft Components &bull; Home of SmartPlugin&nbsp;</title>
  []
]]]

Comment 12 Paulo Fran 2007-05-28 22:52:09 UTC
(In reply to comment #11)

> Sure. I checked a little more in-depth and it looks like the difference is
> caused by the value of the 'Connection' HTTP header. When it's 'close' the
> server fails to return a body, when the value is 'keep-alive' it performs as
> expected.

Thanks, Terje.

In fact, the "keep-alive" is crucial for the redirection to succeed. However, even though it's turned on, W3C's Markup Validator keeps saying no doctype has been found (and even when typing in the IP uri to dismiss the redirection). Duh!

It's a pity the Markup Validator does not use telnet.  \\;^)

Only those guys who developed the markup, link and css validators are able to determine why in earth the markup validator is unable to read my contents.
Such "feature" has been around for years!

Thank you for your good will.
Comment 13 Terje Bless 2007-05-28 23:18:13 UTC
(In reply to comment #12)
> In fact, the "keep-alive" is crucial for the redirection to succeed.

Just to make sure I'm being clear... That the server is behaving differently  by not returning the document data  when the Connection header field is set to 'close', is a bug in the server and not a problem with the Validator.

Both of the test cases I quoted should have behaved identically up to that point.

HTTP Keep-Alive should only have had any effect _after_ the full document was returned to the client.


> Only those guys who developed the markup, link and css validators are able to
> determine why in earth the markup validator is unable to read my contents.

The Markup Validator, quite correctly, simply happens to not make use of the (optional) 'Keep-Alive' feature of HTTP, and, it would appear, the two other tools cited happen to make use of it. That the server behaves differently for the two cases is a bug  incorrect behavior  in the server (IIS 5.0).

Comment 14 Paulo Fran 2007-05-28 23:46:14 UTC
(In reply to comment #13)

> ...
> The Markup Validator, quite correctly, simply happens to not make use of the
> (optional) 'Keep-Alive' feature of HTTP, and, it would appear, the two other
> tools cited happen to make use of it. That the server behaves differently for
> the two cases is a bug  incorrect behavior  in the server (IIS 5.0).

Ok, I got it now.

But if the other two tools make use of the (optional) "keep-alive" feature, why doesn't the Markup Validator do the same? This would solve the problem not only for me, but for all other sites using IIS 5.

The CSS Validator loads my contents fast, so using that feature wouldn't be that time-consuming. Would it be too hard to change that in the validator code?

Notice I am comparing three tools from W3C, where two of them make use of the
keep-alive and one doesn't. If this one that doesn't is facing problems, then it should do it too, or not?!

--o-o-o-o-o-o--

> The Markup Validator, quite correctly, simply happens to not make use...

Why "quite correctly"? How the use of the keep-alive flag would hurt the validator?

Sorry for my lack of knowledge: I've never developed a validator in my life. I'm just a Delphi programmer trying to self-host a server (tired of remoted ones). \\:^)

And thank you very much for your patience. I'm taking the oportunity to learn a bit more about servers, headers, and related subjects.
Comment 15 Terje Bless 2007-05-29 09:24:46 UTC
(In reply to comment #14)
> But if the other two tools make use of the (optional) "keep-alive" feature, why
> doesn't the Markup Validator do the same? This would solve the problem not only
> for me, but for all other sites using IIS 5.

We're not aware of any general problem with the Markup Validator and IIS 5.0. This appears to be a local configuration issue with your particular installation; either due to how it is configured or due to an ISAPI filter or...

The Markup Validator has no need to use HTTP Keep Alive because it only requests a single document from the remote server. HTTP Keep Alive is an optional method to optimize retreival of multiple documents from the same server within a short period of time. Both the CSS Validator and the Link Checker have need to fetch multiple documents to satisfy the request, so for these tools HTTP Keep Alive is an appropriate optimization technique to use.

Comment 16 Paulo Fran 2007-05-29 15:35:41 UTC
(In reply to comment #15)

> We're not aware of any general problem with the Markup Validator and IIS 5.0.

Well, now you are.  \\:^)

--o-o-o-o-o-o-o--

> This appears to be a local configuration issue with your particular
> installation...

It seems W3C's Markup Validator is quite sensitive in this regard.
My server is reachable and is also content-readable all over the web (except for
that tool).

--o-o-o-o-o-o-o--

> ...either due to how it is configured or due to an ISAPI filter or...

No filter at all. All default filters removed other than php interpreter (used by my web-based community).

--o-o-o-o-o-o-o--

> The Markup Validator has no need to use HTTP Keep Alive because it only
> requests a single document from the remote server. HTTP Keep Alive is an
> optional method to optimize retreival of multiple documents from the same
> server within a short period of time. Both the CSS Validator and the Link
> Checker have need to fetch multiple documents to satisfy the request, so for
> these tools HTTP Keep Alive is an appropriate optimization technique to use.

Understood. that seems logical.
However, making it respect the keep-alive flag wouldn't hurt.

--o-o-o-o-o-o-o--

Terje, I'd like to thank you and to Olivier for all time spent on this.
I'll drop it, leave is as-is.

I have a feeling that, if my server was behind some *$oft company or similar, this "feature" in the validator would deserve more attention - I mean inside the code itself. Still don't see any inconvenience in respecting the keep-alive.

After all, I have managed to make the guys see that there is a bad side-effect of the validator's decision to not take that flag into account.
Also have learned a lot with you.

Thanks again and sorry for any inconvenience.

Best regards,
Paulo França
Comment 17 Paulo Fran 2007-05-29 16:22:53 UTC
Back again!  \\;^)
Just an interesting detail...

THIS WORKS:
  http://validator.w3.org/check?uri=http://www.athrasoft.com/cgi-bin/Environ.cgi

The dynamically created page above works both with W3C and Validome.
It's a Delphi console application (.exe renamed to .cgi) made by myself.

In the very beginning of the application I write this to the console:
  Content-Type: text/html; charset=windows-1252

The rest (page header and contents) is the same as in all my static (.htm) pages. Now I think I'll get nuts!!

The final server-generated headers (reported by Validome) differs between statical and dynamic pages:

STATICAL (.htm):
  HTTP/1.1 200 OK
  Server: Microsoft-IIS/5.0
  Content-Type: text/html; charset=Windows-1252
  Cache-Control: no-cache
  Expires: Tue, 29 May 2007 16:11:59 GMT
  Content-Location: http://201.79.145.166:8081/Default.htm
  Date: Tue, 29 May 2007 16:11:59 GMT
  Content-Type: text/html
  Accept-Ranges: bytes
  Last-Modified: Tue, 29 May 2007 00:19:10 GMT
  ETag: "01347fa86a1c71:893"
  Content-Length: 1618

DYNAMIC (.cgi):
  HTTP/1.1 200 OK
  Server: Microsoft-IIS/5.0
  Date: Tue, 29 May 2007 16:12:03 GMT
  Connection: close
  Content-Type: text/html; charset=Windows-1252  <<-- my cgi did this !!

Maybe something in the server headers for the statical pages hold things that it shouldn't. I'll investigate it better when I have the courage. \\;^)

Kind regards,
Paulo.
Comment 18 Terje Bless 2007-05-29 17:14:49 UTC
(In reply to comment #17)
> THIS WORKS: [] /cgi-bin/Environ.cgi

Yes, this script works regardless of whether the client requested a keep-alive connection or not. IOW, it behaves differently from the previous test case ('/en/Home.htm').
Comment 19 Paulo Fran 2007-05-29 18:09:50 UTC
(In reply to comment #18)

> Yes, this script works regardless of whether the client requested a keep-alive
> connection or not. IOW, it behaves differently from the previous test case
> ('/en/Home.htm').

Do you know which headers in specific the W3C Markup Validator looks at?

Unfortunately, there isn't a "show server headers" option (as in Validome), so one cannot guess what exactly is being taken into account regarding server headers. That would be very useful for diagnostics.

In the near future, I plan on recreating my site by providing dynamic pages, mainly for serving the pages with the more appropriate Content Type. For instance, "application/xhtml+xml" for validators and those browsers which correctly support it, and "text/html" for the rest (mainly legacy browsers), based on this (very interesting) report:
  http://www.w3.org/People/mimasa/test/xhtml/media-types/results

As such, knowing which headers the W3C Markup Validator pays attention to, I will be better prepared for setting up the dynamic site.

Thanks,
Paulo.
Comment 20 Olivier Thereaux 2007-06-01 05:57:29 UTC
Closing bug, as I gather the problem was IIS misconfiguration.

Discussion on negotiation would probably be better on www-validator mailing-list, as it is becoming off-topic for this particular bug report.

Thank you!
Comment 21 Olivier Thereaux 2007-10-24 03:42:26 UTC
*** Bug 5222 has been marked as a duplicate of this bug. ***