This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 978 - systematic xml preparse mode triggers wrong parse mode for xml documents with broken xml declaration
Summary: systematic xml preparse mode triggers wrong parse mode for xml documents with...
Alias: None
Product: Validator
Classification: Unclassified
Component: check (show other bugs)
Version: 0.6.7
Hardware: PC Windows XP
: P3 blocker
Target Milestone: 0.8.0
Assignee: Olivier Thereaux
QA Contact: qa-dev tracking
Depends on:
Reported: 2004-12-30 11:30 UTC by Philipp Lucas
Modified: 2007-03-28 01:48 UTC (History)
0 users

See Also:


Description Philipp Lucas 2004-12-30 11:30:27 UTC
I am trying to convert a HTML 4.01 strict web page to XHTML. My first attempt 
apparently failed, but the validator does not tell me why. I stripped most 
content of the page, so that it consists now only of an XHTML skeleton.

"This page is not Valid !
Below are the results of attempting to parse this document with an SGML parser."

... and that's it. Surely the validator should tell me the problem.

The page in question is
Comment 1 Bj 2004-12-30 15:27:10 UTC
<?xml version="1.0" encoding="UTF-8">

must be

<?xml version="1.0" encoding="UTF-8"?>
Comment 2 Terje Bless 2005-02-04 13:39:22 UTC
While we don't catch this error specifically, the current development code will,
due to other code changes, no longer produce empty results.
Comment 3 Olivier Thereaux 2005-10-20 06:16:41 UTC
Update: latest release (0.7.1) still sends empty result, while 
current development code (HEAD, 0.8-dev) reports error
 beyond the boundaries of the document validated.
Comment 4 Olivier Thereaux 2007-02-22 05:56:17 UTC
renaming, raising priority.
Comment 5 Olivier Thereaux 2007-03-22 09:12:10 UTC;debug
is useful in understanding what's happening.

* an XHTML document is sent as text/html (curse the day text/html was said to be OK for XHTML...)
* the parse mode is set to TBD 
* preparse looks at document
  - by default HTML::Parser was set to XML mode
  - pre-parsing cannot find end of XML declaration, and thus parses the whole doc as if...
  - the doctype cannot be found
* as a result, XML mode is NOT triggered
* openSP is launched in SGML mode
* openSP parses the XML DTD as an SGML DTD, whines
* errors are reported in the DTD (which is why it looks as though it reports errors in the document, but at odd lines).

FIX: use pre-parser as XML mode only if the content-type has unambiguously shown that we should do so. 
In the case of text/html, cautiously use SGML pre-parsing. Finding an XHTML document type will later trigger xml mode in the actual parser and validator.

my $p = HTML::Parser->new(api_version => 3);

- $p->xml_mode(TRUE);

+ # if content-type has shown we should pre-parse with XML mode, use that
+ # otherwise (mostly text/html cases) use default mode
+ $p->xml_mode(TRUE) if ($File->{Mode} eq 'XML');

I have to test this patch against a number of other test cases, but I'm hopeful it should be the solution to this problem, as well as Bug #14.
Comment 6 Olivier Thereaux 2007-03-22 09:12:58 UTC
Assigning to me.
Comment 7 Olivier Thereaux 2007-03-28 01:48:09 UTC
Patch mentioned in Comment #5 has been applied, and it works, as far as I (and my tests) can tell.