5921 – The checker crashes on invalid XML pages that cannot be tidied properly

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5921 - The checker crashes on invalid XML pages that cannot be tidied properly

Summary: The checker crashes on invalid XML pages that cannot be tidied properly

Status:	RESOLVED FIXED

Alias:	None

Product:	mobileOK Basic checker
Classification:	Unclassified
Component:	Java Library (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P2 minor
Target Milestone:	---
Assignee:	fd
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-07-31 09:30 UTC by fd
Modified:	2009-04-21 07:39 UTC (History)
CC List:	0 users

See Also:

Attachments

Description fd 2008-07-31 09:30:08 UTC

It could be related to an encoding problem, because it seems only to occur on non UTF-8 encoded pages.

The thing is the tidied body may still not be valid enough for the XML parser to parse it.

The incriminated code is in HTTPXHTMLResource.parseTidiedDOM which may return null while the calling method (the constructor) does not expect that to ever happen.

Example:
Run the checker on http://www.china-avs.com/
It should return:
Exception in thread "main" java.lang.NullPointerException
 at org.w3c.mwi.mobileok.basic.HTTPXHTMLResource.<init (HTTPXHTMLResource.java:145)
 at org.w3c.mwi.mobileok.basic.Preprocessor.preprocess(Preprocessor.java:50)
 at org.w3c.mwi.mobileok.basic.Tester.getPreprocessorResults(Tester.java:90)
 at org.w3c.mwi.mobileok.basic.Tester.main(Tester.java:202)


Two things:
1/ we should try to understand what really happens here
2/ even if we fix the parseTidiedDOM function, we should make sure that the Tester returns an error when the content is so invalid that no test can be run (a message in MAIN_DOCUMENT perhaps?)

I'm flagging the bug with a minor severity as it only occurs when the page is "fairly" invalid.

Comment 1 Abel Rionda 2008-07-31 11:56:27 UTC

We have been tackled with this a little bit. It seems that the root of the problem is in the constructor of HTTPTextResource where we force a UTF-8 decoding where non encoding can be detected.

if (statedEncoding == null) {
// No stated encoding, try UTF-8
validateUTF8();
decodeBodyAsUTF8();
}

The problem is that this decoding inserts rubbish and this later produces a crash.
We think that the best option is to *detect* (by means of some 3rd parties available) the real encoding, and hopefully this will solve our problem (but it would be worth checking it).

Comment 2 fd 2009-04-21 07:39:51 UTC

Bug 5921, bug 6284, bug 6718 and bug 6818 are similar because they all relate to primary documents that cannot be decoded nor parsed. The Checker should return an error in such cases, and not raise an exception that makes it look as if something is wrong within the Checker.

The problem is that the mobileOK Basic Tests 1.0 specification does not exactly define a suitable error to advertise the fact that the document cannot be parsed. The only FAIL that more or less matches this case is:
 CONTENT_FORMAT_SUPPORT-4
 [If the document is not an HTML document, FAIL]
We might want to have a more dedicated error message in the future. I'll raise a bug on this.


Changes in TextContent
-----
The content is decoded as UTF-8 when we do not support the stated encoding. If it works, that means we can run further tests, if not, well, at least we would have tried, and changes committed to fix bug 6818 ensure that some CHARACTER_ENCODING_SUPPORT FAIL message is reported.


Changes in XhtmlContent
-----
parseTidiedDOM now returns null when an exception occurs in the tag soup parser.
We might want to log the exception in the future, because it could help reveal bugs in the underlying library.

As mentioned above, CONTENT_FORMAT_SUPPORT-4 will always be returned when no DOM tree can be constructed out of the received content.