SAXing up the Markup Validator: from Validator to Conformance Checker

Author(s) and publish date

Skip to 6 comments

One of the main weaknesses of the Markup Validator is, it's a validator. This may sound a little odd, until one knows precisely what validation is. Validation, roughly speaking, is the process of comparing a document written in a certain language against a machine-readable grammar for that language.

So, when the validator checks a document written in HTML 4.01 Strict, it doesn't actually know any of the prose that one can find in the HTML 4.01 Specification, it just knows the machine-readable grammar (called a DTD in the case of HTML, and most markup languages standardized to date). In some ways, that is a good thing: prose can be ambiguous, a DTD is not. But there are some things you can not define, or enforce, with a DTD: for example, attribute values are defined as being of a certain type (identifier, URI, character data), but their value itself can not be enforced with a DTD.

As long as the validator remains a validator stricto sensu, validation will be one of its main limitations. The alternatives are:

  1. Change the technology used to describe the language's grammar: languages being standardized today tend to use other, more recent technologies, such as Relax NG, Schematron or XML Schema. Although more expressive, these technologies are in no way a solution to the complexity of describing markup languages, as explained in e.g Henri Sivonen's Thesis on an HTML5 Conformance checker.
  2. Evolve the validator into more than a validator: what if the validator knew about the prose in the HTML specifications? Would it not solve the problem? It would help. Of course, the validator would no longer be a validator, but instead entering the realm of conformance checkers.

Recent changes in the validator will help move in that direction. Up to version 0.7.4 (released in Nov 2006) the validator was using OpenSP, the venerable parser for SGML (and XML). Starting with the next version, the validator will still use the same parser, but wrapped in a smarter, faster package. In addition to being much faster, the perl module SGML::Parser::OpenSP provides us with a SAX-equivalent event-based parsing interface.

A SAX (or the OpenSP equivalent) API is an opportunity to go way beyond grammar-based validation: additional checks, additional features. Additional checks can follow a similar strategy to that of the RSS validator, as Mark Pilgrim had documented in his Inside the RSS Validator column.

There was one obvious candidate for the first extra check we could create: in the list of criteria for a Strictly Conforming XHTML 1.0 Document, all but one can be enforced with grammar-based checking: DTDs can not, at the same time, express that an attribute MUST be specified, and MUST have a fixed value.

The latest version of the validator (still in development for a few more weeks) can now spot documents in XHTML missing the xmlns attribute for the root html element. The technical side of the issue is now closed, but most of the problems are on the road ahead:

  • The wardens of a certain definition of "validator" will probably not be pleased by such a blatant drifting from formal validation into conformance checking.
  • Some users of the validator will be puzzled to see their once-validating documents now rejected by the validator. It is a natural reaction, particularly from users who tend to consider the validator as a "reference", forgetful that any software may have bugs, ignoring the too-often-seen note that "the validator's XML support has some limitations"

One way to please everyone may be to only issue warnings, not errors, for such checks that belong to conformance checking beyond formal validation, but in the long run, it would probably not have the same mending effect on the web as a strict stance.

Hopefully, the upcoming weeks of Beta test for the new version will help figure out which road to take. Opinions welcome.

Related RSS feed

Subscribe to our blog feed

Comments (6)

Comments for this post are closed.