W3C

SAXing up the Markup Validator: from Validator to Conformance Checker

One of the main weaknesses of the Markup Validator is, it’s a validator. This may sound a little odd, until one knows precisely what validation is. Validation, roughly speaking, is the process of comparing a document written in a certain language against a machine-readable grammar for that language.

So, when the validator checks a document written in HTML 4.01 Strict, it doesn’t actually know any of the prose that one can find in the HTML 4.01 Specification, it just knows the machine-readable grammar (called a DTD in the case of HTML, and most markup languages standardized to date). In some ways, that is a good thing: prose can be ambiguous, a DTD is not. But there are some things you can not define, or enforce, with a DTD: for example, attribute values are defined as being of a certain type (identifier, URI, character data), but their value itself can not be enforced with a DTD.

As long as the validator remains a validator stricto sensu, validation will be one of its main limitations. The alternatives are:

  1. Change the technology used to describe the language’s grammar: languages being standardized today tend to use other, more recent technologies, such as Relax NG, Schematron or XML Schema. Although more expressive, these technologies are in no way a solution to the complexity of describing markup languages, as explained in e.g Henri Sivonen’s Thesis on an HTML5 Conformance checker.
  2. Evolve the validator into more than a validator: what if the validator knew about the prose in the HTML specifications? Would it not solve the problem? It would help. Of course, the validator would no longer be a validator, but instead entering the realm of conformance checkers.

Recent changes in the validator will help move in that direction. Up to version 0.7.4 (released in Nov 2006) the validator was using OpenSP, the venerable parser for SGML (and XML). Starting with the next version, the validator will still use the same parser, but wrapped in a smarter, faster package. In addition to being much faster, the perl module SGML::Parser::OpenSP provides us with a SAX-equivalent event-based parsing interface.

A SAX (or the OpenSP equivalent) API is an opportunity to go way beyond grammar-based validation: additional checks, additional features. Additional checks can follow a similar strategy to that of the RSS validator, as Mark Pilgrim had documented in his Inside the RSS Validator column.

There was one obvious candidate for the first extra check we could create: in the list of criteria for a Strictly Conforming XHTML 1.0 Document, all but one can be enforced with grammar-based checking: DTDs can not, at the same time, express that an attribute MUST be specified, and MUST have a fixed value.

The latest version of the validator (still in development for a few more weeks) can now spot documents in XHTML missing the xmlns attribute for the root html element. The technical side of the issue is now closed, but most of the problems are on the road ahead:

  • The wardens of a certain definition of “validator” will probably not be pleased by such a blatant drifting from formal validation into conformance checking.
  • Some users of the validator will be puzzled to see their once-validating documents now rejected by the validator. It is a natural reaction, particularly from users who tend to consider the validator as a “reference”, forgetful that any software may have bugs, ignoring the too-often-seen note that “the validator’s XML support has some limitations”

One way to please everyone may be to only issue warnings, not errors, for such checks that belong to conformance checking beyond formal validation, but in the long run, it would probably not have the same mending effect on the web as a strict stance.

Hopefully, the upcoming weeks of Beta test for the new version will help figure out which road to take. Opinions welcome.

6 thoughts on “SAXing up the Markup Validator: from Validator to Conformance Checker

  1. This sounds very interesting. As we are about to set up the Test Suite for RDFa, I’d like to ask you for more details, and maybe some hints how this could be applied to RDFa. Secondly, I’d very much like to hear Dan Connolly’s comments on conformance checking w.r.t. this one.

    Cheers,
    Michael

  2. You mean it doesn’t check these things already?

    There’s no real problem here. The tool is advertised as an HTML validator, not a DTD validator. “validator” has many meanings and in the context of HTML that includes things like checking that an a element does not contain other a element descendants. It’s irrelevant how this is checked: with a DTD or W3C XML schema or Schematron or Java code. All that matters is that the constraint is validated.

  3. I think this is a wonderful development and the right way to move forward. Very few people have knowledge of what “validation” in the DTD-sense is anyway. I believe most developers think “conformance check” when they validate their documents and not “validation against a DTD and nothing else”. They want to know if their documents are correct, that’s all.

    And with that, I believe the validator could report errors and warnings for things that goes far outside the scope of the HTML specification, like malformed HTTP headers, etc.

  4. Michael,

    For RDFa, which I’ve been looking at intermittently, there are two things: make sure RDFa in XHTML can properly be validated as XHTML, and get the validator to make RDFa-specific checks.

    The former is probably the more important and urgent, and from what I’ve seen in recent publications, we’re pretty close.

    The way RDFa now has namespaces in attribute value is smart in this regard, it solves most of the problem. The big remaining issue is that, for a DTD-based validator, such a construct is forbidden:

    <html xmlns=”http://www.w3.org/1999/xhtml” xmlns:cal=”http://www.w3.org/2002/12/cal/ical#” …
    because XML DTDs are not namespace aware. Somewhere on a corner of my head, I think I eventually want to filter out these “errors” from validator output in case of XML documents, as a bit of a cheating way to make it more namespace-aware.

    The rest is a matter of making sure that attributes and elements follow some established schema. I played with examples from the RDFa primer the other day, and saw that they use such attributes as “content”, “about” which aren’t in XHTML1.0. Making a XHTML-based language with XHTML1.0 as basis plus the stuff needed by RDFa will be a solution to this.

    (And just as I type this, I’m given a link to XHTML RDFa Modules, which indeed solves the issue…)

    Adding checks to the validator specifically for RDFa is still rather blurry for me, but I think it’s not outside of the realm of possibility. Just need to learn more about it before I can have an opinion.

  5. Asbjørn,

    Thanks a lot for your insight, much appreciated. Good to hear that the idea doesn’t sound completely heretic to you.

    Additional checks would be great, indeed. It may be difficult to bring them in, because some people get very, very upset at seeing warnings (See some discussions in the CSS validator’s list about accessibility related warnings…) in otherwise “correct” documents, so we’d have to be smart and have a good UI.

    Warnings about broken HTTP headers is a pretty good idea. The version of the validator in development does that, to some extent, when detecting that documents are sent with bogus mime types (see the mime type section in the validator’s little test suite for an idea). If you have more ideas like that, please, shoot :).

Comments are closed.