Validator Dev Watch: fuzzy matching for unknown elements/attributes

Part of Systems

Author(s) and publish date

Skip to 9 comments

Unknown elements and attributes at top of the validation errors charts

According to MAMA's survey of validation of a few million web pages, the most common validation error is either “There is no attribute X” or “Element X undefined”. In other words, instances where the document uses elements or attributes which are not standard. As explained in the Validator's documentation of errors, the most likely reasons for these errors are:

  1. typos. The user wrote <acornym> when what was really meant was <acronym>. I am not sure if this is the most common error, but it can be a terribly frustrating one. “What do you mean acronym is not a standard element. Of course it is! Oh, wait, I made a typo…”
  2. Non-standard elements. Again, I don't have statistics about which elements/attributes trigger this error most of the times, but I would bet on the <embed> element and the target attribute (which, by the way, is only available in Transitional Doctypes). For those we can't do much, other than recommend using another doctype and point to standard ways of using <object> to display flash content.
  3. Case-sensitive XHTML. This one bites me more often than I'd like to admit. Copy and paste a snippet of code that uses e.g the onLoad attribute, test the functionality in a few browsers – they will gladly oblige – then see the validator throw an error, because of course, in lower-case XHTML, onLoad isn't a known attribute. onload is.

What makes these errors frustrating is not so much the difficulty they present. Anyone carefully reading the error message and the explanation that comes with it will easily fix their markup. Unfortunately, for a number of good and bad reasons, few of us ever read the explanations: those tend to be a bit long, propose possible causes for the problem, and a list of potential solutions – and most people will just ignore or gloss over them.

Suggestive power

One way we found of making the validator more user-friendly here is to escalate the most likely solution up into the error message itself. In other words, compare:

Error Line 12, Column 14: there is no attribute "crass"

<spam crass="foo">typos in attribute and element</span>

lenghty explanation here…


Error Line 12, Column 14: there is no attribute "crass". Maybe you meant "class" or "classid"?

<spam crass="foo">typos in attribute and element</span>

same lenghty explanation here…

The former is what the latest stable release of the markup validator will output. The latter is what I implemented last week, and can be tested on our test instance of the validator.

How is it implemented?

Since the validator is coded in perl, we looked for perl modules implementing algorithm to calculate edit distance between strings. We found String::Approx, which implements the Levenshtein algorithm. Take this algorith, plug in a list of all known elements and attributes in HTML, and after moderate hacking, my code would very easily find that <spam> should be <span>, and some extra tweaking yielded good results suggesting <acornym> could be corrected as <acronym>.

For some reason however, I could not find a way to make the String::Approx algorithm reliably suggest onload as a replacement for onLoad – it seems to consider character substitution as expensive, regardless of the fact that the substitution is from a character to its uppercase equivalent. A trivial additional test took care of this glitch, and we seem to be all set to have a more usable validator at the upcoming release.

What do you think?

What do you think of this feature? Would you have implemented it differently?

Any suggestion for a better way to word/present the suggested correction for unknown element/attributes? Any thought on other small improvements to the validator which would dramatically improve its usability?

Related RSS feed

Comments (9)

Comments for this post are closed.