Validator Dev Watch: fuzzy matching for unknown elements/attributes
Unknown elements and attributes at top of the validation errors charts
According to MAMA's survey of validation of a few million web pages, the most common validation error is either “There is no attribute X” or “Element X undefined”. In other words, instances where the document uses elements or attributes which are not standard. As explained in the Validator's documentation of errors, the most likely reasons for these errors are:
- typos. The user wrote <acornym> when what was really meant was <acronym>. I am not sure if this is the most common error, but it can be a terribly frustrating one. “What do you mean acronym is not a standard element. Of course it is! Oh, wait, I made a typo…”
- Non-standard elements. Again, I don't have statistics about which elements/attributes trigger this error most of the times, but I would bet on the <embed> element and the
target
attribute (which, by the way, is only available in Transitional Doctypes). For those we can't do much, other than recommend using another doctype and point to standard ways of using <object> to display flash content. - Case-sensitive XHTML. This one bites me more often than I'd like to admit. Copy and paste a snippet of code that uses e.g the
onLoad
attribute, test the functionality in a few browsers – they will gladly oblige – then see the validator throw an error, because of course, in lower-case XHTML,onLoad
isn't a known attribute.onload
is.
What makes these errors frustrating is not so much the difficulty they present. Anyone carefully reading the error message and the explanation that comes with it will easily fix their markup. Unfortunately, for a number of good and bad reasons, few of us ever read the explanations: those tend to be a bit long, propose possible causes for the problem, and a list of potential solutions – and most people will just ignore or gloss over them.
Suggestive power
One way we found of making the validator more user-friendly here is to escalate the most likely solution up into the error message itself. In other words, compare:
http://validator.w3.org/check?uri=http://qa-dev.w3.org/wmvs/HEAD/dev/tests/4412-fuzzymatch.xhtml
Line 12, Column 14: there is no attribute "crass"
<spam crass="foo">typos in attribute and element</span>
lenghty explanation here…
with...
http://qa-dev.w3.org/wmvs/HEAD/check?uri=http://qa-dev.w3.org/wmvs/HEAD/dev/tests/4412-fuzzymatch.xhtml
Line 12, Column 14: there is no attribute "crass". Maybe you meant "class" or "classid"?
<spam crass="foo">typos in attribute and element</span>
same lenghty explanation here…
The former is what the latest stable release of the markup validator will output. The latter is what I implemented last week, and can be tested on our test instance of the validator.
How is it implemented?
Since the validator is coded in perl, we looked for perl modules implementing algorithm to calculate edit distance between strings. We found String::Approx, which implements the Levenshtein algorithm. Take this algorith, plug in a list of all known elements and attributes in HTML, and after moderate hacking, my code would very easily find that <spam> should be <span>, and some extra tweaking yielded good results suggesting <acornym> could be corrected as <acronym>.
For some reason however, I could not find a way to make the String::Approx algorithm reliably suggest onload
as a replacement for onLoad
– it seems to consider character substitution as expensive, regardless of the fact that the substitution is from a character to its uppercase equivalent. A trivial additional test took care of this glitch, and we seem to be all set to have a more usable validator at the upcoming release.
What do you think?
What do you think of this feature? Would you have implemented it differently?
Any suggestion for a better way to word/present the suggested correction for unknown element/attributes? Any thought on other small improvements to the validator which would dramatically improve its usability?
If the
onLoad
attribute in XHTML actually worked that would be a bug in browsers.Anne,
Nice troll :)Thanks for your feedback.The
onLoad
attribute apparently “works” just fine for XHTML served as text/html. In every desktop browser I know – including the one made by your employer ;).This is a super feature.
I once hacked together something to help on the case sensitivity, as many people had "viewbox" instead of SVG's "viewBox".
And help in spotting typos can't be specific enough, as you can easily overlook even though you're looking right at it.
Hi Stelt,
Great! Good to hear you've had success in the past with a similar feature. Did you then have a chance to observe whether the feature helped your users?
Also, given that you have similar experience, if you can suggest any improvement to how the suggestion is made, that'd be excellent. I'm not 100% sure about «Maybe you meant "class" or "classid"». Other similar implementations seem to stick with «did you mean ...?»
@Olivier
I did get some replies telling me it found some errors the W3C didn't back then.
Just thinking about how i use the validator and imagining how others might too:
Every time i have to scan a sentence upto the point where it has the only information i use: "crass ..."
I hardly ever use "Line 12, Column 14" as it's faster to use ctrl-F in my editor.
I don't use "there is no attribute" all the time either.
"There is no attribute X" could to some people read as "you should have used attribute X, but didn't".
Suggestion:
X "crass" is not a valid attribute. Is "class" or "classid" what you meant?
Line 12, Column 14
I think "crass" should stand out 'visually' and to a lesser degree "class" and "classid" too. I think the opposite for "Line 12, Column 14".
I just remembered one more thing:
What if you find an attribute invalid on the current element, but valid on other elements?
Do you fuzzy-suggest a different attribute, or do you tell what elements the attribute would be valid on?
@stelt - indeed the implementation I have isn't that smart. For it to do what you are hinting at we'd need something much smarter, which knows which elements/attributes are allowed in which context. Maybe for a future version...
In the meantime, I have to at least fix this one… When validating e.g.
<span target="blah">
The validator would reply:
there is no attribute "target". Maybe you meant "target"?
Obviously that would not help! Should be easy to patch, I have that on my todo.
patched with improved logic and wording.
What happened to the Clean up Markup with HTML Tidy feature in HTML Validator? I used to be able to get cleaned code copy by clicking that box when validating by Direct Input, but can't now. Is the feature somewhere else now?