Re: Comparing conformance requirements against real-world docs

On Aug 30, 2008, at 10:58, Henri Sivonen wrote:

> I reran the numbers for pages that
> 1) did not trigger the quirks mode
> AND
> 2) had zero parse errors (still ignoring tree builder-level doctype  
> errors)
> AND
> 3) had at least one validation error

First, some philosophical assumptions underlying my conclusions:
  1) Time is a precious resource for people. Therefore, wasting  
people's time is bad.
  2) A validator should primarily be a tool that authors use to help  
themselves in their authoring task. The primary purpose of a validator  
is not imposing a particular code aesthetic onto other people.

With legacy language features it becomes problematic to help people  
not waste their time. If an author is writing a new HTML page, the  
author's time is wasted if we make a useless piece of syntax  
conforming and pundits convince the author in to use the useless  
syntax. In this sense, conforming no-op syntax isn't harmless. On the  
other hand, if an author has existing HTML templates and then adds a  
new HTML5 feature (<video>) to his/her site and starts using an HTML5  
validator as a quality assurance tool, since an HTML5 validator  
recognizes the new feature, the author's time is wasted if the  
validator spews a lot of errors about legacy language features that  
are interoperably implemented and don't really cause harm beyond  
wasted bandwidth (and perhaps slightly lower maintainability).

More concretely, it wastes people's time if experts advise people to  
write <style type=text/css> instead of <style> but it also wastes  
people's time if a validator tells people to take type=text/css out  
when it already has been written.

I don't have a good solution for this problem at this time. However, I  
think that's safe to say that the approach HTML 4 and XHTML 1.0 took  
isn't the solution. Those specs defined two conformance targets:  
practical aka. Transitional and wishful aka. Strict. It turned out  
that most people who care about validation aim for the more permissive  
conformance target. Also, some people spent a lot of time coding  
around the strictness of Strict, which, I'd argue, was in some cases a  
big waste of time and, therefore, bad.

Yet, it appears that when the more permissive conformance target  
doesn't forbid the things that people want to do (with the notable  
exception of <embed>), in a decade part of the HTML output out there  
does converge towards to more permissive conformance target.

Anyway, I'd like to make the conformance definition of HTML5 not waste  
the time of people who are upgrading from a previous level of HTML.  
There will be authors using the <video> element pretty soon. For them,  
the <video> element will be a killer feature that matters more than  
HTML 4.01 or XHTML 1.x validation. At that point, they should be able  
to turn to an HTML5 tool so that the tool is useful for them and  
doesn't waste their time. For this scenario, it doesn't make sense to  
make HTML5 conformance definition something that maybe 30% of HTML  
output has converged on after a decade.

(Things grouped together a bit below.)

> 0.1142	The internal character encoding declaration must be the first  
> child of the “head” element.

I think we should go back to requiring the declaration to occur within  
the first 512 bytes. Whether it has non-ASCII before it doesn't matter  
in that case even for streaming implementation that perform a prescan  
on the first 512 bytes.

The old definition is theoretically ugly, but it seems to be more  
practical for everyone except validator writers and for me as a  
validator writer it's sunk cost already.

> 0.1001	Attribute “border” not allowed on element “img” at this point.

It seems to me that Gecko's and Trident's default image border is  
extremely unpopular among authors, and making border=0 non-conforming  
is unhelpful, too. I reiterate my suggestion to make border=0  
conforming.

> 0.1013	Attribute “cellspacing” not allowed on element “table” at  
> this point.
> 0.0951	Attribute “cellpadding” not allowed on element “table” at  
> this point.
> 0.0935	Attribute “border” not allowed on element “table” at this  
> point.
> 0.0924	Attribute “width” not allowed on element “table” at this point.
> 0.0779	Attribute “valign” not allowed on element “td” at this point.
> 0.0759	Attribute “width” not allowed on element “td” at this point.
> 0.0451	Attribute “height” not allowed on element “td” at this point.
> 0.0365	Attribute “align” not allowed on element “table” at this point.
> 0.0273	Attribute “height” not allowed on element “table” at this  
> point.


It's clear by now that the layout model offered by HTML tables is  
something that authors find useful. Using layout tables in HTML and  
using CSS is not an either-or choice. Since people who use CSS for  
some things still use layout tables, this is an indication that the  
CSS language or its incumbent implementations don't make it easy to  
make that kind of layouts that authors use tables for.

Realistically, it will take many years for CSS grid layout to be as  
deployable by authors as HTML layout tables are today. Moreover, the  
current installed base of browsers doesn't make CSS table layout a  
viable alternative for HTML table layout. Chances are that this won't  
change until the computers that came with Windows XP pre-installed  
have been disposed of. Chances are that there will be demand for  
validating HTML5 language features before then.

Considering the above, it seems unhelpful for HTML5 to take the  
position that layout tables are not conforming.

(Aside: The accessibility argument against layout tables is moot.  
Layout tables are so abundant out there that accessibility technology  
must deal with them anyhow.)

> 0.0793	Attribute “language” not allowed on element “script” at this  
> point.

<script language=JavaScript> as harmless and useless as <script  
type=text/javascript>.

> 0.0638	Attribute “align” not allowed on element “td” at this point.

I think this one isn't like the other "presentational" table  
attributes. The alignment of table cells is often tightly coupled with  
the kind of content the cells have.

Moreover, its structure and presentation are truly separated it should  
be possible to write a style sheet ahead of time for a given set of  
content features. Here a content feature can be something like "multi- 
paragraph blockquotes" or "tables with both numbers and text in them".  
However, intuitively, "tables with numbers in the fifth column" is too  
specific to be a generic content feature that a style sheet is written  
to support.

If you need to tweak your CSS and class attributes whenever you make a  
table with a new column mix, structure and presentation are not really  
being separated. Once you get there, why not encode the alignment in  
HTML?

> 0.0609	Attribute “size” not allowed on element “input” at this point.

This HTML feature doesn't have a convenient CSS alternative that were  
deployable today considering the existing installed base browsers. I  
think we should just make this attribute conforming.

> 0.0529	Attribute “align” not allowed on element “div” at this point.
> 0.0282	Attribute “align” not allowed on element “p” at this point.
> 0.0372	Attribute “align” not allowed on element “img” at this point.


Wow.

It would be interesting to examine the use cases for aligning divs and  
paragraphs. I'd be interested to know if the popularity of the align  
attribute has something to do with legacy RTL authoring habits.

> 0.0401	Bad value (consolidated) for attribute “http-equiv” on  
> element “meta”.

I don't know what values these are, but I hadn't implemented Content- 
Language yet.

> 0.0386	Attribute “name” not allowed on element “a” at this point.

That one just refuses to go away. :-(

> 0.0354	The “font” element is obsolete.
> 0.0208	Attribute “color” not allowed on element “font” at this point.


<font color> is the simplest way to map color-coded text from a  
WYSIWYG editor to HTML. Would <span style='color:red;'> be any better  
for color-based emphasis or annotations? (Yeah, yeah, it's not good  
for accessibility, but neither of those are. Is it realistic to kill  
color UI in WYSIWYG editors?)

> 0.0279	Attribute “accesskey” not allowed on element “a” at this point.

The design of accesskey sucks, but the attribute seems relatively  
popular.

> 0.0236	Attribute “profile” not allowed on element “head” at this  
> point.

The profile instances are mostly due to WordPress. The scheme of  
picking at most one page per *hostname* still picked a lot of  
username.wordpress.com blogs. Also, there are a lot of other WP  
instances out there. These could be knocked out by a single WP version  
update.

> 0.0224	Attribute “size” not allowed on element “font” at this point.
> 0.0202	Attribute “bgcolor” not allowed on element “td” at this point.


Presentationalism.

This is the cut-off for errors that would not have been errors in HTML  
4.01 Transitional.

> 0.0194	Element “link” not allowed in this context. (The parent was  
> element “div”.) Suppressing further errors from this subtree.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Sunday, 31 August 2008 17:24:41 UTC