Re: Heuristic Tests for Data Tables (Discussion) from Ben 'Cerbera' Millard on 2007-09-05 (public-html@w3.org from September 2007)

From: Ben 'Cerbera' Millard <cerbera@projectcerbera.com>
Date: Wed, 5 Sep 2007 11:20:14 +0100
To: "James Graham" <jg307@cam.ac.uk>, "Robert Burns" <rob@robburns.com>
Cc: "HTMLWG" <public-html@w3.org>
Message-ID: <00be01c7efa6$5b3889b0$0201a8c0@ben9xr3up2lv7v>
This is really valuable work, James. The colour-coding when clicking a cell 
is super cool. :-)

Typo: "Interprert" should be "Interpret" in the checkbox labels.

Is the "two tables per page not supported" bug fixed? This e-mail reviews a 
page with one table and a page with more than one table.

Simon 'zcorpan' Pieters and I have ideas about a good implicit algorithm, 
parts of which I've written about generally. Anyone who wants to chat about 
it is welcome to MSN me:

<cerbera@projectcerbera.com>

Simon is often in #whatwg:

<irc://irc.freenode.net/whatwg>

I make websites for a living, so I am often unavailable. But iterating 
through ideas is much faster in instant messaging or IRC than in e-mails.

James Graham wrote:
>  * In rows or columns that consist only of headings the other headings 
> from that row or column are not applied to cells in the row/col.

Treating continuous headers like this makes sense. At the moment, this part 
doesn't seem to be working:

1. In HTML4 mode, all column headers after the first column header get 
associated with the first column header:
    a. 
<http://james.html5.org/cgi-bin/tables/table_inspector.py?uri=http%3A%2F%2Fsitesurgeon.co.uk%2Ftables%2Fthatcher%2F01-water%2Fminimal.html&algorithm=html4>
    b. 
<http://james.html5.org/cgi-bin/tables/table_inspector.py?uri=http%3A%2F%2Fprojectcerbera.com%2Ftutorials%2Fgtavc%2Fpaths%2Fdefinition&algorithm=html4&scope=1&headers=1>
2. In HTML5 mode, this does not occur:
    a. 
<http://james.html5.org/cgi-bin/tables/table_inspector.py?uri=http%3A%2F%2Fsitesurgeon.co.uk%2Ftables%2Fthatcher%2F01-water%2Fminimal.html&algorithm=html5>
    b. 
<http://james.html5.org/cgi-bin/tables/table_inspector.py?uri=http%3A%2F%2Fprojectcerbera.com%2Ftutorials%2Fgtavc%2Fpaths%2Fdefinition&algorithm=html5>
3. In Experimental mode with all options turned off, this does not occur:
    a. 
<http://james.html5.org/cgi-bin/tables/table_inspector.py?uri=http%3A%2F%2Fsitesurgeon.co.uk%2Ftables%2Fthatcher%2F01-water%2Fminimal.html&algorithm=experimental>
        Multiple headings are found for some cells. This seems correct on 
the whole until the final <th colspan>. See notes after this list.
    b. 
<http://james.html5.org/cgi-bin/tables/table_inspector.py?uri=http%3A%2F%2Fprojectcerbera.com%2Ftutorials%2Fgtavc%2Fpaths%2Fdefinition&algorithm=experimental>
        (No headers are found.)

Adjacent header cells need different treatment from headers with data 
between them. Simon and I found many cases where this is true vertically. I 
mentioned some in a [previous] message:

1. <http://sports.espn.go.com/mlb/stats/aggregate?statType=fielding&group=9>
2. <http://www.ircseries.com/html/Standings_Results.asp>

 The "water-minimal.html" example is a very simple one to get your teeth 
into. Each column is like this:

 <th>Header 1
 <th colspan=2>Header 2
 <td> Data 1
 <td> Data 2
 <td> Data 3
 <td> Data 4
 <td> Data 5
 <td> Data 6
 <th colspan=2>Header 3
 <td>Data 7

For Data 1-6, Header 1 & 2 must be applied as you'd expect. But for Data 7, 
Header 1 & 3 must be applied. Header 3 replaces Header 2 whilst Header 1 
remains. This is obvious when looking at the table in a browser but not so 
clear when thinking about the markup.

In these case, I never found a <tbody> scope="rowgroup" except for the 
personal websites of markup enthusiast like Project Cerbera, my own site. 
:-)  But it's fairly common to see these colspan-sensitive table header 
arrangements. So I think it's worth making this work natively without extra 
markup, at least in those cases where <th> or an alias of <th> has been 
used.

My rough attempt at writing the steps for this goes column by column, from 
the first cell downwards through all subsequent cells, collecting and 
applying header associations on the way. HTML4 describes searching up from 
each <td> until reaching a <th>. You can write it for either direction, but 
I find this way a bit easier to follow.

For each column in the table:

1. Collect the <th> and go down one cell.
2. If this is another <th>, associate them regardless of colspan. Go down 
one cell.
3. Repeat 2 until you reach a <td>.
4. Associate all the <th>s so far with the new <td>. Go down one cell.
5. Repeat 4 until you reach another <th>.
6. For each <th> collected so far:
    a. Check its colspan with that of the new <th>
         i. If they are different, associate them.
        ii. If they are identical (including if they are both colspan=1), 
replace the collected <th> with the new <th>.
    b. Go down one cell.
7. Repeat 6 until you reach a <td>.
8. Repeat 4-7 until the end of the table.

This assumes there are no scope or headers attributes and the topmost cell 
of the column is a <th> (including aliases <td><b> or <td><strong>).

If you could add a new mode to implement this aspect alone, I will test how 
well it works in practise in isolation from the effect of any other steps. 
You could call this mode "Be smart about colspan" or choose a name yourself. 
Testing each subprodecure in isolation woiuld be a more manageable way to 
test their feasibility, imho.

Eventually, I imagine we'll have several functions amalgamated into the 
final algorithm so a wide variety of tables are accomodated. For example, 
"associate" in the above steps would actually mean "if present, use the abbr 
attribute value, otherwise use the element's content" to cut down verbosity. 
This could be a little function for re-use in other parts of the algorithm. 
Perhaps you could make it an option for each of the existing algorithms so 
we can test its feasibility?

You could MSN me or IRC with Simon to discuss any details like this and 
iterate ideas. I would help with the coding but I'm rubbish at it. :-)

[Thatcher] <http://sitesurgeon.co.uk/tables/thatcher/01-water/minimal.html>
[previous] 
<http://lists.w3.org/Archives/Public/public-html/2007Aug/1005.html>

--
Ben 'Cerbera' Millard
Collections of Interesting Data Tables
<http://sitesurgeon.co.uk/tables/readme.html>
Received on Wednesday, 5 September 2007 10:21:21 UTC