This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 26491 - Predictable data format
Summary: Predictable data format
Status: RESOLVED WORKSFORME
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL:
Whiteboard: tools
Keywords:
Depends on:
Blocks:
 
Reported: 2014-07-31 20:28 UTC by Anne
Modified: 2017-07-21 09:23 UTC (History)
4 users (show)

See Also:


Attachments

Description Anne 2014-07-31 20:28:25 UTC
Ideally elements, their associated attributes, IDL fragments, events, etc. would be marked up in such a way that you can extract them from the document as collections.

So that projects such as https://twitter.com/codylindley/status/494155769307078662 do not have to be manual.
Comment 1 Ian 'Hixie' Hickson 2014-07-31 22:51:16 UTC
What exactly do you want? I mean, for example, you say "events" but almost any arbitrary string can be an event that gets dispatched (e.g. by EventSource, or dispatchEvent). Are we also looking to include any sort of associated prose? Other metadata?
Comment 2 Anne 2014-08-01 04:17:26 UTC
I guess the class(es) it implements and maybe some prose, if available. You can make this as rich as you want. Basically make data the specification contains more scrapable for tools that have no AI.
Comment 3 Ian 'Hixie' Hickson 2014-08-01 16:25:02 UTC
I need precise detail about what you want to be able to expose it.
Comment 4 Anne 2014-08-04 07:41:54 UTC
For detailed requirements I guess you could ask those who raised this on Twitter and developers of sites that show such aggregate data. I expect what they want to evolve over time and would be quite different from site to site.

(But you could start simple. If you download the current HTML specification without the indexes, it's hard to get a list of elements out of it.)
Comment 5 Ian 'Hixie' Hickson 2014-08-04 17:40:41 UTC
Right. That's why the HTML specification has indices. If you download it without the section on <ol>, then you wouldn't have information on how <ol> works either. :-)
Comment 6 philip 2014-08-04 18:06:52 UTC
About a year ago I created a JavaScript tool called HTML Inspector. The idea was teams could right their own rules about how their HTML should be structured and offending commits would break the build.

In addition to custom rules, it also came with some default rules for basic HTML5 conformance. It would error when obsolete elements where used or when required element attributes were missing, etc.

This JavaScript file should give you a general idea of the types of data the tool is interested in:
https://github.com/philipwalton/html-inspector/blob/master/src/modules/validation.js

I generated these datasets mostly manually, copying and pasting from here:
http://drafts.htmlwg.org/html/master/index.html#index

It would be great if much of this information could be made available to access programmatically, ideally without having to parse HTML (unless the structure was predictable and unchanging, though that seems unlikely and hard to guarantee).

I have not updated HTML Inspector with new conformance data in over a year because the process for doing so is simply too time consuming. I would like to be able to run a build script that queries the latest spec and forms these rules automatically, but there didn't seem to be a good (and future-friendly) way to do that.
Comment 7 philip 2014-08-04 18:08:06 UTC
By the way, if such a method already exists, and I simply don't know about it, please let me know!
Comment 8 Ian 'Hixie' Hickson 2014-08-05 18:13:41 UTC
Actually describing conformance requirements in a declarative language is basically impossible, that's why we don't both providing DTDs any more. (All the "*"s in the index are cases where the rules are more complicated than that table indicates.) I can approximate the rules in a non-normative data file if there's a compelling use case for not implementing a real validator and if someone tells me exactly what the schema should be for this data, but I'm very reluctant to make up a format myself since I wouldn't know how it would be used.
Comment 9 philip 2014-08-07 23:19:02 UTC
I think maybe it's worth simplifying the use case before continuing this discussion.

Let's say I'm building a code editor, and I want to add some sort of error-like syntax highlighting when someone is using an HTML tag that either doesn't exist or is obsolete. And maybe I want to take this a bit further to inform people that they're using incorrect or outdated attributes (like align="right" on a <div>).

Now, I could easily just go to the spec, copy and paste the list of current HTML elements and acceptable attributes, and ship a version of my code editor.

The problem is: how do I keep this information up to date?

What do browser makers do? (honest question)

Does the spec provide diffs from one version to the next?

I think what I'm ultimately seeking is not some full-featured API that tells me exactly what is and what is not valid. I'm looking for either a more reliably way to get current spec info in a machine-readable format *or* a more convenient way to see what's changed between the current version and some previous version of the spec.

As a specific request. If the data in the tables on the index page[1] were provided in a more machine accessible format, that would accomplish 99% of what I'm after. (As they are now, the tables do not have any identifying attributes, so scraping the page would potentially be quite error prone.) Is there a lower-level source of this info that's used to generate those tables?

Anyway, hopefully that helps provide a little more context.

[1]: http://drafts.htmlwg.org/html/master/index.html#index
Comment 10 Ian 'Hixie' Hickson 2014-08-13 18:33:45 UTC
What browser makers and validators do is update their software when the spec is updated, and then ship the updates. The same should happen with editors. There is no way to get the spec in a machine-readable form short of implementing the spec. That's what the spec is for.

The index tables are maintained by hand. (Note that the page you cited is not the canonical version of the spec, it's some W3C fork that I would strongly recommend ignoring. The spec is at http://whatwg.org/html .)

The tables in question contain mostly vastly simplified information, which is why they are non-normative. I don't recommend trying to base any code on those tables, if you do it'll be woefully incomplete and simplistic.


Getting a better way to list differences between versions is certainly something I'm interested in doing, but I don't know how to do it better than:
   http://html5.org/tools/web-apps-tracker
If you have any ideas for how to do better than that, please do file a separate bug and I'll see what I can do.
Comment 11 Anne 2017-07-21 09:23:43 UTC
This is now (being) done with the Bikeshed/Shepherd model.