HTML 5, the markup

Author(s) and publish date

By:

Karl Dubost

Published:

14 November 2008

HTML 5 is a giant specification. It contains things related to the content model, the APIs, the DOM, the parsing algorithm, etc. We received many comments that it was very hard to read for simple implementers and documentation writers who would like to better understand how html 5 documents are written.

Discover the editor's draft of HTML 5: The Markup Language! Mike Smith has extracted the parts of HTML 5 related to the content model. This document is aimed at people who would like to focus on the content model, be reviewers, authoring tools implementers, documentation writers.

We hope that it will help everyone to have a better understanding of html 5 content model. An additional document should be provided in the future for learning about html 5 with the name Web Authoring Guidelines.

Related RSS feed

Subscribe to our blog feed

Comments (16)

IrnBru001 - 14 November 2008 at 15:37:52 UTC

I don't understand why some of this stuff is in HTML5. I try to be a code conscious web developer but a lot of what is in the HTML5 spec seems to fly in the face of everything I've understood about 'clean code'. For example why does the spec allow for the html tag to be omitted? Or why are attribute values allowed to be unquoted? I thought XHTML was to help us move away from these 'sloppy' omissions but now there is a new version of HTML that says it's ok.
Why isn't this moving towards what as been for years now popularly considered, 'coding the right way'? Are there reasons that necessitate the omission of an html tag or unquoted attribute values? If it is never necessitated why is it allowed?
I hope the reason is something I've not thought of and not the only thing I have thought of... which is some strange understanding of backwards compatibility.
Is there a document that explains the 'philosophical' reason for these decisions?
- Karl Dubost - 14 November 2008 at 16:03:55 UTC
  
  Because it is what browsers actually do.
  That said, nothing forbids you to have a stricter syntax for writing html 5 (as html or as xml). I will have more time in december about this specific topic as I mentioned above (Web Authoring Guidelines).
  So keep your coding practices as you have done for a while now. That is good, easier to read for most people, easier to teach. close your elements, keep your double quotes.
Kornel - 14 November 2008 at 20:31:11 UTC

@IrnBru001: HTML4 and earlier allow omission of <html>, <body> and a few other tags too. It's not the same as sloppy coding and omitting, e.g. </div>. Parsing clearly defines which elements are optional and when they can be omitted. Whether you'll insert them or not, browser will parse the document exactly the same way. Same goes for quotes – you can omit them if value is limited to certain safe characters and there's absolutely no ambiguity there.
Jarvklo - 16 November 2008 at 14:04:28 UTC
@Kornel and Karl
IMHO IrnBru001 made some valid points.
As good as some of the new stuff in HTML5 promises to be, the optional tags issue also complicates stuff that was made simple several years ago.
Example - the HTML element.
HTML4: "In some elements the HTML start- and/or end- tag are optional"
(expressed in the DTDs as eg.

<!ELEMENT HTML O O (%html.content;) -- document root element -->

for the HTML element)
XHTML1: "Start- and end tags are required for all non-empty elements"
HTML5: One section in http://www.w3.org/html/wg/markup-spec/#omittable-tags full of special treatment cases to memorize for anyone interested in learning HTML5 thoroughly...
Eg. - the optional tagging of the body element is described as:

An html element's start tag may be omitted if the first thing inside
the html element is not a comment.
An html element's end tag may be omitted if the html element is not immediately
followed by a comment and the element contains a body element that is either not
empty or whose start tag has not been omitted.

And that's just one element...
Progress?
I for one am not convinced when it comes to optional tags at least :P
"nothing forbids you to use a stricter syntax ... ..."
Clearly - but this question is more related to where you guys managed to find an actual buseiness case to motivate complicating things almost beyond recognition.
The principles states something about "User benefits over coder benefits over Implementor benefits - right? Iterating "Because that's what Browsers actually do" seems to hint more to benefiting the Implementors over coders or users, don't you agree?
"Whether you'll insert them or not, browser will parse the document exactly the same way"
Clearly - but still - going backwards to a more complicated model for acheiving the same thing that is already handled by the simpler and already well accepted "all non-empty elements are to be closed" principle is hardly progress - don't you agree?
- Karl Dubost - 17 November 2008 at 04:40:51 UTC
  
  Let's say it in a different way.
  The language which is implemented and interpreted by browsers (parsers) cover a lot of different cases. They have to cover all these cases, because of what is html on the Web today.
  Now, if we talk about Web practices, about Web coding, etc. Nobody forbids you to stick to a a very strict syntax. I do myself encourage this because indeed it is a lot easier to teach. I think it's what I will try to show into the Web Authoring Guidelines.
  an Authoring tool implementer will have to cater with reading broken document and know how to handle them and will have to know how to save it without modifying it. Usually when a tool decided that your coding style is not the one that it should be, people become angry.
  So yes I agree, xhtml style syntax rules are simpler, at least for me. No questions about that.
Nicolas Krebs - 16 November 2008 at 16:34:28 UTC

Please update the web page of the working group http://www.w3.org/html/wg/ with all its draft, including "HTML 5 differences from HTML 4", HTML 5 for authors, "HTML: The Markup Language".
IrnBru001 - 17 November 2008 at 19:48:37 UTC

Thanks for the response, and for tolerating this lurker. This last comment as two very interesting lines I want to highlight

An Authoring tool implementer will have to cater with reading broken document and know how to handle them and will have to know how to save it without modifying it.

That, paying close attention to my em, is kinda a tell. A 'broken document', we all 'know' it shouldn't be that way, but actually it seems that what we know is broken is actually acceptable for HTML5 (in some cases, like those we've been talking about).

They have to cover all these cases, because of what is html on the Web today

This was my worry. I understand a browser should tolerate it (and must), but should the spec? Isn't that what's quirksmode is for?
I think I may just have a different understanding of the the spec's role is. Intuitively I feel like it should be the rules for they way things should be, that a different 'spec' should be around to handle the way things are (parser implementation recommendations?). At this point it raises the question of what is "a very strict syntax", HTML5 tags etc but XHTML's syntax for the gold star?
Just my thoughts.
- Karl Dubost - 18 November 2008 at 06:55:30 UTC
  
  A broken document which can be parsed following html 5 parsing algorithm is not necessary a valid (conformant) html 5 document. That is a very clear distinction.
  The specification has more than one goal. It is here to help browsers implementers to recover broken documents (more than 90% of the Web).
  On the other side HTML 5 has a stricter content model than html 4 with the same liberal syntax. For example, the center element is gone. The align attribute is gone. If a big enough community of authors, Web designers, and authoring tools implementers is pushing for a stricter syntax in writing documents, and actually implementing these requirements, then you will get what you would like.
  In the meantime the specification, as I said above, doesn't forbid you a strict syntax with closing elements and double quotes around attributes.
Olaf - 20 November 2008 at 18:30:59 UTC

Maybe the XHTML2 draft is more in the direction of some progress interesting for
authors, 'HTML5' seems to be more interesting only for implementors to learn how
to present broken documents and many other things not very interesting for
document authors.
On the other hand, 'HTML5' provides a few new elements with a semantical meaning, the XHTML2 draft does not have yet - personally I think, there is a lot of space
for improvements and new elements with semantical meanings more interesting for
authors than rules how to fix broken documents. If we compare the semantical poor
(X)HTML with other languages caring about text markup too like DAISY, DocBook,
FictionBook, LML, we can see, how impressive expressive markup language for text
can be, why not (X)HTML? This might get pretty interesting combined with the
RDFa+XHTML approach and maybe the idea of metadata for each element and switch
as available in SVG and SMIL - after ten years since HTML4 it is time for a change,
for some progress. There should be much more than fixing broken documents in
HTML5 and it should be obviously simple enough and understandable, what is intended
for authors, to get a less amount of broken documents in the future.
I think, following 'HTML5' we will get even more broken/stupid documents around,
because the majority or authors will not understand 'HTML5' and whether they should
write broken documents now or something with a relevant semantical structure.
Olivier Wehner - 28 November 2008 at 11:06:53 UTC

Because it is what browsers actually do.

That is not an argument.
It is not the function of a specification to give blessings to all the odds that have been done before.

On this page we had extensive discussions on tag soup and how to cope with it, but no one ever came up with the idea that tag soup should be specified! I can hardly believe I am reading what I'm reading here.

The specification has more than one goal. It is here to help browsers implementers to recover broken documents (more than 90% of the Web).

No! Most definitely no! This is exactly what the OP meant with "some strange understanding of backwards compatibility".

It is not the task of this working group to do the browser vendors work.

I agree that user agents should be tolerant, but the specification MUST be strict. Otherwise it's not worth the title "specification" (nor the effort to write it). A spec must be plain, clean and, wherever possible, simple.

90% of the web (rather the html on it, no?) are broken? So what? That is the probleme of the authors. It may (and should) be a concern of the browser vendors. But it must not guide the development of the specification.

W3C may have to catch up the development out there, but sure not with the quirks!
- Karl Dubost - 29 November 2008 at 09:11:15 UTC
  
  Class of Products is the key to your answer. When you define a technical specification, there are different categories of products using this specification.
  For example a consumer such as a desktop browser, an html tidying library, a validator are 3 diffrent products belonging to the user agents. All of them need to read the content which is sent with text/html. It is called the parsing phase. There are part of one class of product.
  Let's continue with producer such as an authoring tool (wysiwyg or text only), an html [PHP, Python, Perl, Java, …] library, a simple human. All of them need to write conformant html which has to be sent with text/html or with application/xhtml+xml. There are part of another class of product.
  The html 5 specification caters for both.
  The section 8, parsing algorithm, is dedicated on how to read the content available online to correctly read the html and the tag soup. It is the first time that it is done. Until now browsers had to create their own techniques and it led to big interoperability issues.
  The html 5 specification also caters for the content model, what you are using for writing html which is not the same at all than parsing and doesn't address the same category of class of products. It happened that the html 5 content model is stricter than html 4. You MUST write a conformant html 5 document served as text/html (which is not tag soup) or a conformant html 5 document written as xml and served as application/xhtml+xml.
  One thing which is missing is an algorithm for html tidying libraries which would help developers to write interoperable tools parsing tag soups and creating conformant html 5 markup.
  Hope it helps to understand.
Robin Alexander - 30 November 2008 at 23:48:50 UTC

Strongly agree with IrnBru001 and Olivier Wehner. Another aspect of this is to ease the unnecessary mental work needed to code and read code. For the web there are so many markup, scripting and programming syntax structures one needs to know, anything that makes life easier is better. And it is easier if one knows that there is one right way to do something (such as quotes around attribute values). That means I don't have to worry about it; just use the quotes. I don't have to use brain cells to wonder, "are quotes required here or not?" I find the more unnecessary choices I have, the harder it is to master the numerous syntax systems we have to learn and use. And yes, a specification should be the ideal, not just a codification of what people do, good and bad.
Before getting fascinated with web development I was in accounting. There is a field where the "standards" are frequently just what people were doing all along and there is no reliable theory behind them. We see the mess that's gotten us into!
somestrangeguy - 2 December 2008 at 14:24:28 UTC

I'm wondering which version of given site will render faster in IE - v1 based on pure HTML4.01 or v2 based on pure HTML5 :)
g1smd - 6 December 2008 at 12:19:32 UTC

A core tenet of software design has always been "be strict in what you send, and liberal in what you accept".
HTML 5 seems to fly in the face of that, allowing liberal variation in what is sent - seemingly even giving blessing to it.
Sure, the spec might hint at standard ways to recover a broken document, but it should not be encouraging the _authoring of broken documents.
Karl Dubost - 8 December 2008 at 20:24:06 UTC

@g1smd You will be happy. HTML 5 specification doesn't encourage, the authoring of broken documents, see my comment above.
Henri de Solages - 27 December 2008 at 09:35:17 UTC

Experience proves that loose standards lead to buggy implementations as well as security issues, and make implementors spend plenty of time in useless synthax exceptions rather than in more interesting new features. A serious problem to deal with is that SGML (whose HTML is an implementation) is based on tree approach, that is SINGLE-inheritance system, but this doesn't fit all documents, specially not DOUBLE-entry tables, which are intrinsecally based on double-inheritance. This is why most columns attributes defined in CSS are not inherited to cells, for instance. So rather than allowing non-closed tags etc., it'd be much more useful to define a double-steped parsing process, allowing multiple-inheritance: a first parsing to make the tree, a second one to infer multiple-inheritance from the tree.