Warning:
This wiki has been archived and is now read-only.

ParseIssues

From HTML WG Wiki
Jump to: navigation, search

Parse issues for text/html serialization

Whitespace handling

Repair of invalid table elements

Repair of invalid list elements

Handling of unknown head elements

The parsing algorithm in the current draft calls for disallowing all unknown elements from the document head element. Current browsers treat unknown head elements as:

  • Internet Explorer: void only
  • Firefox: void only
  • Safari: void only
  • Opera: encountering unknown ends head like current HTML5 draft

Treating unknown elements as non-void provides the greatest flexibility in terms of updating the HTML specification and for author ad hoc extensions. Also the trend has been to augment content models for elements like we see in XHTML2. If might make sense to treat all elements as non-void whether in the head or the body and force unknown elements to be explicitly closed. Perhaps even adopting the XML self-closing tag could be accomplished without breaking existing content or applying it only to new and unknown elements.

Handling of unknown body elements

The parsing algorithm in the current draft calls for treating all unknown elements as phrase elements. Current browsers treat unknown body elements as:

  • Internet Explorer: void
  • Firefox: phrase
  • Safari: non-void
  • Opera: non-void

Treating unknown elements as non-void provides the greatest flexibility in terms of updating the HTML specification and for author ad hoc extensions.

Document character set

The text/html serialization of HTML5 allows some 28 additional characters compared with XML 1.0 (XML 1.1 allows these characters as character references,,but it also requires an additional 33 control characters as character references in the range U+007F – U+009F). While these characters are permitted as part of the document conformance norms, they're meaning is undefined by the HTML5 recommendation. They are not explicitly included as whitespace characters though some of them do play that role in the HTMl5 parsing algorithm and other places. These include:

  • control characters: U+0001 – U+0008 (8 characters)
  • control character: U+000B
  • control character: U+000C
  • control character: U+000E – U+001F (18 characters)

Also:

  • surrogate characters: U+D800 – U+DFFF (2,048 surrogate characters)

The surrogate characters are allowed in XML, though they are not allowed separate from their paired twin and are only interpreted as the non-Basic Multilingual Plan character they reference.

See also