W3CNOTE-xh-19980511

XML in HTML Meeting Report

W3C Note 11 May 1998

This Version:
http://www.w3.org/TR/1998/NOTE-xh-19980511
$Date: 2007/01/26 10:12:49 $
Latest Version:
http://www.w3.org/TR/NOTE-xh
Editors:
Dan Connolly <connolly@w3.org> W3C
Lauren Wood <lauren@softquad.com> Softquad

Status of This Document

This document summarizes the discussion and conclusions of a meeting held to coordinate across several W3C Working Groups. While the decisions of this forum are not binding on any of the working groups, they represent substantial experience and analysis and should guide future work.

Please direct comments to www-html, a public discussion forum.

This document is a NOTE made available by the W3 Consortium for discussion only. This indicates no endorsement of its content, nor that the Consortium has, is, or will be allocating any resources to the issues addressed by the NOTE.


Contents

  1. About the Meeting
    1. Background
    2. Participants
  2. Summary of Discussion
    1. RDF Requirements
    2. MathML Requirements
    3. Types of HTML
    4. General Requirements
    5. Possible Solutions
  3. Conclusions and Future Work
    1. Including XML by Reference
    2. Using Attributes to Hide New Idioms
    3. Using <XML>, an HTML Enhancement
    4. Using Script to Hide Content in Older Browsers
    5. Using Namespaces, Stylsheets, and the DOM
  4. References


About the Meeting

A number of issues regarding the use of XML[XML] in HTML documents were brought to the attention of the W3C Hypertext Coordination Group. In particular, MathML[MathML] and RDF[RDF] are written in XML and intended to be used in HTML documents.

In response, the coordination group held a meeting 11-12 Feb 1998 in San Jose, CA. We would like to thank the host, Sun Microsystems.

Background

As discussed in [Dialects], evolution of the HTML specification proceeds by introduction of new idioms which interact with deployed software in one of the following ways:

The idiom is ignored altogether.
for example, <img src="..."> was ignored by the deployed software base when it was introduced. New empty elements and new attributes generally behave this way.
The enhanced functionality of the new idiom is ignored, but the content is otherwise handled sensibly.
for example, <em>abc</em> displays without emphasis on some very old user agents. New "inline" elements often behave this way.
The idiom is disruptive in deployed software
for example, forms and tables display as a jumble of noise in software deployed before they were introduced. New block elements are particularly difficult to deploy gracefully.

For the past few years, the HTML Working Group has vetted new proposals on behalf of the web community, considering the value of each versus the cost of deployment. But with the introduction of XML into the web, markup design is decentralized. Each community or even each user can use whatever elements and attributes they choose and give them whatever meaning and significance they choose. As MathML and RDF show, at least some of this XML markup is intended for use inside HTML documents.

This meeting explored mechanisms to use XML markup in HTML documents: existing mechanisms and possible enhancements. In particular:

Participants

Participants from all W3C working groups, especially RDF, MathML, CSS&FP, and XML, and DOM were invited. A wide variety of experience and requirements were represented by the meeting participants:

Miscellaneous

The participants request that W3C make the W3C site searchable.

Summary of Discussion

RDF Requirements

The Appendix B of [RDF] says:

The recommended technique for embedding RDF statements in an HTML document is simply to insert the RDF in-line. This will make the resulting document non-conformant to HTML specifications up to and including HTML 4.0 but the RDF Working Group hopes that the HTML specification will evolve to support this.

The discussion around the RDF requirements showed that possible solutions for RDF included putting all the information into attributes; putting it in an external file; and putting it at the end of the document. in general the participants thought that putting information into attributes was safer than putting it in an external file because of worries about security and forcing tools to be able to cope with multiple files. Since many tools already have to cope with multiple files, other participants thought this was not a drawback where security was not an issue. Some participants thought that putting the information in an external file would sometimes be a necessity, so tools would have to learn to cope.

MathML Requirements

MathML has many requirements. One of these is a system that can cope with several small chunks of XML in one document, since a document may have many small equations. It has extreme formatting requirements, only some of which are shared by other objects. There was some discussion of MathML needs in terms of the DOM and formatting properties. The MathML has to be able to be passed as a chunk to an external renderer, and the XML has to be able to be formatted in a reasonable way. The MathML does not include HTML elements within it. That was discussed within the MathML WG, but rejected. The requirement that the content of MathML should not show up in down-level browsers was not as strong for MathML as for RDF, although some of the participants thought it would be best.

Types of HTML

The participants came to the conclusion that there was definite agreement on doing an XML block, where the contents of the block are well-formed XML, without any HTML semantics. There was much discussion about whether there was a reasonable method to include significant non-standard non-empty elements could be found, and whether there was a possibility of defining some sort of "good" HTML that people would use. Reasons for not allowing HTML semantics in the XML block, even on elements with the same element types as exist in HTML, included

  1. Browsers would need to expose rendering model to other processors too soon.
  2. Different error-handling mechanisms
  3. All XML processors would need to process HTML, and users might expect that processing to match current HTML browsers

There was also some support for doing an XML version of HTML, where all the XML rules would apply.

The discussion about whether it was possible to require that the contents of any non-standard elements be well-formed XML mostly came to the conclusion that it wasn't; or that it would be extremely expensive for those users simply wanting to add, e.g., a CHAPTER element to their pages. There was support for the notion that there is a difference between adding XML to pages (where the contents of the XML would be well-formed XML) and adding unknown elements in a standard way to HTML (where the contents of the unknown element would not follow XML well-formed rules.) Whether the HTML in an unknown HTML element needed to be "good" HTML wasn't fully clarified at the meeting.

Another problem is that old browsers render PIs.

General Requirements

During the discussion the following requirements were generally agreed upon.

Agreement on terminology: XML blocks, significant non-standard HTML elements (sometimes also called sprinkles), and crud (or real-world HTML). But how do we distinguish between XML blocks and significant elements? An XML block contains XML -- not HTML. A significant element contains HTML -- not XML (unless it's empty, of course; we have to be able to distinguish between empty and non-empty).

Possible Solutions

The question of how to "sprinkle" non-standard elements in an HTML document while retaining HTML semantics of all elements with HTML element and attribute types devoured most of the meeting. We did not come to a final conclusion on this subject. One proposed solution was to use new elements called CONTAINER and LEAF, with the CLASS attribute used to show the type. The drawback is that users can't define non-standard attributes. There was also much discussion as to whether users would accept this sort of solution, or whether they would want to invent their own element types. It was felt that this solution would allow users to keep on using "real" HTML (a.k.a crud) inside the wrapper elements.

Another proposal was to allow users to define their own wrapper elements. If all elements within the block have end tags, even if they are EMPTY elements, then this could be the way to extensible HTML (not XML). There were several points against this, including the large number of non-standard EMPTY elements that already exist. Many participants thought that defining browser behaviour for this would be almost impossible, and that migrating HTML users to XML with the HTML tagset was a better solution.

How to clean up HTML came up again and again in the discussions. The participants agreed that it is impossible in the general case to create valid HTML from an arbitrary page on the Web without human intervention. Users will not want to risk breaking documents which function. Current HTML has three components: the element type names, default rendering, and semantics (e.g. forms).

There was a strong contingent that said users should wait for XML tools to become generally available and use those, rather than trying to add XML to HTML.

The MathML group would like a mechanism to tell browsers a plain-text string to render, if the equation can't be rendered. This sort of mechanism would potentially be useful for other XML content with high rendering requirements as well.

The biggest reason to come up with a standard method for adding XML (or unknown HTML) to HTML is to allow poeple to use styles and the DOM with these elements. Currently they can't. Browsers do not apply CSS styles to unknown elements, and unknown container elements are not exposed as containers in the MSIE object model. (The DOM WG decided not to tackle the problem, and only talks about valid HTML 4.0 documents, and XML as a separate entity.)

A potential solution was to write HTML as XML, i.e. with MIME-type text/xml. Then all the XML rules would apply. One problem with this is that some browsers sniff the document irrespective of MIME-type and display the content if it looks like HTML according to some heuristic[InetSDK], Appendix A. This may include, for example, having a TITLE element anywhere within the first 200 bytes of the document. Thus document providers may have to add a comment long enough to get rid of the heuristics.

Conclusions and Future Work

Including XML by Reference

The first option for using XML in HTML documents is to include it by reference, using <LINK>, <A>, <OBJECT> or perhaps even <IMG>. This markup conforms to existing W3C Recommendations. This gives predictable behaviour across the whole spectrum of HTML user agents, at the cost of managing and accessing the compound document.

Using Attributes to Hide New Idioms

Another option with predictable behaviour is to use tags and attributes only, and avoid character data which will be displayed by deployed software. Strictly speaking, documents enhanced this way do not conform to the HTML 2, 3.2, or 4.0 specification, but each of those specifications included a note to implementors to ignore unknown attributes.

The XML namespace facility[XML-Names] should be used to manage the risk of name collisions for new attributes and elements. Note that unfortunately, much of the deployed base of user agents will display XML namespace declarations as text.

Using <XML>, an HTML Enhancement

The linking and attributes mechanisms do not satisfy all of the requirements presented at the meeting. It was agreed that an enhancement to HTML to accomodate XML blocks is necessary.

The definition of an XML block is a chunk of well-formed XML that is inside an HTML document. Any elements within the chunk that happen to have the same element types as HTML elements are not considered to be HTML elements. The error-handling as defined in the XML specification applies, i.e. the processor must halt on well-formedness errors.

There were two proposals for this. (Other proposals that were discussed were discovered to be variations of these).

  1. using namespaces, which means the presence of a colon in an element type implies that the contents are well-formed XML
  2. using a specific element type (the discussion centered around XML and XML-BLOCK and eventually we settled for XML)

Using a specific element type has the advantage that the meaning is clear, and that attribute can be added to the element for such things as MIME-type and a link to an external file containing the XML content.

For the XML block case, the group decided on a vote of 10 for and 1 abstension (none against) to use an element called XML. This must be added to a future version of HTML. The attributes are TYPE for the MIME-type and SRC for the URL of the content if it is in an external file. The contents of the XML element are XML. There is an xml PI at the beginning of the XML block that contains all other information that the XML block needs.

Using Script to Hide Content in Older Browsers

Interoperability with the 3.0 generation of browsers is required for successful deployment of RDF, among other applications. This means that the XML block is not a complete solution either.

There are a number of ways in which content can be made to not show up in browsers that don't understand the element.

  1. the XML could be in a separate file, linked to from the HTML document in some way.
  2. the XML could be in the HEAD of the HTML document
  3. the DTD for the XML fragment could be written in such a way that all content appears as attribute values
  4. the XML content could be put at the end of the document, which doesn't really hide it, but this method does get the content out of the way of the main document content.

Of these, putting the content in the HEAD is the most problematic because of the difficulties for HTML browsers of defining where the HEAD ends.

Any of these methods would be considered to not break HTML or XML, and the participants decided that these should be written up (with the exception of putting content in the HEAD) as the recommended methods for coping with XML where the content should not show up in older browsers.

There are, of course, times when none of these methods are suitable for some reason. The group therefore decided to also figure out which of the many unliked methods was the least undesirable. The choices were

The proposal to put the XML content inside an OBJECT element was quickly rejected, as it would not work in Netscape Navigator 3.0.

The problem with APPLET is that if the user has applet loading turned off, the content will show. The problem with SCRIPT is that it breaks the currently defined content model of SCRIPT. There were also worries about whether future XML users will use the SCRIPT element themselves, which would not be possible if it were a reserved element. This concern wasn't shared by the entire group. The problem with using comments is that comments are meant to not contain parsed data, and users couldn't put another comment inside the XML content.

The vote (1 per company) was 1 for comments, 1 for APPLET, and 8 for SCRIPT.

Details of the XML block and SCRIPT mechanisms are the subject of a Working Draft in progress.

Using Namespaces, Stylsheets, and the DOM

The discussion of using XML markup in HTML documents such that it would be "significant" to stylesheet and DOM implementations did not reach a clear consensus.

We observed that XML can be modelled using the HTML 4.0 DIV, SPAN, and CLASS markup, which are significant to stylesheet and DOM implementations. Some experience with this style suggested the community would not embrace it, but the discussion was not conclusive.

A proposal for a "sprinkles" mechanism is the subject of a Working Draft in progress.

References

[RDF]
Resource Description Framework (RDF) Model and Syntax
W3C Working Draft 16 Feb 1998
Ora Lassila, Ralph R. Swick, eds.
[XML]
Extensible Markup Language (XML) 1.0
W3C Recommendation 10-February-1998
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, eds.
[HTML4]
HTML 4.0 Specification
W3C Recommendation 18-Dec-1997
Dave Raggett, Arnaud Le Hors, Ian Jacobs, eds.
[MathML]
Mathematical Markup Language (MathML) 1.0 Specification
W3C Recommendation 07-April-1998
Patrick Ion , Robert Miner
[Dialects]
HTML Dialects: Internet Media and SGML Document Types
W3C Working Draft 06-Mar-96
Daniel W. Connolly
[InetSDK]
Internet Client SDK, December 19, 1997, Microsoft Corporation
[XML-Names]
Namespaces in XML, W3C Working Draft 27-March-1998
Tim Bray, Dave Hollander, Andrew Layman