XML Document Interpretation

This is a collection of notes from chatting with TimBL and Sandro. It will probably evolve into something genuinely useful.

HTTP and SMTP provide a mechanism for a document source to associate a mime type with transmitted data. This allows downstream processors to interpret the data correctly without inspecting the data and guessing the contents. In a common conventional HTTP scenario, an agent requests a document with a preference for image/*. The document server takes this preference into account when serving, for instance, an image/png document.

A growing number of data formats, for instance, SVG and MathML, are encoded in XML with specified mime types. This mime type is used by an agent requesting image/* or image/svg+xml forms of the document. Many XML processors could act on this document if the mime type text/xml was associated with it. This leads to a conflict in defining the media types for emerging XML data formats. Following is a list of mime tree setments that apply to such a document:

text/: Data is encoded in us-ascii or utf-8 and could be presented directly to the user without causing undue stomach upset.
text/xml: XML processors may expect to find a well-formed XML document encoded to comply with the rules for the top-level text/ mime tree.
application/xml: Data is XML a well-formed XML document, but not meant for textual presentation to the user.
image/: The data is a pictorgraphic format, suitable for rendering, fro instance, as an inline image.
image/svg: The data conforms to W3C's SVG recommendation and should be interpreted as such.

Deciding between `text/xml` and `application/xml`

There are no rules beyond encoding restrictions for deciding between text/xml and application/xml; both are registered in RFC 3023. In practice, the majority of the XML [I've seen] satisfies the requirements for text/.

Labeling XML as application/xml conveys the preference that this data not be presented to the user. The application not designed to handle application/xml should default to the application/octet-stream handler. For data with no mopre precise mime type, the data source may choose to pick text/xml if the data consumer (or intervening proxies) may benefit from treating it as a text document. For instance, an example docbook document starts with

<articleinfo>
<title>XML From Your Palm</title>
<pubdate>11 Oct 2000</pubdate>
<releaseinfo role="meta">
$Id: 06-XML-document-interpretation.html,v 1.22 2001/04/11 04:47:00 eric Exp $
</releaseinfo>

which is certainly helpful to the reader with the naive browser.

the `+xml` mime type modifier

rfc3023, section 7 attempts to make XML data available to generic XML processors as well as negotiation schemes for higher level data formats by appending +xml to the end of the mime type. This introduces another level of mime hierarchy to address the common scenario where there is one more desired level of mime type above XML, eg SVG and MathML.

The arguments supporting the Network Working Group's decision on +xml are documented in RFC 3023 Appendix A.

hetrogeneous XML data

XML namespaces provide a reliable mechanism for identifying XML data formats. This enables multiple data formats to be embedded in a single XML document. For instance, an XHTML document may come from a document source with a mime type of text/html. The agent processing this document can furthur distinguish the document as XHMTL by encountering a root element identified by the tuple (http://www.w3.org/1999/xhtml, html). The agent may be able to handle other data formats embedded in the document. An agent with the facilities for rendering SVG will know what to do if it encounters the tuple (http://www.w3.org/2000/svg, SVG) in the document. MathML may be embedded in XHMTL via a similar mechanism.

all `text/xml` model

An alternative to the +xml mime types is to assert that all data formats with an XML encoding use text/xml for that encoding. The document's root node then clarifies the data format. Naturally, this leaves the more precise data format unavailable to metadata queries and content negotiation.

One of the decision points on deciding to use special mime types has been whether the media introduces media-specific fragment identifiers (exhibit A and exhibit B). Another has, naturally been compatibility with the application catagory in the registered tree (exhibit C).

current XHMTL technology

The two most popular browsers leave content providers with a choice of supporting one or the other, but not both. Here is the behavior for different content types:

as a table

browser	mime type	behaviour
Netscape	text/xml	dispatches on namespace. for example renders MathML upon finding a `(http://www.w3.org/1998/Math/MathML, math)` tuple.
Netscape	text/html	does not invode namespace handler. MathML markup is rendered as un-understood tags.
IE	text/xml	will render XHTML deleivered as text/xml if it has a <?xml-stylesheet ""?>
IE	text/html	has MathML tags patched into html machine.

as a dl of dls

Netscape

text/xml: dispatches on namespace. for example renders MathML upon finding a (http://www.w3.org/1998/Math/MathML, math) tuple.
text/html: does not invode namespace handler. MathML markup is rendered as un-understood tags.

IE

text/xml: will render XHTML deleivered as text/xml if it has a <?xml-stylesheet ""?>
text/html: has MathML tags patched into html machine.

References

Referenced texts that were not available by http URL:

ASCII: "US-ASCII. Coded Character Set -- 7-Bit American Standard Code for Information Interchange", ANSI X3.4-1986, 1986.
ISO8859: "ISO-8859. International Standard -- Information Processing -- 8-bit Single-Byte Coded Graphic Character Sets -- Part 1: Latin alphabet No. 1, ISO-8859-1:1987", 1987.

notes

Binary media types are tight shoes for SVG (text, xml, image, svg), event tighter for an XHTML document with MathML and XVG embedded. There will be growing discomforts and lost interoperability opportunities as long as there is no n-ary document description in the (conventional) metadata. One route out is to make unconventional metadata conventional (GET-META and accept recombination) or use an extended header: 13-Alternative-Content-Type

the mime tree is already overloaded with the non-IETF subtrees: vendor (*/vnd.*) and personal (*/prs.*). */xml.* (for instance, image/xml.svg) could collide with these subtrees.

Eric Prud'hommeaux

CVS revision: $Id: 06-XML-document-interpretation.html,v 1.22 2001/04/11 04:47:00 eric Exp $
Last modified: Wed Apr 11 00:37:47 EDT 2001