XML Document Interpretation

This is a collection of notes from chatting with TimBL and Sandro. It will probably evolve into something genuinely useful.

HTTP and SMTP provide a mechanism for a document source to associate a mime type with transmitted data. This allows downstream processors to interpret the data correctly without inspecting the data and guessing the contents. In a common conventional HTTP scenario, an agent requests a document with a preference for image/*. The document server takes this preference into account when serving, for instance, an image/png document.

A growing number of data formats, for instance, SVG and MathML, are encoded in XML with specified mime types. This mime type is used by an agent requesting image/* or image/svg+xml forms of the document. Many XML processors could act on this document if the mime type text/xml was associated with it. This leads to a conflict in defining the media types for emerging XML data formats. Following is a list of mime tree setments that apply to such a document:

text/
Data is encoded in us-ascii or utf-8 and could be presented directly to the user without causing undue stomach upset.
text/xml
XML processors may expect to find a well-formed XML document encoded to comply with the rules for the top-level text/ mime tree.
application/xml
Data is XML a well-formed XML document, but not meant for textual presentation to the user.
image/
The data is a pictorgraphic format, suitable for rendering, fro instance, as an inline image.
image/svg
The data conforms to W3C's SVG recommendation and should be interpreted as such.

Deciding between text/xml and application/xml

There are no rules beyond encoding restrictions for deciding between text/xml and application/xml; both are registered in RFC 3023. In practice, the majority of the XML [I've seen] satisfies the requirements for text/.

Labeling XML as application/xml conveys the preference that this data not be presented to the user. The application not designed to handle application/xml should default to the application/octet-stream handler. For data with no mopre precise mime type, the data source may choose to pick text/xml if the data consumer (or intervening proxies) may benefit from treating it as a text document. For instance, an example docbook document starts with

<articleinfo>
<title>XML From Your Palm</title>
<pubdate>11 Oct 2000</pubdate>
<releaseinfo role="meta">
$Id: 06-XML-document-interpretation.html,v 1.22 2001/04/11 04:47:00 eric Exp $
</releaseinfo>

which is certainly helpful to the reader with the naive browser.

the +xml mime type modifier

rfc3023, section 7 attempts to make XML data available to generic XML processors as well as negotiation schemes for higher level data formats by appending +xml to the end of the mime type. This introduces another level of mime hierarchy to address the common scenario where there is one more desired level of mime type above XML, eg SVG and MathML.

The arguments supporting the Network Working Group's decision on +xml are documented in RFC 3023 Appendix A.

hetrogeneous XML data

XML namespaces provide a reliable mechanism for identifying XML data formats. This enables multiple data formats to be embedded in a single XML document. For instance, an XHTML document may come from a document source with a mime type of text/html. The agent processing this document can furthur distinguish the document as XHMTL by encountering a root element identified by the tuple (http://www.w3.org/1999/xhtml, html). The agent may be able to handle other data formats embedded in the document. An agent with the facilities for rendering SVG will know what to do if it encounters the tuple (http://www.w3.org/2000/svg, SVG) in the document. MathML may be embedded in XHMTL via a similar mechanism.

all text/xml model

An alternative to the +xml mime types is to assert that all data formats with an XML encoding use text/xml for that encoding. The document's root node then clarifies the data format. Naturally, this leaves the more precise data format unavailable to metadata queries and content negotiation.

One of the decision points on deciding to use special mime types has been whether the media introduces media-specific fragment identifiers (exhibit A and exhibit B). Another has, naturally been compatibility with the application catagory in the registered tree (exhibit C).

current XHMTL technology

The two most popular browsers leave content providers with a choice of supporting one or the other, but not both. Here is the behavior for different content types:

as a table

browser mime type behaviour
Netscape text/xml dispatches on namespace. for example renders MathML upon finding a (http://www.w3.org/1998/Math/MathML, math) tuple.
text/html does not invode namespace handler. MathML markup is rendered as un-understood tags.
IE text/xml will render XHTML deleivered as text/xml if it has a <?xml-stylesheet ""?>
text/html has MathML tags patched into html machine.

as a dl of dls

Netscape
text/xml
dispatches on namespace. for example renders MathML upon finding a (http://www.w3.org/1998/Math/MathML, math) tuple.
text/html
does not invode namespace handler. MathML markup is rendered as un-understood tags.
IE
text/xml
will render XHTML deleivered as text/xml if it has a <?xml-stylesheet ""?>
text/html
has MathML tags patched into html machine.

related reading

rfc2376 - XML Media Types
proposes text/xml and application/xml media types
rfc3023
addendum to XML Media Types with these differences: (1) the addition of text/xml- external-parsed-entity, application/xml-external-parsed-entity, and application/xml-dtd, (2) the +xml suffix convention (which also updates the RFC 2048 registration process), and (3) the discussion of "utf-16le" and "utf-16be".
Namespaces in XML
simple method for qualifying element and attribute names used in XML.
SVG document fragment
inserting an SVG document fragment in an XML document
Embedding MathML in other Documents
inserting a MathML document fragment in an XML document
HTTP/1.1 accept header
chapter from the HTTP 1.1 specification describing the use of Accept to help the document server return the most appropriate resource.
HTTP/1.1 content negotiation
more abstract description of enumerable negotiation scenarios.

References

Referenced texts that were not available by http URL:

ASCII
"US-ASCII. Coded Character Set -- 7-Bit American Standard Code for Information Interchange", ANSI X3.4-1986, 1986.
ISO8859
"ISO-8859. International Standard -- Information Processing -- 8-bit Single-Byte Coded Graphic Character Sets -- Part 1: Latin alphabet No. 1, ISO-8859-1:1987", 1987.

notes

Binary media types are tight shoes for SVG (text, xml, image, svg), event tighter for an XHTML document with MathML and XVG embedded. There will be growing discomforts and lost interoperability opportunities as long as there is no n-ary document description in the (conventional) metadata. One route out is to make unconventional metadata conventional (GET-META and accept recombination) or use an extended header: 13-Alternative-Content-Type

the mime tree is already overloaded with the non-IETF subtrees: vendor (*/vnd.*) and personal (*/prs.*). */xml.* (for instance, image/xml.svg) could collide with these subtrees.


Valid XHTML 1.0!

Eric Prud'hommeaux

CVS revision: $Id: 06-XML-document-interpretation.html,v 1.22 2001/04/11 04:47:00 eric Exp $
Last modified: Wed Apr 11 00:37:47 EDT 2001