Evolution of the Web:MIME

Editor's Draft 24 December 2011; see Overview for copyright, caveats, etc.

This section lays out the specific problems with MIME types and charsets, while making reference to the previous sections on identifiers in general and registries. This section is also a place to discuss sniffing and the issues around that, the web from file systems and non-HTTP sources, alternatives to using MIME for discovering language. perhaps "sniffing" needs another section.

In the web, two important protocol elements whose values are identifiers using a registry come from MIME (the Multipurpose Internet Mail Exchange set of specifications): the "Internet Media Type"and the "charset".

MIME: a framework for transmitting content within protocols
Internet Media Type: A protocol element used in MIME as an identifier of languages. There is an Internet Media Type registry which includes several values, including a pointer to a specification of the language.
content-type: A protocol element (used in the HTTP protocol, email protocols and many others) which uses the Internet Media Type protocol element, and (in some cases) additional parameters associated with that.
charset: A protocol element (used many Internet protocols and languages, including in the content-type parameter of HTTP and within XML and HTML) which is an identifier for scripts and their encoding.

The contexts of email and Web are sufficiently different that some of the requirements for email registration and Web registration, as well as practices in the deployment of implementations of agents that use MIME, have led to some mismatch between desired properties of the Internet Media Type protocol element.

The typical use pattern of email is that the transmission of data is unanticipated, and often between parties where the sender has no knowledge of the capabilities of the recipient. The typical use pattern of the web is that data is requested explicitly, and often much is known about the requirements and expected content for a retrieval.

Where are these used?

HTTP tagging results, within HTML for inline (text/css), HTTP Accept header, Accept-charset header, ...

Other ways of determining language (without using content-type)

Often, it is quite possible, with relatively high accuracy, to determine the language of data by examining the data itself; in some cases in the web (retrieval by ftp or file system access), there is no independent channel for communicating content-type: other indicators and sniffing.

file extensions: A common practice in many systems was to use the end of the name of a file in the file system (the "file extension" ) as identifying the type of file. This practice has now extended to other systems.

sniffing: in many contexts, language can be guessed by looking for some unique string, number or pattern, which only appears in files of that language. In circumstances where this was a unique number, it was called a "magic number", although this concept has been extended to other textual patterns. In some cases, sniffing will be employed to override a (syntactically correct) content-type label, because of previous experience with mis-labeled content.

Information about these other ways of determining language were gathered for the Internet Media Type registry; those registering types are encouraged to also describe 'magic numbers', Mac file type, and common file extensions. However, since there was no formal use of that information, the quality of that information in the Internet Media Type registry is haphazard.

In some applications, implementations of some languages and protocols have interpreted identifiers in ways inconsistent with their registry entries, to the point where the specifications of those languages have neeeded to provide for "willful violations" of the registry entries (which cannot change if they are used differently by other languages and protocols which use the same protocol element.)

Polyglot, Multiview, Specializations

(Reasons why the 'mime type' model doesn't work for some important cases. one of the problems MIME types and sniffing, that the same content might properly be considered to be in two different languages)

There are some interesting cases where the same content can be viewed as being in multiple languages:

Polyglot: A 'polyglot' document is one which is some data which can be treated as being in more than one language, but in the situation where meaning of the data is not significantly different in the two languages. Developing new languages in such a way that there are significant use cases for polyglot content is part of a transition strategy to allow content providers (senders) to manage, produce, store, deliver the same data, but with two different labels, and have it work equivalently with two different kinds of implementations (one of which knows one language, and another which knows another.) This use case was part of the transition strategy from HTML to an XML-based XHTML, and also as a way of a single service offering both HTML-based and XML-based processing (e.g., same content useful for news articles and Web pages.)
Multiview: This use case seems similar but it's quite different. In this case, the same data has very different meaning when served as two different content-types, but that difference is intentional; for example, the same data served as text/html is a document, and served as an RDFa type is some specific data.
Specialization: In these cases, there is a class-subclass hierarchy of languages, where the same document is both a general XML document as well as +xml, a general JSON data structure vs +json, stored using ZIP but used particularly for a purpose with a manifest. DNG and TIFF.

(Additional considerations... MIME assumes the sets of language uses are partitioned. PNG and its use in fireworks. Google re-using JPEG for new Google image format.)

Compound documents

(languages allow mixin of other languages sometimes. The MIME type only names the 'top level', how does nesting, mixin happen. How do you name HTML+RDFa? )

Fragment identifiers

The Web added the notion of being able to address part of a content and not the whole content by adding a 'fragment identifier' to the URL that addressed the data. Of course, this originally made sense for the original Web with just HTML, but how would it apply to other content. The URL spec glibly noted that "the definition of the fragment identifier meaning depends on the Internet Media Type", but unfortunately, few of the Internet Media Type definitions included this information, and practices diverged greatly.

If the interpretation of fragment identifiers depends on the MIME type, though, this really crimps the style of using fragment identifiers differently if content negotiation is wanted.

Sniffing security uses scriptability info

If the Internet Media Type registry is more explicit about which kinds of content contain what kind of scriptability access, then the specifications for sniffing can reference the Internet Media Type registry to determine what kinds of sniffing constitute a 'privelege upgrade'.

Note that all sniffing can be a priviledge upgrade, if there is a buggy recipient, although bugs can be fixed, but spec violations are a problem.

MIME Registry Process Issues

The Internet Media Types registry does not correspond to the media types in actual use:

Lots of file types aren't registered (no entry in IANA), even for file types that have been deployed for over a decade. For example, "image/svg+xml", "image/jp2" and "video/mp4" are not registered.
For many file types that are registration, the registration is incomplete or incorrect (people doing registration didn't understand 'magic number' or other fields).
The actual content deployed or created by deployed software doesn't match the registration.

Findings

(Is there anything left to 'find' here?).

Use of references in MIME to point to (evoving) specifications?
"Happy Iana": make registration easier -- we support it
"Update MIME spec": review and make sure it meets web's needs
"Fix sniffing"
move "willful violations" out of specification and into registry or shadow registry