This section lays out the specific problems with MIME types and charsets, while making reference to the previous sections on identifiers in general and registries. This section is also a place to discuss sniffing and the issues around that, the web from file systems and non-HTTP sources, alternatives to using MIME for discovering language. perhaps "sniffing" needs another section.
In the web, two important protocol elements whose values are identifiers using a registry come from MIME (the Multipurpose Internet Mail Exchange set of specifications): the "Internet Media Type"and the "charset".
The contexts of email and Web are sufficiently different that some of the requirements for email registration and Web registration, as well as practices in the deployment of implementations of agents that use MIME, have led to some mismatch between desired properties of the Internet Media Type protocol element.
The typical use pattern of email is that the transmission of data is unanticipated, and often between parties where the sender has no knowledge of the capabilities of the recipient. The typical use pattern of the web is that data is requested explicitly, and often much is known about the requirements and expected content for a retrieval.
HTTP tagging results, within HTML for inline (text/css), HTTP Accept header, Accept-charset header, ...
Often, it is quite possible, with relatively high accuracy, to determine the language of data by examining the data itself; in some cases in the web (retrieval by ftp or file system access), there is no independent channel for communicating content-type: other indicators and sniffing.
file extensions: A common practice in many systems was to use the end of the name of a file in the file system (the "file extension" ) as identifying the type of file. This practice has now extended to other systems.
sniffing: in many contexts, language can be guessed by looking for some unique string, number or pattern, which only appears in files of that language. In circumstances where this was a unique number, it was called a "magic number", although this concept has been extended to other textual patterns. In some cases, sniffing will be employed to override a (syntactically correct) content-type label, because of previous experience with mis-labeled content.
Information about these other ways of determining language were gathered for the Internet Media Type registry; those registering types are encouraged to also describe 'magic numbers', Mac file type, and common file extensions. However, since there was no formal use of that information, the quality of that information in the Internet Media Type registry is haphazard.
In some applications, implementations of some languages and protocols have interpreted identifiers in ways inconsistent with their registry entries, to the point where the specifications of those languages have neeeded to provide for "willful violations" of the registry entries (which cannot change if they are used differently by other languages and protocols which use the same protocol element.)
(Reasons why the 'mime type' model doesn't work for some important cases. one of the problems MIME types and sniffing, that the same content might properly be considered to be in two different languages)
There are some interesting cases where the same content can be viewed as being in multiple languages:
(Additional considerations... MIME assumes the sets of language uses are partitioned. PNG and its use in fireworks. Google re-using JPEG for new Google image format.)
(languages allow mixin of other languages sometimes. The MIME type only names the 'top level', how does nesting, mixin happen. How do you name HTML+RDFa? )
The Web added the notion of being able to address part of a content and not the whole content by adding a 'fragment identifier' to the URL that addressed the data. Of course, this originally made sense for the original Web with just HTML, but how would it apply to other content. The URL spec glibly noted that "the definition of the fragment identifier meaning depends on the Internet Media Type", but unfortunately, few of the Internet Media Type definitions included this information, and practices diverged greatly.
If the interpretation of fragment identifiers depends on the MIME type, though, this really crimps the style of using fragment identifiers differently if content negotiation is wanted.
If the Internet Media Type registry is more explicit about which kinds of content contain what kind of scriptability access, then the specifications for sniffing can reference the Internet Media Type registry to determine what kinds of sniffing constitute a 'privelege upgrade'.
Note that all sniffing can be a priviledge upgrade, if there is a buggy recipient, although bugs can be fixed, but spec violations are a problem.
The Internet Media Types registry does not correspond to the media types in actual use:
(Is there anything left to 'find' here?).