ContentTypeIssues

From HTML WG Wiki
Jump to: navigation, search

Content Type Issues

Many problems persist regarding the determination of content type. On the one hand, the ability to se authoritative content types allow authors greater flexibility in haow users consume their content. On the other hand the persistent errors in content type designation lead HTML UAs to make extra efforts to determine content type even when the content type of file is already designated by the server or other mechanism. These extra measure handle cases where authors incorrectly designate content types, however they also undermine the ability of authors to designate unconventional content types on files for special handling by UAs.

The problem then is there anyway for HTML, through document and UAs norms, to address the problem of mislabeled content types, while at the same time giving authors more control over the ways content is handled?

State: Unresolved

Quick Links for This Page

ContentTypeIssues#head-573e272a5e8707bd0d21c5406f571e38dcb4626d: Use cases | ContentTypeIssues#head-011d17c37fc91b454f24a9b82bcbea468a6337d0: Research reults | ContentTypeIssues#head-6075de0c08c60c43b67c09d1ad9ab6ef52b47fce: Further research | ContentTypeIssues#head-9c8a6020f1db2433f873058d64ee4b5e0d3bfa0b: Proposed solutions | ContentTypeIssues#head-bc285148e15e6b12fba76c3f4b87778c59b74173: Discussion | ContentTypeIssues#head-2afc77a10415e07a8469ca42e6b220153be798a3: Examples | ContentTypeIssues#head-2c7a0013504f8a7ad4f7d617f908c100368cceb9: E-mails

Use cases

  • An author creates a valid and well-formed XHTML 1.0 document and wants to display it in its raw text form in an IFRAME or OBjECT element. The author sets the file’s filename extension to '.txt' so that the HTTP server will map that filename extension to a content-type header of 'text/plain'. Note that this is an unconventional content type for an XHTML file so not treating the content -type as authoritative may lead to a mishandling of the content by the UA.
  • A user wants to view a web page that embeds or links to several sub-resources. One ore mor of the sub-resources is misidentified by the server as 'text/html' when it really should be designated 'application/xhtml+xml'. With the wrong type designation, the browser cannot display the content to the user in a way the user can understand.

Research Results

Current browser behavior

For each cell in the following table indicate whether the UA relies on the content-type header (head), filename extension (ext), @type attribute (type), sniffed content (sniff) to determine the file's type. If the UA differs in its handling based on the server response header each case is in a separate row (<none> i.e., no content-type header at all, text/plain, text/html, application/octet-stream, <any other> i.e., testing for any content sniffing, filename extension overrides or any other).

Cell value: 'head', 'ext', 'type', or 'sniff'.

Trident Gecko WebKit Presto iCab KHTML
Main Resource (HTML)
Main Resource (XHTML)
Main Resource (CSS)
Main Resource (XML/)
Main Resource (image)
Main Resource (video)
Main Resource (audio)
LINK@href (style sheet)
LINK@href (other)
AREA@href
A@href
SCRIPT@src
OBJECT@data
IMG@src
@longdesc
@cite

Further research

Understand of filename extension errors

Filename extension errors may be a rare source of the problem. Problem's could occur when:

  • the author OS environment handles filename extensions as a different type than the server maps them to or when MIME types do not have a clear private or registered type for a filename extension. This may be less of a situation of filename extension errors as another filename extension mapping error once the file reaches the server.
  • some operating systems try to mask the complexities of filenames from users. When setting a filename extension the operating system is unaware of, the operating system may simply hide the actual filename extension and mislead the user into thinking the filename extension is changed.
  • some operating systems use other attributes of a file to determine its type handling. For example, Mac OS uses separate filesystem attributes (TypeCode and CreatorCode) to determine type handling that take precedence over filename extensions. BeOS used a separate filesystem attribute to store MIME types to determine type handling for a file. In either case, a user can freely use a file with an unrecognized or inappropriate filename extension without any feedback from their operating environment.

Understand filename extension server mapping errors

  • some filename extensions map to different MIME types in different circles (e.g., .rpm can be either a Realplayer media file or a Redhat Package Maker file).
  • A long-time bug in Apache leads the server to map all unmapped filename extensions (or all unknown file types in general) to 'text/plain'. A similar long-time bug in Microsoft IIS leads the server to map all unmapped filename extensions (or all unknown file types in general) to 'application/octet-stream'.
  • Many application server environments default to using filename extensions that map to 'text/html' or map HTTP responses to 'text/html' even when the content is not 'text/html'. Application server developers may not be aware of errors because testing with most browsers will mask their mistakes.

Understand current content sniffing

Most browser perform some content sniffing. It will be useful to understand how each of the major browsers determines content type before determining new interoperability norms.

  • A) For the following browser:
    • Internet Explorer 7
    • Firefox 3
    • Safari 3
    • Opera 9
    • iCab 3
  • B) identifying how browsers handle content based on
    • content-type header
    • filename extension
    • @type attribute
    • sniffed content
  • C) when the server sends the following content-type header values:
    • <none> (i.e., no content-type header at all)
    • text/plain
    • text/html
    • application/octet-stream
    • <any other> (i.e., testing for any content sniffing, filename extension overrides or any other special handling for any other MIME types)
  • D) determine the browser priorities for the content indicators listed in 'B' for:
    • the main resource (including HTML, XML, RSS, ATOM, MPEG, JPEG, PNG, etc.)
    • LINK@href for style sheet data
    • LINK@href for other data
    • SCRIPT@src
    • OBJECT@data
    • IMG@src
    • EMBED@src
    • IMG@longdesc
    • @cite
    • A@href
    • AREA@href
    • etc.

There are several axes of tests here. For each browser, the determination of type may be handled differently (through content-headers, content-sniffing, etc) depending on the content-type header response from the server, and the relation of the markup or other means (e.g., a address-bar URL request or an HTML document embedded resource) that leads to the resource request.

Proposed solutions

Treat content type as authoritative

Advantages

  • Gives author with server access control over how their content is consumed

Disadvantages

  • When errors occur user's cannot consume content
  • When some browsers opt out of this solution they appear as better quality browsers compared to those following the standard
  • Authors without server access, or without prior knowledge of where their content will be served from, have no way to control the content type handling

UA client-side determination

Advantages

  • Improves handling of errant headers (which have been common due to bugs in ubiquitous servers)

Disadvantages

  • Authors have no control to change the type handling of a file

Markup approach

A markup approach might add attributes to existing HTML elements to allow authors to control type handling of content. For example, the OBJECT element has a type attribute that provides advisory type information. A stronger authoritative attributes such as handledataas, handlesrcas, handlehrefas could be used to alter the UA handling of embedded and linked resources. For linked files, authors may also want to indicate the file should be downloaded even when a UA can handle the file internally. This could be accomplished simply by adding a new target attribute keyword such as '_download'.

Advantages

  • Authors have control over how content is consumed
  • Optional attributes only necessary when altering the normal UA treatment of a file's type
  • Control over content handling does not require access to the server or prior knowledge of server configuration
  • Documents continue to work as the author intended regardless of the file's location

Disadvantages

  • Does not work with main resources that are not HTML (other resources would need to be loaded through a small HTML file)

Discussion

We have a problem that file types are not labeled properly. Some have identified one part of this issue as a disjoint between the local practice of filename extensions and the server practice of using content headers. Some suggest we might change those headers by indicating an '!important' in the content-header value or something similar. Browsers have tried to solve this problem by sniffing content (which actually contributes to the problem since authors are unaware of their errors in setting metadata because of the content sniffing). In addition to other approaches, HTML5 might want to treat mismatches between content-headers and sniff-content types as an error (even if we require that error be handled gracefully though somehow reported to authors).

• For local files, filename extensions have become the nearly universal practice for setting file type handling in local file systems. It would be useful to determine if authors make any significant number of errors in setting filename extensions? If they do, where does this happen and can we learn anything about why it happens? Is there anything we as the HTML WG (or in cooperation with other groups) can do to address this problem. Mac OS certainly creates opportunities for author error in this respect in that other type setting mechanisms can prevent authors from realizing a missing filename extension (needed once the file goes tot he server or is accessed using HTTP). However, modern Mac OS applications make if difficult to create new content that doesn't have a proper filename extension
• Servers typically try to map filename extensions to content type headers. Servers may also be configured to provide content headers that are not based solely on the filename extension (or not at all). Is this a large source for the problem?

Many presume the server mapping issue is a big source of the problem. That is authors set their filename extensions correctly because they receive immediate local OS feedback when the filename extension is wrong (with rare exceptions on Mac OS which provides other file type mapping techniques but they are seldom used for web-family resource files).

It would be useful to know if this is a big source of the problem. For example, it will do no good to specify a new 'important' content header syntax if that too will be mis-configured. Typically the problem may occur when new file formats become common where the server has been installed and configured long before those formats (and their associated filename extensions) came onto the scene.

Investigating these two issues (filename extensions and extension to header mapping) might require mere discussion among the WG members. What do we think about mistaken filename extensions? What do we think about mistaken filename extension to content header mappings? Is there any library research or research from W3C members that might shed some light on the issue?

For example, if we explore these issues, and determine that filename extensions nearly universally reflect the authors intentions, then perhaps content sniffing is not the way to go (this is just a hypothetical, it may not be the case). In that case browsers that think they're providing greater value to their users by sniffing content, are not doing that. It is the browser that treats filename extensions or filename extensions in combination with content headers as authoritative that will provide a better experience than content sniffing. Sure content sniffing an image may be easy to do, but if the only time an image has a different filename extension is the times when an author wants it treated as a download (just as a semi-flawed example) then the browser that doesn't sniff provides a better user experience. At other times, a filename extension may be missing or unknown. There it might make sense to turn to sniffing as another (probably rare) fallback mechanism.

RFC 2616

RFC 2616 Hypertext Transfer Protocol -- HTTP/1.1

Any HTTP/1.1 message containing an entity-body SHOULD include a Content-Type header field defining the media type of that body. If and only if the media type is not given by a Content-Type field, the recipient MAY attempt to guess the media type via inspection of its content and/or the name extension(s) of the URI used to identify the resource. If the media type remains unknown, the recipient SHOULD treat it as type "application/octet-stream".

Examples

E-mails


See also