24337 – Authors should be able to use both "utf8" and "utf-8" labels, case-insensitively

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 24337 - Authors should be able to use both "utf8" and "utf-8" labels, case-insensitively

Summary: Authors should be able to use both "utf8" and "utf-8" labels, case-insensitively

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-01-20 18:22 UTC by Geoffrey Sneddon
Modified:	2014-01-24 22:47 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description Geoffrey Sneddon 2014-01-20 18:22:29 UTC

Currently the spec says: 'Authors must use the utf-8 encoding and must use the "utf-8" label to identify it.'

Given the label matching is done case-insensitively, it is not entirely clear whether authors must use this label case-sensitively or not. This should be clarified, preferably to allow either case (there is no practical benefit of requiring it to be lowercased).

We should also make the "utf8" label conforming. Making this non-conforming is of no practical benefit and makes a large number of documents non-conforming.

Comment 1 Martin Dürst 2014-01-21 05:34:57 UTC

(In reply to Geoffrey Sneddon from comment #0)
> Currently the spec says: 'Authors must use the utf-8 encoding and must use
> the "utf-8" label to identify it.'
> 
> Given the label matching is done case-insensitively, it is not entirely
> clear whether authors must use this label case-sensitively or not. This
> should be clarified, preferably to allow either case (there is no practical
> benefit of requiring it to be lowercased).

Agreed.

> We should also make the "utf8" label conforming. Making this non-conforming
> is of no practical benefit and makes a large number of documents
> non-conforming.

This looks innocuous at first. However, in some products (in particular Oracle Databases), the label "utf8" is used for a variant of UTF-8 where characters outside the BMP are encoded with two surrogates, with a total of 6 bytes. For security reasons, this is prohibited in UTF-8.

Comment 2 Anne 2014-01-24 22:47:23 UTC

Yeah, only utf-8 was intentional. Clarified the case stuff.

https://github.com/whatwg/encoding/commit/61af3cdf199b4ab86babd47b7d48bb328c54a702