7380 – Suggest heuristic detection of UTF-8

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 7380 - Suggest heuristic detection of UTF-8

Summary: Suggest heuristic detection of UTF-8

Status:	CLOSED FIXED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:	http://dev.w3.org/html5/spec/Overview...
Whiteboard:
Keywords:	NE

Depends on:
Blocks:

Reported:	2009-08-20 07:29 UTC by Maciej Stachowiak
Modified:	2010-10-04 14:32 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description Maciej Stachowiak 2009-08-20 07:29:12 UTC

Step 6 of the encoding detection algorithm should specifically suggest the possibility of algorithmically detecting UTF-8. Here is some suggested wording from the I18N WG:

"Note: The UTF-8 encoding has a highly detectable bit pattern. Documents that contain bytes > 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents that do not match it definitely are not. While not full autodetection, it may be appropriate for a user-agent to search for this common encoding."