Heuristic detection and non-ASCII superset encodings from Henri Sivonen on 2008-03-21 (public-html@w3.org from March 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 21 Mar 2008 11:55:58 +0200
To: HTML WG <public-html@w3.org>
Message-Id: <C04EAC67-704E-4599-942B-7FB42C1BF639@iki.fi>

> The user agent may attempt to autodetect the character encoding from  
> applying frequency analysis or other algorithms to the data stream.  
> If autodetection succeeds in determining a character encoding, then  
> return that encoding, with the confidence tentative, and abort these  
> steps.

I think only US-ASCII superset encodings should be allowed as outcomes  
of heuristic encoding detection. If a page is misdetected as UTF-16,  
there's no later meta recourse.

Consider this case that I just programmed around:
A Russian page is encoded as Windows-1251. The page fails the meta  
prescan. A heuristic detector misdetects the page as UTF-16 Chinese. A  
later meta gets garbled and the parser output is garbage.

When only US-ASCII supersets can be detected, a later meta will set  
things right even if the heuristic detector fails.

I don't have statistics to back this up, but my educated guess based  
on anecdotal evidence is that HTTP-unlabeled UTF-16BE and UTF-16LE  
(i.e. BOMless) is very rare if not non-existent on the Web. On the  
other hand, Russian pages that CJK-biased detector software can  
misdetect as UTF-16 are a more likely occurrence on the Web.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Friday, 21 March 2008 09:56:45 UTC