[WebIDL] Bugs in DOMString conversion to Unichode characters (was Re: "send data using the Web Socket" and UCS-2)

On Wednesday 2009-06-17 16:26 +1000, Cameron McCormack wrote:
> Jonas Sicking:
> > Yes, I don't see how we could handle this in WebIDL, other than
> > defining that all DOMStrings must be structurally correct UTF-16.
> > However that would be prohibitively expensive since we would have to
> > add checks in many many places.
> 
> I agree, I don’t think it would be good to require this.
> 
> Anne van Kesteren:
> > Web IDL could define algorithms how you convert a DOMString to and
> > from UTF-8. And maybe other encodings if that is desirable.
> 
> I added a simple algorithm that converts a sequence of 16 bit code units
> to a sequence of Unicode characters, inserting U+FFFD characters when
> bad surrogates are used:
> 
>   http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode
> 
> Nothing in Web IDL references this algorithm.  Other specs can do so if
> it is useful.

This algorithm seems incorrect in two ways:

 * It gets the ranges for high and low surrogates backwards.  (High
   surrogates are U+D800 - U+DBFF, low surrogates are U+DC00 -
   U+DFFF, and in UTF-16 a surrogate pair is a high surrogate
   followed by a low surrogate.  So swapping the ranges in the
   headings should make the algorithm correct, modulo the next
   point.  But you should definitely double-check this. :-)

 * It incorrectly handles unpaired high surrogates by eating the
   next character.  Instead, I would prefer that the unpaired high
   surrogate is replaced by U+FFFD, and the following character is
   interpreted normally.  (That's what Gecko does, anyway.
   Furthermore, I think it makes sense to match the handling of
   unpaired low surrogates.)

-David

-- 
L. David Baron                                 http://dbaron.org/
Mozilla Corporation                       http://www.mozilla.com/

Received on Tuesday, 30 June 2009 22:44:17 UTC