Bugzilla – Bug 17620
Add steps to convert a sequence of Unicode characters to a DOMString
Last modified: 2012-06-28 22:12:18 UTC
The Web API proposed in http://wiki.whatwg.org/wiki/StringEncoding requires interpretation of DOMString code units as an encoding of Unicode characters for the purpose of encoding and decoding DOMStrings to other binary encodings.
WebIDL defines "steps to convert a DOMString to a sequence of Unicode characters" at http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode
The proposed API require the reverse as well, and defines "steps to convert a sequence of Unicode characters to a DOMString" at http://wiki.whatwg.org/wiki/StringEncoding#Steps_to_convert_a_sequence_of_Unicode_characters_to_a_DOMString
Would it be possible to add the latter to the WebIDL specification so that both directions are defined in one place?
The proposed text could be (sans-formatting, using _ for subscript and ^ for superscript):
The following algorithm defines a way to convert a sequence of Unicode characters to a DOMString:
1. Let U_0...n-1 be the sequence of Unicode characters
2. Initialize i to 0
3. Initialize S to be an empty sequence of code units
4. While i < n
1. Let c be the code point of the Unicode character in U at index i
2. If c ≥ 2^16, then:
1. Append to S a code unit equal to (c - 2^16) / 2^10 + 0xD800, where "/" represents integer division.
2. Append to S a code unit equal to (c - 2^16) % 2^10 + 0xDC00, where "%" represents the remainder of an integer division.
3. Otherwise, append to S a code unit equal to c.
4. Set i to i+1
5. Return the IDL DOMString value that represents sequence of code units S.
I think the reason that I didn't include this reverse algorithm was because there's only one correct way of converting a Unicode string into UTF-16 code units (whereas going the other way you need to deal with illegal UTF-16 sequences so there were some different approaches we could have taken) so you could probably just write a single line in your spec saying for example
Let s be the DOMString that represents the sequence of code units resulting
from encoding the sequence of Unicode characters t as UTF-16.
I guess I just want to avoid re-specifying the UTF-16 encoding algorithm. But if you think the above is not precise enough I guess I can add your suggested text.
(In reply to comment #1)
> I guess I just want to avoid re-specifying the UTF-16 encoding algorithm. But
> if you think the above is not precise enough I guess I can add your suggested
You're right, that should be fine. If anyone complains about not having it detailed it shouldn't be problematic to add it later since, as you point out, there's only one way to do it.