This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Unlike Han characters, Hangul syllables actually have Unicode names, and the Korean index should probably use those instead of just saying ‘<Hangul Syllable>’. See Unicode 6.1, Section 3.12 Conjoining Jamo Behavior, Hangul Syllable Name Generation [*]. For instance, U+D4DB is ‘HANGUL SYLLABLE PWILH’. [*] <http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf>, pp. 111-112.
As long as https://raw.github.com/whatwg/encoding/master/UnicodeData.txt does not contain them (or the equivalent file on Unicode.org) that seems unlikely to happen.
Are you saying that Unicode would have to publish an official list rather than an algorithm, or would it be enough if someone implemented the algorithm, checked the output against lists believed to be correct (e.g., http://www.inames.net/lang/out/out_p1s3_hangul.html) and sent you the data in a suitable format?
There is also <http://www.itscj.ipsj.or.jp/sc2/open/02n4168/HangulSy.txt>, which appears to be part of ISO 10646.
Well if you are willing to do the work I do not think I would object. It seems like a welcome addition. As a patch to https://github.com/whatwg/encoding/blob/master/index.py would be the best, but I can do that last bit too.
The easiest solution is probably to add the algorithm to index.py: @@ -5,6 +5,13 @@ data = json.loads(open("indexes.json", "r").read()) # Copy from ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt names = open("UnicodeData.txt", "r").readlines() +jamo = [["G","GG","N","D","DD","R","M","B","BB","S","SS","","J","JJ","C","K", + "T","P","H"], + ["A","AE","YA","YAE","EO","E","YEO","YE","O","WA","WAE","OE","YO","U", + "WEO","WE","WI","YU","EU","YI","I"], + ["","G","GG","GS","N","NJ","NH","D","L","LG","LM","LB","LS","LT","LP", + "LH","M","B","BS","S","SS","NG","J","C","K","T","P","H"]] + def char(cp): if cp > 0xFFFF: hi, lo = divmod(cp-0x10000, 0x400) @@ -23,7 +30,9 @@ def get_name(cp): elif cp >= 0x4E00 and cp <= 0x9FCB: return "<CJK Ideograph>" elif cp >= 0xAC00 and cp <= 0xD7A3: - return "<Hangul Syllable>" + i = cp - 0xAC00 + s = jamo[0][i/21/28] + jamo[1][i%(21*28)/28] + jamo[2][i%28] + return "HANGUL SYLLABLE " + s elif cp >= 0xE000 and cp <= 0xF8FF: return "<Private Use>" elif cp >= 0x20000 and cp <= 0x2A6D6:
Alternative, slightly simpler formula: s = jamo[0][i/21/28] + jamo[1][i/28%21] + jamo[2][i%28]
https://github.com/whatwg/encoding/commit/9cc353b228d5ec9c6fdbb7a516ef18d8970fdd27 Thanks!
It would make more sense to write the first division as i/28/21. (This does of course make no difference to the result, and this code is not part of the specification, so I leave the bug closed.)
Done.