19944 2012-11-11 23:23:28 +0000 Editorial: Hangul names missing from the Korean index 2013-01-21 16:25:24 +0000 1 1 1 Unclassified WHATWG Encoding unspecified All All RESOLVED FIXED P2 trivial Unsorted 1 pub-w3 annevk mike VYV03354 sideshowbarker+encodingspec oldest_to_newest 78212 0 pub-w3 2012-11-11 23:23:28 +0000 Unlike Han characters, Hangul syllables actually have Unicode names, and the Korean index should probably use those instead of just saying ‘<Hangul Syllable>’. See Unicode 6.1, Section 3.12 Conjoining Jamo Behavior, Hangul Syllable Name Generation [*]. For instance, U+D4DB is ‘HANGUL SYLLABLE PWILH’. [*] <http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf>, pp. 111-112. 78214 1 annevk 2012-11-12 08:09:23 +0000 As long as https://raw.github.com/whatwg/encoding/master/UnicodeData.txt does not contain them (or the equivalent file on Unicode.org) that seems unlikely to happen. 78238 2 pub-w3 2012-11-12 21:16:48 +0000 Are you saying that Unicode would have to publish an official list rather than an algorithm, or would it be enough if someone implemented the algorithm, checked the output against lists believed to be correct (e.g., http://www.inames.net/lang/out/out_p1s3_hangul.html) and sent you the data in a suitable format? 78239 3 pub-w3 2012-11-12 21:28:24 +0000 There is also <http://www.itscj.ipsj.or.jp/sc2/open/02n4168/HangulSy.txt>, which appears to be part of ISO 10646. 78243 4 annevk 2012-11-12 22:06:12 +0000 Well if you are willing to do the work I do not think I would object. It seems like a welcome addition. As a patch to https://github.com/whatwg/encoding/blob/master/index.py would be the best, but I can do that last bit too. 78333 5 pub-w3 2012-11-15 00:40:34 +0000 The easiest solution is probably to add the algorithm to index.py: @@ -5,6 +5,13 @@ data = json.loads(open("indexes.json", "r").read()) # Copy from ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt names = open("UnicodeData.txt", "r").readlines() +jamo = [["G","GG","N","D","DD","R","M","B","BB","S","SS","","J","JJ","C","K", + "T","P","H"], + ["A","AE","YA","YAE","EO","E","YEO","YE","O","WA","WAE","OE","YO","U", + "WEO","WE","WI","YU","EU","YI","I"], + ["","G","GG","GS","N","NJ","NH","D","L","LG","LM","LB","LS","LT","LP", + "LH","M","B","BS","S","SS","NG","J","C","K","T","P","H"]] + def char(cp): if cp > 0xFFFF: hi, lo = divmod(cp-0x10000, 0x400) @@ -23,7 +30,9 @@ def get_name(cp): elif cp >= 0x4E00 and cp <= 0x9FCB: return "<CJK Ideograph>" elif cp >= 0xAC00 and cp <= 0xD7A3: - return "<Hangul Syllable>" + i = cp - 0xAC00 + s = jamo[0][i/21/28] + jamo[1][i%(21*28)/28] + jamo[2][i%28] + return "HANGUL SYLLABLE " + s elif cp >= 0xE000 and cp <= 0xF8FF: return "<Private Use>" elif cp >= 0x20000 and cp <= 0x2A6D6: 78336 6 pub-w3 2012-11-15 01:43:21 +0000 Alternative, slightly simpler formula: s = jamo[0][i/21/28] + jamo[1][i/28%21] + jamo[2][i%28] 78402 7 annevk 2012-11-16 12:30:50 +0000 https://github.com/whatwg/encoding/commit/9cc353b228d5ec9c6fdbb7a516ef18d8970fdd27 Thanks! 81599 8 pub-w3 2013-01-19 15:00:41 +0000 It would make more sense to write the first division as i/28/21. (This does of course make no difference to the result, and this code is not part of the specification, so I leave the bug closed.) 81888 9 annevk 2013-01-21 16:25:24 +0000 Done.