This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 19944 - Editorial: Hangul names missing from the Korean index
Summary: Editorial: Hangul names missing from the Korean index
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: All All
: P2 trivial
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-11 23:23 UTC by pub-w3
Modified: 2013-01-21 16:25 UTC (History)
2 users (show)

See Also:


Attachments

Description pub-w3 2012-11-11 23:23:28 UTC
Unlike Han characters, Hangul syllables actually have Unicode names, and the Korean index should probably use those instead of just saying ‘<Hangul Syllable>’.

See Unicode 6.1, Section 3.12 Conjoining Jamo Behavior, Hangul Syllable Name Generation [*].  For instance, U+D4DB is ‘HANGUL SYLLABLE PWILH’.

[*] <http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf>, pp. 111-112.
Comment 1 Anne 2012-11-12 08:09:23 UTC
As long as https://raw.github.com/whatwg/encoding/master/UnicodeData.txt does not contain them (or the equivalent file on Unicode.org) that seems unlikely to happen.
Comment 2 pub-w3 2012-11-12 21:16:48 UTC
Are you saying that Unicode would have to publish an official list rather than an algorithm, or would it be enough if someone implemented the algorithm, checked the output against lists believed to be correct (e.g., http://www.inames.net/lang/out/out_p1s3_hangul.html) and sent you the data in a suitable format?
Comment 3 pub-w3 2012-11-12 21:28:24 UTC
There is also <http://www.itscj.ipsj.or.jp/sc2/open/02n4168/HangulSy.txt>, which appears to be part of ISO 10646.
Comment 4 Anne 2012-11-12 22:06:12 UTC
Well if you are willing to do the work I do not think I would object. It seems like a welcome addition. As a patch to https://github.com/whatwg/encoding/blob/master/index.py would be the best, but I can do that last bit too.
Comment 5 pub-w3 2012-11-15 00:40:34 UTC
The easiest solution is probably to add the algorithm to index.py:

@@ -5,6 +5,13 @@ data = json.loads(open("indexes.json", "r").read())
 # Copy from ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
 names = open("UnicodeData.txt", "r").readlines()
 
+jamo = [["G","GG","N","D","DD","R","M","B","BB","S","SS","","J","JJ","C","K",
+         "T","P","H"],
+        ["A","AE","YA","YAE","EO","E","YEO","YE","O","WA","WAE","OE","YO","U",
+         "WEO","WE","WI","YU","EU","YI","I"],
+        ["","G","GG","GS","N","NJ","NH","D","L","LG","LM","LB","LS","LT","LP",
+         "LH","M","B","BS","S","SS","NG","J","C","K","T","P","H"]]
+
 def char(cp):
     if cp > 0xFFFF:
         hi, lo = divmod(cp-0x10000, 0x400)
@@ -23,7 +30,9 @@ def get_name(cp):
     elif cp >= 0x4E00 and cp <= 0x9FCB:
         return "<CJK Ideograph>"
     elif cp >= 0xAC00 and cp <= 0xD7A3:
-        return "<Hangul Syllable>"
+        i = cp - 0xAC00
+        s = jamo[0][i/21/28] + jamo[1][i%(21*28)/28] + jamo[2][i%28]
+        return "HANGUL SYLLABLE " + s
     elif cp >= 0xE000 and cp <= 0xF8FF:
         return "<Private Use>"
     elif cp >= 0x20000 and cp <= 0x2A6D6:
Comment 6 pub-w3 2012-11-15 01:43:21 UTC
Alternative, slightly simpler formula:
s = jamo[0][i/21/28] + jamo[1][i/28%21] + jamo[2][i%28]
Comment 8 pub-w3 2013-01-19 15:00:41 UTC
It would make more sense to write the first division as i/28/21.
(This does of course make no difference to the result, and this code is not part of the specification, so I leave the bug closed.)
Comment 9 Anne 2013-01-21 16:25:24 UTC
Done.