19944 – Editorial: Hangul names missing from the Korean index

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 19944 - Editorial: Hangul names missing from the Korean index

Summary: Editorial: Hangul names missing from the Korean index

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 trivial
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-11-11 23:23 UTC by pub-w3
Modified:	2013-01-21 16:25 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description pub-w3 2012-11-11 23:23:28 UTC

Unlike Han characters, Hangul syllables actually have Unicode names, and the Korean index should probably use those instead of just saying ‘<Hangul Syllable>’.

See Unicode 6.1, Section 3.12 Conjoining Jamo Behavior, Hangul Syllable Name Generation [*].  For instance, U+D4DB is ‘HANGUL SYLLABLE PWILH’.

[*] <http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf>, pp. 111-112.

Comment 1 Anne 2012-11-12 08:09:23 UTC

As long as https://raw.github.com/whatwg/encoding/master/UnicodeData.txt does not contain them (or the equivalent file on Unicode.org) that seems unlikely to happen.

Comment 2 pub-w3 2012-11-12 21:16:48 UTC

Are you saying that Unicode would have to publish an official list rather than an algorithm, or would it be enough if someone implemented the algorithm, checked the output against lists believed to be correct (e.g., http://www.inames.net/lang/out/out_p1s3_hangul.html) and sent you the data in a suitable format?

Comment 3 pub-w3 2012-11-12 21:28:24 UTC

There is also <http://www.itscj.ipsj.or.jp/sc2/open/02n4168/HangulSy.txt>, which appears to be part of ISO 10646.

Comment 4 Anne 2012-11-12 22:06:12 UTC

Well if you are willing to do the work I do not think I would object. It seems like a welcome addition. As a patch to https://github.com/whatwg/encoding/blob/master/index.py would be the best, but I can do that last bit too.

Comment 5 pub-w3 2012-11-15 00:40:34 UTC

The easiest solution is probably to add the algorithm to index.py:

@@ -5,6 +5,13 @@ data = json.loads(open("indexes.json", "r").read())
 # Copy from ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
 names = open("UnicodeData.txt", "r").readlines()
 
+jamo = [["G","GG","N","D","DD","R","M","B","BB","S","SS","","J","JJ","C","K",
+         "T","P","H"],
+        ["A","AE","YA","YAE","EO","E","YEO","YE","O","WA","WAE","OE","YO","U",
+         "WEO","WE","WI","YU","EU","YI","I"],
+        ["","G","GG","GS","N","NJ","NH","D","L","LG","LM","LB","LS","LT","LP",
+         "LH","M","B","BS","S","SS","NG","J","C","K","T","P","H"]]
+
 def char(cp):
     if cp > 0xFFFF:
         hi, lo = divmod(cp-0x10000, 0x400)
@@ -23,7 +30,9 @@ def get_name(cp):
     elif cp >= 0x4E00 and cp <= 0x9FCB:
         return "<CJK Ideograph>"
     elif cp >= 0xAC00 and cp <= 0xD7A3:
-        return "<Hangul Syllable>"
+        i = cp - 0xAC00
+        s = jamo[0][i/21/28] + jamo[1][i%(21*28)/28] + jamo[2][i%28]
+        return "HANGUL SYLLABLE " + s
     elif cp >= 0xE000 and cp <= 0xF8FF:
         return "<Private Use>"
     elif cp >= 0x20000 and cp <= 0x2A6D6:

Comment 6 pub-w3 2012-11-15 01:43:21 UTC

Alternative, slightly simpler formula:
s = jamo[0][i/21/28] + jamo[1][i/28%21] + jamo[2][i%28]

Comment 7 Anne 2012-11-16 12:30:50 UTC

https://github.com/whatwg/encoding/commit/9cc353b228d5ec9c6fdbb7a516ef18d8970fdd27

Thanks!

Comment 8 pub-w3 2013-01-19 15:00:41 UTC

It would make more sense to write the first division as i/28/21.
(This does of course make no difference to the result, and this code is not part of the specification, so I leave the bug closed.)

Comment 9 Anne 2013-01-21 16:25:24 UTC

Done.