This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
https://encoding.spec.whatwg.org/#shift_jis-encoder Proposal: Adding the following step between step 4 and 5 of the shift_jis encoder: * If code pint is U+2212, return two bytes; 0x81 and 0x7c. See crbug.com/425417 . While we enable Japanese input method, pressing "-" key produces U+2212 on Mac, and U+FF0D on Windows. Browsers using ICU mapping converts U+2212 to bytes same as U+FF0D. So, we had no issues. Google Chrome switched to a strict mapping specified by Encoding API specification, and it caused problems. "-" is very frequently used in Japanese users because it appears in postal addresses.
Simon, Henri, opinions? I'm inclined to add this given it's rather trivial.
So the consequence would be that Shift_JIS 0x817c would always decode as U+FF0B, but both U+FF0B and U+2212 would be encoded as 0x817C? And as a corollary, U+2212 wouldn't round-trip?
(In reply to Simon Montagu from comment #2) > So the consequence would be that Shift_JIS 0x817c would always decode as > U+FF0B, but both U+FF0B and U+2212 would be encoded as 0x817C? And as a > corollary, U+2212 wouldn't round-trip? That's right. (FF0B -> FF0D)
Firefox also has the non-round-trip mapping (U+2212 to 0x817C). Internet Explorer and Microsoft Edge don't, but they will never receive U+2212 from Japanese input method because they are Windows only. I agree to add this mapping to the spec.
I also support this change and will make a change to CHrome's ICU quickly. BTW, virtually all the encoding-only mappings (fromUnicode direction) were dropped in the encoding spec probably under the assumption that emitting NCRs would work with most servers. This is an example where that assumption breaks down. We need to review other cases of 'dropped encoding-only mappings'.
(In reply to Masatoshi Kimura from comment #4) > Firefox also has the non-round-trip mapping (U+2212 to 0x817C). What other entries does Firefox have for 'encoding-only mapping' (fromUnicode) in Shift_JIS and other encodings?
(In reply to Anne from comment #1) > Simon, Henri, opinions? I'm not competent to have an opinion on this.
Created attachment 1603 [details] encoding-only mappings found on Firefox (In reply to Jungshik Shin from comment #6) > What other entries does Firefox have for 'encoding-only mapping' > (fromUnicode) in Shift_JIS and other encodings? Looks like Firefox has some bugs in gbk/gb18030 encoder. UAO has many "fallback" mappings, so does Big5-UAO (the Firefox Big5 variant). Shift_JIS and EUC-JP has 11 encoding-only mappings to compensate the difference between MS mappings and JIS mappings. ISO-2022-JP has halfwidth-to-fullwidth katakana conversion as well as 11 mappings. Only multibyte encodings have encoding-only mapping on Firefox.
Created attachment 1604 [details] encoding-only mappings found on Chrome Looks like Chrome also has some bugs...
Created attachment 1605 [details] Test script This script was too slow to wait the result on Internet Explorer. Please run this on Safari and attach the result.
Created attachment 1606 [details] Test script (won't work on IE) The previous script didn't work well with ISO-2022-JP. Please use this instead.
(In reply to Masatoshi Kimura from comment #9) > Created attachment 1604 [details] > encoding-only mappings found on Chrome > > Looks like Chrome also has some bugs... These and a number of other entries do not make any sense. (I'm aware that your test script generated them. ) euc-kr: ffe0(¢)=>ffe1(£) euc-kr: ffe1(£)=>ffe2(¬) euc-kr: ffe2(¬)=>ffe3( ̄) euc-kr: ffe4(¦)=>ffe5(¥) euc-kr: ffe5(¥)=>ffe6(₩) Chrome's EUC-KR table does not have any one-way mapping in either direction. See https://code.google.com/p/chromium/codesearch#chromium/src/third_party/icu/source/data/mappings/euc-kr-html.ucm&q=euc&sq=package:chromium&l=1 One way mapping should have '|1' or '|3' at the end of each entry. BTW, GBK is not yet aligned with the spec.
Created attachment 1607 [details] encoding-only mappings found on IE 11 Most entries are bogus, but IE has one encoding-only (U+00A5 to 0x5c) for Japanese encodings and fullwidth-to-halfwidth mappings for ISO-2022-JP.
(In reply to Masatoshi Kimura from comment #13) > Created attachment 1607 [details] > encoding-only mappings found on IE 11 > > Most entries are bogus, but IE has one encoding-only (U+00A5 to 0x5c) for > Japanese encodings and fullwidth-to-halfwidth mappings for ISO-2022-JP. The current encoding spec (and Chrome's Shift_JIS) has two one-way mapping (fromUnicode): If code point is U+00A5, return byte 0x5C. If code point is U+203E, return byte 0x7E. ICU's default Shift_JIS (ibm-943) has 47 encoding-only mappings. Most of them are Kanjis, but several of them are various symbols/punctuations like wave dash (two of them are U+00A5 and U+203E)
Created attachment 1609 [details] Test script v3
Hmm, the result on Safari 8.0.6 cannot attach here due to over 1MB. Which part do you need?
Please check if the following entries are present. euc-jp: a2(¢)=>ffe0(¢) euc-jp: a3(£)=>ffe1(£) euc-jp: a5(¥)=>5c(\) euc-jp: a6(¦)=>ffe4(¦) euc-jp: ac(¬)=>ffe2(¬) euc-jp: 2014(—)=>2015(―) euc-jp: 2016(‖)=>2225(∥) euc-jp: 203e(‾)=>7e(~) euc-jp: 2212(−)=>ff0d(-) euc-jp: 22ef(⋯)=>2026(…) euc-jp: 301c(〜)=>ff5e(~) iso-2022-jp: a2(¢)=>ffe0(¢) iso-2022-jp: a3(£)=>ffe1(£) iso-2022-jp: a5(¥)=>5c(\) iso-2022-jp: a6(¦)=>ffe4(¦) iso-2022-jp: ac(¬)=>ffe2(¬) iso-2022-jp: 2014(—)=>2015(―) iso-2022-jp: 2016(‖)=>2225(∥) iso-2022-jp: 203e(‾)=>7e(~) iso-2022-jp: 2212(−)=>ff0d(-) iso-2022-jp: 22ef(⋯)=>2026(…) iso-2022-jp: 301c(〜)=>ff5e(~) shift_jis: a2(¢)=>ffe0(¢) shift_jis: a3(£)=>ffe1(£) shift_jis: a5(¥)=>5c(\) shift_jis: a6(¦)=>ffe4(¦) shift_jis: ac(¬)=>ffe2(¬) shift_jis: 2014(—)=>2015(―) shift_jis: 2016(‖)=>2225(∥) shift_jis: 203e(‾)=>7e(~) shift_jis: 2212(−)=>ff0d(-) shift_jis: 22ef(⋯)=>2026(…) shift_jis: 301c(〜)=>ff5e(~)
Created attachment 1610 [details] Test result on Safari 8.0.6 (Yosemite)
(In reply to Jungshik Shin from comment #14) > (In reply to Masatoshi Kimura from comment #13) > > Created attachment 1607 [details] > > encoding-only mappings found on IE 11 > > > > Most entries are bogus, but IE has one encoding-only (U+00A5 to 0x5c) for > > Japanese encodings and fullwidth-to-halfwidth mappings for ISO-2022-JP. > > The current encoding spec (and Chrome's Shift_JIS) has two one-way mapping > (fromUnicode): > > If code point is U+00A5, return byte 0x5C. > > If code point is U+203E, return byte 0x7E. > > ICU's default Shift_JIS (ibm-943) has 47 encoding-only mappings. Most of > them are Kanjis, but several of them are various symbols/punctuations like > wave dash (two of them are U+00A5 and U+203E) In addition to the above two one-way mappings in the current encoding spec, ICU's default Shift_JIS has the following one-way mapping in the fromUnicode direction (those with '|1'). I'm excluding all the entries for Kanjis (about 40 of them). <UFF5E> \x81\x60 |0 <U301C> \x81\x60 |1 <U2225> \x81\x61 |0 <U2016> \x81\x61 |1 <UFF0D> \x81\x7C |0 <U2212> \x81\x7C |1 <U2116> \x87\x82 |0 <UF86F> \x87\x82 |1 <UFFE4> \xFA\x55 |0 <U00A6> \xFA\x55 |1 The above list is a subset of what's listed in comment 17 for Safari's Shift_JIS. I don't know what webkit is doing. (they use ICU's default converter on Mac OS X/iOS, but hard-code some additional mappings to Webkit if they find it necessary)
tkent@ : I couldn't enter U+2212 on Mac OS 10.10 with Japanese IME - Hiragana, Romaji, Katakana (in both English UI and Japanese UI). Can you tell me how to enter U+2212? What I got is U+30FC (Hiragana-Katakana prolonged sound mark), which is rather strange. BTW, one of choices I got for '~' with 'Hiragana' is U+301C (Wave Dash), which is not included in the current table for Shift_JIS although there's a fallback mapping (encoding-only) in ICU's converter. (see the previous comment). Obviously, wave dash won't be used as often as U+2212 in postal code.
Created attachment 1611 [details] Test script using the href attribute This is much faster and more accurate, but does not work on IE/Edge. Looks like Chrome NFC-normalizes the href attribute. Is this a spec-compliant behavior?
Sorry for expanding the scope of this bug far beyond that of the initial report. We seem to have an agreement on mapping U+2212 to 0x81 0x7C in Shift_JIS. While we're at it, I propose that we do the same for EUC-JP and ISO-2022-JP. (add an one-way encoding-only mapping for U+2212). ICU uses Shift_JIS table for ISO-2022-JP so that if Shift_JIS table is changed, ISO-2022-JP will get it, too.
(In reply to Jungshik Shin from comment #20) > tkent@ : I couldn't enter U+2212 on Mac OS 10.10 with Japanese IME - > Hiragana, Romaji, Katakana (in both English UI and Japanese UI). Can you > tell me how to enter U+2212? Enable Hiragana with Kotoeri, or Hiragana with Google Japanese Input, then type "1-2-3". It puts U+FF11 U+2212 U+FF12 U+2212 U+FF13. > What I got is U+30FC (Hiragana-Katakana prolonged sound mark), which is > rather strange. > > BTW, one of choices I got for '~' with 'Hiragana' is U+301C (Wave Dash), > which is not included in the current table for Shift_JIS although there's a > fallback mapping (encoding-only) in ICU's converter. (see the previous > comment). Obviously, wave dash won't be used as often as U+2212 in postal > code. Other one-way mapping characters are not important at all. I have no idea of use cases in web forms. (In reply to Jungshik Shin from comment #22) > While we're at it, I propose that we do the same for EUC-JP and ISO-2022-JP. > (add an one-way encoding-only mapping for U+2212). It's very reasonable.
(In reply to Kent Tamura from comment #23) > (In reply to Jungshik Shin from comment #20) > > tkent@ : I couldn't enter U+2212 on Mac OS 10.10 with Japanese IME - > > Hiragana, Romaji, Katakana (in both English UI and Japanese UI). Can you > > tell me how to enter U+2212? > > Enable Hiragana with Kotoeri, or Hiragana with Google Japanese Input, then > type "1-2-3". It puts U+FF11 U+2212 U+FF12 U+2212 U+FF13. Thank you. I couldn't get U+2212 because I just tried typing '-' by itself and none of candidates was U+2212. In the context of '1-2-3', I do get U+2212 as well as full-width digits. > > What I got is U+30FC (Hiragana-Katakana prolonged sound mark), which is > > rather strange. > > > > BTW, one of choices I got for '~' with 'Hiragana' is U+301C (Wave Dash), > > which is not included in the current table for Shift_JIS although there's a > > fallback mapping (encoding-only) in ICU's converter. (see the previous > > comment). Obviously, wave dash won't be used as often as U+2212 in postal > > code. > > Other one-way mapping characters are not important at all. I have no idea > of use cases in web forms. Thank you for the answer. > (In reply to Jungshik Shin from comment #22) > > While we're at it, I propose that we do the same for EUC-JP and ISO-2022-JP. > > (add an one-way encoding-only mapping for U+2212). > > It's very reasonable. I'll go ahead adding a one-way mapping (encoding-only) for U+2212 to all the Japanese legacy encodings (ISO-2022-JP automatically gets it from Shift_JIS) in Chrome's copy of ICU. (all these tables for the encoding spec will be contributed back to ICU eventually along with the aliases table per spec.).
So, it seems shift_jis and euc-jp both emit 0x81 0x7C whereas iso-2022-jp emits 0x21 0x5D (in the jis0208 state) for U+2212 based on testing with <form>. Changing the specification to align with this seems rather trivial.
For EUC-JP, it'll be 0xA1 0xDD :-) For Shift_JIS and ISO-2022-JP, your comment 25 is right.
Thank you, I must have made a mistake during testing since now I get the same.
Kent, I copied your name in kanji from the HTML Standard. Please let me know if you wish that to be changed in some manner. The way you appear in the acknowledgments is up to you. Also, thank you for your report. I used comment 3 to fix this in the simplest way possible. Initially I wanted to just emit the literal bytes, but actually changing the input code point was easier for iso-2022-jp. https://github.com/whatwg/encoding/commit/a7ab97e891773bd7a564b463c6a1cc31196a5bdd
(In reply to Anne from comment #28) > Kent, I copied your name in kanji from the HTML Standard. Please let me know > if you wish that to be changed in some manner. The way you appear in the > acknowledgments is up to you. > > Also, thank you for your report. I used comment 3 to fix this in the > simplest way possible. Initially I wanted to just emit the literal bytes, > but actually changing the input code point was easier for iso-2022-jp. > > https://github.com/whatwg/encoding/commit/ > a7ab97e891773bd7a564b463c6a1cc31196a5bdd This commit is erroneous. This bug is about U+2212 but the commit has U+2022. I filed https://github.com/whatwg/encoding/issues/21 on that.