28661 – U+2212 in shift_jis encoder

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 28661 - U+2212 in shift_jis encoder

Summary: U+2212 in shift_jis encoder

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-05-20 06:11 UTC by Kent Tamura
Modified:	2015-12-10 19:52 UTC (History)
CC List:	7 users (show)

See Also:

Attachments
encoding-only mappings found on Firefox (181.15 KB, text/plain) 2015-05-27 22:05 UTC, Masatoshi Kimura	Details
encoding-only mappings found on Chrome (32.28 KB, text/plain) 2015-05-27 22:06 UTC, Masatoshi Kimura	Details
Test script (2.77 KB, text/html) 2015-05-27 22:09 UTC, Masatoshi Kimura	Details
Test script (won't work on IE) (2.61 KB, text/html) 2015-05-27 22:26 UTC, Masatoshi Kimura	Details
encoding-only mappings found on IE 11 (56.62 KB, text/plain) 2015-05-28 10:47 UTC, Masatoshi Kimura	Details
Test script v3 (2.92 KB, text/html) 2015-05-28 23:10 UTC, Masatoshi Kimura	Details
Test result on Safari 8.0.6 (Yosemite) (769 bytes, text/plain) 2015-05-29 14:13 UTC, Masayuki Nakano	Details
Test script using the href attribute (5.84 KB, text/html) 2015-05-30 07:56 UTC, Masatoshi Kimura	Details

Description Kent Tamura 2015-05-20 06:11:52 UTC

https://encoding.spec.whatwg.org/#shift_jis-encoder

Proposal:  Adding the following step between step 4 and 5 of the shift_jis encoder:

* If code pint is U+2212, return two bytes; 0x81 and 0x7c.


See crbug.com/425417 .
While we enable Japanese input method, pressing "-" key produces U+2212 on Mac, and U+FF0D on Windows.  Browsers using ICU mapping converts U+2212 to bytes same as U+FF0D.  So, we had no issues.  Google Chrome switched to a strict mapping specified by Encoding API specification, and it caused problems.
"-" is very frequently used in Japanese users because it appears in postal addresses.

Comment 1 Anne 2015-05-23 01:44:42 UTC

Simon, Henri, opinions?

I'm inclined to add this given it's rather trivial.

Comment 2 Simon Montagu 2015-05-24 19:20:13 UTC

So the consequence would be that Shift_JIS 0x817c would always decode as U+FF0B, but both U+FF0B and U+2212 would be encoded as 0x817C? And as a corollary, U+2212 wouldn't round-trip?

Comment 3 Kent Tamura 2015-05-25 02:07:10 UTC

(In reply to Simon Montagu from comment #2)
> So the consequence would be that Shift_JIS 0x817c would always decode as
> U+FF0B, but both U+FF0B and U+2212 would be encoded as 0x817C? And as a
> corollary, U+2212 wouldn't round-trip?

That's right.
(FF0B -> FF0D)

Comment 4 Masatoshi Kimura 2015-05-25 10:56:50 UTC

Firefox also has the non-round-trip mapping (U+2212 to 0x817C).
Internet Explorer and Microsoft Edge don't, but they will never receive U+2212 from Japanese input method because they are Windows only.
I agree to add this mapping to the spec.

Comment 5 Jungshik Shin 2015-05-26 18:54:16 UTC

I also support this change and will make a change to CHrome's ICU quickly. 

BTW, virtually all the encoding-only mappings (fromUnicode direction) were dropped in the encoding spec probably under the assumption that emitting NCRs would work with most servers. This is an example where that assumption breaks down. 

We need to review other cases of 'dropped encoding-only mappings'.

Comment 6 Jungshik Shin 2015-05-26 18:55:53 UTC

(In reply to Masatoshi Kimura from comment #4)
> Firefox also has the non-round-trip mapping (U+2212 to 0x817C).

What other entries does Firefox have for 'encoding-only mapping' (fromUnicode) in Shift_JIS and other encodings?

Comment 7 Henri Sivonen 2015-05-27 13:47:10 UTC

(In reply to Anne from comment #1)
> Simon, Henri, opinions?

I'm not competent to have an opinion on this.

Comment 8 Masatoshi Kimura 2015-05-27 22:05:38 UTC

Created attachment 1603 [details]
encoding-only mappings found on Firefox

(In reply to Jungshik Shin from comment #6)
> What other entries does Firefox have for 'encoding-only mapping'
> (fromUnicode) in Shift_JIS and other encodings?

Looks like Firefox has some bugs in gbk/gb18030 encoder.
UAO has many "fallback" mappings, so does Big5-UAO (the Firefox Big5 variant).
Shift_JIS and EUC-JP has 11 encoding-only mappings to compensate the difference between MS mappings and JIS mappings.
ISO-2022-JP has halfwidth-to-fullwidth katakana conversion as well as 11 mappings.
Only multibyte encodings have encoding-only mapping on Firefox.

Comment 9 Masatoshi Kimura 2015-05-27 22:06:55 UTC

Created attachment 1604 [details]
encoding-only mappings found on Chrome

Looks like Chrome also has some bugs...

Comment 10 Masatoshi Kimura 2015-05-27 22:09:44 UTC

Created attachment 1605 [details]
Test script

This script was too slow to wait the result on Internet Explorer.
Please run this on Safari and attach the result.

Comment 11 Masatoshi Kimura 2015-05-27 22:26:15 UTC

Created attachment 1606 [details]
Test script (won't work on IE)

The previous script didn't work well with ISO-2022-JP. Please use this instead.

Comment 12 Jungshik Shin 2015-05-28 05:11:31 UTC

(In reply to Masatoshi Kimura from comment #9)
> Created attachment 1604 [details]
> encoding-only mappings found on Chrome
> 
> Looks like Chrome also has some bugs...

These and a number of other entries do not make any sense. (I'm aware that your test script generated them. ) 

euc-kr: ffe0(￠)=>ffe1(￡)
euc-kr: ffe1(￡)=>ffe2(￢)
euc-kr: ffe2(￢)=>ffe3(￣)
euc-kr: ffe4(￤)=>ffe5(￥)
euc-kr: ffe5(￥)=>ffe6(￦)

Chrome's EUC-KR table does not have any one-way mapping in either direction. See
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/icu/source/data/mappings/euc-kr-html.ucm&q=euc&sq=package:chromium&l=1  

One way mapping should have '|1' or '|3' at the end of each entry. 

BTW, GBK is not yet aligned with the spec.

Comment 13 Masatoshi Kimura 2015-05-28 10:47:45 UTC

Created attachment 1607 [details]
encoding-only mappings found on IE 11

Most entries are bogus, but IE has one encoding-only (U+00A5 to 0x5c) for Japanese encodings and fullwidth-to-halfwidth mappings for ISO-2022-JP.

Comment 14 Jungshik Shin 2015-05-28 22:31:17 UTC

(In reply to Masatoshi Kimura from comment #13)
> Created attachment 1607 [details]
> encoding-only mappings found on IE 11
> 
> Most entries are bogus, but IE has one encoding-only (U+00A5 to 0x5c) for
> Japanese encodings and fullwidth-to-halfwidth mappings for ISO-2022-JP.

The current encoding spec (and Chrome's Shift_JIS) has two one-way mapping (fromUnicode):

If code point is U+00A5, return byte 0x5C.

If code point is U+203E, return byte 0x7E.

ICU's default Shift_JIS (ibm-943) has 47 encoding-only mappings. Most of them are Kanjis, but several of them are various symbols/punctuations like wave dash (two of them are U+00A5 and U+203E)

Comment 15 Masatoshi Kimura 2015-05-28 23:10:59 UTC

Created attachment 1609 [details]
Test script v3

Comment 16 Masayuki Nakano 2015-05-29 06:35:11 UTC

Hmm, the result on Safari 8.0.6 cannot attach here due to over 1MB. Which part do you need?

Comment 17 Masatoshi Kimura 2015-05-29 10:34:27 UTC

Please check if the following entries are present.
euc-jp: a2(¢)=>ffe0(￠)
euc-jp: a3(£)=>ffe1(￡)
euc-jp: a5(¥)=>5c(\)
euc-jp: a6(¦)=>ffe4(￤)
euc-jp: ac(¬)=>ffe2(￢)
euc-jp: 2014(—)=>2015(―)
euc-jp: 2016(‖)=>2225(∥)
euc-jp: 203e(‾)=>7e(~)
euc-jp: 2212(−)=>ff0d(－)
euc-jp: 22ef(⋯)=>2026(…)
euc-jp: 301c(〜)=>ff5e(～)
iso-2022-jp: a2(¢)=>ffe0(￠)
iso-2022-jp: a3(£)=>ffe1(￡)
iso-2022-jp: a5(¥)=>5c(\)
iso-2022-jp: a6(¦)=>ffe4(￤)
iso-2022-jp: ac(¬)=>ffe2(￢)
iso-2022-jp: 2014(—)=>2015(―)
iso-2022-jp: 2016(‖)=>2225(∥)
iso-2022-jp: 203e(‾)=>7e(~)
iso-2022-jp: 2212(−)=>ff0d(－)
iso-2022-jp: 22ef(⋯)=>2026(…)
iso-2022-jp: 301c(〜)=>ff5e(～)
shift_jis: a2(¢)=>ffe0(￠)
shift_jis: a3(£)=>ffe1(￡)
shift_jis: a5(¥)=>5c(\)
shift_jis: a6(¦)=>ffe4(￤)
shift_jis: ac(¬)=>ffe2(￢)
shift_jis: 2014(—)=>2015(―)
shift_jis: 2016(‖)=>2225(∥)
shift_jis: 203e(‾)=>7e(~)
shift_jis: 2212(−)=>ff0d(－)
shift_jis: 22ef(⋯)=>2026(…)
shift_jis: 301c(〜)=>ff5e(～)

Comment 18 Masayuki Nakano 2015-05-29 14:13:08 UTC

Created attachment 1610 [details]
Test result on Safari 8.0.6 (Yosemite)

Comment 19 Jungshik Shin 2015-05-29 18:00:57 UTC

(In reply to Jungshik Shin from comment #14)
> (In reply to Masatoshi Kimura from comment #13)
> > Created attachment 1607 [details]
> > encoding-only mappings found on IE 11
> > 
> > Most entries are bogus, but IE has one encoding-only (U+00A5 to 0x5c) for
> > Japanese encodings and fullwidth-to-halfwidth mappings for ISO-2022-JP.
> 
> The current encoding spec (and Chrome's Shift_JIS) has two one-way mapping
> (fromUnicode):
> 
> If code point is U+00A5, return byte 0x5C.
> 
> If code point is U+203E, return byte 0x7E.
> 
> ICU's default Shift_JIS (ibm-943) has 47 encoding-only mappings. Most of
> them are Kanjis, but several of them are various symbols/punctuations like
> wave dash (two of them are U+00A5 and U+203E)


In addition to the above two one-way mappings in the current encoding spec, ICU's default Shift_JIS has the following one-way mapping in the fromUnicode direction (those with '|1'). I'm excluding all the entries for Kanjis (about 40 of them). 

<UFF5E> \x81\x60 |0
<U301C> \x81\x60 |1
<U2225> \x81\x61 |0
<U2016> \x81\x61 |1
<UFF0D> \x81\x7C |0
<U2212> \x81\x7C |1
<U2116> \x87\x82 |0
<UF86F> \x87\x82 |1
<UFFE4> \xFA\x55 |0
<U00A6> \xFA\x55 |1

The above list is a subset of what's listed in comment 17 for Safari's Shift_JIS. I don't know what webkit is doing. (they use ICU's default converter on Mac OS X/iOS, but hard-code some additional mappings to Webkit if they find it necessary)

Comment 20 Jungshik Shin 2015-05-30 00:26:23 UTC

tkent@ : I couldn't enter U+2212 on Mac OS 10.10 with Japanese IME - Hiragana, Romaji, Katakana (in both English UI and Japanese UI). Can you tell me how to enter U+2212? 

What I got is U+30FC (Hiragana-Katakana prolonged sound mark), which is rather strange. 

BTW, one of choices I got for '~' with 'Hiragana' is  U+301C (Wave Dash), which is not included in the current table for Shift_JIS although there's a fallback mapping (encoding-only) in ICU's converter. (see the previous comment).  Obviously, wave dash won't be used as often as U+2212 in postal code.

Comment 21 Masatoshi Kimura 2015-05-30 07:56:19 UTC

Created attachment 1611 [details]
Test script using the href attribute

This is much faster and more accurate, but does not work on IE/Edge.
Looks like Chrome NFC-normalizes the href attribute. Is this a spec-compliant behavior?

Comment 22 Jungshik Shin 2015-06-01 22:59:55 UTC

Sorry for expanding the scope of this bug far beyond that of the initial report. 

We seem to have an agreement on mapping U+2212 to 0x81 0x7C in Shift_JIS. 

While we're at it, I propose that we do the same for EUC-JP and ISO-2022-JP. (add an one-way encoding-only mapping for U+2212). 

ICU uses Shift_JIS table for ISO-2022-JP so that if Shift_JIS table is changed, ISO-2022-JP will get it, too.

Comment 23 Kent Tamura 2015-06-02 08:31:59 UTC

(In reply to Jungshik Shin from comment #20)
> tkent@ : I couldn't enter U+2212 on Mac OS 10.10 with Japanese IME -
> Hiragana, Romaji, Katakana (in both English UI and Japanese UI). Can you
> tell me how to enter U+2212? 

Enable Hiragana with Kotoeri, or Hiragana with Google Japanese Input, then type "1-2-3".  It puts U+FF11 U+2212 U+FF12 U+2212 U+FF13.


> What I got is U+30FC (Hiragana-Katakana prolonged sound mark), which is
> rather strange. 
> 
> BTW, one of choices I got for '~' with 'Hiragana' is  U+301C (Wave Dash),
> which is not included in the current table for Shift_JIS although there's a
> fallback mapping (encoding-only) in ICU's converter. (see the previous
> comment).  Obviously, wave dash won't be used as often as U+2212 in postal
> code.

Other one-way mapping characters are not important at all.  I have no idea of use cases in web forms.

(In reply to Jungshik Shin from comment #22)
> While we're at it, I propose that we do the same for EUC-JP and ISO-2022-JP.
> (add an one-way encoding-only mapping for U+2212). 

It's very reasonable.

Comment 24 Jungshik Shin 2015-06-02 17:27:36 UTC

(In reply to Kent Tamura from comment #23)
> (In reply to Jungshik Shin from comment #20)
> > tkent@ : I couldn't enter U+2212 on Mac OS 10.10 with Japanese IME -
> > Hiragana, Romaji, Katakana (in both English UI and Japanese UI). Can you
> > tell me how to enter U+2212? 
> 
> Enable Hiragana with Kotoeri, or Hiragana with Google Japanese Input, then
> type "1-2-3".  It puts U+FF11 U+2212 U+FF12 U+2212 U+FF13.

Thank you. I couldn't get U+2212 because I just tried typing '-' by itself and none of candidates was U+2212. In the context of '1-2-3', I do get U+2212 as well as full-width digits. 
 
> > What I got is U+30FC (Hiragana-Katakana prolonged sound mark), which is
> > rather strange. 
> > 
> > BTW, one of choices I got for '~' with 'Hiragana' is  U+301C (Wave Dash),
> > which is not included in the current table for Shift_JIS although there's a
> > fallback mapping (encoding-only) in ICU's converter. (see the previous
> > comment).  Obviously, wave dash won't be used as often as U+2212 in postal
> > code.
> 
> Other one-way mapping characters are not important at all.  I have no idea
> of use cases in web forms.


Thank you for the answer. 
 
> (In reply to Jungshik Shin from comment #22)
> > While we're at it, I propose that we do the same for EUC-JP and ISO-2022-JP.
> > (add an one-way encoding-only mapping for U+2212). 
> 
> It's very reasonable.

I'll go ahead adding a one-way mapping (encoding-only) for U+2212 to all the Japanese legacy encodings (ISO-2022-JP automatically gets it from Shift_JIS) in Chrome's copy of ICU. (all these tables for the encoding spec will be contributed back to ICU eventually along with the aliases table per spec.).

Comment 25 Anne 2015-08-19 16:03:30 UTC

So, it seems shift_jis and euc-jp both emit

  0x81 0x7C

whereas iso-2022-jp emits

  0x21 0x5D (in the jis0208 state)

for U+2212 based on testing with <form>. Changing the specification to align with this seems rather trivial.

Comment 26 Jungshik Shin 2015-08-19 16:54:22 UTC

For EUC-JP, it'll be 0xA1 0xDD :-) 
For Shift_JIS and ISO-2022-JP, your comment 25 is right.

Comment 27 Anne 2015-08-19 16:56:37 UTC

Thank you, I must have made a mistake during testing since now I get the same.

Comment 28 Anne 2015-08-20 06:59:33 UTC

Kent, I copied your name in kanji from the HTML Standard. Please let me know if you wish that to be changed in some manner. The way you appear in the acknowledgments is up to you.

Also, thank you for your report. I used comment 3 to fix this in the simplest way possible. Initially I wanted to just emit the literal bytes, but actually changing the input code point was easier for iso-2022-jp.

https://github.com/whatwg/encoding/commit/a7ab97e891773bd7a564b463c6a1cc31196a5bdd

Comment 29 Jungshik Shin 2015-12-10 19:52:00 UTC

(In reply to Anne from comment #28)
> Kent, I copied your name in kanji from the HTML Standard. Please let me know
> if you wish that to be changed in some manner. The way you appear in the
> acknowledgments is up to you.
> 
> Also, thank you for your report. I used comment 3 to fix this in the
> simplest way possible. Initially I wanted to just emit the literal bytes,
> but actually changing the input code point was easier for iso-2022-jp.
> 
> https://github.com/whatwg/encoding/commit/
> a7ab97e891773bd7a564b463c6a1cc31196a5bdd

This commit is erroneous. This bug is about U+2212 but the commit has U+2022. 
I filed https://github.com/whatwg/encoding/issues/21 on that.