[Bug 28141] New: treatment of invalid 2-byte sequence is different in different CJK encodings

https://www.w3.org/Bugs/Public/show_bug.cgi?id=28141

            Bug ID: 28141
           Summary: treatment of invalid 2-byte sequence is different in
                    different CJK encodings
           Product: WHATWG
           Version: unspecified
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Encoding
          Assignee: annevk@annevk.nl
          Reporter: jshin@chromium.org
        QA Contact: sideshowbarker+encodingspec@gmail.com
                CC: mike@w3.org, www-international@w3.org

Per bug 16691 comment 15, I'm tightening Blink's encoding tables for CJK
encodings to handle unmappable 2-byte sequence in a safe manner. 



The current spec has the following provision after looking up |pointer|. 

* EUC-KR decoder
   If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to
stream.


* Big5 decoder

   If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to
stream.

* Shift_JIS decoder
   If pointer is null, prepend byte to stream.

* EUC-JP decoder
   If byte is not in the range 0xA1 to 0xFE, prepend byte to stream.


* GB18030 decoder
   If pointer is null, prepend byte to stream.

For now, let's put aside EUC-JP and GB18030. 

I don't see a reason to make SJIS decoder behave differently than EUC-KR and
Big5 decoder. One possible reason may be that [xA1, xDF] is a character by
itself. 

In SJIS, "0xFC 0xE0" [1] is turned to U+FFFD, but the second byte (0xE0)
becomes the lead of what follows.

In EUC-KR, "0xFE 0xE0" is turned to U+FFFD and the next lead byte is taken from
the byte *after* 0xE0. 

Shouldn't we change the part of SJIS decoder quoted above to the following? 

  If pointer is null and byte is in the range of 0x00 - 0x7F, prepend byte to
the stream.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Received on Wednesday, 4 March 2015 23:33:26 UTC