ISSUE-455: treatment of invalid 2-byte sequence is different in different CJK encodings
treatment of invalid 2-byte sequence is different in different CJK encodings
- State:
- CLOSED
- Product:
- encoding
- Raised by:
- Richard Ishida
- Opened on:
- 2015-03-30
- Description:
- https://www.w3.org/Bugs/Public/show_bug.cgi?id=28141
This issue tracks the bug listed above and was created as part of the WG CR process.
---
Reporter: jshin@chromium.org
Per bug 16691 comment 15, I'm tightening Blink's encoding tables for CJK
encodings to handle unmappable 2-byte sequence in a safe manner.
The current spec has the following provision after looking up |pointer|.
* EUC-KR decoder
If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to
stream.
* Big5 decoder
If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to
stream.
* Shift_JIS decoder
If pointer is null, prepend byte to stream.
* EUC-JP decoder
If byte is not in the range 0xA1 to 0xFE, prepend byte to stream.
* GB18030 decoder
If pointer is null, prepend byte to stream.
For now, let's put aside EUC-JP and GB18030.
I don't see a reason to make SJIS decoder behave differently than EUC-KR and
Big5 decoder. One possible reason may be that [xA1, xDF] is a character by
itself.
In SJIS, "0xFC 0xE0" [1] is turned to U+FFFD, but the second byte (0xE0)
becomes the lead of what follows.
In EUC-KR, "0xFE 0xE0" is turned to U+FFFD and the next lead byte is taken from
the byte *after* 0xE0.
Shouldn't we change the part of SJIS decoder quoted above to the following?
If pointer is null and byte is in the range of 0x00 - 0x7F, prepend byte to
the stream.
- Related Actions Items:
- No related actions
- Related emails:
- I18N-ISSUE-455 (BUG28141): treatment of invalid 2-byte sequence is different in different CJK encodings [encoding] (from sysbot+tracker@w3.org on 2015-03-30)
Related notes:
These issues are now tracked at http://www.w3.org/International/docs/encoding/encoding-cr-docRichard Ishida, 16 Sep 2015, 11:41:17
Display change log