16862 – Indexes: GB18030 and Microsoft encodings should support PUA code points

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16862 - Indexes: GB18030 and Microsoft encodings should support PUA code points

Summary: Indexes: GB18030 and Microsoft encodings should support PUA code points

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	All Windows 3.1

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Duplicates (2):	16697 21145 (view as bug list)
Depends on:
Blocks:	24130
	Show dependency tree / graph

Reported:	2012-04-25 23:24 UTC by Masatoshi Kimura
Modified:	2013-12-18 15:55 UTC (History)
CC List:	5 users (show)

See Also:

Attachments

Description Masatoshi Kimura 2012-04-25 23:24:50 UTC

Unlike other encodings, PUA mappings of gbk or gb18030 is part of the standard.
It is defined on purpose to ensure round-trip conversion between all gbk 2-byte code range and Unicode code point.

Comment 1 Anne 2012-04-26 06:30:19 UTC

There's also bug 16697.

Comment 2 pub-w3 2012-05-06 22:41:55 UTC

PUA characters have been used to allow round-trip conversion to/from Unicode of explicitly undefined positions in a number of legacy encodings.  Why is this of particular importance for GBK or GB18030?

PUA characters have also been used in GBK to encode specific characters missing from Unicode at the time, so not all PUA mappings are there just to allow round-trip conversion.

Comment 3 Masatoshi Kimura 2012-05-06 23:24:01 UTC

(In reply to comment #2)
> PUA characters have been used to allow round-trip conversion to/from Unicode of
> explicitly undefined positions in a number of legacy encodings.  Why is this of
> particular importance for GBK or GB18030?
You're right. I think we should add PUA mappings for all Microsoft encodings rather than inventing many new similar but different encodings.
Regarding GB18030, PUA mappings are explicitly defined as a part of GB18030:2000/2005 national standard. Those are NOT undefined positions. And GB18030 supposed to cover all Unicode code points except isolated surrogates.
Why U+E865 is included in encoding standard's gb18030 decoder while U+E864 is not? It is quite inconsistent.

Comment 4 Masatoshi Kimura 2012-05-06 23:35:32 UTC

PUA code point mappings for Windows encodings are public on MS Download Center.
http://www.microsoft.com/en-us/download/details.aspx?id=10921 (Windows Supported Code Page Data Files.zip)

Comment 5 pub-w3 2012-05-06 23:42:10 UTC

(In reply to comment #3)

> I think we should add PUA mappings for all Microsoft encodings
> rather than inventing many new similar but different encodings.

Note that there is a difference between what Microsoft specifies and what Microsoft implements, and that this issue is not specific to Microsoft.

How exactly are PUA mappings more useful than mappings to U+FFFD?

> Regarding GB18030, PUA mappings are explicitly defined as a part of
> GB18030:2000/2005 national standard. Those are NOT undefined positions.

There is not much difference in practice between an undefined position and a position mapped to a PUA character explicitly defined not to map to a specific glyph, is there?

> GB18030 [is] supposed to cover all Unicode code points except isolated surrogates.

That is difficult to reconcile with proper handling of the GBK subset, cf. bug 16697.

Comment 6 Masatoshi Kimura 2012-05-06 23:51:56 UTC

(In reply to comment #5)
> Note that there is a difference between what Microsoft specifies and what
> Microsoft implements,
It is consistent about PUA mappings.
> and that this issue is not specific to Microsoft.
PUA mappings of GB18030 and Microsoft encodings are official.

> How exactly are PUA mappings more useful than mappings to U+FFFD?
> There is not much difference in practice between an undefined position and a
> position mapped to a PUA character explicitly defined not to map to a specific
> glyph, is there?
Why UTF-8 decoder includes such useless mappings?

> That is difficult to reconcile with proper handling of the GBK subset, cf. bug
> 16697.
Is GBK subset required at all? What's wrong with supporting non-BMP code points for all GB2312/GBK families? GB2312 is an alias of GB18030 (not GBK) on Gecko.

Comment 7 pub-w3 2012-05-07 11:10:08 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > Note that there is a difference between what Microsoft specifies and what
> > Microsoft implements,
> It is consistent about PUA mappings.

According to <http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1255.txt>,
the byte FF in Windows 1255 maps to U+F896.

More 'official' mappings, however, e.g.,
<http://msdn.microsoft.com/en-us/goglobal/cc305148>,
leaves this position completely undefined.

> > and that this issue is not specific to Microsoft.
> PUA mappings of GB18030 and Microsoft encodings are official.

Apple also uses the PUA for round-tripping:

<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CORPCHAR.TXT>:
# The following (1) is for mapping the single undefined code point in
# the Mac OS Greek and Turkish encodings, thus permitting full
# round-trip fidelity. This character is also used for mapping EURO SIGN
# when mapping to Unicode 1.1 (e.g. for Mac OS Roman and Symbol).
0xF8A0	# undefined1, also EURO SIGN for Unicode 1.1 # Turkish-0xF5, Roman-0xDB, Symbol-0xA0

Again, such mappings are not included in more 'official' versions of the character encoding definitions.

> Why UTF-8 decoder includes such useless mappings?

Good question!  More seriously, I do appreciate that GB18030 can be seen as a Unicode Transformation Format, which is an argument for complete coverage including the PUA.

> > [Covering all Unicode code points] is difficult to reconcile with proper handling
> > of the GBK subset, cf. bug 16697.
> Is GBK subset required at all?

The problem is that, e.g., the byte sequence A8 BC is defined as ḿ (lower-case m with acute accent) in GBK (implemented by using the PUA code point U+E7C7 since, I assume, there was no such character in Unicode at the time), whereas GB18030 defines it as U+E7C7 with no associated glyph and encodes ḿ elsewhere.  Mapping A8 BC to U+1E3F (LATIN SMALL LETTER M WITH ACUTE) would work for GBK, but would make U+E7C7 non-encodable in GB18030, thus preventing complete PUA coverage.

> What's wrong with supporting non-BMP code points
> for all GB2312/GBK families? GB2312 is an alias of GB18030 (not GBK) on Gecko.

GBK and GB18030 are both supersets of (the EUC encoding of) GB2312 (known as EUC-CN), so handling EUC-CN as either GBK or GB18030 will work.  Unfortunately, though, GBK and GB18030 have ended up being slightly incompatible, and more so if PUA code points are considered important.

Comment 8 pub-w3 2012-05-08 20:41:02 UTC

(In reply to comment #3)
> Why U+E865 is included in encoding standard's gb18030 decoder while U+E864 is
> not? It is quite inconsistent.

Indeed, this inconsistency should be fixed somehow.

Comment 9 Masatoshi Kimura 2012-05-09 15:00:39 UTC

(In reply to comment #5)
> There is not much difference in practice between an undefined position and a
> position mapped to a PUA character explicitly defined not to map to a specific
> glyph, is there?
It's quite different for XML processors. If PUA mappings are removed and a XML document contains even one PUA character, the entire document will not be displayed at all.

Comment 10 pub-w3 2012-05-14 21:54:04 UTC

(In reply to comment #9)
> It's quite different for XML processors.

This affects a large number of code points in many encodings and includes complicated cases such as deprecated mappings (e.g., ‘non verifiable’ Han characters found in earlier versions of Big5-HKSCS) and recently added characters (e.g., the new Korean postal code symbol added at a previously completely unused position).  Might it be possible to define certain U+FFFD mappings as non-fatal even in XML?

Comment 11 Anne 2012-05-15 09:18:35 UTC

If you include them directly in the index that should theoretically be enough. But is there any XML out there that actually hits those edge cases?

Comment 12 pub-w3 2012-05-15 20:09:35 UTC

My question (if that is what you were replying to) was more about whether non-fatal U+FFFD mappings would be acceptable for XML (in cases where an undefined byte sequence might reasonably correspond to a private extension, not for clearly illegal sequences triggering reprocessing).

In any case, the problem of undefined mappings for XML processing strikes me as a different issue from the one originally raised, and one that cannot be solved by adding PUA mappings since a number of code points that can be (and have been) used to extend various East Asian encodings do not have standard or semi-standard mappings to the Unicode private-use area.  [The IBM extension to JIS X 0212 (Bug 16941) might have been a better example, but this is somewhat beside the point given the amount of apparently widespread extensions listed by Lunde.]

Comment 13 Anne 2012-11-16 14:30:17 UTC

So say we want to make gb18030 a UTF, how many problems as hinted at in comment 7 do we hit with gbk? Can we special case those somehow for gbk or can we live with making gbk a label for gb18030 and creating some incompatibility or would we need an index for gbk?

Comment 14 pub-w3 2012-11-17 09:42:51 UTC

(In reply to comment #13)
> So say we want to make gb18030 a UTF, how many problems as hinted at in
> comment 7 do we hit with gbk?

The problematic characters are all listed in bug 16697:  10 vertical variants, accented m, 14 Chinese characters missing from an early version of Unicode; 25 characters in total.

Comment 15 Henri Sivonen 2013-12-13 07:52:11 UTC

(In reply to pub-w3 from comment #14)
> (In reply to comment #13)
> > So say we want to make gb18030 a UTF, how many problems as hinted at in
> > comment 7 do we hit with gbk?
> 
> The problematic characters are all listed in bug 16697:  10 vertical
> variants, accented m, 14 Chinese characters missing from an early version of
> Unicode; 25 characters in total.

Is there data about the occurrence of the byte sequences for these on pages that are either unlabeled or whose label maps to gbk according to the current state of the Encoding Standard?

Comment 16 Anne 2013-12-16 16:09:53 UTC

*** Bug 16697 has been marked as a duplicate of this bug. ***

Comment 17 Anne 2013-12-16 16:11:16 UTC

*** Bug 21145 has been marked as a duplicate of this bug. ***

Comment 18 Anne 2013-12-16 16:15:06 UTC

So reading these comments and having updated my understanding of the situation in http://lists.w3.org/Archives/Public/www-archive/2013Dec/0010.html I think we can make "gbk" et al a label for gb18030.

The ḿ case mentioned in comment 7 and others in bug 16697 are mostly a font issue. That gb18030 has a different position for those does not matter if the user has a font installed that makes it display correctly if the site in fact uses the two byte sequence. You might hit some issues with form submission and that code point, but really if you are not using utf-8 for form submission you are a in world of hurt anyway.

So my plan is that I update the gbk table to include PUA as seen in gb18030 and replace gbk with gb18030 (removing the gb18030 flag).

I also think I should rename index gb18030 to table gb18030 (it's not really a linear index) and rename index gbk to index gb18030.

Feedback welcome.

Comment 19 Masatoshi Kimura 2013-12-16 16:21:03 UTC

Wholeheartedly agree.
Gecko used to use gb18030 mappings to decode gb2312 for a long time until implementing the Encoding Standard requirements, so it should not be a big deal, at least for decoding.

Comment 20 Henri Sivonen 2013-12-18 07:30:40 UTC

(In reply to Anne from comment #18)
> So reading these comments and having updated my understanding of the
> situation in
> http://lists.w3.org/Archives/Public/www-archive/2013Dec/0010.html I think we
> can make "gbk" et al a label for gb18030.

Is there a reason why this can't be tested in IE by having a page served as gbk and a page served as gb18030 and having a script inspect the DOM upon onload?

Comment 21 Henri Sivonen 2013-12-18 07:31:14 UTC

Considering how popular IE is in China, testing IE would seem prudent.

Comment 22 Masatoshi Kimura 2013-12-18 10:57:08 UTC

Also, IE10 supported FileReaderSync. IE11 supported overrideMimeType.

Comment 23 Anne 2013-12-18 12:37:40 UTC

I made it work in IE. The tables in IE for gbk and gb18030 are identical to Chrome except that in gb18030 index 6555 maps to U+3000 in Gecko/Chrome and U+E5E5 in IE (U+3000 is also at another location). I plan on not copying this change and still follow the plan from comment 18.

Comment 24 Henri Sivonen 2013-12-18 14:56:12 UTC

(In reply to Anne from comment #18)
> You might hit some issues with form
> submission and that code point, but really if you are not using utf-8 for
> form submission you are a in world of hurt anyway.

FWIW, Firefox whines upon form submission about all encodings other than UTF-8 and GB18030, because at the time of adding the whine, I thought GB18030 was a well-defined UTF.

https://mxr.mozilla.org/mozilla-central/source/content/html/content/src/nsFormSubmission.cpp#713

Comment 25 Anne 2013-12-18 15:14:46 UTC

It is a well defined UTF, but now that we decide to map other labels to it, there is a potential for submission issues if the label was not gb18030.

I am about to merge these encodings, so if we want to keep some of it still separate, now would be a good time to speak up.

Comment 26 Anne 2013-12-18 15:36:17 UTC

https://github.com/whatwg/encoding/commit/182ad9e607a7c6f0fa51d9dd6c638edaa5ec59fd

Comment 27 Anne 2013-12-18 15:37:59 UTC

Masatoshi, if you still feel we should have PUA mappings for other encodings as you seem to indicate in comment 3, could you file those as separate bugs or acknowledge here in a comment that is what you want (in which case I will file them), thanks!

Comment 28 Masatoshi Kimura 2013-12-18 15:55:01 UTC

OK, filed bug 24130.