Re: Surrogate Code Points in Tests? from Eric Prud'hommeaux on 2013-11-03 (public-rdf-comments@w3.org from November 2013)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Sat, 2 Nov 2013 20:02:26 -0400
To: Alex Milowski <alex@milowski.com>
Cc: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
Message-ID: <20131103000225.GF13691@w3.org>
* Alex Milowski <alex@milowski.com> [2013-05-17 11:17-0700]
> The attached file (UTF-8 version) is essentially the same as the one I got
> from the test suite with just slightly different line endings.
> 
> The problem here is that U+10000 and U+EFFFF are not representable in UCS-2
> and it appears that most browsers are using UCS-2 at some level for the
> strings in Javascript.  As such, these code point are turned into surrogate
> pairs (U+D800 + U+DC00 and U+DB7F + U+DFFF) which aren't actually part of
> UCS-2 as legal code points.  The surrogate characters are also not allowed
> by the productions in prefixes and other such names.
> 
> The test is unusable in an environment, like most current browsers, that
> produce surrogate pairs from the UTF-8 encoding.  That is, unusable without
> some enhanced level of support for surrogates.
> 
> There are a number of options to resolve this:
> 
>    1. The test will just fail for such environments without special support
> for surrogates.
>    2. We mark this test as requiring surrogate / UTF-16 handling.  Passing
> the test is quality of implementation question.
>    3. Such environments are expected to check for pairs of surrogate values
> in place of the code points [#x10000-#xEFFFF] even though the resulting
> identifier is unlikely to be generally usable.
> 
> Option (1) means I just fail the test and move on. I'm not particularly
> partial to this.
> 
> Option (2) allows greater diversity of processor environments but certainly
> runs counter to promoting proper Unicode support.  It would be good to have
> a variant that tests only those in the Basic Multilingual Plane.
> 
> Option (3) allows a processor to pass and punts the use of the code point
> back to the user.  In the case of these particular code points and
> identifiers, the user of the result will have similar issues when
> inspecting or using values from the processor.  This will be no different
> for data received from other places that has the same code points as
> Javascript will treat them all the same and produce surrogate pairs.
> 
> There absolutely needs to be much more documentation attached to this test.
>  The identifiers (e.g. prefix) are constructed by taking the first and last
> character of each range in the PN_CHARS_BASE production.  I have to admit I
> really didn't recognize that until much later in my research into why this
> test didn't pass.  In addition, the general issue of UCS-2  vs UTF-16
> representations and surrogates needs to be minimally referenced in the
> description.
> 
> It would be good for everyone else to avoid a lot the research I just did
> with a well-written test description.
> 
> BTW, there's a good write up of Javascript's issues at [1]
> 
> [1] http://mathiasbynens.be/notes/javascript-encoding

Dear Alex,

On 17 May, in
<http://www.w3.org/mid/CABp3FNKKtN8vac7=7gjzL2Y_KfH9wq0p56zAA5KAaO8Rv-SmEg@mail.gmail.com>,
you proposed 3 alternatives for addressing characters outside of the
range of #0000-#FFFD. Your third choice got support from others on rdf-comments:
[[
 3. Such environments are expected to check for pairs of surrogate values
in place of the code points [#x10000-#xEFFFF] even though the resulting
identifier is unlikely to be generally usable.
]]

In response to this, we added the following text to the README in the
turtle test suite:
[[
CHARACTER ENCODING:

The Turtle language uses UTF-8 encoding. The following tests include
non-ascii characters:
  localName_with_assigned_nfc_bmp_PN_CHARS_BASE_character_boundaries
  localName_with_assigned_nfc_PN_CHARS_BASE_character_boundaries *
  localName_with_nfc_PN_CHARS_BASE_character_boundaries *
  labeled_blank_node_with_PN_CHARS_BASE_character_boundaries *
  LITERAL1_with_UTF8_boundaries *
  LITERAL_LONG1_with_UTF8_boundaries *
  LITERAL2_with_UTF8_boundaries *
  LITERAL_LONG2_with_UTF8_boundaries *

Those marked with a * include characters with codepoints greater than
U+FFFD and are thus expressed as a pair of surrogate characters when
represented in UCS2.
]]

This did annotate the tests with non-BNP characters but did neither of:

  Define comparison in terms of surrogate pairs (instead of code points).
  Indicate that those were not required for conformance.

Noting that non of the test submitters failed those tests:
  <http://www.w3.org/2011/rdf-wg/wiki/Turtle_Candidate_Recommendation_Comments#Implementations>
including your own implementation report:
  <http://www.w3.org/mid/CABp3FNLOOHZHwVSUdUK09Kmp_yujfXZmKOeNBC4TC5V8iUp1Nw@mail.gmail.com>
I propose to close this comment as satisfactorily addressed by the

additional comments in README. If you disagree, please indicate what
would satisfy your comment. If you agree, please respond with the
subject prefixed by "[RESOLVED]".


> On Fri, May 17, 2013 at 4:37 AM, Eric Prud'hommeaux <eric@w3.org> wrote:
> 
> > * Alex Milowski <alex@milowski.com> [2013-05-16 23:44-0700]
> > > In looking at test:
> > >
> > >    prefix_with_PN_CHARS_BASE_character_boundaries.ttl  [1]
> > >
> > > There are the code points u+dc00, u+db7f, and u+dfff in the last part of
> > > the prefix.  The code points u+d800-u+dfff are not valid unicode
> > characters.
> > >
> > > Why are these in a positive test?
> > >
> > > [1]
> > >
> > https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/tests-ttl/prefix_with_PN_CHARS_BASE_character_boundaries.ttl
> >
> > In the prefix, I see these codepoints: u+41 u+5a u+61 u+7a u+c0 u+d6
> > u+d8 u+f6 u+f8 u+2ff u+370 u+37d u+37f u+1fff u+200c u+200d u+2070
> > u+218f u+2c00 u+2fef u+3001 u+d7ff u+f900 u+fdcf u+fdf0 u+fffd u+10000
> > u+effff .
> >
> > I've attached two variants the test, one encoded in UTF-8, which is
> > legal turtle and has only the codepoints above, and UTF-16, which uses
> > surrogate to encode the codepoints u+10000 and u+effff . Is it
> > possible that your buffer got re-encoded as UTF-16 at some point?
> >
> > If this message resolves your comment, please reply with "[RESOLVED]"
> > in the subject.
> >
> > > --
> > > --Alex Milowski
> > > "The excellence of grammar as a guide is proportional to the paucity of
> > the
> > > inflexions, i.e. to the degree of analysis effected by the language
> > > considered."
> > >
> > > Bertrand Russell in a footnote of Principles of Mathematics
> >
> > --
> > -ericP
> >
> 
> 
> 
> -- 
> --Alex Milowski
> "The excellence of grammar as a guide is proportional to the paucity of the
> inflexions, i.e. to the degree of analysis effected by the language
> considered."
> 
> Bertrand Russell in a footnote of Principles of Mathematics

-- 
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout
which can only be seen by printing this message on high-clay paper.
Received on Sunday, 3 November 2013 00:02:57 UTC