surrogates is literals

* Dave Beckett <dave@dajobe.org> [2013-03-23 15:38-0700]
> … [eliding license issues addressed in a separate sub-thread]
> I've got some tests I made for raptor after the original Turtle submission
> that the WG might want to use.  I give permission for them to be used
> under the W3C software license
> http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231
> 
> This is what they test:
> … [eliding other tests addressed in a separate sub-thread]
>    test-38.ttl - unicode surrogates ok or not

<http://www.w3.org/TR/2013/WD-rdf11-concepts-20130115/#dfn-literal>'s
reference to a "Unicode string" means that "\ud801\udc69" is not a
valid RDF literal:

[[
D80 Unicode string:
A code unit sequence containing code units of a particular Unicode
encoding form
…
D92 UTF-8 encoding form:
The Unicode encoding form that assigns each Unicode scalar value to an
unsigned byte sequence of one to four bytes in length, as specified in
Table3-6 and Table3-7.
…

• Because surrogate code points are not Unicode scalar values,
  any UTF-8 byte sequence that would otherwise map to code points
  D800..DFFF is ill-formed.
]]
— <http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf> D80-D92

I propose to add a note in the non-normative description of quoted
literals <http://www.w3.org/TR/turtle/#turtle-literals>:
[[
Note that RDF literals are Unicode strings, they must be composed of
valid Unicode characters. The code points in the Unicode surrogate
code range, U+D800-U+DFFF, are not Unicode characters.
]]

Per Andy Seaborne's request to test good practice
<http://www.w3.org/mid/514AE55F.5080103@epimorphics.com>, but in order
to not burden implementations, I have not included test 38 as a
negative test.

If you are satisfied with the resolution of test 38, please reply with
[RESOLVED] in the subject.
-- 
-ericP

Received on Sunday, 24 March 2013 14:52:23 UTC