[OK?] Re: SPARQL and Unicode versions from Eric Prud'hommeaux on 2006-01-26 (public-rdf-dawg-comments@w3.org from January 2006)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Wed, 25 Jan 2006 21:14:44 -0500
To: Dan Connolly <connolly@w3.org>
Cc: Dave Beckett <dave@dajobe.org>, public-rdf-dawg-comments@w3.org
Message-ID: <20060126021444.GZ17752@w3.org>
On Sun, Jan 08, 2006 at 08:54:15AM -0600, Dan Connolly wrote:
> 
> On Sat, 2006-01-07 at 20:01 -0800, Dave Beckett wrote:
> > Dan Connolly wrote:
> > > On Sat, 2006-01-07 at 12:38 -0800, Dave Beckett wrote:
> > > 
> > >>SPARQL refers to:
> > >>
> > >>[[
> > >>  [UNICODE]
> > >>    The Unicode Standard, Version 4. ISBN 0-321-18578-1, as updated from
> > >>  time to time by the publication of new versions. The latest version of
> > >>  Unicode and additional information on versions of the standard and of
> > >>  the Unicode Character Database is available at
> > >>  http://www.unicode.org/unicode/standard/versions/.
> > >>
> > >>]]
> > >>
> > >>which cites a moving target.  Please define SPARQL in terms of a
> > >>particular version of Unicode only, and no other.  Otherwise if or when
> > >>this Unicode consortium makes some incompatible changes, all existing
> > >>implementations become invalid.
> > > 
> > > 
> > > How so? How is conformance to SPARQL sensitive to changes in Unicode?
> > 
> > The SPARQL query syntax is defined on Unicode characters:
> > 
> > [[
> > A. SPARQL Grammar
> > 
> > A SPARQL query string is a Unicode character string (c.f. section 6.1
> > String concepts of [CHARMOD])
> > ...
> > ]]
> > 
> > although the grammar defines precise ranges of codepoints for particular
> > things such as names of variables (based on XML 1.1 I think).
> > 
> > If the definition of a Unicode character string changes in some future
> > Unicode revision, such as for example by allowing additional codepoints,
> > then there will be additional codepoints allowed in a SPARQL query
> > string, following the sentence above.
> 
> I believe that's by design, following...
> 
> "C063  [S]  A generic reference to the Unicode Standard MUST be made if
> it is desired that characters allocated after a specification is
> published are usable with that specification".
>   http://www.w3.org/TR/2005/REC-charmod-20050215/#C063
> 
> I suppose I should check with the WG.
> 
> > Any part of the grammar that uses an negated range such as with '[^...]'
> > will allow such codepoints.  Examples include:
> >   http://www.w3.org/TR/rdf-sparql-query/#rQ_IRI_REF
> > and all string literals.
> > 
> > These codepoints may be refused by something implementing Unicode 4.0
> > and no more.
> 
> I suppose we need a test case that uses a codepoint that isn't currently
> allocated in Unicode 4.0.
> 
> I still can't think of any reason why changes in Unicode specs would
> make any difference to SPARQL producers/consumers. It's not like
> they need to reference the Unicode tables to check the grammar or
> anything.

Do to lineage and good intentions, the SPARQL grammar mirrors the
XML1.1 spec. For instance, our name chars
  http://www.w3.org/2001/sw/DataAccess/rq23/#rNCCHAR1p
are slight liberalizations of XML name chars
  http://www.w3.org/TR/xml11/#NT-NameStartChar
Strings
  http://www.w3.org/2001/sw/DataAccess/rq23/#rSTRING_LITERAL1
are analogous to CharData
  http://www.w3.org/TR/xml11/#NT-CharData

Basically, our grammar follows XML's lead and maps out the use Unicode
chars from #x00 to #xEFFFF . All Unicode chars are in this range, but
there are lots of holes (currently undefined chars). My reading of the
XML spec is that the grammar is fixed as Unicode grows and fills these
holes. However, if Unicode extends beyond #xEFFFF, XML1.1 apps will
not handle these new chars. To clarify this, and to address the
Björn's comments, I will propose the following text at the top of the
grammar definition:

[[
A SPARQL query string is a Unicode character string (c.f. section 6.1
String concepts of [CHARMOD]) in the language defined by the following
grammar, starting with the Query production.  The EBNF format is the
same as that used in the XML 1.1 specification[XML11]. Please see the
"Notation" section of that specification for specific information
about the notation.

[ Informative: this specification maps out the useage of Unicode
characters between #x00 and #xEFFFF. Excluded character sets,
for example "[^<>'{}|^`]", indicate the range of [#x00-#xEFFFF] minus
those the listed characters. This specification does not include any
future Unicode characters outside of the range [#x00-#xEFFFF]. ]

The following sections list all additional constraints on a valid
SPARQL query:
...
A.5 Escape sequences in strings

Escaped characters in strings (STRING_LITERAL1, STRING_LITERAL2,
STRING_LITERAL_LONG1, STRING_LITERAL_LONG2) must be in the character
ranges defined by those rules.
]]

Dave, Björn, what do you think?
-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Thursday, 26 January 2006 02:14:48 UTC