proposed clarifications to the SPARQL grammar

I addressed the "SPARQL and Unicode versions" comment with some text
proposed in
  http://www.w3.org/mid/20060126021444.GZ17752@w3.org
Bjoern Hoehrmann pointed out several remaining shortcomings in
  http://www.w3.org/mid/90vnt1dqjg0d74lfe4j21f69bpofniafea@hive.bjoern.hoehrmann.de
To address these issues, I propose the following change to
  http://www.w3.org/2001/sw/DataAccess/rq23/#grammar

I would like to change A. SPARQL Grammar from
[[
A SPARQL query string is a Unicode character string (c.f. section 6.1
String concepts of [CHARMOD]) in the language defined by the following
grammar, starting with the Query production.  The EBNF format is the same
as that used in the XML 1.1 specification[XML11]. Please see the
"Notation" section of that specification for specific information about
the notation.

In addition, the following sections apply.
]]
to
[[
A SPARQL query string is a Unicode character string (c.f. section 6.1
String concepts of [CHARMOD]) in the language defined by the following
grammar, starting with the Query production. For compatibility with future
versions of Unicode, the characters in this string may include unassigned
Unicode codepoints (see Identifier and Pattern Syntax [UNIID] section 4
Pattern Syntax). For productions with excluded character classes (for
example "[^<>'{}|^`]"), the characters are excluded from the range #x00 -
#xEFFFFF.

The EBNF notation used in the grammar is defined in Extensible Markup
Language (XML) 1.1 [XML11] section 6 Notation.

In addition, rules A.1 to A.5 apply.
]]

and add an informative reference to

[UNIID] Identifier and Pattern Syntax 4.1.0, Mark Davis, Unicode Standard
Annex #31, 25 March 2005, http://www.unicode.org/reports/tr31/tr31-5.html .
Latest version available at http://www.unicode.org/reports/tr31/ .



Further, I would like to address Bjoern's comments on escape sequences by
modifying
[[
A.5 Escape sequences in strings

Strings are used for the lexical form of RDF terms and in expressions.
Within a string, the following escape sequences apply. The escape
character is backslash "\" (#x5C). No other escape sequences are defined
for strings.  Names for characters given are the common names.

These escape sequences apply to all rules making up the rule for string
(rules: STRING_LITERAL1, STRING_LITERAL2, STRING_LITERAL_LONG1,
STRING_LITERAL_LONG2).

<table>

where HEX  is a hexadecimal character

    HEX ::= [0-9] | [A-F] | [a-f]

Examples:
...
]]
to
[[
A.5 Escape sequences in strings

The following escape sequences may be used in any string production
(e.g. STRING_LITERAL1, STRING_LITERAL2, STRING_LITERAL_LONG1,
STRING_LITERAL_LONG2):

<table>

Any escaped character in the range #x00 - #xEFFFFF may appear in any
string production. For instance, "\n" may appear in a STRING_LITERAL1 even
though the unescaped form is not valid in that production.
]]

This clarifies n points:
  - parsers must be able to process currently unassigned Unicode characters.
  - SPARQL strings include the character #x00.
  - which codepoints can be produced through \uU escape sequences.
  - there *is* a difference between escaped characters in strings and
    escaped characters in variable names and IRI references.

I specify the range to be #x00 - #xEFFFFF while XML 1.1 uses #x01 -
#xEFFFFF, citing "Due to potential problems with APIs, #x0 is still
forbidden both directly and as a character reference." I read our LC
document as allowing #x00 - #xEFFFFF and am trying to avoid any
changes to the language at this late date. I don't think the
liberalization will hurt us.
-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

Received on Thursday, 9 March 2006 22:35:28 UTC