Re: [RESOLVED} PN_CHARS_BASE outside of IRI range

On Mar 23, 2013, at 10:45 PM, Eric Prud'hommeaux <eric@w3.org> wrote:

> * Gregg Kellogg <gregg@greggkellogg.net> [2013-03-23 15:35-0700]
>> I've been struggling with the localName_with_PN_CHARS_BASE_character_boundaries.ttl test, which tests the range of characters allowed by PN_CHARS_BASE. Within the grammar, this is defined as follows:
>> 
>> [163s]	PN_CHARS_BASE	::=	[A-Z] | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] |[#x037F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |[#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
>> 
>> This explicitly includes characters beyond what is allows in RFC-3987 [2] uschar production:
>> 
>> ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
>>                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
>>                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
>>                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
>>                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
>>                  / %xD0000-DFFFD / %xE1000-EFFFD
>> 
>> As a result, even though my Turtle processor parses the test, it fails when I try to validate the output, where I ensure that IRIs are also valid. My read of the ucschar production is that a valid IRI does not include %xEFFFE or %xEFFFF, which _are_ included in Turtle (and SPARQL I believe).
>> 
>> (Interestingly, it also excludes some ranges that are included in ucschar, but that is the subject of issue-190 [3]).
>> 
>> Since the horse has probably left the barn, I don't expect PN_CHARS_BASE to change at this point, but tests, such as localName_with_PN_CHARS_BASE_character_boundaries.ttl should probably be limited to be valid IRIs according to RFC-3987, as that spec is normatively referenced.
> 
> Thank you for resolving a long-standing mystery. The Turtle productions come from the SPARQL productions which come from
> 
>  [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
> — http://www.w3.org/TR/REC-xml/#NT-NameStartChar
> 
> Per <http://www.unicode.org/charts/PDF/UEFF80.pdf>, U+EFFFE and U+EFFFF are intended for process-internal use and will thus never be useful in IRIs.
> (Relevent only to literals, U+10FFFE and U+10FFFF are also process-internal per <http://www.unicode.org/charts/PDF/U10FF80.pdf>.)
> 
> For reasons I couldn't recover, I had U+EFFFD in my own implementation, but changed it to U+EFFFF 'cause the coverage test seemed justified by the the production. Given your observation, I've updated the tests (and reverted my code).
>  https://dvcs.w3.org/hg/rdf/rev/dad56881f954
> 
> I think I could argue that this is an error in the spec and could be fixed without another Last Call. I've raised ISSUE-123 to poke at "PN_CHARS_BASE permits up to U+EFFFF but RFC-3987 stops at U+EFFFD".
>  https://www.w3.org/2011/rdf-wg/track/issues/123

I'd perhaps put it in Eratta, but wouldn't go out of LC for this. After all, it still must be an IRI according to RFC-3987; having a more literal RegExp isn't in conflict with this requirement. Indeed, if an implementation relied on an alternative IRI validation, the actual Terminal definition used in a parser could be much more liberal.

> I hope this complete agreement with your position is a satisfactory resolution of your comment. If so, please respond with "[RESOLVED]" at the beginning of your subject.

I am satisfied with this resolution.

Gregg

>> Gregg Kellogg
>> gregg@greggkellogg.net
>> 
>> [1] http://www.w3.org/TR/turtle/#sec-grammar-grammar
>> [2] http://www.ietf.org/rfc/rfc3987.txt
>> [3] http://www.w3.org/International/track/issues/190
> -- 
> -ericP

Received on Sunday, 24 March 2013 18:20:23 UTC