PN_CHARS_BASE outside of IRI range

I've been struggling with the localName_with_PN_CHARS_BASE_character_boundaries.ttl test, which tests the range of characters allowed by PN_CHARS_BASE. Within the grammar, this is defined as follows:

[163s]	PN_CHARS_BASE	::=	[A-Z] | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] |[#x037F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |[#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

This explicitly includes characters beyond what is allows in RFC-3987 [2] uschar production:

ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

As a result, even though my Turtle processor parses the test, it fails when I try to validate the output, where I ensure that IRIs are also valid. My read of the ucschar production is that a valid IRI does not include %xEFFFE or %xEFFFF, which _are_ included in Turtle (and SPARQL I believe).

(Interestingly, it also excludes some ranges that are included in ucschar, but that is the subject of issue-190 [3]).

Since the horse has probably left the barn, I don't expect PN_CHARS_BASE to change at this point, but tests, such as localName_with_PN_CHARS_BASE_character_boundaries.ttl should probably be limited to be valid IRIs according to RFC-3987, as that spec is normatively referenced.

Gregg Kellogg
gregg@greggkellogg.net

[1] http://www.w3.org/TR/turtle/#sec-grammar-grammar
[2] http://www.ietf.org/rfc/rfc3987.txt
[3] http://www.w3.org/International/track/issues/190

Received on Saturday, 23 March 2013 22:35:58 UTC