Re: claimed completion on "ACTION-233: Publish the consolidated test suite"

* Andy Seaborne <andy.seaborne@epimorphics.com> [2013-03-21 10:47+0000]
> 
> 
> On 21/03/13 03:51, Eric Prud'hommeaux wrote:
> >RDF and I18N folks, we have an interesting situation where we permit
> >U+F900-U+FA0D to appear in local names, but advise against anything
> >which is not NFC. So, what do we test?
> 
> The grammar is wider than the acceptable URIs in several places -
> it's inevitable.  We're expecting URI checking to be done after
> parsing in a very strict implementation.
> 
> So test good practice and recognize that not everywhere is
> completely up-to-date on everything.

I guess we're narrowing in on a notion of "good practice".

Is it good practice for Turtle parsers to handle future codepoints
reserved in Unicode, or is it good to practice to give errors when
outside the current codepoints? My guess is that we should follow
XML's lead here 'cause we're basically filling the same roll. A quick
grep through the XML tests for the end of one of the NameStartChar
ranges found 48 tests with U+1FFF.

As to NFC, yeah, it's kind of mean to ask the vigilent developer to
put in a --quiet switch just for testing. Otoh, the prescribed
behavior (in Turtle and XML) is that parsing a local name with U+F900
preserves that codepoint. Others have advice?



>  Andy
> 
> >Everything a Turtle parser
> >could encouter? Currently assigned characters that are in NFC?
> >Identifiers consisting of a single letter 'a', under the assumption
> >that all others will work by extension?
> >
> >
> >* Gavin Carothers <gavin@carothers.name> [2013-03-20 13:17-0700]
> >>http://www.unicode.org/charts/PDF/U1F00.pdf U+1FFF is not a character.
> >>http://www.unicode.org/charts/PDF/U2150.pdf U+218F is not a character.
> >>No chart for code point U+2FEF could be located. Most likely this is
> >>because no character is assigned to this code point yet.
> >>http://www.unicode.org/charts/PDF/UD7B0.pdf U+D7FF is not a character.
> >>http://www.unicode.org/charts/PDF/UFB50.pdf U+FDCF is not a character.
> >>No chart for code point U+EFFFF could be located. Most likely this is
> >>because no character is assigned to this code point yet
> >>
> >>
> >>New string based on the above missing characters tested in Python 3.3
> >>(earlier versions of python not supported, only one with Unicode 6.1.0)
> >
> >I banged briefly on finding an ubuntu package for Python 3.3
> >(currently at 3.2). Ended up with something called perl. sigh.
> >
> >use Unicode::Normalize;
> >$s = "AZaz\x{00c0}\x{00d6}\x{00d8}\x{00f6}\x{00f8}\x{02ff}\x{0370}\x{037d}\x{0384}\x{1ffe}\x{200c}\x{200d}\x{2070}\x{217f}\x{2c00}\x{2fcf}\x{3001}\x{d7fb}\x{f900}\x{fdc7}\x{fdf0}\x{fffd}\x{00010000}\x{0001f52b}";
> >p $s cmp NFC($s);
> >=> 1 -- strings are different. so now to look for the first candidate:
> >
> >for (0xf900..0xfdcf) {
> >     if (ord(Unicode::Normalize::NFC(chr($_))) == $_) {
> >         printf("%x\n", $_);
> >         last;
> >     }
> >}
> >=> fa0e
> >
> ># checked with
> >$s = "AZaz\x{00c0}\x{00d6}\x{00d8}\x{00f6}\x{00f8}\x{02ff}\x{0370}\x{037d}\x{0384}\x{1ffe}\x{200c}\x{200d}\x{2070}\x{217f}\x{2c00}\x{2fcf}\x{3001}\x{d7fb}\x{fa0e}\x{fdc7}\x{fdf0}\x{fffd}\x{00010000}\x{0001f52b}";
> >p $s cmp NFC($s);
> >=> 0 -- equivalent
> >
> >The currently unassigned characters don't impact NFC:
> >$s = "AZaz\x{00c0}\x{00d6}\x{00d8}\x{00f6}\x{00f8}\x{02ff}\x{0370}\x{037d}\x{037f}\x{1fff}\x{200c}\x{200d}\x{2070}\x{218f}\x{2c00}\x{2fef}\x{3001}\x{d7ff}\x{fa0e}\x{fdcf}\x{fdf0}\x{fffd}\x{10000}\x{effff}"
> >p $s cmp NFC($s);
> >=> 0 -- equivalent
> >
> >
> >>import unicodedata
> >>s =
> >>"AZaz\u00c0\u00d6\u00d8\u00f6\u00f8\u02ff\u0370\u037d\u0384\u1ffe\u200c\u200d\u2070\u217f\u2c00\u2fcf\u3001\ud7fb\uf900\ufdc7\ufdf0\ufffd\U00010000\U0001f52b"
> >>
> >>def display_string(s):
> >>for c in s:
> >>print("""Character: {c!s}
> >>Codepoint: {code:x}
> >>Name: {name}
> >>Combining: {combining}
> >>""".format(
> >>c=c,
> >>code=ord(c),
> >>name=unicodedata.name(c),
> >>combining=unicodedata.combining(c),
> >>))
> >>
> >>n = unicodedata.normalize("NFC", s)
> >>
> >>display_string(s)
> >>print("\n ------------------ \n ")
> >>display_string(n)
> >>
> >>assert n == s
> >>
> >>Yeah, they aren't the same. The offending character is f900:
> >>
> >>CJK COMPATIBILITY IDEOGRAPH-F900 which in normal form is CJK UNIFIED
> >>IDEOGRAPH-8C48
> >>
> >>Finding something in the F900ish range is left to Eric. Script above can be
> >>modified until it passes.
> >>
> >>Cheers,
> >>Gavin
> >>
> >>
> >>
> >>
> >>
> >>On Wed, Mar 20, 2013 at 12:13 PM, Eric Prud'hommeaux <eric@w3.org> wrote:
> >>
> >>>* Andy Seaborne <andy.seaborne@epimorphics.com> [2013-03-20 17:36+0000]
> >>>>The TTL has U+037E but ...
> >>>>
> >>>>PN_CHARS_BASE has a hole specifically for that
> >>>>
> >>>>[#x0370-#x037D] | [#x037F-#x1FFF]
> >>>>
> >>>>=> not a legal char.
> >>>
> >>>Yeah, I screwed that up. I should have gone the other way 'cause it's at
> >>>the bottom of a range (unlike all the other unassigned chars). Attached are
> >>>the same tests with s/37f/384/. Could you chop off after the "AZaz" and see
> >>>if that works and do a binary search to see what it's complaining about?
> >>>
> >>>I18N folks, could you tell me why an NFC validator is objecting to this
> >>>(beautiful) IRI and if there's some validator I can use for testing:?
> >>>   <http://a.example/AZazÀÖØöø˿Ͱͽ΄῾‌‍⁰↉Ⰰ⿕、ퟻ豈ﷇﷰ�𐀀>
> >>>The goal is to test as much as possible the valid input to <
> >>>http://www.w3.org/TR/turtle/#grammar-production-PrefixedName>. In turtle,
> >>>the localName gets appended to the namespace, hence the url above. The
> >>>
> >>>   [163s] PN_CHARS_BASE ::=    [A-Z] | [a-z] | [#x00C0-#x00D6] |
> >>>[#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] |
> >>>[#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
> >>>[#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
> >>>
> >>>production is taken from <http://www.w3.org/TR/REC-xml/#NT-NameStartChar>:
> >>>
> >>>   [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
> >>>[#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
> >>>[#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
> >>>[#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
> >>>
> >>>
> >>>
> >>>>Removing it (Greek question mark), I then get:
> >>>>
> >>>>WARN  [line: 2, col: 43] Bad IRI:
> >>>><http://a.example/AZaz???????????????????????> Code: 46/NOT_NFC in
> >>>>PATH: The IRI is not in Unicode Normal Form C.
> >>>>WARN  [line: 2, col: 43] Bad IRI:
> >>>><http://a.example/AZaz???????????????????????> Code: 47/NOT_NFKC in
> >>>>PATH: The IRI is not in Unicode Normal Form KC.
> >>>>WARN  [line: 2, col: 43] Bad IRI:
> >>>><http://a.example/AZaz???????????????????????> Code:
> >>>>56/COMPATIBILITY_CHARACTER in PATH: TODO
> >>>>
> >>>>with or without the last char.
> >>>>
> >>>>>I poked around looking for composing characters in the PN_CHARS_BASE
> >>>>>character ranges. \u02ff MODIFIER LETTER LOW LEFT ARROW seemed like it
> >>>>>could be a culprit, but fileformat.info claims it's not in a combining
> >>>>>class. Likewise \ufffd REPLACEMENT CHARACTER
> >>>>>
> >>>>>There are a bunch of yet-unassigned characters which could be confusing
> >>>>>a vigilent IRI checkr. I've mapped those to the highest currently-
> >>>>>assigned characters in their respective range (per fileformat.info):
> >>>>>
> >>>>>     \u037f   37e
> >>>>>     \u1fff  1ffe
> >>>>>     \u218f  2189
> >>>>>     \u2fef  2fd5
> >>>>>     \ud7ff  d7fb
> >>>>>     \ufdcf  fdc7
> >>>>>\U000effff e01ef
> >>>>>
> >>>>>attached is a variant of
> >>>>>   localName_with_PN_CHARS_BASE_character_boundaries.{nt,ttl}
> >>>>>with the values substituted. (I pass this modified test so there
> >>>>>shouldn't be any typos in it.) If it still doesn't work, try chopping
> >>>>>off the last character 'cause it's a variation selector which ostensibly
> >>>>>is NF{,K}{C,D} valid, but may not have been when jjc wrote your checker.
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>--
> >>>-ericP
> >>>
> >>>
> >
> 

-- 
-ericP

Received on Friday, 22 March 2013 07:21:22 UTC