Meeting minutes
Approval of last week’s minutes: 1
<pfps> minutes look acceptable
<ora> PROPOSAL: Approve last week's minutes
<AndyS> +1
<ora> +0 (not present)
<pfps> +1
<TallTed> +1
<niklasl> +1
<j22> +1
<gtw> +1
<Tpt> +0 (not present)
<Souri> +1
<lisp> +1
<doerthe> +1
RESOLUTION: Approve last week's minutes
Allowing \u escaped surrogate pairs 2
<AndyS> https://
AndyS: There have been an email thread. The sentiment seems to not have \u surrogate pairs, despite the i18n suggestion
AndyS: A choice is to prevent parsers to doing that or not
AndyS: We don't have much response from UTF-16 programming language people like JavaScript, Python or Java
ora: Your suggestion would have been to prevent them?
<AndyS> """
<AndyS> Two adjacent numeric escape sequences forming a Surrogate Pair MAY be converted to a supplementary codepoint as described by Unicode 17.0 section 3.9.2 UTF-16.
<AndyS> """
AndyS: I am pretty neutral
AndyS: We can put a MAY to not make any obligation to support it in any way
<Souri> +1 to MAY
<Zakim> pfps, you wanted to way that the current document is perfect
pfps: It appears to me that the document in its current form is perfect
pfps: It states that surrogates are not allowed and that Turtle parsers are allowed to do whatever they want outside the standard
pfps: If we state that "Turtle document MAY include escaped surrogate pairs" it means that every processor should support them
pfps: I read Turtle 1.1 as preventing surrogate pairs, they are not character
AndyS: There are no "unicode character" there are the "abstract character" you don't write and 1.1 does not explicitly prevent surrogates. This is i18n reading
pfps: The intent of 1.1 is that surrogates are not allowed
AndyS: Unfortunately, intent is not speccable
pfps: You are right "unicode character" are not a well define things but I don't think surrogates are "character"
<gtw> no characters, but there are "noncharacters" (which is orthogonal to this discussion)
AndyS: The nearest thing at this level is a code point. So technically U+0020 is not a character, it's a code point
lisp: I really don't understand what is argued about. We are talking about something encoded in UTF-8
lisp: if you have surrogate characters you get an invalid sequence
AndyS: at utf-8 level you get the \u... syntax that is valid UTF-8
AndyS: What is doing on is you process as UTF-8 and you happen to find some escape sequence that encode surrogates
lisp: even if there is a pair of surrogates, this does not give a valid sequence of code points
gtw: At the point we come across the \u escape, we are not any more parsing the utf-8 input string we are manipulating code points
gtw: My point is that \u is something we have defined in our format, it is used to encode unicode escapes but it is something we handle
lisp: If that the case, then Turtle is not UTF-8, it is not
lisp: A UTF-8 is a sequence of code unit (1..4). If you put into it an escape for a code unit, then we are doing something invalid
AndyS: The 12 character \u....\u.... are decoded into a single unicode codepoint just like \n gives you the new line code point
lisp: We may do that but you are no longer decoding utf-8
These sequence of characters have a meaning for unicode and it's sequence of two surrogates and it's not allowed
<Zakim> pfps, you wanted to say that the 12 bytes are *Turtle*
gtw: Turtle see a sequence of 12 character \u....\u.... and has to choose if they are legal or illegal
lisp: if you are decoding UTF-8, it has a definition for codepoint, you are not out of UTF-8 yet
lisp: If the WG want to say it's UTF-8, then it's not possible
ora: What you want to say is that's a "UTF-8 escape sequence"
<pfps> UTF-8 does have a definition of the twelve bytes \unnnn\unnnn, it's twelve Unicode code units, each of which are in the ASCII block
lisp: No, it ends up as a UTF-16 escape sequence. Surrogates are not valid escape sequences
<Zakim> AndyS, you wanted to move on to changes - if any - and response to i18n
AndyS: Can we find some way forward to what we can answer to i18n
AndyS: What I am hearing is that we can keep the current text (surrogates are not allowed) and leave it to that
AndyS: I would be happy if we had some response from people who have worked in UTF-16 centric languages
AndyS: I changed Jena last fall to put it as close to the spec at all. Java miss represent code on output, it currently will accept invalid surrogate pairs but state in the code "this is an extension"
<ora> STRAWPOLL: Keep current text: surrogates not allowed
<pfps> +1
<ora> +1
<Tpt> +0 (neutral)
<lisp> +1
<doerthe> +0
<gtw> +1
<TallTed> +0.8
<niklasl> +0
<j22> -0.5
<AndyS> For \u: A Unicode code point in the ranges U+0000 to U+D7FF and U+E000 to U+FFFF, corresponding to the value encoded by the four hexadecimal digits interpreted from most significant to least significant digit.
<AndyS> "A numeric escape sequence MUST NOT produce a code point value in the range U+D800 to U+DFFF, which is the range for Unicode surrogates."
<Souri> +0
j22: I do not have strong feeling. It is a feature in the spec that could increase compatibility in existing specs
j22: So, I though it would be reasonable to follow the suggestion but I do not have strong enough feeling
lisp: The text which Andy wrote is an accurate rendition of what UTF-8 requires
lisp: The notion that one will increase compatibility by allowing extension will mean people won't know what they can expect
lisp: the "surrogate MAY appears mean parsers must implement them to get interoperability"
AndyS: The spec text use the U+ notation that talks about code points after decoding
AndyS: if you want to talk interoperability, we should talk about what systems produce
ora: If we don't allow surrogates, what are the bad things that would happen? I do not have enough experience to know how marginal this issue is
AndyS: Change in Jena to exclude bad use of surrogate did not cause user reports
Tpt: According to the issue author, DotNetRdf writes escaped surrogate pairs
<pfps> Someone indicated that Python was UTF-16 friendly. I'm not seeing that in documents about strings in Python.
ora: the low energy option is to keep the old text
tpt: could put in a note about a common extensions to Turtle.
Tpt: What about stating that parsers MAY accept escaped surrogates and serializers MUST NOT use it
gtw: That is the current status quo - don't feel that we need to mention it as it suggests it is acceptable.
gtw: I don't think it's a good idea, it suggests it is something "fine" to do. I am concern it will cause compatibility issues
ora: Do people want to decide now, in the grand scheme of things, it's a small thing
ora: Adrian and I are giving a small status update at the Knowledge Graph Conference
<ora> PROPOSAL: Keep current text: surrogates not allowed
<ora> +1
<Souri> +1
<Tpt> +1
<j22> +1
<lisp> +1
<gtw> +1
<niklasl> +1
<AndyS> +0
<pfps> +1
<TallTed> +1
RESOLUTION: Keep current text: surrogates not allowed
AndyS: What we answer to i18n, "the WG has considered your position and decided to keep the current text as is clearer"
ora: AndyS will respond
Review of open PRs, available at 4
ora: Where are we in tests?
<j22> +1 to rdf/xml informal meeting next week
Tpt: Tests for NTriples/NQuads/Turtle/TriG are in good shape, RDF/XML is blocked on possible substantive changes, SPARQL still misses some parts like EXISTS
AndyS: What about an informal session next week on RDF/XML?
<gtw> is there a SPARQL TF call tomorrow?
ora: Neither me and Adrian will be there next week likely
<niklasl> Perhaps also talk a little about w3c/
<gb> Issue 189 Consolidate advice about versions of RDF syntax and HTTP access (by niklasl) [needs discussion] [ms:CR]
ora: Let's not cancel the meeting but send a mailing list email expliciting this
ora: thank you everyone