RDF and SPARQL Working Group

Meeting minutes

Approval of last week’s minutes: 1

<pfps> minutes look acceptable

<ora> PROPOSAL: Approve last week's minutes

<AndyS> +1

<ora> +0 (not present)

<pfps> +1

<TallTed> +1

<niklasl> +1

<j22> +1

<gtw> +1

<Tpt> +0 (not present)

<Souri> +1

<lisp> +1

<doerthe> +1

RESOLUTION: Approve last week's minutes

Allowing \u escaped surrogate pairs 2

<AndyS> https://lists.w3.org/Archives/Public/public-rdf-star-wg/2026Apr/0028.html

AndyS: There have been an email thread. The sentiment seems to not have \u surrogate pairs, despite the i18n suggestion

AndyS: A choice is to prevent parsers to doing that or not

AndyS: We don't have much response from UTF-16 programming language people like JavaScript, Python or Java

ora: Your suggestion would have been to prevent them?

<AndyS> """

<AndyS> Two adjacent numeric escape sequences forming a Surrogate Pair MAY be converted to a supplementary codepoint as described by Unicode 17.0 section 3.9.2 UTF-16.

<AndyS> """

AndyS: I am pretty neutral

AndyS: We can put a MAY to not make any obligation to support it in any way

<Souri> +1 to MAY

<Zakim> pfps, you wanted to way that the current document is perfect

pfps: It appears to me that the document in its current form is perfect

pfps: It states that surrogates are not allowed and that Turtle parsers are allowed to do whatever they want outside the standard

pfps: If we state that "Turtle document MAY include escaped surrogate pairs" it means that every processor should support them

pfps: I read Turtle 1.1 as preventing surrogate pairs, they are not character

AndyS: There are no "unicode character" there are the "abstract character" you don't write and 1.1 does not explicitly prevent surrogates. This is i18n reading

pfps: The intent of 1.1 is that surrogates are not allowed

AndyS: Unfortunately, intent is not speccable

pfps: You are right "unicode character" are not a well define things but I don't think surrogates are "character"

<gtw> no characters, but there are "noncharacters" (which is orthogonal to this discussion)

AndyS: The nearest thing at this level is a code point. So technically U+0020 is not a character, it's a code point

lisp: I really don't understand what is argued about. We are talking about something encoded in UTF-8

lisp: if you have surrogate characters you get an invalid sequence

AndyS: at utf-8 level you get the \u... syntax that is valid UTF-8

AndyS: What is doing on is you process as UTF-8 and you happen to find some escape sequence that encode surrogates

lisp: even if there is a pair of surrogates, this does not give a valid sequence of code points

gtw: At the point we come across the \u escape, we are not any more parsing the utf-8 input string we are manipulating code points

gtw: My point is that \u is something we have defined in our format, it is used to encode unicode escapes but it is something we handle

lisp: If that the case, then Turtle is not UTF-8, it is not

lisp: A UTF-8 is a sequence of code unit (1..4). If you put into it an escape for a code unit, then we are doing something invalid

AndyS: The 12 character \u....\u.... are decoded into a single unicode codepoint just like \n gives you the new line code point

lisp: We may do that but you are no longer decoding utf-8

These sequence of characters have a meaning for unicode and it's sequence of two surrogates and it's not allowed

<Zakim> pfps, you wanted to say that the 12 bytes are *Turtle*

gtw: Turtle see a sequence of 12 character \u....\u.... and has to choose if they are legal or illegal

lisp: if you are decoding UTF-8, it has a definition for codepoint, you are not out of UTF-8 yet

lisp: If the WG want to say it's UTF-8, then it's not possible

ora: What you want to say is that's a "UTF-8 escape sequence"

<pfps> UTF-8 does have a definition of the twelve bytes \unnnn\unnnn, it's twelve Unicode code units, each of which are in the ASCII block

lisp: No, it ends up as a UTF-16 escape sequence. Surrogates are not valid escape sequences

<Zakim> AndyS, you wanted to move on to changes - if any - and response to i18n

AndyS: Can we find some way forward to what we can answer to i18n

AndyS: What I am hearing is that we can keep the current text (surrogates are not allowed) and leave it to that

AndyS: I would be happy if we had some response from people who have worked in UTF-16 centric languages

AndyS: I changed Jena last fall to put it as close to the spec at all. Java miss represent code on output, it currently will accept invalid surrogate pairs but state in the code "this is an extension"

<ora> STRAWPOLL: Keep current text: surrogates not allowed

<pfps> +1

<ora> +1

<Tpt> +0 (neutral)

<lisp> +1

<doerthe> +0

<gtw> +1

<TallTed> +0.8

<niklasl> +0

<j22> -0.5

<AndyS> For \u: A Unicode code point in the ranges U+0000 to U+D7FF and U+E000 to U+FFFF, corresponding to the value encoded by the four hexadecimal digits interpreted from most significant to least significant digit.

<AndyS> "A numeric escape sequence MUST NOT produce a code point value in the range U+D800 to U+DFFF, which is the range for Unicode surrogates."

<Souri> +0

j22: I do not have strong feeling. It is a feature in the spec that could increase compatibility in existing specs

j22: So, I though it would be reasonable to follow the suggestion but I do not have strong enough feeling

lisp: The text which Andy wrote is an accurate rendition of what UTF-8 requires

lisp: The notion that one will increase compatibility by allowing extension will mean people won't know what they can expect

lisp: the "surrogate MAY appears mean parsers must implement them to get interoperability"

AndyS: The spec text use the U+ notation that talks about code points after decoding

AndyS: if you want to talk interoperability, we should talk about what systems produce

ora: If we don't allow surrogates, what are the bad things that would happen? I do not have enough experience to know how marginal this issue is

AndyS: Change in Jena to exclude bad use of surrogate did not cause user reports

Tpt: According to the issue author, DotNetRdf writes escaped surrogate pairs

<pfps> Someone indicated that Python was UTF-16 friendly. I'm not seeing that in documents about strings in Python.

ora: the low energy option is to keep the old text

tpt: could put in a note about a common extensions to Turtle.

Tpt: What about stating that parsers MAY accept escaped surrogates and serializers MUST NOT use it

gtw: That is the current status quo - don't feel that we need to mention it as it suggests it is acceptable.

gtw: I don't think it's a good idea, it suggests it is something "fine" to do. I am concern it will cause compatibility issues

ora: Do people want to decide now, in the grand scheme of things, it's a small thing

ora: Adrian and I are giving a small status update at the Knowledge Graph Conference

<ora> PROPOSAL: Keep current text: surrogates not allowed

<ora> +1

<Souri> +1

<Tpt> +1

<j22> +1

<lisp> +1

<gtw> +1

<niklasl> +1

<AndyS> +0

<pfps> +1

<TallTed> +1

RESOLUTION: Keep current text: surrogates not allowed

AndyS: What we answer to i18n, "the WG has considered your position and decided to keep the current text as is clearer"

ora: AndyS will respond

Review of open PRs, available at 4

ora: Where are we in tests?

<j22> +1 to rdf/xml informal meeting next week

Tpt: Tests for NTriples/NQuads/Turtle/TriG are in good shape, RDF/XML is blocked on possible substantive changes, SPARQL still misses some parts like EXISTS

AndyS: What about an informal session next week on RDF/XML?

<gtw> is there a SPARQL TF call tomorrow?

ora: Neither me and Adrian will be there next week likely

<niklasl> Perhaps also talk a little about w3c/rdf-star-wg#189 next week? (Related.)

<gb> Issue 189 Consolidate advice about versions of RDF syntax and HTTP access (by niklasl) [needs discussion] [ms:CR]

ora: Let's not cancel the meeting but send a mailing list email expliciting this

ora: thank you everyone

– DRAFT –
RDF and SPARQL Working Group

30 April 2026

Attendees

Meeting minutes

Approval of last week’s minutes: 1

Allowing \u escaped surrogate pairs 2

Review of open PRs, available at 4

Summary of resolutions

Diagnostics