(MEETING TITLE) – 28 September 2022

Meeting minutes

markus_sabadello: We use IRC for queuing, if you want to say something you type "q+".

markus_sabadello: Phil Archer isn't here today, he's the other co-chair.

markus_sabadello: We had a couple of meetings, a kickoff and one at TPAC. Anyone who wasn't there to introduce?

AndyS: I'm Andy Seaborne, I've done W3C work with RDF and SPARQL, I'm interested in making sure there's an extension here for RDF star and looking at some of the issues around how the work is communicated to devleopers.

markus_sabadello: Yes, RDF* has been mentioned in the meetings before.

Kazue: My name is Kazue Sako, I was at TPAC 2019, at that time I went to the VC WG. I'm looking into doing anonymous credentials and selective disclosure using BBS signatures and I'd be very happy if people would use this for VCs in a very secure way.

markus_sabadello: Anyone else?

Ahmad Alobaid - I made it to the first meeting, couldn't make it to the second meeting.

markus_sabadello: I thought we'd do two things, we'd follow up on the TPAC meeting were we had a joint meeting with the VCWG and talk about the insights / conclusions from that meeting.

markus_sabadello: The second thing is next steps / first steps on how to get the work started.

markus_sabadello: At the moment, we have github repos and some input documents but we haven't really started to do the actual work and that would be the proposal: follow up on TPAC and then talk next steps.

markus_sabadello: Anything else in terms of agenda items?

markus_sabadello: Also meeting times -- Kazue you are here and could contribute a lot is there a better time?

Kazue: This time slot is better for me because otherwise I have ordinary day work meetings. I'm comfortable with this time slot.

markus_sabadello: At the moment we have the meeting scheduled bi-weekly (every other week) at this time and we can discuss moving it if it's an issue for people with the time.

seabass: Hello! I'm Sebastian Crane; pending any paperwork I'll be an Invited Expert to this WG. I'm a member of the SPDX project, which aims to improve the communication of software package metadata.

markus_sabadello: Thanks to Sebastian for the introduction.

<markus_sabadello> Thanks seabass !

<Zakim> seabass, you wanted to Hello! I'm Sebastian Crane; pending any paperwork I'll be an Invited Expert to this WG. I'm a member of the SPDX project, which aims to improve the communication of software package metadata.

Scope of the deliverables

markus_sabadello: At TPAC we had a number of topics, one was scope of deliverables.

markus_sabadello: We looked a little at two deliverables, one was RDF Dataset Canonicalization and another was RDF Dataset Hash

markus_sabadello: We discussed a little at TPAC what the inputs and outputs of the two documents would be and how they relate to each other.

markus_sabadello: Since then there was a little discussion on github.

<markus_sabadello> https://github.com/w3c/rch-rdc/issues/4

<markus_sabadello> https://github.com/w3c/rch-rdh/issues/2

markus_sabadello: These were about what the deliverables will do (inputs / outputs).

markus_sabadello: A question is around whether the output of canonicalization is a final string or a set of labels / blank nodes or an abstract data model of some kind.

markus_sabadello: That was one of the discussions we had and we don't have to solve that right now, I just wanted to ask right now if there are any more thoughts.

markus_sabadello: Whether you were at TPAC or not -- please feel free to provide input / new thoughts on this topic: the scope of the deliverables and how they relate / what they are for.

gkellogg: My thoughts were thinking about the possibility of something like a map / bijection that relates the blank nodes in the input document to their canonicalized form. It seems a little bit fraught -- you can't refer to the blank nodes by their original labels, meaning the output needs to be more concrete.

gkellogg: At least for writing a spec and writing tests for it it would be better to have the input / output be a concrete serialization such as N-quads.

<AndyS> +1

+1 to gkellogg

pchampin: I would argue any implementation ... [bad audio].

markus_sabadello: You said the easiest would be if the output was a concrete serialization, Gregg?

gkellogg: Yes.

markus_sabadello: I think we also said that the output could be something other than the final serialization. Something related to individual statements perhaps. That would be interesting to ZKP / selective disclosure.

markus_sabadello: That's for if there's interest in signing parts of the data.

markus_sabadello: Intermediate results or labeling of blank nodes or whatever could make sense.

markus_sabadello: I was also wondering about the outputs of both specifications. RDF Dataset Canonicalization and RDF Dataset Hash.

markus_sabadello: Do you have an opinion on what those should output?

gkellogg: I think the output of the hash algorithm would be the hash string. I guess the question is what is the input? Using the mechanisms I'm familiar with would be to have the N-Quads as input as the subject that the hash would take place.

gkellogg: One of the possibilities ... and I confess I didn't go through these issues because I didn't see them until now ... but the canonicalization form runs over statements and creates blank node identifiers for them which would be serialized.

gkellogg: There was some thought that only the blank node identifiers would need to be the output, but the problem is to relating them back to the original blank nodes.

gkellogg: Because the hash needs to take into consideration all the data, not just the blank nodes. Perhaps there are some intermediate steps and some form of a structure that would allow access to the structure intermediately, perhaps an ordered array of the statements.

gkellogg: That might satisfy some of the thoughts about ZKP and others.

pchampin: I think Gregg is convincing me. My point was that the algorithm was getting an internal representation of the dataset where blank nodes have a stable identity for the duration of the program.

pchampin: So I'd think a mapping of those blank nodes and the canonical label could be part of the algorithm at least internally. The tricky part is the output.

pchampin: Specifying it -- maybe it's easier to specify a concrete serialization output, but I'd rather like to have some kind of array / abstraction rather than some big string and we can parse it up after the fact but that's overkill.

+1 to pchampin if we can find a way to specify it cleanly

ivan: I don't see a use case to maintain the output, to maintain the relationship between the original blank node labels and the new labels.

ivan: I don't see a use case to do some kind of use case, because I don't think the canonicalized version of the graph is interesting, we just want to see if they are the same, if we want to input them into the hash function, etc, the original bnodes are uninteresting.

ivan: I don't think we should go out of our way to output the mapping to the original bnodes, etc.

ivan: First of all, N-Quads is sort of the assembly language of RDF, it's the simplest one for people working in these field, so I'm perfectly fine with saying that the result of the canonicalization is N-Quads and that means we need to define a canonical version of N-Quads, which you need for hashing, it cannot be ignored.

ivan: If we say that the output is a canonical version of N-Quads, sorting them, no extra spaces, etc. it's not a big deal to define that, but that makes it relatively clear and then the hashing algorithm becomes clear and non-problematic. Making it more abstract than that looks nice but complicates the specification more than is needed.

<gkellogg> Canonical N-Quads would be a straight-forward extension to that for N-Triples https://www.w3.org/TR/n-triples/#canonical-ntriples

<ivan> +1 gkellogg

<pchampin> except that canonical N-Triples is strictly canonical: "This section defined a canonical form of N-Triples which has less variability in layout"

AndyS: I completely agree with Ivan that the reverse mapping is problematic. Sometimes the representation is different sometimes, sometimes N-Quads or Turtle another time, you can only refer to things sometimes by syntactical location in the document.

<seabass> Wouldn't it be useful to correlate the blank nodes between serialisation formats, so you can easily locate the bnodes found in, say, a JSON-LD document, with the same one in a Turtle document?

AndyS: I think it would be a good idea to avoid defining a new syntax and that will drain WG time. The other thing is that the input may not come directly or easily from a document.

AndyS: Where triples are arriving from a parser in indeterminate order.

<Zakim> dlongley, you wanted to provide a use case

<markus_sabadello> dlongley: Agree with ivan, but there is a use case for having mapping of blank nodes.

<markus_sabadello> dlongley: Use case is related to selective disclosure, and trying to make VCs unlinkable through blank node identifiers.

<markus_sabadello> dlongley: When you assign blank node IDs to a dataset, you may reveal more information that you want to selectively disclose.

<Kazue> Dan and I have a paper on the usecase mentioned by Dave

<markus_sabadello> dlongley: Use case to create more herd privacy where you don't use the canonical labels directly, but apply other type of change.

<Kazue> https://sako-lab.jp/download.php?article=ssr2022_proceedings_dan.pdf

<markus_sabadello> dlongley: You can sort dummy values, break ties. Reduce number of combinations.

<markus_sabadello> dlongley: In that use case, it can make sense to have the mapping.

markus_sabadello: Thanks, Dave, for the use case.

ivan: I haven't realized this use case, it's really interesting. What you need at the end is some transformation of the canonical labels, but you don't need the original labels, but you need something that changes the labels.

<markus_sabadello> dlongley: Reason for keeping original labels would be the same for everybody -> create herd privacy. Only when ties need to be broken, you'd get information that is specific to a user. This doesn't occur often.

<markus_sabadello> dlongley: Often, people will have the same labels.

<markus_sabadello> dlongley: If you have an array of values where all predicates are the same, only values are different. In this case the ordering is specific to an individual.

<markus_sabadello> dlongley: In other circumstances, everybody will get herd privacy.

Kazue: I'm not yet used to this conversation -- I have nothing to add to what Dave said, but Dan and I wrote a paper on how we can securely do selective disclosure and I put the link in the chat.

markus_sabadello: Thank you, I read that paper. If you haven't read that paper, you should.

markus_sabadello: Maybe what we should do is continue this topic on the github repo.

markus_sabadello: Reminder that we can create issues on the repositories to keep track of discussions.

markus_sabadello: We don't want to forget them.

<pchampin> Repo for the Canonicalization deliverable

markus_sabadello: Do we want to continue this discussion now about the outputs / mapping / labeling?

markus_sabadello: If not, there's another topic to discuss which is follow up from TPAC.

<pchampin> Repo for the Hashing deliverable

AndyS: Generalized RDF has bnodes in all sorts of places, such as predicate, that's something to perhaps discuss.

<pchampin> Generalized RDF

AndyS: Generalized RDF is in the 1.1 spec, a lot of the machinery defined for RDF applies to whatever the types of the terms in RDF are, it's not restricted.

AndyS: It does arise naturally in some processing situations where you're forming intermediates that may not escape the system, but it's easier to use the generalized form internally

<Zakim> AndyS, you wanted to ask about generalized RDF.

markus_sabadello: Could you raise an issue?

AndyS: In the canonicalization repo? Should that be the default repo?

ivan: I would think canonicalization will be the main one in practice, but the repos are per document.

<AndyS> I will raise an issue.

<markus_sabadello> Thanks AndyS !

gkellogg: To follow on that, JSON-LD 1.0 allowed and showed use cases for blank nodes as properties. That's been made archiac or some form in 1.1 because I believe the Activity Streams makes use of that by setting the `@vocab` to a blank node. You'd get generalized RDF when using that. It's not entirely theoretical.

gkellogg: There are other ways we could have satisfied that use case, but it's a REC, so it's not going away.

markus_sabadello: I want to say hello to a few more people who have joined.

Two presentations from Dave and Aidan

markus_sabadello: We have 20 minutes left -- what I thought would be helpful that we could do on this call, is to talk a little bit about the two presentations during TPAC.

markus_sabadello: We had a presentation by Dave who is here about one algorithm for canonicalization RDF Datasets and one presentation from Aidan Hogan who also worked on that same topic.

markus_sabadello: In order to get our work started, we could just decide to start from scratch, and define something completely new, but that wouldn't be too smart, it would be better to start with something that exists already.

markus_sabadello: My understanding is too high level to really know what the differences are. I wanted to ask what everyone thought about the two presentations and ask how we can start our work.

markus_sabadello: To be more concrete, should we decide to start with one of them and use that as a basis for the work -- and then can we merge or align them because they are similar or what we do we want to do?

markus_sabadello: We've seen these two pieces of work and they are concrete and have been implemented, what do we want to do? How do they relate to one another or maybe is there another pre-existing work we want to consider? Any thoughts on that?

gkellogg: So I did specifically ask Aidan about his opinion about the congruity of the two algorithms. His feeling was that they substantially do the same thing but don't have the same output. Given the amount of work and maturity of the community specification around Dave's algorithm, including a robust set of test cases, my suggestion would be that the WG adopt that as the basis of the RDF Dataset Canonicalization work.

gkellogg: Of course there's more work to do, but we'd simplify our task immensely if we do that, I see the CCG hasn't put it out as the final report, but once that happens, that would be my suggestion.

markus_sabadello: Concrete proposals are always very helpful, thanks for that.

<Zakim> gkellogg, you wanted to note that Activity Streams can result in blank-node properties and to

ivan: One of those rare cases where Gregg and I don't really agree. It was a bit unfortunate that the two presentations were very different, Aidan went into the algorithm weeds, Dave was more high level.

<gkellogg> Draft CCG report: https://w3c-ccg.github.io/rdf-dataset-canonicalization/spec/index.html

<markus_sabadello> Issue in CCG to publish it as a final report: https://github.com/w3c-ccg/community/issues/234

ivan: Both of the algorithms are two-step algorithms, the first looks at the immediate environment of the blank node and does immediate assignment of the node names. I think that covers the large percentage of the graphs out there. Then there are cases that "do not behave properly" so to say, and more complexity is needed.

ivan: This is where the two algorithms differ from one another. I think due diligence would require some kind of comparison that concentrates on the second part. I think the differences are maybe minor in the first step but I think we still owe the community looking at the second step.

<aalobaid> +1

ivan: To understand the differences in efficiency, ease to implement -- I think to decide one and throw out the other is not the proper way to do it. I realize we can get into long discussions. But perhaps give us two months to give us a comparison.

ivan: I think we owe that to provide a proper work.

markus_sabadello: I put myself on the queue to mention that the CCG spec is not a final report but we'll be working on that so we can consider that as an input document. I agree with both Gregg and Ivan. I think the spec by Dave in the CCG has quite a bit of exposure and adoption. I think it's a good idea to do a comparison.

<markus_sabadello> dlongley: One approach could be to adopt the CCG spec as a base, and then proceed with the comparison to see how the algorithm could be improved, based on Aidan's work.

<markus_sabadello> dlongley: Make sure that whatever improvements are made are justified, given existing incubation and implementation work.

<markus_sabadello> dlongley: We should be careful how we proceed, need to consider the existing incubation and interoperability.

markus_sabadello: I wonder who would be able to do these comparisons. If we do that comparison and then end up doing what Dave just said -- I still think what Ivan says makes sense to compare them. Who is qualified to do that? Besides Dave, Aidan, Ivan?

ivan: I'm not qualified.

markus_sabadello: I'm not qualified either. Is there anyone in the group who wants to spend some time explaining what the differences are?

markus_sabadello: Perhaps not on the technical level but what the differences mean for users? How many people think they could understand the differences?

Kazue: Where can we find more about these two presentations we are talking about? Maybe that's not why I'd be able to do the comparison.

markus_sabadello: Those presentations were at the TPAC conference, there were minutes at the meeting. What I can do after the call is post the minutes to the mailing list.

<ivan> relevant minutes of the groups' joint meeting

AndyS: The documents were available, was there a video? I'm mildly interested in getting involved in the explanation, but the key part is not the technical details but to explain what the differences are in a more formal way.

AndyS: The general concepts rather than the exact details. I think that would be good to run through the documents we've produced to give this non-spec grade, but more approachable descriptions of what these complicated things do.

AndyS: The spec-grade stuff just puts people off.

<ivan> Dave's slides start here

<ivan> Aidan's slides start here

markus_sabadello: Maybe we create a github issue for that as well. We want to start doing concrete work, perhaps we want to use the CCG specification as a concrete basis.

markus_sabadello: Perhaps we can find some people within our group to explain a little bit the differences between the approaches.

markus_sabadello: At the same time, not spend too many months on that, because we want to get started on concrete work.

ivan: There is a reference to a paper for Dave's version, which you will find in the minutes as well, for Aidan's published papers, I don't see any reference in the minutes, but I think it's something we can find by the usual research paper searches. I can find that in my files if you need it.

Kazue: Thank you, I'll try.

markus_sabadello: I will find them and create a github issue and link that to the mailing list.

markus_sabadello: Hopefully we can get some more input and we will have this on record.

<ivan> explainer to charter

ivan: There was an explainer attached to the charter.

ivan: There is a reference to Aidan Hogan's paper and Rachel Arnold and Dave Longley's paper.

Kazue: Thank you.

markus_sabadello: We should try to end the meeting 5 minutes before the top of the hour, apologies, best practice for W3C meetings.

markus_sabadello: Thanks for the concrete proposal to start with one document but first see if we can do a comparison.

markus_sabadello: We can talk more at the next meeting in two weeks, in the meantime please use github and the mailing list and thanks for the discussion!

markus_sabadello: Look forward to next steps.

<seabass> thanks; bye!

<yamdan> I really apologize for the very late arrival. I was completely wrong about the time.

<markus_sabadello> Thanks for coming yamdan ! I know it's late. We can consider changing the meeting time if it's inconvenient for you.

<yamdan> Thank you Markus, but I can accept this time. I will avoid this type of mistake...

<markus_sabadello> @Pierre-Antoine Do I have to do anything with the Zakim bot at the end of the meeting?

– DRAFT –
(MEETING TITLE)

28 September 2022

Attendees

Meeting minutes

Scope of the deliverables

Two presentations from Dave and Aidan

Diagnostics