Re: Need for a canonical byte stream for an RDF graph

I have sent a related member confidential message concerning this topic 
to patent-issues

http://lists.w3.org/Archives/Member/patent-issues/2011AprJun/0002.html

Please do not read that message if your organization prohibits you from 
doing so.
If in doubt about your organizational policy please refer to your AC rep.
(Although frankly, it is not worth the effort)

Please also do not follow up on the content of that message on this list 
or any other technically focused list of the W3C (including 
member-rdf-wg), out of respect to those participants whose organizations 
have such policies.

I am not sure where discussion of such matters is permitted; I am sure 
the chairs will advise if they believe discussion is necessary.
I am also quite happy to follow up to private e-mails etc.

Jeremy


On 6/23/2011 4:39 PM, Gavin Carothers wrote:
> One of the commonly mentioned uses cases for named graphs and graph
> literals is their use with digital signatures. At the moment signing
> an RDF graph is impossible. Today applications and systems can sign
> serializations of an RDF graph. However, there are issues with all of
> those serializations, and the whole notion of signing the serialized
> form. For example as we know a given RDF graph can be serialized into
> a huge number of equivalent RDF/XML files. We also know that RDF/XML
> can't represent all RDF graphs. If we leave the creation of a
> canonical form up to each serialization we are likely to end up with a
> large number of canonicalization methods. One for RDF/XML, one for
> Turtle, one for JSON-LD, one for RDF-JSON, etc. If we look at RDFa,
> the situation gets really strange. What the heck do you sign? The
> whole HTML, what if you just want to sign the data?
>
> It seems that creating a standard canonical byte stream based on an
> abstract syntax graph would be very useful. It also seems reasonably
> easy. The simplest naive approach of sorting N-Triples works
> reasonably well. Until that is you encounter our good old friend Blank
> Nodes. Happily the majority of the work needed to address blank nodes
> can be found in Jeremy Carroll's paper "Signing RDF Graphs"
> http://www.hpl.hp.com/techreports/2003/HPL-2003-142.html from back in
> 2003.
>
> What follows is a simple summary of the method in "Signing RDF Graphs
> 2003". Please read the paper for all the gory details. Basically, we
> sort the triples in lexicographic order, ignoring blank nodes. There
> are some remaining issues with different triples that are identical
> when ignoring blank nodes. It is usually possible to number the blank
> nodes in document order (in the output document), to break these ties.
> There are cases for which it is not possible to make such a
> tie-breaking numbering. These cases can be resolved before saving the
> document by modifying the graph by explicitly and nondeterministically
> adding formally meaningless triples. Output it all as N-Triples. This
> process is always O(n log (n)), there are also a number of possible
> optimizations mentioned in the paper.
>
> The nondeterministic nature of the process isn't great. Happily lots
> and lots of real world RDF graphs will never get to the
> nondeterministic step. In fact it would be perfectly reasonable for a
> system or application present giant red flashing lights when
> publishing such data.
>
> There is also a bit of an issue in that at the moment the paper
> defines the canonical graph as after deleting all non-distinctive
> triples the first occurrence of each blank node identifier appear in
> numeric order starting at 1 without gaps. It is possible to make sure
> that the blank node labels stay consistent as well.
>
> There are other non-signing uses. By using a line oriented canonical
> form existing version control and differencing tools will work right
> out of the box. A normal unified diff is suddenly useful not just for
> code, but for data.
>
> Welcome further discussion and feedback, specifically from people
> looking at implementing canonicalization in a specific serialization,
> and if this can meet your use cases.
>
> Cheers,
> Gavin
>
> PS. House keeping, the addition of a semantics free predicate may mean
> reopening ISSUE-56.
>

Received on Thursday, 23 June 2011 23:49:49 UTC