Re: deterministic naming of blank nodes from ahogan@dcc.uchile.cl on 2015-05-13 (semantic-web@w3.org from May 2015)

From: <ahogan@dcc.uchile.cl>
Date: Wed, 13 May 2015 03:14:50 -0300
To: semantic-web@w3.org
Message-ID: <ec6834a5cba3b5369b05d1130a4d41a2.squirrel@webmail.dcc.uchile.cl>
Hi,

Sorry for resurrecting old threads but this discussion sort of prompted me
to work on the topic of "deterministic naming of blank nodes" and I
thought that it might be time (six months or so later) to report back on
some findings.

The result is the paper here:

Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In
WWW, Florence, Italy, May 18–22, 2015 (to appear).
Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf

I'll be presenting it next Friday if anyone happens to be around the
Florence area.

Roughly speaking, the paper presents an algorithm for deterministically
labelling blank nodes of any RDF graph such that the labels of two
isomorphic RDF graphs will "correspond". This could be used to find
isomorphic RDF graphs in a large set of RDF graphs (without needing
pairwise checks), for computing signatures, or for Skolemising in such a
way that the output of two RDF graphs will be equal if and only if they
are isomorphic.

The algorithm again works for general RDF graphs and doesn't assume any
well-behavedness or similar. The algorithm is exponential (based roughly
on a Nauty-style recursion) but the experiments show that, in terms of
isomorphism at least, you would have to have really really really *really*
weird RDF graphs to run into any real trouble (we're talking cliques of
blank nodes with k>55 or Miyazki graphs or similar). We skolemised all the
individual graphs in the BTC-14 collection ... the slowest took 31–40
seconds, mainly due to its size: 7.3 million triples with 254 thousand
blank nodes.

One drawback of the current approach is that at the moment you have to
load the RDF graph into memory, which may be prohibitive for very very
large graphs (of connected blank nodes), but I'm not sure how often that
happens in practice. (It wasn't a problem for any of the graphs in the
BTC-14 dataset for example.)

There's also a software package available here:

https://github.com/aidhog/blabel/wiki

Of course the usual disclaimers about research code apply!

Anyways, I thought these results might be of interest and hopefully might
have applications in the context of Skolemisation or signing RDF graphs or
similar.

Best,
Aidan

On 21/10/2014 04:30, Laurens Rietveld wrote:
> Coincidentally, I will present SampLD[1] in two hours at the ISWC (11.00
> local time, sala 300), which might be suitable for your experiment.
> SampLD is a scalable approach for sampling RDF graphs, solely using the
> topology of the graph:
>
>     The Linked Data cloud has grown to become the largest knowledge base
>     ever constructed.
>     Its size is now turning into a major bottleneck for many
>     applications. In order to facilitate access to this structured
>     information, this paper proposes an automatic sampling method
>     targeted at maximizing answer coverage for applications using SPARQL
>     querying.
>     The approach presented in this paper is novel: no similar RDF
>     sampling approach exist. Additionally, the concept of creating a
>     sample aimed at maximizing SPARQL answer coverage, is unique.
>     We empirically show that the relevance of triples for sampling (a
>     semantic notion) is influenced by the topology of the graph (purely
>     structural), and can be determined
>     without prior knowledge of the queries.
>     Experiments show a significantly higher recall of topology based
>     sampling methods over random and naive baseline approaches (e.g. up
>     to 90% for Open-BioMed at a sample size of 6%).
>
>
> And, to support LOD-scale experiments, you might be interested in the
> LOD Laundromat[2,3] as well, presented this Thursday at ISWC as well.
> Here, we (re)publish clean, sanitized versions of Linked Data sets as
> sorted gzipped n-triples (13 billion triples and counting), with
> corresponding meta-data (e.g. the number of blank nodes in this
dataset[4]).
>
> Best, Laurens
>
> [1]
> http://laurensrietveld.nl/pdf/Structural_Properties_as_Proxy_for_Semantic_Relevance_in_RDF_Graph_Sampling.pdf
> [2]
> http://laurensrietveld.nl/pdf/LOD_Laundromat_-_A_Uniform_Way_of_Publishing_Other_Peoples_Dirty_Data.pdf
> [3] http://lodlaundromat.org
> [4]
> http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC(%3FnumBnodes)+LIMIT+100
> --
>
> VU University Amsterdam____
>
> Faculty of Exact Sciences____
>
> Department of Computer Science____
>
> De Boelelaan 1081 A____
>
> 1081 HV Amsterdam____
>
> The Netherlands
>
> www.laurensrietveld.nl <http://www.laurensrietveld.nl>
>
> laurens.rietveld@vu.nl <mailto:laurens.rietveld@vu.nl>
>
> Visiting address:
>
> De Boelelaan 1081____
>
> Science Building Room T312
>
>
> On Mon, Oct 20, 2014 at 5:47 PM, Sandro Hawke <sandro@w3.org
> <mailto:sandro@w3.org>> wrote:
>
>     On 10/07/2014 10:25 PM, Sampo Syreeni wrote:
>     > On 2014-10-07, Sandro Hawke wrote:
>     >
>     >> That may be true, but it is hard for me to see how any benefit this
>     >> could bring would outweigh the absolute pain in the ass it would be
>     >> for everyone to change their RDF stacks.
>
>     It was not me who said that.   That was Alan Ruttenberg.
>
>     >
>     > So, why not subdivide the process? It ought to be easy and efficient
>     > enough to detect a rather expansive subset of graphs which do admit
>     > unique and efficient labeling. At the very least graphs which only
use
>     > a blank node precisely twice (to define something and to refer to it
>     > once as in the bracket notation) are pretty simple, using a simple
>     > hash table with counters -- that perhaps being the commonest case as
>     > well.
>     >
>     > If the test succeeds, define a unique labeling based on the rest of
>     > the attributes of the triple and lexical ordering; if not, ask the
>     > user whether general graph isomorphism comparison is wanted, and if
>     > so, do that, somehow signaling that it really went that far (perhaps
>     > inband in the format of the labels? or out of band as the case may
>     > be); if not, or if you can't do graph isomorphism in your code, then
>     > slap on nonunique labels, again differentiating them somehow from the
>     > first two cases.
>     >
>     > That is certainly not an easy or clean solution, but it doesn't break
>     > the stack, and it works in most of the places where you want to do
>     > fast path processing under the assumption that in fact the labels are
>     > canonical, and can be relied upon to have 1-1 correpondence from
>     > syntax to node.
>
>     I agree.   Does anyone have a good sampling of the LOD cloud we could
>     easily use for this experiment?
>
>            -- Sandro
Received on Wednesday, 13 May 2015 06:15:16 UTC