Re: shared bnodes (Skolems, SPARQL) from Steve Harris on 2012-08-31 (public-rdf-wg@w3.org from August 2012)

From: Steve Harris <steve.harris@garlik.com>
Date: Fri, 31 Aug 2012 13:59:59 +0100
To: Sandro Hawke <sandro@w3.org>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-Id: <FDD7E990-3D12-43F5-9095-6E4DD8305180@garlik.com>
On 2012-08-29, at 19:50, Sandro Hawke wrote:
> On 08/29/2012 11:19 AM, Steve Harris wrote:
>> On 2012-08-29, at 15:59, Sandro Hawke wrote:
>>> On 08/29/2012 08:43 AM, Steve Harris wrote:
>>>> On 2012-08-29, at 13:15, Sandro Hawke wrote:
>>>>> On 08/28/2012 11:52 AM, Steve Harris wrote:
>>>>>> On 2012-08-24, at 16:52, Sandro Hawke wrote:
>>>>>> 
>>>>>> … snip …
>>>>>> 
>>>>>>>> And sub/union graphs in general.
>>>>>>>> 
>>>>>>>> Union graphs for those systems that already make one graph the union of all others.  Whether we like it or not, those systems are common, even maybe even the majority, and have been for several years.
>>>>>>>> 
>>>>>>>> It is the compromise of the context point-of-view and the multiple-graphs point-of-view.  In the context POV,
>>>>>>>> 
>>>>>>>> (this is not advocacy, more like 'history')
>>>>>>>> 
>>>>>>> agreed.
>>>>>>> 
>>>>>>> To put that slightly differently: shared bnodes are also required for the SPARQL dump & restore use case.
>>>>>> Yup, that was one of the motivations for Skolem URIs.
>>>>> How would that work, if there was already Skolmized RDF in the dataset?    (And there will be, if your dataset comes from crawling other people's data sources, and those data sources emit Skolemized RDF, as we're expecting they will sometimes.)
>>>>> 
>>>>> I guess you could make a new Skolem prefix (eg http://example.com/.well-known/genid/backup-20120829T081103/) and genid your bnodes to new URLs starting with that string -- and then pass that string along with the backup file.     But keeping those together might be difficult, and if you're going to do that, there's no need for any sort of standard format for Skolems.
>>>> Well, in 4store (for e.g.) the Skolem URIs generally look like:
>>>>    http://4store.org/.well-known/genid/[UUID_for_DB]/[ID_number]
>>>> so the store can recognise it's own bNodes, and convert them back into internal IDs if it gets them back.
>>> I'm not quite sure what a DB is, but it seems like it would be kind of hard control whether the nodes are de-Skolemized on database-restore -- users would have to understand whether they were loading it into the "same" DB or not.
>> DB = Database / graphstore / instance / whatever term you want to use.
>> 
>> The users don't need to understand it, that case was covered by my 2nd para, if it did come from this one, then it will recognise the prefix+UUID if it gets it back. That case is easy.
> 
> But how does a user know when two 4store instances are the "same" DB?   Sometimes I want to move a database from one instance to another (or is it the same one? I dunno) without anything about being changed.  It sounds like with the Skolem approach, I couldn't do that.    (Although, yes, the change would be at a level that we might be considering noise.)

I think I wasn't being clear - the user doesn't have to know - it will work either way.

The two cases were an either-or situation, whichever one applies (or a combination of both) it will come back in as shared bNodes, as long as it's legal to translate Skolem constant → bNode identifier internally. 

>>>> Other Skolem URIs in .well-known form can also be converted into new internal bNode identifiers, in the same way bNode labels are, but they're globally unique, so you can safely map any Skolem URI to the same bNode ID across graphs. I don't know for sure if 4store does this or not, but it could.
>>> Is it okay for RDF client software to silently and automatically turn Skolem IRIs back into blank nodes?    (That will change the results of some SPARQL queries on that data.)   If it does this, how long does it have to keep the IRI-bnode map around?   For as long as it has that blank node?
>> That's a good question. http://www.w3.org/TR/2011/WD-rdf11-concepts-20110830/#section-skolemization says you can turn bNodes into Skolem URIs, which also changes SPARQL queries… unless ISBLANK(<http://example.com/.well-known/genid/1>) is true, which I believe it is not.
>> 
>> It doesn't say that you can turn Skolem URIs into internal bNode identifiers (not actually the same thing as a bNode, but that's more-or-less an implementation detail). It is something we discussed at the F2F though, the rationale being that internal bNode identifiers are more efficient to store.
>> 
>>> I don't know the best answer to these questions, or even if we have to answer them, I guess.
>> The one thing we need to agree is what happens if you see:
>> 
>> <http://example.com/a> { _:a a <Foo> }
>> <http://example.com/b> { _:a a <Bar> }
>> 
>> i.e. is there one bNode in two graphs, or two one in each graph.
> 
> Exactly.   This is ISSUE-21 ("Can Node-IDs be shared between parts of a quad/multigraph format?")
> 
> We could do a strawpoll on that here and now.
> 
> My vote, not surprising anyone, would be:
> 
> +1 (shared bnodes are needed for several use cases and are simpler than using Skolem nodes)


-0.5 it's a significant change in behaviour for some systems, with unknown implications [would be -1 if Jena didn't do it already]

We're not really big users of Trig, so I'd like to hear from people that are - if there aren't any big users of Trig, then I guess we probably should make the change, but I have to question why were bothering.

- Steve

-- 
Steve Harris, CTO
Garlik, a part of Experian
+44 7854 417 874  http://www.garlik.com/
Registered in England and Wales 653331 VAT # 887 1335 93
Registered office: Landmark House, Experian Way, Nottingham, Notts, NG80 1ZZ
Received on Friday, 31 August 2012 13:00:31 UTC