Re: ISSUE-9 Another question about Generate Blank Nodes

* Juan Sequeda <juanfederico@gmail.com> [2011-01-31 10:11-0600]
> This may be something we have talked about, so sorry if I'm asking about
> something that already has an answer.
> 
> We assume that a table that does not have a primary key will have a blank
> node as the Row identifier for each tuple.
> 
> But what happens if the table does not have a primary key but does have a
> candidate key(s). Are we still generating a blank node as the Row identifier
> for each tuple? Or could we consider building an IRI with the candidate
> keys?

That would make the rule a bit more complicated to explain to users
and would lead to some design questions: Which candidate key would
dominate when there were several to choose from (e.g. the Projects
table)? How would the dominant key's value be available when
generating reference triples which link to a non-dominant keys?

One use case I want to be sure to address is that of a typical
warehouse merging data from multiple sources, re-populated at a
regular interval (say 3am daily). Sometimes they don't have a primary
key (the candidate keys serve for linking purposes) because those keys
would change every day. Sometimes they do have a primary key but its
volatility dictates that the key is a secret used only by the import
scripts.


> Consider the following example
> 
> Schema
> Projects(lead, name, deptName, deptCity) where UNIQUE(name, deptName,
> deptCity)

I read this as a superkey encompassing the two candidate keys
described in
<http://www.w3.org/2001/sw/rdb2rdf/directMapping/#ref-no-pk>.

> Instances
> Projects(8, pencil survey, accounting, cambridge)
> Projects(8, eraser survey, accounting, cambridge)
> 
> For each tuple we could create a fresh blank node, or we could create a Row
> IRI for each tuple using the candidate key :
> 
> <Projects/name=pencil survey,deptName=accounting,deptCity=cambridge>
> <Projects/name=eraser survey,deptName=accounting,deptCity=cambridge>
> 
> These IRIs are unique because they come from unique keys.
> 
> What is the consensus here. I do not think this case is covered in the
> current direct mapping doc (right Eric?)

The modeling you're exploring isn't used in the direct mapping doc,
but the use case is addressed. "Referencing tables with empty primary
keys" includes the table with two unique keys and no primary keys that
you describe above. The generated graph maintains referential
integrity by labeling the triples from one row of the Projects table
as _:c and using that as the object of all arcs which reference that
row.

I think the simple consistency of the current rule will appeal more to
users and implementers. We now have two cases:

  table has a primary key → row node is a function of that primary
  key value.

  table has no primary key → row node is a new blank node.

We will otherwise have three cases:

  table has a primary key (and any number of candidate keys) → row
  node is a function of that primary key value.

  table has no primary key and no canidate key → row node is a new
  blank node.

  table has no primary key and some canidate keys → row node is a
  function of those candidate key values.


> Cheers
> 
> Juan Sequeda
> +1-575-SEQ-UEDA
> www.juansequeda.com
> 
> 
> On Fri, Jan 21, 2011 at 2:41 PM, RDB2RDF Working Group Issue Tracker <
> sysbot+tracker@w3.org <sysbot%2Btracker@w3.org>> wrote:
> 
> >
> > ISSUE-9 (bn_directmapping): Generate Blank Nodes for duplicate tuples
> > [Direct Mapping]
> >
> > http://www.w3.org/2001/sw/rdb2rdf/track/issues/9
> >
> > Raised by: Juan Sequeda
> > On product: Direct Mapping
> >
> > Given a table that does not have a primary key, which has duplicate tuples,
> > a different blank node must be created for each tuple.
> >
> > In the Direct Mapping as rules section of the Direct Mapping document, we
> > described this scenario by using all the values of the tuple to create the
> > blank node [1] [2]. However, there is a bug, raised by Alexandre [3]. The
> > issue is that datalog cannot deal with duplicate. Consequently, Marcelo
> > raised the point that we can use simple versions of datalog that can deal
> > with duplicate solutions.
> >
> > Possible solutions:
> >
> > 1) assume that each table implicitly has a row id which is part of its set
> > of attributes. The row id is unique.
> > 2) associates to each tuple an annotation that corresponds to the
> > multiplicity of the tuple in the database. This annotation function
> > corresponds to the function card in the definition of the semantics of
> > SPARQL
> >
> >
> > [1]
> > http://www.w3.org/TR/2010/WD-rdb-direct-mapping-20101118/#rules_table_triples_no_pk
> > [2]
> > http://www.w3.org/TR/2010/WD-rdb-direct-mapping-20101118/#rules_literal_triples_no_pk
> > [3]
> > http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2011Jan/0044.html
> >
> >
> >
> >

-- 
-ericP

Received on Monday, 31 January 2011 19:34:55 UTC