Warning:
This wiki has been archived and is now read-only.

Non-unique Tables

From RDB2RDF
Jump to: navigation, search

SQL tables and views have 0 or more unique keys. Those with 0 unique keys may have repeated rows. This case is captured by test cases 1table3columns3rows2duplicates and 2tables2duplicates0nulls.

A R2RML subjectMap is one of

  • a constant RDF term - rr:constant <foo>
  • a column value - rr:column "PAGE" (where the PAGE column has URLs)
  • a generated IRI - rr:template "http..{EMPNO}"); rr:termType rr:IRI
  • a generated blank node - rr:template "{lname}|{amount}"; rr:termType rr:BlankNode

A table like:

<tbody>
 </tbody>
IOUs
fname lname amount
BobSmith30
SueJones20
BobSmith30

Because the 1st and 3rd rows have the same values, no template string can create different subjects for the two rows. Because RDF is a set of triples, a assertion generated for each of the repeated rows:

_:Smith30 <IOUs#amount> 30.
_:Smith30 <IOUs#amount> 30.

yield a single triple.

Because the DM generated unique blank nodes for each row in a table with no primary key, these triples would have different subjects:

_:a <IOUs#amount> 30.
_:b <IOUs#amount> 30.

This makes it impossible for one to generate an R2RML mapping to the DM of a table with potentially repeated rows. Richard's proposal is to relax this requirement to permit these triples to have a single blank node, thus collaposing to one triple. In the cases where the person configuring R2RML knows these rows are unique (perhaps do to domain knowledge or constraints in the applications which populate the database), it would do no harm for the default behavior to generate repeated blank nodes for repeated rows because the user knows there are no repeated rows. Likewise if the user knows that repeated rows are uninteresting to the application. they may also be served by a default of generating the same node.

application environment enforces uniqueness

If the database schema doesn't contain any assertion of uniqueness, but it has a timestamp with a 1s granularity and the user knows that he recieves IOUs serially (no multi-tasking) and it takes him more than a second then this table:

<tbody>
 </tbody>
IOUs
when fname lname amount
2012-05-13 01:23:45BobSmith30
2012-05-13 01:23:46SueJones20
2012-05-13 01:23:47BobSmith30

could be safely captured with subject blank nodes which are keyed off the values in the table:

_:2012-05-13 01:23:45|Smith|Bob|30 <IOUs#amount> 30.
_:2012-05-13 01:23:47|Smith|Bob|30 <IOUs#amount> 30.

ericP's counter argument

R2RML can't capture the use and meaning of a table which legitimately has no unique key; it only captures the meaning of those tables which are missing unique constraints. Before recieving this extra information, it would be inaccurate to claim to capture the meaning and use of the table, just as it would be inaccurate to interpret all strings as integers just because some malformed databases may have strings which are interpreted by the application as integers. If the user user knows that a unique key is missing, they can either change the schema passed to the R2RML tool or write a subject map which is consistent with the actual data rather than the reported schema.