Warning:
This wiki has been archived and is now read-only.

RDBNullValues

From RDB2RDF
Jump to: navigation, search

This is a workspace for proposals how to deal with NULL values in RDB tables (see ISSUE-41 and ISSUE-42.

Dealing with RDB NULL values

First, the problem is described, then for the DM and R2RML the proposed solutions are captured.

Problem

Cells in a RDB table may contain a NULL value. The following is the definition of the NULL value as per SQL spec (part 1 of t2006 draft of SQL-2008):

 DEFINITIONS
  
 null value: A special value that is used to indicate the absence of any data value. (p5)
 
 
 4.4.2	The null value
 
 Every data type includes a special value, called the null value, sometimes denoted by the keyword NULL. 
 This value differs from other values in the   following respects:
 
 Since the null value is in every data type, the data type of the null value implied by the keyword NULL cannot be inferred; 
 hence NULL can be used to denote the null value only in certain contexts, rather than everywhere that a literal is permitted.
 
 Although the null value is neither equal to any other value nor not equal to any other value — it is unknown whether or not it
 is equal to any given value — in some contexts, multiple null values are treated together; for example, the <group by clause>
 treats all null values   together. (p15)

For both the DM and R2RML we need to specify how we deal with NULL values.

Direct Mapping

This is ISSUE-42

Comments and Proposal by Enrico

The story:
Suppose I have a RDB which gives me some answers when queried in SQL. I translate the RDB in RDF, and the SQL query in SPARQL (with a direct translation). In absence of NULL values, I get always the same answers from both the SQL and the SPARQL. This somehow proves that the translation is meaningful - indeed we say that the translation of the original query answering problem from RDB/SQL to RDF/SPARQL is sound and complete. However, when NULL values are involved, there are cases when the SPARQL query gives me less or spurious tuples. In this case we say that the translation of the original query answering problem from RDB/SQL to RDF/SPARQL is incomplete and/or unsound.


Consider the following relational databases:

DB1:
R +----+------+
  | ID |   A  |
  +----+------+
  |  1 | NULL |
  +----+------+
DB2:
R +----+
  | ID |
  +----+
  |  1 |
  +----+

Consider now queries and their outcome according to the standard SQL semantics for NULLs.

The query (a) asks for the values in the column A: (:bn type R).(:bn A ?X)
and it should return
in DB1: { {?X=null} }
in DB2: {}

The query (b) asks for the values in the column A after a self-join of R over the column A: (:bn1 type R).(:bn1 A ?X).(:bn2 type R).(:bn2 A ?X)
and it should return
in DB1: {}
in DB2: {}

The query (c) (Q1 MINUS Q2) has two components: Q1 selects the identifiers of all the tuples, and Q2 selects the identifiers of the tuples which have an existing value for the other attributes (i.e., by projecting out the other attributes).
In SQL the two components are written in the same way, since a tuple can not have a 'hole' - namely a missing value (due to the 1NF): (select ID from R) MINUS (select ID from R).
In SPARQL the query would be: (?X type R) MINUS (?X A :bn)
and it should return
in DB1: {}
in DB2: {}

Consider a direct mapping in RDF which does not translate the NULL attribute values.

The query (a) incorrectly returns (INCOMPLETE answer)
in DB1: {}
in DB2: {}

The query (c) incorrectly returns (UNSOUND answer)
in DB1: { {?X=1} }
in DB2: {}

Consider a direct mapping in RDF which does translate the NULL attribute values as a constant NULL.

The query (b) incorrectly returns (INCOMPLETE answer)
in DB1: { {?X=null} }
in DB2: {}

So, the naive approaches based on not translating NULL values or on translating the NULL value as a special constant are wrong.

I am proposing a recipe on how to *use* unmodified standard normative RDF and SPARQL (or any other access language such as a rule language or anything similar) in order to get the correct answers over the graph obtained by a direct mapping where the NULL value is translated as a special recognisable constant.

Consider a direct mapping in RDF which does translate the NULL attribute values as a constant NULL, and the original queries are changed so that for every repeated (aka joined) query variable in the BGP we add a not-null test.

The query (b'): (:bn1 type R).(:bn1 A ?X).(:bn2 type R).(:bn2 A ?X).(?X ≠ NULL)
correctly returns
in DB1: {}
in DB2: {}

This is the standard solution to capture the semantics of SQL NULLs for query languages able to express conjunctive queries and their boolean combinations.

Please note that it is possible to avoid the translation of the NULL value as a constant if there is schema information from which is is possible to derive that a missing value of an attribute is indeed a NULL value. In this case it is also important that NULL values returned in the answers are recognisable as such.

Proposal

It may well be the case that the WG and the SW community at large does not like this approach. I have therefore alternative proposals.

Proposal EF-1: If a relational database contains NULL values, then the direct mapping is not applicable. This case is postponed for consideration to a future WG.

Proposal EF-2: Do generate triples also when the RDB has a NULL, by using a special recognisable constant;
add a note saying that in this case it is possible (by following a simple recipe for SPARQL or rules or other access methods) to relate this interpretation to the semantics of the NULL of the rdb it comes from.
The recipe is as follows: whenever an access method tries to retrieve the value of an attribute, and it is required that this value should be equal to the retrieved value of some other attribute, then the user should add a test checking that such a value is not equal to the NULL value. This captures correctly the semantics of the NULLs of the original relational (SQL) database.

Proposal EF-3: Depending on the value of a flag (default F=true):
F=true) Do generate triples also when the RDB has a NULL, by using a special recognisable constant;
add a note saying that in this case it is possible (by following a simple recipe for SPARQL or rules or other access methods) to relate this interpretation to the semantics of the NULL of the rdb it comes from.
The recipe is as follows: whenever an access method tries to retrieve the value of an attribute, and it is required that this value should be equal to the retrieved value of some other attribute, then the user should add a test checking that such a value is not equal to the NULL value. This captures correctly the semantics of the NULLs of the original relational (SQL) database.
F=false) Do not generate triples when the RDB has a NULL;
add a note saying that in this case it is not known how to relate this interpretation to the semantics of the NULL of the rdb it comes from.

Comments and Proposal by Juan

Consider the same databases from Enrico's proposal. We now have the following SQL queries

Query A: select A from R

DB1: { {A = null} }

DB2: { }

Query B: select * from R as x, R as y where x.A = y.A

DB1: { }

DB2: { }

Now let's show the Direct Mapping which DOES NOT translate NULL attribute values and call this DM_No_Nulls

DB1:

<R/ID=1><type> <R>.
<R/ID=1><R#ID> 1.

DB2:

<R/ID=1><type> <R>. 
<R/ID=1><R#ID> 1.

Now let's show the Direct Mapping which DOES translate NULL attribute values and call this DM_Nulls

DB1:

<R/ID=1><type> <R>. 
<R/ID=1><R#ID> 1.
<R/ID=1><R#A> NULL.

DB2:

<R/ID=1><type> <R>.
<R/ID=1><R#ID> 1.

Now I'm going to show how to get the same answers as the SQL queries but with different SPARQL queries:

Query A on DB1 DM_No_Nulls

SELECT ?A
WHERE{
?x <type> <R>
OPTIONAL{
  ?x <R#A> ?A
}
}

DB1: { {A = null} }

Query A on DB2 DM_No_Nulls

SELECT ?A
WHERE{
?x <type> <R>
 ?x <R#A> ?A
}

DB2: { }

Query A on DB1 DM_Nulls

SELECT ?A
WHERE{
?x <type> <R>
 ?x <R#A> ?A
}

DB1: { {A = null} }

Query A on DB2 DM_Nulls

SELECT ?A
WHERE{
?x <type> <R>
 ?x <R#A> ?A
}

DB2: { }


Query B on DB1 DM_No_Nulls

SELECT ?x
WHERE {
 ?b1 <type> <R>
 ?b1 <R#A> ?x
 ?b2 <type> <R>
 ?b2 <R#A> ?x
}

DB1: { }

Query B on DB2 DM_No_Nulls

SELECT ?x
WHERE {
 ?b1 <type> <R>
 ?b1 <R#A> ?x
 ?b2 <type> <R>
 ?b2 <R#A> ?x
}

DB2: { }

Query B on DB1 DM_Nulls

Yes, we will have a problem.

Proposal

The presence of SQL NULLs needs to be detectable in the Direct Mapping so that either RDF-to-RDF transformations or SPARQL queries can make use of the presence of the NULL. So the Direct Mapping will include triples representing the relational schema and will omit triples for NULL values.

R2RML

This is ISSUE-41.

Proposal:

1) by default R2RML will suppress triples when the subject, predicate, or object columns are NULL (this applies to any of the columns used in template expressions as well as direct column references)

2) if the application needs other handling for NULL values then a SQLQuery can be defined in the mapping to convert NULL values to some other application specific value

3) If one or more of the columns used in a rr:graphTemplate are NULL, then corresponding triples will not be generated either.