Warning:
This wiki has been archived and is now readonly.
Semantics of R2RML
Contents
Introduction
This document presents the semantics of R2RML. The semantics of R2RML are equivalent to nonrecursive datalog with negation. We first present an overview of Datalog and present the syntax and semantics of datalog. Additionally we present two variations of mappings that can be done based on this semantics. First, we define the DATABASEINSTANCEONLY mapping. In this mapping, we only consider the instances of the database in order to generate RDF. Second, we define the DATABASESCHEMAONLY mapping. This mapping only considers the schema of the database in order to generate the ontology in RDFS/OWL.
TODO: Map the syntax of R2RML to datalog
Datalog
Syntax of Datalog rules
A Datalog rule is a rule of the form:

P(x) ← P_{1}(x_{1}), ..., P_{k}(x_{k}), not Q_{1}(y_{1}), ..., not Q_{m}(y_{m}), u_{1} ≠ v_{1}, ..., u_{n} ≠ v_{n}
where:
 k > 0, m ≥ 0 and n ≥ 0
 P, P_{1}, ..., P_{k}, Q_{1}, ..., Q_{m} are (non necessarily distinct) relation symbols
 each of x, x_{1}, ..., x_{k}, y_{1}, ..., y_{m} is a tuple of variables (elements from V) and constants (elements from U)
 every u_{i} (1 ≤ i ≤ n), and every v_{i}, is either a variable or a constant
 the following safety conditions are satisfied:
 every variable in x is mentioned in some tuple x_{j} (1 ≤ j ≤ k)
 every variable in y_{i} (1 ≤ i ≤ m) is mentioned in some tuple x_{j} (1 ≤ j ≤ k)
 if u_{i} is a variable (1 ≤ i ≤ n), then u_{i} is mentioned in some tuple x_{j} (1 ≤ j ≤ k)
 if v_{i} is a variable (1 ≤ i ≤ n), then v_{i} is mentioned in some tuple x_{j} (1 ≤ j ≤ k)
It should be noticed that if m = 0 in the previous Datalog rule, then the rule does not include any element of the form "not Q(x)". Similarly, if n = 0 in the previous Datalog rule, then it does not include any inequalities.
The following is a Datalog rule for our running example:
triple(x, "name", y) ← student(x, y)
In this rule, x and y are variables, and "name" is a constant (an element from U). From now on, we use double quotes in Datalog rules to denote constants.
As a second example, consider the following Datalog rules for our running example:
A(x) ← enrolled(x, y), enrolled(x, z), y ≠ z B(x) ← enrolled(x, y_{1}), enrolled(x, y_{2}), course(y_{1}, w_{1}, z_{1}), course(y_{2}, w_{2}, z_{2}), y_{1} ≠ y_{2}, z_{1} ≠ "CS", z_{2} ≠ "CS"
Intuitively, the first rule retrieves all the students that are taking at least two distinct courses, while the second rule retrieves all the students that are taking at least two distinct courses that are not given in the "CS" department. Finally, consider the following Datalog rule that uses the relation symbol A just defined:
C(x) ← enrolled(x, y), not A(x)
Intuitively, this rule retrieves all the students that are taking exactly one course: enrolled(x, y) indicates that x is taking at least one course, while not A(x) indicates that x is not in the table A, that is, x is not taking at least two distinct courses.
We conclude this section by mentioning that in the Datalog rule shown at the beginning, P(x) is the head of the rule and P_{1}(x_{1}), ..., P_{k}(x_{k}), not Q_{1}(y_{1}), ..., not Q_{m}(y_{m}), u_{1} ≠ v_{1}, ..., u_{n} ≠ v_{n} is the body of the rule.
Nonrecursive Datalog programs
A Datalog program Π is a finite set of Datalog rules. In Π, a relation symbol P is intensional if it is mentioned in the head of some rule of Π, and it is extensional otherwise. Then a Datalog program Π is said to be defined over a schema R if the set of extensional relation symbols of Π is a subset of R. For example, the following is a Datalog program over the schema of our running example:
A(x) ← enrolled(x, y), enrolled(x, z), y ≠ z C(x) ← enrolled(x, y), not A(x)
To see why this is the case, it should be noticed that {A, C} is the set of intesional relation symbols of this program, while {enrolled} is the set of extensional relation symbols of this program.
A Datalog program Π is said to be nonrecursive if there exists a function f that assigns a positive number to each relation symbol in Π in such a way that for every rule in Π, if P is the relation symbol in the head of the rule and Q is a relation symbol in the body of the rule, then f(P) > f(Q). Thus, for example, the Datalog program:
A(x) ← enrolled(x, y), enrolled(x, z), y ≠ z C(x) ← enrolled(x, y), not A(x)
is nonrecursive as the function f defined as f(C) = 3, f(A) = 2, f(enrolled) = 1 satisfies the aforementioned condition. On the other hand, the following Datalog program:
A(x) ← E(x,y), A(y)
as well as:
A(x) ← E(x,y) A(x) ← B(x) B(x) ← C(x) C(x) ← A(x)
are recursive Datalog programs.
Intuitively, a Datalog program is recursive if one of its intensional predicates is defined in terms of itself. The following is a typical example of a recursive Datalog program:
flight(x,y) ← non_stop_flight(x,y) flight(x,y) ← non_stop_flight(x,z), flight(z,y)
This program retrieves the pairs of cities (x,y) such that there is a way to flight from x to y that may include an arbitrary number of stopovers.
Important: In what follows, we consider only nonrecursive Datalog programs.
Semantics of Datalog programs
To define the semantics of Datalog programs, we need to introduce some terminology. A substitution σ is a function from V ∪ U to U that is the identity on the constants (that is, σ(c) = c for every c ∈ U). Given a tuple x=(x_{1}, ..., x_{k}) of constants and variables and a substitution σ, tuple σ(x) is defined as (σ(x_{1}), ..., σ(x_{k})). For example, if x= (x, "name", y) and σ is a substitution such that σ(x) = "1" and σ(y) = "John", then σ(x) = ("1", "name", "John").
Let R be a schema, D an instance of R, and Π a Datalog program over R, and assume that f is a function that assigns a positive number to each relation symbol in Π in such a way that for every rule in Π, if P is the relation symbol in the head of the rule and Q is a relation symbol in the body of the rule, then f(P) > f(Q) (such a function exists since Π is assumed to be nonrecursive). The evaluation of Π over D assigns to every predicate symbol R mentioned in Π a relation R^{D} of the corresponding arity. Formally, this evaluation is recursively defined as follows.
(1) For every extensional relation symbol R mentioned in Π, relation R^{D} is just the relation assigned to R by D.
(2) Assume that P is an intensional relation symbol in Π and that for every relation symbol Q such that f(P) > f(Q), relation Q^{D} has already been computed. Then a tuple c of constants is in P^{D} if and only if there exist a rule in Π:
and a substitution σ such that:
 σ(x) = c
 for every i ∈ {1, ..., k}: σ(x_{i}) is a tuple in P_{i}^{D}
 for every i ∈ {1, ..., m}: σ(y_{i}) is not a tuple in Q_{i}^{D}
 for every i ∈ {1, ..., n}: σ(u_{i}) ≠ σ(v_{i})
It is important to notice that in the previous definition, a function f is used to define the evaluation of Π over D. Thus, it is natural to ask what would happen if one replaces this function by another function g satisfying the condition mentioned in the definition. Given that f is only used to determine the order in which the rules of Π have to be evaluated, it can be formally proved that if one uses g instead of f when evaluating Π over D, then the result is the same.
We now show an example of the evaluation process. Let Π be the following Datalog program over the schema of our running example:
A(x) ← enrolled(x, y), enrolled(x, z), y ≠ z C(x) ← enrolled(x, y), not A(x)
Moreover, let f be a function defined as f(C) = 3, f(A) = 2 and f(enrolled) = 1, which actually shows that Π is a nonrecursive Datalog program. To evaluate Π over our running database instance D, we first notice that:
enrolled^{D} = { (1,2) }
Then we consider intensional predicate A, as we have that f(A) > f(enrolled), enrolled^{D} has already been computed and enrolled in the only relation symbol mentioned in Π whose image under f is smaller than f(A). In this case, we have that:
A^{D} = ∅
as there is no substitution σ such that (σ(x), σ(y)) is in enrolled^{D}, (σ(x), σ(z)) is in enrolled^{D} and σ(y) ≠ σ(z). Finally, we consider intensional predicate C, for which we have that:
C^{D} = { 1 }
since for the substitution σ such that σ(x) = 1 and σ(y) = 2, we have that (σ(x), σ(y)) is in enrolled^{D} and σ(x) is not in A^{D}. Notice that this result corresponds with our intuition about the definition of relation symbol C, as C^{D} contains the set of students in D that are taking exactly one course.
Builtin predicates
Fix a relational schema R, and assume that B is a set of relation symbols that are not mentioned in R. Moreover, assume that for every relation symbol R in B of arity n, there exists a (nonnecessarily finite) nary relation I of elements from U such that for every instance D of R, it holds that R^{D} = I. That is, the interpretation of each relation symbol in B is fixed (it does not depend on the database instances) and may be infinite.
Each relation symbol in B is called a builtin predicate. For example, if U is the set of natural numbers, then < is a builtin predicate whose interpretation in each database instance D is:
<^{D} = { (n,m)  n and m are natural numbers and n is smaller than m }
Builtin predicates can be included in Datalog rules. More precisely, a Datalog rule with builtin predicates is a rule of the form:

P(x) ← P_{1}(x_{1}), ..., P_{k}(x_{k}), not Q_{1}(y_{1}), ..., not Q_{m}(y_{m}), u_{1} ≠ v_{1}, ..., u_{n} ≠ v_{n}, R_{1}(w_{1}), ..., R_{s}(w_{s})
where:
 k > 0, m ≥ 0, n ≥ 0 and s ≥ 0
 P, P_{1}, ..., P_{k}, Q_{1}, ..., Q_{m} are (non necessarily distinct) relation symbols not mentioned in B
 R_{1}, ..., R_{s} are (non necessarily distinct) builtin predicate symbols (relation symbols from B)
 each of x, x_{1}, ..., x_{k}, y_{1}, ..., y_{m}, w_{1}, ..., w_{s} is a tuple of variables and constants
 every u_{i} (1 ≤ i ≤ n), and every v_{i}, is either a variable or a constant
 the following safety conditions are satisfied:
 every variable in x is mentioned in some tuple x_{j} (1 ≤ j ≤ k)
 every variable in y_{i} (1 ≤ i ≤ m) is mentioned in some tuple x_{j} (1 ≤ j ≤ k)
 if u_{i} is a variable (1 ≤ i ≤ n), then u_{i} is mentioned in some tuple x_{j} (1 ≤ j ≤ k)
 if v_{i} is a variable (1 ≤ i ≤ n), then v_{i} is mentioned in some tuple x_{j} (1 ≤ j ≤ k)
 every variable in w_{i} (1 ≤ i ≤ s) is mentioned in some tuple x_{j} (1 ≤ j ≤ k)
It is important to notice that the last safety condition is imposed to avoid rules like the following:
A(x) ← B(y), y < x
which may have an infinite number of solutions if U is the set of natural numbers and < is the usual order on these numbers.
A Datalog program Π with builtin predicates is a finite set of Datalog rules with builtin predicates. A Datalog program Π is said to be defined over relational schema R with builtin predicates B if the set of extensional relation symbols of Π is a subset of R ∪ B.
The semantics of Datalog programs with builtin predicates is a simple extension of the semantics defined above. Let D be an instance of schema R and Π a Datalog program over R with builtin predicates B, and assume that f is a function that assigns a positive number to each relation symbol in Π in such a way that for every rule in Π, if P is the relation symbol in the head of the rule and Q is a relation symbol in the body of the rule, then f(P) > f(Q) (such a function exists since Π is assumed to be nonrecursive). The evaluation of Π over D assigns to every predicate symbol R mentioned in Π a relation R^{D} of the corresponding arity. Formally, this evaluation is recursively defined as follows.
(1) For every extensional relation symbol R mentioned in Π, relation R^{D} is just the relation assigned to R by D (in particular, for every builtin predicate R mentioned in Π, relation R^{D} is just the fixed interpretation assigned to R by D).
(2) Assume that P is an intensional relation symbol in Π and that for every relation symbol Q such that f(P) > f(Q), relation Q^{D} has already been computed. Then a tuple c of constants is in P^{D} if and only if there exist a rule in Π:
and a substitution σ such that:
 σ(x) = c
 for every i ∈ {1, ..., k}: σ(x_{i}) is a tuple in P_{i}^{D}
 for every i ∈ {1, ..., m}: σ(y_{i}) is not a tuple in Q_{i}^{D}
 for every i ∈ {1, ..., n}: σ(u_{i}) ≠ σ(v_{i})
 for every i ∈ {1, ..., s}: σ(w_{i}) is a tuple in R_{i}^{D}
DatabaseInstanceOnly Mapping
...
DatabaseSchemaOnly Mapping
blabla