Warning:
This wiki has been archived and is now read-only.
Semantics of R2RML
Contents
Introduction
This document presents the semantics of R2RML. The semantics of R2RML are equivalent to non-recursive datalog with negation. We first present an overview of Datalog and present the syntax and semantics of datalog. Additionally we present two variations of mappings that can be done based on this semantics. First, we define the DATABASE-INSTANCE-ONLY mapping. In this mapping, we only consider the instances of the database in order to generate RDF. Second, we define the DATABASE-SCHEMA-ONLY mapping. This mapping only considers the schema of the database in order to generate the ontology in RDFS/OWL.
TO-DO: Map the syntax of R2RML to datalog
Datalog
Syntax of Datalog rules
A Datalog rule is a rule of the form:
-
P(x) ← P1(x1), ..., Pk(xk), not Q1(y1), ..., not Qm(ym), u1 ≠ v1, ..., un ≠ vn
where:
- k > 0, m ≥ 0 and n ≥ 0
- P, P1, ..., Pk, Q1, ..., Qm are (non necessarily distinct) relation symbols
- each of x, x1, ..., xk, y1, ..., ym is a tuple of variables (elements from V) and constants (elements from U)
- every ui (1 ≤ i ≤ n), and every vi, is either a variable or a constant
- the following safety conditions are satisfied:
- every variable in x is mentioned in some tuple xj (1 ≤ j ≤ k)
- every variable in yi (1 ≤ i ≤ m) is mentioned in some tuple xj (1 ≤ j ≤ k)
- if ui is a variable (1 ≤ i ≤ n), then ui is mentioned in some tuple xj (1 ≤ j ≤ k)
- if vi is a variable (1 ≤ i ≤ n), then vi is mentioned in some tuple xj (1 ≤ j ≤ k)
It should be noticed that if m = 0 in the previous Datalog rule, then the rule does not include any element of the form "not Q(x)". Similarly, if n = 0 in the previous Datalog rule, then it does not include any inequalities.
The following is a Datalog rule for our running example:
triple(x, "name", y) ← student(x, y)
In this rule, x and y are variables, and "name" is a constant (an element from U). From now on, we use double quotes in Datalog rules to denote constants.
As a second example, consider the following Datalog rules for our running example:
A(x) ← enrolled(x, y), enrolled(x, z), y ≠ z B(x) ← enrolled(x, y1), enrolled(x, y2), course(y1, w1, z1), course(y2, w2, z2), y1 ≠ y2, z1 ≠ "CS", z2 ≠ "CS"
Intuitively, the first rule retrieves all the students that are taking at least two distinct courses, while the second rule retrieves all the students that are taking at least two distinct courses that are not given in the "CS" department. Finally, consider the following Datalog rule that uses the relation symbol A just defined:
C(x) ← enrolled(x, y), not A(x)
Intuitively, this rule retrieves all the students that are taking exactly one course: enrolled(x, y) indicates that x is taking at least one course, while not A(x) indicates that x is not in the table A, that is, x is not taking at least two distinct courses.
We conclude this section by mentioning that in the Datalog rule shown at the beginning, P(x) is the head of the rule and P1(x1), ..., Pk(xk), not Q1(y1), ..., not Qm(ym), u1 ≠ v1, ..., un ≠ vn is the body of the rule.
Non-recursive Datalog programs
A Datalog program Π is a finite set of Datalog rules. In Π, a relation symbol P is intensional if it is mentioned in the head of some rule of Π, and it is extensional otherwise. Then a Datalog program Π is said to be defined over a schema R if the set of extensional relation symbols of Π is a subset of R. For example, the following is a Datalog program over the schema of our running example:
A(x) ← enrolled(x, y), enrolled(x, z), y ≠ z C(x) ← enrolled(x, y), not A(x)
To see why this is the case, it should be noticed that {A, C} is the set of intesional relation symbols of this program, while {enrolled} is the set of extensional relation symbols of this program.
A Datalog program Π is said to be non-recursive if there exists a function f that assigns a positive number to each relation symbol in Π in such a way that for every rule in Π, if P is the relation symbol in the head of the rule and Q is a relation symbol in the body of the rule, then f(P) > f(Q). Thus, for example, the Datalog program:
A(x) ← enrolled(x, y), enrolled(x, z), y ≠ z C(x) ← enrolled(x, y), not A(x)
is non-recursive as the function f defined as f(C) = 3, f(A) = 2, f(enrolled) = 1 satisfies the aforementioned condition. On the other hand, the following Datalog program:
A(x) ← E(x,y), A(y)
as well as:
A(x) ← E(x,y) A(x) ← B(x) B(x) ← C(x) C(x) ← A(x)
are recursive Datalog programs.
Intuitively, a Datalog program is recursive if one of its intensional predicates is defined in terms of itself. The following is a typical example of a recursive Datalog program:
flight(x,y) ← non_stop_flight(x,y) flight(x,y) ← non_stop_flight(x,z), flight(z,y)
This program retrieves the pairs of cities (x,y) such that there is a way to flight from x to y that may include an arbitrary number of stopovers.
Important: In what follows, we consider only non-recursive Datalog programs.
Semantics of Datalog programs
To define the semantics of Datalog programs, we need to introduce some terminology. A substitution σ is a function from V ∪ U to U that is the identity on the constants (that is, σ(c) = c for every c ∈ U). Given a tuple x=(x1, ..., xk) of constants and variables and a substitution σ, tuple σ(x) is defined as (σ(x1), ..., σ(xk)). For example, if x= (x, "name", y) and σ is a substitution such that σ(x) = "1" and σ(y) = "John", then σ(x) = ("1", "name", "John").
Let R be a schema, D an instance of R, and Π a Datalog program over R, and assume that f is a function that assigns a positive number to each relation symbol in Π in such a way that for every rule in Π, if P is the relation symbol in the head of the rule and Q is a relation symbol in the body of the rule, then f(P) > f(Q) (such a function exists since Π is assumed to be non-recursive). The evaluation of Π over D assigns to every predicate symbol R mentioned in Π a relation RD of the corresponding arity. Formally, this evaluation is recursively defined as follows.
(1) For every extensional relation symbol R mentioned in Π, relation RD is just the relation assigned to R by D.
(2) Assume that P is an intensional relation symbol in Π and that for every relation symbol Q such that f(P) > f(Q), relation QD has already been computed. Then a tuple c of constants is in PD if and only if there exist a rule in Π:
and a substitution σ such that:
- σ(x) = c
- for every i ∈ {1, ..., k}: σ(xi) is a tuple in PiD
- for every i ∈ {1, ..., m}: σ(yi) is not a tuple in QiD
- for every i ∈ {1, ..., n}: σ(ui) ≠ σ(vi)
It is important to notice that in the previous definition, a function f is used to define the evaluation of Π over D. Thus, it is natural to ask what would happen if one replaces this function by another function g satisfying the condition mentioned in the definition. Given that f is only used to determine the order in which the rules of Π have to be evaluated, it can be formally proved that if one uses g instead of f when evaluating Π over D, then the result is the same.
We now show an example of the evaluation process. Let Π be the following Datalog program over the schema of our running example:
A(x) ← enrolled(x, y), enrolled(x, z), y ≠ z C(x) ← enrolled(x, y), not A(x)
Moreover, let f be a function defined as f(C) = 3, f(A) = 2 and f(enrolled) = 1, which actually shows that Π is a non-recursive Datalog program. To evaluate Π over our running database instance D, we first notice that:
enrolledD = { (1,2) }
Then we consider intensional predicate A, as we have that f(A) > f(enrolled), enrolledD has already been computed and enrolled in the only relation symbol mentioned in Π whose image under f is smaller than f(A). In this case, we have that:
AD = ∅
as there is no substitution σ such that (σ(x), σ(y)) is in enrolledD, (σ(x), σ(z)) is in enrolledD and σ(y) ≠ σ(z). Finally, we consider intensional predicate C, for which we have that:
CD = { 1 }
since for the substitution σ such that σ(x) = 1 and σ(y) = 2, we have that (σ(x), σ(y)) is in enrolledD and σ(x) is not in AD. Notice that this result corresponds with our intuition about the definition of relation symbol C, as CD contains the set of students in D that are taking exactly one course.
Built-in predicates
Fix a relational schema R, and assume that B is a set of relation symbols that are not mentioned in R. Moreover, assume that for every relation symbol R in B of arity n, there exists a (non-necessarily finite) n-ary relation I of elements from U such that for every instance D of R, it holds that RD = I. That is, the interpretation of each relation symbol in B is fixed (it does not depend on the database instances) and may be infinite.
Each relation symbol in B is called a built-in predicate. For example, if U is the set of natural numbers, then < is a built-in predicate whose interpretation in each database instance D is:
<D = { (n,m) | n and m are natural numbers and n is smaller than m }
Built-in predicates can be included in Datalog rules. More precisely, a Datalog rule with built-in predicates is a rule of the form:
-
P(x) ← P1(x1), ..., Pk(xk), not Q1(y1), ..., not Qm(ym), u1 ≠ v1, ..., un ≠ vn, R1(w1), ..., Rs(ws)
where:
- k > 0, m ≥ 0, n ≥ 0 and s ≥ 0
- P, P1, ..., Pk, Q1, ..., Qm are (non necessarily distinct) relation symbols not mentioned in B
- R1, ..., Rs are (non necessarily distinct) built-in predicate symbols (relation symbols from B)
- each of x, x1, ..., xk, y1, ..., ym, w1, ..., ws is a tuple of variables and constants
- every ui (1 ≤ i ≤ n), and every vi, is either a variable or a constant
- the following safety conditions are satisfied:
- every variable in x is mentioned in some tuple xj (1 ≤ j ≤ k)
- every variable in yi (1 ≤ i ≤ m) is mentioned in some tuple xj (1 ≤ j ≤ k)
- if ui is a variable (1 ≤ i ≤ n), then ui is mentioned in some tuple xj (1 ≤ j ≤ k)
- if vi is a variable (1 ≤ i ≤ n), then vi is mentioned in some tuple xj (1 ≤ j ≤ k)
- every variable in wi (1 ≤ i ≤ s) is mentioned in some tuple xj (1 ≤ j ≤ k)
It is important to notice that the last safety condition is imposed to avoid rules like the following:
A(x) ← B(y), y < x
which may have an infinite number of solutions if U is the set of natural numbers and < is the usual order on these numbers.
A Datalog program Π with built-in predicates is a finite set of Datalog rules with built-in predicates. A Datalog program Π is said to be defined over relational schema R with built-in predicates B if the set of extensional relation symbols of Π is a subset of R ∪ B.
The semantics of Datalog programs with built-in predicates is a simple extension of the semantics defined above. Let D be an instance of schema R and Π a Datalog program over R with built-in predicates B, and assume that f is a function that assigns a positive number to each relation symbol in Π in such a way that for every rule in Π, if P is the relation symbol in the head of the rule and Q is a relation symbol in the body of the rule, then f(P) > f(Q) (such a function exists since Π is assumed to be non-recursive). The evaluation of Π over D assigns to every predicate symbol R mentioned in Π a relation RD of the corresponding arity. Formally, this evaluation is recursively defined as follows.
(1) For every extensional relation symbol R mentioned in Π, relation RD is just the relation assigned to R by D (in particular, for every built-in predicate R mentioned in Π, relation RD is just the fixed interpretation assigned to R by D).
(2) Assume that P is an intensional relation symbol in Π and that for every relation symbol Q such that f(P) > f(Q), relation QD has already been computed. Then a tuple c of constants is in PD if and only if there exist a rule in Π:
and a substitution σ such that:
- σ(x) = c
- for every i ∈ {1, ..., k}: σ(xi) is a tuple in PiD
- for every i ∈ {1, ..., m}: σ(yi) is not a tuple in QiD
- for every i ∈ {1, ..., n}: σ(ui) ≠ σ(vi)
- for every i ∈ {1, ..., s}: σ(wi) is a tuple in RiD
Database-Instance-Only Mapping
...
Database-Schema-Only Mapping
blabla