Warning:
This wiki has been archived and is now read-only.

Semantics of R2RML

From RDB2RDF

Jump to: navigation, search

1 Introduction
2 Datalog
3 Database-Instance-Only Mapping
4 Database-Schema-Only Mapping

Introduction

This document presents the semantics of R2RML. The semantics of R2RML are equivalent to non-recursive datalog with negation. We first present an overview of Datalog and present the syntax and semantics of datalog. Additionally we present two variations of mappings that can be done based on this semantics. First, we define the DATABASE-INSTANCE-ONLY mapping. In this mapping, we only consider the instances of the database in order to generate RDF. Second, we define the DATABASE-SCHEMA-ONLY mapping. This mapping only considers the schema of the database in order to generate the ontology in RDFS/OWL.

TO-DO: Map the syntax of R2RML to datalog

Datalog

Syntax of Datalog rules

A Datalog rule is a rule of the form:

P(x) ← P₁(x₁), ..., P_k(x_k), not Q₁(y₁), ..., not Q_m(y_m), u₁ ≠ v₁, ..., u_n ≠ v_n

where:

k > 0, m ≥ 0 and n ≥ 0
P, P₁, ..., P_k, Q₁, ..., Q_m are (non necessarily distinct) relation symbols
each of x, x₁, ..., x_k, y₁, ..., y_m is a tuple of variables (elements from V) and constants (elements from U)
every u_i (1 ≤ i ≤ n), and every v_i, is either a variable or a constant
the following safety conditions are satisfied:
- every variable in x is mentioned in some tuple x_j (1 ≤ j ≤ k)
- every variable in y_i (1 ≤ i ≤ m) is mentioned in some tuple x_j (1 ≤ j ≤ k)
- if u_i is a variable (1 ≤ i ≤ n), then u_i is mentioned in some tuple x_j (1 ≤ j ≤ k)
- if v_i is a variable (1 ≤ i ≤ n), then v_i is mentioned in some tuple x_j (1 ≤ j ≤ k)

It should be noticed that if m = 0 in the previous Datalog rule, then the rule does not include any element of the form "not Q(x)". Similarly, if n = 0 in the previous Datalog rule, then it does not include any inequalities.

The following is a Datalog rule for our running example:

triple(x, "name", y) ← student(x, y)

In this rule, x and y are variables, and "name" is a constant (an element from U). From now on, we use double quotes in Datalog rules to denote constants.

As a second example, consider the following Datalog rules for our running example:

A(x) ← enrolled(x, y), enrolled(x, z), y ≠ z 
B(x) ← enrolled(x, y₁), enrolled(x, y₂), course(y₁, w₁, z₁), course(y₂, w₂, z₂), y₁ ≠ y₂, z₁ ≠ "CS", z₂ ≠ "CS"

Intuitively, the first rule retrieves all the students that are taking at least two distinct courses, while the second rule retrieves all the students that are taking at least two distinct courses that are not given in the "CS" department. Finally, consider the following Datalog rule that uses the relation symbol A just defined:

C(x) ← enrolled(x, y), not A(x)

Intuitively, this rule retrieves all the students that are taking exactly one course: enrolled(x, y) indicates that x is taking at least one course, while not A(x) indicates that x is not in the table A, that is, x is not taking at least two distinct courses.

We conclude this section by mentioning that in the Datalog rule shown at the beginning, P(x) is the head of the rule and P₁(x₁), ..., P_k(x_k), not Q₁(y₁), ..., not Q_m(y_m), u₁ ≠ v₁, ..., u_n ≠ v_n is the body of the rule.

Non-recursive Datalog programs

A Datalog program Π is a finite set of Datalog rules. In Π, a relation symbol P is intensional if it is mentioned in the head of some rule of Π, and it is extensional otherwise. Then a Datalog program Π is said to be defined over a schema R if the set of extensional relation symbols of Π is a subset of R. For example, the following is a Datalog program over the schema of our running example:

A(x) ← enrolled(x, y), enrolled(x, z), y ≠ z 
C(x) ← enrolled(x, y), not A(x)

To see why this is the case, it should be noticed that {A, C} is the set of intesional relation symbols of this program, while {enrolled} is the set of extensional relation symbols of this program.

A Datalog program Π is said to be non-recursive if there exists a function f that assigns a positive number to each relation symbol in Π in such a way that for every rule in Π, if P is the relation symbol in the head of the rule and Q is a relation symbol in the body of the rule, then f(P) > f(Q). Thus, for example, the Datalog program:

A(x) ← enrolled(x, y), enrolled(x, z), y ≠ z 
C(x) ← enrolled(x, y), not A(x)

is non-recursive as the function f defined as f(C) = 3, f(A) = 2, f(enrolled) = 1 satisfies the aforementioned condition. On the other hand, the following Datalog program:

A(x) ← E(x,y), A(y)

as well as:

A(x) ← E(x,y)
A(x) ← B(x)
B(x) ← C(x)
C(x) ← A(x)

are recursive Datalog programs.

Intuitively, a Datalog program is recursive if one of its intensional predicates is defined in terms of itself. The following is a typical example of a recursive Datalog program:

flight(x,y) ← non_stop_flight(x,y)
flight(x,y) ← non_stop_flight(x,z), flight(z,y)

This program retrieves the pairs of cities (x,y) such that there is a way to flight from x to y that may include an arbitrary number of stopovers.

Important: In what follows, we consider only non-recursive Datalog programs.

Semantics of Datalog programs

To define the semantics of Datalog programs, we need to introduce some terminology. A substitution σ is a function from V ∪ U to U that is the identity on the constants (that is, σ(c) = c for every c ∈ U). Given a tuple x=(x₁, ..., x_k) of constants and variables and a substitution σ, tuple σ(x) is defined as (σ(x₁), ..., σ(x_k)). For example, if x= (x, "name", y) and σ is a substitution such that σ(x) = "1" and σ(y) = "John", then σ(x) = ("1", "name", "John").

Let R be a schema, D an instance of R, and Π a Datalog program over R, and assume that f is a function that assigns a positive number to each relation symbol in Π in such a way that for every rule in Π, if P is the relation symbol in the head of the rule and Q is a relation symbol in the body of the rule, then f(P) > f(Q) (such a function exists since Π is assumed to be non-recursive). The evaluation of Π over D assigns to every predicate symbol R mentioned in Π a relation R^D of the corresponding arity. Formally, this evaluation is recursively defined as follows.

(1) For every extensional relation symbol R mentioned in Π, relation R^D is just the relation assigned to R by D.

(2) Assume that P is an intensional relation symbol in Π and that for every relation symbol Q such that f(P) > f(Q), relation Q^D has already been computed. Then a tuple c of constants is in P^D if and only if there exist a rule in Π:

P(x) ← P₁(x₁), ..., P_k(x_k), not Q₁(y₁), ..., not Q_m(y_m), u₁ ≠ v₁, ..., u_n ≠ v_n

and a substitution σ such that:

σ(x) = c
for every i ∈ {1, ..., k}: σ(x_i) is a tuple in P_i^D
for every i ∈ {1, ..., m}: σ(y_i) is not a tuple in Q_i^D
for every i ∈ {1, ..., n}: σ(u_i) ≠ σ(v_i)

It is important to notice that in the previous definition, a function f is used to define the evaluation of Π over D. Thus, it is natural to ask what would happen if one replaces this function by another function g satisfying the condition mentioned in the definition. Given that f is only used to determine the order in which the rules of Π have to be evaluated, it can be formally proved that if one uses g instead of f when evaluating Π over D, then the result is the same.

We now show an example of the evaluation process. Let Π be the following Datalog program over the schema of our running example:

A(x) ← enrolled(x, y), enrolled(x, z), y ≠ z 
C(x) ← enrolled(x, y), not A(x)

Moreover, let f be a function defined as f(C) = 3, f(A) = 2 and f(enrolled) = 1, which actually shows that Π is a non-recursive Datalog program. To evaluate Π over our running database instance D, we first notice that:

enrolled^D = { (1,2) }

Then we consider intensional predicate A, as we have that f(A) > f(enrolled), enrolled^D has already been computed and enrolled in the only relation symbol mentioned in Π whose image under f is smaller than f(A). In this case, we have that:

A^D = ∅

as there is no substitution σ such that (σ(x), σ(y)) is in enrolled^D, (σ(x), σ(z)) is in enrolled^D and σ(y) ≠ σ(z). Finally, we consider intensional predicate C, for which we have that:

C^D = { 1 }

since for the substitution σ such that σ(x) = 1 and σ(y) = 2, we have that (σ(x), σ(y)) is in enrolled^D and σ(x) is not in A^D. Notice that this result corresponds with our intuition about the definition of relation symbol C, as C^D contains the set of students in D that are taking exactly one course.

Built-in predicates

Fix a relational schema R, and assume that B is a set of relation symbols that are not mentioned in R. Moreover, assume that for every relation symbol R in B of arity n, there exists a (non-necessarily finite) n-ary relation I of elements from U such that for every instance D of R, it holds that R^D = I. That is, the interpretation of each relation symbol in B is fixed (it does not depend on the database instances) and may be infinite.

Each relation symbol in B is called a built-in predicate. For example, if U is the set of natural numbers, then < is a built-in predicate whose interpretation in each database instance D is:

<^D = { (n,m) | n and m are natural numbers and n is smaller than m }

Built-in predicates can be included in Datalog rules. More precisely, a Datalog rule with built-in predicates is a rule of the form:

P(x) ← P₁(x₁), ..., P_k(x_k), not Q₁(y₁), ..., not Q_m(y_m), u₁ ≠ v₁, ..., u_n ≠ v_n, R₁(w₁), ..., R_s(w_s)

where:

k > 0, m ≥ 0, n ≥ 0 and s ≥ 0
P, P₁, ..., P_k, Q₁, ..., Q_m are (non necessarily distinct) relation symbols not mentioned in B
R₁, ..., R_s are (non necessarily distinct) built-in predicate symbols (relation symbols from B)
each of x, x₁, ..., x_k, y₁, ..., y_m, w₁, ..., w_s is a tuple of variables and constants
every u_i (1 ≤ i ≤ n), and every v_i, is either a variable or a constant
the following safety conditions are satisfied:
- every variable in x is mentioned in some tuple x_j (1 ≤ j ≤ k)
- every variable in y_i (1 ≤ i ≤ m) is mentioned in some tuple x_j (1 ≤ j ≤ k)
- if u_i is a variable (1 ≤ i ≤ n), then u_i is mentioned in some tuple x_j (1 ≤ j ≤ k)
- if v_i is a variable (1 ≤ i ≤ n), then v_i is mentioned in some tuple x_j (1 ≤ j ≤ k)
- every variable in w_i (1 ≤ i ≤ s) is mentioned in some tuple x_j (1 ≤ j ≤ k)

It is important to notice that the last safety condition is imposed to avoid rules like the following:

A(x) ← B(y), y < x

which may have an infinite number of solutions if U is the set of natural numbers and < is the usual order on these numbers.

A Datalog program Π with built-in predicates is a finite set of Datalog rules with built-in predicates. A Datalog program Π is said to be defined over relational schema R with built-in predicates B if the set of extensional relation symbols of Π is a subset of R ∪ B.

The semantics of Datalog programs with built-in predicates is a simple extension of the semantics defined above. Let D be an instance of schema R and Π a Datalog program over R with built-in predicates B, and assume that f is a function that assigns a positive number to each relation symbol in Π in such a way that for every rule in Π, if P is the relation symbol in the head of the rule and Q is a relation symbol in the body of the rule, then f(P) > f(Q) (such a function exists since Π is assumed to be non-recursive). The evaluation of Π over D assigns to every predicate symbol R mentioned in Π a relation R^D of the corresponding arity. Formally, this evaluation is recursively defined as follows.

(1) For every extensional relation symbol R mentioned in Π, relation R^D is just the relation assigned to R by D (in particular, for every built-in predicate R mentioned in Π, relation R^D is just the fixed interpretation assigned to R by D).

P(x) ← P₁(x₁), ..., P_k(x_k), not Q₁(y₁), ..., not Q_m(y_m), u₁ ≠ v₁, ..., u_n ≠ v_n, R₁(w₁), ..., R_s(w_s)

and a substitution σ such that:

σ(x) = c
for every i ∈ {1, ..., k}: σ(x_i) is a tuple in P_i^D
for every i ∈ {1, ..., m}: σ(y_i) is not a tuple in Q_i^D
for every i ∈ {1, ..., n}: σ(u_i) ≠ σ(v_i)
for every i ∈ {1, ..., s}: σ(w_i) is a tuple in R_i^D

Database-Instance-Only Mapping

...

Database-Schema-Only Mapping

blabla

Retrieved from "https://www.w3.org/2001/sw/rdb2rdf/wiki/index.php?title=Semantics_of_R2RML&oldid=682"

Semantics of R2RML

Contents

Introduction

Datalog

Syntax of Datalog rules

Non-recursive Datalog programs

Semantics of Datalog programs

Built-in predicates

Database-Instance-Only Mapping

Database-Schema-Only Mapping

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Navigation

Tools