RIF example UC8: Vocabulary Mapping for Data Integration

Contents

RIF example UC8: Vocabulary Mapping for Data Integration

Summary

This use case is about rules which handle RDF data.

To write any such rules in the XML syntax of RIF Core WD1 we have to make some assumptions about the mapping from RDF to RIF. It is easy to invent such mappings but there are several alternatives.

Once we have the mapping then to express the rules in a concrete XML syntax as opposed to the abstract syntax we have to fill in the blanks on the tedious details like namespace handling and datatypes.

We would also like to be able to annotate the rules and rulesets in various ways. This example does so using RDF.

It was fairly easy to guess what the intended XML syntax was like from the current write up by my first guesses had mistakes. This may just be my fault but I think it suggests the need for a rather clearer specification of the syntax.

Background

Use case Vocabulary Mapping for Data Integration is about integrating data from multiple sources. The sources are provided as RDF conforming to RDFS/OWL vocabularies. Rules are used to translate the individual source representations to a common vocabulary.

The use case is loosely based on several existing applications of Jena and JenaRules, at least one of which is shipping commercially.

Source rules

These are artificial rules loosely based on the existing applications. They have been chosen to illustrate the typical range of features used in this class of applications.

  @prefix it: <http://jena.hpl.hp.com/rifUC8/ITDatabase#>.
  @prefix fn: <http://jena.hpl.hp.com/rifUC8/finance#>.
  @prefix bp: <http://jena.hpl.hp.com/rifUC8/businessProcesses#>.
  @prefix  t: <http://jena.hpl.hp.com/rifUC8/target#>.

  # Simple data mapping 
  # - a ComputeNode with a network interface card is mapped to a 
  #   Server with an IP address (no explicit NIC)
  [r1-computeNodeToServer:
     (?x rdf:type it:ComputeNode) 
     (?x it:hasNIC ?i) (?i it:hasIP ?p)
   ->
     (?x rdf:type t:Server) (?x t:address ?p)]

  # Simple join
  # - find the cage housing the rack housing the compute node
  #   and find the maintanance control for that cage
  [r2-joinBasedOnLocation:
     (?x rdf:type it:ComputeNode) 
     (?x it:rack ?r) (?r it:cage ?c)
     (?mc fn:maintenaceContractForCage ?c)
    ->
     (?x rdf:type t:Server) (?x t:maintenanceContract ?mc)]

  # Object introduction
  # - the application data doesn't have an explicit representation
  #   of the database hosting server so we invent one
  [r3-applicationHost:
     (?a rdf:type bp:Application) 
     (?a bp:discoveredAtIP ?p)
     makeTemp(?n)
    ->
     (?n rdf:type t:Server) (?n t:address ?p)  (?n t:hosts ?a)]

  # Datatypes and builtins
  # - assume that bulk maintenance contracts will get 25% discount
  [r4-discount:
     (?mc fn:baseCost ?c)
     (?mc fn:category fn:Bulk)
     product(?c, 0.75, ?cd)
   ->
     (?mc t:assumedCost ?cd)]

  # Vocabulary access and predicate variables
  # - there are several relationships between an application
  #   and its subcomponents but any of them should induce a
  #   dependency
  [r5-dependency:
     (?a rdf:type bp:Application)
     (?a ?P ?subApp)  (?P rdfs:subPropertyOf bp:comprises)
     (?n t:hosts ?subApp)
    ->
     (?a t:dependsOn ?n)]

Analysis and issues

Core is horn

The simple rules fall within RIF core in that the rules have bodies which are conjunctions of triple patterns and variables are universally quantified across each rule. There is no negation.

The syntax:

   [rulename:  B1 .. Bn -> H1 .. Hm]

is simply syntactic sugar for a set of Horn rules:

    B1 .. Bn -> H1
    ...
    B1 .. Bn -> Hm

RDF triple mapping

To map the sample rules to RIF we have to decide how to map RDF triple patterns to RIF Core Expressions (Uniterms). There are (at least) three reasonable options:

Use an "rdf" ternary relation.
Map all RDF (s P o) triples to binary relations P(s,o)
Map all RDF type triples (s rdf:type T) to unary relations T(s) and map all other triples to binary relations P(s,o)

The first is the simplest and supports quantification over RDF predicates without requiring quantification over RIF relations. The second is in some ways the most "natural" since we would normally regard an RDF triple as representing an instance of a binary relation.

For this exercise we chose the second option.

RDF Resource mapping

Next we have to decide how the map all the URIs like it:ComputeNode into Const(ant)s.

They could be simply strings, they could be instances of some URI sort or they could be called out as special cases in the abstract syntax.

Strings is the easiest but for the concrete syntax we would prefer some sort of qname/curie syntax. At the abstract syntax level this is irrelvant and boring. At the concrete syntax level handling namespaces in XML is one tedious headache. To simplify this we extend the syntax to use attributes for (c)URI(es).

So for example the first triple pattern in the first rule would look like:

  <Uniterm>
    <Const rif:uri="rdf:type" />
    <Var>x</Var>
    <Const rif:uri="it:ComputeNode" />
  </Uniterm>

or in the bipartioned graph proposal this would be:

  <Uniterm>
    <Const rif:uri="rdf:type" />
    <Var>x</Var>
    <Const rif:uri="it:ComputeNode" />
  </Uniterm>

This assumes some RIF-mandated rule about expansion of curies. This may not be acceptable W3C practice in which case wherever you see "pre:foo" imagine you actually see "&pre;foo" where pre is an XML Entity.

Quantification over predicates

Having chosen the "natural" mapping that RDF predicates map to binary RIF relations (Consts) we have a problem with rule r5 which quantifies over such predicates.

To cope with this we extended the synatax in an "obvious" way so the second triple pattern in r5 would look like:

  <Uniterm>
    <Const><Var>P</Var></Const>
    <Var>a</Var>
    <Var>subApp</Var>
  </Uniterm>

However, from discussion with Harold it seems that Const is intended to be a leaf node not a role specifier and there is a role specifier <op> that can be used in this situation so that the recommended syntax is:

  <Uniterm>
    <op><Var>P</Var></op>
    <Var>a</Var>
    <Var>subApp</Var>
  </Uniterm>

Datatypes

Rule r4 has a numeric constant (the rule syntax 0.75 will translate into the RDF literal "0.75"^^xsd:double). RIF has no agreed concrete syntax for such constants so we adopt an obvious one using an attribute to give the datatype, assume all reasonable XSD atomic datatypes are supported and that curie/qname syntax is supported.

So we assume the constant will look like:

   <Const rif:datatype="xsd:double">0.75</Const>

in the bipartitioned proposal this would become:

   <Data rif:datatype="xsd:double">0.75</Data>

Builtins

Rule r4 also refers to a builtin function ("product"). We haven't yet discussed specific sets of builtins for RIF. Since we are using XSD for atomic types it might be logical to use XQuery functions and operators but that doesn't supply URIs to identify the operators like "*". Similarly we could use MathML but that only gives us QNames and not URIs. So we'll just pretend RIF has defined a set of builtins.

So the third pattern in rule r4 becomes:

    <Equal>
      <Var>cd</Var>
      <Uniterm>
        <Const rif:uri="rif:multiply" />
        <Var>c</Var>
        <Const rif:datatype="xsd:double">0.75</Const>
      </Uniterm>
    </Equal>

bNodes

This class of JenaRule application goes outside Horn in that the rules can manufacture new bNodes to represent objects we know to be present but are not assigning a URI. This is treating bNodes simply as Skolem constants. We've represented this as the makeTemp builtin in rule r3. To translate this rule we assume some equivalent genSym function than can manufacture a skolem constant.

Rule naming

A boring but sometimes useful feature of the source rules is the syntactic rule label. We'd like be able to attach arbitrary descriptive metadata to rules such as names, descriptions, authors etc

For the sake of this example we are going to use the sugestion in http://lists.w3.org/Archives/Public/public-rif-wg/2006Sep/0077.html

We could have a single Literal for the entire rule and follow the B.1 rule syntax for that literal. However, then for the rules with repeated bodies (lots in this example) we end up with a lot of duplication. To make the translation marginally less unreadable we've chosen to go with a head/body/vars split to enable us to have multiple heads as a purely syntactic convenience and just use the A.1 syntax for those components.

[Note that rif:vars as used here implicitly implies universal quantification, if we were to adopt something like it then the quantification should be explicit.]

This not a fundamental issue and switching to a single rif:ruleSrc property which points to a B.1 literal with all rules expanded in full would be perfectly acceptable.

Rule and Ruleset labelling

It would be convenient to be able to label these rules as being indended for processing RDF data so a translator knows to expect only binary predicates.

It would also be convenient to be able to label them as intended for model transformation rather than deductive closure. In the original application the rules are actually preduction rules with implicit "asserts" for each triple in the conclusion. The desired output of the rule processor is just the set of newly asserted conclusions not the full deductive closure. This procedural usage is presumably outside RIF core but we can at least annotate the ruleset to indicate this was the original intended usage.

For both of these we've invented RDFS/OWL classes and used them as annotations on the base Ruleset. One can argue if they should be Rule rather than Ruleset classifications.

RIF Translation

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rdf:RDF [
    <!ENTITY rdf  'http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
    <!ENTITY rdfs 'http://www.w3.org/2000/01/rdf-schema#'>
    <!ENTITY xsd  'http://www.w3.org/2001/XMLSchema#'>
    <!ENTITY owl  "http://www.w3.org/2002/07/owl#" >
    <!ENTITY rif  "http://www.w3.org/2006/10/rif#" >
    <!ENTITY jena  "http://jena.hpl.hp.com/vocabs/rif#" >
    <!ENTITY  dc  "http://purl.org/dc/elements/1.1/" >
    <!ENTITY  it  "http://jena.hpl.hp.com/rifUC8/ITDatabase#" >
    <!ENTITY  fn  "http://jena.hpl.hp.com/rifUC8/finance#" >
    <!ENTITY  bp  "http://jena.hpl.hp.com/rifUC8/businessProcesses#" >
    <!ENTITY   t  "http://jena.hpl.hp.com/rifUC8/target#" >
]>

<rdf:RDF xmlns:rdf="&rdf;" xmlns:rdfs="&rdfs;" xmlns:xsd="&xsd;" xmlns:owl="&owl;"
         xmlns:rif="&rif;" xmlns:it="&it;" xmlns:fn="&fn;" xmlns:bp="&bp;" 
         xmlns:jena="&jena;" xmlns:t="&t;" xmlns:dc="&dc;"
         xml:base="&t;" xmlns="&t;">

<rif:Ruleset rdf:ID="uc8rules">

  <!-- label outer ruleset as only expecting RDF compatible rules -->
  <rdf:type rdf:resource="&rif;RDFRuleset" />
  
  <!-- label outer ruleset with a jena-specific concept of transformation rules, 
       can be ignored by other processors -->
  <rdf:type rdf:resource="&jena;TransformationRules" />
  
  <!-- The first rule -->
  <rif:rule><rif:Implies rdf:ID="r1">
  
     <!-- Descriptive metadata -->
     <rdfs:label>r1-computeNodeToServer</rdfs:label>
     <rdfs:comment>Simple data mapping</rdfs:comment>
     <dc:creator>Dave Reynolds</dc:creator>

     <!-- Rule variables, universally quantified -->
     <rif:vars rdf:parseType='Literal' xmlns="&rif;">
          <Var>x</Var>
          <Var>i</Var>
          <Var>p</Var>
     </rif:vars>

     <!-- rule body -->
     <rif:if rdf:parseType='Literal' xmlns="&rif;">
              <And>
                <Uniterm>
                  <Const rif:uri="rdf:type" />
                  <Var>x</Var>
                  <Const rif:uri="it:ComputeNode" />
                </Uniterm>
                <Uniterm>
                  <Const rif:uri="it:hasNIC" />
                  <Var>x</Var>
                  <Var>i</Var>
                </Uniterm>
                <Uniterm>
                  <Const rif:uri="it:hasIP" />
                  <Var>i</Var>
                  <Var>p</Var>
                </Uniterm>
              </And> 
     </rif:if>

     <!-- rule head -->
     <rif:then rdf:parseType='Literal' xmlns="&rif;"> 
               <Uniterm>
                  <Const rif:uri="rdf:type" />
                  <Var>x</Var>
                  <Const rif:uri="t:Server" />
               </Uniterm>
     </rif:then>

     <!-- multiple heads, syntactic sugar for repeated rules with same body -->
     <rif:then rdf:parseType='Literal' xmlns="&rif;">
                <Uniterm>
                  <Const rif:uri="t:address" />
                  <Var>x</Var>
                  <Var>p</Var>
                </Uniterm>
     </rif:then>
  </rif:Implies></rif:rule>

  <rif:rule><rif:Implies rdf:ID="r2">
     <rdfs:label>r2-joinBasedOnLocation</rdfs:label>
     <rdfs:comment>Simple join</rdfs:comment>
     <dc:creator>Dave Reynolds</dc:creator>

     <!-- Rule variables, universally quantified -->
     <rif:vars rdf:parseType='Literal' xmlns="&rif;">
          <Var>x</Var>
          <Var>r</Var>
          <Var>c</Var>
          <Var>mc</Var>
     </rif:vars>

     <!-- rule body -->
     <rif:if rdf:parseType='Literal' xmlns="&rif;">
              <And>
                <Uniterm>
                  <Const rif:uri="rdf:type" />
                  <Var>x</Var>
                  <Const rif:uri="it:ComputeNode" />
                </Uniterm>
                <Uniterm>
                  <Const rif:uri="it:rack" />
                  <Var>x</Var>
                  <Var>r</Var>
                </Uniterm>
                <Uniterm>
                  <Const rif:uri="it:cage" />
                  <Var>r</Var>
                  <Var>c</Var>
                </Uniterm>
                <Uniterm>
                  <Const rif:uri="fn:maintenaceContractForCage" />
                  <Var>mc</Var>
                  <Var>c</Var>
                </Uniterm>
              </And> 
     </rif:if>

     <!-- rule head -->
     <rif:then rdf:parseType='Literal' xmlns="&rif;"> 
                <Uniterm>
                  <Const rif:uri="rdf:type" />
                  <Var>x</Var>
                  <Const rif:uri="t:Server" />
                </Uniterm>
     </rif:then>
     
     <rif:then rdf:parseType='Literal' xmlns="&rif;"> 
                <Uniterm>
                  <Const rif:uri="t:maintenanceContract" />
                  <Var>x</Var>
                  <Var>mc</Var>
                </Uniterm>
     </rif:then> 
  </rif:Implies></rif:rule>
    
  <rif:rule><rif:Implies rdf:ID="r3">
     <rdfs:label>r3-applicationHost</rdfs:label>
     <rdfs:comment>Object introduction</rdfs:comment>
     <dc:creator>Dave Reynolds</dc:creator>
     <!-- Rule variables, universally quantified -->
     <rif:vars rdf:parseType='Literal' xmlns="&rif;">
          <Var>a</Var>
          <Var>p</Var>
          <Var>n</Var>
     </rif:vars>

     <!-- rule body -->
     <rif:if rdf:parseType='Literal' xmlns="&rif;">
              <And>
                <Uniterm>
                  <Const rif:uri="rdf:type" />
                  <Var>a</Var>
                  <Const rif:uri="bp:Application" />
                </Uniterm>
                <Uniterm>
                  <Const rif:uri="bp:discoveredAtIP" />
                  <Var>a</Var>
                  <Var>p</Var>
                </Uniterm>
                <!-- genSym built in assumed to take n-1 bound variables and 
                     constants as keys and bind the final variable to an Ind keyed from them -->
                <Uniterm>
                  <Const rif:uri="rif:genSym" />
                  <Var>a</Var>
                  <Var>p</Var>
                  <Const>r3</Const>
                  <Var>n</Var>
                </Uniterm>
              </And> 
     </rif:if>

     <!-- rule head -->
     <rif:then rdf:parseType='Literal' xmlns="&rif;"> 
                <Uniterm>
                  <Const rif:uri="rdf:type" />
                  <Var>n</Var>
                  <Const rif:uri="t:Server" />
                </Uniterm>
     </rif:then>

     <rif:then rdf:parseType='Literal' xmlns="&rif;"> 
                <Uniterm>
                  <Const rif:uri="t:address" />
                  <Var>n</Var>
                  <Var>p</Var>
                </Uniterm>
     </rif:then>
     
     <rif:then rdf:parseType='Literal' xmlns="&rif;"> 
                <Uniterm>
                  <Const rif:uri="t:hosts" />
                  <Var>n</Var>
                  <Var>a</Var>
                </Uniterm>
     </rif:then>
  </rif:Implies></rif:rule>
  
  <rif:rule><rif:Implies rdf:ID="r4">
  
     <!-- Descriptive metadata -->
     <rdfs:label>r4-discount</rdfs:label>
     <rdfs:comment>Datatypes and builtins</rdfs:comment>
     <dc:creator>Dave Reynolds</dc:creator>

     <!-- Rule variables, universally quantified -->
     <rif:vars rdf:parseType='Literal' xmlns="&rif;">
          <Var>mc</Var>
          <Var>cd</Var>
          <Var>c</Var>
     </rif:vars>

     <!-- rule body -->
     <rif:if rdf:parseType='Literal' xmlns="&rif;">
              <And>
                <Uniterm>
                  <Const rif:uri="fn:category" />
                  <Var>mc</Var>
                  <Const rif:uri="fn:Bulk" />
                </Uniterm>
                <!-- not sure if this should be a builtin relation or use equality -->
                <Equal>
                  <Var>cd</Var>
                  <Uniterm>
                      <!-- assuming builtin multiply function -->
                      <Const rif:uri="rif:multiply" />
                      <Var>c</Var>
                      <Const rif:datatype="xsd:double">0.75</Const>
                  </Uniterm>
                </Equal>
              </And> 
    </rif:if>

     <!-- rule head -->
     <rif:then rdf:parseType='Literal' xmlns="&rif;"> 
                <Uniterm>
                  <Const rif:uri="t:assumedCost" />
                  <Var>mc</Var>
                  <Var>cd</Var>
                </Uniterm>
     </rif:then>
     
  </rif:Implies></rif:rule>
  
  <rif:rule><rif:Implies rdf:ID="r5">
  
     <!-- Descriptive metadata -->
     <rdfs:label>r5-dependency</rdfs:label>
     <rdfs:comment>Vocabulary access and predicate variables</rdfs:comment>
     <dc:creator>Dave Reynolds</dc:creator>

     <!-- Rule variables, universally quantified -->
     <rif:vars rdf:parseType='Literal' xmlns="&rif;">
          <Var>a</Var>
          <Var>P</Var>
          <Var>n</Var>
          <Var>subApp</Var>
     </rif:vars>

     <!-- rule body -->
     <rif:if rdf:parseType='Literal' xmlns="&rif;">
              <And>
                <Uniterm>
                  <Const rif:uri="rdf:type" />
                  <Var>a</Var>
                  <Const rif:uri="bp:Application" />
                </Uniterm>
                <!-- Variable in relation position -->
                <Uniterm>
                  <op><Var>P</Var></op>
                  <Var>a</Var>
                  <Var>subApp</Var>
                </Uniterm>
                <Uniterm>
                  <Const rif:uri="rdfs:subPropertyOf" />
                  <Var>P</Var>
                  <Const rif:uri="bp:comprises" />
                </Uniterm>
                <Uniterm>
                  <Const rif:uri="t:hosts" />
                  <Var>n</Var>
                  <Var>subApp</Var>
                </Uniterm>
              </And> 
    </rif:if>

     <!-- rule head -->
     <rif:then rdf:parseType='Literal' xmlns="&rif;"> 
                <Uniterm>
                  <Const rif:uri="t:dependsOn" />
                  <Var>a</Var>
                  <Var>n</Var>
                </Uniterm>
     </rif:then>

  </rif:Implies></rif:rule>
          
</rif:Ruleset>

</rdf:RDF>

Changes

30/10/06 Fixed the syntax slightly after suggestions from Harold to stick closer to the core syntax (dropped use of separate Ind, fixed up use of <Rel>).
31/10/06 Changed nesting of <Equal> in response to change/clarification of A.1 BNF.
31/10/06 Switched to Harold's proposal of <op><Var> to designate variables in the Relation position.
31/10/06 Added comments on use of vars/head/body versus ruleSrc.
21/05/07 Updated Dave's use case towards the XML syntax of RIF Core WD1.