Data Sets

This is a work-in-progress page for the issue of referencing data sets within RIF.

[The idea of such pages is that they should pin down what the problem is (Framing), explore what RIF might or might not do about it (Discussion), then when some clear picture starts to emerge we can draft text which might go in the final document (Draft text). This page is still very much at the discussion stage, though it is now on its second iteration of that.]

Framing

Rule sets are typically intended to act over some separate set of data, what should RIF be able to express about such data sets?

There are several dimensions to this problem.

Data set identification. Consider a supply chain application in which a supplier publishes a rule set for computing product discounts. This rule set might act on two separate data sets - the order being processed (local to the rule set user) and a database of prices and discount classes (published by the supplier). Should RIF be able to indicate the specific global datasets (such as the discount table in this example) which are needed in order for the rule set to function? And if so, how?
Data model identification. RIF is expected to support applications using data in many formats such as RDF, XML, object models, databases. Within each of these formats a given rule set might assume a particular data model (ontology, XML Schema, meta-model, relational schema). Should RIF be able to indicate the data model(s) to which a given rule set is applicable? And if so, how?
Data model usage. If a ruleset operates over data expressed in some external data model then it will need to be able to navigate that data. What are the RIF mechanisms which enable rules to navigate data expressed in such an external datamodel? How and where is the mapping from that external data model to the corresponding RIF mechanism defined?

Discussion

Discussions on these issues tend to get sucked into the last topic, data model usage, but let's try to separate them as much as possible.

Data set identification

Yes, there are cases were a rule set needs to refer to an external source and RIF should provide a mechanism for this. This mechanism should be the minimum needed to facilitate interchange and should avoid saying anything about local execution environments. E.g. in the supply chain example RIF shouldn't say anything about how to identify the order to be processed, that's not relevant to interchange just to the end application.

Thus we propose treating the problem as one of metadata. We suggest annotating rulesets with metadata which references any required datasets. This metadata should be open ended so that specific applications can extended it without requiring a change to the RIF specification.

The metadata mechanism for RIF is not yet defined but for the purposes of this discussion we assume the metadata is expressed in RDF, even if the syntactic form is not RDF/XML.

The bare minimum starting vocabulary might look like (N3/Turtle syntax):

  rif:RuleSet a rdfs:Class .

  rif:DataSet a rdfs:Class .

  rif:requiresDataSet a rdf:Property;
    rdfs:domain rif:RuleSet ;
    rdfs:range  rif:DataSet .

Thus in our supply chain example one might have:

 [] a rif:RuleSet;
 
   rif:requiresDataSet <http://example.com/discountStructure> ;

   rif:requiresDataSet 
      [ rdfs:label "order";
        rdfs:comment "The order to be processed" ];

The notion here is that the required global datasets are identified by URIs, as in http://example.com/discountStructure. However, these are intended purely as identifiers. The fact that one might chose to use an http URI does not necessarily mean that the data itself has to be accessible over the web. Though that might be the case for many applications.

The local datasets, the order data in the supply chain example, are left anonymous by using a bNode but annotated. RIF should specify a minimal set of metadata which should be associated with an anonymous dataset. That minimum is probably an open descriptive text field plus any data model metadata (see below). I suggest using rdfs:comment for this descriptive text but dc:description would also be acceptable.

Note: When we add modules to RIF then the dataset metadata will probably need extending to include a module name by which the dataset will be known in in the ruleset, including some notion of a default module.

Data model identification

Some datasets may be simply RIF rulesets comprising just facts. In that case no further information is needed. Referencing a global dataset in that case can be achieved using a simple ruleset import mechanism.

We could stop there and say that how datasets in some application-specific datamodel get translated to RIF fact-sets is outside the scope of RIF.

However, there seems to be a desire for RIF to be able to reference datasets in application-specific formats and at least document what the format is.

The first part of this problem is simply identifying what the data model is. At a minimum this allows a RIF processor to determine whether or not this is a data format it supports. Ideally the identification is sufficient to enable a RIF processor that supports it to determine the appropriate mapping needed to handle the data. Whether that mapping is implemented by translating the data to RIF facts or by connecting RIF builtins to native data accesses is open.

Based on a poll at F2F6 RIF members would like support for at least three data models:

XML with the model specified via XML Schema
RDF with the model specified via RDFS/OWL-full
Object data with the model specified via UML or ODM

The problem of identifying the application data model to be used is again a metadata problem. The proposal is to extend the above metadata schema with support for this.

  rif:dataModel  a rdf:Property ;
    rdfs:domain rif:DataSet ;
    rdfs:range  rif:DataModel .

  rif:DataModel a rdfs:Class .

  rif:schema a rdf:Property ;
    rdfs:comment "Identifies the application-specific ontology, schema or object model of the dataset (optional)" ;
    rdfs:domain rif:DataModel ;
    rdfs:range  rdfs:Resource .

  rif:RIFDataModel        rdfs:subClassOf rif:DataModel .

  rif:OWLFullDataModel    rdfs:subClassOf rif:DataModel .
  rif:RDFSDataModel       rdfs:subClassOf rif:DataModel, rif:OWLFullDataModel ;
  rif:RDFDataModel        rdfs:subClassOf rif:DataModel, rif:RDFSDataModel ;

  rif:XMLSchemaDataModel  rdfs:subClassOf rif:DataModel .

  rif:ObjectDataModel     rdfs:subClassOf rif:DataModel .
  rif:MOFDataModel        rdfs:subClassOf rif:ObjectDataModel .

We could thus extend our supply chain example:

 [] a rif:RuleSet;
 
   rif:requiresDataSet <http://example.com/discountStructure> ;

   rif:requiresDataSet 
      [ rdfs:label "order";
        rdfs:comment "The order to be processed" ;
        rif:dataModel <http://example.com/orderDataModel> ] .

 <http://example.com/orderDataModel> a rif:XMLSchemaDataModel ;
    rdfs:comment "The identifier for an order schema agreed by the consortium";
    rif:schema <http://example.com/orderDataModel.xs> .

 <http://example.com/discountStructure> a rif:DataSet;
    rif:dataModel rif:RIFDataModel .

Here we have specified that the rule set processes an order which conforms to an application-specific XML Schema but that we have made the discountStructure available as a set of RIF facts which require no translation or adaptation.

Data model usage

Identifying the application-specific data model is only a small part of the battle.

To be useful we have to be able to write rules which manipulate data according to that model and then a RIF processor has to be able to connect those rules to the requisite instance data.

We have several possible approaches to this question under discussion at the moment. These include:

Library of translators. For a given category of data model we define a transformation which will convert that data to a RIF native format. The rules are written to manipulate that RIF native form. In the case of RDF we have proposed a translation algorithm which would apply to any RDF (and thus RDFS or OWL) document, irrespective of the associated ontology. For some data models such as XML then the translation may be driven by a schema document. RIF itself would define only a small number of such translators (perhaps just RDF (schemaless) and XML (driven by XML Schema)). Other groups could define additional translators and record that fact in metadata but there would be no automatic way for an existing RIF implementation to discover and use such extension translators.
Abstract datatypes. In this approach rather than translate data into RIF facts, the data is exposed as concrete data types and data is manipulated via a set of model-specific builtin predicates and functions. In the case of RDF we would have types like graph, triple, node and associated predicates including SPARQL queries; in the case of XML would have types like Element, Document and associated predicates such as XPath-like navigation.
Single meta translator. In this approach RIF adopts a single meta-meta modeling language (perhaps MOF or KM3) and defines a single transformation algorithm. For any data model category we define a meta-model for it, expressed in our standard meta-meta modeling language. The transformation algorithm can translate any instance data for which there is an associated model and meta-model to RIF native format. RIF would only define a small number of such meta-models but the external groups could adopt additional ones. In this case the associated meta-model could be referenced in the ruleset metadata and a RIF implementation could apply the meta-translator without further modification.

Draft text

TBD.