Warning:
This wiki has been archived and is now read-only.

Feature:CompositeDatasets

From SPARQL Working Group
Jump to: navigation, search


Feature: Composite Datasets

It would be useful in general to be able to refer to arbitrary subsets of all the named graphs in a dataset by some long-lived name or by an expression that allows such subsets to be grouped from the given dataset 'on the fly.' This feature deals with that need in a comprehensive way, however, All Named Graphs only deals with the simplest extension: the ability for a client to request that the default graph be comprised of the RDF merge of all the named graphs of the dataset.

Feature description

The main issue this feature is meant to address is the fact that SPARQL doesn't "restrict the relationships of named and default graphs" 8.1 Examples of RDF Datasets and further more, it doesn't specify any way to re-cast the dataset into certain groupings that are meaningful but adhoc and thus best implemented as a layer on top of the given dataset.

The base case, however, is where a user simply wants to dispatch a query that ranges against the merge of all the named graphs in a reproducible way. This base case (alone) is dealt with in another feature: All Named Graphs

( Ivan Mikhailov proposes two alternative semantics for the feature.

Case 1.

If the composed graph is not free from duplicates then the feature is very simple, but it might be convenient to allow persistent declarations of composed named lists (say, for security).

Case 2.

If the composed graph should be free from duplicates then the execution cost is similar to the cost of creation of any other temporary graph. Thus, it may be practical to "split" the feature to few: 1) to allow arbitrary CONSTRUCT or DESCRIBE for temporary graph, 2) to allow both anonymous temporary graph ( FROM CONSTRUCT { ... } ) and COMPOSE <iri> AS CONSTRUCT ..., 3) to allow COMPOSE <iri> ( <graph1> ... <graphN> ) as a shorthand for CONSTRUCT { ?s ?p ?o } FROM <graph1> ... FROM <graphN> where { ?s ?p ?o } This will not make the performance worse that plain union and still relatively cheap to implement and "overkill enough" to be stable for a long time. )

Example

   PREFIX trials: <tag:info@example.com,2008:Cohorts#>
   SELECT ...
   COMPOSE GRAPH trials:MalePatients ( trial:B211 trial:B233 trial:A422 trial:A511 trial:CB30 )
   FROM trials:MalePatients
   WHERE {
      ... basic graph pattern ....
     # The pattern above ranges over the default graph which is the RDF merge 
     # of graph trial:B222, etc..
   }
   PREFIX trial: <tag:info@example.com,2008#Cohorts>
   SELECT ...
   COMPOSE GRAPH trials:MyTrial   ( trial:B211 trial:B233 trial:A422 trial:A511 trial:CB30 )
   COMPOSE GRAPH trials:MyTrial2 ( trial:C211 trial:F003 trial:C022 trial:X111 trial:ZB30 )
   FROM NAMED trials:MyTrial
   FROM NAMED trials:MyTrial2
   WHERE {
      GRAPH ?RECORD {
        ... basic graph pattern ...
        # The pattern above ranges over two named graphs each of which is the 
        # RDF merge of 5 named graphs from the original dataset
        # In addition, the solutions are 'confined' to one of the two aggregate graphs
        # per standard semantics of GRAPH patterns
      }
   }
   PREFIX trial: <tag:info@example.com,2008#Cohorts>
   SELECT ...
   FROM *
   WHERE {
        ... basic graph pattern ...
        # The pattern above ranges over 
        # RDF merge of all the named graphs from the original dataset
      }
   }

The expression

 FROM * 

Is a convenient, syntactic shortcut for:

   COMPOSE GRAPH :AllNamedGraphs ( graphIRI1 graphIRI2 ... graphIRIN )
   FROM :AllNamedGraphs

Where ( graphIRI1 graphIRI2 ... graphIRIN ) is an explicit enumeration of the IRIs of all the named graph in the dataset

Existing Implementation(s)

Various implementations provide mechanisms for this:

  • Named graphs in Open Anzo - in fairness, Open Anzo does not have any mechanism to write a single query that works against multiple graphs, each of which is the RDF merge of other graphs.
  • Jena

Other implementations, provide a 'system-level' flag that hardcodes the default graph to always be the RDF merge of all the named graphs:

Existing Specification / Documentation

Compatibility

This feature is purely additive and thus SPARQL queries that do not use the added syntax should behave in the same manner as with the previous specification

Links to postponed Issues

Related Features

The Parameterized Inference feature mentions the need for this capability for answering queries such as:

 SELECT NAMES OF ALL PEOPLE I KNOW OR NAMES OF PEOPLE KNOWN BY THESE PEOPLE

Champions

Chimezie Ogbuji Cleveland Clinic Foundation

Use cases

The primary motivation for this feature is the scenario where SPARQL is used as the mechanism for managing a cohort as an RDF dataset.

 Cohort: A defined population group followed prospectively in an epidemiological study.

In such an arrangement, it is useful to allocated a named graph for each patient and query the entire population to find specific groups that match certain criteria (where the groups targetted can vary by year, race, sex or other such demographic characteristics). In this scenario, typically you setup the RDF dataset once (similar to a conventional data warehouse) and dispatch different queries over different subsets of the cohort and thus being able to specify a particular subset (possibly in a way that is long-lived and re-usable in later query sessions) is critical for using SPARQL in this way.

References

SeeAlso an experiment with dynamic changes to the RDF dataset of a query: Walking the Web.