Extensions Proposed By OpenLink

From SPARQL Working Group
Jump to: navigation, search

Proposed SPARQL Extensions for SPARQL 2.0 WG

Orri Erling, Ivan Mikhailov

OpenLink Software


Introduction

The broadest expression of our agenda is to make the web into a database capable of answering complex structured queries, including analytics. A semantic dimension is required for meaningful joining of data from diverse sources. RDF and SPARQL offer the potential for being the concluding chapter in the data integration saga, across the open web as well as in the enterprise. To support this, SPARQL must match and exceed SQL in all capabilities proper to SQL in addition to providing access to semantically rich data, inference and so forth.

In light of this, OpenLink has implemented numerous extensions to SPARQL, of which we offer some for consideration to the SPARQL 2.0 WG.

Below is a list of proposed extensions, with examples and implementation information where applicable. The features seen as more important are mentioned first. Separate lists are presented for the language and the protocol.

SPARQL Language

Updates

We suggest adoption of SPARUL as defined by Andy Seaborne et al in its current form. Virtuoso and ARQ implement this. The only work we see as remaining is finalizing a list of interoperability tests.

Expressions

We suggest allowing scalar valued expressions in select and construct as follows:

  • As a column in select
  • In order by and group by criteria
  • In the variable position in triple patterns

We do not propose introducing a "let" construct for establishing nested scopes for variables. We note that the subquery construct outlined below can serve to give names to expressions, so as to avoid repetition.

Nested Queries

We propose allowing an embedded select statement in the place of a triple pattern. Such a select statement may specify distinct, aggregation, order by etc. It may select expressions and assign an alias to them for reference in the enclosing context. This corresponds to the SQL derived table (select in a FROM clause).

The scope rules are as in SQL, where the nested query sees no variables of the enclosing context. Specifying conditions on selected columns of the subquery is used for joining between subquery columns and variables in the enclosing scope. The subquery has an optional correlation name. A reference of the form ?cname.var can distinguish between variables found in different subqueries if they have a common name. With such rules, a SQL view look-alike may also be introduced.

Scalar Subquery and Existence

We propose allowing a single column select in the place of a scalar expression. Such a select is expected to select 0 or one rows. If 0 rows are returned, the value is unbound, else the single column of the first row is returned. Returning multiple rows is an error. Returning a single row can be enforced by making the query and aggregate or with a order by plus limit clause.

We propose allowing exists (<scalar subquery>) in the place of a condition in a filter. The condition is false if the select returns 0 rows, else it is true. If * occurs in the selection, this is equivalent to selecting an arbitrary constant and is not taken to mean all variables bound inside the subquery.

The scope rules for the scalar and exists subqueries are the same as for an optional group pattern. This is similar to SQL usage.

?x foaf:name "Alice". filter ( exists ((select * where { ?x foaf:knows ?y . ?y foaf:name "John"})))

selects an ?x with name Alice such that ?x knows at least one subject with foaf:name John.



Aggregation and Grouping

  • We propose allowing aggregates and group by in select statements in a manner similar to SQL.
  • We propose allowing a column ordinal number in the place of a column in order by, as in SQL, to avoid repetition of possibly long expressions in a selection if sorting based on these expressions.
  • We propose including the grouping set construct from Oracle SQL. This allows putting multiple group by clauses in a select, so that data can be fetched once but aggregated in many ways. This is functionally related to the SQL 99 cube and rollup options but allows more fine grained control over which aggregations are produced.

If a selection contains expressions that involve aggregates and expressions that do not, then a group by clause is required.


Examples of Nesting, Aggregation and Grouping

Below are a few queries that show nesting, aggregation and grouping as presently implemented in Virtuoso. We consider these features essential for real world SPARQL but are flexible as concerns their syntax. The below is not necessarily our final proposal on the matter.

Celebrities

This query shows the top 10 people ranked by how many people claim to know them without being known in return. This shows a basic group by with an existence test.

select ?celeb, count (*)
where {
  ?claimant foaf:knows ?celeb .
  filter (!bif:exists ((select (1) where { ?celeb foaf:knows ?claimant })))
  } group by ?celeb order by desc 2 limit
10


People With Common Interests

This query takes people who have at least one interest in common with foaf handle plaid_skirt. Then this counts how many interests each such person shares with plaid_skirt and how many total interests each such person has. The results are sorted on the count of shared interests. This shows a nested distinct subquery and the use of scalar subqueries and the reference to a selection column in order by with the column position.

select distinct ?n ((select count (*) where {?p foaf:interest ?i . ?ps foaf:interest ?i}))
   ((select count (*) where { ?p foaf:interest ?i})) 
where
{
?ps foaf:nick
"plaid_skirt"@en .
{select distinct ?p ?psi where {?p foaf:interest ?i . ?psi
foaf:interest ?i }} . 
  filter (?ps = ?psi)
  ?p
foaf:nick ?n
} order by desc 2 limit
50


Distinctive Shared Interests

This query finds foaf:interests which are shared by only few people. It then returns pairs of people interested in the same specialty where the two people do not know each other. We have both a nested group by and existence tests.

select ?i ?cnt ?n1 ?n2 ?p1 ?p2
  where {
      {select ?i count (*) as ?cnt
          where {
              ?p foaf:interest ?i}
          group by ?i
      }
      filter ( ?cnt > 1 && ?cnt < 10) .
      ?p1 foaf:interest ?i .
      ?p2 foaf:interest ?i .
      filter (?p1 != ?p2 &&
               !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &&
               !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) .
      ?p1 foaf:nick ?n1 .
      ?p2 foaf:nick ?n2 .
   }
  order by ?cnt limit 10


Syntax for Pragmas

We propose an XQuery-like feature for pragmas in SPARQL. The syntax may be (* qname ... *) and would be allowed in places where it

  • Affects the whole query, as in the prefface with namespace declarations
  • Affects a triple pattern
  • Affects a group pattern or subquery. What comes after the qname depends on the qname.

An implementation would signal an error if the qname were unknown to it. Thus a query could assert that it required certain functionality. Note that not all functionality is designated by special syntax, for example run time inference does not have a corresponding syntactic consttruct but still a query might state that it expects a certain inference to be made.


Full Text

We propose support for a full text search capability. Since a full text match typically returns a match score, this would be in the form of a pattern like

?o contains <text condition> ?score .

Some implementations exist, e.g. Virtuoso and ARQ. They are not interoperable however. For the convenience of standardization, we propose to reuse the XPATH/XQuery full text extension. This will not interoperate with existing implementations. Not reusing the XPATH or possibly SQL MM specifications would make the task unachievable within the scope of the WG. Reusing existing proposals makes the addition trivial. Support of the XPATH/XQuery text expression syntax is not very hard on implementors since they already must parse some syntax for text conditions.


Path Expressions

Both Virtuoso and ARQ have some support for paths consisting of dereferencing properties in SPARQL. ARQ offers a complex path language with transitivity but does not allow paths in expressions. Virtuoso allows simple paths in expressions and has a separate feature for transitivity. We propose a merge of these specifications.

In ARQ, ?x foaf:knows / foaf:knows ?y means the same as ?x foaf:knows ?t . ?t foaf:knows ?y .

In Virtuoso, ?x+>foaf:knows foaf:knows ?y means the same thing.

In Virtuoso

select ?f+>foaf:name ?f|>foaf:mbox where { ?x foaf:name "Alice" . ?x foaf:knows ?f . filter (?f+>foaf:name = "John") } 

means

select ?fname ?mbox
where {
  ?x foaf:knows ?f .
  ?x foaf:knows ?f .
  optional {?f foaf:mbox ?mbox} .
  ?f foaf:name ?fname .
  ?x foaf:name "Alice" .
  ?x foaf:knows ?f2 .
  ?f2 foaf:name
"John" }.

Which selects all names and optionally mailboxes of friends of all Alices who know a John.


We suggest:

  • Adopting the / operator and * (transitivity option) from ARQ. We do not object to adopting more path features from ARQ but would consider the aforementioned a minimum.
  • Adding an inner and outer dereference operator for expressions, as +> and |> above.


End Point Self-Description

As SPARQL becomes as complex and more so than SQL, differences in extent of support between implementations will become inevitable.

For this purpose, we suggest assigning a URI to distinct areas of functionality, such as aggregation, inference control and full text.

An end point would thus be able to advertize the supported SPARQL profile without applications having to know about the extent of support of each version of each vendor's implementation.

On the reverse side, a query could use the pragma construct with any of these URI's to assert that it expects a certain feature to be provided by the implementation.

Some of the obvious self-description information includes:

  • Make and version of server.
  • Maximum timeout before queries are terminated.
  • SPARQL capabilities supported
  • XQuery function library functions supported
  • Graph IRI's of graphs containing VoID descriptions of data sets hosted on this instance.

For purposes of query federation and estimation of cardinalities, we believe the VoID (Vocabulary of Interlinked Data) is a good start and removes much of the need for the SPARQL WG to focus on this.


SPARQL Federation

Ideally, federating queries should be transparent. With SPARQL,this comes at a high cost, specially if any of multiple end points may in principle match any of the triple patterns in a query. This means that information about colocation of data within one end point is hard to express.

For this purpose, we suggest that a subquery in a SPARQL query, see above, could have an additional federated option. This option would specify that this subquery were sent as a whole to each of possibly many end points and that the union of the results of these would be considered the subquery's evaluation.

Thus if the subquery were a conjunction of patterns p1 and p2, the matches of p1 and p2 would have to be found at the same end point for the solution to be included in the solution set of the subquery.

Without such a device, pushing joins to end points is somewhat problematical and imposes a heavy penalty on any federated system.

We note that ARQ implements a SPARQL service construct for embedding a a remotely executed subquery into a SPARQL query. This is however a departure from SPARQL's declarative semantic and scope rules, as we understand them. The proposed federated subquery would leave all latitude to the implementation to decide on join order and would thus be more in the non-procedural spirit of SPARQL.

For example:

select ?contact1 where {
  {select ?contact1 where { ?me foaf:nick "Orri" . ?me foaf:knows ?f . ?f foaf:name ?contact1 }} option (federated
<http://www.semanticweb.com/sparql>) .
  {select ?contact2 where { ?me foaf:nick "Orri" . ?me foaf:knows ?f . ?f foaf:name ?contact2 }} option (federated
<http://www.myopenlink.net/sparql>) . 
  filter (?contact1 = ?contact2)
}

This would return the people whom Orri knows at both semanticweb.com and myopenlink.net. It is clear from the use case that it makes little sense to look for contacts of Orri's IRI at myopenlink.net in the repository at semanticweb.com if both sites were known to assign different person URI's. Without the explicit partitioning of the patterns, it would be difficult to infer this.

The query optimizer would have all latitude in determining which site had fewer contacts and putting this first in a loop join or for retrieving both in parallel and doing a hash intersection or any other possible execution plan. The plans' relative merit is entirely dependent on the expected cardinalities and access latencies of the sites.

Control of Inference

An implementation may offer inference either at run time or may hold precalculated inference results apart from the original data. In both cases, it will be relevant to specify what inferred triples will be considered in producing an evaluation of a query.

Virtuoso implements this by introducing a so-called inference context. Such a context specifies what subclasses, subproperties, IFP's and the like will be considered in evaluating a query. Either the whole query or a single triple pattern may be qualified with an inference context.

?x a yago:entity matches if ?x is a direct instance of yago:entity. ?x a yago:entity option (input:inference <yago>) matches if ?x is an instance of yago:entity or any subclass thereof, as specified in the inference context <yago>.

For a query wide declaration, the clause define input:inference <yago> may be added in the preface of the query, together with namespace prefixes.

Specifying what functionality including an inference context may imply is beyond the scope of the WG but specifying an interoperable syntax is useful. We propose using the pragma feature above for declaring this.

An implementation might be able to include a set of RIF rules, RDFS or OWL ontologies inside an inference context. Evaluating a query in such a context would evaluate it as if the entailed triples were present.

One-Of Group Pattern

It is possible that an application knows of many semantically equivalent expressions of a query. In principle, given any expression of the query, the query optimizer should be able to arrive at the optimal plan. This is not always so, though.

There are cases where an optimizer cannot make this choice. This is for example the case if at the application level it is known that equivalent data (for application purposes) exists in many sources, e.g. a RDBMS mapped on demand to SPARQL and an extract of said RDBMS stored as RDF. In such a case, the application can issue a special kind of union of which only one term is to be evaluated, at the discretion of the cost model.

We see this as specially useful for federated queries.

For example

select ?contact where { graph <mapped> { ?x foaf:name "Alice". ?x foaf:knows ?f . ?f foaf:name ?contact }
 alternate graph <etl> { ?x foaf:name "Alice". ?x foaf:knows ?f . ?f foaf:name ?contact }}

The scope rules are as for union. A trivial implementation would always choose the first alternative, thus the minimal cost is low. Since a set of alternate patterns may occur at any level of nesting in a query, this is not readily replaced by giving the application access to cost model estimates.

SPARQL Protocol

Parameters

Experience demonstrates that producing an execution plan may take up to half of the processing time of many queries, specially if these queries select relatively little data. This is demonstrated for example by the Berlin SPARQL benchmark.

This may be easily remedied by supporting parametrized queries. Since the protocol is stateless, a ODBC/JDBC style notion of a prepared statement does not apply. However, it is simple to cache a number of recent parametrized execution plans on a server for eventual reuse.

In Virtuoso, if the name of a URL parameter in the SPARQL GET or POST request corresponds with a variable of the same name in the query, the value of this URL parameter is taken as as the initial binding of the variable, which is in effect for this execution of the query. This is an existing feature of Virtuoso. ARQ also offers parametrized queries, however not over the SPARQL protocol.

For better error checking, we propose a special notation for parameter placeholders. In this manner, it is possible to signal an error if a parameter value is expected but is not present in the request.  This could be ?:xx in the place ?xx.

Execution Comments and Warnings

A result set, whether of select, ask or construct, should be able to carry metadata pertaining to the execution of the query. Example use cases are:

  • Returning a partial result set if the query is interrupted for any reason, such as run time error, timeout or such.
  • Returning warnings, as may pertain to the query, for example presence of filter conditions that are identically false, use of deprecated features etc.
  • Returning resource utilization information.

Timeout and Resource Constraints

In an open web environment, specially if queries are being federated, a best effort approach will often be required. This means omitting results from sources which do not answer in the required time. Also, in interactive situations such as faceted browsing, results may cease to be relevant if they are not returned within a short time limit. For efficient resource utilization, such timeouts ought to be enforced server side.

For this purpose, we propose an extra timeout parameter for the different SPARQL request formats.


Cost Model Interface

For federated query optimization, a SPARQL end point may expose a hook into its query cost model. Thus, given the text of a SPARQL query, the implementation would respond with its guess of the quantity of solutions produced and query execution time. The implementation cost is small, since implementations must anyhow have a cost model. The expected user of such a service would be a federated SPARQL query optimizer.


Cursors

We note that OWL QL has a notion of a handle on a query. Such a handle may be used in order to obtain additional results for a query. An implementation is free to process this in any manner of its choice, from keeping an open cursor to re-evaluating the query with different offset-limit settings. We do not consider this as particularly critical but also would not object to this feature being copied from OWL QL into SPARQL.