Design:Aggregate

From SPARQL Working Group
Jump to: navigation, search


Definition and scope of feature

We extend the existing SPARQL algebra with aggregate operators and functions. The goal is to re-use the general approach in relational algebra for creating aggregate formation operators and well-defined aggregate functions that extend a multiset of solutions with bindings to the result of applying these aggregate functions to partitions of the multiset distinguished by uniqueness for the values of a user-specified set of variables.

Status

The open issues are below:

Syntax

SPARQL queries that leverage aggregate operators mostly add

 GROUP BY ......

expression to the end of a SELECT, ASK?, ... queries and the list of projected variables are updated to include the use of aggregate functions along with the variables their values are bound to

SelectQuery          ::=  'SELECT' ( 'DISTINCT' | 'REDUCED' )? ( (Var | AggregateFunc '(' Var ') AS' Var)+ | '*' ) 
                           DatasetClause* WhereClause SolutionModifier
WhereClause          ::=  'WHERE'? GroupGraphPattern AggregateExpression?
AggregateExpression  ::=  'GROUP BY' ?Var+ HavingExpression?
HavingExpression     ::= HAVING Constraint
Constraint           ::= BrackettedExpression | BuiltInCall | AggregateFunc '(' Var ')'
PrimaryExpression    ::= BrackettedExpression | BuiltInCall | IRIrefOrFunction | RDFLiteral | NumericLiteral | 
BooleanLiteral | Var | AggregateFunc '(' Var ')'

An example:

SELECT ?org sum(?lprice) as ?totalPrice
WHERE { 
  ?org :affiliates ?auth .
  ?auth :writesBook ?book .
  ?book :price ?lprice 
} 
GROUP BY ?org
HAVING (sum(?lprice) > 10)

A.8 Grammar in SPARQL/Query 1.0.

Algebra operators

An algebraic operator Aggregation is introduced that returns a solution mapping. It has the following signature documenting its arguments:

   Aggregation(VarList, FuncAndVars, Omega, Mu)

VarList is a list of one or more variables whose values will be used to form partitions of Omega. Each partition corresponds to those mappings where the values (or the range) for the variables in VarList are equivalent. FuncAndVars is a set of 3-tuples (f,Vin,Vout) where f is one of a set of aggregate functions. Each aggregate function takes a set of RDF terms (Literals only?) and return a numeric value as an RDF literal. Vin is the variable whose bindings are picked from the mappings in a partition and used as input to the aggregate functions. Vout is the variable that will be used to bind the result of the aggregate function call. Omega is a SPARQL multiset of solution mappings, one of which is Mu

The intention is for Aggregation to return a solution mapping whose domain is all the variables (Vout) that are either:

  • arguments of each aggregate function in the SELECT clause
  • one of the variables in the GROUP BY expression

The range is all the results from applying the function to a set of numeric values. Vin is used to create this set from the values bound to it amongst the solution mappings in Omega that are compatible with Mu.

Another operator Partitions is introduced which returns a set (so duplicates are not allowed) of mappings

   Partitions(VarList, Omega)

Partitions returns a set of solution mappings restricted to the variables in VarList, forming partitions within Omega where the bindings for the given variables are the same.

Finally, the Groups operator is added which also returns a set of solution mappings.

   Groups(VarList, FuncAndVars, Omega)

It returns a set of solution mappings starting with the partitions formed using VarList against Omega. Each solution mapping in the partition is extended with the result of calling Aggregation using the given arguments ( VarList, FuncAndVars, and Omega ) and the current partition solution mapping

Draft Formalization

From 2009JulSep/0144 email

Single valued function: returns a solution projected down to named variables only:

 key(varlist, μ) = { (v,x) | (v,x) in μ, v in varlist }

and the set of all keys:

 key(varlist, Ω) = { k | μ in Ω, k=key(varlist,μ) }

The partition of the multiset Ω is:

 Partition(varlist, Ω) = { (k,µ) | µ in Ω, k=key(varlist, µ) }

First Version

This is the version in the email. It allows for multiple aggregations over the same groupings but needs to introduce FuncAndVars.

Let agg(VarList,Ω') be the aggregation function run on a multiset of solutions Ω', taking variables VarList. Then the aggregation is the set of solutions defined by each key in the multiset Ω being evaluated:

(this is the version in the original email)

Aggregation(VarList, FuncAndVars, Ω) =
  { merge(k, (Vout, agg(Vin,X)) | (k, X) in Partition(varlist, Ω), 
                                  FuncAndVars= set (f, Vin, Vout), 
                                  agg in set f }

Simplified version

Simple form: single aggregator. Aggregation function agg(key, X) which produces a single value for each key and partition for that key (key, X).

Aggregation(VarList, Var, agg, Ω) =
  { merge(k, (Var, agg(k, Ω')) | (k, Ω') in Partition(varlist, Ω), Var is a variable }

Var is not mentioned in the pattern used to determine Ω. Generalize this form to multiple aggregations as per first version.

Aggregate functions

.. Common restrictions that each aggregate function needs to adhere to ..

Mapping from AST to algebra

Test Cases

Data