Warning:
This wiki has been archived and is now read-only.
Design:Aggregate
Contents
Definition and scope of feature
We extend the existing SPARQL algebra with aggregate operators and functions. The goal is to re-use the general approach in relational algebra for creating aggregate formation operators and well-defined aggregate functions that extend a multiset of solutions with bindings to the result of applying these aggregate functions to partitions of the multiset distinguished by uniqueness for the values of a user-specified set of variables.
Status
The open issues are below:
- What specific issues are there due to the use of multisets / bags?
- http://www.w3.org/2009/sparql/track/issues/14 - What specific aggregate functions will be included? (the standard relational algebra functions are: SUM, AVG, COUNT, MIN, MAX, and COUNT), including COUNT(*), COUNT(DISTINCT *), COUNT(?x), COUNT(DISTINCT ?x). Do we allow custom (URI-named) aggregates?
- What are the results when the group has no valid entries? e.g. SUM(?x) where the only ?x is a URI.
- How does REDUCED & DISTINCT interact with solution aggregation?
- Does the use of 'AS' with aggregate functions require updating the grammar to require commas in between terms in the projection list?
- http://www.w3.org/2009/sparql/track/issues/16 - What happens when we have MIN over different value spaces? e.g. 1 and "string".
- http://www.w3.org/2009/sparql/track/issues/15 - Support for custom aggregate functions (named by IRI)? What syntax?
- http://www.w3.org/2009/sparql/track/issues/35 - Do aggregates support distinct?
- http://www.w3.org/2009/sparql/track/issues/41 - Does GROUP BY allow expressions? multiple expressions?
Syntax
SPARQL queries that leverage aggregate operators mostly add
GROUP BY ......
expression to the end of a SELECT, ASK?, ... queries and the list of projected variables are updated to include the use of aggregate functions along with the variables their values are bound to
SelectQuery ::= 'SELECT' ( 'DISTINCT' | 'REDUCED' )? ( (Var | AggregateFunc '(' Var ') AS' Var)+ | '*' ) DatasetClause* WhereClause SolutionModifier WhereClause ::= 'WHERE'? GroupGraphPattern AggregateExpression? AggregateExpression ::= 'GROUP BY' ?Var+ HavingExpression? HavingExpression ::= HAVING Constraint Constraint ::= BrackettedExpression | BuiltInCall | AggregateFunc '(' Var ')' PrimaryExpression ::= BrackettedExpression | BuiltInCall | IRIrefOrFunction | RDFLiteral | NumericLiteral | BooleanLiteral | Var | AggregateFunc '(' Var ')'
An example:
SELECT ?org sum(?lprice) as ?totalPrice WHERE { ?org :affiliates ?auth . ?auth :writesBook ?book . ?book :price ?lprice } GROUP BY ?org HAVING (sum(?lprice) > 10)
A.8 Grammar in SPARQL/Query 1.0.
Algebra operators
An algebraic operator Aggregation is introduced that returns a solution mapping. It has the following signature documenting its arguments:
Aggregation(VarList, FuncAndVars, Omega, Mu)
VarList is a list of one or more variables whose values will be used to form partitions of Omega. Each partition corresponds to those mappings where the values (or the range) for the variables in VarList are equivalent. FuncAndVars is a set of 3-tuples (f,Vin,Vout) where f is one of a set of aggregate functions. Each aggregate function takes a set of RDF terms (Literals only?) and return a numeric value as an RDF literal. Vin is the variable whose bindings are picked from the mappings in a partition and used as input to the aggregate functions. Vout is the variable that will be used to bind the result of the aggregate function call. Omega is a SPARQL multiset of solution mappings, one of which is Mu
The intention is for Aggregation to return a solution mapping whose domain is all the variables (Vout) that are either:
- arguments of each aggregate function in the SELECT clause
- one of the variables in the GROUP BY expression
The range is all the results from applying the function to a set of numeric values. Vin is used to create this set from the values bound to it amongst the solution mappings in Omega that are compatible with Mu.
Another operator Partitions is introduced which returns a set (so duplicates are not allowed) of mappings
Partitions(VarList, Omega)
Partitions returns a set of solution mappings restricted to the variables in VarList, forming partitions within Omega where the bindings for the given variables are the same.
Finally, the Groups operator is added which also returns a set of solution mappings.
Groups(VarList, FuncAndVars, Omega)
It returns a set of solution mappings starting with the partitions formed using VarList against Omega. Each solution mapping in the partition is extended with the result of calling Aggregation using the given arguments ( VarList, FuncAndVars, and Omega ) and the current partition solution mapping
Draft Formalization
Single valued function: returns a solution projected down to named variables only:
key(varlist, μ) = { (v,x) | (v,x) in μ, v in varlist }
and the set of all keys:
key(varlist, Ω) = { k | μ in Ω, k=key(varlist,μ) }
The partition of the multiset Ω is:
Partition(varlist, Ω) = { (k,µ) | µ in Ω, k=key(varlist, µ) }
First Version
This is the version in the email. It allows for multiple aggregations over the same groupings but needs to introduce FuncAndVars.
Let agg(VarList,Ω') be the aggregation function run on a multiset of solutions Ω', taking variables VarList. Then the aggregation is the set of solutions defined by each key in the multiset Ω being evaluated:
(this is the version in the original email)
Aggregation(VarList, FuncAndVars, Ω) = { merge(k, (Vout, agg(Vin,X)) | (k, X) in Partition(varlist, Ω), FuncAndVars= set (f, Vin, Vout), agg in set f }
Simplified version
Simple form: single aggregator. Aggregation function agg(key, X) which produces a single value for each key and partition for that key (key, X).
Aggregation(VarList, Var, agg, Ω) = { merge(k, (Var, agg(k, Ω')) | (k, Ω') in Partition(varlist, Ω), Var is a variable }
Var is not mentioned in the pattern used to determine Ω. Generalize this form to multiple aggregations as per first version.
Aggregate functions
.. Common restrictions that each aggregate function needs to adhere to ..