# Design:Aggregate

## Contents

## Definition and scope of feature

We extend the existing SPARQL algebra with aggregate operators and functions. The goal is to re-use the general approach in relational algebra for creating *aggregate formation* operators and well-defined aggregate functions that extend a multiset of solutions with bindings to the result of applying these aggregate functions to partitions of the multiset distinguished by uniqueness for the values of a user-specified set of variables.

## Status

The open issues are below:

- What specific issues are there due to the use of multisets / bags?
- http://www.w3.org/2009/sparql/track/issues/14 - What specific aggregate functions will be included? (the standard relational algebra functions are: SUM, AVG, COUNT, MIN, MAX, and COUNT), including COUNT(*), COUNT(DISTINCT *), COUNT(?x), COUNT(DISTINCT ?x). Do we allow custom (URI-named) aggregates?
- What are the results when the group has no valid entries? e.g. SUM(?x) where the only ?x is a URI.
- How does REDUCED & DISTINCT interact with solution aggregation?
- Does the use of 'AS' with aggregate functions require updating the grammar to require commas in between terms in the projection list?
- http://www.w3.org/2009/sparql/track/issues/16 - What happens when we have MIN over different value spaces? e.g. 1 and "string".
- http://www.w3.org/2009/sparql/track/issues/15 - Support for custom aggregate functions (named by IRI)? What syntax?
- http://www.w3.org/2009/sparql/track/issues/35 - Do aggregates support distinct?
- http://www.w3.org/2009/sparql/track/issues/41 - Does GROUP BY allow expressions? multiple expressions?

## Syntax

SPARQL queries that leverage aggregate operators mostly add

GROUP BY ......

expression to the end of a SELECT, ASK?, ... queries and the list of projected variables are updated to include the use of *aggregate functions* along with the variables their values are bound to

SelectQuery ::= 'SELECT' ( 'DISTINCT' | 'REDUCED' )? ( (Var | AggregateFunc '(' Var ') AS' Var)+ | '*' ) DatasetClause* WhereClause SolutionModifier WhereClause ::= 'WHERE'? GroupGraphPattern AggregateExpression? AggregateExpression ::= 'GROUP BY' ?Var+ HavingExpression? HavingExpression ::= HAVING Constraint Constraint ::= BrackettedExpression | BuiltInCall | AggregateFunc '(' Var ')' PrimaryExpression ::= BrackettedExpression | BuiltInCall | IRIrefOrFunction | RDFLiteral | NumericLiteral | BooleanLiteral | Var | AggregateFunc '(' Var ')'

An example:

SELECT ?org sum(?lprice) as ?totalPrice WHERE { ?org :affiliates ?auth . ?auth :writesBook ?book . ?book :price ?lprice } GROUP BY ?org HAVING (sum(?lprice) > 10)

A.8 Grammar in SPARQL/Query 1.0.

## Algebra operators

An algebraic operator **Aggregation** is introduced that returns a solution mapping. It has the following signature documenting its arguments:

(AggregationVarList,FuncAndVars,Omega,Mu)

**VarList** is a list of one or more variables whose values will be used to form partitions of **Omega**. Each partition corresponds to those mappings where the values (or the range) for the variables in * VarList* are equivalent.

*is a set of 3-tuples (*

**FuncAndVars***,*

**f***,*

**Vin***) where*

**Vout***is one of a set of*

**f***aggregate functions*. Each aggregate function takes a set of RDF terms (Literals only?) and return a numeric value as an RDF literal.

*is the variable whose bindings are picked from the mappings in a partition and used as input to the aggregate functions.*

**Vin***is the variable that will be used to bind the result of the aggregate function call.*

**Vout***is a SPARQL multiset of solution mappings, one of which is*

**Omega****Mu**

The intention is for **Aggregation** to return a solution mapping whose domain is all the variables (**Vout**) that are either:

- arguments of each aggregate function in the SELECT clause
- one of the variables in the GROUP BY expression

The range is all the results from applying the function to a set of numeric values. **Vin** is used to create this set from the values bound to it amongst the solution mappings in **Omega** that are compatible with *Mu*.

Another operator **Partitions** is introduced which returns a *set* (so duplicates are not allowed) of mappings

(PartitionsVarList,Omega)

**Partitions** returns a set of solution mappings restricted to the variables in **VarList**, forming partitions within **Omega** where the bindings for the given variables are the same.

Finally, the **Groups** operator is added which also returns a set of solution mappings.

(GroupsVarList,FuncAndVars,Omega)

It returns a set of solution mappings starting with the partitions formed using **VarList** against **Omega**. Each solution mapping in the partition is extended with the result of calling **Aggregation** using the given arguments ( **VarList**, **FuncAndVars**, and **Omega** ) and the current partition solution mapping

### Draft Formalization

Single valued function: returns a solution projected down to named variables only:

key(varlist, μ) = { (v,x) | (v,x) in μ, v in varlist }

and the set of all keys:

key(varlist, Ω) = { k | μ in Ω, k=key(varlist,μ) }

The partition of the multiset Ω is:

Partition(varlist, Ω) = { (k,µ) | µ in Ω, k=key(varlist, µ) }

*First Version*

This is the version in the email. It allows for multiple aggregations over the same groupings but needs to introduce `FuncAndVars`.

Let `agg(VarList,Ω')` be the aggregation function run on a multiset of solutions Ω', taking variables VarList. Then the aggregation is the set of solutions defined by each key in the multiset Ω being evaluated:

(this is the version in the original email)

Aggregation(VarList, FuncAndVars, Ω) = { merge(k, (Vout, agg(Vin,X)) | (k, X) in Partition(varlist, Ω), FuncAndVars= set (f, Vin, Vout), agg in set f }

*Simplified version*

Simple form: single aggregator. Aggregation function `agg(key, X)` which produces a single value for each key and partition for that key `(key, X)`.

Aggregation(VarList, Var, agg, Ω) = { merge(k, (Var, agg(k, Ω')) | (k, Ω') in Partition(varlist, Ω), Var is a variable }

`Var` is not mentioned in the pattern used to determine Ω. Generalize this form to multiple aggregations as per first version.

## Aggregate functions

.. Common restrictions that each aggregate function needs to adhere to ..