Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
There are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. The Data Cube vocabulary provides a means to do this using the W3C RDF (Resource Description Framework) standard. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations. The Data Cube vocabulary is a core foundation which supports extension vocabularies to enable publication of other aspects of statistical data flows or other multi-dimensional data sets.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This publication transitions previous work on this subject onto the W3C Recommendation Track.
This document was published by the Government Linked Data Working Group as a Last Call Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-gld-comments@w3.org (subscribe, archives). The Last Call period ends 08 April 2013. All comments are welcome.
Publication as a Last Call Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This is a Last Call Working Draft and thus the Working Group has determined that this document has satisfied the relevant technical requirements and is sufficiently stable to advance through the Technical Recommendation process.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
Classes: qb:Attachable qb:AttributeProperty qb:CodedProperty qb:ComponentProperty qb:ComponentSet qb:ComponentSpecification qb:DataSet qb:DataStructureDefinition qb:DimensionProperty qb:HierarchicalCodeList qb:MeasureProperty qb:Observation qb:Slice qb:ObservationGroup qb:SliceKey
Properties: qb:attribute qb:codeList qb:component qb:componentAttachment qb:componentProperty qb:componentRequired qb:concept qb:dataSet qb:dimension qb:hierarchyRoot qb:measure qb:measureDimension qb:measureType qb:observation qb:observationGroup qb:order qb:parentChildProperty qb:slice qb:sliceKey qb:sliceStructure qb:structure
This section is non-normative.
Statistical data is a foundation for policy prediction, planning and adjustments and underpins many of the mash-ups and visualisations we see on the web. There is strong interest in being able to publish statistical data in a web-friendly format to enable it to be linked and combined with related information.
At the heart of a statistical dataset is a set of observed values organized along a group of dimensions, together with associated metadata. The Data Cube vocabulary enables such information to be represented using the W3C RDF (Resource Description Framework) standard and published following the principles of linked data. The vocabulary is based upon the approach used by the SDMX ISO standard for statistical data exchange. This cube model is very general and so the Data Cube vocabulary can be used for other data sets such as survey data, spreadsheets and OLAP data cubes [OLAP].
The Data Cube vocabulary is focused purely on the publication of multi-dimensional data on the web. We envisage a series of modular vocabularies being developed which extend this core foundation. In particular, we see the need for an SDMX extension vocabulary to support the publication of additional context to statistical data (such as the encompassing Data Flows and associated Provision Agreements). Other extensions are possible to support metadata for surveys (so called "micro-data", as encompassed by DDI) or publication of statistical reference metadata.
The Data Cube in turn builds upon the following existing RDF vocabularies:
This section is non-normative.
Linked data is an approach to publishing data on the web, enabling datasets to be linked together through references to common concepts. The approach [LOD] recommends use of HTTP URIs to name the entities and concepts so that consumers of the data can look-up those URIs to get more information, including links to other related URIs. RDF [RDF-PRIMER] provides a standard for the representation of the information that describes those entities and concepts, and is returned by dereferencing the URIs.
There are a number of benefits to being able to publish multi-dimensional data, such as statistics, using RDF and the linked data approach:
The Statistical Data and Metadata Exchange (SDMX) Initiative was organised in 2001 by seven international organisations (BIS, ECB, Eurostat, IMF, OECD, World Bank and the UN) to realise greater efficiencies in statistical practice. These organisations all collect significant amounts of data, mostly from the national level, to support policy. They also disseminate data at the supra-national and international levels.
There have been a number of important results from this work: two versions of a set of technical specifications - ISO:TS 17369 (SDMX) - and the release of several recommendations for structuring and harmonising cross-domain statistics, the SDMX Content-Oriented Guidelines. All of the products are available at www.sdmx.org. The standards are now being widely adopted around the world for the collection, exchange, processing, and dissemination of aggregate statistics by official statistical organisations. The UN Statistical Commission recommended SDMX as the preferred standard for statistics in 2007.
The SDMX specification defines a core information model which is reflected in concrete form in two syntaxes - SDMX-ML (an XML syntax) and SDMX-EDI.
The RDF Data Cube vocabulary builds upon the core of the the SDMX 2.0 Information Model [SDMX20].
A key component of the SDMX standards package are the Content-Oriented Guidelines (COGs), a set of cross-domain concepts, code lists, and categories that support interoperability and comparability between datasets by providing a shared terminology between SDMX implementers [COG]. RDF versions of these terms are available separately for use along with the Data Cube vocabulary, see Content oriented guidelines.
This document describes the Data Cube vocabulary It is aimed at people wishing to publish statistical or other multi-dimension data in RDF. Mechanics of cross-format translation from other formats such as SDMX-ML are not covered here.
The names of RDF entities -- classes, predicates, individuals -- are
URIs. These are usually expressed using a compact notation where the
name is written prefix:localname
, and where the prefix
identifies a namespace URI. The namespace identified by the prefix is
prepended to the localname
to obtain the full URI.
The following namespaces are used in this document:
Prefix | Namespace | Reference |
---|---|---|
qb | http://purl.org/linked-data/cube# | This document |
skos | http://www.w3.org/2004/02/skos/core# | [SKOS-REFERENCE] |
scovo | http://purl.org/NET/scovo# | [SCOVO] |
void | http://rdfs.org/ns/void# | [VOID] |
foaf | http://xmlns.com/foaf/0.1/ | [FOAF] |
org | http://www.w3.org/ns/org# | [ORG] |
dct | http://purl.org/dc/terms/ | [DC11] |
owl | http://www.w3.org/2002/07/owl# | [OWL2-PRIMER] |
rdf | http://www.w3.org/1999/02/22-rdf-syntax-ns# | [RDF-CONCEPTS] |
rdfs | http://www.w3.org/2000/01/rdf-schema# | [RDF-SCHEMA] |
admingeo | http://data.ordnancesurvey.co.uk/ontology/admingeo/ | (Non-normative, used for examples only) |
eg | http://example.org/ns# | (Non-normative, used for examples only) |
All RDF examples are written in Turtle syntax [TURTLE-TR].
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words must, must not, required, should, should not, recommended, may, and optional in this specification are to be interpreted as described in [RFC2119].
A data interchange, however that interchange occurs, is conformant with Data Cube if:
A conforming data interchange:
This section is non-normative.
This section is non-normative.
A statistical data set comprises a collection of observations made at some points across some logical space. The collection can be characterized by a set of dimensions that define what the observation applies to (e.g. time, area, gender) along with metadata describing what has been measured (e.g. economic activity, population), how it was measured and how the observations are expressed (e.g. units, multipliers, status). We can think of the statistical data set as a multi-dimensional space, or hyper-cube, indexed by those dimensions. This space is commonly referred to as a cube for short; though the name shouldn't be taken literally, it is not meant to imply that there are exactly three dimensions (there can be more or fewer) nor that all the dimensions are somehow similar in size.
A cube is organized according to a set of dimensions, attributes and measures. We collectively call these components.
The dimension components serve to identify the observations. A set of values for all the dimension components is sufficient to identify a single observation. Examples of dimensions include the time to which the observation applies, or a geographic region which the observation covers.
The measure components represent the phenomenon being observed.
The attribute components allow us to qualify and interpret the observed value(s). They enable specification of the units of measure, any scaling factors and metadata such as the status of the observation (e.g. estimated, provisional).
This section is non-normative.
It is frequently useful to group subsets of observations within a dataset. In particular to fix all but one (or a small subset) of the dimensions and be able to refer to all observations with those dimension values as a single entity. We call such a selection a slice through the cube. For example, given a data set on regional performance indicators then we might group all the observations about a given indicator and a given region into a slice, each slice would then represent a time series of observed values.
A data publisher may identify slices through the data for various purposes. They can be a useful grouping to which metadata might be attached, for example to note a change in measurement process which affects a particular time or region. Slices also enable the publisher to identify and label particular subsets of the data which should be presented to the user - they can enable the consuming application to more easily construct the appropriate graph or chart for presentation.
In statistical applications it is common to work with slices in which a single dimension is left unspecified. In particular, to refer to such slices in which the single free dimension is time as Time Series and to refer slices along non-time dimensions as Sections. Within the Data Cube vocabulary we allow arbitrary dimensionality slices and do not give different names to particular types of slice. Such sub classes of slice could be added in extension vocabularies.
This section is non-normative.
In order to illustrate the use of the data cube vocabulary we will
use a small demonstration
data set extracted from
StatsWales report
number 003311 which describes life expectancy broken down by region
(unitary authority), age and time. The extract we will use is:
2004-2006 |
2005-2007 |
2006-2008 |
||||
Male |
Female |
Male |
Female |
Male |
Female |
|
Newport |
76.7 |
80.7 |
77.1 |
80.9 |
77.0 |
81.5 |
Cardiff |
78.7 |
83.3 |
78.6 |
83.7 |
78.7 |
83.4 |
Monmouthshire |
76.6 |
81.3 |
76.5 |
81.5 |
76.6 |
81.7 |
Merthyr
Tydfil |
75.5 |
79.1 |
75.5 |
79.4 |
74.9 |
79.6 |
We can see that there are three dimensions - time period (rolling averages over three year timespans), region and sex. Each observation represents the life expectancy for that population (the measure) and we will need an attribute to define the units (years) of the measured values.
An example of slicing the data would be to define slices in which the time and sex are fixed for each slice. Such slices then show the variation in life expectancy across the different regions, i.e. corresponding to the columns in the above tabular layout.
A qb:DataStructureDefinition
defines the structure of one or more
datasets. In particular, it defines the dimensions, attributes and measures
used in the dataset along with qualifying information such as ordering of
dimensions and whether attributes are required or optional. For well-formed
data sets much of this information is implicit within the RDF component properties
found on the observations. However, the explicit declaration of the structure has
several benefits:
It is common, when publishing statistical data, to have a regular series of publications which all follow the same structure. The notion of a Data Structure Definition (DSD) allows us to define that structure once and then reuse it for each publication in the series. Consumers can then be confident that the structure of the data has not changed.
The Data Cube vocabulary represents the dimensions, attributes and measures
as RDF properties. Each is an instance of the abstract qb:ComponentProperty
class, which in turn has sub-classes qb:DimensionProperty
,
qb:AttributeProperty
and qb:MeasureProperty
.
A component property encapsulates several pieces of information:
The same concept can be manifested in different components. For example, the concept
of currency may be used as a dimension (in a data set dealing with exchange rates) or as
an attribute (when describing the currency in which an observed trade took place). The concept of time
is typically used only as a dimension but may be encoded as a data value (e.g. an xsd:dateTime
)
or as a symbolic value (e.g. a URI drawn from the reference time URI set developed by data.gov.uk).
In statistical agencies it is common to have a standard thesaurus of statistical concepts which
underpin the components used in multiple different data sets.
To support this reuse of general statistical concepts the data cube vocabulary provides the qb:concept
property which
links a qb:ComponentProperty
to the concept it represents. We use the SKOS
vocabulary [SKOS-PRIMER] to represent such concepts. This is very natural for those cases where the
concepts are already maintained as a controlled term list or thesaurus.
When developing a data structure definition for an informal data set there may not be an appropriate
concept already. In those cases, if the concept is likely to be reused in other guises it is recommended to
publish a skos:Concept
along with the specific qb:ComponentProperty
. However, if
such reuse is not expected then it is not required to do so - the qb:concept
link is optional and a simple instance of the appropriate subclass of qb:ComponentProperty
is
sufficient.
The representation of the possible values of the component is described using the rdfs:range
property of the component in the usual RDF manner. Thus, for example, values of a time dimension might
be represented using literals of type xsd:dateTime
or as URIs drawn from a time reference service.
In statistical data sets it is common
for values to be encoded using some (possibly hierarchical) code list and it can be useful to be
able to easily identify the overall code list in some more structured form. To cater for this a
component can also be optionally annotated with a qb:codeList
to indicate a set of
skos:Concept
s which may be used as codes. The qb:codeList
value may be a
skos:ConceptScheme
, skos:Collection
or qb:HierarchicalCodeList
.
In such a case the rdfs:range
of the component might be left as simply skos:Concept
but
a useful design pattern is to also define an rdfs:Class
whose members are all the skos:Concept
s within a particular scheme. In that way
the rdfs:range
can be made more specific which enables generic RDF tools to perform
appropriate range checking.
Note that in any SDMX extension vocabulary there would be one further item of information to encode
about components - the role that they play within the structure definition. In particular, it is sometimes
convenient for consumers to be able to easily identify which is the time dimension,
which component is the primary measure and so forth. It turns out that such roles are intrinsic to
the concepts and so this information can encoded by providing subclasses of skos:Concept
for each role. The particular choice of roles here is specific to the SDMX standard and so is not
included within the core Data Cube vocabulary.
Before illustrating the components needed for our running example, there is one more piece of machinery to introduce, a reusable set of concepts and components based on SDMX.
This section is non-normative.
The SDMX standard includes a set of content oriented guidelines (COG) [COG] which define a set of common statistical concepts and associated code lists that are intended to be reusable across data sets. A community group has developed RDF encodings of these guidelines. These comprise:
Prefix | Namespace | Description |
---|---|---|
sdmx-concept | http://purl.org/linked-data/sdmx/2009/concept# | SKOS Concepts for each COG defined concept |
sdmx-code | http://purl.org/linked-data/sdmx/2009/code# | SKOS Concepts and ConceptSchemes for each COG defined code list |
sdmx-dimension | http://purl.org/linked-data/sdmx/2009/dimension# | component properties corresponding to each COG concept that can be used as a dimension |
sdmx-attribute | http://purl.org/linked-data/sdmx/2009/attribute# | component properties corresponding to each COG concept that can be used as an attribute |
sdmx-measure | http://purl.org/linked-data/sdmx/2009/measure# | component properties corresponding to each COG concept that can be used as a measure |
These resources are provided as a convenience and do not form part of the Data Cube standard at this time. However, they are used by a number of existing Data Cube publications and so we will reference them within our worked examples.
This section is non-normative.
Turning to our example data set then we can see there are three dimensions to represent - time period, region (unitary authority) and sex of the population. There is a single (primary) measure which corresponds to the topic of the data set (life expectancy) and encodes a value in years. Hence, we need the following components.
Time. There is a suitable predefined concept in the SMDX-COG for this, REF_PERIOD, so
we could reuse the corresponding component property sdmx-dimension:refPeriod
. However,
to represent the time period itself it would be convenient to use the data.gov.uk reference
time service and to declare this within the data structure definition.
eg:refPeriod a rdf:Property, qb:DimensionProperty; rdfs:label "reference period"@en; rdfs:subPropertyOf sdmx-dimension:refPeriod; rdfs:range interval:Interval; qb:concept sdmx-concept:refPeriod .
Region. Again there is a suitable COG concept and associated component that we can use for this, and again we can customize the range of the component. In this case we can use the Ordnance Survey Administrative Geography Ontology [OS-GEO].
eg:refArea a rdf:Property, qb:DimensionProperty; rdfs:label "reference area"@en; rdfs:subPropertyOf sdmx-dimension:refArea; rdfs:range admingeo:UnitaryAuthority; qb:concept sdmx-concept:refArea .
Sex. In this case we can use the corresponding COG component sdmx-dimension:sex
directly, since the default code list for it includes the terms we need.
Measure. This property will give the value of each observation.
We could use the default smdx-measure:obsValue
for this (defining
the topic being observed using metadata). However, it can aid readability and processing
of the RDF data sets to use a specific measure corresponding to the phenomenon being observed.
eg:lifeExpectancy a rdf:Property, qb:MeasureProperty; rdfs:label "life expectancy"@en; rdfs:subPropertyOf sdmx-measure:obsValue; rdfs:range xsd:decimal .
Unit measure attribute. The primary measure on its own is a plain decimal value.
To correctly interpret this value we need to define what units it is measured in (years in this case).
This is defined using attributes which qualify the interpretation of the observed value.
Specifically in this example we can use the predefined sdmx-attribute:unitMeasure
which in turn corresponds to the COG concept of UNIT_MEASURE
. To express
the value of this attribute we would typically us a common thesaurus of units of measure.
For the sake of this simple example we will use the DBpedia resource http://dbpedia.org/resource/Year
which corresponds to the topic of the Wikipedia page on "Years".
This covers the minimal components needed to define the structure of this data set.
To combine the components into a specification for the structure of this
dataset we need to declare a qb:DataStuctureDefinition
resource which in turn will reference a set of qb:ComponentSpecification
resources.
The qb:DataStuctureDefinition
will be reusable across other data sets with the same structure.
In the simplest case the qb:ComponentSpecification
simply references the
corresponding qb:ComponentProperty
(usually using one of the sub properties
qb:dimension
, qb:measure
or qb:attribute
).
However, it is also possible to qualify the
component specification in several ways.
qb:componentRequired>
. In the
absence of such a declaration an attribute is assumed to be
optional. The qb:componentRequired
declaration may only be applied to component specifications of
attributes - measures and dimensions are always required.qb:order
.
This order carries no semantics but can be useful to aid consuming agents in generating
appropriate user interfaces. It can also be useful in the publication chain to enable
synthesis of appropriate URIs for observations.qb:componentAttachment
property of the specification should
reference the class corresponding to the attachment level (e.g. qb:DataSet
for attributes
that will be attached to the overall data set).In the case of our running example the dimensions can be usefully ordered. There is only one attribute, the unit measure, and this is required. In the interest of illustrating the vocabulary use we will declare that this attribute will be attached at the level of the data set, however normalized representations are in general easier to query and combine.
So the structure of our example data set (and other similar datasets) can be declared by:
eg:dsd-le a qb:DataStructureDefinition; # The dimensions qb:component [qb:dimension eg:refArea; qb:order 1]; qb:component [qb:dimension eg:refPeriod; qb:order 2]; qb:component [qb:dimension sdmx-dimension:sex; qb:order 3]; # The measure(s) qb:component [qb:measure eg:lifeExpectancy]; # The attributes qb:component [qb:attribute sdmx-attribute:unitMeasure; qb:componentRequired "true"^^xsd:boolean; qb:componentAttachment qb:DataSet;] .
Note that we have given the data structure definition (DSD) a URI since it will be reused across different datasets with the same structure. Similarly the component properties themselves can be reused across different DSDs. However, the component specifications are only useful within the scope of a particular DSD and so we have chosen to represent them using blank nodes.
Our example data set is relatively simple in having a single observable (in this case "life expectancy") that is being measured. In other data sets there can be multiple measures. These measures may be of similar nature (e.g. a data set on local government performance might provide multiple different performance indicators for each region) or quite different (e.g. a data set on trades might provide quantity, value, weight for each trade).
There are two approaches to representing multiple measures. In the SDMX information model, each observation can record a single observed value. In a data set with multiple observations then we add an additional dimension whose value indicates the measure. This is appropriate for applications where the measures are separate aggregate statistics. In other domains such as a clinical statistics or sensor networks then the term observation usually denotes an observation event which can include multiple observed values. Similarly in Business Intelligence applications and OLAP, a single "cell" in the data cube will typically contain values for multiple measures.
The data cube vocabulary permits either representation approach to be used though they cannot be mixed within the same data set.
Both representation approaches require that, for every point in the space of dimensions for which there is an observation, then a value must be given for every measure. In the case of multi-measure observations then each measure must be present on each observation. In cubes which use a measure dimension then there are sets of observations for each populated point in the cube and within each of those sets there must be an observation giving each measure.
This approach allows multiple observed values to be attached
to an individual observation. It is suited to representation of things like sensor data and OLAP cubes.
To use this representation you simply declare multiple qb:MeasureProperty
components
in the data structure definition and attach an instance of each property to the observations within
the data set.
For example, if we have a set of shipment data containing unit count and total weight for each shipment then we might have a data structure definition such as:
eg:dsd1 a qb:DataStructureDefinition; rdfs:comment "shipments by time (multiple measures approach)"@en; qb:component [ qb:dimension sdmx-dimension:refTime; ], [ qb:measure eg-measure:quantity; ], [ qb:measure eg-measure:weight; ] .
This would correspond to individual observations such as:
eg:dataset1 a qb:DataSet; qb:structure eg:dsd1 . eg:obs1a a qb:Observation; qb:dataSet eg:dataset1; sdmx-dimension:refTime "30-07-2010"^^xsd:date; eg-measure:weight 1.3 ; eg-measure:quantity 42 ; .
Note that one limitation of the multi-measure approach is that it is not possible to attach
an attribute to a single observed value. An attribute attached to the observation instance
will apply to the whole observation (e.g. to indicate who made the observation). Attributes
can also be attached directly to the qb:MeasureProperty
itself (e.g. to indicate
the unit of measure for that measure) but that attachment applies to the whole data
set (indeed any data set using that measure property) and cannot vary for different observations.
For applications where this limitation is a problem then use the measure dimension approach.
This approach restricts observations to having a single measured value but allows
a data set to carry multiple measures by adding an extra dimension, a measure dimension.
The value of the measure dimension denotes which particular measure is being conveyed by the
observation. This is the representation approach used within SDMX and the SMDX-in-RDF
extension vocabulary introduces a subclass of qb:DataStructureDefinition
which is restricted
to using the measure dimension representation.
To use this representation you declare an additional dimension within the data structure
definition to play the role of the measure dimension. For use within the Data Cube vocabulary
we provide a single distinguished component for this purpose -- qb:measureType
.
An extension vocabulary could generalize this through the provision of roles to
identify concepts which
act as measure types, enabling other measure dimensions to be declared.
In the special case of using qb:measureType
as the measure dimension, the set of allowed
measures is assumed to be those measures declared within the DSD. There is no need to
define a separate code list or enumerated class to duplicate this information.
Thus, qb:measureType
is a “magic” dimension property with an implicit code list.
The data structure definition for our above example, using this representation approach, would then be:
eg:dsd2 a qb:DataStructureDefinition; rdfs:comment "shipments by time (measure dimension approach)"@en; qb:component [ qb:dimension sdmx-dimension:refTime; ], [ qb:measure eg-measure:quantity; ], [ qb:measure eg-measure:weight; ], [ qb:dimension qb:measureType; ] .
This would correspond to individual observations such as:
eg:dataset2 a qb:DataSet; qb:structure eg:dsd2 . eg:obs2a a qb:Observation; qb:dataSet eg:dataset2; sdmx-dimension:refTime "30-07-2010"^^xsd:date; qb:measureType eg-measure:weight ; eg-measure:weight 1.3 . eg:obs2b a qb:Observation; qb:dataSet eg:dataset2; sdmx-dimension:refTime "30-07-2010"^^xsd:date; qb:measureType eg-measure:quantity ; eg-measure:quantity 42 .
Note the duplication of having the measure property show up both as the property that carries the measured value, and as the value of the measure dimension. We accept this duplication as necessary to ensure the uniform cube/dimension mechanism and a uniform way of declaring and using measure properties on all kinds of datasets.
Those familiar with SDMX should also note that in the RDF representation there is no need for a separate "primary measure" which subsumes each of the individual measures, those individual measures are used directly. Extension vocabularies could address the round-tripping of the SDMX primary measure by use of a separate annotation on the data structure definition.
A DataSet is a collection of statistical data that corresponds to a given data structure definition. The data in a data set can be roughly described as belonging to one of the following kinds:
A resource representing the entire data set is created and typed as qb:DataSet
and
linked to the corresponding data structure definition via the qb:structure
property.
Pitfall: Note the capitalization of qb:DataSet
,
which differs from the capitalization in other vocabularies, such as
void:Dataset and dcat:Dataset. This unusual capitalization is chosen for compatibility
with the SDMX standard. The same applies to the related property qb:dataSet
.
Each observation is represented as an instance of type qb:Observation
.
In the basic case then values for each of the attributes, dimensions and measurements are attached directly to the observation (remember
that these components are all RDF properties). The observation is linked to the containing
data set using the qb:dataSet
property.
Thus for our running example we might expect to have:
eg:dataset-le1 a qb:DataSet; rdfs:label "Life expectancy"@en; rdfs:comment "Life expectancy within Welsh Unitary authorities - extracted from Stats Wales"@en; qb:structure eg:dsd-le ; . eg:o1 a qb:Observation; qb:dataSet eg:dataset-le1 ; eg:refArea ex-geo:newport_00pr ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-dimension:sex sdmx-code:sex-M ; sdmx-attribute:unitMeasure <http://dbpedia.org/resource/Year> ; eg:lifeExpectancy 76.7 ; . eg:o2 a qb:Observation; qb:dataSet eg:dataset-le1 ; eg:refArea ex-geo:cardiff_00pt ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-dimension:sex sdmx-code:sex-M ; sdmx-attribute:unitMeasure <http://dbpedia.org/resource/Year> ; eg:lifeExpectancy 78.7 ; . eg:o3 a qb:Observation; qb:dataSet eg:dataset-le1 ; eg:refArea ex-geo:monmouthshire_00pp ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-dimension:sex sdmx-code:sex-M ; sdmx-attribute:unitMeasure <http://dbpedia.org/resource/Year> ; eg:lifeExpectancy 76.6 ; . ...
This normalized structure makes it easy to query and combine data sets but there is some redundancy here. For example, the unit of measure for the life expectancy is uniform across the whole data set and does not change between observations. To cater for situations like this the Data Cube vocabulary allows components to be attached at a high level in the nested structure. Indeed if we re-examine our original Data Structure Declaration we see that we declared the unit of measure to be attached at the data set level. So an improved version of the example is:
eg:dataset-le1 a qb:DataSet; rdfs:label "Life expectancy"@en; rdfs:comment "Life expectancy within Welsh Unitary authorities - extracted from Stats Wales"@en; qb:structure eg:dsd-le ; sdmx-attribute:unitMeasure <http://dbpedia.org/resource/Year> ; . eg:o1 a qb:Observation; qb:dataSet eg:dataset-le1 ; eg:refArea ex-geo:newport_00pr ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-dimension:sex sdmx-code:sex-M ; eg:lifeExpectancy 76.7 ; . eg:o2 a qb:Observation; qb:dataSet eg:dataset-le1 ; eg:refArea ex-geo:cardiff_00pt ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-dimension:sex sdmx-code:sex-M ; eg:lifeExpectancy 78.7 ; . eg:o3 a qb:Observation; qb:dataSet eg:dataset-le1 ; eg:refArea ex-geo:monmouthshire_00pp ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-dimension:sex sdmx-code:sex-M ; eg:lifeExpectancy 76.6 ; . ...
In a data set containing just observations with no intervening structure then each observation must have a complete set of dimension values, along with all the measure values. If the set is structured by using slices then further abbreviation is possible, as discussed in the next section.
Slices allow us to group subsets of observations together. This not intended to represent arbitrary selections from the observations but uniform slices through the cube in which one or more of the dimension values are fixed.
Slices may be used for a number of reasons:
To illustrate the use of slices let us group the sample data set into geographic series. That will enable us to refer to e.g. "male life expectancy observations for 2004-2006" and guide applications to present a comparative chart across regions.
We first define the structure of the slices we want by associating a "slice key" which the
data structure definition. This is done by creating a qb:SliceKey
which
lists the component properties (which must be dimensions) which will be fixed in the
slice. The key is attached to the DSD using qb:sliceKey
. For example:
eg:sliceByRegion a qb:SliceKey; rdfs:label "slice by region"@en; rdfs:comment "Slice by grouping regions together, fixing sex and time values"@en; qb:componentProperty eg:refPeriod, sdmx-dimension:sex . eg:dsd-le-slice1 a qb:DataStructureDefinition; qb:component [qb:dimension eg:refArea; qb:order 1]; [qb:dimension eg:refPeriod; qb:order 2]; [qb:dimension sdmx-dimension:sex; qb:order 3]; [qb:measure eg:lifeExpectancy]; [qb:attribute sdmx-attribute:unitMeasure; qb:componentAttachment qb:DataSet;] ; qb:sliceKey eg:sliceByRegion .
In the instance data then slices are represented by instances of qb:Slice
which
link to the observations in the slice via qb:observation
and to the key by means
of qb:sliceStructure
. Data sets indicate
the slices they contain by means of qb:slice
. Thus in our example we would have:
eg:dataset-le2 a qb:DataSet; rdfs:label "Life expectancy"@en; rdfs:comment "Life expectancy within Welsh Unitary authorities - extracted from Stats Wales"@en; qb:structure eg:dsd-le-slice2 ; sdmx-attribute:unitMeasure <http://dbpedia.org/resource/Year> ; qb:slice eg:slice2; . eg:slice2 a qb:Slice; qb:sliceStructure eg:sliceByRegion ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-dimension:sex sdmx-code:sex-M ; qb:observation eg:o1b, eg:o2b; eg:o3b, ... . eg:o1b a qb:Observation; qb:dataSet eg:dataset-le2 ; eg:refArea ex-geo:newport_00pr ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-dimension:sex sdmx-code:sex-M ; eg:lifeExpectancy 76.7 ; . eg:o2b a qb:Observation; qb:dataSet eg:dataset-le2 ; eg:refArea ex-geo:cardiff_00pt ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-dimension:sex sdmx-code:sex-M ; eg:lifeExpectancy 78.7 ; . eg:o3b a qb:Observation; qb:dataSet eg:dataset-le2 ; eg:refArea ex-geo:monmouthshire_00pp ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-dimension:sex sdmx-code:sex-M ; eg:lifeExpectancy 76.6 ; . ...
Note that here we are still repeating the dimension values on the individual observations. This normalized representation means that a consuming application can still query for observed values uniformly without having to first parse the data structure definition and search for slice definitions. If it is desired, this redundancy can be reduced by declaring different attachment levels for the dimensions. For example:
eg:dsd-le-slice3 a qb:DataStructureDefinition; qb:component [qb:dimension eg:refArea; qb:order 1]; [qb:dimension eg:refPeriod; qb:order 2; qb:componentAttachment qb:Slice]; [qb:dimension sdmx-dimension:sex; qb:order 3; qb:componentAttachment qb:Slice]; [qb:measure eg:lifeExpectancy]; [qb:attribute sdmx-attribute:unitMeasure; qb:componentAttachment qb:DataSet;] ; qb:sliceKey eg:sliceByRegion . eg:dataset-le3 a qb:DataSet; rdfs:label "Life expectancy"@en; rdfs:comment "Life expectancy within Welsh Unitary authorities - extracted from Stats Wales"@en; qb:structure eg:dsd-le-slice3 ; sdmx-attribute:unitMeasure <http://dbpedia.org/resource/Year> ; qb:slice eg:slice3 ; . eg:slice3 a qb:Slice; qb:sliceStructure eg:sliceByRegion ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-dimension:sex sdmx-code:sex-M ; qb:observation eg:o1c, eg:o2c; eg:o3c, ... . eg:o1c a qb:Observation; qb:dataSet eg:dataset-le3 ; eg:refArea ex-geo:newport_00pr ; eg:lifeExpectancy 76.7 ; . eg:o2c a qb:Observation; qb:dataSet eg:dataset-le3 ; eg:refArea ex-geo:cardiff_00pt ; eg:lifeExpectancy 78.7 ; . eg:o3c a qb:Observation; qb:dataSet eg:dataset-le3 ; eg:refArea ex-geo:monmouthshire_00pp ; eg:lifeExpectancy 76.6 ; . ...
There are also situations in which a publisher wishes to group a set of observations
together for ease of access or presentation purposes but where that set is not defined
by simply fixing a set of dimension values. For example, in
representing weather observations it can be desirable to group
together the latest observation available from each station even though
each observation may have been taken at a different time.
For those situations the Data Cube vocabulary supports
qb:ObservationGroup
. A qb:ObservationGroup
can contain an arbitrary
collection of observations. A qb:Slice
is a special case of a qb:ObservationGroup
.
The values for dimensions within a data set must be unambiguously
defined. They may be typed values (e.g. xsd:dateTime
for time instances)
or codes drawn from some code list. Similarly, many attributes
used in data sets represent coded values from some controlled term list rather
than free text descriptions. In the Data Cube vocabulary such codes are
represented by URI references in the usual RDF fashion.
Sometimes
appropriate URI sets already exist for the relevant dimensions (e.g. the representations
of area and time periods in our running example). In other cases the data set being
converted may use controlled terms from some scheme which does not yet have
associated URIs. In those cases we recommend use of SKOS, representing
the individual code values using skos:Concept
and the overall
set of admissible values using skos:ConceptScheme
or skos:Collection
.
We illustrate this with an example drawn from the translation of the SDMX COG code list for gender, as used already in our worked example. The relevant subset of this code list is:
sdmx-code:sex a skos:ConceptScheme; skos:prefLabel "Code list for Sex (SEX) - codelist scheme"@en; rdfs:label "Code list for Sex (SEX) - codelist scheme"@en; skos:notation "CL_SEX"; skos:note "This code list provides the gender."@en; skos:definition <http://sdmx.org/wp-content/uploads/2009/01/02_sdmx_cog_annex_2_cl_2009.pdf> ; rdfs:seeAlso sdmx-code:Sex ; sdmx-code:sex skos:hasTopConcept sdmx-code:sex-F ; sdmx-code:sex skos:hasTopConcept sdmx-code:sex-M . sdmx-code:Sex a rdfs:Class, owl:Class; rdfs:subClassOf skos:Concept ; rdfs:label "Code list for Sex (SEX) - codelist class"@en; rdfs:comment "This code list provides the gender."@en; rdfs:seeAlso sdmx-code:sex . sdmx-code:sex-F a skos:Concept, sdmx-code:Sex; skos:topConceptOf sdmx-code:sex; skos:prefLabel "Female"@en ; skos:notation "F" ; skos:inScheme sdmx-code:sex . sdmx-code:sex-M a skos:Concept, sdmx-code:Sex; skos:topConceptOf sdmx-code:sex; skos:prefLabel "Male"@en ; skos:notation "M" ; skos:inScheme sdmx-code:sex .
skos:prefLabel
is used to give a name to the code,
skos:note
gives a description and skos:notation
can be used
to record a short form code which might appear in other serializations.
The SKOS specification [SKOS-REFERENCE] recommends the generation of a custom datatype for
each use of skos:notation
but here the notation is not intended for use
within RDF encodings, it merely documents the notation used in other representations
(which do not use such a datatype).
It is convenient and good practice when developing a code list to also
create a Class to denote all the codes within the code
list, irrespective of hierarchical structure. This allows the range of an
qb:ComponentProperty
to be defined by using rdfs:range
which then permits standard RDF closed-world checkers to validate use of the
code list without requiring custom SDMX-RDF-aware tooling. We do that in the
above example by using the common convention that the class name is the
same as that of the concept scheme but with leading upper case.
This code list can then be associated with a coded property, such as a dimension:
eg:sex a qb:DimensionProperty, qb:CodedProperty; qb:codeList sdmx-code:sex ; rdfs:range sdmx-code:Sex .
Explicitly declaring the code list using qb:codeList
is not mandatory but can be helpful in those cases where a concept scheme has been defined.
In some cases code lists have a hierarchical structure. In particular, this is
used in SDMX when the data cube includes aggregations of data values
(e.g. aggregating a measure across geographic regions).
Hierarchical code lists should be represented using the
skos:narrower
relationship to link from the skos:hasTopConcept
codes down through the tree or lattice of child codes.
In some publishing tool chains the corresponding transitive closure
skos:narrowerTransitive
will be automatically inferred.
The use of skos:narrower
makes it possible to declare new
concept schemes which extend an existing scheme by adding additional aggregation layers on top.
All items are linked to the scheme via skos:inScheme
.
It is sometimes convenient to be able to specify a hierarchical arrangement of
concepts other than through the use of the SKOS relation skos:narrower
.
There are several situations where this is useful:
The Data Cube vocabulary supports this situation through the qb:HierarchicalCodeList
class.
An instance of qb:HierarchicalCodeList
defines a set of root concepts in the hierarchy
(qb:hierarchyRoot
) and a parent-to-child relationship (qb:parentChildProperty
) which
links a term in the hierarchy to its immediate sub-terms.
Thus a qb:HierarchicalCodeList
is similar to a skos:ConceptScheme
in which qb:hierarchyRoot
plays the same
role as skos:hasTopConcept
, and the value of qb:parentChildProperty
plays
the same role as skos:narrower
. In the case where a code list is already available as a SKOS concept scheme or collection
then those should be used directly; qb:HierarchicalCodeList
is provided for cases where the
terms are not available as SKOS but are available in some other RDF representation suitable for reuse.
For example, the Ordnance Survey of Great Britain publishes a geographic hierarchy which has
eleven roots (European Regions such as Wales, Scotland, the South West) and uses a spatial relations
ontology to define a containment hierarchy. This could be represented as a qb:HierarchicalCodeList
using the following.
PREFIX spatial: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/> . eg:GBgeoHierarchy a qb:HierarchicalCodeList; rdfs:label "Geographic Hierarchy for Great Britain"@en; qb:hierarchyRoot <http://data.ordnancesurvey.co.uk/id/7000000000041427>, # South West <http://data.ordnancesurvey.co.uk/id/7000000000041426>, # West Midlands <http://data.ordnancesurvey.co.uk/id/7000000000041421>, # South East <http://data.ordnancesurvey.co.uk/id/7000000000041430>, # Yorkshire & the Humber <http://data.ordnancesurvey.co.uk/id/7000000000041423>, # East Midlands <http://data.ordnancesurvey.co.uk/id/7000000000041425>, # Eastern <http://data.ordnancesurvey.co.uk/id/7000000000041428>, # London <http://data.ordnancesurvey.co.uk/id/7000000000041431>, # North West <http://data.ordnancesurvey.co.uk/id/7000000000041422>, # North East <http://data.ordnancesurvey.co.uk/id/7000000000041424>, # Wales <http://data.ordnancesurvey.co.uk/id/7000000000041429>; # Scotland qb:parentChildProperty spatial:contains; . eg:geoDimension a qb:DimensionProperty ; qb:codeList eg:GBgeoHierarchy .
Note that in some cases the hierarchy to be reused may only have a
property relating child concepts to parent concepts. This situation
is handled by declaring
the qb:parentChildProperty
to be
the owl:inverseOf
of the child-to-parent property. For
example:
PREFIX spatial: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/> . eg:GBgeoHierarchy a qb:HierarchicalCodeList; qb:parentChildProperty [owl:inverseOf spatial:within] .
Future extensions of Data Cube may support additional sub classes
of qb:HierarchicalCodeList
, for example to
declare hierarchies in which each parent is a disjoint union of its children.
The use of SKOS, or non-SKOS, hierarchies makes it possible to publish aggregated statistics for the non-leaf concepts in the hierarchy. The Data Cube vocabulary itself imposes no constraints on how such aggregation is done. Indeed in statistical applications the appropriate statistical corrections to make to aggregated values may be non-trivial and dependent on the data and precise analysis methodology. Even in simple, non-statistical, applications such as OLAP a number of different aggregation operators are commonly used.
Vocabulary terms to represent the aggregation operations employed within a given dataset, and how one dataset might be derived from another, are not supported in this version of the Data Cube specification. This area may be addressed by future extensions to Data Cube.
DataSets should be marked up with metadata to support discovery, presentation and
processing. Dublin Core Terms [DC11] should be used
for representing the key metadata annotations commonly needed for
DataSets. The RDFS terms for display label (rdfs:label
)
descriptive comment (rdfs:comment
) should be given as
well for compatibility with earlier versions of Data Cube and common
RDF practice.
The recommend core set of metadata terms is:
dct:title
rdfs:label
- may be same as dct:title
dct:description
rdfs:comment
- may be same as dct:description
dct:issued
dct:modified
dct:subject
dct:publisher
dct:license
Other documents, notably [DCAT], provide additional recommendations for metadata terms for data sets which may be used for describing Data Cube DataSets.
Publishers of statistics often categorize their data sets into different statistical
domains, such as Education, Labour, or Transportation.
We encourage use of dct:subject
to record such a classification of
a whole data set.
The classification terms can include coarse grained classifications, such
as the List of Subject-matter Domains from the SDMX Content-oriented Guidelines,
and fine grained classifications to support discovery of data sets.
The classification schemes are typically represented using the SKOS vocabulary. For convenience the SMDX Subject-matter Domains have been encoded as a SKOS concept scheme at http://purl.org/linked-data/sdmx/2009/subject#.
Thus our sample dataset might be marked up by:
eg:dataset1 a qb:DataSet; rdfs:label "Life expectancy"@en; dct:title "Life expectancy"@en; rdfs:comment "Life expectancy within Welsh Unitary authorities - extracted from Stats Wales"@en; dct:description "Life expectancy within Welsh Unitary authorities - extracted from Stats Wales"@en; dct:issued "2010-08-11"^^xsd:date; dct:subject sdmx-subject:3.2 , # regional and small area statistics sdmx-subject:1.4 , # Health ex:wales; # Wales ...
where eg:Wales
is a skos:Concept
drawn from an appropriate controlled
vocabulary for places.
The organization that publishes a dataset should be recorded as part of the dataset metadata.
Again we recommend use of the Dublin Core term dct:publisher
for this.
The organization should be represented as an instance of foaf:Agent
, or
some more specific subclass such as org:Organization
[ORG].
eg:dataset1 a qb:DataSet; dct:publisher <http://example.com/meta#organization> . <http://example.com/meta#organization> a org:Organization, foaf:Agent; rdfs:label "Example org" .
Extension vocabularies may provide additional metadata properties and may impose constraints on what metadata must be provided.
In normal form then the qb:Observation
s which
make up a Data Cube have property values for each of the required
dimensions, attributes and measures as declared in the associated data
structure definition. This form for a Data Cube is
termed normalized. It is a convenient format for
querying data and makes it possible to write uniform queries which
extract sets of observations, including from across multiple
cubes. However, the verbosity of a fully normalized representation
incurs overheads in transmission and storage of Data Cubes which may
be problematic in some settings.
To address this the Data Cube vocabulary supports a notion of
an abbreviated format in which component
properties may be attached to other levels in the
Data Cube. Specifically they may be attached to
a qb:DataSet
or qb:Slice
.
In those cases the attached property is taken to be applied to all
the qb:Observation
instances associated with that
attachment point. For illustration
see example 4 in which the unit of
measure is declared as to be attached to the whole data set and need
not be repeated for every observation.
It is also possible to attach attributes to a qb:MeasureProperty
in which case the attribute is intended to apply only to that property and not
to the observations in which that property occurs.
We define these notions by means of a transformation algorithm which can normalize an abbreviated Data Cube. We express this transformation using the SPARQL 1.1 Update language [SPARQL-UPDATE-11]. Use of this notation does not imply that the transformation must be implemented this way. Information exchanges using Data Cube may retain data in abbreviated form and use other techniques such as query rewriting to ease access, may implement the normalization algorithm by other means or may handle all data in normalized form or any mix of these.
The normalization algorithm comprises two sets of SPARQL Update operations which should be applied in turn to a SPARQL Dataset in which the default graph contains the Data Cube RDF graph to be normalized.
The first update operation performs selective type and property closure
operations. These serve two purposes. They ensure
that rdf:type
assertions on instances
of qb:Observation
and qb:Slice
may be omitted in an abbreviated Data Cube. They also simplify
the second set of update operations by expanding
the sub properties of qb:componentProperty
(specifically qb:dimension
, qb:measure
and qb:attribute
).
Phase 1: Type and property closure |
---|
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX qb: <http://purl.org/linked-data/cube#> INSERT { ?o rdf:type qb:Observation . } WHERE { [] qb:observation ?o . }; INSERT { ?o rdf:type qb:Observation . } WHERE { ?o qb:dataSet [] . }; INSERT { ?s rdf:type qb:Slice . } WHERE { [] qb:slice ?s. }; INSERT { ?cs qb:componentProperty ?p . ?p rdf:type qb:DimensionProperty . } WHERE { ?cs qb:dimension ?p . }; INSERT { ?cs qb:componentProperty ?p . ?p rdf:type qb:MeasureProperty . } WHERE { ?cs qb:measure ?p . }; INSERT { ?cs qb:componentProperty ?p . ?p rdf:type qb:AttributeProperty . } WHERE { ?cs qb:attribute ?p . } |
These closure operations are implied by the RDFS semantics of the Data Cube vocabulary. Data Cube processors may apply full RDFS closure in place of the update operation defined here.
The second update operation checks the components of the data structure definition of the data set for declared attachment levels. For each of the possible attachments levels it looks for occurrences of that component to be pushed down to the corresponding observations.
Phase 2: Push down attachment levels |
---|
PREFIX qb: <http://purl.org/linked-data/cube#> # Dataset attachments INSERT { ?obs ?comp ?value } WHERE { ?spec qb:componentProperty ?comp ; qb:componentAttachment qb:DataSet . ?dataset qb:structure [qb:component ?spec]; ?comp ?value . ?obs qb:dataSet ?dataset. }; # Slice attachments INSERT { ?obs ?comp ?value } WHERE { ?spec qb:componentProperty ?comp; qb:componentAttachment qb:Slice . ?dataset qb:structure [qb:component ?spec]; qb:slice ?slice . ?slice ?comp ?value; qb:observation ?obs . }; # Dimension values on slices INSERT { ?obs ?comp ?value } WHERE { ?spec qb:componentProperty ?comp . ?comp a qb:DimensionProperty . ?dataset qb:structure [qb:component ?spec]; qb:slice ?slice . ?slice ?comp ?value; qb:observation ?obs . } |
An instance of an RDF Data Cube should conform to a set of integrity constraints which we define in this section.
A well-formed RDF Data Cube is an a RDF graph describing
one or more instances of qb:DataSet
for which
each of the integrity checks defined here passes.
A well-formed abbreviated RDF Data Cube is an a RDF graph which, when expanded using the normalization algorithm, yields a well-formed RDF Data Cube.
Each integrity constraint is expressed as narrative prose and, where possible, a SPARQL [SPARQL-QUERY-11] ASK query or query template. If the ASK query is applied to an RDF graph then it will return true if that graph contains one or more Data Cube instances which violate the corresponding constraint.
Using SPARQL queries to express the integrity constraints does not imply that integrity checking must be performed this way. Implementations are free to use alternative query formulations or alternative implementation techniques to perform equivalent checks.
Each integrity constraint query assumes the following set of prefix bindings:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX qb: <http://purl.org/linked-data/cube#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX owl: <http://www.w3.org/2002/07/owl#>
The complete set of constraints is listed below.
The RDF graph must be consistent under RDF D-entailment [RDF-MT] using a datatype map containing all the datatypes used within the graph.
Every qb:Observation
has exactly one associated qb:DataSet
.
ASK { { # Check observation has a data set ?obs a qb:Observation . FILTER NOT EXISTS { ?obs qb:dataSet ?dataset1 . } } UNION { # Check has just one data set ?obs a qb:Observation ; qb:dataSet ?dataset1, ?dataset2 . FILTER (?dataset1 != ?dataset2) } } |
Every qb:DataSet
has exactly one associated qb:DataStructureDefinition
.
ASK { { # Check dataset has a dsd ?dataset a qb:DataSet . FILTER NOT EXISTS { ?dataset qb:structure ?dsd . } } UNION { # Check has just one dsd ?dataset a qb:DataSet ; qb:structure ?dsd1, ?dsd2 . FILTER (?dsd1 != ?dsd2) } } |
Every qb:DataStructureDefinition
must include at least one declared measure.
ASK { ?dsd a qb:DataStructureDefinition . FILTER NOT EXISTS { ?dsd qb:component [qb:componentProperty [a qb:MeasureProperty]] } } |
Every dimension declared in a qb:DataStructureDefinition
must have a declared rdfs:range
.
ASK { ?dim a qb:DimensionProperty . FILTER NOT EXISTS { ?dim rdfs:range [] } } |
Every dimension with range skos:Concept
must have a qb:codeList
.
ASK { ?dim a qb:DimensionProperty ; rdfs:range skos:Concept . FILTER NOT EXISTS { ?dim qb:codeList [] } } |
The only components of
a qb:DataStructureDefinition
that may be marked as
optional, using qb:componentRequired
are attributes.
ASK { ?dsd qb:component ?componentSpec . ?componentSpec qb:componentRequired "false"^^xsd:boolean ; qb:componentProperty ?component . FILTER NOT EXISTS { ?component a qb:AttributeProperty } } |
Every qb:SliceKey
must be associated with a qb:DataStructureDefinition
.
ASK { ?sliceKey a qb:SliceKey . FILTER NOT EXISTS { [a qb:DataStructureDefinition] qb:sliceKey ?sliceKey } } |
Every qb:componentProperty
on a qb:SliceKey
must also be declared as a qb:component
of the associated qb:DataStructureDefinition
.
ASK { ?slicekey a qb:SliceKey; qb:componentProperty ?prop . ?dsd qb:sliceKey ?sliceKey . FILTER NOT EXISTS { ?dsd qb:component [qb:componentProperty ?prop] } } |
Each qb:Slice
must have exactly one associated qb:sliceStructure
.
ASK { { # Slice has a key ?slice a qb:Slice . FILTER NOT EXISTS { ?slice qb:sliceStructure ?key } } UNION { # Slice has just one key ?slice a qb:Slice ; qb:sliceStructure ?key1, ?key2; FILTER (?key1 != ?key2) } } |
Every qb:Slice
must have a value for every dimension declared in its qb:sliceStructure
.
ASK { ?slice qb:sliceStructure [qb:componentProperty ?dim] . FILTER NOT EXISTS { ?slice ?dim [] } } |
Every qb:Observation
has a value for each dimension declared in its associated qb:DataStructureDefinition
.
ASK { ?obs qb:dataSet/qb:structure/qb:component/qb:componentProperty ?dim . ?dim a qb:DimensionProperty; FILTER NOT EXISTS { ?obs ?dim [] } } |
No two qb:Observation
s in the same qb:DataSet
may have the same value for all dimensions.
ASK { FILTER( ?allEqual ) { # For each pair of observations test if all the dimension values are the same SELECT (MIN(?equal) AS ?allEqual) WHERE { ?obs1 qb:dataSet ?dataset . ?obs2 qb:dataSet ?dataset . FILTER (?obs1 != ?obs2) ?dataset qb:structure/qb:component/qb:componentProperty ?dim . ?dim a qb:DimensionProperty . ?obs1 ?dim ?value1 . ?obs2 ?dim ?value2 . BIND( ?value1 = ?value2 AS ?equal) } GROUP BY ?obs1 ?obs2 } } |
Every qb:Observation
has a value for each declared attribute that is marked as required.
ASK { ?obs qb:dataSet/qb:structure/qb:component ?component . ?component qb:componentRequired "true"^^xsd:boolean ; qb:componentProperty ?attr . FILTER NOT EXISTS { ?obs ?attr [] } } |
In a qb:DataSet
which does not use a Measure dimension then each individual qb:Observation
must have a value for every declared measure.
ASK { # Observation in a non-measureType cube ?obs qb:dataSet/qb:structure ?dsd . FILTER NOT EXISTS { ?dsd qb:component/qb:componentProperty qb:measureType } # verify every measure is present ?dsd qb:component/qb:componentProperty ?measure . ?measure a qb:MeasureProperty; FILTER NOT EXISTS { ?obs ?measure [] } } |
In a qb:DataSet
which uses a Measure dimension then each qb:Observation
must have a value for the measure corresponding to its given qb:measureType
.
ASK { # Observation in a measureType-cube ?obs qb:dataSet/qb:structure ?dsd ; qb:measureType ?measure . ?dsd qb:component/qb:componentProperty qb:measureType . # Must have value for its measureType FILTER NOT EXISTS { ?obs ?measure [] } } |
In a qb:DataSet
which uses a Measure dimension then each qb:Observation
must only have a value for one measure (by IC-15 this will be the measure corresponding to its qb:measureType
).
ASK { # Observation with measureType ?obs qb:dataSet/qb:structure ?dsd ; qb:measureType ?measure ; ?omeasure [] . # Any measure on the observation ?dsd qb:component/qb:componentProperty qb:measureType ; qb:component/qb:componentProperty ?omeasure . ?omeasure a qb:MeasureProperty . # Must be the same as the measureType FILTER (?omeasure != ?measure) } |
In a qb:DataSet
which uses a Measure dimension then if there is a Observation for some combination of non-measure dimensions then there must be other Observations with the same non-measure dimension values for each of the declared measures.
ASK { { # Count number of other measures found at each point SELECT ?numMeasures (COUNT(?obs2) AS ?count) WHERE { { # Find the DSDs and check how many measures they have SELECT ?dsd (COUNT(?m) AS ?numMeasures) WHERE { ?dsd qb:component/qb:componentProperty ?m. ?m a qb:MeasureProperty . } GROUP BY ?dsd } # Observation in measureType cube ?obs1 qb:dataSet/qb:structure ?dsd; qb:dataSet ?dataset ; qb:measureType ?m1 . # Other observation at same dimension value ?obs2 qb:dataSet ?dataset ; qb:measureType ?m2 . FILTER NOT EXISTS { ?dsd qb:component/qb:componentProperty ?dim . FILTER (?dim != qb:measureType) ?dim a qb:DimensionProperty . ?obs1 ?dim ?v1 . ?obs2 ?dim ?v2. FILTER (?v1 != ?v2) } } GROUP BY ?obs1 ?numMeasures HAVING (?count != ?numMeasures) } } |
If a qb:DataSet
D has a qb:slice
S, and S has an qb:observation
O, then the qb:dataSet
corresponding to O must be D.
ASK { ?dataset qb:slice ?slice . ?slice qb:observation ?obs . FILTER NOT EXISTS { ?obs qb:dataSet ?dataset . } } |
If a dimension property has a qb:codeList
, then the value of the dimension property on every qb:Observation
must be in the code list.
The following integrity check queries must be applied to an RDF graph which contains the
definition of the code list as well as the Data Cube to be checked. In the case
of a skos:ConceptScheme
then each concept must be linked to the scheme using
skos:inScheme
. In the case of a skos:Collection
then the
collection must link to each concept (or to nested collections) using skos:member
. If the
collection uses skos:memberList
then the entailment of skos:member
values defined by S36
in [SKOS-REFERENCE] must be materialized before this check is applied.
ASK { ?obs qb:dataSet/qb:structure/qb:component/qb:componentProperty ?dim . ?dim a qb:DimensionProperty ; qb:codeList ?list . ?list a skos:ConceptScheme . ?obs ?dim ?v . FILTER NOT EXISTS { ?v a skos:Concept ; skos:inScheme ?list } } ASK { ?obs qb:dataSet/qb:structure/qb:component/qb:componentProperty ?dim . ?dim a qb:DimensionProperty ; qb:codeList ?list . ?list a skos:Collection . ?obs ?dim ?v . FILTER NOT EXISTS { ?v a skos:Concept . ?list skos:member+ ?v } } |
If a dimension property has
a qb:HierarchicalCodeList
with a
non-blank qb:parentChildProperty
then the value of
that dimension property on every qb:Observation
must be reachable from a root of the hierarchy using zero or more hops along the qb:parentChildProperty
links.
This check cannot be made by a simple fixed SPARQL query. Instead a
query template is supplied.
An instance of the template should be generated
for each qb:HierarchicalCodeList
which has an IRI
value for its qb:parentChildProperty
.
That is for each binding of ?p
in the following
instantiation query:
SELECT ?p WHERE { ?hierarchy a qb:HierarchicalCodeList ; qb:parentChildProperty ?p . FILTER ( isIRI(?p) ) }
The template is then instantiated by replacing the
string $p
by the IRI found by the
instantiation query. The template is:
ASK { ?obs qb:dataSet/qb:structure/qb:component/qb:componentProperty ?dim . ?dim a qb:DimensionProperty ; qb:codeList ?list . ?list a qb:HierarchicalCodeList . ?obs ?dim ?v . FILTER NOT EXISTS { ?list qb:hierarchyRoot/<$p>* ?v } } |
If a dimension property has
a qb:HierarchicalCodeList
with an
inverse qb:parentChildProperty
then the value of
that dimension property on every qb:Observation
must be reachable from a root of the hierarchy using zero or more hops along the inverse qb:parentChildProperty
links.
This check cannot be made by a simple fixed SPARQL query. Instead a
query template is supplied.
An instance of the template should be generated
for each qb:HierarchicalCodeList
which has a blank-node
value for its qb:parentChildProperty
, with an
associated inverse property.
That is for each binding of ?p
in the following
instantiation query:
SELECT ?p WHERE { ?hierarchy a qb:HierarchicalCodeList; qb:parentChildProperty ?pcp . FILTER( isBlank(?pcp) ) ?pcp owl:inverseOf ?p . FILTER( isIRI(?p) ) }
The template is then instantiated by replacing the
string $p
by the IRI found by the
instantiation query. The template is:
ASK { ?obs qb:dataSet/qb:structure/qb:component/qb:componentProperty ?dim . ?dim a qb:DimensionProperty ; qb:codeList ?list . ?list a qb:HierarchicalCodeList . ?obs ?dim ?v . FILTER NOT EXISTS { ?list qb:hierarchyRoot/(^<$p>)* ?v } } |
See Section Expressing data sets.
qb:DataSet
Sub class of:
qb:Attachable
Equivalent to:
scovo:Dataset
See Section Expressing data sets.
qb:Observation
Sub class of:
qb:Attachable
Equivalent to:
scovo:Item
qb:dataSet
( Domain:
qb:Observation
-> Range:
qb:DataSet
)
qb:observation
( Domain:
qb:Slice
-> Range:
qb:Observation
)
See Section Slices.
qb:ObservationGroup
qb:Slice
Sub class of:
qb:Attachable
,
qb:ObservationGroup
qb:slice
( Domain:
qb:DataSet
-> Range:
qb:Slice
;
sub property of:
qb:observationGroup
)
qb:observationGroup
( Domain:
-> Range:
qb:ObservationGroup
)
See Section Dimensions, attributes and measures.
qb:Attachable
qb:ComponentProperty
Sub class of:
rdf:Property
qb:DimensionProperty
Sub class of:
qb:ComponentProperty
,
qb:CodedProperty
qb:AttributeProperty
Sub class of:
qb:ComponentProperty
qb:MeasureProperty
Sub class of:
qb:ComponentProperty
qb:CodedProperty
Sub class of:
qb:ComponentProperty
See Section Measure dimensions.
qb:measureType
( Domain:
-> Range:
qb:MeasureProperty
)
See Section ComponentSpecifications and DataStructureDefinitions.
qb:DataStructureDefinition
Sub class of:
qb:ComponentSet
qb:structure
( Domain:
qb:DataSet
-> Range:
qb:DataStructureDefinition
)
qb:component
( Domain:
qb:DataStructureDefinition
-> Range:
qb:ComponentSpecification
)
See Section ComponentSpecifications and DataStructureDefinitions.
qb:ComponentSpecification
Sub class of:
qb:ComponentSet
qb:ComponentSet
qb:componentProperty
( Domain:
qb:ComponentSet
-> Range:
qb:ComponentProperty
)
qb:order
( Domain:
qb:ComponentSpecification
-> Range:
xsd:int
)
qb:componentRequired
( Domain:
qb:ComponentSpecification
-> Range:
xsd:boolean
)
qb:componentAttachment
( Domain:
qb:ComponentSpecification
-> Range:
rdfs:Class
)
qb:dimension
( Domain:
-> Range:
qb:DimensionProperty
; sub property of:
qb:componentProperty
)
qb:measure
( Domain:
-> Range:
qb:MeasureProperty
; sub property of:
qb:componentProperty
)
qb:attribute
( Domain:
-> Range:
qb:AttributeProperty
; sub property of:
qb:componentProperty
)
qb:measureDimension
( Domain:
-> Range:
qb:DimensionProperty
; sub property of:
qb:componentProperty
)
See Section Slices.
qb:SliceKey
Sub class of:
qb:ComponentSet
qb:sliceStructure
( Domain:
qb:Slice
-> Range:
qb:SliceKey
)
qb:sliceKey
( Domain:
qb:DataSet
-> Range:
qb:SliceKey
)
See Section Concept schemes and code lists.
qb:concept
( Domain:
qb:ComponentProperty
-> Range:
skos:Concept
)
qb:codeList
( Domain:
qb:CodedProperty
-> Range:
owl:unionOf(skos:ConceptScheme skos:Collection qb:HierarchicalCodeList)
)
See Section Non-SKOS hierarchies.
qb:HierarchicalCodeList
qb:hierarchyRoot
( Domain:
qb:HierarchicalCodeList
)
qb:parentChildProperty
( Domain:
qb:HierarchicalCodeList
-> Range:
rdf:Property
)
This work is based on a collaboration that was initiated in a workshop on Publishing statistical datasets in SDMX and the semantic web, hosted by ONS in Sunningdale, United Kingdom in February 2010 and continued at the ODaF 2010 workshop in Tilburg. The authors would like to thank all the participants at those workshops for their input into this work but especially Arofan Gregory for his patient explanations of SDMX and insight in the need and requirements for a core Data Cube representation.
The editors would like to thank John Sheridan for his comments, suggestions and support for the original work.
Many individuals provided valuable comments on this specification as it made its way through the W3C process. We would like to especially acknowledge the contributions of Benedikt Kaempgen, Sarven Capadisli and Curran Kelleher.
qb:componentRequired
is only applicable to
attributes and that it defaults to optional.qb:ObservationGroup
as a generalization of qb:Slice
. ISSUE-33.qb:subSlice
as being problematic in how
they interact with attachment levels. ISSUE-34.qb:codeList
to allow skos:Collection
. ISSUE-39.