From CSV on the Web Working Group Wiki
Jump to: navigation, search

Proposals for expressing complex data in tabular form

Contributed by Jeremy Tandy

(Note: whilst I’ve made some concrete proposals below based on Jeni’s unofficial “Linked CSV” proposal, my intent is to use these suggestions as a way to describe the problem in more detail … if there are more elegant ways to achieve these outcomes then I’m happy to adopt them.)

To illustrate the proposal, we will look at a subset of data from the Met Office’s Weather Observation Website (WOW); a system that crowd-sources weather observation reports. WOW has somewhere approaching 100 million individual weather reports, each of which is associated with a specific location, or “station” in meteorological parlance. The majority of locations are within the UK but with a growing number hail from Australia where a second web portal has recently been released in collaboration with the Bureau of Meteorology.

Whilst WOW does not yet expose data in CSV format, the Met Office is keen that the “second generation” of WOW will expose data APIs that can be consumed by down-stream applications. The intent is that the data provided by those APIs will be made available in a number of formats - each of which are able to express the rich semantics required to understand the weather observation in context.

The design work is yet to begin in earnest, but some early work has been done to map the WOW dataset into RDF using the RDF Data Cube, Semantic Sensor Network (SSN), Observations and measurements (O&M2) and Quantities, Units, Dimensions and Data Types (QUDT) ontologies.

Both SSN and O&M2 provide a means to describe the observation event itself. From O&M2, an observation is an event whose result is an estimate of the value of some property (or properties) of the feature of interest using a given procedure - where the “feature of interest” is the real world target of the observation. SSN, being based on the “stimulus-sensor-observation” design pattern, provides more in depth descriptions of the “procedure” used in the observation event (e.g. describing the sensor itself, its operating conditions and the “sensing process” that it operates).

O&M2 introduces the concept of “sampling features” that one can use to characterise the sampling regime and provide an indirect relationship to the ultimate target of the observation. For example, we may deploy an automatic weather station (AWS) at Exeter International Airport. Here, the sampling regime comprises an individual point location where the AWS is deployed; an identified point location where observations are made. This is the “sampling feature” (or in this case more specifically, a “sampling point”). O&M2 clearly states that the observed properties relate to the physical medium at the location rather than any physical artefact such as the AWS platform itself. Thus use of a sampling point implies that we are measuring the properties of the atmosphere in its local vicinity which we infer to be representative of the atmospheric properties for Exeter International Airport itself. O&M2 defines the property "sampledFeature” to relate the sampling feature with the ultimate feature of interest - in this case, Exeter International Airport. In this way we can bind our observations to “real-world things” that may be maintained by other parties such as Ordnance Survey’s NamedPlace resource describing Exeter International Airport enabling the weather observations to be linked with other sources of information that reference that NamedPlace.

QUDT provides the means to describe the physical properties being observed (“quantity kinds”), the measured values themselves and the units of measurement within which those values are expressed.

Finally, the RDF Data Cube vocabulary provides a mechanism to describe the overall structure of the dataset (e.g. the dimensions, time and location, and the measured properties) and enables the data to be packaged up into manageable subsets termed “slices”.

As is typical of meteorologists, we choose to partition the WOW dataset such that we have one time-series per station (a.k.a. “feature of interest”). These are the “slices” described in the DataCube “DataSetDefinition”.

Furthermore, each slice can be conveniently packaged for consumption within a single file.

Because the feature of interest does not vary for the entire time-series, we can “attach” the feature of interest property (om:featureOfInterest) to the slice. Thus we can infer that every observation within that slice has the same feature of interest. This mechanism of “component attachment” is one possible way that property values repeated in every row of a given CSV file instance could be summarised at a file level.

As a result, much of the complexity of describing a time-series of weather observations (e.g. the sampling point, the real-world feature it relates to and the sensor “hosted” at that sampling point) can be expressed once at the file level - perhaps by providing a reference to a description of the slice itself.

The only properties that vary with each and every observation event are then the time at which the observation occurred (om:phenomenonTime) and the measured values themselves.

The O&M2 data model requires that range of “om:phenomenonTime” shall be a time instant or interval; om:phenomenonTime cannot refer directly to a XSDDateTime string. Thus om:phenomenonTime refers to an object of type time:Instant, albeit a “blank node” in the graph, which then uses the property time:inXSDDateTime to provide an ISO 8601 date-time string.

The measured values are provided via the om:result property. Here, we are able to bundle multiple values into a single record using an instance of type ssn:SensorOutput. Each measured value is provided as instance of qudt:QuantityValue using the property qudt:numericValue to express the data value itself. Rather than specify the unit of measurement and the physical property (“quantity kind" being measured, it is possible to infer this using a set of OWL axioms. For example, use of the (locally defined) property wow-def-obp:airTemperature_C means that we can infer the range of that property to be an instance of type qudt:QuantityValue where the unit of measurement and quantity kind are constrained to be Celsius and air temperature respectively.

So, rather than the semantics for each property in the table being expressed as a single RDF property, the semantics are more like a “path” wherein intermediate objects are treated as blank nodes, possibly with their type able to be inferred from associated axioms.

So assuming we were looking at a time-series for Exeter International Airport (site no. 22580943), a single observation in that time-series might have:

  • phenomenon time = 2013-12-13T08:00:00Z
  • air temperature = 11.2 Cel
  • dew-point temperature = 10.2 Cel

(in reality, there will be more measured properties - but I need to keep the example simple!)

So the RDF (in TTL syntax) for that observation (using fictitious URIs) might look like:

    a ssn:Observation , qb:Observation , om:Observation ;
    qb:dataSet <> ;
    om:phenomenontime [ time:inXSDDateTime “2013-12-13T08:00:00Z”^^xsd:dateTime ] ;
    om:result [
        a ssn:SensorOutput ;
        wow-def-obp:airTemperature_C [ qudt:numericValue “11.2”^^xsd:double ] ;
        wow-def-obp:dewPointTemperature_C [ qudt:numericValue “10.2”^^xsd:double ] ] ;

Reusing the syntax from LDPath we could express the path between the observation entity <> and the phenomenon time as:


… whilst the paths to the air temperature and dew-point temperature values are:

om:result/wow-def-obp:airTemperature_C/qudt:numericValue and om:result/wow-def-obp:dewPointTemperature_C/qudt:numericValue

Taking Jeni’s unofficial “Linked CSV” proposal as a starting point, we can encode the weather observations as follows:

#,                                                      $id,                           Date-time,                                   Air temperature (Cel),                                     Dew-point temperature (Cel)
meta,                                                      ,                                base,,
see,,                                    ,                                                        ,                                                                  
see,                                          site/22580943,                                    ,                                                        ,                                                                  
url,                                                       ,om:phenomenonTime/time:inXSDDateTime,om:result/wow-def-obp:airTemperature_C/qudt:numericValue,   om:result/wow-def-obp:dewPointTemperature_C/qudt:numericValue
type,                                                      ,                                time,                                                  double,                                                          double
,                    site/22580943/date-time/20131213T0800Z,                2013-12-13T08:00:00Z,                                                    11.2,                                                            10.2
,                    site/22580943/date-time/20131213T0900Z,                2013-12-13T09:00:00Z,                                                    12.0,                                                            10.2

The see prolog line is used to reference both the top-level WOW dataset itself, and the slice for site 2258943 (Exeter International Airport) - wherein one can find more information about the feature of interest etc.

We can see some amendments to Jeni’s original proposal:

  1. Use of LDPath-like syntax in the url prolog line
  2. Addition of a single occurrence of a base type within a meta prolog line - this is just for convenience, but it certainly makes the tabular data easier to read! It’s not always a safe assumption that the URIs for entities are relative to the location of the Linked CSV file.

In unpacking this CSV file as RDF, we would need to insert bnodes for the intermediate entities (as shown above in the RDF snippet). It’s worth noting that both measurement values are children of the same blank node object (path om:result); thus any conversion algorithm should look to conflate bnodes with the same path that are expressed in the same row of the CSV file.

The LDPath expressions used above employ compact URIs; there’s no definition of the prefixes for “om”, “time”, “wow-def-obp”, “qudt” etc. Whilst we could predefine these prefixes in the standard, it is much easier to allow a data publisher to express their own prefixes.

Furthermore, it would be useful to assert the “type” of the entity described in each row without having to repeat this for every line - especially noting that the example here uses three type declarations!

The final CSV file might look like:

#,                                                      $id,                           Date-time,                                   Air temperature (Cel),                                     Dew-point temperature (Cel)
meta,                                                      ,                              prefix,                                                   qudt:,                      
meta,                                                      ,                              prefix,                                                     qb:,                     
meta,                                                      ,                              prefix,                                                     om:,
meta,                                                      ,                              prefix,                                            wow-def-obp:,    
meta,                                                      ,                              prefix,                                                   time:,                          
meta,                                                      ,                                base,,
meta,                                                      ,                                type,                                                     url,                                                 ssn:Observation
meta,                                                      ,                                type,                                                     url,                                                  qb:Observation
meta,                                                      ,                                type,                                                     url,                                                  om:Observation
see,,                                    ,                                                        ,                                                                 
see,                                          site/22580943,                                    ,                                                        ,                                                                 
url,                                                       ,om:phenomenonTime/time:inXSDDateTime,om:result/wow-def-obp:airTemperature_C/qudt:numericValue,   om:result/wow-def-obp:dewPointTemperature_C/qudt:numericValue
type,                                                      ,                                time,                                                  double,                                                          double
,                    site/22580943/date-time/20131213T0800Z,                2013-12-13T08:00:00Z,                                                    11.2,                                                            10.2
,                    site/22580943/date-time/20131213T0900Z,                2013-12-13T09:00:00Z,                                                    12.0,                                                            10.2

… with these additional amendments to Jeni’s original proposal:

  1. Addition of multiple occurrences of a prefix type within a meta prolog line
  2. Addition of multiple occurrences of a type type within a meta prolog line

Finally, it occurs to me that given the need for flexibility in how the "metadata headers" are expressed for a given CSV file (e.g. where many CSV files have similar structure), it should be possible to refer to am external resource - much in the same way that JSON-LD uses @context. Further thought required, but this could be achieved using a single occurrence of a context type within a meta prolog line?

Obviously a _real_ example would have more than two observations. A CSV file (without the additional header information) is provided (abeit in Excel format because I can't upload a CSV or TXT file!) with a larger set of observations. The source data can be extracted from WOW here.

For reference, a PDF diagram is provided showing the graph of objects and their properties. Green relates to the RDF Data Cube DataSet Definition whilst Red indicates information that can be inferred.