CEO-LD Kick Off Meeting Report

29-30 September

Present:

UK: Geoffrey Boulton, Simon Hodson (CODATA); Phil Archer (W3C); Denise Mckenzie (OGC); Jeremy Tandy (UK Met Office); Maik Riechert (University of Reading); Yang Gao, Adina Gillespie, Surrey Satellite Technology; Payam Baraghi, University of Surrey; Simon Agass, Satellite Applications Catapult
China: Li Jianhui, CNIC/CAS; Chunming Hu, Beihang University/W3C; Jitao Yang, Institute of Remote Sensing/CAS

Raw minutes day 1, day 2. Further notes were made on the wiki.

Introduction

The kick off meeting of the CEO-LD project took place in late September at the Royal Society's premises in central London. The participants represented a variety of interests from the satellite data industry and academia with an interest in using and making satellite data available. As Maik Riechert said during the opening tour de table: I come from the Web developer community, I just want to use the data easily.

A recurring theme for the two days was the problem of scale. Earth Observation creates immense volumes of data. Making that available to researchers and other users on the open Web requires automation and standardisation. It was emphasised that it is that final step – making the data available – that this group is concerned with. The processes for downloading, expanding and decrypting the data are out of scope.

Simon Agass highlighted the value in data being contextualised. Historically, satellite data has not been shared on the Web, other than perhaps some metadata. Linking the data, mixing it and adding context is where the value comes that can support innovation. This raised the question of what sort of questions we want to be able to answer. It's usually easy to ask 'what are the properties of this location' but it can be harder to find all the locations that have a given set of properties. It's especially difficult when properties are expressed differently in different contexts. For example one country's 'coniferous forest' might be another's 'forest.' The ability to link and map between terms is going to be important, especially if the data is to be used between countries as different as the UK and China. Likewise, different systems will often use different identifiers for the same things but it won't be obvious from the outside that this is true.

Understanding the needs of users is, of course, a key topic. Li Jianhui said that their platform is provided to scientists and they know the requirements for that community but not others. The Satellite Applications Catapult works with users and value adders and is a potential source of new use cases. The problems of complexity are all too apparent: defining and describing a slice through 4-dimensional space and time is a problem that has been solved many times but in different ways. For data to be shared easily on the Web, we need convergence on how it's done.

The meeting loosely identified a number of user groups:

users interested in observation and measurement data;
users interested in (i) linked to other assets on the Web
users (i) and (ii) plus provenance data and the processes that have been applied to the data in the pipeline.

Different users will have different skill sets. Many data scientists at CAS will understand the data but not the Web technologies, Web developers will be the other way around. So, whichever way you approach it, there is a deficit in expertise that we need to address but the basis much be in representing the O&M data in a way that can be linked.

These initial discussions provided the context for the bulk of the meeting that was devoted to deciding what needs to be covered in the final report.

Expected Report Structure

The group's report should have three principal sections:

The purpose (a robust statement of scope).
Functional principles – the principles being adopted to achieve the purpose.
Enabling Access and Delivery – the technical detail.

The functional principles are already clear.

The aim is to put data in the hands of people who want to use and manipulate the data without having to be a geo expert.
Whilst also supporting more expert users/researchers in finding the detailed data and provenance info that they need.
The CEO-LD group must be open: we're contributing something distinctive to the W3C/OGC Working Group.
No wheel reinvention. Something should be deliverable/actionable, not an academic exercise.
Minimum cost.
Whatever is delivered should be discoverable, both the work we do and the data it describes.
The work needs to be applicable both to real-time data publication and access to data archives.

The discussion of the detail of how to enable access and discovery raised many issues.

It was agreed that we need identifiers for datasets and distributions of those datasets. This is a distinction made in for example, the Data Catalogue Vocabulary, DCAT. A dataset might be available in multiple formats in which case each of those formats is a distribution of the more abstract idea of the dataset. No dataset can be used if it's not discovered so how should the metadata be provided? The European Commission's GeoDCAT Application Profile matches DCAT with ISO19115 terms and seems a good candidate.

More interestingly in the context of the large volumes of satellite data is identifiers for slices and subsets of data. The RDF Data Cube offers this but RDF is a verbose format and not suitable for storing satellite data itself. Formats like GeoTIFF, HDF5, JSON, NetCDF etc. are more usual, and CSV on the Web facilitates the provision of metadata that allows tabular data to be converted readily to other formats. One option would be to convert the data to RDF at the point of query and shared in that way.

A possible way forward for defining identifiers for slices of datasets would be to use or adapt OGC's OpenSearch Geo and Time Extensions (PDF). This defines technology-neutral identifiers that encode queries against large datasets, such as a search engine's index.

This also touches on what APIs should be provided for satellite data? data by geoposition? Time? Observed property? Shoujld there be different APIs for expert and non-expert users?

The meeting also discussed issues that are widely recognised in the more general area of data publishing, such as annotations for data quality and provenance.

Next Steps

The participants are anxious to ensure that solutions are practical. In this respect it is noteworthy that the CEO-LD project coincides with the MELODIES project, lead by the University of Reading. That intersection will allow Maik Riechert in particular to contribute to the development of relevant extensions for the most widely used data open source catalogue platform, CKAN, particularly focusing on RDF/DCAT support, spatiotemporal filtering, and data previewing.

The CEO-LD Working Group will deliver its report in May 2016. Discussions will continue via e-mail during that time, with a second face to face meeting scheduled for Sunday 28 - Monday 29 Feb, hosted by Beihang University. This is immediately prior to the next Research Data Alliance plenary in Japan.