Coverage draft requirements
- 1 Definitions
- 2 Requirements in UCR (highlighted as relevant for coverage)
- 3 Other requirements in UCR (not highlighted, but still seem relevant)
- 4 Additional requirements, not mentioned in UCR
- 5 Relevant BPs
- 6 Questions to discuss
- 7 Miscellaneous notes and things not to forget but not yet put in order
- coverage - point to existing ones
- extract - a part of coverage 'file' or 'document' (terminology?)
- 'endpoint' (better word for that?)
Requirements in UCR (highlighted as relevant for coverage)
5.4 It should be possible to add temporal references to spatial coverage data.
Notes: One approach is to treat time as another dimension, with equal status as spatial dimensions. An alternative would be to treat time as a metadata item, with a single time applicable to a large group (array, grid, point cloud, whatever) of spatial data points. That's only viable if there is a 'slice' of spatial data that all applies to the same time, which might not always be the case (eg a 'trajectory'). Given that there are many useful use cases that relate to taking a time-slice through a data cube of some sort, it would seem preferable to treat the time dimension in the same way as spatial dimensions.
- for a coverage extract in our chosen format, can you tell what time it applies to?
- can a single coverage extract include data relating to two or more different times?
5.10 It should be easy to find spatial data on the Web, e.g. by means of metadata aimed at discovery. When spatial data are published on the Web, both humans and machines should be able to discover those data.
Notes: this seems to be one of the biggest practical challenges around coverage data at present, especially satellite data. The search and discovery facilities of eg the Sentinel data (https://scihub.copernicus.eu) are not particularly easy to use (in the opinion of Bill Roberts!). There is an API providing machine readable data but not currently amenable to crawlers. There are undoubtedly design challenges to doing this, particularly with the very large quantities of data, but clearly this could be done better.
- What needs to be included in discovery metadata? (DCAT style stuff, as opposed to eg metadata on how to translate grid coords into spatial coords)
- a bounding box (perhaps n-dimensional, and including time)
- what variables are provided in the data
- licence information
- contact information
- links to documentation on how to use it
- see http://w3c.github.io/dwbp/bp.html#metadata
- What should metadata be attached to?
- we are specifying a way to identify/describe/deliver an extract of coverage. In general there might be an enormous number of different ways in which a large coverage data collection could be divided into extracts. If the publisher decides to offer a defined list of extract options, each one could have its own metadata. If the publisher is providing an API service, where eg the user can choose an arbitrary bounding box, then it would still be useful for the delivered coverage to come with descriptive metadata, but it may also be necessary to provide metadata about the limits of the service. Similar considerations come up with other kinds of data - eg should you bundle all years of a statistical data collection into one dataset? or have separate datasets per year? Or do you put it in a database and let users say they want only the data about Edinburgh and only for 2013, 2014 and 2015?
- What format should the metadata be provided in? (DCAT RDF?)
- How does it become crawlable so that web search engines can discover it is there?
- in current practice web crawlers mostly index HTML pages and a bit of RDFa if you are lucky - so should a publisher provide a web page about each coverage dataset, coverage extract, or collection of coverages? Should it use schema.org ? Does schema.org include enough properties for our purpose or do we need to add more?
- How does a user relate common ways of specifying location to which bits of coverage data exist or might be useful? eg to provide a named place, or coordinates of a point, or coordinates of a polygon? Maybe that's something a data service could offer, but needn't be part of a standard for coverage data. It would seem to be a generic 'translation service' that can interconnect between different ways of talking about place.
[Maik]: Why can't we just ask search engines what they want from us? Ed Parsons maybe? [Kerry]: SSN should be able to do this. There is a big advantage of a graph-representation of metadata because it can attach meatadata to even the smallest units, giving info on pixel-level corrections right up to the sensor and platform. I am unaware of any use of SSN in this way for coverages, but the SSN deliverable has a plan to ensure better alignment with the rdf datacube than is possible in its current form (see for example http://dl.acm.org/citation.cfm?id=2887690). SSN from the charter has to address timeseries data, at least, and the datacube is the front-runner option at present.
- once we specify what metadata needs to be provided and by what mechanism, we can define a specific test to verify that
5.14 The coverage data model should consider the inclusion of metadata to allow georectification to an arbitrary grid.
Notes: the user needs to be able to work out which place the coverage data refers to. This particular requirement is not currently phrased in a very generic way, but is probably equivalent to: for every data point in a coverage extract, it must be possible to identify the location and time to which it refers.
Not sure of significance of 'arbitrary grid' here.
[Maik]: I don't see how georectification applies to coverages at all. The term georectify means that you haven't established a CRS/coordinates for your satellite image (etc) yet and you need to pick known points in the image to create such a connection. After that process you have metadata to georeference it automatically.
[Kerry] This requirement arose from a use case that was presented at the SDW meeting in Barcelona. It is traceable from the requirement in the BP doc above. It was all about needing to use some national grid system (Greece in this case) that is not normally used for storage/retrivel of satellite imagery. I believe it does make sense -- I think it was asking for algorithms and/or API to store and retrieve coverage data wrt an arbitrary CRS. My own opinion (and following extensive discussions in SDW on CRS -- and we should check the outcome as recorded in our BP deliverable)-- our scope should be to ensure that we allow a well-defined CRS to be included in metadata (and an arbitrary one at that, but a default CRS is also appropriate pending BP decisions on that matter), but doing translations is out of scope for us (and I think OGC has done a lot of work here already?). Then, in principle, any kind of representation that includes a well-defined CRS achieves this goal. So I think the "test" below is just right. If we agree that dealing with automated translations wrt various CRS is out of scope (and I agree with this) we should merge this requirement with the georeferencing one below as Bill suggests.
Test: verify that a coverage extract includes sufficient information to calculate the location and time of each data point within it.
5.15 It should be possible to georeference spatial data.
Notes: is this just the same as the previous requirement? I propose merging them.
5.23 Spatial data modeling issues solved in existing models shall be considered for adoption, e.g. O&M, SoilML or the OGC coverage model.
Notes: This requirement seems too generic to be useful - but maybe just reminds us not to unnecessarily re-invent the wheel. Is there something here we need to capture and make testable?
[Kerry]: Agreed -- we should drop this. Note that it is issue-18 in UCR, and also applies to SSN -- where it should also be dropped. My suggestion is to put this on the agenda for a plenary meeting and to propose to drop it. It makes sense in overarching design-principle terms, but seems out of place as a requirement.
5.25 Multilingual support
Notes: this is not really specific to coverages so should cover in best practices. Will require metadata to support 'human readable annotation in multiple languages'
[Kerry]: In ontology terms this should be delivered by multiple language tags on annotation properties and possibly also labels. We should strive to do this if we have the resources. But this would be a very late-stage step in our work plan.
5.26 It should be possible to represent many different types of coverage. For instance, to classify coverage data by grid complexity: GridCoverage (GML 3.2.1), RectifiedGridCoverage, ReferenceableGridCoverage, etc.
Notes: how complete should our solution be? how to balance completeness or flexibility with simplicity? should we tackle the most common cases in a simple way (simple to specify, understand, use) but accept that it doesn't cover everything? or does it need to deal with all cases.
[Maik]: I don't see why this should be a requirement at all. Different data formats serve different purposes, so there may be formats that only support regular grids, and others that only support trajectories, and then others that support a bit of both. An overview of which formats support which types would be useful however. And an easily understandable listing of the types, maybe without using the quite technical GML/CIS terms.
[Kerry]: So we are being asked to "support" ie have solutions for all those formats (and it is not the format that matters , it is the underlying purpose to which the format refers that matters here for us, as Maik says). My own view is that we cannot possibly afford to cover even all "discrete" (from the charter) cases, but which ones? Well that depends on what our participants are developing or noticing solutions for...
5.32 Ensure alignment of models or vocabularies for describing provenance that exist in the geospatial and Semantic Web domains. Examples are the W3C provenance ontology (PROV-O) and the OGC metadata specfication (ISO-19115).
- are these compatible, or will we have to choose one or the other? or some kind of union of the two?
- for describing the properties of sensors, eg as used in EO coverage data, should we point to the work of the SSN group
[Kerry]: There is already an alignment between prov-o and iso 19115-le (which is most of the relevant stuff) here https://www.w3.org/2001/sw/wiki/PROV#ISO_19115_Lineage As the SSN group is also doing a prov-o alignment we should leverage this.
5.33 It should be possible to describe properties of the data quality, e.g. uncertainty.
Notes: in practice I can imagine that the most common way to do this is probably in human-readable documentation. In general the factors leading to uncertainty might be very complicated. There might be some cases where machine readable and quantitative ways of describing uncertainty are possible. Is that going to be something that applies to a whole extract? or to individual data points?
[Maik]: Uncertainty may apply to individual data points. See http://behemoth.nerc-essc.ac.uk/ncWMS2/Godiva3.html in the left menu click "CCI SST (Regular Grid)" -> "SST uncertainty". Per-point uncertainty is just another observed property. The tricky part is to semantically link the uncertainty parameter to the original parameter, and describe what kind of uncertainty it is. For the latter, sometimes using a URI from uncertml.org is enough, e.g. http://www.uncertml.org/statistics/standard-deviation. But this is a whole area of research and is probably out of scope / too specific.
[Kerry]: This should help us. https://www.w3.org/TR/vocab-dqv/ NOW is the time to check if it meets out needs!
5.34 It should be possible to identify and reference chunks of data, e.g. for processing, citation, provenance, cataloging.
Notes: I think we've agreed that this is our fundamental requirement to deliver an 'extract' of a larger coverage dataset or dataset collection.
- ideally, it should be possible to identify and reference an individual data point (if required - it won't always be required) - should we require our coverage extract approach to be flexible enough that it can go down to a single data point?
[Maik]: I feel a slight tension between identifying extracts at the high-level (DCAT...) dataset level vs on the lower-level coverage data format/API level. I think it depends on the use case which level you require. For provenance etc you want the high-level one, but for low-level data processing you may want the low-level one (and then don't care about RDF/DCAT etc.). And the higher level may define some fixed bigger extracts (e.g. time steps, per year) while the lower level can go down to single points.
5.43 It should be possible to describe locations in a vague, imprecise manner. For instance, to represent spatial descriptions from crowdsourced observations, such as "we saw a wildfire at the bottom of the hillside" or "we felt a light tremor while walking by Los Angeles downtown". Another related use case deals with spatial locations identified in historical texts, e.g. a battle occurred at the south west boundary of the Roman Empire.
Notes: I think it would be reasonable for our coverage extract approach to only deal with precisely described locations. A separate consideration could deal with how imprecise locations might be related to more precise ones, and with what level of certainty
[Maik]: I agree, and I think if we cover things like countries vs coordinates as location then this already goes a long way.
5.44 It should be possible to represent satellite data using the SSN model, including sensor descriptions.
Notes: let's ask Kerry!
- is this actually a requirement on the SSN model?
- for metadata describing the variables (eg satellite sensor frequency bands), we should use the SSN ontology
5.46 Standards or recommendations for spatial data on the Web should be applicable to three-dimensional data.
Notes: this could possibly be combined with 5.4 (about time references) to say we should be able to have 1, 2 or 3 spatial dimensions and a time dimension in the coverage extract format. Or do we always have 4-d data but some of the dimensions might collapse to a single value?
[Maik]: Often the vertical coordinate is unknown or irrelevant, so it may be 2D spatial + 1D temporal. But yes, a coverage data standard should support xyzt dimensions at least I think.
5.51 It should be possible to represent time series of data.
Notes: covered by 5.4 I think.
5.54 It should be possible to use coverage data as input or output of computational models, e.g. geological models.
- what specific requirements might this bring? if we can do 4d coverages, is that sufficient? Any particular computational model will have its own input and output format requirements, so might need some conversion mechanism.
- also, so far we have not been specific about grids (of various sorts) versus point clouds - will address this below
Other requirements in UCR (not highlighted, but still seem relevant)
5.2 Standards or recommendations for spatial data on the Web should be compatible with existing methods of making spatial data available (like WFS, WMS, CSW, WCS).
Notes: what does 'compatible' mean here? It needs a more specific definition in order to be able to test it or use it as a design criterion.
5.3 Spatial data on the Web should be compressible (for optimization of data transfer).
Notes: well, all data is compressible to some extent, assuming it has a serialisation. Can we just assume HTTP compression as per https://tools.ietf.org/html/rfc2616 (i.e. content types, gzip etc)
5.5 Spatial data on the Web should be crawlable, allowing data to be found and indexed by external agents.
Notes: in general for something to be crawled, there needs to be a link to it, and the crawler needs to be able to understand the response when it GETs the resource. This comes back to the question of whether data is presented as a finite list of defined extracts, or an infinite combinations of URLs as API calls. A crawler will only be able to crawl a set of links. Should all coverage extracts have a web search-engine crawler-friendly representation (html page with metadata embedded) as well as machine readable links? It would certainly be possible to create a crawler that retrieves and interprets coverage formats too, even if such a thing is not currently common.
5.6 CRS definition - coverage work should follow the recommended CRS definition - once it is decided what that is
Notes: fair enough.
5.8 There should be a default Coordinate Reference System (CRS) that can be assumed to be used for coordinates that otherwise have no specification of the CRS.
Notes: as per 5.6 - whatever the SDW group as a whole decides on CRS can be applied here.
5.9 It should be possible to represent data using different time models, such as geological time and non-Gregorian calendars.
Notes: defer to the Time group on how to specify times. We'll just link to it. But note that in the same way that the coverage extract format has to assume or specify a spatial CRS, it will need to assume or specify a calendar.
5.19 Spatial data on the Web should be linkable (by explicit relationships between different data in different data sets), to other spatial data and to or from other types of data.
Notes: I think this is already covered above: any coverage extract must have a URL that can be linked to and which can be used to find and retrieve the data.
- in practice, do we want separate URLs to retrieve the metadata for a coverage extract and the contents of the extract? Would be nice to be able to find out about it without having to download potentially a lot of data - and to know how big the data 'payload' will be
5.20 Standards or recommendations for spatial data on the Web should work well in machine to machine environments.
5.22 It should be possible to represent spatial extent directly bound to time, e.g. journey trajectories.
Notes: support for non-gridded coverages (point-cloud?)
5.31 It should be possible to describe the observed property represented by a coverage.
Notes: do we need to specify which vocabularies to use to do this?
[Maik]: Ideally yes, practically there are a lot of gaps and unknowns here, it's early times still.
5.38 It should be possible to attach the procedural description of a sensing method.
Notes: aspects of the SSN work will be relevant here.
5.45 Data should be streamable, a consumer should be able to do something meaningful before the end of the data message is received. This could be considered a general requirement for data on the Web, but it is recorded here because spatial data often consist of large chunks of data.
Notes: is this compatible with the compressible aspect? Not sure of the technology but would imagine you can either stream or compress but doing both at once might be tricky.
[Maik]: I don't think this should be a requirement. GeoJSON is not streamable, and yet it is used everywhere successfully. If a data format comes along that supports it, fine, but it does complicate things and I don't see a point in saying that a format should be streamable otherwise it's not considered "good".
5.50 Standards or recommendations for spatial data on the Web should support tiling (for raster and vector data). Tiling of spatial data can drastically improve the speed of data retrieval and allows having simple caches of data around the Web.
Notes: is our idea of coverage extracts sufficiently flexible to allow a tiling approach?
[Maik]: Tiling can happen on different levels. Either on the coverage level such that each tile is a complete independent coverage, or otherwise on the range level, such that the range of a coverage is tiled and split up across several resources. If you consider the first option, then tiles are just a structured set of extracts (defined in coordinate or index space), so I would say "extracts" are flexible enough to not forbid that. If you consider the second option then you probably have to talk about range extracts which we haven't defined yet (and those are likely defined in index space only).
Additional requirements, not mentioned in UCR
- coverage data can sometimes be very large, hence the common approaches for grids, where you define the rules for turning the position of a data point in an array into a spatial location, meaning you don't have to attach spatial info to every point
- there are some efficient and easy to use solutions for gridded data that wouldn't work well for all coverage problems. Do we need two (or more) solutions? Perhaps a grid-based approach and also a flexible but verbose approach where space and time data is attached to every data point (or collections of variables that share the same space and time coords). If so, how much could those approaches share?
[Maik]: They could share how the observed properties (URIs...) and things like units and CRS are defined. So any global concepts that are worth sharing.
- we need to be able to deliver our extracts over the web, so making data volumes small where possible is desirable (though see above re compression) - and we have already decided that the ability to define an extract is at the heart of this spec, hence supporting various kinds of 'chunking' approaches.
- should we explicitly define a method for combining extracts into a larger one? - a 'de-chunking' algorithm?
[Maik]: About the last point, this is impossible I think without considering a specific format and extraction method. There could be extraction methods that split up polygons on borders whereas others just include them in full even if they go beyong the border, replicating them in multiple extracts.
- do we assume that a URL for a coverage extract is an opaque identifier, or is it an API call - a pattern that a server can turn into a specification for generating the required extract? Either approach would meet the requirement that they would support linking to an extract. Might the API approach sometimes lead to impractically long URLs?
- or do we separate these issues - and assume that the identifier just identifies. A publisher could choose to provide an API to generate extracts and that API could return an identifier as well as the data, that could be used to refer to the extract in future (meaning that the publisher would have to keep track of generated identifiers or have a repeatable process for generating them from API parameters)
[Maik]: Both ways make sense.
Familiarity to web data users and web developers
- this is a non-functional requirement that is rather hard to define. But if we can follow existing commonly used design patterns where possible, it speeds up the process of learning. Which community's experience do we want to align with most? How quickly do common practices evolve?
- in a previous discussion the general feeling was that we should prioritise simplicity over fully capable
- How do we try to make any recommendation as future-proof as possible? Which aspects of web-friendliness relate to current trends (eg format A is more popular than format B at the moment) and which relate to fundamentals that can be applied to future specific toolsets as they arise?
[Maik]: Hard, yes. It often depends just on how many tools/tutorials/... are available. Even if a format is more complicated this might be offset by other factors. E.g. if someone needs extreme efficiency then that person probably doesn't care that much about format complexity and may be happy to use e.g. a binary format like Google's protocol buffers. Always depends on the use case, who your end users are, whether your data is just ingested somewhere else and then offered in simpler/more complex formats... Maybe, something like, start with one of the simple JSON-based formats and if those are not enough, try a binary-based one or invent your own custom format if you really have to.
Play nicely with existing tools
- which tools?
- typical web clients:
- web browser
- software in arbitrary programming language retrieving data by HTTP
- compatibility with established GIS tools? need for conversion software?
Ease of implementation for a web publisher
- if we want people to adopt this, it has to be reasonably easy to get started
- simple essentials and fancier extensions?
- make it work at some level by just generating some files and sticking them on a web server?
Easy to carry out further processing and analysis
- the data format should make it easy for a user to 'unpack' and do more processing on
Considering these in more detail may lead to further specific requirements
Best Practice 1: Use globally unique HTTP identifiers for entity-level resources
Best Practice 2: Reuse existing (authoritative) identifiers when available
Best Practice 3: Convert or map dataset-scoped identifiers to URIs [ deep linking to a 'bit' of a coverage. - down to pixel level/grid point/obs level?]
Best Practice 4: Provide stable identifiers for Things (resources) that change over time
Best Practice 5: Provide identifiers for parts of larger information resources [note recommendation to avoid technology specific API access type URLs as persistent identifiers]
Best Practice 6: Provide a minimum set of information for your intended application
Best Practice 15: Describe sensor data processing workflows
Best Practice 16: Relate observation data to the real world
Best Practice 28: Expose entity-level data through 'convenience APIs'
Best Practice 29: APIs should be self-describing
Best Practice 30: Include search capability in your data access API
Questions to discuss
I've included a number of questions above in the context of individual requirements. Later we might want to distil or summarise the most important ones and put them in this section.
Miscellaneous notes and things not to forget but not yet put in order
'magic' label for 'latest' eg time=latest
format for metadata? JSON-LD?
vocab(s) for metadata? DCAT, PROV-O, custom...
info on what the range means - eg might be an integer radiance, or reflectivity - how does integer scale map to physical values (do we specify this every time or point to one of a subset of common options, or to a document that explains it), ground elevation in some coord system - allow for delivering computed/processed data - so could eg deliver land cover classification derived from raw data
SSL? optional but recommended
authentication - up to publisher - any standard method? (what's a 'standard' in this regard - list standard methods?)
note that a user might want a time series of data in a spatial bounding box. If raw data is stored by the publisher as a collection of precomputed files (eg images), that might mean making extracts of lots of images and bolting them together. So the simple 'just prepare and serve some files' option might have to be balanced up against compactness, time as a first-class dimension etc.
- support both precalc'd and stored extracts vs on the fly generated extracts. Could have similar/same URLs but different discoverability