Warning:
This wiki has been archived and is now read-only.

Notes from the ANU Team

From Spatial Data on the Web Working Group
Jump to: navigation, search

Meeting on 27/07/16

Since May, we've been working the concept we previously presented to the group in a scalable manner. In particular, our proof-of-concept pipeline had an RDF materialisation step which made it impractical for large datasets (e.g. satellite coverages), since materialising an entire data cube of such observations takes a prohibitive amount of space. We have been working on replacing that with a dynamic generation approach which can produce triples on the fly in response to SPARQL queries.

Given the recent interest in using the RDF data cube on the mailing list (large thread starts here), it might be helpful for us to explain how we handle some of the QB-related issues which have been brought up:

  • Dicing: We have not found it necessary to use QB slices, nor to extend QB with the ability to explicitly represent diced subcubes. Instead, we perform dicing in SPARQL: for example, one could find all tiles in a small region around Null Island using SELECT ?imageData ?loc {?obs a qb:Observation; led:imageData ?imageData; led:location [geo:lat ?lat; geo:long ?long]. FILTER(-1 < ?lat && -1 < ?long && ?lat < 1 && ?long < 1)}. The ultimate goal of the backend we are developing is to support GeoSPARQL queries, which should provide extremely flexible dicing in the spatial domain. For most other dimensions of interest (that is, the ones which SPARQL knows how to compare the rdfs:ranges of), "dicing" is supported out-of-the-box by SPARQL filters.
    • If required, ordering on axes can generally be implemented using a SPARQL ORDER BY. However, since the data is delivered as a cloud of points anyway (with all required metadata attached to each observation), this has been a non-problem for the client application we've been developing.
  • Units: Units are either not explicitly described or attached to the relevant triples (in each observation) using the most suitable vocabulary we could find. Our backend does not do any sort of fancy inference to translate units.

For reference, we have a hypothetical satellite data coverage in one of our Github repos.

Meeting on 18/05/16

Overview of the current pipeline:

We use a small subset of the geospatial, LANDSAT data from the Australian datacube. In this current demo, the data is manually loaded into Jena. We are currently working on getting the dynamic loading working, i.e. loading images into RDF as needed. This data is converted into RDF format using the ontology described at https://github.com/ANU-Linked-Earth-Data/ontology. The client app then queries this dataset using standard Sparql, and gets a Json response. Looking into a restful API is in the works for sometime in the future. This data is then overlaid onto a Leaflet map. Currently the user can change the date of the system. It was intended to utilised Jena’s built in geospatial queries to limit the search by space but the search speed has rendered this impossible for the moment. Meta-data about the selected section is available, including a “link” to the data itself, although this is current a dummy link. However, it does display the band, location and resolution etc… Therefore it would be possible with more data to limit by these additional queries.

Example ttl file is available on the github repo.

Stuff we might be able to mention from meeting with Rob:

  • The smarter the server, the better.
    • Taking a temporal slice along a spatial line can be useful for some things. For instance, Rob once defined a line down the middle of Lake Burleigh Griffin and then took a temporal slice to observe some effect associated with El Niño.
    • “Give me 30 years of summer above the Tropic of Capricorn” is one example of that. There, you’re using contextual information like “when is summer?” “how is summer defined above the Tropic of Capricorn?”, etc. It also hints at a requirement for flexible date schemes—being able to say “I want every second month”, “I want to see surface reflectance on the equinox each year”, etc. is handy.
    • Declarative mathematics for interpolation is another one. Apparently, being able to write “I want 30% Landsat band 4, 45% Landsat band 5 and 25% Landsat band 7” makes development a lot faster.
    • Knowing what constitutes “red”, “green”, “near infrared”, etc. for a given satellite is also handy. That way you don’t have to come up with interpolation formulas yourself.
    • Although it wasn’t mentioned, the web map server’s smarts in this domain are quite handy. It has a whole lot of parameters to specify colour, which comes in handy for users who might be confused by a grayscale (i.e. intensity) dataset.
    • Even if we can’t build that intelligence into the server, a lot of the information required to implement those functions is available on the web, and it’s possible that some sort of linked data approach might make it easier to write fancy queries.
  • Power users hate resampling. This conflicts with the preferences of some scientists (e.g. the Fenner School people I listened to) who just want to be able to see easy-to-interpret numbers for humanly meaningful regions/time periods, and don’t care how they’re produced.
  • Smarts are good, but they need to be implemented dynamically. I[[f you have 4PB of Landsat data and then make a new infilled version then you’ll end up with 8PB of data. You then have to keep that 8PB forever, since eventually someone will ask for the specific version of data they used for some particular scientific finding. The inevitable result is that you fill up petabytes and petabytes of space with redundant junk.
  • Related requirement: the ability to see how data was transformed to get to the client. Being able to encode the different projections, colour choices, resampling schemes, original data, etc. might be a job for the ontology. I think others have called this “provenance”.
  • Synthetic aperture radar doesn’t really have a canonical representation. It certainly can’t be represented as ordinary images in most cases. Not sure what the upshot is here, but it might be something like “flexible ndarray-type formats need to be supported in case new instruments come along”.
  • There are more and more satellites going up every day. Often new satellites complement old ones by providing similar instruments but having different orbits or sampling rates. Being able to process huge amounts of data from different sources quickly can enable some cool real-time applications (e.g. bushfire tracking). However, developers can into lot of problems interpreting the data (what if there’s a bushfire down the road and you don’t know what an “ETM+” is?) and mixing/matching data between satellites (which all have different kinds of sensors, of course).
    • Rob didn’t mention it, but after our meeting with Ed I suspect that discoverabiltiy is a big deal here. Being able to search for all of the satellite data repositories which give near infrared data (for example) would be really helpful for data fusion and real time mapping.

Other things from our meeting with Ed:

  • Main value of linked data from Ed’s PoV is that you can embed your data in your webpage, instead of needing one interface for machines and another interface for humans.
  • Linked data is a boon for webmasters because it allows them to boost the visibility of their sites in search engines. Not sure whether something similar could hold for coverages.
  • “Semantics” are good.
  • Searchability (crawlability? Discoverability?) is good.
  • Microdata over RDF. Restful APIs over SPARQL. Introducing new abstractions just puts one more unwelcome step between a developer and the data they want.
  • Google (and probably other companies) doesn’t see huge demand for coverages at the moment (perhaps this explains why WMS is used over WCS on THREDDS?).
  • You definitely need to be able to work above the pixel level. That’s a not possible with current standards like WMS and WCS, since those all give you huge ndarrays of data to work with.
  • “Killer app” might be minerals prospecting or something (that was an off-hand comment and I’m not sure how seriously to take it)

Things from meeting with Fenner School of Environmental Science