Identifying coverage subsets

I notice that one of the issues discussed relates to identifiers for coverage subsets (and the relationship with queries ...). 
There is an RDA recommendation on 'Data Citation' which covers some of this area. 
It is primarily expressed as a series of requirements, rather than a solution, so may be a useful checklist here. 
See attached paper. 

Unfortunately the RDA website is down at the moment so I can't check the direct links to their outputs, but I think this will work when it is back up again: 

https://www.rd-alliance.org/group/data-citation-wg.html 
https://rd-alliance.org/system/files/documents/RDA-DC-Recommendations_150924.pdf 


-----Original Message-----
From: Phil Archer [mailto:phila@w3.org] 
Sent: Thursday, 24 March 2016 8:46 AM
To: SDW WG Public List <public-sdw-wg@w3.org>
Subject: [Minutes-Cov] 2016-03-23

The minutes of today's Coverages sub group meeting are at https://www.w3.org/2016/03/23-sdwcov-minutes and copied as text below.

We were joined on this occasion by Bernadette Loscio and Newton Calegari, 2 of the editors of the DWBP doc, to talk about subsetting.


...

    <billroberts>
    [18]https://www.w3.org/2015/spatial/wiki/Coverage_UCR_notes


      [18] https://www.w3.org/2015/spatial/wiki/Coverage_UCR_notes


    billroberts: at bottom in summary see that subsetting came out
    a lot
    ... assign an identifier to a subset of a coverage of a dataset
    ... also for provenance so you can point to how the processing
    happened
    ... the question of delivering a full coverage is a special
    case of delivering a subset -- if we address addressing and
    formatting it will be solved
    ... also some use cases for poihnt cloud and time series --
    need to keep these in mind
    ... also note that the region of interest might be complicated,
    not just a bounding box, may be polygon or tunnel underground
    ... any comments?

    jtandy: they are the things I can recall

    phila: note the way subsetting tumbles out becuase we are
    struggling in dwbp to say something that is *not*
    spatailly-specific
    ... dwbp does not have good use cases for this

    billroberts: also we have time subsets and variable subsets

    +q

    <Zakim> jtandy, you wanted to query predefined subsets or
    on-the-fly query

    <eparsons> jtandy

    jtandy: we had a long email thread on subsetting for BP
    ... one kind is subsetting for useful chunks to be manageable
    (a predefined set)
    ... other kind is an on-the-fly query chunk
    ... we need both
    ... rdf datacube does predefined type but not query type

    billroberts: datacube can be used for query-type but perhaps
    less flexible

    jtandy: when i assign an identifier to a subset it could be
    anythinh
    ... but a query type identifier is also an api, effectively

    <phila> kerry: I hate us calling it subsetting given all the
    different dimensions that we need to talk about

    kerry: does not like "subsetting"

    <phila> Discussion between phila and kerry about whether
    audience for Coverages doc is only spatial folks

    <phila> kerry: How about 'sub coverage?'

    <phila> billroberts: That makes sense to me

    <phila> phila: Doesn't like 'sub coverage'

    <scribe> ACTION: kerry to present some suggestions for renaming
    "subsetting" [recorded in
    [19]http://www.w3.org/2016/03/23-sdwcov-minutes.html#action01]

      [19] http://www.w3.org/2016/03/23-sdwcov-minutes.html#action01]

    <trackbot> Created ACTION-152 - Present some suggestions for
    renaming "subsetting" [on Kerry Taylor - due 2016-03-30].

    <BernadetteLoscio> yes!

    <billroberts>
    [20]http://w3c.github.io/dwbp/bp.html#EnableDataSubsetting


      [20] http://w3c.github.io/dwbp/bp.html#EnableDataSubsetting

DWBP subsetting

    BernadetteLoscio: we have a proposal as in the irc, but it is
    difficult to test
    ... it is generic and important but there are different
    approaches
    ... e.g. apis, queries
    ... we are not sure whether we should have this as a bp or to
    just describe it
    ... what would be helpful to you and how would it be testable?

    <Zakim> jtandy, you wanted to ask if you could cover
    'subsetting' as an example operation in your API

    jtandy: when I look at subsetting I think it is one example of
    the way you could work with data... there are other BP about
    offering an API in DWBP
    ... data subsetting makes a lot of sesne for slices for
    statistical, etc, but when I look more generically it is really
    just an operation you provie thru an API
    ... could be just an illustrative example
    ... but it makes a lot of sense for time series and satellite
    data (somehow differently)

    bernadette: : should we also talk about subsetting for download

    jtandy: you should also talk about the data you take away after
    downloading
    ... I would suggest when working with large datasets a typical
    use case would be an api to select parts of that dataset
    ... difficult for you to reference what we do, but I suggest
    just describe an illustrative example of a convenience API

    BernadetteLoscio: perhaps we can talk about subsetting along
    with downloads as another example

    billroberts: the problem with api/query is that it is futile to
    specify upfront what it should look like in general
    ... maybe all we can do is say "you need an API" or esle we end
    up inventing yet another query language
    ... needs to be up to the data provider

    jtandy: agrees

    phila: <moved us with his absent speech>

    <phila> phila: Requirement no. 1 can assign an identifier to a
    subset of a coverage dataset

    phila: we have been saying "you just give it a uri", although a
    uri *is* an api
    ... for bulk download is it useful to say you can use the api
    and you can give it an example of its own, e.g. meteorological
    data for the last week
    ... should this go in dwbp or sdw?
    ... should dwbp do this ... your first ucr says you need to
    asign an identifier to a subset

    billroberts: yes it would be useful

    jtandy: it makes sense to for dwbp to provide some advice -- if
    you have data that is too big for a web application then
    providew a mechanism to get hold of bits of it
    ... eg. using predefined slices or an API
    ... test by "here is a massive dataset -- can you work with it
    in a browser app?

    billroberts: use cases where this emerged was wanting to attach
    some metadata to it, something that is the full set, not a
    subset

    <phila> is that helpful newton_dwbp?

    <newton_dwbp> I liked jtandy point

    billroberts: need to look again at email thread on this, any
    otehr comments?

    BernadetteLoscio: we like jtandy's idea and will bring to our
    dwbp discussion. thank you very much

RDF datacube action

    billroberts: which aspects of rdf datacube would be good for
    defining subsets?

    <Zakim> jtandy, you wanted to note qb:slice

    billroberts: bill will write note on pros and con of datacube
    and mechanisms that would be helpful for subsets

    dmitrybrizhinev: ... we are a group of students working on an
    example implementation for coverages, we are worried about
    verbosity of datacube
    ... flipside is taking a subset with lots of granularity with a
    sparql query is useful butused verbose
    ... i have been converting the coveragesjson to rdf but this is
    the query... is there a best of both worlds

    billroberts: please share anything written up

    jtandy: agree about way too verbose, jonblower keeps saying
    this cannot be used to carry the data, but the metadata might
    be useful

    <jtandy> [21]https://www.w3.org/TR/vocab-data-cube/#slices

      [21] https://www.w3.org/TR/vocab-data-cube/#slices


    jtandy: for describing subsets there is qp:slice and also a
    mechanism for creating arbitrary groups in the spec
    ... leaving the data in a desne array is arguable no different
    to the way we deal with goespatail stuff all the time, eg
    geometry objeects in WKT or in GML
    ... becuase we want to treat the whole geometry as an object
    (we don't break it up), the same can apply to a dense array of
    data
    ... in the same way the geosparql can provide operations on
    data, when we are working with coverage data in a webby form we
    ned to provide some additional mechanism for querying inside

    billroberts: e.g.75th point of array needs to be accessible,
    and you need some coordinates that stick with the points...
    that kind of conciseness is needed for whole grid but when
    there are only bits it may work well
    ... datacube couldwork well itself for a small subset if not
    the entire grid

    jtandy: if you just want ith column and jth row ...

    <phila> kerry: were you suggesting, Bill, that the QB model
    could be used as a response format for a query over a bigger
    set

    <phila> billroberts: Not precisely, but that structure of an
    observation

    <phila> ... If you just have one data point, you need all the
    dimensional info and the metadata. Some metadata applies to the
    whole dataset, some to a specific point.

    <phila> ... If you have a grid, you don't need all the coords
    'cos you can work them out but a point cloud does need them.

    billroberts: the structure of an observation is very useful for
    datacube way

    jtandy: index space querying , natural coord subsetting, more
    work to do here...

    phila: what proportion of coverage data is on a regular grid?
    ... I am thinking of those with only 2 or 3 lines with regular
    definition and you can work the rest out
    ... in such cases a template uri could be generated that does
    identifiy a "slice"
    ... so we could say "of you have a regular grid pattern this is
    how you generate the uri template"

    <Zakim> jtandy, you wanted to respond to phila's question about
    regular grids

    jtandy: yes it is a large fraction by volume and number of
    datasets, eg satellite imagery,
    ... but there are other important cases such as in-situ
    observations by radiospondes or buoys or gliders irregular
    coverages happen more (like opendap/netcdf index-based
    subsetting)

    eparsons: aggrees. my meta-question is , where is the stuff
    with a more semantic approach -- are we just reinventing the
    wheel of tools in other places?

    jtandy: phil had said data is easy, metadata is challenge. the
    metadata is the bit to get the advantage of linked data such as
    what you are measuring etc
    ... metadata as linked data is key, then something else for
    dense arrays of data

    eparsons: so lets not get worked up on data size then as ther
    are other approaches

    billroberts: index array of data is good approah for some
    stuff, but we need to think harder about others

Jeremy Stole our time slot

    billroberts: jeremy stole out timeslot
    ... we had been proposing to follow the main group for time
    changes

...

Received on Wednesday, 23 March 2016 22:19:05 UTC