DXWG DCAT Working Session teleconference 05 February 2019 - Distributions

Meeting minutes

<riccardoAlbertoni> hello!

https://‌docs.google.com/‌document/‌d/‌18tFkR3PP7DECjBnQjsIf0_XD8i51-UTIBPuCMtQUXyw/‌edit?usp=sharing

<Zakim> AndreaPerego, you wanted to say we should scribe as for any other meeting

<alejandra> I think we need to first discuss the set of issues to discuss within the broad topic of distributions

<alejandra> I created two github projects: one with issues related to distribution definition and relationship to profiles and the second one with Makx's set of issues related to packaging and file composition of distributions

<alejandra> then I suggest we use IRC as usual and the google doc for collaborative editing of the text once we reach some conclusions

links to project with relevent issues:

<SimonCox> https://‌github.com/‌w3c/‌dxwg/‌projects/‌8

<SimonCox> https://‌github.com/‌w3c/‌dxwg/‌projects/‌6

<alejandra> +1

Focus on definition first

...ie https://‌github.com/‌w3c/‌dxwg/‌projects/‌8

....which has had some recent discussion

SimonCox: Original definition had this concept of a downloadable thing/file
… introduction of dataservices tightened up the definition of distribution
… with the idea that its a representation in REST term

s/tighethened/tightened/

<alejandra> The issue we are discussing is: https://‌github.com/‌w3c/‌dxwg/‌issues/‌317

SimonCox: so what is the notion of distribution for?

<Zakim> alejandra, you wanted to mention Clement's comment about profiles and distributions

<alejandra> https://‌github.com/‌w3c/‌dxwg/‌issues/‌531

alejandra: Rob's original issue has an echo in Clemen's question & example
… how do profiles and different distributions for teh same dataset interact>

<alejandra> https://‌github.com/‌w3c/‌dxwg/‌issues/‌411

<Zakim> DaveBrowning, you wanted to ask what kind of differences would be acceptable

SimonCox: Unikely to have a very hard and fast decision - will be domain (and publisher) dependent

AndreaPerego: Lots of variation in the industry
… there doesn't seem to be a defining rule
… informationally equivalent is too hard a rule to be applied in every situation

<PWinstanley> although we are saying that we leave it to the providers, what mechanism are we providing for them to actively say that they are informationally equivalent (or not). I'm thinking here about datasets that might be too large for people to examine in detail

AndreaPerego: give examples of how things can be done

alejandra: including any hard rule won't help

alejandra: including additional files and information looks more valuable

riccardoAlbertoni: if we had examples, then we could make it clear where informationally equivalent wouldn't be useful

<alejandra> I think we need to consider the support for associating files to distributions

<alejandra> I think we need to move away from 'informationally equivalent' goal

AndreaPerego: we should also try to understand what the definition is for - most data providers have a strong view...
… sometimes they will use non-equivalent distributions sometimes equivalent. This should be informative guidance
… we can show how people use it
… there is no right and wrong
… shouldn't over harmonize

alejandra: Suggest we avoid informationally equivalent...
… examples only talk of formats, but we need to acknowledge other ways that distributions might differ
… suggest we vote...

<Zakim> DaveBrowning, you wanted to ask about alignment with services

<alejandra> DaveBrowning: distribution concept is the access information

<alejandra> ... and there is no guarantee that what you are going to find is equivalent

<alejandra> ... you can describe it in a more precise way but you have to add all of that

Makx: Should keep the definition succinct
… it gives access to a file that gives data for the dataset

<PWinstanley> ?+ to ask about machines, rather than people, and how they might 'decide' between what to select when there are choices of the same dataset

Makx: but examples are good

alejandra: distribution is the representation of the data, yes
… but only gives examples of differ by format - we have practice of other differences

+1 no no more info equivalence...

<PWinstanley> Distribution is the data, there might be choices, one might be uncleaned (full of duplicates) and the other is cleaned up. Is there going to be any mechanism for a machine to know that there is the same information in the two?

... but we shouldn't talk about info equivalent

<PWinstanley> ok

<Makx> +1 to alejandra

<alejandra> proposed: we won't require different distributions to be informational equivalent and leave this as a judgement call by the data providers

<AndreaPerego> +1

<PWinstanley> +1

<alejandra> +1

<Makx> +1

<SimonCox> +1

Resolved: we won't require different distributions to be informational equivalent and leave this as a judgement call by the data providers

DaveBrowning: The second part you were mentioning, alejandra, you said we should acknowledge the different examples of distributions, right?

alejandra: Yep. I can write a proposal and see if people agree.

<alejandra> proposed: to add to the dcat:Distribution definition a mention of that they may differ in various ways, including Natural language Media-type or format Schematic organization Temporal and spatial resolution, level of detail ...

<riccardoAlbertoni> +1

alejandra: By copying SimonCox's text :)

<DaveBrowning> +1

<PWinstanley> +1

<alejandra> +1

<SimonCox> +1

<alejandra> the current definition is" “A specific representation of a dataset. A dataset might be available in several different forms, and these forms might comprise both different serializations or different schematic arrangements of the same data. Examples of distributions include a CSV file, a netCDF file, a JSON document, or a data-cube.”

<Makx> 0

Resolved: to add to the dcat:Distribution definition a mention of that they may differ in various ways, including Natural language Media-type or format Schematic organization Temporal and spatial resolution, level of detail ...

<alejandra> maybe let's check again: https://‌github.com/‌w3c/‌dxwg/‌issues/‌411

DaveBrowning: Any other aspects we should discuss?

<alejandra> and see if we can close it after addressing this?

alejandra: Have we also addressed the long discussion on informational equivalence in https://‌github.com/‌w3c/‌dxwg/‌issues/‌411 ?

DaveBrowning: Overall, yes, IMO the main concerns are addressed by the resolution.
… So let's move to the other aspects we mentioned.
… Let's wait to close #411 after we have revised as decided the DCAT spec.

DaveBrowning: Should we look at the other issues in sprint 1?

alejandra: There was a lot of discussion on the profile one, so it would sensible to see if we can reach some agreement on it.

<Makx> +1

<DaveBrowning> https://‌github.com/‌w3c/‌dxwg/‌issues/‌531

<SimonCox> +1

<DaveBrowning> +1

<riccardoAlbertoni> +1

<PWinstanley> +1

<alejandra> most of the discussion happened here: https://‌github.com/‌w3c/‌dxwg/‌issues/‌317

<alejandra> but 531 is related

[all]: They look like the same issue.

alejandra: SimonCox, you say the one to be closed should be the Clemens's one, right?

SimonCox: Yes, saying that this is addressed by the decision on the other one.

alejandra: Actually, Clemens refers to informational equivalence, but about profiles. So, do we need to distinguish between profiles and distributions? Or I'mmisunderstanding Clemens's point.

Makx: Yes, Clemens mentions profiles, but it is actually about different distributions (in different profiles which are not informationally equivalent).

alejandra: So we can add profiles in the Google doc as one of the examples of how distributions can be different.

<riccardoAlbertoni> +1 to add "profile" on the list of the possible variations

DaveBrowning: So this looks like we can close #317

<alejandra> Moved https://‌github.com/‌w3c/‌dxwg/‌issues/‌531 to 'In progress'

DaveBrowning: We still have citations and distributions.

<PWinstanley> yes

alejandra: I think what we need to do is to do the same we did for datasets. So, what you need to cite is a persistent identifier, and what should be cited should have bib metadata (as authors, publisher, publication year).

alejandra: Can we assume that the authors of the dataset are the authors of the distribution?

<alejandra> all was considered here: https://‌github.com/‌w3c/‌dxwg/‌issues/‌61

<alejandra> data citation principles: https://‌www.force11.org/‌datacitationprinciples

[all]: [discussing on data citation approaches]

<PWinstanley> +1 to Makx , AndreaPerego , SimonCox

<alejandra> fine by me, I agree with Andrea

<riccardoAlbertoni> +1 to let the citation on dataset

<DaveBrowning> +1 to not duplicating

SimonCox: I'm concerned that we are making things (datasets & distributions) too similar intensionally.

Makx: Same concern.

<PWinstanley> keep to the profile

DaveBrowning: So we need to reply to Annette.
… Who wants to do that?

SimonCox: I can do that.

<alejandra> I can revise the whole https://‌github.com/‌w3c/‌dxwg/‌issues/‌411 to see if there is anything missing or it is addressed

DaveBrowning: About https://‌github.com/‌w3c/‌dxwg/‌issues/‌411 , what we decided?

alejandra: I think we decided we have to review it to see if the resolution on informational equivalence is enough to address it, or there's also something else.

alejandra: I can take care of that.

<DaveBrowning> https://‌github.com/‌w3c/‌dxwg/‌projects/‌6

DaveBrowning: Should we look at these issues ^^ ?

Makx: Last week I gave a summary about a resolution could be on this.
… We could go through them.

<riccardoAlbertoni> +1 to adding the two properties

DaveBrowning: Which of these are not backward compatible?

Makx: There may be implementations out there that are using mediatype for that, since there was no other option.

<riccardoAlbertoni> s\+1 to adding the two properties\+1 to adding the two properties for compressed distributions

DaveBrowning: Seems that there's no objection to your proposal, Makx.
… So please go ahead and make a proposal.

alejandra: Can you submit it as a PR?

Makx: I'll try.

DaveBrowning: So, what we do now?

alejandra: Maybe we plan the next sprint.

DaveBrowning: I think another key issue is the one about versioning.
… So, this can be a sensible discussion.

<Makx> qq+

SimonCox: We have 2 options: do a lot of work, or realise we cannot do all that work, and try to address the issues to the best extent we can.

<Makx> qq+

DaveBrowning: It may be also the case that we decide that all these should go into a guidance document.

SimonCox: Would you prepare a number of options on how we can address this?

Makx: We are actually not starting from scratch on versioning. I would support SimonCox's proposal not to do much. But we can pick up what we did before, at the beginning of the WG.
… I don't think we should define what versioning is, but rather, in case you have versioned data, these are the properties you can use.
… Also here, we are dealing with a domain-specific problem, which is addressed in many different ways.

<riccardoAlbertoni> +1 to not define what is versioning but indicating some simple set of terms that can be used..

DaveBrowning: Yes, I also think we shouldn't do much, just because we don't have the time.
… And to answer to SimonCox's point, I'll prepare some options.

DaveBrowning: About when to meet...

<alejandra> +1 on 2 hours for the usual DCAT subgroup slot

<riccardoAlbertoni> +1 to regular slot extended

Makx: Maybe next week, normal slot.

<PWinstanley> we need to be on the plenary

<PWinstanley> so same slot is better

<riccardoAlbertoni> bye thanks

[meeting adjourned]

– DRAFT –
DXWG DCAT Working Session teleconference 05 February 2019 - Distributions

05 February 2019

Meeting minutes

Summary of resolutions

Diagnostics