[dxwg] Support for HTTP compliant datasets (#1086) from Claus Stadler via GitHub on 2019-09-19 (public-dxwg-wg@w3.org from September 2019)

From: Claus Stadler via GitHub <sysbot+gh@w3.org>
Date: Thu, 19 Sep 2019 13:14:02 +0000
To: public-dxwg-wg@w3.org
Message-ID: <issues.opened-495791791-1568898841-sysbot+gh@w3.org>

Aklakan has just created a new issue for https://github.com/w3c/dxwg:

== Support for HTTP compliant datasets ==
I understand that DCAT 2 content is frozen, so this is a feature request to be considered for a future version.

While working with DCAT data catalogs I came across challenge:

* The link between datasets and distributions seems to be used pretty much arbitrarily in practice. For example, picking an arbitrary entry from [https://catalog.data.gov/dataset/epa-facility-registry-service-frs-wastewater-treatment-plants](data.gov), I can see a zip file, web resources, REST endpoint. In the typical CKAN-DCAT mapping, all these resources become distributions and my impression is, that the DCAT 2 standard does (intentionally?) not impose many restrictions here.
Of course, a little semantic goes a long way, but after nearly 2 decades of Semantic Web, I think many people in the RDF community want to go a bit further.

And with this lax modeling, it is impossible for application to refer to a (DCAT) **dataset** and to have it do something smart with it.

So what is a dataset in the first place?
There is [5.1 DCAT scope](https://www.w3.org/TR/vocab-dcat-2/#dcat-scope) which states

> A dataset in DCAT is defined as a "collection of data, published or curated by a single agent, and available for access or download in one or more serializations or formats".

I would like to make the following proposal:

* **Definition** A dataset is an instance of a **data model**. Note, that data model and abstract syntax are synonyms.
* A distribution denotes a means for access to the specific instance of the data model
* All distributions of a dataset should provide access to the same dataset. Hence, if a copy of dataset from one distribution was obtained, there is no more need to fetch further distributions. Alternatively, if one distribution of an RDF dataset (a dataset that is an instance of the RDF model) is a SPARQL endpoint, an application may prefer this distribution over the file download.
* A download URL points to a resource that can supply representations whose content type are among the syntactic representations of the abstract syntax: If you have tabular data, the concrete syntaxes are denoted by the mime types e.g. text/csv or text/tab-separated-values, if you have RDF data, they may be application/turtle, application/n-triples or application/rdf+xml.
* If resolution of the download URL does not provide specific HTTP headers (e.g. application/octet-steam, such as for [DBpedia downloads](http://downloads.dbpedia.org/2016-10/core-i18n/en/genders_en.ttl.bz2)), then interpretation of the response content type, encoding, charset and language (all standard HTTP headers) may be assumed according to the distribution's DCAT description
* A zip archive by itself is typically NOT a dataset - it is simply an archive, and thus a collection of files. Without further references to standards or metadata, no application can reason about what or where is the dataset of a zip archive. A zip archive *could* contain a DCAT description of its own content in e.g. a dcat.ttl file in the root folder. This file could then describe all CSV, RDF, XML, whatever files in the archive.

Dataset descriptions that adhere to these rules, can be unambigously served according the HTTP principles, notably content negotiation, by a DCAT-based HTTP proxy.

* The HTTP proxy internally resolves the URL requested by a client to an entry among a set of DCAT catalogs.
* Based on the catalog, the server can automatically provide the appropriate HTTP headers. A **smart* server can even choose the appropriate download, **perform HTTP caching** and convert the available syntaxes and encodings to those requested (TTL to rdf/xml, CSV to TSV or excel, etc)
* Note, that HTTP already describes a mechanism for handling encoding (gzip, bzip2, brotli, etc)

As I see it, there is a strong link between how HTTP functions and how datasets - according to the strict definition - correspond to HTTP resources that thus can be served in a standard way based on catalog metadata. This aspect is in my impression not yet adequately considered in the DCAT spec.

Please view or discuss this issue at https://github.com/w3c/dxwg/issues/1086 using your GitHub account

Received on Thursday, 19 September 2019 13:14:04 UTC