What are the issues?
The history of computing has been one associated with the realisation that information aggregated was often information with increased value. This led to the conflicting positions of those wanting to merge data from diverse sources to distil more value, and those wanting to prevent the merging of information to retain privacy or other control of processing, or to prevent inappropriate use of data that was in some way felt unfit for general processing. The approach taken by those wanting to exchange datasets bilaterally was to prepare some data interchange agreement (DIA) that explained how the model of data from one party would fit with the data models of the other. This agreement would also cover licensing, and other caveats about the use of the data. This DIA was often a large text document with tables, and often had hand-written signatures to establish the authority of the agreement between parties. This approach changed radically with the advent of the World Wide Web and the scope it provides for dataset exchange at global scale between millions of computers and their users. The advent of the open data movement was the natural progression of this where both citizens and administrations were keen to establish the conditions where significant economic benefit could be obtained from the re-use of public sector information.
The research environment has followed a similar journey as teams and institutions have discovered not only the benefit of being able to aggregate information, but have also been encouraged to make their datasets available as part of the research reproducibility and research transparency agendas. However, in a similar way to the usage agreement aspect of the DIA, Data Sharing Agreements (DSA) have been brought in, particularly in areas such as genomics and other health-related areas where funding bodies such as the US National Institutes of Health have a set of policies for researchers to comply with.
Where is the earlier work?
The provision of guidelines for administrations on how to publish ‘open data’ was pivotal to the W3C development of the 2017 recommendation on how to publish data on the web that built on the previously developed first version of the W3C standard vocabulary for publishing data catalogs on the Web (DCAT), published three years earlier. The European Commission and national governments adopted this standard for catalogs. In some cases, however, they felt certain elements were missing and they often also wanted to specify which controlled vocabularies to use. This led to the creation of ‘application profiles’ through which a data publisher could supplement the DCAT vocabulary with elements taken from vocabularies developed in other standardisation efforts, and when necessary also add further constraints. There are a large number of individual application profiles centred on DCAT for data catalogs of individual national administrations or specific dataset types, such as statistical (StatDCAT ) or geospatial (GeoDCAT ).
DCAT Version 2
In 2017 W3C realised that there would be benefit in re-examining the whole situation with dataset exchange on the web and chartered the Dataset Exchange working group [DXWG] to revise DCAT and to also examine the role and scope of application profiles in requesting and serving data on the Web. The revision of DCAT is now in the late stages of the standards development process. The latest public Working Draft is available and readers are encouraged to make themselves aware of this work and provide feedback to the public mailing list at firstname.lastname@example.org and/or as Github issues .
Anything else to think about?
In addition to the DIAs and DSAs, another acronym associated with the process of dataset exchange is “ETL” – this is the Extraction, Transformation and Loading effort that is often required when a party gets a datasets to be merged that use different models or schemas. ETL is often a considerable effort that is only necessary because the parties are using different models. But generally, ETL takes effort but doesn’t add value. The ideal situation would be to avoid this essentially nugatory work. There is already a mechanism on the Web for a server to be given an ordered set of choices of the serialisation type for returning a dataset to a client (e.g. preferably XML, if not that then CSV, and if not that then the default HTML). This “content negotiation” has a specific mechanism that depends on providing this ordered list to the server, generally through use of the HTTP “Accept” header. Given that the “application profiles” mentioned earlier describe the model that a dataset such as a data catalog has to conform to for it to be valid in a certain context, there is need for a mechanism where a client can use a list of profiles to indicate to a web server which profile or profiles it would prefer the data provided by the server to adhere to. Since this provides a contract between a data provider and a data consumer, the indication of profile preferences could, amongst other things, perhaps reduce the need for an ETL step in dataset exchange.
Content Negotiation By Profile
The DXWG is also making strong progress in developing a recommendation for “Content Negotiation by Profile” and the Second Public Working Draft was published for community review on 30th April 2019. Readers are encouraged to read this draft and to provide their feedback. Thus, for both specification, we will welcome feedback, including positive support for the proposal being developed, to the public mailing list at email@example.com and/or as Github issues .
Through the combination of improved DCAT for facilitating the discovery of datasets, guidance on profiles (which is still in the early stages of development), and a recommendation on mechanisms that could allow a client to provide an ordered set of choices of profile or model for the datasets it wants returned from servers, the DXWG is working to provide a framework of standards and recommended designs/strategies to help developers improve automation in discovering and merging datasets to deliver the increased value that people expect to gain from data aggregation, whilst at the same time providing a mechanism to automate the selection of models that might reduce the ETL requirement or deliver another preferred model.
Acknowledgements: Thanks to Alejandra Gonzalez-Beltran and Lars G Svensson for helpful comments