Use Case Citation of Scientific Datasets

From Library Linked Data
Jump to: navigation, search

Back to Use Cases & Case Studies page


Citation of Scientific Datasets


Monica Duke

Background and Current Practice

In some scientific disciplines there is a growing trend to make available supporting data alongside journal publication of research. This data can either be stored and curated by the journal, or in discipline-specific repositories where these exist. As yet there is no de facto method for citing the data made available; for example in some disciplines (like bio-informatics) accession codes are used by the community as identifiers to access the data from large well-known databases. Furthermore, as yet there are no widely agreed ways to link the data to the publication that refers to it, or to the contributors of the data, beyond assumptions that are made about publication co-authors having contributed to the research without the individual roles being made clear.

The motivation behind making data citation more explicit is so that data can be accessed more easily, supporting re-use or verification, and secondly to strengthen the link between contributor and data (or some other research contribution) in order to assign credit where it is due.

The SageCite project in particular is working with Sage Bionetworks to cite predictive models of disease in collaboration with two publishers (Nature Genetics and PloS). Libraries are also increasingly being identified as potential curators of data [Borgman].


  1. Data that is referenced in publications can be identified, described and accessed, and linked to the contributor.
  2. Linked data can provide guidance on how to assign identifiers, how to link data, descriptions, publications and contributors/authors through vocabularies.

Target Audience

  • Publishers
  • Data repositories
  • Libraries
  • Researchers
  • Data contributors
  • Funding bodies
  • Providers of services (e.g. identifier services, reputation systems)

Use Case Scenario

(1) Human user A researcher (or reviewer) would like to examine gene expression and protein-protein interaction data that was used to construct a predictive model for complex biological system behaviour. The researcher is able to access a description of the data that was used, e.g. source of the experimental data (protocols and brands), contributor, date, and is then able to re-use the data or re-run the process. The researcher wishes to find out if other data of the same type has been made available by the same contributor, any articles that this contributor has co-authored based on this data, and other articles that report on experiments that use this data.

(2) Reputation system

A university wants to build a reputation system that showcases the extent of data sharing that its researchers are involved in. The system will trace the data released by its members, display the information about articles that have been published based on that data (or derived experiments), either by contributors of the data or by others, and for each researcher show some rating based on how well that data has been re-used, any comments others have made on that data, or any other ratings given to the data or to the contributor by other external systems (e.g. discipline-based repositories or journal systems).

Application of linked data for the given use case

Linked data provides guidance and technology suggestions for minting URLs for identifying the data, the contributors, publications or other systems, and for describing the relationships between them. It could allow inferences to be made over data created or held across systems managed in a distributed way (e.g. journal systems, university systems -including university libraries, discipline repositories). Automated systems should be able to extract data to compute measures of credit and to allow the making of connections, particularly to the same person. Human users should be able to view descriptions of the data by navigating from publications, and then access the data.

Existing Work (optional)

  • The W3C HCLSIG
  • ORCID is an initiative to assign unique identifiers to researchers.
  • DOIs are a system of identifiers used by publishers and being promoted for identification of data for example by


  • EZID is an identifier-agnostic system.
  • GBIF has started to look into data citation.

Related Vocabularies (optional)

  • OPM
  • BIBO
  • CITO

Problems and Limitations

There are a number of challenges with this scenario. Firstly it crosses a number of communities, so it highlights the need to interact with other groups and standard setting bodies. This links to another challenge, which relates to the need to understand a number of vocabularies being developed and select which of them to apply and which parts are relevant, and how to link between them. For example (and this is one challenge very relevant to the library community) the issue of identifying persons crosses boundaries - a person may have different roles (or identities) as an author, a researcher employed by an institution, a data contributor; The ORCID initiative is intended to address researcher identification - for the scenario to be realised this identification has to be reused or linked to other systems such as those used by journals, instances of FOAF, institutional systems, data repostories. The contributor or author may be designated using different terms across vocabularies e.g. dc:author or dc:contributor. There is also as yet no widely accepted vocabulary to describe roles for contributors (data producer, reviewer, author). Another challenge is that some of the identifiers being used are criticised by some as not being 'of the web'. Other identifiers (such as the accession codes for data) are intended to work with databases and although they can be used in conjunction with an API (sometimes in a RESTFUL way) they may predate linked data and may not adhere to its principles of identification.

Related Use Cases and Unanticipated Uses (optional)

Library Linked Data Dimensions / Topics

Users needs

  • Browse / explore / select
  • Retrieve / find
  • Identify
  • Access / obtain
  • Integrate / contextualize
  • Add information / annotate / comment

Publishing citations, e.g. best practice for citations in RDF Use of Identifiers for and in LLD

  • Reuse or urlification of traditional identifiers
  • Namespace policies

Use of Identifiers

  • HTTP URIs, DOIs, handles, ARKs, shorteners, hash, slash, 303 redirects, PURLs

Community-building, education and outreach

  • Outreach to other communities (archives, museums, publishers, the Web)

Citation (attribution) as a requirement for use

  • Some people are attempting to apply copyright licenses (e.g. Creative Commons) to data, meaning citation is not just a good idea but a legal requirement. (not clear this will stick since data is usually interpreted as not protectable by copyright law.)
  • The Protocol for Implementing Open Access Data treats citation as a normative, not legal, requirement of use

References (optional)

This section is used to refer to cited literature and quoted websites.

[Borgman] Borgman, Christine Research Data: Who Will Share What, with Whom, When, and Why?"

See also

Current citation tracking tools don't generally work for dataset citations.