W3C

Proposal for the Improvement of the Semantics of ORCIDs

As part of its wider service, ORCID currently provides data about individuals in RDF. This document proposes a number of small changes to this service that, it is hoped, will help improve the semantics and robustness of the data. In making this proposal it should be emphasised that the current solution is already good, many many more things are right than wrong; the aim is to make it even better.

The proposal adheres to a number of guiding principles:

  1. any change must represent a minimal evolution, not a wholesale change;
  2. any change must be backwards compatible with the existing service.

Problem Statement

As an example, the following data is currently returned from http://orcid.org/0000-0003-0782-2704 with accept headers set to text/turtle. Line numbers have been added for ease of reference.

 1 @prefix gn:      <http://www.geonames.org/ontology#> .
 2 @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
 3 @prefix prov:    <http://www.w3.org/ns/prov#> .
 4 @prefix foaf:    <http://xmlns.com/foaf/0.1/> .
 5 @prefix pav:     <http://purl.org/pav/> .
 6 @prefix owl:     <http://www.w3.org/2002/07/owl#> .
 7 @prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
 8 @prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

 9 <http://orcid.org/0000-0003-0782-2704/>
10   a foaf:PersonalProfileDocument , foaf:OnlineAccount ;
11   rdfs:label "0000-0003-0782-2704" ;
12   pav:contributedOn "2012-12-07T14:37:24.441Z"^^xsd:dateTime ;
13   pav:createdBy <http://orcid.org/0000-0003-0782-2704> ;
14   pav:createdOn "2012-12-07T14:34:08.399Z"^^xsd:dateTime ;
15   pav:createdWith <http://orcid.org> ;
16   pav:lastUpdateOn "2015-02-16T03:21:12.933Z"^^xsd:dateTime ;
17   prov:generatedAtTime "2015-02-16T03:21:12.933Z"^^xsd:dateTime ;
18   prov:wasAttributedTo <http://orcid.org/0000-0003-0782-2704> ;
19   foaf:accountName "0000-0003-0782-2704" ;
20   foaf:accountServiceHomepage <http://orcid.org> ;
21   foaf:maker <http://orcid.org/0000-0003-0782-2704> ;
22   foaf:primaryTopic <http://orcid.org/0000-0003-0782-2704> .

23 <http://sws.geonames.org/2750405/>
24   a gn:Feature , <http://schema.org/Place> , rdfs:Resource ,  
       <http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing> ;
25   rdfs:label "Netherlands" , "Kingdom of the Netherlands" ;
26   gn:countryCode "NL" ;
27   gn:name "Netherlands" , "Kingdom of the Netherlands" .

28 <http://orcid.org/0000-0003-0782-2704>
29   a foaf:Person , prov:Person ;
30   rdfs:label "Ivan Herman" ;
31   foaf:account <http://orcid.org/0000-0003-0782-2704/> ;
32   foaf:based_near
33     [ a       gn:Feature ;
34       gn:countryCode "NL" ;
35       gn:parentCountry <http://sws.geonames.org/2750405/>
36     ] ;
37   foaf:familyName "Herman" ;
38   foaf:givenName "Ivan" ;
39   foaf:name "Ivan Herman" ;
40   foaf:page <http://www.ivan-herman.name> , 
     <http://www.w3.org/People/Ivan/> , 
     <http://www.ivan-herman.net/professional/> ;
41   foaf:plan "See http://www.ivan-herman.net/professional/CV.html" ;
     foaf:publications <http://orcid.org/0000-0003-0782-2704/> .

Example 1: Ivan Herman's data 2015-03-06

It is important to note that this data includes two identifiers that differ in the presence or absence of the trailing slash, i.e. the two identifiers are:

http://orcid.org/0000-0003-0782-2704
http://orcid.org/0000-0003-0782-2704/

These are used consistently: the identifier without the trailing slash is used to identify the individual person, that with the trailing slash the online account held by that person. Strictly speaking, this is perfectly correct, however, it is dangerous as discussed recently in a W3C mailing list. Note in particular the contributions from Stian Soiland-Reyes who contributed to the current ORCID implementation. There are several objections to the current implementation:

  1. it is trivially easy for developers to miss the distinction and use the wrong one (or just one) in software;
  2. in an attempt to make the Web as easy to use as possible, browsers often don't display full URLs compounding the first problem.
  3. URLs that don't end in file names very often do redirect to ones with a trailing slash (whether this is shown in the browser address bar or not). http://m.bbc.co.uk/news/ is a rare example of a URL that redirects to the same URL minus the trailing slash.

In short, everyday experience suggests that the presence or absence of a trailing slash on a URL is insufficient and potentially hazardous method to distinguish between a person and information associated with that person. As the recent online discussion shows, the debate about whether http://orcid.org/0000-0003-0782-2704 should identify Ivan Herman or an account held by him is unlikely to lead to consensus.

Can the discussion be avoided altogether?

Proposed Changes

ORCIDs are defined in terms of what they do, not what they represent, i.e. “... a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized.”

The proposed way forward is consistent with that definition: that the semantics of an ORCID should be simply that it is an ORCID. On its own, it identifies neither the person nor their account, but dereferencing that identifier in a semantic workflow will return semantically accurate data. This includes information about the individual person, who should be identified within the data by appending the fragment #person. Similarly, the account would be identified by appending #account. There is a further subject in the example data above: the list of publications which is neither the person nor the account and so should be identified by appending #pubList.

If adopted, the current implementation would change such that, again within the data,
http://orcid.org/0000-0003-0782-2704 would be replaced by
http://orcid.org/0000-0003-0782-2704#person

http://orcid.org/0000-0003-0782-2704/ would be replaced by
http://orcid.org/0000-0003-0782-2704#account

in all cases except for the value of the foaf:publications property (line 41 in the example) which would become http://orcid.org/0000-0003-0782-2704#pubList.

The advantages of this solution are:

The potential disadvantage of this or any change to the current implementation is that it might adversely affect other people's systems that use the data.

If individual operators are known to use ORCID's RDF data then they should be contacted and the issues discussed. The unknown users are harder to reach but this can probably be achieved through a variety of outreach mechanisms, such as an online call for comment that can be promoted through tweets, conference talks and more.

Any change should be signalled well in advance.

Secondary Recommendations

A further improvement in the data returned when dereferencing an ORCID would be to include more of the information available to human readers. Ivan Herman's ORCID Web page shows his education and a full list of his publications but this is not included in the machine readable output. One way forward might be to augment the HTML page with RDFa markup but it's likely to be easy to add to the published RDF data too.

Furthermore, noting the proposal to use the #pubList> fragment as the subject of the foaf:publications property, it would be logically consistent if the Web page page that humans see when dereferencing an ORCID in a regular browser were amended to include an id of pubList on the relevant HTML element.

The current implementation uses content negotiation to return data in HTML, RDF/XML, RDF Turtle, XML and JSON (as an aside, it would be good to add JSON-LD to this list). However, the availability of this functionality could be much more obvious. The usual method, exemplified in sites such as OpenCorporates and Ordnance Survey is to:

  1. Configure the server such that appending the relevant file extension to the URI returns the data in that format (.html, .rdf, .ttl, .xml and .json respectively) without content negotiation.
  2. Link these alternative representations to each other. This is done from footnotes in the HTML page and in the data itself using dcterms:isFormatOf links.

For example:

<http://orcid.org/0000-0003-0782-2704.json>
  a <http://purl.org/dc/dcmitype/Text>, foaf:Document ;
  dcterms:isFormatOf <http://orcid.org/0000-0003-0782-2704> ;
  dcterms:format "application/json" .

Making these changes has several advantages:

  1. It makes the data more discoverable as data.
  2. It shows developers that ORCID is offering a service for them, maximising the impact of the efforts already made.
  3. It adds credibility to that service, thereby attracting new registrations.

Conclusion

As noted at the beginning of this short document, there is a lot more right with the current implementation than wrong with it. The changes proposed are incremental in nature and, it is hoped, will increase the utility of the services offered to ORCID's primary constituents in the research community through better discoverability and better semantics.

Phil Archer
W3C Data Activity Lead
March 2015