The Metadata Task Force of the DPUB IG found, through extensive interviews with representatives of various sectors and roles within the publishing ecosystem, that there are numerous pain points for publishers with regard to metadata but that these pain points are largely not due to deficiencies in the Open Web Platform. Instead, there is a widespread lack of understanding or implementation of the technologies that the OWP already makes available for addressing most of the issues raised. However, some of the very technologies that are little used or understood in most sectors of publishing are widely used and understood in certain other sectors (e.g., scientific publishing, libraries). Priorities that have emerged are the need for better understanding of the importance of expressing identifiers as URIs; the need for much more widespread use of RDF and its various serializations throughout the publishing ecosystem; and the need to develop a truly interoperable, cross-sector specification for the conveyance of rights metadata (while remaining agnostic as to the sector-specific vocabularies for the expression of rights). This Note documents in detail the issues that were raised; provides examples of available RDF educational resources at various levels, from the very technical to non-technical and introductory; and lists important identifiers used in the publishing ecosystem, documenting which of them are expressed as URIs, and in what sectors and contexts. It recommends that while little new technology is called for, the W3C is in a unique position to bridge today's currently siloed metadata practices to help facilitate truly cross-sector exchange of interoperable metadata. This Note is thus intended to provide background and a context in which concrete work, whether by this Task Force or elsewhere within the W3C, may be undertaken.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a work in progress. No section should be considered final, and the absence of any content does not imply that such content is out of scope, or may not appear in the future. If you feel something should be covered here, tell us!

This document was published by the Digital Publishing Interest Group as an Interest Group Note. If you wish to make comments regarding this document, please send them to public-digipub@w3.org (subscribe, archives). All comments are welcome.

Publication as an Interest Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

The disclosure obligations of the Participants of this group are described in the charter.

This document is governed by the 1 August 2014 W3C Process Document.

Table of Contents

1. Overview

Publishers use metadata in three fundamentally different ways:

While in many cases metadata is in a system or a form outside of the Web and uses technologies outside of the Open Web Platform (OWP), such as databases, repositories, authoring and formatting software, and proprietary aggregation and dissemination platforms, OWP technologies are increasingly becoming essential to all aspects of the publishing process (including modern versions of all those mentioned).

The Metadata Task Force of the W3C Digital Publishing Interest Group (DPUB IG) was formed to identify ways in which the W3C could help address problems publishers currently have with regard to metadata. In its discovery phase, the TF found the following fundamental issues to be commonly regarded as "pain points" by publishers:

Although each sector of publishing has problems with metadata in its own ways, the causes of these problems fall into two major categories:

In its initial exploration of these issues, the Metadata Task Force of the W3C Digital Publishing Interest Group found that the vast majority of difficulties that publishers of all types have in implementing metadata more effectively are in the second category. In most cases, the OWP already has features that address these issues, if used properly by publishers and implemented properly in systems that create, disseminate, and display those publications (e.g., expressing identifiers as URIs, using RDF and RDFa, etc.). In other cases, ongoing work by the W3C will likely provide solutions or essential components to solutions (e.g., the work of the Web Annotations WG is closely related to the need to address arbitrarily granular units of content).

1.1 Recommendations for Further Work at W3C

The Metadata Task Force of the W3C Digital Publishing Interest Group has developed the following general and specific recommendations to the W3C with regard to the use of metadata within the OWP.

2. Background: Interviews

In order to assess the “pain points” with regard to metadata for publishers, the co-leaders of the task force, Madi Solomon of Pearson and Bill Kasdorf of Apex, conducted a number of interviews in 2014 with publishers, service providers, and representatives from related organizations. The inverviews themeselves are available in a separate document.

The interviewees were selected to provide insight from a variety of perspectives, and were individuals known to the interviewers to be knowledgeable and authoritative within their spheres. Ms. Solomon took a “vertical” approach, interviewing a broad range of individuals within Pearson, a large global educational publisher. Mr. Kasdorf took a “horizontal” approach, interviewing experts from diverse types of publishing (book, journal, magazine, and news) and representing diverse roles within the digital publishing ecosystem (publishers, metadata service providers, consultants, and representatives from other organizations that are addressing the issue of metadata in publishing).

The interview strategy was to conduct casual, open-ended interviews with a single individual without an agenda or a prepared set of questions. The reason for this strategy was to avoid steering the discussion in particular directions. Instead, in this initial phase, the goal was to elicit what each interviewer would perceive as the key issues and pain points with regard to metadata from their own point of view. Thus the interviews deliberately did not focus on the issue of what the W3C could do—and what changes could be made to the Open Web Platform to address them. Instead, the interviews stayed on the general level. Since many of the interviewees were not technical, framing the discussion in too technical a manner would have impeded the ability to obtain authentic responses. As expected, few of the interviewees felt able to identify specific “pain points” with regard to the OWP. They spoke instead of general issues of concern to them in their work. The hope was that with an understanding of these issues and pain points, the DPUB IG could then assess where the W3C and the OWP could potentially help address them—and could avoid addressing theoretical technical issues that might not in fact align with publishers’ priorities.

While the published interview reports cited above will provide the best understanding of both the common themes and diverse perspectives revealed by the interviews, this report attempts to summarize key observations and offer initial recommendations for subsequent strategies.

2.1 Primary Observations

If there is a single overarching lesson revealed by these interviews, it is that the issues with regard to metadata seen as priorities for publishers and their clients and partners differ significantly between publishing sectors (although they all share all of these issues to some extent).

While all of these issues—discovery via subject metadata and other metadata characterizing content and products, management of content via metadata, development and participation of cross-publisher platforms and services via metadata, and the communication of rights via metadata—cross all sectors of publishing, it is clear from the interviews that the priorities in distinct sectors diverge significantly.

Another major theme heard in virtually all of the interviews was that metadata is “too complicated.” Book publishers, for example, recognize that ONIX is the standard way to communicate supply chain metadata; as such, it is an extremely rich, complex, and useful standard. Similarly, the BISAC standard is a rich vocabulary used in the US for subject classification; there are similar such standards in most other countries or regions, and also a new global standard, Thema. While publishers recognize the value of these standards, they often characterize them as “too hard”; yet when pressed for what an individual publisher needs to communicate (to the supply chain, or about the subjects of its books), they often wind up asking for more complexity. (E.g., a U.S. publisher may want to describe a book as being about “the Battle of the Bulge, within the topic “World War II” which itself is in the category of military history; this can be done with BISAC but not with Thema.) The truth is that these systems are complex because what they are designed to do is complex. The desire for an “ONIX Lite” expressed by several interviewees may prove to be unrealistic, because a significantly simpler model would be significantly less expressive.

Another common theme was that in too many cases metadata may exist—or may potentially exist, if applied to a given publication—but it often “doesn’t do anything.” It is very frustrating to users if it is the case—or even if it is their perception—that going the work to adding metadata is futile because systems are not seen as using it. (This is of course true of some types of metadata but not others: clearly trade publishers know how their ONIX metadata is used by the supply chain, and scholarly publishers know how their CrossRef metadata is used for citation linking.) This particularly surfaced in the context of the Pearson interviews because complex educational content is created by a vast team of participants, each of whom may have the ability to provide some aspect of metadata but most of whom have no clear understanding of how to do so, no systems to enable to do that consistently, and no faith that if they “go to all that work,” it will actually be used for any purpose downstream.

In thinking about metadata, it is important to distinguish between metadata that is incorporated within a publication (an EPUB, a website); metadata that is separate from the publication or publications it describes (e.g., ONIX, which can continually change over time without requiring the publications it communicates metadata about to be altered); and metadata that is incorporated in systems designed to provide information about publications (e.g., a publisher’s, retailer’s, or aggregator’s website).

And finally, it should be noted that an important theme that did not emerge from the interviews was the importance of accessibility. Revealing this was one of the benefits of the interview strategy of not asking leading questions: when anybody is asked if accessibility is an important issue, they will almost always say it is. So it is particularly—and lamentably—of note that none of the interviewees mentioned accessibility as a priority issue with regard to metadata.

2.2 Important Themes

The key themes of the interviews conducted by Mr. Kasdorf are summarized in the following appendix. They are the following:

Please see the summary on the group’s wiki page for a discussion of these themes, including important comments by members of the DPUB IG.

The key themes of the interviews conducted by Ms. Solomon are summarized as follows:

Please see Ms. Solomon’s report in the interview document for a more detailed discussion of these themes.

A. Acronyms and Terms Used in the Report

B. List of Some Identifiers for the Publishing Industry

(See also BISG’s Guide to Identifiers.)

List of Identifiers Used by the Publishing Industry
Category Identifier URI/URN Example (if defined) Authority Resolution result
Creator/Contributor IPI—Interested Parties Information International Confederation of Societies of Authors and Composers (CISAC)
Creator/Contributor ISNI—International Standard Name Identifier http://isni.org/isni/0000000134596488 ISNI Registration Authority, ISO. Current Registries include Bowker and Ringgold ISNI record
Creator/Contributor ORCID—Open Researcher and Contributor ID http://orcid.org/0000-0002-1825-0097 ORCID.org ORCID servers
Creator/Contributor DAI—Digital Author ID SURF
Creator/Contributor Lattes Platform Brazilian Government
Creator/Contributor Deutsche Biographie http://data.deutsche-biographie.de/sfz26859 Historischen Kommission of the “Bayerischen Akademie der Wissenschaften RDF description
Creator/Contributor CERL Thesaurus http://thesaurus.cerl.org/record/cnp01379452 Consortium of European Research Libraries (CERL) resolves to human HTML description; RDF serializations available.
Creator/Contributor VIAF—Virtual International Authority File http://viaf.org/viaf/10179357 OCLC HTML or RDF serialization (content negotiation?)
Work DOI—Digital Object Identifer http://dx.doi.org/10.1007/1-4020-4466-6 International DOI Foundation Registerd digital object.
Work (not really) LoC—Library of Congress Control No. (LCCN) http://lccn.loc.gov/2003556443
Library of Congress (LoC) Metadata, e.g., MARC or MODS/MADS record
Work ISRC—International Standard Recording Code National ISRC Agencies
Work ISTC—International Standard Text Code ISTC Agencies
Product & Unit EPC—Electronic Product Code urn:epc:id:sgtin:978817525.0766.999999999999 GS1
Product & Unit GTIN—Global Trade Item Number GS1
Product & Unit ISBN—International Standard Book Number urn:isbn:978-952-10-9981-6
International ISBN Agencies
Product & Unit ISMN—International Standard Music Number ISMN agencies
Product & Unit ISSN—International Standards Serial Number ISSN Agencies
Product & Unit EAN–International Article Number Bookwire Redirection to bib manifestation (in RDF, using schema.org semantics).
Library (work metadata like LCCN) OLID—Open Library ID https://openlibrary.org/books/OL17870452M/ Open Library (Internet Archive) HTML, RDF, JSON
Library (Item) ARK—Archival Resource Key Identifier http://ark.cdlib.org/ark:/13030/tf5p30086k http://hdl.handle.net/2027/uiuo.ark:/13960/t7np24670 ARK Name Assigning Authorities, California Digital Library (CDL) Object or description; servicing entity is part of the URI
Distribution GLN—Global Location Number GS1
Distribution SAN—Standard Address Number R.R. Bowker
Distribution SSCC—Serial Shipping Container Code GS1
Discovery ISLI—International Standard Link Identifier ISO Identifies link between two entities and the nature of the link; Under development
Work NOID—Nice Opaque Identifiers Need to be combined with an ARK/HDL to be expressed as an URI http://ark.cdlib.org/ark:/13030/tf5p30086k California Digital Library Creates an opaque object identifier known as a NOID

C. List of Some RDF/RDFa Outreach Documents

(See also W3C’s list on Semantic Web related books.)

List of RDF and RDFa related outreach documents.
Title Targeted sector Technical level Date created or updated Free and open?
RDF 1.1 Concepts and Abstract Syntax Tech High 2014-02-25 Yes
RDF 1.1 Primer General Intermediate 2014-06-24 Yes
RDFa 1.1 Primer Tech Intermediate 2014-08-22 Yes
Linked Data: Evolving the Web into a Global Data Space, Tom Heath and Christian Bizer, Morgan & Claypool Publishers (2011), ISBN: 9781608454310 Tech High 2011 Yes, in HTML format
A Semantic Web Primer, (Third Edition), Grigoris Antoniou, Paul Groth, Frank van van Harmelen, Rinke Hoekstra ISBN: 0262018284 Tech High 2012 No
Semantic Web for the Working Ontologist (2nd ed.), Dean Allemang and Jim Hendler, Morgan Kaufmann (2011), ISBN: 0123859654 Tech High No
HTML Data Guide Tech Intermediate 2012-03-08 Yes
Cool URIs for the Semantic Web Tech Intermediate 2008-12-03 Yes
Introduction to Linked Open Data (DC2013 Tutorial slide set) Tech Intermediate/td> 2013 Yes
Resource Description Framework, Wikipedia General Somewhat technical Varies Yes
RDF101 (Cambridge Semantics) General Beginner 2012 Yes
Linked Data for Libraries, Archives, and Museums, Seth van Hooland and Ruben Verborgh, Facet Publishing (2014), ISBN: 9781856049641 Librarians Beginner 2014 Only Partially
Linked Archival Metadata: A Guidebook (LiAM Project) Archivists/Librarians/CHI/GLAMs Beginner 2014 Yes
Introduction to: RDF C-Suite Beginner 2011 Yes
RDF “Just Enough” Video (IDEAlliance) Magazines Simple 2014 Yes
RDFa Tutorial and demo General Beginner/Intermediate Yes
webplatform.org General Beginner/Intermediate 2012 - ongoing Yes

D. Footnotes

1 JATS, the Journal Article Tag Suite, and BITS, the Book Interchange Tag Suite—which share a common markup model below the article and chapter level and which have very rich metadata models and mechanisms—are the current versions of what were previously known as the “NLM DTDs,” the markup and metadata model on which virtually all publications, platforms, and services in the area of scholarly publishing are based. This is unique to scholarly publishing: in no other sector is there such universal consensus on a single markup and metadata model.