Task Forces/Metadata

From Digital Publishing Interest Group
Jump to: navigation, search



Task force on "Metadata"

  • Leader(s): Bill Kasdorf, Apex, bkasdorf@apexcovantage.com, Madi Solomon, Pearson, madi.solomon@pearson.com

Members (Please add your name, organization, and preferred contact email)

  • Ivan Herman, W3C, ivan@w3.org
  • Tzviya Siegman, Wiley, tsiegman@wiley.com
  • Tim Clark, Mass General Hospital, tim_clark@harvard.edu
  • Tom De Nies, Ghent University - iMinds - MMLab, tom.denies@ugent.be
  • Phil Madans, Hachette, phil.madans@hbgusa.com
  • Luc Audrain, Hachette Livre, laudrain@hachette-livre.fr
  • Hajar Ghaem Sigarchian, Ghent University - iMinds - MMLab, hajar.ghaemsigarchian@ugent.be
  • Madi Solomon, Pearson, madi.solomon@pearson.com
  • Julie Morris, BISG, julie@bisg.org
  • Dave Cramer, Hachette, dave.cramer@hbgusa.com
  • Graham Bell, EDItEUR, graham@editeur.org

SCOPE

1. Identify problems re the use of metadata by publishers on the Open Web Platform

[IH:] To emphasize: some of the problems may lead to a request to W3C to start up a new (Interest or Working) Group to solve the issues, because there may not be a target group currently running. Which is perfectly fine, but starting up such a group should clearly identify the use cases, major potential beneficiaries and participants in such an endeavor.

2. Collect Use Cases

NEAR TERM GOALS

Identify problems publishers currently have relating to metadata.

Phase 1 Strategy

Madi and Bill to interview key resource people (Madi within Pearson, Bill across broad range of publishing segments) by asking the open ended question "What are the main problems publishers have regarding metadata?"

We are deliberately not asking "what could be improved by the W3C in the OWP to address these issues?" because that is a very hard question for people to answer. Instead, we want to simply surface the issues from a hand-picked group of knowledgeable people to see what their responses indicate. Then it will be the IG's job to determine whether those issues are appropriate to address at all via changes to the OWP, and if so, develop use appropriate use cases. It may also be evident that education is needed: i.e., the OWP may already be able to address the problem but publishers just don't know how to do it.

Notes from the interviews:

[BK]Here is just a quick summary of some of the issues that came out of these interviews:

--COMPLEXITY. Many, many folks lamented that there are so many metadata vocabularies, and they are so complicated (and many of them are in constant evolution—ONIX, BISAC, PRISM, etc.) that (a) they are hard to understand, keep up with, and implement properly, and (b) lead to . . .

--INCONSISTENCY. Despite the existence of well established and widely used standards (again, e.g. ONIX, BISAC, PRISM, etc.) they are used inconsistently by both the creators and recipients of metadata. Publishers feel that no two recipients want exactly the same things from them, and recipients lament that no two publishers give them the required metadata in exactly the same way.

--SACRIFICING RICHNESS FOR SIMPLICITY. While many folks on the trade side wish they had an “ONIX Lite,” actually getting to that is not trivial because there is an inherent complexity of what they want to communicate. The clearest counter example comes from the scholarly/STM publishing side, where many folks actually think of metadata as a “solved problem.” Why does it work so well? Because CrossRef was created initially for a single purpose—to enable reference linking—and so they created a very simple spec for the metadata necessary to make that work. On the other hand, folks now expect CrossRef to do Metadata Magic (my phrase, not theirs) and they can’t because they don’t have the metadata they need for the _other_ things folks want, because they have only collected the subset of metadata they needed for their initial use case.

--ONIX vs. SUBJECT METADATA. In the book industry, there are really two distinct ways metadata gets used. Overwhelmingly the main way is via “ONIX Feeds,” the periodic batches of metadata the publishers or their service providers send out to the supply chain (retailers, aggregators, etc.). This is _separate from the book content_. In fact ONIX was initially created to distribute supply chain metadata mainly for physical books; although it has been updated with many more features related to eBooks (ONIX 3.0), there is a lot of resistance in the US to move off of the older ONIX 2.1 because from the point of view of many publishers, “it works,” and also “it’s what the supply chain is asking for.” While yes, you can embed subject metadata in an ONIX record, that is only a tiny slice of what goes in an ONIX record. On the other hand, what is lacking is a way to embed subject metadata _in the book content itself_, either at the title level or at a component level (most common need: chapters; but all agree that embedding subject metadata at a granular level in the content _should_ make it more discoverable, manageable, and useful). But . . .

--FEW BOOKS ARE ONLINE ANYWAY. One reason metadata works so well in scholarly/STM journals is that the journal content is overwhelmingly online, so being able to click on a link in a reference (and journal articles often have hundreds of references) and _get right to the desired content_ is huge. While this is starting to happen with some scholarly books, it is extremely rare in any other side of book publishing. Books are products, whether print or eBooks, that are _discovered, sold, and often delivered online_ but the book content itself is rarely online.

--DISCOVERY (aka MARKETING) IS THE PRIORITY. Book publishers _do_ want to be able to do a better job of identifying the subjects of books, chapters, and components. They already have a lot of vocabularies designed to do just that. Plus many folks express the need for simple keywords: that is, NOT a controlled vocabulary, just let the publisher or the editor or the marketer or the author put in the damn file whatever words they think will make the right people find them and buy them. One contrasting example: Kevin Hawkins pointed out that for the University of Michigan, they are actually spending LESS effort on cataloguing [from the library perspective] because for content that _is_ online, people use search engines to find things, not library catalogs. To tie those two POVs together was a point made by Thad McIlroy: for discovery, what matters, really, is _discovery via Google_. (Thad has a very practical, down-to-earth, get-real orientation.)

--IDENTIFIERS, IDENTIFIERS, IDENTIFIERS. We can talk all day long about metadata but if we ain’t got identifiers we ain’t got nuthin’. (Again, my editorial opinion.)

--AND NOW FOR SOMETHING COMPLETELY DIFFERENT: NEWS. Please read the interviews with Vincent Baby and Michael Steidl of the IPTC. You will see that the news industry has done a TON of work on metadata since 1979; they’ve been involved with the W3C and the Semantic Web all along; they really grok metadata. And they have an interesting perspective because of (a) the enormous firehose of content and images and media they need to manage, and (b) the speed with which everything has to happen. They can’t wait for some agency to issue an identifier for something; they need self-describing identifiers. It’s creative work, so RIGHTS METADATA is crucial. In the new multimedia world we are living in, metadata standards don’t align well (and the IPTC is working to help address this). They have a TON of vocabularies, standards, etc. (provided as links in the reports of my interviews with Vincent and Michael). They are even keeping up with “fashion”: despite their long commitment to XML, they realize that JSON is ascendant (while nowhere as rigorous or useful, just way easier to implement—my editorial comment, not theirs) so they are working on JSONizing standards they’ve got as XML or RDFa.


[PM] I've read all of the responses and have a few comments. I can't speak for publishers in general, of course, but as a publisher of my type of general trade content:

1. We want to make our books, authors and content more discoverable online. So I can sell more books. It is all about marketing.

2. The problem is that online discovery has never been nearly as effective as discovery in a bookstore or library. And we don't really know how to make it better. Or if we can make it better. There are a lot of other elements involved than search and SEO. We continue to experiment online. Use of Social Media has had some success but not nearly enough.

3. The problem is that the number of bookstores, both chain and independent, continues to decrease. The failure of Borders a few years ago had a devastating impact on the business. Also Book Review sections in newspapers and magazines are dwindling, as printed newspapers dwindle. The fact that consumers shop for books in bookstores, but actually buy them online from the site with the lowest prices isn't helping the picture either.

4. We want to make up for this deficit and we are hoping that the more effective use of metadata can help.

5. I agree with Fran Toolan that ONIX isn't the answer, that ONIX isn't for the Web. I spend a lot of time on ONIX and originally had thought there may be a place for it, but now I think the ONIX message is not the answer. One problem I had with some of the responses Bill posted is that ONIX and metadata were being discussed as if they were one and the same thing. The important thing is to have good, clean metadata. The transport mechanism is something completely different. We need also keep in mind that an awful lot of metadata in the industry is still transmitted on spreadsheets.

6. What we need to agree on is who is the target audience for the metadata. And I think we are talking about the end consumer. ONIX is not right for that. ONIX is purely a B2B message. I agree with the comments that ONIX is very complex and is used for many different reasons, bibliographic data and marketing collateral for display, but also back end data like sales territories and product dimensions for warehousing. But even for the bibliographic and marketing aspects, ONIX needs to be translated for the consumer. I don't transmit BISAC Categories, I transmit BISAC Codes and depend on my trading partners to translate those codes into the literal BISAC Categories for presentation to the end consumer. And most map the BISAC to their own more targeted categories.

7. I think Publishers as a whole could use guidance on how to use metadata on their own web pages. schema.org has been mentioned a lot. Maybe the task of the task force should be educate publishers on it's use. The main question remains: how can I as a publisher, take the metadata I create and use for my ONIX feed or my spreadsheet feed to Fran Toolan's company and use it promote online discovery of my products and authors.

One more thought. It occurred to me that we may be overlooking an important constituency: Booksellers. Not the major chains, but the smaller chains and independent booksellers. Most of the good ones have an online presence and could probably do with better discoverability. Even though independents serve their own community, better online presence can help expand the community and hopefully help keep more in business. indiebound.org is a website run by the American Booksellers Association that tries to point potential customers to local booksellers. I Googled a title featured on the indiebound.org home page and Amazon was number one of course and Indiebound was nowhere to be found.


[GB] Agree with Phil that the conflation of metadata and ONIX is misleading. ONIX is a well-accepted way of transmitting highly-structured metadata within the book and e-book commercial supply chain - and it allows various controlled vocabularies to be used (eg to describe the subject of a book, to describe the returns status of a book, to describe the physical or digital nature of the book). In some of the discussion above we are talking about metadata embedded within the content, and in other parts of the discussion, we are talking about metadata that is separated from the content. These two cases clearly demand different syntaxes - while you CAN embed an ONIX file directly inside an EPUB 3 package, it isn't really what ONIX is for. And the third option is metadata embedded in web pages about the content (eg a catalog page in an online bookstore). So (at least) three separate but related use cases

  i. metadata delivered in bulk, separate from the content or resource itself (eg as part of the commercial supply chain)
  ii. metadata delivered embedded within the content or resource it describes (eg within an EPUB, within a web page)
  iii. metadata delivered embedded within web pages describing the content or resource (eg in an online store, repository or catalog), possibly separate from the metadata displayed (for humans) on those pages

BUT

Given the reluctance of book publishers and retailers to invest more in metadata (viz lack of uptake of a work identifier like ISTC, lack of interest in a release identifier analogous to GRID, slow migration to ONIX 3.0 in countries where 2.1 was most firmly embedded…), it seems to me to be critical that we don't further burden the industry with 'yet another data format to ignore'. As Phil implies in his point 5, the important thing is to have good metadata, and it doesn't much matter how it is expressed – so long as it can be transformed from one expression to another easily and without loss of meaning. I suspect the best way around this is to retain as much of the semantics of ONIX, while thinking about a syntax that would allow that metadata to be embedded in e-publications and online content. This would avoid publishers having to manage two or three parallel and distinct sets of metadata. Separating ONIX semantics ('what do we mean by pub date, by imprint, by title?') from the XML message (which is 'merely' a convenient syntax used for transmitting the data along a data supply chain in bulk), and allowing ONIX-style data to be expressed in other syntaxes or data formats seems (to me) to be the way to go.

At the same time, the ONIX model (which separates metadata from content) does make it simpler to keep the metadata up to date. The dynamic nature of metadata, particularly in the weeks and months leading up to publication, is critical to the book trade. Perhaps only a certain subset of ONIX needs to be 'embeddable in content' where it is more problematic to update.

This approach would also allow a bookseller in receipt of 'traditional' XML-based ONIX to create web pages about the book that include embedded metadata without doing any heavyweight conversion, because the semantics would be compatible.

Note that metadata delivered embedded within the content does little or nothing for 'discoverability', since the books themselves are not typically available on the open web. It might enhance discoverability if major search engines were given privileged access to content available within walled gardens (eg to the libraries of subscription services), but publishers have historically been wary of this approach. For trade publishers in particular, the ability to discover a book via its metadata prior to publication is also critical.

AND YET

We need to bear in mind that ONIX does not cover the full spectrum of publishing – it's domain (the commercial book and e-book trade) is broad, but still narrower than that of EPUB and arguably narrower than the remit of the DPIG.

[GB] Commenting on Fran Toolan's statement (as paraphrased in the interview report) "'ONIX is irrelevant to the W3C.' It doesn't show up anywhere on the Web." While in a strict sense this is true, much of the data visible on an Amazon catalog page is an HTML rendering of the ONIX data sent from the publisher to Amazon.

[GB] Commenting on Renee Register and Thad McIlroy's statements about ONIX being 'too complex' for normal publishers to author. This is really a user interface problem for applications. There are numerous third-party metadata management applications (the best of which integrate data management with other aspects of the publishing process), and these applications make creation of an ONIX message straightforward. They range from relatively high-cost enterprise software (from Klopotek, Publishing Technology etc) and services (eg from Firebrand etc) to low-cost applications intended for the small publisher (eg ONIXsuite, ONIXedit, Stison and many others). However, I do agree with Renee, Thad (and with Fran Toolan, who also mentioned this) that life is difficult for recipients of ONIX data (eg retailers).

BASIC ISSUES

1. Is the DPIG the right group to address these issues?

Is this issue within the scope and charter of the DPIG, and is this Task Force the proper group to address it, or does the metadata issue require a separate IG or other activity within the W3C to be addressed properly?

[TF: Add your comments below, prefixed by your name in brackets. Note that comments added to the next question will affect our ultimate answer to this question.]

RS From email, the distinction between cataloguing requirements for metadata versus providing recommendations as to how to convey the metadata is important to keep clear. Is there a way to fulfill these needs, rather than picking one way, often from many existing methods. Given that understanding, I think it is in scope for DPIG, and valuable to the wider community.

BK: Does the scope of the DPIG include libraries? The focus so far has been on what publishers need from the OWP. Publishers do a lot with metadata but they don't think a whole lot about cataloguing explicitly. That, instead, is mainly a concern of librarians. We need clarification whether the needs of librarians are part of the scope of DPIG. Librarians are obviously very active in other W3C activities, particularly Semantic Web activities.

[PM:] After thinking about yesterday's meeting, I don't know that we should discount Libraries here. I certainly agree with Luc's point that for publishers Metadata is very much about discovery--connecting with potential readers. This goes for the digital and physical worlds, starting with book jackets and advertisements, which are nothing if not containers of metadata, as much as an ONIX feed is. Libraries also use metadata for discovery purposes to connect with their patrons. Isn't this what cataloging is about. Pr0viding enough information for patrons to find exactly what they need? There are obviously big differences between the publishing and library worlds, but there are enough similarities in terms of metadata that can be addressed.

One of the issue Publishers face is that there are a number of organizations that create and refine metadata, Publishers, Distributors like Baker & Taylor, Bowker, Library sources like OCLC, Library of Congress here in the U.S. But there is very little interoperability among the participants. We provide metadata to B&T. They augment our categories, but don't tell us about it. The libraries add some very rich metadata in terms of keywords, character profiles, but access is not easy or inexpensive.

OCLC piloted a program a few years ago where they took our ONIX feed and enhanced it with more of their metadata and sent it back to us. I think they finally offered this as a commercial product. We dropped out in the early pilot stage for other reasons. But it was a good idea, very hard to implement.

This probably belongs in a use case section, but I wanted to raise the point that looking at it just form a publisher perspective may be to narrow. --Phil Madans (talk) 15:25, 4 February 2014 (UTC)

[Luc] I think this is the right place and a chance for us publishers to bring our needs for the OWP enabling better usage of our ebooks content.

1. Global ebooks discovery phase is important not only for ebookstores to display correctly our ebooks metadata before selling, but also for the reader ebooks library to be well categorized.

Just think about discipline in textbooks : today nobody in the B2C supply chain is able to say which discipline is this school book about, except if the publisher add it in the title. We need this info to be a scpcific field on ebookstores for search and display.

We already have ONIX for that global purpose and as members of EDItEUR, we are working in its evolution.

2. But my major concern is about content metadata. Converting our XML files to HTML5 brings us from a semantic world to a dumb world. In our publishing companies, we have been working hard for years to move content creation to structure and meaning, what we did achieve on a large number of subjects with XML vocabularies. We need then this WG to help us to bring that structure and meaning to the OWP platform so that we can propagate it inside the text of EPUB files.

What is the OWP recommandation to enable this is IMO a proper goal of this WG.

2. Are there deficiencies in the current OWP that need to be remedied?

If the answer to (1.) is yes, then are the problems and use cases identified by this TF due mainly to inadequate understanding and use of already existing capabilities of the OWP, or are there specific improvements to the OWP standards required to enable publishers to use metadata appropriately?

[TF: Add your comments below, prefixed by your name in brackets.]

[BK] One approach: For each issue raised below (and thus any resulting use cases) we should ask: "Can this be addressed without a change to the OWP?" If the answer is yes, that would imply that we don't need any of the existing components of the OWP to be modified; however, that does not necessarily imply that a new initiative within the W3C might be called for. Case in point: when my colleagues and I create XHTML-based models for publishers (as foundational models for workflow, repository, archive, etc.) we typically need to devote a
to a big "metadata header" of the sort other models (NLM/JATS/BITS, TEI, DocBook) provide as a specific feature. Moreover, we then need to use the @class attribute, e.g.