Task Forces/Metadata/Vincent Baby Interview

From Digital Publishing Interest Group
Jump to: navigation, search

Vincent Baby, Chairman of the Board, International Press Telecommunications Council (IPTC) [Bill]

Vincent Baby, Chairman of the Board of the IPTC (Int’l Press Telecommunications Council), the major source of metadata standards for the news industry, including the widely used IPTC Photo Metadata schemas, the XML-based NewsML, the RDF-based rNews, RightsML, and others. (Mr. Baby—a journalist by training, currently in a product management role with Thomson Reuters—made it clear that his comments in our conversation were his own personal views; he was in no way speaking for the IPTC or Thomson Reuters.)

Here are some links to relevant IPTC resources sent by Mr. Baby: “IPTC is constantly updating core standards like photo metadata and NewsML and crafting new standards for semantic annotation on the web, embeddable rights expressions and APIs. We have also recently revived our NewsCodes Working Party which develops and maintains provider-agnostic taxonomies.”

. . .

He began by pointing out that the IPTC and W3C overlap and intersect in many ways; currently, the main link is ODRL, the Open Digital Rights Language. IPTC has done a lot of work on a rights expression language that I think could be very valuable to publishers of all sorts (most others are unaware of it). Recognizing that the “free text” most such models enable is ineffective, IPTC focused on developing machine-processable metadata. Their RightsML is ODRL with a vocabulary specific to the news industry.

The Semantic News Community Group at the W3C “emanated from IPTC members.” However, this activity has been pretty dormant for the past 2-3 years. He commented that a number of IPTC members are “wary of the W3C” because of IP concerns. However, he mentioned that one of their constituent organizations, EBU (the European Broadcasting Union in Geneva) has been an active participant in the W3C.

The story of rNews is interesting in the context of the other conversations I’ve been having with people from other sides of publishing. The New York Times brought the use case to the IPTC initially: the need to preserve metadata in HTML pages. News publishers have very rich metadata in their CMSs but it is lost in the content that goes online. What was developed was rNews, an RDFa-based model that makes news more accessible to search results and to social media and allows better displays such as “rich snippets”.

Here’s the part of the story I found most resonant: in June 2011 they became aware of schema.org, and there was a concern that schema.org would overshadow rNews. They contacted schema.org and found that schema.org was in fact very interested in collaborating and open to the input of representative organizations such as IPTC in various domains. The result was that schema.org now incorporates a small subset of rNews properties. (This resonated with comments others have made about the need for an “ONIX Lite” or some similar ability to get subject metadata into schema.org. BTW in my view it is probably Thema, the new international subject codes standard, not ONIX, that’s the best candidate for this).

The result is that many rNews properties are now widely adopted in the news industry, though mainly the subset that’s in schema.org. The big players like the BBC use the full rNews schema.

In discussing the challengers publishers face, Mr. Baby named two fundamental issues:

  1. Maximizing engagement with the user base
  2. The need for efficiency and economy

He pointed out that metadata can have a role in both areas.

He remarked that publishers initially focused on SEO; but that proved ineffective because just getting a user to a given web page didn’t build any lasting value or connection with that user, or knowledge about that user. Now they are more focused on social media and semantic markup.

An important aspect of the ecosystem today is the importance of rich multimedia content. Cross-media alignment between assets has become critically important. The problem is that there are different taxonomies and different levels of richness associated with different types of media assets. For example, video has great technical metadata but very poor subject metadata.

It’s not just about text anymore. It’s about text + pictures, interactivity, databases, Twitter, and on and on. This multiplicity of media formats presents a huge challenge: how do you keep track of everything?

On the subject of engagement, one thing he lamented was that embedded metadata winds up getting stripped out when it gets to Twitter, Facebook, etc.: most social networks are “cavalier about metadata—they just toss it out.” He mentioned an “Embedded Metadata Manifesto” that is hoping to counter this [Link provided below].

Another important issue: how do you measure impact? He said that “all kinds of initiatives are working on this.” Over time, publishers will want to add this information to their metadata, e.g. “this is a bestseller” or “this has been retweeted X times.” [Note: that is already intrinsic to ONIX for the book supply chain.]

He had a lot of interesting and resonant points to make in the context of efficiency. [I think a lot of what he pointed out regarding news will become increasingly relevant to most types of publishing, which are moving from a “once-and-done” model to a situation where content evolves over time.]

An area I found particularly relevant to work in other areas was his discussion of “how to automate workflows with metadata?” and “how to mark up content in a way that makes it easy to reuse chunks?” In the news industry, it is common for new versions of a story to build on previous versions, with lots of overlap. But often, when there is only a small proportion of updated content, it becomes a whole new story.

One aspect of this is the need for granular structural semantic markup: What is a quote? What is a “background paragraph”? What is a “lede”? etc. There is a need to “isolate these bits” so that the user can be presented with “just the new bits.”

Many people are putting their hopes in an “event-based” model where ”unique events” can be identified in advance, with an ID, and then managed over time. Others, including for example the BBC and the Guardian, object that this does not work because this is not how journalists actually work. New stories pop up unexpectedly and then twig off in unexpected directions that can’t be anticipated in advance. (E.g., Ukraine rejects the deal with the EU and a few months later Crimea has been annexed to Russia, with lots happening in between that nobody would have expected.) Newsrooms typically identify the successive iterations of stories using “slugs” (keywords) that enable users (and journalists) to follow how a story evolves. The “storyline” model aims to organize these free-form slugs ex-post using a dedicated ontology, thereby leveraging in the background a pre-existing workflow.

In this context, archives have become increasingly important: they help get more engagement from readers. This is another issue that crosses media types. He mentioned that just this week, ProPublica has posted a very interesting draft on their work on how to archive “news applications.” This is based on the need for an interactive database that is queryable on a very local level. How do you archive the stories that make up that repository to enable this to work? [Link provided below]

Another key issue in all this is the human dimension. He pointed out that “journalists are very creative people,” interested in the _substance_, not the process. They’re generally not disciplined about cataloging, labeling, etc. So “metadata becomes a struggle.” Management, on the other hand, recognizes that this metadata is crucial: it’s what makes the content useful to both users and publishers.

Having said that, there can be a “blind faith” that putting the systems and processes in place will resolve the problems: “If we fix our metadata everything will be hunky dory.” This is obviously naïve.

Finally, he raised the familiar issue of the proliferation of devices, OS’s, form factors, etc. He pointed out that during a recent weekend 55% of The Guardian’s content was viewed via mobile, where there are hundreds of different devices and form factors. Therefore, responsive design becomes crucial. A challenge: everything needs to be tested for compatibility across a broad range of browsers and devices.

He closed our conversation with an offer to provide links to a number of useful resources—which in fact he did about an hour after our conversation. Here’s what he sent:

W3C Semantic News Community Group http://www.w3.org/community/semnews/ (purely for reference)

Leading edge innovators in the news publishing industry

Structured news

Dealing with non standard media formats

Semantic web

Embedded metadata

Measuring impact