Task Forces/Metadata/Michael Steidl Interview

From Digital Publishing Interest Group
Jump to: navigation, search

Michael Steidl, Managing Director, International Press Telecommunications Council (IPTC) [Bill]

My interview with Michael was an excellent complement to the interview I conducted with Vincent Baby, Chairman of the Board of the IPTC. Whereas Vincent is part of the volunteer leadership (he works for Thomson Reuters), Michael’s full time responsibility is the very wide-ranging work of the IPTC, an organization that has been deeply involved in the development and use of metadata standards for 35 years. Its standards are extensively implemented globally, primarily in the context of news (including textual, image, and multimedia content).

Although his focus is of course not primarily on books, Michael began by observing that the concept of “book” is evolving due to the emergence of digital books, along with print: a “book” really becomes the intellectual content, not just a product, and even the nature of that intellectual content is changing.

Similarly, in the IPTC’s area, the question “what is news?” is evolving for many of the same reasons. It can mean “professional news,” or it can be broadened to include blogging, news created by private individuals, etc. IPTC focuses on the former: they represent “the professional creation and distribution of news,” not everything that might be new, or “every 10th tweet would be news.”

IPTC has a long history of metadata work. Their first standard with metadata was created in 1979, and it is still in use: “IPTC 7901,” which is a sibling to “ANPA 1312” in the US. (There are only minor differences—they are “95% in common.”)

IPTC also began to focus on multimedia early on, creating IIM (their Information Interchange Model) which was adopted by Adobe in 1994. This was the origin of photo metadata. They have been “deeply involved” in image and multimedia metadata since then. All IIM properties can be expressed in XMP [the eXtensible Metadata Platform, originally an Adobe spec for embedding metadata in Photoshop, Illustrator, etc. files, which is now an international standard]: IPTC Photo Metadata Standard is the metadata vocabulary, XMP is the mechanism for embedding it.

They are also deeply involved with Identifiers, recognizing that all “creative work needs an identifier.” “What makes a string an identifier?” In the book industry, identifiers like DOI and ISNI and ISBN are maintained by organizations that formally _issue_ the identifiers. But there’s a big difference between news and books: e.g., a given book publisher might publish 5, 10, or even 1,000 books a week (just the giants), whereas a mid-sized news agency produces 1,000 items PER DAY, and a large news agency can produce 10,000 items PER DAY. Thus they can’t “hand pick an identifier,” this doesn’t scale for news. Instead, they need SELF-DESCRIBING identifiers. He pointed out that URIs and URLs have _very high relevance_ for this because they are both an identifier and a carrier of information about what is being identified [unlike the identifiers like DOI, ISNI, and ISBN, which just provide the key to _obtain_ information about what is being identified].

Another topic he stressed was the issue of METADATA SCHEMAS. There are “lots of organizations creating schemas, and many of their properties are quite similar in terms of semantics.” The problem: there are reasonable schemas for different areas of creative work in a given area, but “looking across boundaries it is hard to bring them together.” He pointed out that text, video, etc. all have different schemas. In multimedia, you are often dealing with 5, 6, 7, or 8 different metadata schemas at the same time.

The IPTC is trying not to “contribute to metadata proliferation.” Their first rule of thumb: “Is this already defined somewhere?”

RIGHTS METADATA is very important in this regard. IPTC “will not create its own rights metadata schema.” A book may use content, text, graphics, photos, videos, etc. from news agencies and so they need their metadata to be as consistent as possible. They are working with the W3C Community Groups and have been particularly involved in the development of Open Digital Rights Language (ODRL).

They are also a driver of the Linked Content Coalition (LCC), a followup to ACAP (Application Configuration Access Protocol), a group of 40-some organizations (including EDItEUR: both Michael and Graham Bell are directors). They are working on creating a framework for “exchanging rights information across the silos.”

Another big topic: METADATA VALUES.

What’s easy are literal values like dates. Much harder is conceptual information, e.g. “The person in this picture is Mr. So-and-so.” There is a need for a common way to describe entities (people, companies, etc.). Now, with the Semantic Web, it is much more common for each entity to have an identifier.

He pointed out that in the context of news, proprietary identifiers get created within given news organizations because of urgency: “they have to do this right away” in order to manage their information. Now there’s a need for a layer for sharing information. One potential solution is a layer on top of Wikipedia, but they are bound to a single language, and Wikipedia links are to a single _article_. Wikidata is an important initiative for extracting “entities and topics” and enabling the application of a “generic identifier” that could, e.g., provide a list of “all articles in all languages associated about this entity.”

The IPTC has done work on subject categorization (originally “Subject Codes,” now the improved “Media Topics”): over 1,000 terms of content description, which has a “clear focus on news.” IPTC is working on mapping IPTC Media Topics to proprietary vocabularies used in the news industry and then to Wikidata, thus providing a sort of “hub” between all those proprietary schemes and wikidata.

IPTC made a formal decision in mid-March that all vocabularies will use SKOS [Simple Knowledge Organization System], which enables “matching” of identifiers at different levels. E.g., “my concept identifier relates to your concept identifier,” but as either an exact match, a subset relationship, or a superset relationship (e.g. “Book Industry” could map as a subset concept to “Economy”).

Finally, he talked about FORMATS: how to express a vocabulary, the syntax, etc. He observed that there are “fashions” in formats. “Ten years ago, everything had to be XML; now, XML is old fashioned, everything has to be JSON.” [BTW I hear this all the time, this “fashion” issue!]

IPTC has a deep involvement in XML, but “with the advent of APIs, XML is too complex, JSON is much simpler.” So now IPTC “needs to reformulate in JSON.” This is an ongoing challenge for standards organizations: “following the fashions of the industry.”