This is an archive of an inactive wiki and cannot be modified.

Multimedia Semantics on the Web: Vocabularies

The nice thing about standards is that there are so many to choose from.
Andrew S. Tannenbaum


This version
http://www.w3.org/2005/Incubator/mmsem/wiki/Vocabularies

Previous version
http://www.w3.org/2001/sw/BestPractices/MM/resources/Vocabularies.html

Incubator Report
http://www.w3.org/2005/Incubator/mmsem/XGR-vocabularies/

Editors

Contributors


Index


1. Introduction

This document is based on unfinished work performed in the context of the W3C MMSEM-XG. All information provided here serves purely as a list of examples, and inclusion on this page does not imply endorsement by the W3C membership or the Incubator Group.

TODO: Provenance TODO: Scope, Off-topic,

1.1. Declaration of Namespaces

Using N3 syntax, the namespace used herein are the following:

 @prefix xsd: <"http://www.w3.org/2001/XMLSchema#">
 @prefix rdf: <"http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 @prefix rdfs: <"http://www.w3.org/2000/01/rdf-schema#">
 @prefix owl: <"http://www.w3.org/2002/07/owl#">
 @prefix dc: <"http://purl.org/dc/elements/1.1/">

Note: If necessary, namespaces might be defined locally (e.g. in examples).

1.2. Related Pages

See Tools an Resources page for complementary information regarding multimedia metadata, viz. usage, availability, projects, etc.

2. Types of Multimedia Metadata

Based on (Smith et.al., 2006), the vocabularies are categorised and/or described in terms of:

Discriminator/Category

Value/Item

Example (for NewsML)

Representation

non-XML (nX), XML (X), RDF (R), OWL (O)

X, R

Content Type

still-image (SI), video (V), audio (A), text (T), general purpose (G)

V, A

Workflow

premeditation, production, publish, etc.

publish

Domain

entertainment, news, sports, etc.

news

Industry

broadcast, music, publishing, etc.

broadcast

Fig. 1 Discriminators and categories for multimedia metadata standards used herein.

Note:

3. Existing Multimedia Metadata

This section briefly introduces common existing metadata standards that are of importance for the description and usage of multimedial essence.

Each subsection starts with a table containing the responsible party, the specification (if available online) and a list of discriminators/categories. In case a standardized mapping (or alternative approaches) exit how to map a herein mentioned vocabulary to RDF/OWL, an appropriate pointer is provided to corresponding section in the formal representation field.

TODO: For each MM standard, discuss overlapping/complimentary with others (was: Sec. 4: Comparison of Existing Vocabularies).

3.1. MM standards for describing Still Images

3.1.1. Visual Resource Association (VRA)

Responsible

http://www.vraweb.org/

Specification

http://www.vraweb.org/vracore3.htm

Formal Representation

VRA - RDF/OWL

Discriminators/Categories

nX

SI

publish

culture

archives

The Visual Resource Association (VRA) is an organization consisting of over 600 active members, including many American Universities, galleries and art institutes. These often maintain large collections of (annotated) slides, images and other representations of works of art. The VRA has defined the VRA Core Categories to describe such collections. The VRA Core is a set of metadata elements ... to describe works of visual culture as well as the images that document them.

Where the Dublin Core specifies a small and commonly used vocabulary for on-line resources in general, VRA Core defines a similar set targeted especially at visual resources. Dublin Core and VRA Core both refer to terms in their vocabularies as elements, and both use qualifiers to refine elements in similar way. The more general elements of VRA Core have direct mappings to comparable fields in Dublin Core. Furthermore, both vocabularies are defined in a way that abstracts from implementation issues and underlying serialization languages.

3.1.2. Exchangeable image file format (Exif)

Responsible

http://www.jeita.or.jp/english/

Specification

http://www.digicamsoft.com/exif22/exif22/html/exif22_1.htm

Formal Representation

Exif - RDF/OWL

Discriminators/Categories

nX

SI

capture-distribute

generic

digital camera

One of today's commonly used metadata format for digital images is the Exchangeable Image File Format (Exif). The standard "specifies the formats to be used for images, sound and tags in digital still cameras and in other systems handling the image and sound files recorded by digital still cameras." The so called Exif header carries the metadata for the captured image or sound.

The metadata tags which the Exif standard provides covers metadata related to the capturing of the image and the context situation of the capturing. This includes metadata related to the image data structure (e.g., height, width, orientation, etc.), capturing information (e.g., rotation, exposure time, flash), recording offset (e.g., image data location, bytes per compressed strip, etc.), image data characteristics (e.g., transfer function, color space transformation, etc.), as well as general tags (e.g., image title, copyright holder, manufacturer, etc.). In these days new camera also write GPS information into the header. Lastly, we point out that metadata elements pertaining to the image are stored in the image file header and are marked indetified by unique tags, which serve as an element identifier.

3.1.3. NISO Z39.87

Responsible

http://www.niso.org/

Specification

http://www.niso.org/standards/resources/Z39-87-2006.pdf

Discriminators/Categories

X

SI

production

interoperability

image creation

Tags cover a wide spectrum of metadata: Basic image parameters; Image creation; Imaging performance assessment; History. This standard is intended to facilitate the development of applications to validate, manage, migrate, and otherwise process images of enduring value. Such applications are viewed to be essential components of large-scale digital repositories and digital asset management systems.

The dictionary has been designed to facilitate interoperability between systems, services, and software as well as to support the long-term management of and continuing access to digital image collections.

3.1.4. DIG35

Responsible

http://www.i3a.org/

Specification

http://xml.coverpages.org/FU-Berlin-DIG35-v10-Sept00.pdf

Formal Representation

DIG35 - RDF/OWL

Discriminators/Categories

X

SI

publish

archives

consumer

The DIG35 Specification includes a "standard set of metadata for digital images" which promotes interoperability and extensibility, as well as a "uniform underlying construct to support interoperability of metadata between various digital imaging devices."

Tags cover: Basic Image Parameter (a general-purpose metadata standard); Image Creation (e.g. the camera and lens information); Content Description (who, what, when and where); History (partial information about how the image got to the present state); Intellectual Property Rights; Fundamental Metadata Types and Fields (define the format of the field defined in all metadata block). Metadata encoding with XML Schema.

Minus: DIG35 Metadata Specification Version 1.1 is not free ($35)

3.1.5. PhotoRDF

Responsible

http://www.w3.org/

Specification

http://www.w3.org/TR/photo-rdf

Formal Representation

PhotoRDF

Discriminators/Categories

R

SI

capture-distribute

personal media

photo

PhotoRDF is an attempt to address a standardization of categories and labels for personal photo collections. The standard has eveloved very early in 2002, however, did not develop since. The latest version is a W3C Note from 19 April 2002. The standard already workds as a roof for different other standards that together should solve the "a project for describing & retrieving (digitized) photos with (RDF) metadata". The metadata is separated into three different schemas, a Dublin Core, a technical schema and a content schema. As the standard aims to be short and simple it covers only a small set of properties. The Dublin Core schema is adopted for those parts of a photo that needs description for its creator, editor, title, date of publishing and so on. With regard to the technical aspects of a photo, however, leaves us with less properties than the EXIF standard. For the actual description of the content, the content schema defines a very small set of keywords that shall be used in the "subject" field of the dublin core schema.

All in all the attempt of PhotoRDF well addresses the demand of a small standard describing personal photos for personal media management as well as for publishing photos, exchanging them between different tools. I covers the different aspects of a photo that range from the camera information to the subject of the photo. However, the standard fails to cover the central aspects of photos as they are needed for interoperability of photo tools and photo services. For example, the place or position of a photo is not addressed as well as photographic information such as aperture. Also the content description property is limited by a small number of keywords. The trend for tagging has not been forseen at the time of the development of the standard.

3.2. MM standards for describing Audio Content

The X gives an overview of music related MM standards.

3.2.1. ID3

Responsible

http://www.id3.org/

Specification

http://www.id3.org/

Discriminators/Categories

nX

A

distribute

generic

music

ID3 is a metadata container used with the MP3 audio file format. It allows to state information about the title, artist, album, etc. about a file (embeded in the file itself). The ID3 specification tries to cover a broad range; a list of genres is defined.

3.2.2. MusicBrainz Metadata Initiative 2.1

Responsible

http://musicbrainz.org/

Specification

http://musicbrainz.org/MM/

Discriminators/Categories

R

A

production

generic

music

MusicBrainz defines a (RDF-S based) vocabulary: three namespaces are defined. The core set is capable of expressing basic music related metadata (as artist, album, track, etc.). Instances in RDF are being made available via a query language. The third namespace is reserved for future use in expressing extended music related metadata (as contributors, roles, lyrics, etc.).

3.2.3. MusicXML

Responsible

http://www.recordare.com/

Specification

http://www.recordare.com/xml.html and http://www.recordare.com/dtds/index.html

Discriminators/Categories

X

A

production

generic

music

Recordare has developed MusicXML technology to create an Internet-friendly method of publishing musical scores, enabling musicians and music fans to get more out of their online music.

MusicXML is a universal translator for common Western musical notation from the 17th century onwards. It is designed as an interchange format for notation, analysis, retrieval, and performance applications. The MusicXML format is open for use by anyone under a royalty-free license, and is supported by over 75 applications.

3.3. MM standards for describing Audio-Visual Content

3.3.1. MPEG-7

Responsible

http://www.iso.org/iso/en/prods-services/popstds/mpeg.html

Specification

http://www.chiariglione.org/MPEG/standards/mpeg-7/mpeg-7.htm

Formal Representation

MPEG-7 - RDF/OWL

Discriminators/Categories

nX, X

SI, V, A

archive-publish

generic

generic

The MPEG-7 standard, formally named "Multimedia Content Description" aims to be an overall for describing any multimedia content. MPEG-7 standardizes so-called "description tools" for multimedia content: Descriptors (Ds), Description Schemes (DSs) and the relationships between them. Descriptors are used to represent specific features of the content, generally low-level features such as visual (e.g. texture, camera motion) or audio (e.g. melody), while description schemes refer to more abstract description entities (usually a set of related descriptors). These description tools as well as their relationships are represented using the Description Definition Language (DDL), a core part of the language. The W3C XML Schema recommendation has been adopted as the most appropriate schema for the MPEG-7 DDL, adding a few extensions (array and matrix datatypes) in order to satisfy specific MPEG-7 requirements. MPEG-7 descriptions can be serialized as XML or in a binary format defined in the standard.

The comprehensiveness results from the fact that the standard has been designed for a broad range of applications and thus employs very general and widely applicable concepts. The standard contains a large set of tools for diverse types of annotations on different semantic levels (the set of MPEG-7 XML Schemas define 1182 elements, 417 attributes and 377 complex types). The flexibility is very much based on the structuring tools and allows the description to be modular and on different levels of abstraction. MPEG-7 supports fine grained description, and it provides the possibility to attach descriptors to arbitrary segments on any level of detail of the description. The possibility to extend MPEG-7 according to the conformance guidelines defined in part 7 provides further flexibility. Two main problems arise in the practical use of MPEG 7 from its flexibility and comprehensiveness: complexity and limited interoperability. The complexity is a result of the use of generic concepts, which allow deep hierarchical structures, the high number of different descriptors and description schemes and their flexible inner structure, i.e. the variability concerning types of descriptors and their cardinalities. This causes sometimes hesitance in using the standard. The interoperability problem is a result of the ambiguities that exist because of the flexible definition of many elements in the standard (e.g. the generic structuring tools). There can be several options to structure and organize descriptions which are similar or even identical in terms of content, and they result in conformant, yet incompatible descriptions. The description tools are defined using DDL. Their semantics is descibed textually in the standard documents. Due to the wide application are, the semantics of the description tools are often very general. Several works have already pointed out the lack of formal semantics of the standard that could extend the traditional text descriptions into machine understandable ones. These attempts that aim to bridge the gap between the multimedia community and the Semantic Web, either for the whole standard, or just one of its part, are detailed below.

3.3.1.1. Profiles

Profiles and levels have been proposed as a means to reduce the complexity of MPEG-7 descriptions (ISO, 2005). Like in other MPEG standards, profiles are subsets of the standard that cover certain functionalities, while levels are flavours of profiles with different complexity. In MPEG-7, profiles are subsets of description tools for certain application areas, levels have not yet been used. The proposed process of the definition of a profile consists of three steps:

The result of tool selection and the definition of tool constraints are formalized using the MPEG-7 DDL and result in an XML schema like the full standard.

Several profiles have been under consideration for standardization and three profiles have been standardized (they constitute part 9 of the standard, with their XML schemas being defined in part 11):

Simple Metadata Profile (SMP)
Allows describing single instances of multimedia content or simple collections. The profile contains tools for global metadata in textual form only. The proposed Simple Bibliographic Profile is a subset of SMP. Mappings from ID3, 3GPP and EXIF to SMP have been defined.
User Description Profile (UDP)
Its functionality consists of tools for describing user preferences and usage history for the personalization of multimedia content delivery.
Core Description Profile (CDP)
Allows describing image, audio, video and audiovisual content as well as collections of multimedia content. Tools for the description of relationships between content, media information, creation information, usage information and semantic information are included. The profile does not include the visual and audio description tools defined in parts 3 and 4.

The adopted profiles will not be sufficient for a number of applications. If an application requires additional description tools, a new profile must be specified. It will thus be necessary to define further profiles for specific application areas. For interoperability it is crucial, that the definitions of these profiles are published, to check conformance to a certain profile and define mappings between the profiles. It has to be noted, that all of the adopted profiles just define the subset of description tools to be included and some tool constraints; none of the profile definitions includes constraints on the semantics of the tools that clarify how they are to be used in the profile.

Apart from the standardized ones, a profile for the detailed description of single audiovisual content entities called Detailed Audiovisual Profile (DAVP) has been proposed. The profile includes many of the MDS tools, such as a wide range of structuring tools, as well as tools for the description of media, creation and production information and textual and semantic annotation, and for summarization. In contrast to the adopted profiles, DAVP includes the tools for audio and visual feature description, which was one motivation for the definition of the profile. The other motivation was to define a profile the supports interoperability between systems using MPEG-7 by avoiding possible ambiguities and clarifying the use of the description tools in the profile. The DAVP definition thus includes a set of semantic constraints, which play a crucial role in the profile definition. Due to the lack of formal semantics in DDL, these constraints are only described in textual form in the profile definition (Bailer et.al., 2006).

3.3.1.2. Controlled vocabularies in MPEG-7

Annotation of content often contains references to semantic entities such as objects, events, states, places, and times. In order to ensure consistent descriptions (e.g. make sure that persons are always referenced with the same name) some kind of controlled vocabulary should be used in these cases. MPEG-7 provides a generic mechanism for referencing terms defined in controlled vocabularies. The only requirement is that the controlled vocabulary is identified by a URI, so that a specific term in a specific controlled vocabulary can be referenced unambiguously. In the simplest case, the controlled vocabulary is just a list of possible values of a property in the content description, without any structure. The list of values can be defined in a file accessed by the application or can be taken from some external source, for example the list of countries defined in ISO 3166. The mechanism can also be used to reference terms from other external vocabularies, such as thesauri or ontologies. Classification schemes (CSs) are a MPEG-7 description tool that allows to describe a set of terms using MPEG-7 description schemes and descriptors. It allows to define hierarchies of terms and simple relations between them, and allows the term names and definitions to be multilingual. Part 5 of the standard already defines a number of classification schemes, and new ones can be added. The CSs defined in the standard are for those description tools, which require or encourage the use of controlled vocabularies, such as

3.3.2. AAF

Responsible

http://www.aafassociation.org/

Specification

http://www.aafassociation.org/html/techinfo/index.html#aaf_specifications

Discriminators/Categories

nX

SI,V,A

production

content creation

broadcast

The Advanced Authoring Format (AAF) is a cross-platform file format that allows the interchange of data between multimedia authoring tools. AAF supports the encapsulation of both metadata and essence, but its primary purpose involves the description of authoring information. The object-oriented AAF object model allows for extensive timeline-based modeling of compositions (i.e., motion picture montages), including transitions between clips and the application of effects (e.g., dissolves, wipes, flipping). Hence, the application domain of AAF is within the post production phase of an audiovisual product and it can be employed in specialized video work centers. Among the structural metadata contained for clips and compositions, AAF also supports storing event-related information, e.g. time-based user annotations and remarks or specific authoring instructions.

AAF files are fully agnostic as to how essence is coded and serve as a wrapper for any kind of essence coding specification. In addition to describing the current location and characteristics of essence clips, AAF also supports descriptions of the entire derivation chain for a piece of essence, from its current state to the original storage medium, possible a tape (identified by tape number and timecode), or film (identified by e.g., an edge code).

The AAF data model and essence is independent of the specifics of how AAF files are stored on disk. The most common storage specification used for AAF files is the Microsoft Structured Storage format, but other storage formats could be used (e.g., XML).

The AAF metadata specifications and object model is fully extensible (e.g., through subclasses of existing objects) and the extensions are fully contained in a metadata dictionary, stored in the AAF file. In order in order to achieve predictable interoperability between implementations created by different developers, due to the format’s flexibility and use of proprietary extensions, the Edit Protocol was established. The Edit Protocol combines a number of best practices and constraints as to how an Edit Protocol-compatible AAF implementation must function and which subset of the AAF specification can be used in Edit Protocol-compliant AAF files.

3.3.3. MXF-DMS-1

Responsible

http://www.smpte.org/

Specification

http://store.smpte.org/category-s/3.htm (SMPTE-M377-2004, SMPTE-M380-2004)

Discriminators/Categories

nX

SI,V,A

production

content creation

broadcast

The Material Exchange Format (MXF) is a streamable file format optimized for the interchange of material for the content creation industries. MXF is a wrapper/container format intended to encapsulate and accurately describe one or more “clips" of audiovisual essence (video, sound, pictures …). This file format is essence-agnostic, which means it should be independent of the underlying audio and video coding specifications in the file. In order to process such a file, its header contains data about the essence i.e. metadata. An MXF file contains enough structural header information to allow applications to interchange essence without any a priori information. The MXF metadata allows applications to know the duration of the file, what essence codecs are required, what timeline complexity is involved and other key points to allow interchange.

There exists a “Zero Divergence" doctrine, which states that any areas in which AAF and MXF overlap must be technologically identical. As such, MXF and AAF share a common data model. This means that they use the same model to represent timelines, clips, descriptions of essence, and metadata. The major difference between the two is that MXF has chosen not to include transition and layering functionality. This makes MXF the favourable file format in embedded systems, such as VRTs or cameras, where resources can be scare. Essentially, this creates an environment in which raw essence can be created in MXF, it can be postproduced in AAF, and then the finished content can be generated as an MXF file.

MXF uses KLV (SMPTE-336M-2001) coding throughout the file structure. This KLV is a data interchange format defined by the simple data construct: Key-Length-Value, where the Key identifies the data meaning, the Length gives the data length, and the Value is the data itself. This principle allows a decoder to identify each component by its key and skip any component it cannot recognize using the length value to continue decoding data types with recognized key values. KLV coding allows any kind of information to be coded. It is essentially a machine-friendly coding construct that is datacentric and is not dependent on human language. Additionally, the KLV structure of MXF allows this file format to be streamable.

Structural Metadata is the way in which MXF describes different essence types and their relationship along a timeline. The structural metadata defines the synchronization of different tracks along a timeline. It also defines picture size, picture rate, aspect ratio, audio sampling, and other essence description parameters. The MXF structural metadata is derived from the AAF data model. Next to the structural metadata described above, MXF files may contain descriptive and “dark" metadata.

MXF descriptive metadata comprises information in addition to the structure of the MXF file. Descriptive metadata is metadata created during production or planning of production. Possible information can be about the production, the clip (e.g. which type of camera was used) or a scene (e.g. the actors in it) among others. DMS-1 (Descriptive Metadata Scheme 1) is an attempt to standardize such information within the MXF format. Furthermore DMS-1 is able to interwork as far as practical with other metadata schemes such as MPEG-7 TV-Anytime, P/meta and Dublin Core.

Dark Metadata is the term given to metadata that is unknown by an application. This metadata may be privately defined and generated, it may be new properties added or it may be standard MXF metadata not relevant to the application processing this MXF file. There are rules in the MXF standard on the use of dark metadata to prevent numerical or namespace clashes when private metadata is added to a file already containing dark metadata.

3.4. MM standards for describing multimedia presentations

3.4.1. Synchronized Multimedia Integration Language (SMIL)

Responsible

http://www.w3.org

Specification

http://www.w3.org/TR/SMIL2/metadata.html

Discriminators/Categories

X

G

publish, distribution, presentation, interaction

generic

Web, mobile applications

The Synchronized Multimedia Integration Language (SMIL) is an XML-based 2-dimensional graphics language that enables simple authoring of interactive audiovisual presentations. SMIL is used to describe scenes with streaming audio, streaming video, still images, text or any other media type. SMIL can be integrated with other web technologies such as XML, DOM, SVG, CSS and XHTML.

Next to media, a SMIL scene also consists of a spatial and temporal layout and supports animation and interactivity. SMIL also has a timing mechanism to control animations and for synchronization. SMIL is based on the download-and-play concept; it has also a mobile specification, SMIL Basic.

The SMIL 2.1 Metainformation module contains elements and attributes that allow description of SMIL documents. It allows authors to describe documents with a very basic vocabulary (meta element; inherited from SMIL 1.0), and in its recent version the specification introduces new capabilities for describing metadata using RDF.

3.4.2. Scalable Vector Graphics (SVG)

Responsible

http://www.w3.org

Specification

http://www.w3.org/TR/SVG11/metadata.html

Discriminators/Categories

X

G

publish, presentation

generic

Web, mobile applications

Scalable Vector Graphics (SVG) is a language for describing two-dimensional vector and mixed vector/raster graphics in XML. It allows for describing scenes with vector shapes (e.g., paths consisting of straight lines, curves), text, and multimedia (e.g. still images, video, audio). These objects can be grouped, transformed, styled and composited into previously rendered objects.

SVG files are compact and provide high-quality graphics on the Web, in print, and on resource-limited handheld devices. In addition, SVG supports scripting and animation, so SVG is ideal for interactive, data-driven, personalized graphics. SVG is based on the download-and-play concept. SVG has also a mobile specification, SVG Tiny, which is a subset of SVG.

Metadata which is included with SVG content is specified within the metadata elements, with contents from other XML namespaces, as DC, RDF, etc.

3.4.3. Flash

Responsible

http://www.adobe.com/products/flash/flashpro/

Specification

Flash Professional 8

Discriminators/Categories

X, nX

G (SI, V, A, T)

publish

content distribution, content presentation, interactivity

broadcast

Flash is a vector-based platform for describing two-dimensional graphics and is proprietary. It is the de facto standard for rich-media content as Flash Players exist for most Operating Systems and Flash itself is used widely (i.e. for websites and games).

A Flash scene is build out of vector graphics (e.g., paths consisting of straight lines, curves), multimedia (e.g. still images, video, audio) and/or text. Flash also supports animations (declarative animations natively and interactive animations through ActionScript), interactivity and dynamic loading of still images, videos, audio, XML and other Flash scenes.

Flash does not support streaming except for video objects. Furthermore, video objects are not processed by Flash but passed through to the codecs available on the device.

Flash also has a mobile version, Flash Lite, which supports SVG Tiny.

3.4.4. MPEG-4 BIFS

Responsible

http://www.chiariglione.org/mpeg/technologies/mp04-bifs/index.htm

Specification

ISO/IEC JTC 1/SC 29/WG 11N7608

Discriminators/Categories

X, nX

G (SI, V, A, T)

publish

content distribution, content presentation, interactivity

broadcast

Binary Format for Scenes (BIFS) is a binary format for two- or three-dimensional audiovisual content. It is based on VRML and part 11 of the MPEG-4 standard. It supports streaming and uses compression to limit the necessary band width.

BIFS is an MPEG-4 scene description protocol to compose MPEG-4 objects, describe interaction with MPEG-4 objects and to animate MPEG-4 objects.

A BIFS scene is described using a tree representation. The leaf nodes in this tree denote the different media objects (text, video, audio, et cetera) and their global properties, while the parent nodes can be seen as primitives that are used for the actual composition of the audio-visual scene (grouping and transformations).

3.4.5. MPEG LASER

Responsible

http://www.mpeg-laser.org/

Specification

ISO/IEC 14496-20:2006

Discriminators/Categories

X

G (SI, V, A, T)

publish

content distribution, content presentation, interactivity

broadcast

MPEG-4 Part 20 LASeR is, as SVG Tiny, designed to allow an efficient representation of 2-dimensional scenes describing rich-media services for constrained devices. It is created to support SVG Tiny 1.1 (and with amendment 1 also SVG Tiny 1.2) but adds some key extensions such as streaming capabilities, efficient compression of the SVG-content and frame-accurate synchronization of the scene with the audio-visual objects.

Next to LASeR, MPEG also added the Streaming Aggregation Format (SAF) to the MPEG-4 Part 20 standard. This format allows to multiplex the LASeR stream with various elementary streams such as video, audio, and images, thus creating only one stream.

In order to support streaming and limiting the used band width, LASeR uses timed modifications . These modifications can add, delete, and replace object or change their properties without having to re-sent a complete scene.

3.5. MM standards for describing a specific domain/workflow

3.5.1. NewsML

Responsible

http://www.iptc.org/NAR/

Specification

Public version of the Specifications, Experimental Phase2

Discriminators/Categories

X

G (SI, V, A, T)

publish

news

News Agencies

For easing the exchange of news, the International Press Telecommunication Council (IPTC) is currently developping the NewsML2 Architecture whose goal is to provide a single generic model for exchanging all kinds of newsworthy information, thus providing a framework for a future family of IPTC news exchange standards. This family includes NewsML, SportsML, EventsML, ProgramGuideML or a future WeatherML. All are XML-based languages used for describing not only the news content (traditional metadata), but also their management, packaging, or related to the exchange itself (transportation, routing).

3.5.2. TVAnytime

Responsible

http://www.tv-anytime.org/

Specification

Metadata Specification, Phase 2

Discriminators/Categories

X

G

distribute

EPG (Electronic Program Guides)

broadcast

The TV Anytime Forum is an association of organizations which seeks to develop specifications to provide value-added interactive services, such as the electronic program guide, in the context of TV digital broadcasting. The forum identified the metadata as one of the key technologies enabling their vision and have adopted MPEG-7 as the description language. They have extended the MPEG-7 vocabulary with higher-level descriptors, such as, for example, the intended audience of a program or its broadcast conditions.

3.5.3. MPEG-21

Responsible

http://www.iso.org/iso/en/prods-services/popstds/mpeg.html

Specification

http://www.chiariglione.org/MPEG/standards/mpeg-21/mpeg-21.htm

Discriminators/Categories

nX, X

G

annotate, publish, distribute

generic

generic

The MPEG-21 standard aims at defining a framework for multimedia delivery and consumption which supports a variety of businesses engaged in the trading of digital objects. MPEG-21 is quite different to its predecessors, as it is not focused on the representation and coding of content like MPEG-1 to MPEG-7 do, but instead focusing on filling the gaps in the multimedia delivery chain. MPEG-21 was developed with the vision in mind that it should offer users transparent and interoperable consumption and delivery of rich multimedia content. The MPEG-21 standard consists of a set of tools and builds on its previous coding and metadata standards like MPEG-1, -2, -4 and -7, i.e., it links them together to produce a protectable universal package for collecting, relating, referencing and structuring multimedia content for the consumption by users (the Digital Item). The vision of MPEG-21 is to enable transparent and augmented use of multimedia resources (e.g. music tracks, videos, text documents or physical objects) contained in digital items across a wide range of networks and devices.

The two central concepts of MPEG-21 are Digital Items, a fundamental unit of distribution and transaction, and the concept of Users interacting with Digital Items: A User is any entity that interacts in the MPEG-21 environment or makes use of a Digital Item, and a Digital Item is a structured digital object with a standard representation, identification and metadata within the MPEG-21 framework. This entity is also the fundamental unit of distribution and transaction within this framework.

The MPEG-21standard consists of 18 parts (part 13, formerly known as Scalable Video Coding, has been specified as an amendment to MEPG-4 part 10; consequently, part 13 is currently no longer in use):

MPEG-21 identifies and defines the mechanisms and elements needed to support the multimedia delivery chain as described above, as well as the relationships between and the operations supported by them. Within the parts of MPEG-21, these elements are elaborated by defining the syntax and semantics of their characteristics, such as interfaces to the elements.

3.5.4. EBU P/Meta

Responsible

http://www.ebu.ch/

Specification

http://www.ebu.ch/CMSimages/en/tec_doc_t3295_v0102_tcm6-40957.pdf

Discriminators/Categories

nX, X

V+A

publish

generic

broadcast

The EBU P/Meta working group has designed this standard as a metadata vocabulary for programme exchange in the professional broadcast industry. It is not intended as an internal representation of a broadcaster’s system. P/Meta has been designed as metadata format in a business-to-business scenario to exchange broadcast programme related metadata between content producers, content distributors and archives. The P/Meta definition uses a three-layer model. The standard specifies the definition layer (i.e. the semantic of the description). The technology layer defines the encoding used for exchange; currently KLV (key, length, value) and XML representations are specified. The lowest layer, the data interchange layer, is out of scope of the specification. P/Meta consists of a number of attributes (some of them with a controlled list of values), which are organized into sets. The standard covers the following types of metadata:

3.6. Other MM(-related) standards

3.6.1. Dublin Core (DC)

Responsible

http://dublincore.org/

Specification

http://dublincore.org/documents/dcmi-terms/

Discriminators/Categories

X, R

G

publish

generic

generic

The Dublin Core Metadata Initiative (DCMI) has defined a set of elements for cross-domain information resource description. The set consists of a flat list of 15 elements describing common properties of resources, such as title, creator etc. Dublin Core recommends using controlled vocabularies for providing the values for these elements.

3.6.2. XMP/IPTC

Responsible

http://www.adobe.com/

Specification

http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf

Discriminators/Categories

X, R

G

annotate, publish, distribute

generic

generic

The main goals of XMP are to attach more powerful metadata to media assets in order to enable better management of multimedia content, to allow better ways to search and retrieve content and thus to improve consumption of assets. Furthermore XMP aims to enhance reuse and repurposing of content and to improve interoperability between differnent vendors and systems.

The Adobe XMP specification – as written in the specification - standardizes the definition, creation, and processing of metadata by providing a data model, storage model (serialization of the metadata as a stream of XML), and formal schema definitions (predefined sets of metadata property definitions that are relevant for a wide range of applications). XMP makes use of RDF in order to represent the metadata properties associated with a document.

With XMP, Adobe provides a method and format for expressing and embedding metadata in various multimedia file formats. It provides a basic data model as well as metadata schemas for storing metadata in RDF, and provides storage mechanism and a basic set of schemas for managing multimedia content like versioning support etc.

The most important components of the specification are the data model and the pre-defined (and extensible) schemas.

XMP Data Model: The data model is derived from RDF and is a subset of the RDF data model. It provides support for: metadata properties to attach metadata to a resource. Properties have property values, which can be structured (structured properties) or simple types or arrays. Properties also may have properties (property qualifiers) which my provide additional information about the property value.

XMP Schemas: Schemas consist of predefined sets of metadata property definitions. Schemas are essentially collections of statements about resources which are expressed using RDF. It is possible to define new external schemas, extend the existing ones or add some if necessary. There are some predefined schemas included in the specification like a Dublin Core Schema , a basic rights schema or a media management schema.

There is a growing number of commercial applications that already support XMP. The International Press and Telecommunications Council (IPTC) has integrated XMP in its Image Metadata specifications and almost every Adobe application like Photoshop or In-Design supports XMP.

4. Formal Representation of Existing Multimedia Metadata

This section discusses known approaches how to map existing multimedia metadata to RDF/OWL for the purpose of interoperability, reasoning, etc..

Each subsection starts with a table containing the the ontology source (if available online) and a description of the formalisation.

4.1. VRA - RDF/OWL

At the time of writing, there exists no commonly accepted mapping for VRA Core to RDF/OWL. This section discusses the pros and cons of the alternative mappings.

4.1.1. Mark van Assem VRA Mapping

Mark van Assem

4.1.2. SIMILE VRA Mapping

SIMILE

4.2. Exif - RDF/OWL

Recently there has been efforts to encode the Exif metadata tags in standardized Web ontology languages. Two such approaches are presented here; the two approaches are semantically very similar, yet are both presented for completeness.

4.2.1. Kanzaki Exif RDF Schema

The Kanzaki Exif RDF Schema (Kanzaki Exif) provides an encoding of the basic Exif metadata tags in RDFS. We also note here that relevant domains and ranges are utilized as well. (Kanzaki Exif) additionally provides a Exif conversion service, EXIF-to-RDF, which extracts Exif metadata from images and automatically maps it to the RDF encoding. In particular the service takes a URL to an Exif image and extracts the embedded EXIF metadata. The service then converts this metadata to the RDF schema defined in (Kanzaki Exif) and returns this to the user.

4.2.2. Norm Walsh Exif RDF Schema

The Norm Walsh EXIF RDF Schema (Walsh Exif) provides another encoding of the basic EXIF metadata tags in RDFS. (Walsh Exif) additionally provides JPEGRDF, which is a Java application that provides an API to read and manipulate EXIF meatadata stored in JPEG images. Currently, JPEGRDF can can extract, query, and augment the EXIF/RDF data stored in the file headers. In particular, we note that the API can be used to convert existing EXIF metadata in file headers to the schema defined in (Walsh Exif). The resulting RDF can then be stored in the image file header, etc. (Note here that the API's functionality greatly extends that which was briefly presented here).

4.3. DIG35 - RDF/OWL

Ontology Source

http://multimedialab.elis.ugent.be/users/chpoppe/Ontologies/DIG35.zip

Description

N.A.

The DIG35 ontology, developed by Multimedia Lab within the context of the W3C Multimedia Semantics Incubator Group, provides an OWL Schema covering the entire DIG35 specification. For the formal representation of DIG35 no other ontologies have been used. However, relations with other ontologies, as Exif, FOAF, etc. will be created to give the DIG35 ontology a broader semantic range. The DIG35 ontology is an OWL Full ontology.

4.4. MPEG-7 - RDF/OWL

This section lists existing approaches that concern the translation of (parts of) MPEG-7 into RDF/OWL.

4.4.1. MPEG-7 Upper MDS Ontology by Hunter

Ontology Source

http://metadata.net/mpeg7

Description

(Hunter, 2001)

Chronologically the first one, this MPEG-7 ontology was firstly developed in RDFS, then converted into DAML+OIL, and is now available in OWL-Full. The ontology covers the upper part of the Multimedia Description Scheme (MDS) part of the MPEG-7 standard. It comprises about 60 classes and 40 properties.

4.4.2. MPEG-7 MDS Ontology by Tsinaraki

Ontology Source

http://elikonas.ced.tuc.gr/ontologies/av_semantics.zip

Description

(Tsinaraki et.al., 2004)

Starting from the ontology developed by (Hunter, 2001) this MPEG-7 ontology covers the full Multimedia Description Scheme (MDS) part of the MPEG-7 standard. It contains 420 classes and 175 properties. This is an OWL DL ontology.

4.4.3. MPEG-7 Ontology by Rhizomik

Ontology Source

http://rhizomik.net/ontologies/mpeg7ontos

Description

(Garcia et.al., 2005)

This MPEG-7 ontology has been produced fully automatically from the MPEG-7 standard in order to give it a formal semantics. For such a purpose, a generic mapping XSD2OWL has been implemented. The definitions of the XML Schema types and elements of the ISO standard have been converted into OWL definitions according to the table given in (Garcia et.al., 2005). This ontology could then serve as a top ontology thus easing the integration of other more specific ontologies such as MusicBrainz. The authors have also proposed to transform automatically the XML data (instances of MPEG-7) into RDF triples (instances of this top ontology).

This ontology aims to cover the whole standard and it thus the most complete one (with respect to the previous mentioned). It contains finally 2372 classes and 975 properties. This is an OWL Full ontology since it employs the rdf:Property construct to cope with the fact that there are properties that have both datatype and object type ranges.

4.4.4. INA Ontology

Ontology Source

TODO: put the final ontology on a CWI site

Description

(Isaac et.al., 2004), (Troncy, 2003)

This ontology is not really an MPEG-7 ontology since it does not cover the whole standard. It is rather a core audio-visual ontology inspired by several terminologies, either standardized (like MPEG-7 and TV Anytime) or still under development (ProgramGuideML). Furthermore, this ontology benefits from the practices of the French INA institute, the English BBC and the Italian RAI channels, which have also developed a complete terminology for describing radio and TV programs.

This core ontology contains currently 1100 classes and 220 properties and it is represented in OWL Full.

5. Multimedia Ontologies

This section briefly introduces (existing) multimedia ontologies.

Each subsection starts with a table containing the responsible party (project), the ontology source (if available online) and a description of the formalisation (and its purpose).

5.1. aceMedia Visual Descriptor Ontology

Responsible

http://www.acemedia.org/

Ontology Source

http://www.acemedia.org/aceMedia/files/software/m-ontomat/acemedia-visual-descriptor-ontology-v09.rdfs

Description

(Bloehdorn et.al., 2005)

The Visual Descriptor Ontology (VDO) developed within the aceMedia project for semantic multimedia content analysis and reasoning, contains representations of MPEG-7 visual descriptors and models Concepts and Properties that describe visual characteristics of objects. The term descriptor refers to a specific representation of a visual feature (color, shape, texture etc) that defines the syntax and the semantics of a specific aspect of the feature. For example, the dominant color descriptor specifies among others, the number and value of dominant colors that are present in a region of interest and the percentage of pixels that each associated color value has. Although the construction of the VDO is tightly coupled with the specification of the MPEG-7 Visual Part, several modifications were carried out in order to adapt to the XML Schema provided by MPEG-7 to an ontology and the data type representations available in RDF Schema.

5.2. Mindswap Image Region Ontology

Responsible

http://www.mindswap.org/

Ontology Source

http://www.mindswap.org/2005/owl/digital-media

Description

(Halaschek-Wiener et.al., 2005)

The Mindswap digital-media is an OWL ontology which models concepts and relations covering various aspects of the digital media domain. The main purpose of the ontology is to provide the expressiveness to assert what is depicted within various types of digital media, including image and videos. The ontology defines concepts including image, video, video frame, region, as well as relations such as depicts, regionOf, etc. Using these concepts and their associated properties, it is therefore possible to assert that an image/imageRegion depicts some instance, etc.

5.3. Visual Ontology for Video Retrieval

Responsible

http://www.cs.vu.nl/en/sec/bi/index-en.html

Ontology Source

http://www.cs.vu.nl/~laurah/VO/visualWordnetschema2a.rdfs

Description

(Hollink et.al., 2005)

5.4. Audio Ontologies

5.4.1. Common Music Ontology

Responsible

Frédérick Giasson (Zitgist)

Specification

http://pingthesemanticweb.com/ontology/mo/

Ontology Source

http://pingthesemanticweb.com/ontology/mo/musicontology.rdfs

Discriminators/Categories

O, R

A

public, premeditation?

generic

music

The Music Ontology Specification provides main concepts and properties fo describing music (i.e. artists, albums and tracks) on the Semantic Web. Based on (or inspired by) the MusicBrainz editorial metadata.

5.4.2. Kanzaki

Responsible

http://www.kanzaki.com/ns/music

Specification

http://www.kanzaki.com/ns/music

Discriminators/Categories

O, R

A

public, premeditation?

generic

music

A vocabulary to describe classical music and performances. Classes (categories) for musical work, event, instrument and performers, as well as related properties are defined.

5.4.3. Music Production

Responsible

http://moustaki.xtr3m.org

Specification

http://moustaki.xtr3m.org/musicont/

Ontology Source

http://purl.org/NET/c4dm/music.owl

Discriminators/Categories

O, R

A

public

generic

music

Description

(Abdallah et.al., 2006)

A set of OWL-DL ontologies designed to cover a large range of things happening during a music production process. This ontology is split in several ones (Time, Event, Music production, Editorial information)

5.4.4. Music Recommendation

Responsible

http://foafing-the-music.iua.upf.edu/

Specification

http://foafing-the-music.iua.upf.edu/ISWC2006

Ontology Source

http://foafing-the-music.iua.upf.edu/music-ontology/foafing-ontology-0.2.owl

Discriminators/Categories

O, R

A

public

generic

music

Description

(Celma, 2006)

A simple OWL-DL ontology that defines basic information of artists (and their relationships), and songs. It includes some descriptors automatically extracted from the audio (beats per minute, key and mode, intensity, etc.)

5.5. Others

6. Examples

TODO: add a visual (a still image) and audio inline example and show how it is described using the vocabularies listed above.

7. References

(Smith et.al., 2006)

Metadata Standards Roundup J. R. Smith, P. Schirling, IEEE MultiMedia, vol. 13, no. 2, pp. 84-88, Apr-Jun, 2006.

(Hardman, 2005)]

Canonical Processes of Media Production L. Hardman. In Proceedings of the ACM workshop on Multimedia for human communication. ACM Press, 2005.

(ISO, 2005)

Definition of MPEG-7 Description Profiling ISO/IEC 15938-9, 2005.

(Bailer et.al., 2006)

The Detailed Audiovisual Profile: Enabling Interoperability between MPEG-7 Based Systems W. Bailer and P. Schallauer. In Proc. of 12th International Multi-Media Modeling Conference, Beijing, CN, 2006.

(Hunter, 2001)

Adding Multimedia to the Semantic Web - Building an MPEG-7 Ontology J. Hunter. In Proc. of the 1st International Semantic Web Working Symposium (SWWS 2001), Stanford, USA, 30 July - 1 August 2001.

(Tsinaraki et.al., 2004)

Interoperability support for Ontology-based Video Retrieval Applications C. Tsinaraki, P. Polydoros and S. Christodoulakis. In Proc. of 3rd International Conference on Image and Video Retrieval (CIVR 2004), Dublin, Ireland, 21-23 July 2004.

(Garcia et.al., 2005)

Semantic Integration and Retrieval of Multimedia Metadata R. Garcia and O. Celma. In Proc. of the 5th International Workshop on Knowledge Markup and Semantic Annotation (SemAnnot 2005) to be held with ISWC 2005, Galway, Ireland, 7 November 2005.

(Isaac et.al., 2004)

Designing and Using an Audio-Visual Description Core Ontology A. Isaac and R. Troncy. In Workshop on Core Ontologies in Ontology Engineering held in conjunction with the 14th International Conference on Knowledge Engineering and Knowledge Management (EKAW'04), Whittlebury Hall, Northamptonshire, UK, 8 October 2004.

(Troncy, 2003)

Integrating Structure and Semantics into Audio-visual Documents R. Troncy. In Proc. of the 2nd International Semantic Web Conference (ISWC'03), LNCS 2870, pages 566-581, Sanibel Island, Florida, USA, 21-23 October 2003.

(Bloehdorn et.al., 2005)

Semantic Annotation of Images and Videos for Multimedia Analysis S. Bloehdorn, K. Petridis, C. Saathoff, N. Simou, V. Tzouvaras, Y. Avrithis, S. Handschuh, I. Kompatsiaris, S. Staab, and M. G. Strintzis. In Proc. of the 2nd European Semantic Web Conference (ESWC 2005), Heraklion,Greece, May 2005.

(Halaschek-Wiener et.al., 2005)

A Flexible Approach for Managing Digital Images on the Semantic Web C. Halaschek-Wiener, A. Schain, J. Golbeck, M. Grove, B. Parsia and J. Hendler. In Proc. of the 5th International Workshop on Knowledge Markup and Semantic Annotation (SemAnnot 2005) to be held with ISWC 2005, Galway, Ireland, 7 November 2005.

(Hollink et.al., 2005)

Building a Visual Ontology for Video Retrieval L. Hollink, M. Worring and G. Schreiber. In Proc. of the ACM Multimedia, Singapore, November 2005.

(Kanzaki Exif)

Kanzaki EXIF-RDF Converter

(Walsh Exif)

JPEGRDF - Norm Walsh EXIF Converter

(Abdallah et.al., 2006)

An ontology-based approach to information management for music analysis systems

(Celma, 2006)

Foafing the Music: Bridging the Semantic Gap in Music Recommendation