Image annotation on the Semantic Web

Editors' Draft $Date: 2005/10/28 11:46:30 $ $Revision: 1.4 $

This version:: http://www.w3.org/2001/sw/BestPractices/MM/image_annotation_galway.html
Latest version:: http://www.w3.org/2001/sw/BestPractices/MM/image_annotation.html
Previous version:: N/A
Editors:: Jacco van Ossenbruggen, Center for Mathematics and Computer Science (CWI); Raphaël Troncy, Center for Mathematics and Computer Science (CWI); Giorgos Stamou, IVML, National Technical University of Athens
Contributors:: Christian Halaschek-Wiener, University of Maryland; Jane Hunter, DSTC; Nikolaos Simou, IVML, National Technical University of Athens; John Smith, IBM T. J. Watson Research Center; Vassilis Tzouvaras, IVML, National Technical University of Athens
: Also see Acknowledgements.

Abstract

Many applications that involve multimedia content make use of some form of metadata that describe this content. This document provides guidelines for using Semantic Web languages and technologies in order to create, store, manipulate, interchange and process image metadata. It gives a number of use cases to exemplify the use of Semantic Web technology for image annotation, an overview of RDF and OWL vocabularies developed for this task and an overview of relevant tools.

Note that many approaches to image annotation predate Semantic Web technology. Interoperability between these technologies and RDF and OWL-based approaches, however, is addressed in a separate document on Interoperability.

After reading this document, readers may turn to separate documents discussing individual image annotation vocabularies, tools, and other relevant resources.

Target Audience

Institutions and organizations with research and standardization activities in the area of multimedia, professional (museums, libraries, audiovisual archives, media production and broadcast industry, image and video banks) and non-professional (end-users) multimedia annotators.

Objectives

Provide use cases with examples of multimedia annotations
Collect currently used vocabularies for multimedia annotations (like Dublin Core, VRA, ...)

Status of this document

This is a public (WORKING DRAFT) Working Group Note produced by the Multimedia Annotation in the Semantic Web Task Force of the W3C Semantic Web Best Practices & Deployment Working Group, which is part of the W3C Semantic Web activity.

Discussion of this document is invited on the public mailing list public-swbp-wg@w3.org (public archives). Public comments should include "comments: [MM]" at the start of the Subject header.

Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress. Other documents may supersede this document.

1. Introduction
2. Use Cases
3. Vocabularies
4. Tools
5. Examples Solutions to the Use Cases
References
Acknowledgments

1. Introduction

The need for annotating digital image data is recognized in a wide variety of different applications, covering both professional and personal usage of image data. At the time of writing, most work done in this area is not based on Semantic Web technology often because it predates the Semantic Web. This document explains the advantages of using Semantic Web languages and technologies for image annotations and provides guidelines for doing so. It is organized around a number of representative use cases, and a description of Semantic Web vocabularies and tools that could be used to help accomplish the task mentioned in the uses cases. The remainder of this introductory section first gives an overview of image annotation in general, followed by a short description of the key Semantic Web concepts that are relevant for image annotation.

1.1 Image annotation basics

Annotating images on a small scale for personal usage can be relatively simple. The reader should be aware, however, that large scale, industrial strength image annotation is notoriously complex. Trade offs along several dimensions make the task difficult:

Generic vs task-specific annotation

Annotating images without having a specific goal or task in mind is often not cost effective: after the target application has been developed, it turns out that images have been annotated using the wrong type of information, or on the wrong abstraction level, etc. Redoing the annotations is then an unavoidable, but costly solution. On the other hand, annotating with only the target application in mind may also not be cost effective. The annotations may work well with that one application, but if the same metadata is to be reused in the context of other applications, it may turn out to be too specific, and unsuited for reuse in a different context. In most situations the range of applications in which the metadata will be used in the future is unknown at the time of annotation. When lacking a crystal ball, the best the annotator can do in practice is use an approach that is sufficiently specific for the application under development, while avoiding unnecessary application-specific assumptions as much as possible.
Manual versus automatic annotation and the "Semantic Gap"

In general, manual annotation can provide image descriptions at the right level of abstraction. It is, however, time consuming and thus expensive. In addition, it proves to be highly subjective: different human annotators tend to "see" different things in the same image. On the other hand, annotation based on automatic feature extraction is relatively fast and cheap, and free of human bias. It tends to result, however, in image descriptions that are too low level for many applications. The difference between the low level feature descriptions provided by image analysis tools and the high level content descriptions required by the applications is often referred to, in the literature, as the Semantic Gap. In the remainder, we will discuss use cases, vocabularies and tools for both manual and automatic image annotation.
Different vocabularies for different types of metadata

While various classifications of metadata have been described in the literature, every annotator should at least be aware of the difference between annotations describing properties of the image itself, and those describing the subject matter of the image, that is, the properties of the objects, persons or concepts depicted by the image. In the first category, typical annotations provide information about title, creator, resolution, image format, image size, copyright, year of publication, etc. Many applications use a common, predefined and relatively small vocabulary defining such properties. Examples include the Dublin Core and VRA Core vocabularies. The second category describes what is depicted by the image, which can vary wildly with the type of image at hand. As a result, one sees a large variation in vocabularies used for this purpose. Typical examples vary from domain-specific vocabularies (for example, with terms that are very specific for astronomy images, or sport images, etc) to domain-independent ones (for example, a vocabulary with terms that are sufficiently generic to describe any news photo). In addition, vocabularies tend to differ in size, granularity, formality etc. In the remainder, we discuss the above metadata categories. Note that in the first type it is not uncommon that a vocabulary only defines the properties and defers the definitions of the values of those properties to another vocabulary. This is true, for example, for both Dublin Core and VRA Core. This means that typically, in order to annotate a single image one needs terms from multiple vocabularies.
Lack of Syntactic and Semantic Interoperability

Many different file formats and tools for image annotations are currently in use. Reusing metadata developed for one set of tools in another is often hindered by a lack of interoperability. First, different tools use different file formats, so tool A may not be able to read in the metadata provided by tool B (syntax-level interoperability). Solving the problem is relatively easy if the inner structure of both file formats are known by developing a conversion tool. Second, tool A may assign a different meaning to the same annotation as tool B does (semantic interoperability). Solving this problem is much harder and can be done automatically only when the semantics of the vocabulary used is explicitly defined for both tools.

1.2 Semantic Web Basics

While much of the current work in this area is not (yet) based on Semantic Web languages and technology, we believe using this has many potential advantages.

Digital images are published and exchanged more and more over the Web. It is thus increasingly important that the associated annotations can also be shared and published over the Web. The Semantic Web is designed with this goal in mind. For example, the Semantic Web (re)uses the Web's URI ([IRI?]) scheme for identifying the resources that are annotated, the annotations and the definitions of the concepts used in the annotations. This allows everyone to unambiguously publish and exchange annotations and annotation vocabulary, without the need to commit to one centralized vocabulary.
Because the Semantic Web is inherently Web-based, it is build on top of open, platform and application neutral languages, which reduces the syntactic interoperability problems mentioned above (See the Interoperability document for more details).
Because the Semantic Web allows for machine readable and explicitly defined semantics, it also provides practical solution for solving the semantic interoperability problems. For example, a very specific term in an annotation produced by tool A may be recognized by tool B that requires more generic terminology by using the explicit subsumption relationships from RDFS or OWL.
By using Semantic Web languages for annotations and to define vocabularies, one can use a growing set of Semantic Web-based tools and software packages.
The Semantic Web is a World Wide system, and addresses some issues concerning internationalization, for example, the use of language tags to make the natural language used in metadata explicit, and to permit metadata in multiple languages simultaneously.

2. Use Cases

Image annotation is relevant in a wide range of domains, organisations and applications, that cannot be covered in a single document such as this. Instead, a number of use cases are described that are intended as a representative set of examples. These use cases will be used later to discuss the vocabularies and tools that are relevant for image annotation on the Semantic Web. Example solutions are given in Section 5.

The use cases are organized in four categories, which reflect the topics depicted by the images. These topics often determine the tools and vocabularies used in the annotation process.

2.1 World Images

This section provides two use cases with images that could potentially depict any subject: management of a personal photo collection and that of a news press photo bank. The other use cases will focus on use cases with images from a specific domain.

Use case: Management of Personal Digital Photo Collections

Advances in digital technologies (cameras, computers, storage, communication etc) have caused a huge increase of digital multimedia information captured stored and distributed by personal users over the web. Digital formats now provide the most cheap, safe and easy way to broadly capture, store and deliver multimedia content. Many personal users have thousands of photos (from vacations, parties, traveling, conferences, everyday life etc), usually stored in several resolutions on the hard disk of their computers in a simple directory structure without any metadata. Ideally, the user wants to easily access this content, view it, create presentations, use it in his homepage, deliver it over the Internet to other people, make part of it accessible for other people or even sell part of it to image banks etc. Too often, however, the only way for this content to be accessed is by browsing the directories, the name of which usually provides the date and describes with one or two words the original event captured by the specific photos. Obviously, this access becomes more and more difficult since the number of photos increases everyday and the content will quickly become practically unaccessible. See also example solution.

Use case: Press Photo Bank

TO DO [after Galway f2f] IPTC / News / Sport / Entertainment. e.g Corbis, Associated Press, Reuters

2.2 Culture Images

This section contains a single use case from the cultural heritage domain. This domain is characterized by a long tradition in describing images, with many standardized methods and vocabularies.

Use case: Cultural Heritage

A museum in fine arts has asked a specialized company to produce high resolution digital scans of the most important art works of their collections. The museum's quality assurance requires the possibility to track when, where and by whom every scan was made, with what equipment, etc. The museum's internal IT department, maintaining the underlying image database, needs the size, resolution and format of every resulting image. It also needs to know the repository ID of the original work of art. The company developing the museum's website additionally requires copyright information (that varies for every scan, depending on the age of the original work of art and the collection it originates from). It also want to give the users of the website access to the collection, not only based on the titles of the paintings and names of their painters, but also based on the topics depicted ('sun sets'), genre ('self portraits'), style ('post-impressionism'), period ('fin de siecle'), region ('west European'). See also example solution.

2.3 Media

Use cases in this section are mainly targeted at media professionals, and less to the general public. Typical requests are characterized by detailed queries, not only about the content but also about the media specific details, such as camera angle, lens settings etc.

Use case: Media Production Services

A media production house requires several web services in order to organize and implement its projects. Usually, the pre-production and production start from location, people, image and footage search and retrieval in order to speed up the process and reduce as much as possible the cost of the production. For that reason, several multimedia archives (image and video banks, location management databases, casting houses etc) provide the above information through the web. Everyday, media producers, location managers, casting managers etc, are looking in the above archives in order to find the appropriate resources for their project. The quality of this search and retrieval process directly affects the quality of the service that the archives provide to the users. In order to facilitate the above process, the annotation of image content should make use of Semantic Web technologies, also following the multimedia standards in order to be interoperable with other archives. Using for example the tools described below, people that archives the content in the media production chain can provide all the necessary information (administrative, structural and descriptive metadata) in a standard form (RDF, OWL) that will be easily accessible for other people over the web. Using the Semantic Web standards, the archiving, search and retrieval processes will then make use of semantic vocabularies (ontologies) describing information concerning the structure of the content from thematic categories to description of the main objects appearing in the content with its main visual characteristics etc. In this way, multimedia archives will make their content easily accessible over the web, providing a unified framework for media production resource allocation. (solution to be done after f2f)

Use case: Television Archive

Audiovisual archive centers are used to manage very large multimedia databases. For instance, INA, the French Audiovisual National Institute, has been archiving TV documents for 50 years and radio documents for 65 years and stores more than 1 million hours of broadcast programs. The images and sound archives kept at INA are either intended for professional use (journalists, film directors, producers, audiovisual and multimedia programmers and publishers, in France and worldwide) or communicated for research purposes (for a public of students, research workers, teachers and writers). In order to allow an efficient access to the data stored, most of the parts of these video documents are described and indexed by their content. The global multimedia information system should then be fine-grain enough detailed to support some very complex and precise queries. For example, a journalist or a film director client might ask for an excerpt of a previously broadcasted program showing the first goal of a given football player in its national team, scored with its head. The query could additionally contain some more technical requirements such that the goal action should be available according to both the front camera view and the reverse angle camera view. Finally, the client might or might not remember some general information about this football game, such that the date, the place and the final score. See also example solution.

2.4 Scientific Images

This section presents use cases from the scientific domain. Typically, images are annotated using large and complex ontologies.

Use Case: Large-scale Image Collections at NASA

Many organizations maintain extremely large-scale image collections. The National Aeronautics and Space Administration (NASA) is such an example, which has hundreds of thousands of images, stored in different formats, levels of availability and resolution, and with associated descriptive information at various levels of detail and formality. Such an organization also generates thousands of images on an ongoing basis that are collected and cataloged. Thus, a mechanism is needed to catalog all the different types of image content across various domains. Information about both the image itself (e.g., its creation date, dpi, source) and about the specific content of the image is required. Additionally, the associated metadata must be maintainable and extensible so that associated relationships between images and data can evolve cumulatively. Lastly, management functionality should provide mechanisms flexible enough to enforce restriction based on content type, ownership, authorization, etc. See also example solution.

Use case: Medical Image Annotations

TO BE DONE [by Jane Hunter, after the f2f].

3. Vocabularies for image annotation

Choosing which vocabularies to use for annotating image is a key decision in an annotation project. Typically, one needs more than a single vocabulary to cover the different relevant aspects of the images. Vocabularies Overview discusses a number of individual vocabularies that are relevant for images annotation. The remainder of this section discusses more general issues.

Many of the relevant vocabularies have been developed prior to the Semantic Web, and Vocabularies Overview lists many translations of such vocabularies to RDF or OWL. Most notably, the key International Standard in this area, the Multimedia Content Description standard, widely known as MPEG-7, is defined using XML Schema. At the time of writing, there is no commonly accepted mapping from the XML Schema definitions in the standard to RDF or OWL. Several alternative mappings, however, have been developed so far and are discussed in the overview.

Another relevant vocabulary is the VRA Core. Where the Dublin Core specifies a small and commonly used vocabulary for on-line resources in general, VRA Core defines a similar set targeted especially at visual resources. Dublin Core and VRA Core both refer to terms in their vocabularies as elements, and both use qualifiers to refine elements in similar way. The more general elements of VRA Core have direct mappings to comparable fields in Dublin Core. Furthermore, both vocabularies are defined in a way that abstracts from implementation issues and underlying serialization languages. A key difference, however, is that for Dublin Core there exists a commonly accepted mapping to RDF, along with the associated schema. At the time of writing, this is not the case for VRA Core, and the overview discusses the pros and cons of the alternative mappings.

Many annotations on the Semantic Web are about an entire resource. For example, a <dc:title> property applies to the entire document. For images and other multimedia documents, one often needs to annotate a specific part of a resource (for example, a region in an image). Sharing the metadata dealing with the localization of some specific part of multimedia content is important since it allows to have multiple annotations (potentially from multiple users) referring to the same content.

[TO DO after f2f: Discuss and give examples of two possible solutions:]

Ideally, the target image already specifies this specific part, using a name that is addressable in the URI fragment identifier (this can be done, for example, in SVG).
Otherwise the region needs to be described in the metadata, as is done in MPEG-7.

See Semantic Web Image Annotation Vocabularies for a discussion on the individual vocabularies.

4. Tools

Besides the hundreds of tools used for image archiving and description there are also many tools that are used for semantic image annotation. The aim of this section is to give a brief description of the above tools and their characteristics in order to provide some guidelines for their proper use. Using these characteristics as criteria, the tools can be categorized so that the user that wants to annotate its multimedia content could choose the most appropriate for his application.

Type of Content. This characteristic refers to the type of content that a tool can annotate. Usually, the raw content is an image (in several forms like jpg, tif etc) but there are also tools that can annotate videos as well.

Type of Metadata. This characteristic concerns the type of metadata that can be included in the annotation using the specific tool. Following the categorization provided by the The Making of America II project (Library of Congress) the metadata can be descriptive (for description and identification of information), structural (for navigation and presentation), and administrative (for management and processing). Most of the tools can be used in order to provide descriptive metadata, but using some of them the user can also provide structural and administrative information.

Format of Metadata. This characteristic refers to the format of metadata that the tools can import or export. The most important for the user is the format in which the annotation is exported, since it should be interoperable for semantic web applications. The most common formats used by the tools are OWL and RDF.

Annotation level. Some tools give to the user the opportunity to annotate an image using vocabularies while others allow free text annotation. When ontologies are used (usually in RDF or OWL format), then the level of the annotation is high, since the semantics are provided in a more formal way, otherwise it is low.

Operation Mode. This characteristic refers to the type of the operation of the tool, mentioning if it is a stand-alone or a web-based application.

Open Source. Some of the tools are open source while some others are not. It is important for the user and for potential researchers and developers in the area of multimedia annotation to know this before decide the tool that they are going to use.

Collaborative or individual. This characteristic refers to the possible usage of the tool as an annotation framework for web-shared image databases or as an individual user multimedia content annotation.

Granularity. Granularity specifies whether annotation is segment based or file based. This characteristic is important if the user wants for his content to be accessed from content-based retrieval tools. In combination with the high annotation level, the user can provide visual information (like colour, shape, texture etc characteristics) about the objects appear in the image.

Threaded or unthreaded. This characteristic refers to the ability of the tool to respond or add to a previous annotation and to stagger/structure the presentation of annotations to reflect this.

Access controlled or open access. This refers to the cost-free full usage access to the tool that the user could have. It is really important that almost all the tools examined are free of charge.

Concluding, the appropriateness of a tool depends on the nature of annotation that the user requires and cannot be predetermined. Here, all the annotation tools found in the internet have been categorized according to the characteristics described above. They were used for different types of annotations (according to the use cases, see the following section). For a discussion on the individual tools see Semantic Web Image Annotation Tools.

5. Examples Solutions to the Use Cases

This section describes possible solutions for the use cases presented in Section 2. These solutions are provided purely as an illustrative example and do not imply endorsement by the W3C membership or the Semantic Web Best Practices and Deployment Working Group.

TO DISCUSS DURING F2F: Should the use case solutions be included into this document or should they be kept separate so we can update them after publication of this note?

5.1 Use Case: Management of Personal Digital Photo Collections

The solution of this use case requires the use of multiple vocabularies.The domain of a photo belonging to a personal digital collection is very wide including domains such as sports, entertainment, sightseeing etc. In order to solve this use case the information that a user needs to know about the image has to be addressed for the appropriate selection of vocabularies. Furthermore the information of the image can be separated into two categories, the general image characteristics and the image content. A vocabulary that fulfills the general image properties is [MPEG-7]. On the other hand various vocabularies can be used depending on the content described.

An image was annotated using three tools from two different categories, tools that require a domain ontology (PhotoStuff and M-Ontomat-Annotizer) and tools that do not operate with a domain ontology (flick2rdf, SWAD).

See complete use case description for example RDF code and more detailed discussion of the design decisions.

5.2 Use Case: Cultural Heritage

Many of the requirements of this use case description can be met by using the vocabulary developed by the VRA. An important distinction made by this vocabulary is that between annotations describing a work of art itself and annotations describing a (digital) image of that work. The example RDF provided uses this distinction. Note that while many fields are filled with RDF Literals, many of these fields are filled with terms from other controlled vocabularies. Also note that many fields do not contain information about the current situation, but about places and collections the painting has beed in the past, which provides provenance information that is important in this domain.

See complete use case description for example RDF code and more detailed discussion of the design decisions.

5.3 Use Case: Television News Archive

This use case is typically one that requires the use of multiple vocabularies. First, the image could be extracted from a broadcasted TV program such as, for example, a weekly sports magazine. This program may be fully described using the vocabulary developped by the [TV Anytime]. Second, lets assume that this image shows a player scoring with his head during a particular football game. The context of this game could be described using the [MPEG-7] vocabulary while the action itself might be described by a soccer ontology. Finally, a soccer fan may notice that on this particular image, the goal is actually refused for an active offside position of another player. On the image, a circle highlights this player badly positionned. Again, the description could merge MPEG-7 vocabulary for delimiting the relevant image region and a domain specific ontology for describing the action itself.

See complete use case description for example RDF code and more detailed discussion of the design decisions.

5.4 Use Case: large-scale image collections at NASA

One possible solution for such image management requirements is an annotation environment that enables users to annotate information about images and/or their regions using concepts in ontologies (OWL and/or RDFS). More specifically, subject matter experts will be able to assert metadata elements about images and their specific content. Multimedia related ontologies can be used to localize and represent regions within particular images. These regions can then be related to the image via a depiction/annotation property. This functionality can be provided, for example, by the MINDSWAP digital-media ontology (to represent image regions), in conjunction with FOAF (to assert image depictions). Additionally, in order to represent the low level image features of regions, the aceMedia Visual Descriptor Ontology can be used.

Existing toolkits, such as PhotoStuff and M-OntoMat-Annotizer, currently provide graphical environments to accomplish the tasks as defined above. Using such tools, users can load images, create regions around parts of the image, automatically extract low-level features of selected regions (via M-OntoMat-Annotizer), assert statements about the selected regions, etc. Additionally, the resulting annotations can be exported as RDF/XML, thus allowing them be shared, indexed, and used by advanced annotation-based browsing (and searchable) environments.

See complete use case description for example RDF code and more detailed discussion of the design decisions.

References

[Dublin Core]: The Dublin Core Metadata Initiative, Dublin Core Metadata Element Set, Version 1.1: Reference Description.
[Hunter, 2001]: J. Hunter. Adding Multimedia to the Semantic Web — Building an MPEG-7 Ontology. In International Semantic Web Working Symposium (SWWS 2001), Stanford University, California, USA, July 30 - August 1, 2001.
[Interoperability]: Semantic Web Image Annotation Interoperability
[MPEG-7]: Information Technology - Multimedia Content Description Interface (MPEG-7). Standard No. ISO/IEC 15938:2001, International Organization for Standardization(ISO), 2001.
[Stamou, 2005]: G. Stamou and S. Kollias (eds). Multimedia Content and the Semantic Web: Methods, Standards and Tools. John Wiley & Sons Ltd, 2005.
[Troncy, 2003]: R. Troncy. Integrating Structure and Semantics into Audio-visual Documents. In Second International Semantic Web Conference (ISWC 2003), pages 566 – 581, Sanibel Island, Florida, USA, October 20-23, 2003. Springer-Verlag Heidelberg.
[TV Anytime]: TV Anytime Forum, http://www.tv-anytime.org/
[Ossenbruggen, 2004]: J. van Ossenbruggen, F. Nack, and L. Hardman. That Obscure Object of Desire: Multimedia Metadata on the Web (Part I). In: IEEE Multimedia 11(4), pp. 38-48 October-December 2004.
[Ossenbruggen, 2005]: F. Nack, J. van Ossenbruggen, and L. Hardman. That Obscure Object of Desire: Multimedia Metadata on the Web (Part II). In: IEEE Multimedia 12(1), pp. 54-63 January-March 2005.
[VRA Core]: Visual Resources Association Data Standards Committee, VRA Core Categories, Version 3.0. See http://www.w3.org/2001/sw/BestPractices/MM/vra-conversion.html for a RDFS and OWL schema of VRA Core 3.0.

Acknowledgments

The editor would like to thank the following Working Group members for their contributions to this document: Jeremy Caroll, Libby Miller, Jeff Pan, Michael Uschold and Mark van Assem.

This document is a product of the Multimedia Annotation on the Semantic Web Task Force of the Semantic Web Best Practices and Deployment Working Group.