Use Case: Photo Use Case

Index

Contents

Introduction
Motivating Examples
1. Photo annotation and selection
2. Exchanging and sharing photos
The fundamental problem: semantic content understanding
1. The role of metadata standards for photos
The multimedia semantics interoperability problem
1. Different levels and types of metadata for photos
2. Different standards for photo metadata and annotations
Torwards a solution
1. Identification of interoperability needs and use cases
2. RDF Scheme for DIG35 and MPEG-7
References

1. Introduction

Currently, we are facing a market in which more than 20 billion digital photos are taken per year for example in Europe (GFK, 2006). However, the number of tools both for the desktop but also in the Web is increasing that perform an automatic as well as an manual annotation of the content. For example, a large number of personal photo management tools extract information from the so called EXIF (EXIF) header and add this to the photo description. These tools typically allow to tag and describe single photos. There are also many Web tools that allow to upload photos to share them, organize them and annotate them. Web sites such as (Flickr, 2007) allow tagging on the large scale. Sites like (Riya, 2007) provide specific services such as face detection and face recognition of personal photo collections. Foto community sites such as (Foto Community, 2007) allow an organization of the photos in categories and allow rating and commenting on them. Even though our photos today find more and more tools to manage and share them, these tools come with different capabilities. What remains difficult is finding, sharing, reusing photo collections across the borders of tools and sites. Not only the way in which photos are automatically and manually are annotated is different but also the way in which this metadata is described and represented finds many different standards. In the beginning of the management of personal photo collections is the semantic understanding of the photos.

2. Motivating Examples

From the perspective of an end user let us consider the following scenario to describe what is missing and needed for next generation digital photo services. Ellen Scott and her family were on a nice two-week vacation in the Tuscany. They enjoyed the sun at the beaches of the Mediterranean, appreciating the great culture in Florence, Siena and Pisa, and traveling on the traces of the Etruscans through the small villages of the Maremma. During their marvelous trip, the family was taking pictures of the sightseeing spots, the landscapes and of course from the family members. The digital camera they use is already equipped with a GPS receiver, so every photo is stamped not only with the time when, but also with the geo-location where it has been taken.

2.1. Photo annotation and selection

Back home the family uploads about 1000 pictures from the camera to the computer and wants to create an album for grand dad. On this computer, the family uses a nice photo management tool which both extracts some basic features such as the EXIF header but also allows for entering tags and personal descriptions. Still fulfilled with the memories of the nice trip the mother of the family labels most of the photos. With a second tool, the tour of the GPS receiver and the photos are merged using the time stamp. As a results, each of the photos is geo-referenced with the GPS position stored in the EXIF header. However, showing all the photos would take an entire weekend. So Ellen starts to create a nice and interesting excerpt of their trip and the highlights. Her photo album software takes in the 1000 pictures and makes suggestions for the selection and the arrangement of the pictures in a photo album. For example, the album software shows her a map of Tuscany and visualises, where she has taken which photos and groups them together making suggestions which photos would best represent this part of the vacation. For places for which the software detects highlights, the system offers to add information about the place to the album, stating that on this Piazza in front of the Palazzo Vecchio there is the copy of Michelangelo's famous David statue. Depending on the selected style, the software creates a layout and the distribution of all images over the pages of the album taking into account color, spatial and temporal clusters and template preference. So, in about 20 minutes Ellen has finished the album and orders a paper version as well as an online-version. The paper album is delivered to her by mail three days later. It looks great, and the explaining texts that her software has almost automatically added to the pictures are informative and help her remembering the great vacation. They show the album to grandpa and he can take his time to study their vacation and wonderful Tuscany.

2.2. Exchanging and sharing photos

Selecting the most impressive photos, the son of the family uploads a nice set of photos to FLickr, to give his friends an impression of the great vacation. Unfortunately, However, all the descriptions and annotations from the personal photo management system are lost after the Web upload. Therefore, he adds a few own tags to the Flickr photos to describe the places, events, persons of the trip. Even the GPS track is lost and he places the photos again on the Flickr map application to geo-reference them. One friend finds a cool picture from the Spanish Stairs in Rome by night and would like to get the photo and its location from Flickr. This is difficult again as a pure download of the photo does not retain the geo-location. When aunt Mary visits the Web album and starts looking on the photos she tries to download a few onto her laptop to integrate them into her own photo management software. Now aunt Mary would like to incorporate some of the pictures of her nieces and nephews into her photo management system. And again, the system imports the photos but the precious metadata that mother and sun of family Miller have already annotated twice are gone.

3. The fundamental problem: semantic content understanding

What is needed is a better and more effective automatic annotation of digital photos that better reflects one's personal memory of the events captured by the photos an allows different applications to create value-added services on top of them such as the creation of a personal photo album book. For understanding the personal photos and overcoming the semantic gap, digital cameras leave us with files like dsc5881.jpg, a very poor reflection of the actual event. Is is a 2D visual snapshot of an multi-sensory personal experience. The quality of photos is often very limited (snapshots, over exposed, blurred, ...). On the other hand digital photos come with a large potential for semantic understanding the photos. Photographs are always taken in context. In comparison to analog photography digital photos provide us with explicit contextual information (time, flash, aperture, ...) a “unique id” such as the timestamp allows to later merge contextual information with the pure image content.

However, what we want to remember along with the photo is where it was, who was there with us, what can be seen on the photo, what the weather was, if we liked the event and so on. In recent years, it became clear that signal analysis along will not be the solution. In combination with the context of the photo such as GPS position or time stamp some hard signal processing problems can be solved better. So context analysis has gained much attention and became important for photos and very helpful for photo understanding (Scherp et al 2007). In the following figure a simple example is given of how to combine signal analysis and context analysis to achieve a better indoor/outdoor detection of photos. And, not only with the advent of the Web 2.0 the actual user came into focus. The manual effort of single user annotations but also collaborative effects are considered to be important for semantic photo understanding.

multimodalindooroutdoor

3.1. The role of metadata standards for photos

The role of metadata for this usage of photo collections is manyfold: * Save the experience: The central goal is to overcome the semantic gap and represent as much of the humans impression of the moment when the photo was taken. * Browse and find previously taken photos: Allow searching for events and persons, places, moments in time, ... * Share photos with the metadata with others: give your annotated photo from Flickr or from Foto Community to your friend’ application * Use comprehensive metadata for value-added services of the photos: Create an automatic photo collage or send a flash presentation to your aunt’s TV, notify all friends that are interested in photos from certain locations, events, or persons, ...

The following Figure illustrates the use of photos today and what we do with our photos at home but also in the Web.

usageofphotos

So the social life of personal photos can be summarized as:

Capturing: one or more persons capture and event, with one or different cameras with different capabilities and characteristics
Storing: one or more persons store the photos with different tools on different systems
Processing: post-editing with different tools that change the quality and maybe the metadata
Uploading: some persons make their photos available on Web (2.0) sites (Flickr); different sites offer different kinds of value-added services to the photos (Riya)
Sharing: photos are given away or are given access to via email, Web sites, print, ...
Receiving: photos from others are received via MMS, email, download, ...
Combining: Photos from own and different sources are selected and reused for services like T-Shirt, Mugs, mouse pads, photo albums, collages, ...

For this metadata plays a central role at all times and places of the social life of our photos.

4. The multimedia semantics interoperability problem

4.1. Different levels and types of metadata for photos

The problem we have here that metadata these days are a precious asset for interesting services on top of a photo collection, however is created and enhanced by different tools and systems and follows different standards and representations. Even though there are many tools and standards around that aim to capture and maintain this metadata, they are not necessarily interoperable. So on a technical level we have the problem of a common representation of metadata that is helpful and relevant for photo management, sharing and reuse. Metadata an end user typically gets in touch with are descriptive metadata that stem from the context of the photo. At the same time, in more than a decade many results in multimedia analysis have been achieved to extract many different valuable features from multimedia content. For photos for example, this includes color histograms, edge detection, brightness, texture and so on. With MPEG-7 a very large standard has been developed that allows to describe these features in a metadata standard and exchange content and metadata with other applications. However, both the size of the standard but also the many optional attributes in the standard have lead to a situation in which MPEG-7 is used only in very specific applications and has not been achieved as a world wide accepted standard for adding (some) metadata to a media item. Especially in the area of personal media, in the same fashion as in the tagging scenario, a small but comprehensive shareable and exchangeable description scheme for personal media is missing.

4.2. Different standards for photo metadata and annotations

What is needed is a machine readable description that comes with each photo that allows a site to offer valuable search and selection functionality on the uploaded photos. Even though approaches for Photo Annotation have been proposed they still do not address the wide range of metadata, annotations that could and should be stored with an image in a standardized fashion.

EXIF (EXIF) is a standard that comprises many photographic and capture relevant metadata. Even though the end user might use only a few of the key.value pairs they are relevant at least for photo editing and archiving tools which read this kind of metadata and visualize it. So EXIF is a necessary set of metadata which is needed for photos.
Tags from Flickr and other photo web sites and tools are metadata of low structure but high relevance for the user and the use of the photos. Manually added they reflect the users knowledge and understanding of the content which can not be replaced by any automatic semantic extraction. Therefore, a representation of these is needed. Depending on the source of tags is might be of interest to relate the tags to their origin such as "taken from an existing vocabulary", "from a suggested set of other tags" or just "free tags". XMP seems to be a very promising standard as it allows to define RDF-based metadata for photos. However, in the description of the standard it clearly states that it leaves the application dependent schema /vocabulary definition to the application and only makes suggestions for a set of "generic" sets such as EXIF, Dublin Core. So the standard could be a good "host" for a defined photo metadata description scheme in RDF but does not define it.
PhotoRDF (W3C, 2002) "describes a project for describing & retrieving (digitized) photos with (RDF) metadata. It describes the RDF schemas, a data-entry program for quickly entering metadata for large numbers of photos, a way to serve the photos and the metadata over HTTP, and some suggestions for search methods to retrieve photos based on their descriptions." So wonderful, but the standard is separated into three different schemas Dublin Core (Dublin Core), a Technical Schema, which comprises more or less entries about author, camera and short description, and a Content Schema, which provides a set of 10 keywords. With PhotoRDF, the type and number of attributes is limited, does not even comprise the full EXIF scheme and is also limited with regard to the content description of a photo.
The Extensible Metadata Platform or XMP (XMP) and the IPTC-NAA-Standard (IPTC) have been introduced to define how metadata (not only) of a photo can be stored with the media element itself. However, these standards come with their own set of attributes to describe the photo or allow to define individual metadata templates. This is the killer for any sharing and semantic Web search! What is missing is an actual standardized vocabulary what information about a photo is important and relevant to a large set of next generation digital photo services has not been reached.
The Image Annotation on the Semantic Web (W3C, 2001) provides a very nice overview of the existing standard such as those mentioned above. At the same time it shows how diverse the world of annotation is. The use case for photo annotation choses RDF/XML syntax of RDF in order to gain interoperability. It refers to a large set of different standards and approaches that can be used to image annotation but there is no unified view on image annotation and metadata relevant for photos. The attempt here is to integrated existing standards. If those however are too many, too comprehensive, and might even have overlapping attributes is might not be adopted as the common photo annotation scheme on the Web. For example, for the low level features for example, there is only a link to MPEG-7.
The DIG35 Initiative Group of the International Imaging Industry Association aims "provide a standardized mechanism which allows end-users to see digital image use as being equally as easy, as convenient and as flexible as the traditional photographic methods while enabling additional benefits that are possible only with a digital format." (DIG35, 2007). The DIG35 standards aims to define a standard set of metadata for digital images that can be widely implemented across multiple image file formats. From all the photo standards this is the broadest one with respect to typical photo metadata and is already defined as a XML Schema.
MPEG-7 is far to big even though the standard comprises metadata elements that are relevant also for a Web wide usage of media content. The advantage of MPEG-7 is that one can define an own description scheme and with it collect a subset of relevant feature related metadata with a photo. But, there is no chance to actually include an entire XML-based MPEG-7 description of a photo into the raw content. For the description of the content the use case refers to three domain-specific ontologies: personal history event, location and landscape.

5. Torwards a solution

5.1. Identification of interoperability needs and use cases

The result is clear, that there is not one standardized representation and vocabulary for adding metadata to photos. Even though the different semantic Web applications and developments should be embraced, a photo annotation standard as a patchwork of too many different specifications is not helpful. The following figure illustrates some of the different actitivities as described in the scenario above what people to with their photos and what different local and Web tools they use for this.

interoperability

What is missing, however, for content management, search, retrieval, sharing and innovative semantic (Web 2.0) applications is a limited and simple but at the same time comprehensive vocabulary in a machine-readable, exchangeable, but not over complicated representation is needed. However, the single standards described only solve part of the problem. For example, a standardization of tags is very helpful for a semantic search on photos in the Web. However, today the low(er) level features are also lost. Even though the semantic search is fine on a search level, for a later use and exploitation of a set of photos, previously extracted and annotated lower-level features might be interesting as well. Maybe a Web site would like to offer a grouping of photos along the color distribution. Then either the site needs to do the extraction of a color histogram or the photo itself brings this information already in in its standardized header information. A face detection software might have found the bounding boxes on the photo where a face has been detected and also provide a face count. Then the Web site might allow to search for photos with two or more persons on it. And so one. Even though low level features do not seem relevant at first sight, for a detailed search, visualization and also later processing the previously extracted metadata should be stored and available with the photo.

5.2. RDF Scheme for DIG35 and MPEG-7

6. References

(GFK, 2006): GfK Group for CeWe Color. Usage behavior digital photography, 2006.

(Flickr, 2007): Flickr. Yahoo! Inc, USA. http://www.flickr.com/

(Riya, 2007): Riya Foto Search. http://www.riya.com/

(Foto Community, 2007): Foto Community. http://www.fotocommunity.com/

(International Imaging Industry Association, 2007): DIG35 http://www.i3a.org/i_dig35.html

(W3C, 2002): Photo RDF - Describing and retrieving photos using RDF and HTTP, W3C Note 19 April 2002, http://www.w3.org/TR/photo-rdf

(Dublin Core): Dublin Core. http://dublincore.org/

(EXIF): EXIF - Exchangeable Image File Format, Japan Electronic Industry Development Association (JEIDA). Specifications version 2.2 available in HTML and PDF

(XMP): Adobe, Extensible Metadata Platform (XMP) http://www.adobe.com/products/xmp/index.html

(IPTC): Information Interchange Model http://www.iptc.org/IIM/

(W3C, 2007): Image Annotation on the Semantic Web. http://www.w3.org/2001/sw/BestPractices/MM/image_annotation.html#vocabularies