Warning:
This wiki has been archived and is now read-only.

Top-Down Modelling Approach

From Media Annotations Working Group Wiki
Jump to: navigation, search

Authors: Frank Nack, Werner Bailer and Véronique Malaisé


General comment

There are classically two viewpoints that are considered about audiovisual documents in the archiving world: the video as a medium or the video for the content it represents. The first one is indexed for parts to be reused in other contexts, and the second one is indexed for its message. The first one focuses on the media level (optionnally with subjective comments about feelings conveyed by the sound or the image), the second one on the content level. The content level is often described in a particular schema, which separates the description into fields, but which are not interconnected: the annotation is global for the document, may it be a film, a shot or whatever document unit is selected. The fields usually represent the people seen on the screen or mentioned in the document, the names (brands, companies, bands name, etc.), location and generic content description keywords; another way of annotating is the one adopted in the structured textual annotation scheme in MPEG-7 and in the MultimediaN e-culture project: graphs representing who did what, when, where and how, elaborating on the scheme sketched by [Tam and Leung, 2001], having the different parts of the graph filled in by values coming from ontologies.

In large archives and particularly for institutes which have a very homogeneous collection, making links or graphs to connect the different pieces of the annotation that belong together is very important for the precision/enhancing the search. Without such explicit links, even when using keywords for the annotation, searching in the archives brings the same problems as making a complex query on the Web witout using double quotes: there is no garantee that the elements that are searched for belong together even if they stand in the same annotation. For example, if one document describes 3 heads of State meeting at a conference and two of them shaking hands, the countries of the different heads of state are likely to be mentioned, their names, and the action "shaking hands", but it is unlikely that this action will be connceted with its actual actors. Searching for any of these head of states shaking hands will retrieve the document.


Basic Assumption for this WG

We have 3 dimensions in which we can look at the media annotation problem:

1. The media (which particular aspects of a medium need to be described to facilitate actions performed by the user) Media are different in their expression strength (e.g. visuals are strong on their denotative power, where audio or haptics are better in stimulating feelings, text is strong on paradigmatic processes). Taking in consideration what the cognitive power of a medium is might help us to distil the basics to be described to achieve the widest coverage. Media also differ in the content dimensions they support, e.g. time, 2D space, 3D space.

2. The context describes under which circumstances is the media accessed, e.g. presentation generation, pure search, mobile environment, etc., and the combination/embedding with other media items, e.g. inclusion in Web page, text/images/video clips in EPG, etc. The relevant questions here are: which information elements are necessary/relevant to achieve the correct context? In the 'mobile' scenario this means: we have to think about the essential attributes of ‘location’ and once that is clear we determine how those can be minimally described so that a larger variety of processes/actions can be performed (Assumption: we do not model the processes but rather design metadata that allow them (the applications) to handle the material appropriately, we also do not intend to model process-related metadata, e.g. processing applied to a media).

3. The actual tasks performed by the user (they will require particular information to be correctly performed) The relevant questions are: how should, whatever we design, support the tasks users perform on and with media? Which tasks (e.g. search, manipulation, generation, .....) would we like to support? Do we make a distinction between general and specific tasks (general are those that can work alone, such as search, whereas specific tasks are those that need others to be functional (e.g. for manipulating a video, it first needs to be found). Finally, - what are the essential terms/tags/description structures we have to come up with?

What we aim for is a minimal set of properties that allow us to describe a medium in a way that all 3 dimensions are covered. The general question for each dimension is: how granular do we intend to become (this then points to the question: are we aiming for tags only, clustered tags (cluster is a dimension and the associated terms are the tags), or do we aim for structured descriptions, e.g. classes with attributes, and relations between them, e.g. each dimension is a class, subclasses are e.g. the media types, of which each might have attributes).


Media

We consider the following media: text (? not sure if text was meant to be covered by the MAWG mandate), image, video, audio, multi-source audiovisual content (e.g. multi-view video, surround audio), 3D models and scenes, haptic (?), olfactory (?)

For each medium we have to consider modes, such as: static or interactive, fixed or mobile, realistic or abstract. For audio also: voice or melodic.

For each medium, type specific metadata that are necessary to access media of this type need to be supported.

Example: if we have a news video, and we wish to support the combination of videos into a personalized news show, some aggregation of material needs to be performed, where the personalization could address the topic (international news only), the performer (clips from a set of reporters only) the form (short clips, handheld camera, etc.) For an EPG that is different, as different media are combined (different sort of aggregation), e.g. text, images, audio, and video. The problem here is to identify those aspects that are relevant for a useful presentation: e.g. which media is more important for a user, which medium attracts more attention (usual the visual media that move), etc.

Related uses cases: 6080 Use case "Tagging", 6081 Use case "semantic media analysis", 6141 Video, 6142 Mobile


Context

Each test case addresses a certain context, which requests particular tasks that need to be performed. Below a list of example use cases which do not represent a medium or mode only:

6064: IPTV

Task Purpose
watch essential consumption mode => TV programs, multimedia resources (youtube, etc.)
search TV program, other information
filtering recommendation
filtering recommendation
aggregation EPG
interact e.g with the TV program (nm2)


6067: Audiovisual archive

Task Purpose
watch essential consumption mode => TV programs, multimedia resources (youtube, etc.)
annotate covering various content aspects (event, object, media feature, etc.) => structured, simple tags, etc.
exchange/map/merge existing proprietary metadata
search over various collections => person, location event
personalizing based on user preferences (topics, media, etc).
aggregation presentation
fragmentation presentation
watch/listen/read essential consumption modes => presentation
interact the user can generate a presentation based on actions (e.g. starts with browsing the collection and the system generates the presentation form on the fly)


6077: Image

Task Purpose
annotate images in the personal archive, public archive (flickr), covering various content aspects (event, object, media feature, etc.) => structured, simple tags, etc.
selection/search a single image or groups (e.g.event)
watch essential consumption mode
sharing inform others about images, send them, send a link, etc (several tasks can be combined)
rating setting values of importance (can address various attributes of the medium but also the user’s preferences)


6078: Music

Task Purpose
annotate music in the personal archive, public archive, covering various content aspects (composer, structure of a piece, media feature, etc.) => structured, simple tags, etc. Special feature here are the lyrics => text (different medium)
selection/search parts or complete pieces
exchange/map/merge existing proprietary metadata
listen essential consumption mode
sharing inform others about existing material
rating setting values of importance (can address various attributes of the medium but also the user’s preferences)
aggregation based on the user’s action (listening) new material might be collected (e.g. playlist)
interact jumping within a piece, jumping from one part of a piece to another (related to media description)
inform/distribute based on the user’s action (e.g. listen) the system can collect e.g. related events in the real world (see also aggregation => aggregation is a task that combines several tasks : search, collect, present) Note: here the mobile mode (see media) is relevant


6078: News


Task Purpose
annotate mainly done by professionals based on a defined vocabulary. Special feature here are the lyrics => text (different medium) => structured, simple tags, etc.
selection/search the user searches for a particular topic
watch/listen/read essential consumption mode, depending on the media the news is distributed
sharing inform others about existing material
rating setting values of importance (can address various attributes of the medium but also the user’s preferences)
aggregation based on the user’s action (news channel) new material might be collected and distributed (see rss)
interact jumping within a piece, jumping from one part of a piece to another (related to media description)
inform/distribute based on the user’s action (e.g. looking at a film) the system can collect related events in the real world (see also aggregation => aggregation is a task that combines several tasks : search, collect, present) : e.g. actor just dies, so news is provided as caption

Here also a collection of concepts that frequently appeared in the context dimension of our use cases: Personal and grouped tags, spatial and temporal relations between objects in a medium and between media, media analysis, rhetoric structures, geo data, production information


Tasks

Our use cases provide a vast number of tasks that can be performed on media. The above mentioned tasks are not all that are mentioned. Here the list of tasks extracted from the current use cases:

Search, tagging, adapt, personalize, filter (collaborative), reuse, exchange, map, merge, extract, maintaining, listen, read, watch, mix, generate, summarize, present, interact, query, retrieve,

The following is a list of basic tasks:

  • Annotate
  • Analyse
  • Search
  • Distribute/present
  • Watch, listen, read
  • Share (?)


While it seems promising to build on basic or generic tasks (such as those defined in the Canonical Processes papers) not bound to specific use cases, these tasks might be very different depending on media (e.g. the watch/listen tasks, although otherwise well suited to define a generic consumption task), context (e.g. annotation in end user scenarios like personal image/music collection vs. archive documentation) and other aspects (e.g. there are many types of search requiring different metadata, even when searching the same media type in the same application context).

For all those basic tasks the question remains: in comparison with the other dimension, what are the basic attributes that need to be described.


There are a number of tasks that require task chains:

  • Adapt => search for relevant material, generate new context, present
  • Mix => analyze media, combine material, present
  • Summarize => analyze the material, extract, present
  • Inform => observe user behavior, identify related material, generate info block, distribute

These task chains (could we call them processes as well?) seem to be interesting for our purpose, as the answer to the question “which metadata items need to be passed from one task to a subsequent one to make the chain work” helps defining the minimum set of attributes.

Some of these tasks are already covered by use cases, namely 6080 Use case "Tagging", 6083 Use case "multimedia search" , 6084 Multimedia Adaptation , 6085 Use case "multimedia sharing", 6086 Multimedia Presentation , 6142 Mobile


Suggestion for an additional use case

The analysis so far is rather process oriented. It would be necessary, though, to come up with an example where the user asks for particular facts that then can be provided in any media. Important is to figure out how much of direct content access do we have to support.


Related material:

  • article{506437,
author = {A. M. Tam and C. H. C. Leung},
title = {Structured natural-language descriptions for semantic content retrieval of visual materials},
journal = {J. Am. Soc. Inf. Sci. Technol.},
volume = {52},
number = {11},
year = {2001},
issn = {1532-2882},
pages = {930--937},
doi = {http://dx.doi.org/10.1002/asi.1151.abs},
publisher = {John Wiley \& Sons, Inc.},
address = {New York, NY, USA},
}