Use Case : Semantic Media Analysis for Intelligent Retrieval

Authors: Ioannis Pratikakis, Sofia Tsekeridou

Introduction

Semantic Media Analysis seen from a multimedia retrieval perspective is equivalent to the automatic creation of semantic indices and annotations based on multimedia and domain ontologies to enable intelligent human-like multimedia retrieval purposes. An efficient multimedia retrieval system [1], must:

be able to handle the semantics of the query;
unify multiple modalities in a homogeneous framework and
abstract the relationship between low level media features and high level semantic concepts to allow the user to query in terms of these concepts rather than in terms of examples, i.e. introduction the notion of ontologies.

This Use Case aims to pinpoint problems that arise during the effort for an automatic creation of semantic indices and annotations in an attempt to bridge the multimedia semantic gap and thus provide corresponding solutions using Semantic Web Technologies.

For multimedia data retrieval, based on only low-level features as in the case of “quering by example” and of content-based retrieval paradigms and systems, on the one hand, one gets the advantage of an automatic computation of the required low-level features but on the other hand, such methodology lacks the ability to respond to high-level, semantic-based queries, and evidently loses the relation among low-level multimedia features such as pitch, or zero-crossing rate in audio or color and shape in image and video, or frequency of words in text, to high-level domain concepts that essentially characterize the underlying knowledge in data that a human is capable of quickly grasping, whereas a machine cannot. For this reason, an abstraction of high level multimedia content descriptions and semantics is required based on what can actually be generated automatically, such as low-level features after low-level processing, and on methods, tools and languages to represent the domain ontology and attain the mapping between the two. Tha latter is needed so that semantic indices are extracted as automatic as possible, rather than being produced manually which is a time-consuming and not always efficient task (attains a lot of subjective annotations). To avoid the latter limitations of manual semantic annotations on multimedia data, metadata standards and ontologies (upper, domain, etc.) have to be used and interoperate. Thus, a requirement emerges for multimedia semantics interoperability to further enable efficient solutions interoperation, when considering the distributed nature of the Web and the enormous amounts of multimedia data published there.

An example solution for the interoperability problem stated above is the MPEG-7 standard. MPEG-7, composed of various parts, defines both metadata descriptors for structural and low-level aspects of multimedia documents, as well as high level description schemes (Multimedia Description Schemes) for a higher-level of descriptions including semantics of multimedia data. However, it does not determine the mapping of the former to the latter based on the addressed application domain. A number of publications have appeared to define the MPEG-7 core ontology (J. Hunter, C. Tzinaraki) to address such issues. What is important is that the MPEG-7 provides the standardised means of descriptors both low-level and high level. The value sets of those descriptions along with a richer set of relationships definitions could form the necessary missing piece along with the knowledge discovery algorithms which will use these to extract semantic descriptions and indices in an almost automatic way out of multimedia data. The bottom line thus is that MPEG-7 metadata descriptions need to be properly linked to domain-specific ontologies that model high-level semantics.

Furthermore, one should consider usually the multimodality feature of multimedia data and content on the Web. The same concept there may be described by different means, that is by news in text as well as an image showing a snapshot of what the news are reporting. Thus, since the provision of cross-linking between different media types or corresponding modalities supports a rich scope for inferencing a semantic interpretation, interoperability between different single media schemes (audio ontology, text ontology, image ontology, video ontology, etc.) is an important issue. This emerges from the need to homogenise different single modalities for which it is possible that:

can infer particular high level semantics with different degrees of confidence (e.g. rely mainly on audio for infering certain concepts than text),
can be supported by a world modelling (or ontologies) where different relationships exist, e.g. in an image one can attribute spatial relationships while in a video sequence spatio-temporal relationships can be attained, and
can have different role in a cross-modality fashion – which modality triggers the other, e.g. to identify that a particular photo in a Web page depicts person X, we first extract information from text on the person's identity and thereafter we cross-validate by the corresponding information extraction from the image.

Both of the above concerns, either the single modality tackled first or the cross-modality (which essentially encapsulates the sinlge modality), require semantic interoperability which will support a knowledge representation of the domain concepts and relatioships, of the multimedia descriptors and of the cross-linking of both, as well as a multimedia analysis part combined with modeling, inferencing and mining algorithms that can be directed towards automatic semantics extraction from multimedia to further enable efficient semantic-based indexing and intelligent multimedia retrieval.

Motivating Examples

In the following, current pitfalls with respect to the desired semantic interoperability are given via examples. The discussed pitfalls are not the only ones, therefore, further discussion is needed to cover the broad scope of semantic multimedia analysis and retrieval.

Example 1 - Single modality case: Lack of semantics in low-level descriptors

The linking of low-level features to high-level semantics can be obtained by the following two main trends:

using machine learning and mining techniques to infer the required mapping, based on a basic knowledge representation of the concepts of the addressed domain (usually low-to-medium level inferencing) and
using ontology-driven approaches to both guide the semantic analysis and infer high-level concepts using reasoning and logics. This trend can include the first one as well and then be further driven by medium-level semantics to more abstract domain concepts and relationships.

In both trends above, it is appropriate for granularity purposes to produce concept/event detectors, which usually incorporate a training phase applied on training feature sets for which ground-truth is available (apriori knowledge of addressed concepts or events). This phase enable optimization of the underlying artificial intelligence algorithms. Semantic interoperability cannot be achieved by only exchanging low-level features, wrapped in standardised metadata descriptors, between different users or applications, since there is a lack of formal semantics. In particular, a set of low level descriptors (eg. MPEG-7 audio descriptors) cannot be semantically meaningful since there is a lack of intuitive interpretation to higher levels of knowledge - these have been however extensively used in content-based retrieval that relies on similarity measures. The low level descriptors are represented as a vector of numerical values, and thus, they are useful for a content-based multimedia retrieval rather than a semantic multimedia retrieval process.

Furthermore, since a set of optimal low level descriptors per target application (be it music genre recognition or speaker indexing) can be conceived by only multimedia analysis experts, this set has to be transparent to any other user. For example, although a non-expert user can understand the color and shape of a particular object, he is unable to attribute to this object a suitable representation by the selection of appropriate low level descriptors. It is obvious that the low level descriptors do not only lack semantics but also limit their direct use to people that have gained a particular expertise concerning multimedia analysis and multimedia characteristics.

The problem raised out of this Example that needs to be solved is in which way low level descriptors can be efficiently and automatically linked and turned into an exchangeable bag of semantics.

Example 2 - Multi-modality case: Fusion and interchange of semantics among media

In multimedia data and web content, cross-modality aspects are dominant, a characteristic that can be efficiently exploited by semantic multimedia analysis and retrieval, when all modalities can be exploited to infer the same or related concepts or events. One aspect, is again motivated from the analysis part, that refers to particular concepts and relationships capturing, which require a priority in the processing of modalities during their automatic extraction. For example, to enhance recognition of a face of a particular person in an image appearing in a Web page, which is actually a very difficult task, it seems more natural and efficient that initially inferencing is based on the textual content, to locate the identity (name) of the person, and thereafter, the results can be validated or enhanced by related results from image analysis. Similar multimodal media analysis benefits can be obtained by analysing synchronized audio-visual content to semantically annotate it. The trends there are:

to conscruct combined feature vectors from audio and visual features and feed those to machine learning algorithms to extract combined semantics ii. to analyse each single modality separately towards recognizing medium-level semantics or the same concepts and then fuse results of analysis (decision fusion) in usually a weighted or ordered manner (depending on the underlying single modality cross-relations towards the same topic) to either improve the accuracy of semantics extraction results or enrich them, towards higher level semantics.

For the sake of clarity, an example scenario is described in the following which is taken from the ‘sports’ domain and more specifically from ‘athletics’.

Let's assume that we need to semantically index and annotate, in the most possible automatic way, the web page shown at Figure 1, which is taken from the site of the International Association of Athletics Federation [2]. The subject of this page is ‘the victory of the athlete Reiko Sosa at the Tokyo’s marathon’. Let's try to answer the question: What analysis steps are required if we would like to enable semantic retrieval results for the query “show me images with the athlete Reiko Sosa” ?

One might notice that for each image in this web page there is a caption which includes very useful information about the content of the image, in particular the persons appearing in it, i.e. structural (spatial) relations of the media-rich web page contents. Therefore, it is important to identify the areas of an image and the areas of a caption. Let's assume that we can detect those areas (it is not useful to get into details how). Then, we proceed in the semantics extraction of the textual content in the caption which identifies:

Person Names = {Naoko Takahashi, Reiko Sosa},
Places = {Tokyo}, Athletics type = {Women’s Marathon} and
activity = {runs away from} (see Figure 1, in yellow and blue color).

In the case of the semantics extraction from images, we can identify the following concepts and relationships:

In the image at the upper part of the web page, we can get the ‘athlete’s faces’ and with respect to the spatial relationship of those faces we can identify which face (athlete) takes lead against the other. Using only the image we cannot draw a conclusion who is the athlete.
In the image at the lower part of the web page, we can identify that there exist a person after a face detection but still, we cannot ensure to whom this face belongs to.

If we combine both the semantics from textual information in captions and the semantics from image we may give a large support to reasoning mechanisms to reach the conclusion that “we have images with the athlete Reiko Sosa”. Nonetheless, in the case that we have several athletes like in the image on the upper web image part, reasoning using the identified spatial relationship can spot which particular athlete between the two, is Reiko Sosa.

webex1

Figure 1 : Example of a web page about athletics

Another scenario involved multimodal analysis of audio-visual data, distributed on the web or accessed through it from video archives, and concerns automatic semantics extraction and annotation of video scenes related to violence, for further purposes of content filtering and parental control [6]. Thus, the goal in this scenario is automatic identification and semantic classification of violent content, using features extracted from visual, auditory and textual modalities of multimedia data.

Let's consider that we are trying to automatically identify violent scenes where fighting among two persons takes place with no weapons involved. The low-level analysis parts will lead to different low-level descriptors separately for each modality. For example, for the visual modality the analysis will involve:

Shot cut detection and video segmentation.
Human body recognition and motion analysis.
Human body parts recognition (arms, legs).
Human body parts movement and tracking (i.e. “Fast horizontal hand movement”)
Interpretation of simple "visual" events/concepts based on spatial and temporal relations of identified objects (medium-level semantics).

On the other hand, the analysis of the auditory modality will involve:

Audio signal segmentation.
Segment classification in sound categories, including speech, silence, music, scream, etc. which may relate to violence events or not (medium-level semantics).

Now, by of course fusing medium-level semantics and results from the single modality analysis, taking under consideration spatio-temporal relations and behaviour patterns, we evidently can automatically extract (infer) higher level semantics. For example, the “punch” concept can be automatically extracted based on the initial analysis results and on the sequence or synchronicity of audio or visual detected events such as two person in visual data, the one moving towards the other, while a punch sound and scream of pain is detected in the audio data.

To fulfil such scenarios as the ones presented above, we should solve the problem how to fuse and interchange semantics from different modalities.

Possible Solutions

Example 1 (The solution)

As it was mentioned in Example 1, semantics extraction can be achieved via concept detectors after a training phase based upon feature sets. Towards this goal, recently there was a suggestion in [3] to go from a low level description to a more semantic description by extending MPEG-7 to facilitate sharing classifier parameters and class models. This should occur by presenting the classification process in a standardised form. A classifier description must specify on what kind of data it operates, contain a description of the feature extraction process, the transformation to generate feature vectors and a model that associates specific feature vector values to an object class. For this, an upper ontology could be created, called a classifier ontology, which could be linked to a multimedia core ontology (eg. CIDOC CRM ontology), a visual descriptor ontology [4] as well as a domain ontology. A similar approach is followed by the method presented in [5], where classifiers are used to recognize and model music genres for efficient music retrieval, and description extensions are introduced to account for such extended functionalities.

As to these aspects, the current Use Case relates at some extend to the Algorithm Representation UC. However, the latter refers mainly to general purpose processing and analysis and not to analysis and semantics extraction, based on classification and machine learning algorithms, to enable intelligent retrieval.

In the proposed solution, the visual descriptor ontology consists of a superset of MPEG-7 descriptors since the existing MPEG-7 descriptors cannot always support an optimal feature set for a particular class.

A scenario that exemplifies the use of the above proposal is given in the following. Maria is an architect who wishes to retrieve available multimedia material of a particular architecture style like ‘Art Nouveau’, ‘Art Deco’, ‘Modern’ among the bulk of data that she has already stored using her multimedia management software. Due to its particular interest, she plugs in the ‘Art Nouveau classifier kit’ that enables the retrieval of all images or videos that correspond to this particular style in the form of visual representation or non-visual or their combination (eg. a video on exploring the House of V. Horta, a major representative of Art Nouveau style in Brussels, which includes visual instances of the style as well as a narration about Art Nouveau history).

Necessary attributes for the classifier ontology are estimated to be:

The name and category of the Classifier
The list and types of input parameters
The output type
Limitations on data set, on value ranges for parameters, on processing time and memory requirements
Permormance metrics
Guidelines of use
Links to class models per domain/application and feature sets

In the above examples, the exchangeable bag of semantics is directly linked to an exchangeable bag of supervised classifiers.

Example 2 (The solution)

In this example, to support reasoning mechanisms, it is required that apart from the ontological descriptions for each modality, there is a need for a cross-modality ontological description which interconnects all possible relations from each modality and constructs rules that are cross-modality specific. It is not clear, whether this can be achieved by an upper multimedia ontology (see [4]) or a new cross-modality ontology that will strive toward the knowledge representation of all possibilities combining media. It is evident though, that the cross-modality ontology, along with the single modality ones, greatly relate to the domain ontology, i.e. to the application at hand.

Furthermore, in this new cross-modality ontology, special attention should be taken for the representation of the priorities/ordering among modalities for any multimodal concept (eg. get textual semantics first to attach semantics in an image). This translates to sequential rules construction. However there are cases, where simultaneous semantic instances in different modalities may lead to higher level of semantics, that synchronicity is also a relationship to be accounted for. Apart from the spatial, temporal or spatio-temporal relationships that need to be accounted for, there is also the issue of importance of each modality for identifying a concept or semantic event. This may be represented by means of weights.

The solution is composed also by relating visual, audio, textual descriptor ontologies with a cross-modality ontology showcasing their inter-relations as well as a domain ontology representing the concepts and relations of the application at hand.

References

[1] M. Naphade and T. Huang, “Extracting semantics from audiovisual content: The final frontier in multimedia retrieval”, IEEE Transactions on Neural Networks, vol. 13, No. 4, 2002.

[2] http://www.iaaf.org

[3] M. Asbach and J-R Ohm, “Object detection and classification based on MPEG-7 descriptions – Technical study, use cases and business models”, ISO/IEC JTC1/SC29/WG11/MPEG2006/M13207, April 2006, Montreaux, CH.

[4] H. Eleftherohorinou, V. Zervaki, A. Gounaris, V. Papastathis, Y. Kompatsiaris, P. Hobson, “Towards a common multimedia ontology framework (Analysis of the Contributions to Call for a Common multimedia ontology framework requirements)”, http://www.acemedia.org/aceMedia/files/multimedia_ontology/cfr/MM-Ontologies-Reqs-v1.3.pdf

[5] S. Tsekeridou, A. Kokonozi, K. Stavroglou, C. Chamzas, "MPEG-7 based Music Metadata Extensions for Traditional Greek Music Retrieval", IAPR Workshop on Multimedia Content Representation, Classification and Security, Istanbul, Turkey, September 2006

[6] T. Perperis, S. Tsekeridou, "Automatic Identification in Video Data of Dangerous to Vulnerable Groups of Users Content", presentation at SSMS2006, Halkidiki, Greece, 2006.