Thoughts from Susanne Boll
I (Susanne) don't know if this is the intended use case which was suggested but I would write here:
With multimedia analysis in more than a decade many results have been achieved to extract many different valuable features from multimedia content. For photos for example, this includes color histograms, edge detection, brightness, texture and so on. With MPEG-7 a very large standard has been developed that allows to describe these features in a metadata standard and exchange content and metadata with other applications.
However, both the size of the standard but also the many optional attributes in the standard have lead to a situation in which MPEG-7 is used only in very specific applications and has not been achieved as a world wide accepted standard for adding (some) metadata to a media item. Especially in the area of personal media, in the same fashion as in the tagging scenario, a small but comprehensive shareable and exchangeable description scheme for personal media is missing.
So lets consider a family Miller that travels to Italy. Equipped with a nice camera and also a GPS receiver the family takes nice photos from the family members but also from the sightseeing spots visited. At the end of the trip the pictures are uploaded to a nice photo annotation tool P1 which both extracts some features but also make suggestions for annotations (tags) or allows for personal descriptions. But the aunt Mary Meyr would like to incoroporate some of the pictures of her nieces and nephews into her photo management systems P2. The system imports the photos but loses most of the annotations and metadata that family Miller has already acquired.
Original thoughts from Ioannis Pratikakis
For multimedia documents retrieval, using only low-level features as in the case of “retrieval by example”, on the one hand, one gets the advantage of an automatic computation of the required low-level features but on the other hand, it is inadequate to answer to high-level queries. For this, an abstraction of the high level multimedia content description is required. In particular, the MPEG-7 standard which provides metadata descriptors for structural and low-level aspects of multimedia documents, needs to be properly linked to domain-specific ontologies that model high-level semantics. For this, the issue of semantic interoperability has to be under consideration.
Furthermore, since the provision of cross-linking between different media types or corresponding modalities supports a rich scope for inferencing a semantic interpretation, interoperability between different single media schemes is an important issue. This is due to the fact that each single modality (i) can inference particular high level semantics with different degrees of confidence, (ii) can be supported by a world modelling (or ontologies) where different relationships exist, e.g. in an image one can attribute spatial relationships while in a video sequence spatio-temporal relationships can be attained, and (iii) can have different role in a cross-modality fashion – which modality triggers the other, e.g. to identify that a particular photo in a Web page depicts person X, we first extract information from text and thereafter we cross-validate by the corresponding information extraction from the image.