This is an archive of an inactive wiki and cannot be modified.

CHOICE@CATCH ranking of candidate terms for description of radio and TV programs

Radio and television (RTV) programs at the Dutch national broadcasting archive (Sound and Vision) are typically associated with contextual text descriptions (web site texts, subtitles, program guide texts, texts from the production process, etc). Documentalists at Sound and Vision manually describe programs using this type of context documents. For this description task, they use the GTAA (Gemeenschappelijke Thesaurus Audiovisuele Archieven - Common Thesaurus for Audiovisual Archives), described in EucGtaaBrowser.

The CHOICE project (part of the Dutch CATCH research programme) uses natural language processing techniques to automatically extract candidate GTAA terms from the context documents. The application that is described here takes these candidate terms as input, and ranks them on basis of the structure of the GTAA thesaurus. For example, the fact that "Voting" and "Democratization" are related in GTAA by a two-step path (via the "Election" term and two "related-to" links) will influence positively the ranking of these terms.Ranked terms will be presented to documentalists to speed up their description work, as detailed in the use case description "Recommend metadata" on http://ems01.mpi.nl/usecases/

Currently the application (now a standalone Java application, later a SOAP web service) is called with a file containing URIs as argument. It uses a Sesame web repository containing the SKOS version of the GTAA thesaurus to retrieve the 'term context' of the terms in the input list, that is, for one given term, all terms that are directly connected to it by broader term, narrower term or related term relations. This term context is stored in a temporary local Sesame repository.

For this ranking, it is now assumed that candidate terms that are mutually connected by thesaurus relations (directly or indirectly) are more likely to be good descriptions than isolated candidate terms. Later on, it might be more interesting to differentiate between types of thesaurus relations, or one may want to use more complex patterns of thesaurus relations for our ranking algorithm.

The thesaurus-based recommendation system can also be integrated with a recommendation system that is based on co-occurences between terms that are used in previously existing descriptions of RTV programs.