Use cases definition

From Best Practices for Multilingual Linked Open Data Community Group
Jump to: navigation, search

[This proposed template (largely inspired by the one used at ontolex w3c group ) is not final so any suggestion to improve it will be welcome.]

Each use case is documented with:

  • Title - a descriptive long name
  • Identifier - a short name
  • Contributor/s - who is responsible of document and maintain it
  • Description - general description and motivation
  • Example - an specific example involving specific triples and ontologies, if applicable
  • Topics - that are covered by this particular use case, from the list of topics
  • Additional remarks - any other issue that is worth mentioning

Localization Workflow

Title Localization Workflow
Identifier LOCWOR
Contributor/s Dave Lewis
Description This use case describes how linguistic linked open data could be deployed in a typical modern localisation workflow. It explores how linguistic linked data can be used to link workflow provenance meta-data with the language resources generated by such workflows (i.e. translation memories and terminology). Such interlinking aims to enable smaller localisation client and service providers to sytematically curate and leverage these language resources using data-driven language technologies such as statistical machine translation and named entity recognition. This approach required the combination of access control to linguistic linked data within a localisation value chain with the publication of language resources as Linked Open Data.

In this usage scenario we assume that a client is responsible for managing communication from a medical research institute in the UK. They publish research findings but also use their web site to actively engage with the press where findings are reported on inaccurately or to engage with the general public via a message forum in relation to question or clarifications about the real world and health policy impact of their work.

Step 1: Localisation Project Requirements - Some recent results have been published based on work in collaboration with partner institutes in Poland, France and Sweden. In all four countries the results have raised controversy and comment in the media and online discussions. The UK institute is contacted by its partners, who do not have the same experience in dealing with such external communications and request that both the original research and the press clarifications and interactions with user forum content be translated into Polish, French and Swedish to help them address similar concerns in their own countries.

Step 2: LSP Subcontracting and Project Set-up - The UK institute contacts a Language Service Provider (LSP) to out-source this translation task, since they had handled medical web site translation in the past. The LSP project manager realises however that the highly specialised nature of the original research and the uptake of that content in both general usage and in government policy discussions uses words and phrases not familiar to their usual translators. In addition, the highly perishable nature of the discussion content requires both fast translation but also command a low price expectation from the client.

The project manager loads the job into the LSP’s subscription to a Software as a Service Translation Management System (SaaS TMS). The analytics shows a poor Translation Memory (TM) match against in-house TMs so the project manager realises that targeted used of LT will be required.

Step 3: Terminology Management - To speed the process up, the project manager runs the content through a Named Entity Recognition (NER) system integrated into a separate SaaS Terminology Management system that he had previously trained on English medical terminology available from previous clients. As this term base had been published by that client as Linked Data, the project manager was able to search links to some of these terms translated by third parties in target languages. Some provenance information associated with those term translations, available due to the LSP’s membership in a terminology skills exchange brokerage cooperative, also put the project manager in touch with a terminologist who could QA those term translations, vet the NER recommendations and fill in any missing term translations.

Step 4: Harvesting Open corpora - The project manager then does an Linked Data search (via SPARQL) on parallel text on medical and health policy domains and finds useful volumes of bi-text from the EC European Linguistic Linked Data Portal. Downloading this lingustic linked data via the SaaS TMS and combining with the terminology now available, he generates an initial custom Statistical Machine Translation (SMT) engine, runs the research material through it and distributes this to small teams of translators in each target language. As this is critical technical content, a translation quality assurance (QA) pass is made over the results.

Step 5: Active Curation to retrain SMT engine - Before processing the associated press releases and posted clarification the manager retrains the SMT engines with selected high quality translation, and the NER engine with the output of the terminologist. Using the analytical tools the manager notices an appreciable improvement in PE time, which he confirms by performance of some automated MT metrics using the new SMT engine. A small sample of QA on the press related materials shows that the shift in terminology for the press-related content is not well supported by existing term base or caught by the NER. However, the translators working for the LSP also receive small incentives to mark phrases that they come across as being poorly translated frequently, so after a few days of this work, the resulting terms are used to retrain the NER engine, alongside another retraining of the SMT engine using selected post-edits.

Step 6: Localisation process analytics - Subsequent analysis of PE time and SMT metrics show that both the press content and the user posts now being translated on an on-going basis are achieving good enough automated translations that a lighter post-editing regime with less experienced translators can be used for press related material and user posts, enabling reasonable profitability to be maintained on these lower value translations.

Step 7: Publication of Linguistic Linked Data generated by translation project - The client is pleased with the results as reported back by his European colleagues. In the project summary report, the client notices the attribution to open Linguistic Linked Data sources in the execution of the project. When asking about this, the LSP explains the benefits of leveraging open Linguistic Linked Data and encourages the client to publish the translation memory and term bases from the project, as it may encourage other LSPs and their clients to reciprocate in the future. When the client publishes this data, the LSP annotates it with process quality data which it makes available to selected LSPs with whom it partners on bigger projects.

Step 8: Third Party Interlink and Reuse of Public Linguistic Linked Data - The client’s Swedish partner is impressed with the term translations resulting from the project, which it can also access via open Linguistic Linked Data, and initiates a project with the national health board to further check and QA these translations and then to include them in a new en-se term base being assembled and published by the local health authorities. The health board has increasingly willing to undertake the cost of such exercises as it sees evidence of the benefits of public term translations published L3Data in translation project across its many public facing functions.

Topics covered Localization information, Interlanguage links, Multilingual parallel texts, Open vs. non open Data,
Additional Remarks Taken from a scenario prepared for the FALCON STREP planned to start under the FP7 SME-DCL call in Oct'2013. Prepared in with collaboration with CNGL at Trinity College Dublin and Dublin City University as well as input from XTM International, Interverbum Technologies and SKAWA Innovation. The scenario is based on assumption on the use of XLIFF, ITS2.0, NIF and PROV-O interoperability specification between the systems mentioned.

Lexicalisation of RDF Datasets

Title Lexicalisation of (Multilingual) RDF Datasets
Identifier LEXRDF
Contributor/s Elena Montiel, Gordon Dunsire, [others]
Description This use case describes how the set of common best practices proposed by this group can be used to select the most appropriate representational option to represent textual or linguistic information (in one or multiple languages) associated to RDF datasets published as Linked Data.
Example To illustrate this use case, we will use the example of the ISBD standard (International Standard for Bibliographic Description).

The ISBD initially contained English labels for classes and properties as rdfs:label properties. With the aim of reusing this dataset in the National Library of Spain (BNE in Spanish), labels for classes and properties were translated into Spanish, resulting in a multilingual vocabulary in English and Spanish. For this specific case, the publishers decided to rely on the SKOS annotation property for preferred labels (skos:prefLabel), and agreed on the use of only one preferred label per language.

However, the Spanish translation of this vocabulary revealed a problem which was not apparent in the English version, namely, that some labels were adjectives (e.g. "cartographic" in English), which in Spanish require a form change depending on whether the word they modify is masculine ("cartográfico") or feminine ("cartográfica").

Because of the agreed restriction, compounds such as "cartográfico/a" were suggested (skos:prefLabel “cartográfico/a”@es), which have some problems, such as the fact that these compounds would not naturally appear in free texts.

In this case, the association of the ISBD dataset to an external ontology-lexicon model (e.g., LexInfo, LIR, lemon) could have been suggested, which would have allowed for the inclusion of the two adjectival forms of the cartographic adjective in Spanish, the masculine and the feminine, by linking them to that property in the ontology by means of a LexicalEntry with two LexicalForms (masculine and feminine).

We expect the best practises to help a publisher in determining under which conditions a simple lexicalisation mechanism (eg., rdfs:label) is preferred over richer approaches (e.g., lemon lexicon) and vice-versa.

Topics covered 2. Textual information

2.4 Lexicalization and linguistic information

Additional Remarks

Ontology Localisation

Title Ontology Localisation
Identifier ONTLOC
Contributor/s Elena Montiel, Guadalupe Aguado, Gordon Dunsire, [others]
Description This use case describes how the set of common best practices proposed by this group can be used to select the most appropriate representational option and available tools for the localization (or translation) of RDF datasets to multiple languages.

We understand ontology localization as the process of adapting an ontology to the needs of a particular (linguistic and cultural) community. A localized ontology can be understood as an ontology adapted to the target community and language, and used independently of the original ontology, or, most commonly, as an ontology in which the vocabulary or TBox has been translated to one or several natural languages, so that it contains terms in several languages for describing classes and properties. When extrapolating this to the linked data context, if the vocabulary publisher reuses an available vocabulary and decides to translate or localize the vocabulary terms into other languages, this could be understood as vocabulary localization and the result would be a multilingual RDF vocabulary.

Example Let us take the example of the ISBD standard, already introduced in the previous use case, namely, Lexicalisation of (Multilingual) RDF Datasets.

The ISBD standard contained labels in English as rdfs:label properties. Since their linguistic needs were restricted to the use of "only one preferred labels" in Spanish, they could have simply added additional rdfs:labels in Spanish and specified the language tag (@es). However, they decided to use the SKOS annotation property for preferred labels and agreed on the use of only one preferred lable per language.

This approach is known in literature as the "Multilingual labelling approach", which means that alternative labeling information is provided to a certain data structure in the form of literals represented as properties fo concepts. This approach can be followed whenever the conceptual or data structure covers the needs of the target language (an culture) to which the model is localised.

As for available ontology localisation tools, publishers could make use of some ontology localization tools such as LabelTranslator or the ontology translation component developed in the Monnet project (specially tuned for the financial domain), to (semi)-automatically translate the vocabulary. In this specific case, publishers decided to manually localise the ISBD standard into Spanish.

Topics covered 4 Ontologies and vocabularies

4.2 Localization of existing vocabularies

Additional Remarks

Crosslingual Linked Data Matching

Title Crosslingual Linked Data Matching
Identifier CLMATCH
Contributor/s Jorge Gracia,[others]
Description This use case will focus on the best practises required for the tasks of discovering, representing, storing, and consuming cross-lingual links among semantic information on the Web. The problem of cross-lingual linking is a fundamental one, since more and more legacy data sources available in different natural languages are being transformed into linked data, and have to be linked to be exploited at its full potential.
Example Let us imagine that we want to represent a "translation" relation between the labels associated to two different ontology entities documented in different languages [TODO: particularise it to a concrete example]. The simplest solution (although with a lot of implications) is to establish an "owl:sameAs" link between the entities. Other possibility is to use other relations at the conceptual level in case of granularity discrepancies ("rdfs:subClassOf" or "skos:narrower" for instance). Further, the labels could be substituted/extended by lexical entries in an external lexicon and thus a property "translation" could be explicitly defined to link the corresponding lexical entries.

Some questions arise that a set of best practises could help to solve:

  • Under which conditions is preferable to use "owl:sameAs" over more complex approaches and vice-versa?
  • Does it have any sense to localise the standard meta-vocabularies into different languages, so that we could, for instance, associate the label "más estrecho@es" in Spanish to skos:narrower?
  • How can we handle multiple possible translation of a label (especially for multiword labels)? To this end how can be capture the status of a suggested translation link, by whom it was created and how could they become regarded as authoritative?


Topics covered

3.1 Interlanguage links

3.2 Use of owl:sameAs

5.1 Benchmarks

Nevertheless, all the other topics (naming and dereferencing, textual information, ontologies and vocabularies, etc.) can have an impact on the choice of the particular linking and representation techniques to be used.

Additional Remarks

Machine Translation

Title Machine Translation
Identifier MT
Contributor/s Timm Heuss
Description A Machine Translation (MT) task follows certain rules, but it also relies on world or expert knowledge. A conventional solution in the field is the creation of dedicated vendors-specific dictionaries. In contrast to this, this use case will clarify the role of multilingual Linked Open Data in Machine Translation tasks, especially in utilizing multilingual Linked Open Data as a source for world knowledge to enable or enhance a translation. Just by reusing the information that are already available, datasets from DBpedia and others could possibly replace those dedicated dictionaries.

In particular, linked open data could be used as a source of new translations that are given with a semantic concept, opening key avenues for linked open data to improve machine translation. Firstly, linked data could be used as a source of translations for unknown (OOV) terms in machine translation by searching for translations on the web. Secondly, linked open data could be used as a source of semantic information and by exploiting the Giant Global Graph, MT systems could both disambiguate translations and also potentially lead to deeper semantic analysis

Example Technical terms

All areas of applied and theoretical science have to deal with very special phrases. Consider the field of life science / medicine, with words like "atrioventricular". In these very specific areas, a translation mechanism would benefit from a scientific community that can contribute and maintain vocabularies by its own.

Proper names

Proper names follow their own spelling and translation laws. Consider a sentence like "Pages by Apple is a word processor like Word by MS", which is usually hard to translate, because "Pages", "Apple", and "Word" are product proper names as well as English nouns. World knowledge, e.g. from DBPedia, could help improving a translation significantly.

Very individual

In addition to the previous product name sample, proper names of persons are even harder. Consider Asian proper names, which have a very specific meaning, some might even prefer an additional western-sounding forename. Multilingual Linked Open Data, like FOAF, could make even a very individual translation possible.

Evolution of language

Natural language is constantly changing. New words (e.g. "Brangelina") and abbreviations (e.g. "LOL") emerge, and MT have to deal with them. Changing and maintaining a dedicated dictionary, however, is costly and time consuming, and might even depend on a single vendor. Distributed LOD-dictionaries could liberate and speed-up that process a lot.

Topics covered 2 Textual information

2.1 Labels with language tag

2.4 Lexicalizations and linguistic information

5 Quality of MLOD

5.2 Big multilingual datasets

5.3 Multilingual parallel texts

6 Tools and examples of MLOD

Additional Remarks The examples for Machine Translation in this use case case might also apply to other domains of Natural Language Proceessing (NLP), too. If you can translate it, you can also correct or suggest it.

Some of these issue also appear in the LOCWOR use case in the application of MT to localization workflows.

Application Localization

Title Application Localization
Contributor/s John McCrae
Description Many applications are in part or fully generated based on ontology descriptions of their domain.

In particular, this often concerns applications that display large amounts of data, in this context localization of the application is the same as localization of the data.

Example Accounting makes uses of ontologies (XBRL) to capture figures used in company reports, the multilinguality

of the labels used here allows balance sheets to be automatically localized into many languages.

Topics covered

2. Textual Information

4. Ontologies and Vocabularies

5. Quality of MLOD

Additional Remarks