LD4LT Roadmapping Workshop Dublin June 2014

From Linked Data for Language Technology Community Group

Introduction

This report summarizes the LD4LT / LIDER roadmapping workshop, held 4th June 2014 as part of FEISGILTT workshop. The workshop was co-located with Localization World Dublin 2014. See more information about the workshop program.

The workshop targeted feedback from the localization community on linked data and language technology. Below is a summary of each contributation and a list of key findings.

Contributions

Welcome and Introduction

Dave Lewis (TCD) and Felix Sasaki (DFKI / W3C Fellow) introduced the workshop and its goals. The localization industry so far has not a lot of experience with linked data in real use cases. The workshop brings the right people from industry to find out: what problems could be solved by using linked data? What use cases are needed? What hinders the adoption of linked data?

Dave and Felix also gave an introduction to the umbrella LIDER project. LIDER works on the basis for creating a Linguistic Linked Data cloud, which can support content analytics tasks of unstructured multilingual cross-media content. Localization is an important usage scenario of LIDER, hence this roadmapping workshop.

Phil Richie: Translation Quality and RDF

Phil Ritchie (VistaTEC) reported on experience with two technologies: ITS 2.0 and NIF. At VistaTEC, a localization workflow using quality related ITS 2.0 information has been created. The Ocelot tool can be used to visualize such information and to correct translations. Reyling on NIF, VisaTEC produced linked data representations of translations. Here the benefit of linked data lies in data integration: various information related to a translation quality and other aspects can be freely integrated.

For localization companies, localization has to happen faster and faster, and prices are going down. So the companies are looking for added value, that is beyond the actual translation. The concept of intelligent content could convey such value. But the creation of such content requires new skills for translators and authors.

Using linked data in localization may only happen if companies can do this without opening the data. Many data sources are the assets of the company. Also, licensing mechanisms for linked data are a key aspect to make linked data for localization happen.

Yves Savourel: Linked Data in Translation Kits

Yves Savourel (ENLASO) gave a demonstration with the okapi toolkit, using linked data sources in a localization workflow. Linked data sources can help to give several types of information to the translator: general context information, definitions, disambiguation, translation examples etc. A usability challenge is: how to present the information to the translator so that it really is helpful, speeds up the localization process and leads to better quality? The issue can be summarized as: "Too much information is no information".

A technical challenge is overlap of information. Overlapping is no issue with NIF, that is in the RDF representation. But most of the localization tools work with XML data, and the standardized formats (XLIFF, TMX, TBX, ...) in localization do not provide mechanisms to represent overlapping information.

The discussion after the presentation provided additional points. If the quality of context information is unclear it may be rather a burden than a help. Always up to date information is needed. The forehand mentioned technical challenge (linked data not natively supported in localization formats) could be resolved by creating standardized JSON representations on top of these formats.

David Lewis: Turning Localisation Workflow Data to Linked Data

Dave Lewis introduced the FALCON project. The aim of FALCON is to provide the basis for a "localization Web". Here, resources used during localization (e.g. terms, translations) become linkable resources. Linkable Metadata in localization workflows then provides added value, compared to current "silo" approach: today, data held used in localization is stored and processed often in a proprietary and non-interlinkable manner.

A localization Web can help to leverage automatic language processing. For example, linked information can be used to leverage machine translation training or text analytics tasks. Core use cases in FALCON are:

  • source content internationalisation, with term extraction and translation discovery.
  • machine translation, producing consistent translations of terms and including discovery of parallel text for training.
  • translation and post-editing, including term definitions from open encyclopaedic data like Wikipedia and concordancing over a global translation memory.

Alan Melby: Linport and RDF

Alan Melby (Brigham Young University) reported on the Linport project. Linport is developing a format to package translation materials. The package will be self-contained, platform or translation tool independent, and it will come with basic operations like splitting or merging packages.

Linport defines an XML-based format. So far the group has not looked into using RDF. The discussion at the workshop around Linport and RDF did not lead to conrete steps in this direction.

Alan also reported on quality related efforts, namely MQM and DQF. Harmonization efforts between these are underway and a joint framework would be highly desirable.

Andrejs Vasiļjevs: Terminology resources in the ecosystem of linked data services for language professionals

Andrejs Vasiljevs (Tilde) presented the TaaS project. TaaS provides a cloud-based platform for instant access to up-to-date terms, user participation in term acquisition and sharing, and terminology resources reuse. The platform allows to automatically extract term candidates, to acquire translation equivalents or to clean up user provided resources.

TaaS also comes with import and export APIs. The export into linked data is one work area undertaken in cooperation with the LIDER project. The discussion after the presentation it became clear that linked data representation of terminology information is a topic of huge interest. The discussion fed into work of the LD4LT group, see the related issue Terminology and linked data.

Ioannis Iakovidis: TermWeb and The Localisation Web

Ioannis Iakovidis ([Interverbum Technology]) introduced TermWeb, their SaaS Terminology Management Solution. He described how its concept based approach to terminology management and its implementation as a web application made integration with lexical-conceptual resources captured as linked data a natural next step for the evolution of their product.

Integration with client's content management and localization workflows is key in deploying TermWeb. By participating in the FALCON, Interverbum aims to reap benefits of; broader sharing of term based; linking into public terminological resources, e.g. BableNet; providing links for auditing and providing quality assessment on term bases and leveraging term translation in machine translation.

Víctor Rodríguez Doncel: Towards high quality, industry-ready Linguistic Linked Licensed Data

Víctor Rodríguez Doncel (UPM) touched upon a topic that was of huge interest in many discussions: licensing and linked data. For the localization community, fully open linked data may not be of high relevance. Hence, a licensing mechanism is deeply needed to foster linked data adoption.

Different licensing models have an impact on business opportunities for linked data. The Open Digital Rights Language provides a framework for expressing such models in a machine readable manner. For localization, "Licensed Linguistic Linked Data (3LD)" may be of most interest. Here, different licensing models can be used together, including totally open and restrictive licenses, or completely closed datasets.

Requirements Gathering, Use Cases and Key points of the workshop

The workshop closed with a interactive session on gathering requirements and use cases for linked data. A summary including key points of the workshop in general is below.

  • Text analytics
    • There is a need for a common API to text analysis services, e.g. Babelnet, DBPedia spotlight, wikidata, Yahoo! content Annotation.
    • One needs to support attribution of source of lexical/terminologycal data and meta-data - especially when using aggregtion services such as Babelnet.
    • Resources need to have live updates to assure constant improvement.
    • Users need a mechanism to feed back corrections to annotation service, also into the underlying resources.
    • JSON can be used to provide annotation meta-data, e.g. as part of common API or as payload across different APIs.
    • One needs to be able to indicate the relevance of an annotation, e.g. confidence scores and their consistent interpretations.
    • Understanding context is key to assessing quality of annotations.
    • A stand-off annotation mechanism is needed to deal with annotation overlap. NIF could be a solution.
    • What type of CAT tool support is needed, e.g.: access to definitions, access to usage examples, use in predictive typing.
  • Licensing metadata
    • Licensing information needs to be integrated with the actual data.
    • One needs to be able to automatically compound different license terms to enable understanding at point of use as end of value chain.
  • Localisation project metadata
    • The relationship between efforts like Linport's STS, EC's MED and RDF should be made clear.
  • Terminology information and RDF
    • There is no standard mapping of the TBX format to RDF.
    • How should terminology information be incorporated with text analysis information and an related API?
    • How should one integrate open lexicons with closed corporate term bases?
  • Bitext
    • One could expose bitext (= aligned text of a source and one or several translations) as linked data, as an alternative to TMX.