Converting TBX to RDF

From Best Practices for Multilingual Linked Open Data Community Group
Jump to: navigation, search

Introduction

--John McCrae (talk) 12:41, 10 July 2015 (UTC) DO NOT EDIT: This has moved to https://github.com/bpmlod/report/blob/gh-pages/multilingual-terminologies/index.html

This document provides guidelines on how to convert terminologies represented in the Term Base eXchange (TBX) into the Resource Description Framework (RDF). TBX is an open standard that has been published by the Localization Industry Standards Association (LISA) (see here). The standard is identical to ISO standard 30042. This document on the one hand describes the vocabularies that are recommended to be used in doing this conversion and describes the structure of the resulting RDF. It builds on standard W3C vocabularies and other vocabularies that are currently in the process of standardization. The conversion has been implemented in the form of a software package that can be used by anyone (see here).

Selection of vocabularies

The following table provides an overview of the vocabularies used in the conversion. Most vocabularies are W3C recommendations or near standards developed by a working group.


Model Prefix Model reference URL
lemon-ontolex lemon-ontolex http://www.w3.org/ns/lemon/ontolex#
SKOS skos http://www.w3.org/2004/02/skos/core#
RDF-schema rdfs http://www.w3.org/2000/01/rdf-schema#
DCAT dcat http://purl.org/dc/terms/
VOID void http://rdfs.org/ns/void#
PROV-O: The Prov Ontology prov http://www.w3.org/ns/prov#
LIDER TBX Ontology tbx http://lider-project.eu/tbx#

We have chosen the lemon-ontolex vocabulary as the backbone of the conversion of TBX into RDF format. lemon-ontolex is a model proposed for representing lexical information relative to ontologies and for linking lexicons and machine-readable dictionaries to the Semantic Web and the Linked Data cloud. The lemon-ontolex vocabulary is currently under discussion by the Ontology-Lexicon Community Group that is currently in the process of issuing the final specification of the model.

Technical Description of the Conversion

The TBX Data Model

The following figure summarizes the TBX Data Model as an UML diagram:

TBX datamodel as UML diagram (simplified)
  • TBX Resource: A TBX resource essentially represents a collection of terminological concepts (terminological concept), which are represented as XML elements of type termEntry and have a unique ID. Each terminological concept is described by a set of properties, such as a subject field they belong to.
  • Terminological Concept (term entry): represents a language-independent concept. Each terminological concept is associated to a LangSet (see below), which can be seen as a set of language-specific terms that express the terminological concept in question.
  • Langset: A langset is a language-specific container for all the terms that lexicalize a terminological concept in a given language. The LangSet contains simple terms, for which no decompositions is provided (TIG), as well as complex terms for which the decomposition information is provided (NTIG).
  • Term Information Group (TIG): represents a language-specific term for which no decomposition information is provided.
  • Nesting Term Information Group (NTIG): represents a language-specific term for which decomposition information is provided.
  • TermGrp: contains information about a language-specific term including its morphosyntactic properties; there is one TermGrp for each TIG and NTIG
  • TermCompList: represents the decomposition of a term
  • TermCompGrp: represents one component of a term and its morphosyntactic properties
  • DescrGrp: describes properties of a particular term, in particular different surface forms or describes contexts that document the usage of the term
  • TransGrp/Transaction: contains information about a transaction that lead to the creation or modification of a term

For a full specification of the TBX data model, please refer to the TBX DTD.

Mapping the TBX Data Model to the ontolex-lemon model

The main data elements described above have been mapped into RDF using the above mentioned vocabularies as follows:

  • TBX Resource: is not explicitly represented, the whole dataset represents the TBX resource. A TBX resource is thus represented as a void:Dataset to which provenance and licensing information can be attached.
  • Terminological Concept: is represented as a skos:Concept. The Simple Knowledge Organization System (SKOS) is a vocabulary for representing knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading and taxonomies in RDF.

The fundamental element of a SKOS vocabulary are concepts, defined as units of thought, ideas, meanings, or (categories of) objects and events, which underlie many knowledge organization systems. As terminologies can be seen as a special case of a knowledge organization system, using SKOS concepts to represent terminological concepts seems appropriate.

  • Langset: A langset is not represented as such in the data. Instead, one ontolex:Lexicon is created for each language for which a LangSet is defined. The collection of all the terms for a given language will belong to the corresponding language-specific ontolex:Lexicon
  • TIG/NTIG: are represented as ontolex:LexicalEntry, no distinction is made between terms with decomposition and terms withouth decomposition; if no decomposition information is available, this is simply omitted. In that sense the representation is monotonic as the decomposition information can be added later.
  • TermGrp: the information about the morphosyntactic properties of a term is attached to the corresponding ontolex:LexicalEntry. The string enclosed in <term> </term> is assumed to be the ontolex:canonicalForm of the lexical entry in question.
  • TermCompList: the decomposition of a term is represented using the ontolex:decomp vocabulary, creating a decomp:Component and a corresponding ontolex:LexicalEntry for each component.
  • TermCompGrp: the morphosyntactic properties of a component are attached to the corresponding lexical entry that is identified (through decomp:correspondsTo) with the component in question)
  • DescrGrp: descriptions of the term or context are mapped to appropriate properties of the lexical entry or the context
  • TransGrp/Transaction: a transaction that creates or modifies the term is mapped to a tbx:Transaction (a subclass of prov:Activity). Provenance metadata is attached to this entity. The prov:Activity related to the responsible person or agent through prov:wasAssociatedWith; the relation to the responsible Agent is encoded via prov:wasGeneratedBy.

Transformation by example

In this section we illustrate the transformation by providing one running and real example taken from the IATE terminology.

<martif type="TBX-Default" xml:lang="en">
  <martifHeader>
    <fileDesc>
      <sourceDesc>
        <p>This is an excerpt of a TBX file downloaded from the IATE website. Address any enquiries to iate@cdt.europa.eu.</p>
      </sourceDesc>
    </fileDesc>
    <encodingDesc>
      <p type="XCSURI">TBXXCS.xcs</p>
    </encodingDesc>
  </martifHeader>
 <text>
  <body>
    <termEntry id="IATE-84">
      <descripGrp>
        <descrip type="subjectField">1011</descrip>
      </descripGrp>
    </termEntry>
    <langSet xml:lang="en">
      <tig>
        <term>competence of the Member States</term>
        <termNote type="termType">fullForm</termNote>
        <descrip type="reliabilityCode">3</descrip>
      </tig>
    </langSet> 
    <langSet xml:lang="de">
      <ntig>
        <termGrp>
          <term>Zuständigkeit der Mitgliedstaaten</term>
          <termNote type="termType">fullForm</termNote>
          <descrip type="reliabilityCode">3</descrip>
          <termCompList type="lemma">
            <termCompGrp>
              <termComp>Zuständigkeit</termComp>
              <termNote type="partOfSpeech">noun</termNote>
              <termNote type="grammaticalNumber">singular</termNote>
            </termCompGrp>
            <termCompGrp>
              <termComp>der</termComp>
              <termNote type="partOfSpeech">other</termNote>
            </termCompGrp>
            <termCompGrp>
              <termComp>Mitgliedstaat</termComp>
              <termNote type="partOfSpeech">noun</termNote>
              <termNote type="grammaticalNumber">plural</termNote>
            </termCompGrp>
          </termCompList>
          <admin type="status">approved</admin>
          <transacGrp>
              <transac type="transactionType">origination</transac>
              <transacNote type="responsibility">PC</transacNote>
              <date>2014-05-08</date>
          </transacGrp>
        </termGrp>
      </ntig>
    </langSet>
  </body>
</text>	

</martif>

Transforming header information

The header information would lead to the following header in the RDF document:

The resulting RDF would looks as follows:

@prefix cc:    <http://creativecommons.org/ns#> .
@prefix :      <file:samples/simple1.rdf> .
@prefix void:  <http://rdfs.org/ns/void#> .
@prefix skos:  <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tbx:   <http://tbx2rdf.lider-project.eu/tbx#> .
@prefix gr:    <http://purl.org/goodrelations/> .
@prefix dct:   <http://purl.org/dc/terms/> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ontolex: <http://www.w3.org/ns/ontolex#> .
@prefix ldr:   <http://purl.oclc.org/NET/ldr/ns#> .
@prefix odrl:  <http://www.w3.org/ns/odrl/2/> .
@prefix dcat:  <http://www.w3.org/ns/dcat#> .
@prefix prov:  <http://www.w3.org/ns/prov#> .

:       a                 tbx:MartifHeader , dcat:Dataset ;
        <http://purl.org/dc/elements/1.1/source>
                "This is an excerpt of a TBX file downloaded from the IATE website. Address any enquiries to iate@cdt.europa.eu." ;
        dct:type          "TBX-Default" ;
        tbx:encodingDesc  "<p type=\"XCSURI\">TBXXCS.xcs</p>"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns##XMLLiteral> ;
        tbx:sourceDesc    "<sourceDesc><p>This is an excerpt of a TBX file downloaded from the IATE website. Address any enquiries to iate@cdt.europa.eu.</p></sourceDesc>"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns##XMLLiteral> .

Transforming terminological concepts

The term entry in lines 3 -7 would be represented in RDF by a skos:Concept in RDF. The Simple Knowledge Organization System (SKOS) is a vocabulary for representing knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading and taxonomies in RDF. The fundamental element of a SKOS vocabulary are concepts', defined as `units of thought, ideas, meanings, or (categories of) objects and events, which underlie many knowledge organization systems. As terminologies can be seen as a special case of a knowledge organization system, using SKOS concepts to represent terminological concepts seems appropriate.

This is shown by the following RDF snippet, where the the subject field of the terminological concept is specified via the property subjectField:

:IATE_84
  a  skos:Concept ;
  tbx:subjectField  "1011"^^tbx:subjectField .


Transforming TIGs

Our TBX example document has two language sets for English and German. In the lemon model, a lexicon is regarded as language-specific and as comprising lexical entries for a single language. Thus, in order to represent lexical entries in different languages, one lexicon per language needs to be created. In our example, as there are terms for English and German, two lexica need to be created. These lexica contain one lexical entry each, corresponding to the terms Zuständigkeit der Mitgliedstaaten and competence of the Member States. The English entry generated from lines 8--14 would look as follows:


<http://tbx2rdf.lider-project.eu/data/iate/en>  a  ontolex:Lexicon ;
  ontolex:entry     :competence+of+the+Member+States-en ;
  ontolex:language  "en" .

:competence+of+the+Member+States-en
  a                      ontolex:LexicalEntry ;
  tbx:reliabilityCode    "3"^^xsd:string ;
  tbx:termType           tbx:fullForm ;
  ontolex:canonicalForm  :competence+of+the+Member+States-en#CanonicalForm ;
  ontolex:language       "en" ;
  ontolex:sense          :competence+of+the+Member+States-en#Sense .

:competence+of+the+Member+States-en#CanonicalForm
  ontolex:writtenRep  "competence of the member states"@en .

:competence+of+the+Member+States-en#Sense
  ontolex:reference  :IATE_84.

Note that the entry specifies the reliability code (i.e. 3), the type of term (i.e. full form), the canonical form (i.e. competence of the member states), and the language (i.e. \emph{en}). Each lexical entry is assumed to have a LexicalSense that represents the meaning of the entry. In this case the meaning is established by reference to the terminological concept :IATE_84.

We would generate a similar entry for German, which is identified as :Zust\%C3\%A4ndigkeit+\-der\-+Mitgliedstaaten-de and is an entry in the corresponding German lexicon. Note that both entries have a reference to :IATE_84 and are thus cross-lingual equivalents.

Transforming NTIGS

Components of a composite term are represented as constituents of the composite term. A component is linked to its corresponding lexical entry by way of the {\tt correspondsTo} relation. In the example below, the lexical entry Zust%C3%A4ndigkeit+der+Mitgliedstaaten-de is linked to an object Zust%C3%A4ndigkeit+der+Mitgliedstaaten-de#ComponentList representing its decomposition via the property correspondsTo. This object Zust\%C3\%A4ndigkeit+der+Mitgliedstaaten-de#ComponentList is linked to its components via the property constituent. For each component, its part-of-speech and grammatical number (if applicable) is indicated. The decomposition of the German entry for Zust\"andigkeit der Mitgliedstaaten (lines 21-36 in the sample TBX document) is represented in RDF as indicated below:

\begin{lstlisting}[basicstyle=\footnotesize\ttfamily,numbers=left]
<http://tbx2rdf.lider-project.eu/data/iate/de>  a	ontolex:Lexicon ;
  ontolex:entry     :Zust%C3%A4ndigkeit+der+Mitgliedstaaten-de ;
  ontolex:language  "de" .

:Zust%C3%A4ndigkeit+der+Mitgliedstaaten-de
  a                      ontolex:LexicalEntry ;
  tbx:reliabilityCode    "3"^^tbx:reliabilityCode ;
  tbx:termType           tbx:fullForm ;
  ontolex:canonicalForm  :Zust%C3%A4ndigkeit+der+Mitgliedstaaten-de#CanonicalForm ;
  ontolex:language       "en" ;
  ontolex:sense          :Zust%C3%A4ndigkeit+der+Mitgliedstaaten-de#Sense.

:Zust%C3%A4ndigkeit+der+Mitgliedstaaten-de#CanonicalForm
        ontolex:writtenRep  "Zuständigkeit der Mitgliedstaaten"@de .

:Zust%C3%A4ndigkeit+der+Mitgliedstaaten-de#ComponentList decomp:identifies
    :Zust%C3%A4ndigkeit+der+Mitgliedstaaten-de ;
  decomp:constituent :component1, :component2, :component3 .


:component1 decomp:correspondsTo :Zust%C3%A4ndigkeit-de .
:component2 decomp:correspondsTo :der-de .
:component3 decomp:correspondsTo :Mitgliedstaaten-de .

:Zust%C3%A4ndigkeit-de
  a                      ontolex:LexicalEntry ;
  rdfs:label             "Zuständigkeit"@de ;
  tbx:grammaticalNumber  tbx:singular ;
  tbx:partOfSpeech       tbx:noun.
                
:der-de
  a                 ontolex:LexicalEntry ;
  rdfs:label        "der"@en ;
  tbx:partOfSpeech  tbx:other.
              
:Mitgliedstaaten-de
  a                 ontolex:LexicalEntry ;
  rdfs:label        "Mitgliedstaat"@en ;
  tbx:partOfSpeech  tbx:singular ;
  tbx:grammaticalNumber tbx:plural
\end{lstlisting}

Transforming Transaction Information

Finally, we discuss how to represent provenance information, in particular provenance information as expressed via transaction elements in TBX. We rely on the [http://www.w3.org/TR/prov-o/ PROV-O ontology} for this, as this is the W3C recommended vocabulary to `represent and interchange provenance information generated in different systems and under different contexts´. Some provenance information is given on lines 37--42 of our TBX example document and from this we generate the following representation:

:Zust%C3%A4ndigkeit+der+Mitgliedstaaten-de
  tbx:reliabilityCode    "3"^^tbx:reliabilityCode ;
  tbx:transaction	 :Transaction .

:Transaction
  a                       prov:Activity , tbx:Transaction ;
  tbx:transactionType     "origination"@en ;
  prov:endedAtTime        "2014-05-08"^^<http://www.w3.org/2001/XMLSchema#date> ;
  prov:wasAssociatedWith  :Agent .

:Agent
  a           prov:Agent ;
  rdfs:label  "PC" .

Proof-of-Concept

As a proof-of-concept for the conversion, we have converted the IATE (InterActive Terminology of Europe) into RDF format. The data is available here. We have also converted the European Migration Network glossary. The data is available here.

Implementation

A converter has been implemented to map TBX/XML input into RDF using the vocabularies described above. The converter has been implemented as a Java program that reads in the document and builds the DOM tree. The DOM tree is traversed and elements are mapped to appropriate object-oriented datastructures. These datastructures are then serialized as RDF. The code is available as GitHub project tbx2rdf.

As additional input to the program, a file can be provided that contains mappings of specific XML elements and attributes used in the TBX document to URIs representing properties. If no file is specified the default file „default.mappings“ is used. This option is only available when directly executing the Java program, not via the Web service.

A service for converting TBX to RDF is available here: http://tbx2rdf.lider-project.eu/converter

References

TBX Standard as published by LISA

tbx2rdf GitHub Project