Abstract:
This document is a review and discussion of existing work on RDF representations for thesaurus data. Seven schemas are summarised as tables and diagrams. Common themes are highlighted and discussed.
Project name:
Semantic Web Advanced Development for Europe (SWAD-Europe)
Project Number:
IST-2001-34732
Workpackage name:
8. Thesaurus Research Prototype
Workpackage description:
http://www.w3.org/2001/sw/Europe/plan/workpackages/live/esw-wp-8.html 
Deliverable title:
8.2: thesaurus_ontologies_report
This version:
http://www.w3.org/2001/sw/Europe/reports/thes/8.2/draft01.html
Latest version:
http://www.w3.org/2001/sw/Europe/reports/thes/8.2/
Status:
Draft
Authors:
Brian Matthews,  CCLRC, UK
Alistair Miles,  CCLRC, UK

Status of this document

This section describes the status of this document at the time of its publication. This is a draft document and may be updated, replaced, or obsoleted by other documents at any time. The latest status of this document series is maintained at the W3C.

This document is a public DRAFT for discussion. This document is an output of the research work of the Semantic Web Advanced Development for Europe Project, which is associated with the W3C Semantic Web Activity. This document is made available by W3C for discussion only. Publication of this document by W3C does not imply endorsement by W3C, including the Team and Membership.

Comments on this document are welcome and should be sent to the authors or to the public-esw-thes@w3.org list. An archive of this list is available at http://lists.w3.org/Archives/Public/public-esw-thes/.


Contents


1. Introduction   [back to contents]

This document is a review and discussion of all work to date on RDF representations for thesaurus data. Section 2 outlines seven known schemas, using tables and diagrams to illustrate their features. Section 3 highlights the major themes, and discusses them in relation to the schemas.

This discussion begs two questions:

This document does not answer these questions, but provides a description of alternative solutions proposed to date, as a basis for further work in this area.

2. Overview of the schemas   [back to contents]

2.1. LIMBER   [back to contents]

Schema title:Thesaurus Interchange Format for the Semantic Web
Authors:B.M.Matthews
Project:LIMBER
Organisation:CCLRC
Date:2001

LIMBER schema summary table (see note 1):

ClassPropertyRangeInstance
ThesaurusObject
        Concept
ClassificationCoderdfs:Literal
inLanguageOfCLanguageCode
isIndicatedByTerm
        PreferredTermTerm
        UsedForTerm
ConceptRelationConcept
        BroaderConceptConcept
        NarrowerConceptConcept
        TopOfHierarchyConcept
        isRelatedToConcept
ConceptEquivalenceConcept
        ExactEquivalentConcept
        InexactEquivalentConcept
        PartialEquivalentConcept
        OneToManyEquivalentConcept
                TopConcept
        Term
        ScopeNote
inLanguageOfSNLanguageCode
hasTypeOfScopeNoteType
        ScopeNoteType
General
Hierarchy
Translation
Editor
History
LanguageCode

2.2 ILRT   [back to contents]

Schema title:RDF Thesaurus Specification (draft)
Authors:Phil Cross, Dan Brickley, Traugott Koch
Project:Desire, Desire II
Organisation:ILRT, Netlab
Date:2001

ILRT schema summary table (see note 1):

ClassPropertyRangeInstance
Concept
broaderConceptConcept
relatedConceptConcept
indicatorTerm
conceptCoderdfs:Literal
scopeScopeNote
Term
langrdfs:Literal
termUsagetermUsageValue
ScopeNote
TermUsageValue
preferred
nonPreferred

2.3. CERES   [back to contents]

Schema title:Thesaurus::RDF - The RDF Thesaurus Descriptor Standard
Authors:CERES
Project:CERES/NBII Thesaurus Partnership Project
Organisation:CERES, NBII
Date:2000???

CERES schema summary table (see note 1):

ClassPropertyRange
Term
HNrdfs:Literal
Sourcerdfs:Literal
Statusrdfs:Literal
        Category
descriptorDescriptor
        Descriptor
SNrdfs:Literal
CNrdfs:Literal
CATCategory
TTDescriptor
BTDescriptor
RTDescriptor
NTDescriptor
LTDescriptor
UFEntryTerm
        EntryTerm
USEDescriptor

2.4. GEM   [back to contents]

Schema title:Monolingual Thesauri Vocabulary
Authors:The GEM Consortium
Project:GEM
Organisation:???
Date:11-04-2001

GEM schema summary table (see note 1):

ClassProperty
BT
RT
NT
USE
UF
TT
HN
SCOPE
NISO-Z3919

The properties of this schema have been defined without a domain or range.


2.5. DRC   [back to contents]

Schema title:CALL Thesaurus Ontology
Authors:TeamXML at DRC
Project:CALL Thesaurus Project
Organisation:DRC
Date:2002-02-26

DRC schema summary table (see note 1):

ClassPropertyRange (restriction)
Term
namexsd:String
descriptorForTerm
       preferredTermForTerm
              USETerm
              ACKTerm
entryTermForTerm
       UFTerm
       AFTerm
BTTerm
NTTerm
RTTerm
       CALL-Term

The properties of this schema have not been defined with a domain or range. DAML restrictions have been defined which restrict the allowed values for the range of each property at the term class.


2.6. FAO   [back to contents]

Schema title:???
Authors:???
Project:Agrovoc Thesaurus Project???
Organisation:FAO, KAON
Date:2001???

FAO schema summary table (see note 1):

ClassPropertyRange
rdfs:Class
rtrdfs:Class
ufrdfs:Class
userdfs:Class
Label
valuerdfs:Literal
inLanguagerdfs:Resource
referencesrdfs:Class

In this schema every term is modelled as an rdfs:Class. The rdfs:subClassOf property is used to indicate hierarchical relationships between terms. 'Labels' are then declared, each of which 'references' one of the previously declared 'rdfs:Class'es (i.e. the terms).


2.7. ETB   [back to contents]

Schema title: RDF Schema declaration for European Treasury Browser Multilingual Educational Thesaurus (ETBT) version 0.4
Authors:Tim Read
Project:ETB Thesaurus Project
Organisation:ETB, INDIRE
Date:2001-11-15

ETB schema summary table (see note 1):

ClassPropertyRange
ETBT
Node
IDrdfs:Literal
        Thes
TmonoNodesTmono
        TMono
Langrdfs:Literal
MTNodesMT
        MT
Namerdfs:Literal
Nordfs:Literal
MTNodeNodesMTNode
TopTermsMTNode
        MTNode
Titlerdfs:Literal
SNrdfs:Literal
Typerdfs:Literal
Daterdfs:Literal
PNrdfs:Literal
HNrdfs:Literal
MTnMT
TTMTNode
RTMTNode
UFMTNode
USEUNode
BTRNode
NTRNode
RBTRNode
RNTRNode
ENENode
        ENode
daMTNode
deMTNode
enMTNode
heMTNode
elMTNode
esMTNode
fiMTNode
frMTNode
itMTNode
nlMTNode
ptMTNode
svMTNode
        UNode
UTyperdfs:Literal
UNMTNode
        RNode
Relationrdfs:Literal
RNMTNode

3. Discussion of themes   [back to contents]

3.1. Concept-based or term-based   [back to contents]

In the term-based view, a thesaurus is a collection of terms. Terms are the only type of entity considered. Terms may be related to other terms.

******Image: Term-based data model******

A traditional print thesaurus consists of data such as:

******Extract from a term-based thesaurus***********

In the concept-based view, a thesaurus consists of two types of entity, concepts and terms. Terms are labels for concepts. A concept is defined as a unit of thought, something which exists in the mind of a person. Relationships such as 'broader' 'narrower' and 'related' are concept-to-concept relationships; they convey information about the structure of the concept-space being described, conceptual information, information about meaning. Term-to-term relationships are used to convey purely linguistic (terminological, lexical) information, for example a term can be an 'abbreviation-for' another term. Concept-to-Term relationships convey information about how a concept may be indicated (labelled) by various terms. Term-to-concept relationships convey information about the concepts (meanings) implied by a term.

******Image: Concept-based data model******

In this view, people use terms (labels) to refer to (point to, indicate) concepts. If two people use the same term to point to the same idea (conceptual structure) they will understand each other.

Some example data in the concept-based view:

******Extract from a concept-based thesaurus***********

The ILRT and LIMBER schemas follow the concept-based model, the GEM, CERES and DRC schemas are term-based. The ETB and FAO schemas are harder to classify, and are discussed further below.

******Comparison with XML formats******

The concept-based model is possibly a more precise description of the information contained in a thesaurus. It is explicit about the fact that a thesaurus consists of two distinct types of informations, conceptual and lexical. It has been argued that distinguishing between these types of information improves clarity, and failing to do so creates confusion (see for example [soergel ref]).

The term-based model translates to a more compact schema and data format. It is also the traditional approach, and more familiar to existing users of thesauri.

3.2. Facets, categories and others grouping structures   [back to contents]

In some thesauri terms or concepts are organised into fundamental categories, or facets. For example:

********Example of faceted data***********

Other criteria may also be used to group concepts and terms. For example:

********Example of grouped data***********

The CERES and ETB schemas support categories for terms, the others do not. None of the schema allow other grouping structures to be represented.

******Comparison with XML formats******

Applying faceted classification and other grouping methods in a thesaurus introduces more useful information about the structure of the conceptual space being defined. A schema which supports these structures, captures this information. In a schema which does not, this information is lost.

3.3. Extensible and customisable relationship sets   [back to contents]

Although the ISO 2788:1986 and ANSI/NISO Z39.19 standards define a fixed set of term-to-term relationships, in practise these recommendations are implented in a great variety of ways [ref???]. In some thesauri, for example, the 'broader-term' relationship strictly implies class subsumption (is-a relationship), in others it is very fuzzy, and can can imply is-a, instance-of, part-of, geographical part-of etc. In some thesauri the part-of relationship is modelled as BT/NT, in others as RT. Other thesauri use a more precisely defined set of relationships, BTG, BTI and BTP. Other relationships are also found, for example RBT and RNT.

There are two points here. First, some thesauri are not precise in defining the meaning of the relationships they use. There is some ambiguity. If a schema does not in turn precisely define the intnded meaning of the relationships defined there, there is possibility for errors of meaning to be introduced. The extent to which the meaning of a relationship is defined within a schema therefore reflects the extent to which errors of amiguity and meaning are eliminated.

Secondly, a schema that supports customization/extension of its relationship set can accomodate more thesauri than one which offers only a fixed set of relationships. Given the variability found, it has been argued that such flexibility is an essential feature of a representation format, although this discussion was in relation to XML formats (see soergel spec).

The Limber schema supports an extensible relatinoghips set by defining high lievel properties 'Conceptrelation' and 'ConceptEquivalence' and 'indicatedBy'. All other relationships are subproperties of these three. Thus there is room for a certain amount of customisation and extension.

*************image: Limber extensible relationship set****************

ETB schema, by modelling every relationship as a node in the RDF graph, suggests a mechanism which allows flexbility. If a relationship is a node, the node can have a property whose value defines the type of that realtionship. However the ETB schema does not in fact allow flexibility in this way, as the properties linking a Term (MTNode) to a Relationship Node (RNode) all imply fixed relationghip types. All others implement a fixed set of relationships, with no mechanism for extension or customisation.

*************image: node based possiblity for relationships type as data value**************

**************Comparison with XML formats**************** The Soergel XML Spec implements relationship types as data values, as does the MeSH XML format. Otherss?????

3.4. Multilingual data   [back to contents]

Does the schema support non-exact equivalences?

[LIMBER, ILRT, FAO, ETB schema support multilingual data. Each of these schema models multilingual data in a different way. The two concept-based schema (LIMBER, ILRT) are easiest to compare. The ILRT schema allows terms to have a language property. The LIMBER schema allows concepts to have a language value. In doing so, the LIMBER schema models all concepts as being embedded in a particular language; there can be no language independent concepts. Multilngual relationships are modelled using the equivalence properties (exact, inexact, partial onet-to-many) between concepts. In the ILRT approach, a concept is defined, then alternate terms from different languages may be attached to it. To find a language equivalent for a term, find the concept it implies, and find alternative terms.

The strength of the LIMBER approach is that it can support non-exact mappings between concepts in different languages. This is highly desirable in several thesauri [****examples of inexact mappings******]. However, it does not allow that there may be some language independent concepts [*******examples*******]. For example, the HPMulti thesaurus contains terms with only exact quivalents. In this case, two identical concepts must be separately defined, and an exact equivalence statement made between them. In a thesaurus with only exact equivalence realtions, this is a waste of space and effort. The ILRT schema, by only supporting language independent concepts, is more efficient in this kind of scenari. However, there is no possibility to support the non-exact mappings found in e.g.*********e.g.*******.

The FAO schema allows Labels to have a lnaguage property. Potentially, several labels from different languages could 'reference' the same term. So this is similar to the ILRT approach.

The ETB schema maps terms to equivalents in other languages via an 'ENode' node in the graph. Again, only exact equivalence relationships are allowed here.

3.5. Inter-thesaurus mapping   [back to contents]

Does the schema allow terms/concepts from different thesauri to be mapped to each other?

Only LIMBER has explicit mechanism for this. Here the equivalence properties used for multilingual mappings may also be used for inter-thesaurus mappings. This allows non-exact mappings. No other schema has explicit mechanism for this, if mappings are made they would have to be using RT, BT, NT.

3.6. Relationships as nodes or arcs   [back to contents]

Does the schema model the relationships between thesaurus entities as an arc in the RDF graph, or as a node?

*************Image - arc approahc****************

************Image - node approach****************

ETB is node approach. All others model relationships as arcs.

Arc approach is most intuitive, and also most compact, with each relationship using a single statement (path length 1). The node approach offers the possibility for relationship typing to be done as the value of a property. (i.e. there could be generic relaitonship nodes, with the type of relationship specified by the value of the property of the node.) (see above) Also coudl support extensibility as class hierarchy of relationship node types. However the node approach is less compact, with each relationship requiring two statements (path length of 2).

3.7. The sub-class approach   [back to contents]

The FAO agrovoc schema models all terms as rdfs:classes. In so doing, the rdfs:subclassof property is reused to declare hierarchical relationhsips in the thesaurus.

To re-use preoperties from other schema promotes interoperability. Also gives a clue to how a thesaurus may be related / migrated to an ontology or other conceptual systems. However there is a danger of semantic inaccuracy and ambiguuity. In many thesauri the terms/concepts are not in fact classes, and so to declare them to be rdfs:Classes would be semantically inaccurate. Secondly the subclassof property implies a specific meaning, that of class subsumption. However, as stated above, the hierarchical relations of many thesauri have a different meaning. Therefore, compressing all hierarchical relationghips into the subclass property would create lots of inaccurate information.

3.8. RDFS, DAML or OWL   [back to contents]

Does the schema use RDFS to define the data model, or DAML, or OWL?

The DRC schema is defined as a DAML ontolgoy. All the others are defined as RDF schema.

Using DAML or OWL allows the definition of specific constraints on the data model, such as [*****example******]. Thus there is possibliity for more extensive validation of data. There is also a possibliity to use generic DAML or OWL components to infer things such as transitivity of a subsumtpion hierarchy. Using RDFS one would for example have to state the exact equivalenct between a term and all other terms in other languages. Using DAML one coudl declare the property as trasitive, and connectr the new term to any on of the exact equivalent set. RDFS is more flexcibile, and this may be an advantage [*****when???******]. Also tool support is perhaps more mature.

3.9. Implementation issues

Several of the schema introduce their own typing mechanism. The ILRT schema has the 'termUsageValue' property, to specify whtehr a term is preferred or not. The LIMBER schema has a scopeNoteType property to define the type of a scope NOte. The ETB schema has three independent typing properties, one for each of the MTNode, UNode and RNode classes.

RDF has an inherent typing mechanism. To reuse this mechanism promotes interoperability and procesing by generic toosl. Introducing new typing mechanisms does the oppoiste.

Reduce possiblitiy for error by constraining the value of such a property not to literal but to a predefined typoed resource. Criticism of the ETB is that it defines types in relatino to lieral values.

4. Summary   [back to contents]

Seven RDF schema for thesaurus data are reviewed. The features of these schema are discussed in relation to nine themes. Each of these themes highlights alternative methods of implementation, and the pros and cons of each alternative are discussed. The main elemnts of this discussion are summarised in the tables below.

Concept- or Term- based?Concept-basedTerm-based
SchemaILRT, LIMBER, (FAO?)GEM, CERES, DRC, ETB
ProsSemantically precise.Compact, familiar
ConsNot compactSemantically confusing
Facets or categories?YesNo
Schema
ProsCaptures more inforamtionReduced, simple
ConsNot compactSemantically confusing
Extensible/customisable?YesNo
Schema
ProsLess ambiguity. Flexibility.Familiar, simple
ConsComplexConstricting. Posibility for error.
Non-exact equivalence?SupportedUnsupported
Schema
ProsSemantically precise.Compact, simple
Conscomplexloss of inromation, precision
Inter-thesaurus mappingSupportedUnsupported
Schema
ProsInteroperability, re-use
ConsNot interoperable
Relationships nodes or arcs?ArcsNodes
SchemaLIMBER, ILRT, CERES, GEM, FAO, DRCETB
ProsCompact, intuitiveExtenisble, flexible
Consless extensibleLonger, complex
Subclass approachYesNo
Schema
ProsInteroperability, re-usePrecise
ConsPossibility for loss of precision, errorHarder to migrate
RDFS, DAMLRDFSDAML
Schema
ProsSimple, supportedValidation, inference
Consless validationcomplex, less well suppotred

Tentative conclusion: a good schema will have the following attributes:

References   [back to contents]

ILRT RDF Thesaurus Specificaiton

A Thesaurus interchange format for the semantic web (MMW)

GEM schema

CERES schema

DRC CALL ontology

FAO Agrovoc/Kaon thesaurus

ETB Schema

Soegel critique of standards

Soergel XML thes spec

MeSH XML format

Doerr interthesaurus mapping paper

ISO 2788

ISO 5964

Notes   [back to contents]

N1. Explanation of schema summary tables   [back to contents]

[back to contents]

The classes of the schema are in the left column. If a class is indented relative to the one above, it is a subclass of the above class.

The properties of the schema are in the middle column. The properties are grouped according to their range (i.e. all properties of a class appear next to it). If a property is indented relative to the one above, it is a subproperty of the above property.

Appendices: schemas and sample data   [back to contents]

A1. LIMBER   [back to contents]

A2. ILRT   [back to contents]

A3. CERES   [back to contents]

A4. GEM   [back to contents]

A5. DRC   [back to contents]

A6. FAO   [back to contents]

A7. ETB   [back to contents]