This document is a review and discussion of existing work on RDF representations for thesaurus data. Seven schemas are summarised as tables and diagrams. Common themes are highlighted and discussed.
This section describes the status of this document at the time of its publication. This is a draft document and may be updated, replaced, or obsoleted by other documents at any time. The latest status of this document series is maintained at the W3C.
This document is a public DRAFT for discussion. This document is an output of the research work of the Semantic Web Advanced Development for Europe Project, which is associated with the W3C Semantic Web Activity. This document is made available by W3C for discussion only. Publication of this document by W3C does not imply endorsement by W3C, including the Team and Membership.
Comments on this document are welcome and should be sent to the authors or to the public-esw-thes@w3.org list. An archive of this list is available at http://lists.w3.org/Archives/Public/public-esw-thes/.
This document is a review and discussion of all work to date on RDF representations for thesaurus data. Section 2 outlines seven known schemas, using tables and diagrams to illustrate their features. Section 3 highlights the major themes, and discusses them in relation to the schemas.
This discussion begs two questions:
This document does not answer these questions, but provides a description of alternative solutions proposed to date, as a basis for further work in this area.
Schema title: | Thesaurus Interchange Format for the Semantic Web |
Authors: | B.M.Matthews |
Project: | LIMBER |
Organisation: | CCLRC |
Date: | 2001 |
LIMBER schema summary table (see note 1):
Class | Property | Range | Instance |
---|---|---|---|
ThesaurusObject | |||
Concept | |||
ClassificationCode | rdfs:Literal | ||
inLanguageOfC | LanguageCode | ||
isIndicatedBy | Term | ||
PreferredTerm | Term | ||
UsedFor | Term | ||
ConceptRelation | Concept | ||
BroaderConcept | Concept | ||
NarrowerConcept | Concept | ||
TopOfHierarchy | Concept | ||
isRelatedTo | Concept | ||
ConceptEquivalence | Concept | ||
ExactEquivalent | Concept | ||
InexactEquivalent | Concept | ||
PartialEquivalent | Concept | ||
OneToManyEquivalent | Concept | ||
TopConcept | |||
Term | |||
ScopeNote | |||
inLanguageOfSN | LanguageCode | ||
hasTypeOf | ScopeNoteType | ||
ScopeNoteType | |||
General | |||
Hierarchy | |||
Translation | |||
Editor | |||
History | |||
LanguageCode |
Schema title: | RDF Thesaurus Specification (draft) |
Authors: | Phil Cross, Dan Brickley, Traugott Koch |
Project: | Desire, Desire II |
Organisation: | ILRT, Netlab |
Date: | 2001 |
ILRT schema summary table (see note 1):
Class | Property | Range | Instance |
---|---|---|---|
Concept | |||
broaderConcept | Concept | ||
relatedConcept | Concept | ||
indicator | Term | ||
conceptCode | rdfs:Literal | ||
scope | ScopeNote | ||
Term | |||
lang | rdfs:Literal | ||
termUsage | termUsageValue | ||
ScopeNote | |||
TermUsageValue | |||
preferred | |||
nonPreferred |
Schema title: | Thesaurus::RDF - The RDF Thesaurus Descriptor Standard |
Authors: | CERES |
Project: | CERES/NBII Thesaurus Partnership Project |
Organisation: | CERES, NBII |
Date: | 2000??? |
CERES schema summary table (see note 1):
Class | Property | Range |
---|---|---|
Term | ||
HN | rdfs:Literal | |
Source | rdfs:Literal | |
Status | rdfs:Literal | |
Category | ||
descriptor | Descriptor | |
Descriptor | ||
SN | rdfs:Literal | |
CN | rdfs:Literal | |
CAT | Category | |
TT | Descriptor | |
BT | Descriptor | |
RT | Descriptor | |
NT | Descriptor | |
LT | Descriptor | |
UF | EntryTerm | |
EntryTerm | ||
USE | Descriptor |
Schema title: | Monolingual Thesauri Vocabulary |
Authors: | The GEM Consortium |
Project: | GEM |
Organisation: | ??? |
Date: | 11-04-2001 |
GEM schema summary table (see note 1):
Class | Property |
---|---|
BT | |
RT | |
NT | |
USE | |
UF | |
TT | |
HN | |
SCOPE | |
NISO-Z3919 |
The properties of this schema have been defined without a domain or range.
Schema title: | CALL Thesaurus Ontology |
Authors: | TeamXML at DRC |
Project: | CALL Thesaurus Project |
Organisation: | DRC |
Date: | 2002-02-26 |
DRC schema summary table (see note 1):
Class | Property | Range (restriction) |
---|---|---|
Term | ||
name | xsd:String | |
descriptorFor | Term | |
preferredTermFor | Term | |
USE | Term | |
ACK | Term | |
entryTermFor | Term | |
UF | Term | |
AF | Term | |
BT | Term | |
NT | Term | |
RT | Term | |
CALL-Term |
The properties of this schema have not been defined with a domain or range. DAML restrictions have been defined which restrict the allowed values for the range of each property at the term class.
Schema title: | ??? |
Authors: | ??? |
Project: | Agrovoc Thesaurus Project??? |
Organisation: | FAO, KAON |
Date: | 2001??? |
FAO schema summary table (see note 1):
Class | Property | Range |
---|---|---|
rdfs:Class | ||
rt | rdfs:Class | |
uf | rdfs:Class | |
use | rdfs:Class | |
Label | ||
value | rdfs:Literal | |
inLanguage | rdfs:Resource | |
references | rdfs:Class |
In this schema every term is modelled as an rdfs:Class. The rdfs:subClassOf property is used to indicate hierarchical relationships between terms. 'Labels' are then declared, each of which 'references' one of the previously declared 'rdfs:Class'es (i.e. the terms).
Schema title: | RDF Schema declaration for European Treasury Browser Multilingual Educational Thesaurus (ETBT) version 0.4 |
Authors: | Tim Read |
Project: | ETB Thesaurus Project |
Organisation: | ETB, INDIRE |
Date: | 2001-11-15 |
ETB schema summary table (see note 1):
Class | Property | Range |
---|---|---|
ETBT | ||
Node | ||
ID | rdfs:Literal | |
Thes | ||
TmonoNodes | Tmono | |
TMono | ||
Lang | rdfs:Literal | |
MTNodes | MT | |
MT | ||
Name | rdfs:Literal | |
No | rdfs:Literal | |
MTNodeNodes | MTNode | |
TopTerms | MTNode | |
MTNode | ||
Title | rdfs:Literal | |
SN | rdfs:Literal | |
Type | rdfs:Literal | |
Date | rdfs:Literal | |
PN | rdfs:Literal | |
HN | rdfs:Literal | |
MTn | MT | |
TT | MTNode | |
RT | MTNode | |
UF | MTNode | |
USE | UNode | |
BT | RNode | |
NT | RNode | |
RBT | RNode | |
RNT | RNode | |
EN | ENode | |
ENode | ||
da | MTNode | |
de | MTNode | |
en | MTNode | |
he | MTNode | |
el | MTNode | |
es | MTNode | |
fi | MTNode | |
fr | MTNode | |
it | MTNode | |
nl | MTNode | |
pt | MTNode | |
sv | MTNode | |
UNode | ||
UType | rdfs:Literal | |
UN | MTNode | |
RNode | ||
Relation | rdfs:Literal | |
RN | MTNode |
In the term-based view, a thesaurus is a collection of terms. Terms are the only type of entity considered. Terms may be related to other terms.
******Image: Term-based data model******
A traditional print thesaurus consists of data such as:
******Extract from a term-based thesaurus***********
In the concept-based view, a thesaurus consists of two types of entity, concepts and terms. Terms are labels for concepts. A concept is defined as a unit of thought, something which exists in the mind of a person. Relationships such as 'broader' 'narrower' and 'related' are concept-to-concept relationships; they convey information about the structure of the concept-space being described, conceptual information, information about meaning. Term-to-term relationships are used to convey purely linguistic (terminological, lexical) information, for example a term can be an 'abbreviation-for' another term. Concept-to-Term relationships convey information about how a concept may be indicated (labelled) by various terms. Term-to-concept relationships convey information about the concepts (meanings) implied by a term.
******Image: Concept-based data model******
In this view, people use terms (labels) to refer to (point to, indicate) concepts. If two people use the same term to point to the same idea (conceptual structure) they will understand each other.
Some example data in the concept-based view:
******Extract from a concept-based thesaurus***********
The ILRT and LIMBER schemas follow the concept-based model, the GEM, CERES and DRC schemas are term-based. The ETB and FAO schemas are harder to classify, and are discussed further below.
******Comparison with XML formats******
The concept-based model is possibly a more precise description of the information contained in a thesaurus. It is explicit about the fact that a thesaurus consists of two distinct types of informations, conceptual and lexical. It has been argued that distinguishing between these types of information improves clarity, and failing to do so creates confusion (see for example [soergel ref]).
The term-based model translates to a more compact schema and data format. It is also the traditional approach, and more familiar to existing users of thesauri.
In some thesauri terms or concepts are organised into fundamental categories, or facets. For example:
********Example of faceted data***********
Other criteria may also be used to group concepts and terms. For example:
********Example of grouped data***********
The CERES and ETB schemas support categories for terms, the others do not. None of the schema allow other grouping structures to be represented.
******Comparison with XML formats******
Applying faceted classification and other grouping methods in a thesaurus introduces more useful information about the structure of the conceptual space being defined. A schema which supports these structures, captures this information. In a schema which does not, this information is lost.
Although the ISO 2788:1986 and ANSI/NISO Z39.19 standards define a fixed set of term-to-term relationships, in practise these recommendations are implented in a great variety of ways [ref???]. In some thesauri, for example, the 'broader-term' relationship strictly implies class subsumption (is-a relationship), in others it is very fuzzy, and can can imply is-a, instance-of, part-of, geographical part-of etc. In some thesauri the part-of relationship is modelled as BT/NT, in others as RT. Other thesauri use a more precisely defined set of relationships, BTG, BTI and BTP. Other relationships are also found, for example RBT and RNT.
There are two points here. First, some thesauri are not precise in defining the meaning of the relationships they use. There is some ambiguity. If a schema does not in turn precisely define the intnded meaning of the relationships defined there, there is possibility for errors of meaning to be introduced. The extent to which the meaning of a relationship is defined within a schema therefore reflects the extent to which errors of amiguity and meaning are eliminated.
Secondly, a schema that supports customization/extension of its relationship set can accomodate more thesauri than one which offers only a fixed set of relationships. Given the variability found, it has been argued that such flexibility is an essential feature of a representation format, although this discussion was in relation to XML formats (see soergel spec).
The Limber schema supports an extensible relatinoghips set by defining high lievel properties 'Conceptrelation' and 'ConceptEquivalence' and 'indicatedBy'. All other relationships are subproperties of these three. Thus there is room for a certain amount of customisation and extension.
*************image: Limber extensible relationship set****************
ETB schema, by modelling every relationship as a node in the RDF graph, suggests a mechanism which allows flexbility. If a relationship is a node, the node can have a property whose value defines the type of that realtionship. However the ETB schema does not in fact allow flexibility in this way, as the properties linking a Term (MTNode) to a Relationship Node (RNode) all imply fixed relationghip types. All others implement a fixed set of relationships, with no mechanism for extension or customisation.
*************image: node based possiblity for relationships type as data value**************
**************Comparison with XML formats**************** The Soergel XML Spec implements relationship types as data values, as does the MeSH XML format. Otherss?????
Does the schema support non-exact equivalences?
[LIMBER, ILRT, FAO, ETB schema support multilingual data. Each of these schema models multilingual data in a different way. The two concept-based schema (LIMBER, ILRT) are easiest to compare. The ILRT schema allows terms to have a language property. The LIMBER schema allows concepts to have a language value. In doing so, the LIMBER schema models all concepts as being embedded in a particular language; there can be no language independent concepts. Multilngual relationships are modelled using the equivalence properties (exact, inexact, partial onet-to-many) between concepts. In the ILRT approach, a concept is defined, then alternate terms from different languages may be attached to it. To find a language equivalent for a term, find the concept it implies, and find alternative terms.
The strength of the LIMBER approach is that it can support non-exact mappings between concepts in different languages. This is highly desirable in several thesauri [****examples of inexact mappings******]. However, it does not allow that there may be some language independent concepts [*******examples*******]. For example, the HPMulti thesaurus contains terms with only exact quivalents. In this case, two identical concepts must be separately defined, and an exact equivalence statement made between them. In a thesaurus with only exact equivalence realtions, this is a waste of space and effort. The ILRT schema, by only supporting language independent concepts, is more efficient in this kind of scenari. However, there is no possibility to support the non-exact mappings found in e.g.*********e.g.*******.
The FAO schema allows Labels to have a lnaguage property. Potentially, several labels from different languages could 'reference' the same term. So this is similar to the ILRT approach.
The ETB schema maps terms to equivalents in other languages via an 'ENode' node in the graph. Again, only exact equivalence relationships are allowed here.
Does the schema allow terms/concepts from different thesauri to be mapped to each other?
Only LIMBER has explicit mechanism for this. Here the equivalence properties used for multilingual mappings may also be used for inter-thesaurus mappings. This allows non-exact mappings. No other schema has explicit mechanism for this, if mappings are made they would have to be using RT, BT, NT.
Does the schema model the relationships between thesaurus entities as an arc in the RDF graph, or as a node?
*************Image - arc approahc****************
************Image - node approach****************
ETB is node approach. All others model relationships as arcs.
Arc approach is most intuitive, and also most compact, with each relationship using a single statement (path length 1). The node approach offers the possibility for relationship typing to be done as the value of a property. (i.e. there could be generic relaitonship nodes, with the type of relationship specified by the value of the property of the node.) (see above) Also coudl support extensibility as class hierarchy of relationship node types. However the node approach is less compact, with each relationship requiring two statements (path length of 2).
The FAO agrovoc schema models all terms as rdfs:classes. In so doing, the rdfs:subclassof property is reused to declare hierarchical relationhsips in the thesaurus.
To re-use preoperties from other schema promotes interoperability. Also gives a clue to how a thesaurus may be related / migrated to an ontology or other conceptual systems. However there is a danger of semantic inaccuracy and ambiguuity. In many thesauri the terms/concepts are not in fact classes, and so to declare them to be rdfs:Classes would be semantically inaccurate. Secondly the subclassof property implies a specific meaning, that of class subsumption. However, as stated above, the hierarchical relations of many thesauri have a different meaning. Therefore, compressing all hierarchical relationghips into the subclass property would create lots of inaccurate information.
Does the schema use RDFS to define the data model, or DAML, or OWL?
The DRC schema is defined as a DAML ontolgoy. All the others are defined as RDF schema.
Using DAML or OWL allows the definition of specific constraints on the data model, such as [*****example******]. Thus there is possibliity for more extensive validation of data. There is also a possibliity to use generic DAML or OWL components to infer things such as transitivity of a subsumtpion hierarchy. Using RDFS one would for example have to state the exact equivalenct between a term and all other terms in other languages. Using DAML one coudl declare the property as trasitive, and connectr the new term to any on of the exact equivalent set. RDFS is more flexcibile, and this may be an advantage [*****when???******]. Also tool support is perhaps more mature.
Several of the schema introduce their own typing mechanism. The ILRT schema has the 'termUsageValue' property, to specify whtehr a term is preferred or not. The LIMBER schema has a scopeNoteType property to define the type of a scope NOte. The ETB schema has three independent typing properties, one for each of the MTNode, UNode and RNode classes.
RDF has an inherent typing mechanism. To reuse this mechanism promotes interoperability and procesing by generic toosl. Introducing new typing mechanisms does the oppoiste.
Reduce possiblitiy for error by constraining the value of such a property not to literal but to a predefined typoed resource. Criticism of the ETB is that it defines types in relatino to lieral values.
Seven RDF schema for thesaurus data are reviewed. The features of these schema are discussed in relation to nine themes. Each of these themes highlights alternative methods of implementation, and the pros and cons of each alternative are discussed. The main elemnts of this discussion are summarised in the tables below.
Concept- or Term- based? | Concept-based | Term-based |
---|---|---|
Schema | ILRT, LIMBER, (FAO?) | GEM, CERES, DRC, ETB |
Pros | Semantically precise. | Compact, familiar |
Cons | Not compact | Semantically confusing |
Facets or categories? | Yes | No |
Schema | ||
Pros | Captures more inforamtion | Reduced, simple |
Cons | Not compact | Semantically confusing |
Extensible/customisable? | Yes | No |
Schema | ||
Pros | Less ambiguity. Flexibility. | Familiar, simple |
Cons | Complex | Constricting. Posibility for error. |
Non-exact equivalence? | Supported | Unsupported |
Schema | ||
Pros | Semantically precise. | Compact, simple |
Cons | complex | loss of inromation, precision |
Inter-thesaurus mapping | Supported | Unsupported |
Schema | ||
Pros | Interoperability, re-use | |
Cons | Not interoperable | |
Relationships nodes or arcs? | Arcs | Nodes |
Schema | LIMBER, ILRT, CERES, GEM, FAO, DRC | ETB |
Pros | Compact, intuitive | Extenisble, flexible |
Cons | less extensible | Longer, complex |
Subclass approach | Yes | No |
Schema | ||
Pros | Interoperability, re-use | Precise |
Cons | Possibility for loss of precision, error | Harder to migrate |
RDFS, DAML | RDFS | DAML |
Schema | ||
Pros | Simple, supported | Validation, inference |
Cons | less validation | complex, less well suppotred |
Tentative conclusion: a good schema will have the following attributes:
ILRT RDF Thesaurus Specificaiton
A Thesaurus interchange format for the semantic web (MMW)
GEM schema
CERES schema
DRC CALL ontology
FAO Agrovoc/Kaon thesaurus
ETB Schema
Soegel critique of standards
Soergel XML thes spec
MeSH XML format
Doerr interthesaurus mapping paper
ISO 2788
ISO 5964
The classes of the schema are in the left column. If a class is indented relative to the one above, it is a subclass of the above class.
The properties of the schema are in the middle column. The properties are grouped according to their range (i.e. all properties of a class appear next to it). If a property is indented relative to the one above, it is a subproperty of the above property.