This is an archive of an inactive wiki and cannot be modified.

This page describes some issues in the RDF representation of various patterns found in standard (ISO 2788) thesauri and some patterns found in non-standard thesauri.

Background

[AlistairMiles: this section is based on the authors extremely limited knowledge of this history of thesaurus development, and may be wholly inaccurate. For a much better account see this email from Stella Dextre-Clarke.]

Here, "standard thesaurus" refers to a thesaurus conforming with the ISO 2788:1986 standard.

A standard thesaurus is built from a set of descriptors and a set of non-descriptors. Descriptors are also known as preferred terms and non-descriptors are also known as non-preferred terms.

For example, a thesaurus might consist of the descriptors "Animals" and "Plants" and the non-descriptors "Fauna" and "Flora".

A thesaurus is fundamentally an information retrieval tool. However, thesauri were not invented for use within computerised retrieval systems. Traditionally, thesauri are used as controlled vocabularies for the creation of paper-based card catalogues.

For example, a document archive might include two works, titled "Botany for Beginners" and "Intermediate Zoology". Simple card entries for these book might look like the following (each table is meant to depict a printed card):

Title:

Botany for Beginners

Subject:

Plants

Title:

Intermediate Zoology

Subject:

Animals

Typically card entries would have other fields also, such as author etc.

The purpose of these card catalogues was to enable the construction of various printed indexes. These indexes would usually be arranged alphabetically.

For example, the alphabetical title index for our example archive would look something like the following:

A
B
  Botany for Beginners [location X]
C
D
E
F
G
H
I
  Intermediate Zoology [location Y]
H
J
[...]

Of course an archive might contain alot more than two works, and so the index might be large also.

A printed title index, arranged alphabetically, would allow a person who knew the title of a work to find further information about that work, for example its physical location within the archive, relatively efficiently (i.e. without having to search through the entire list of titles).

Another type of index is a subject index. This is an alphabetical index constructed from the values entered in the "Subject" field on the cards in the catalogue.

For example, a subject index for our example archive would look something like the following:

Animals
  Intermediate Zoology [location Y]
Plants
  Botany for Beginners [location Z]
[...]

This type of index is constructed to aid people searching within general subject areas, without foreknowledge of specific titles.

The task of filling out cards such as those illustrated above is known as indexing. The task of filling out appropriate values for the Subject field in a set of cards is known as subject indexing.

Naturally, different people use different terms to express the same or similar subject areas. If the values allowed in the Subject field on the cards were not controlled in some way, some people would enter "Fauna" while others would enter "Animals". This would lead to a subject index with a lot of reduncancy, and with items that should be grouped together being spread throughout the index - and therefore hard to find. Even if the indexers all used the same terms, the searchers might naturally use different terms, and so never find the appropriate subjects in the index.

The thesaurus was invented as a solution to exactly this problem. As stated in the introduction to the new British thesaurus standard BS 8723 part 2, the purpose of a thesaurus is to guide both the indexer and the searcher to use the same term [descriptor] for the same concept.

USE X

So, in the construction of a thesaurus, all the possible search terms are first identified, in our case "Animals", "Fauna", "Plants" and "Flora". Synonym groups are then identified, for example "Animals" and "Fauna" being one group, "Plants" and "Flora" being another. Then, for each synonym group, one term is chosen as "preferred". This term may then be used to "describe" items in a catalogue, hence the name "descriptor". Other terms may not be used to "describe" items in a catalogue, hence the name "non-descriptor".

A thesaurus would then be printed, giving instructions as to which descriptors to use when indexing and searching. Links are made between descriptors and corresponding non-descriptors, to help unfamiliar indexers or searchers to find the appropriate descriptor. These links are typically displayed using the symbols "USE" and "UF". "UF" stands for "used for" and "USE" stands for use. For example, our thesaurus would be printed as follows:

Animals
  UF Fauna

Fauna
  USE Animals

Flora
  USE Plants

Plants
  UF Flora

The symbol "USE" is really an instruction: if you've found "Fauna", you should USE "Animals" instead.

This pattern doesn't present any problems when it comes to representation in RDF. Each descriptor is mapped to a preferred lexical label of a concept, and each non-descriptor is mapped to an alternative lexical label of a concept, for example:

ex:a rdf:type skos:Concept;
  skos:prefLabel "Animals"@en;
  skos:altLabel "Fauna"@en.

ex:b rdf:type skos:Concept;
  skos:prefLabel "Plants"@en;
  skos:altLabel "Flora"@en.

It is easy to see that the original thesaurus layout can be generated from the RDF graph given above.

USE X + Y

Normally in a thesaurus, every non-descriptor points to one and only one descriptor, as shown in the example above. However, the ISO 2788 standard also allows instructions of the form "USE X + Y". For example:

Road safety
  USE Road transport + Safety

Road transport
  UF+ Road safety

Safety
  UF+ Road safety

This should be read as an instruction to the indexer to enter two values into the Subject field on a card, for example:

Title:

The Highway Code

Subject:

Road transport; Safety

This should be read as an instruction to the searcher to cross-reference the subject index entries for "Road transport" and "Safety".

Note the "UF+" symbol also, indicating for example that "Safety" (plus something else) is used for "Road safety".

This pattern is part of the ISO 2788 standard, and has significant usage. For these reasons, SKOS should provide a standard way for representing it in RDF. However, mapping this pattern to an RDF representation is not straightforward.

One possibility would be to map the non-descriptor to an alternative label of two concepts, for example:

ex:c rdf:type skos:Concept;
  skos:prefLabel "Road transport"@en;
  skos:altLabel "Road safety"@en.

ex:d rdf:type skos:Concept;
  skos:prefLabel "Safety"@en;
  skos:altLabel "Road safety"@en.

Although it is possible to generate the original thesaurus layout from this graph, the fact that two concepts share an alternative label may not always indicate the USE X + Y pattern. In other words, the graph above may be ambiguous, and an assumption has to be made in order to generate the thesaurus layout which may not be generally applicable. This becomes apparent in the section below.

USE X OR Y

Some thesauri include instructions of the form "USE X OR Y". For example:

grinding house
  UFO grinding mill
  SN A place where material is crushed.

grindery
  UFO grinding mill
  SN A place where metal objects are sharpened.

grinding mill
  USE grinding house OR grindery
  SN Use "grinding house" for a place material is crushed and "grindery" for a place where metal objects are sharpened.

Note also the "UFO" symbol, indicating for example that "grindery" (or something else) is used for "grinding mill".

This is a real example, taken from one of the English Heritage thesauri. However, note that this pattern is not recommended in the ISO 2788 thesaurus standard nor in BS 8723.

As with the previous section, the natural thing to do when creating an RDF representation would be to map the non-descriptor "grinding mill" to an alternative label of two concepts, e.g.:

eg:e rdf:type skos:Concept;
  skos:prefLabel "grinding house"@en;
  skos:altLabel "grinding mill"@en.

eg:f rdf:type skos:Concept;
  skos:prefLabel "grindery"@en;
  skos:altLabel "grinding mill"@en.

It is possible to generate the original thesaurus layout from this graph. However, this time a different assumption has to be made in order to generate the thesaurus layout. We have used the same RDF pattern to represent two different thesaurus patterns, i.e. we have an ambiguity.


Representing these three thesaurus patterns (USE X, USE X + Y, USE X OR Y) without ambiguity is the fundamental issue that this document raises.