SkosDev/ClassificationPubGuide

From W3C Wiki
Jump to: navigation, search

Quick Guide to Publishing a Classification Scheme on the Semantic Web

Current version:

Previous versions:

Please see the history of this wiki page for additional changes.

Editors:

  • Jakob Voss <jakob.voss@gbv.de>, Common Library Network, Germany (GBV)

Abstract

This document describes in brief how to express the content and structure of a classification scheme, and metadata about a classification scheme, in RDF using the SKOS vocabulary. RDF allows data to be linked to and/or merged with other RDF data by semantic web applications. The Semantic Web, which is based on the Resource Description Framework (RDF), provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. Publishing classifications schemes in SKOS will unify the great many of existing classification efforts in the framework of the Semantic Web.

Status of this Document

This document is a tentative sketch to define a standard to encode classification schemes in SKOS. The guide is mostly confided to details of application, so no fundamental changes of the current SKOS Core Working Draft [SKOS Core] are needed. Open Issues and limitations that may require extensions of SKOS are included in the chapter “Open Issues and Limitations”. This document is inspired by the “Quick Guide to Publishing a Thesaurus on the Semantic Web” by Alistair Miles [THESPUB]. Comments are very welcome. For discussion on SKOS use the public-esw-thes@w3.org mailing list. Please notify especially if you find any weird classification schemes with features that are not mentioned in this guide yet.

Introduction

Present activities in SKOS have mainly focused on thesauri. Unlike thesauri classification schemes (in the following referred to as “classifications”) are known to many but only to a limited degree. That's why some basic features of existing classifications have not been specified in SKOS yet. This guide gives advise how to express the content and structure of a classifications, and metadata about a classifications in SKOS for the Semantic Web and interoperability. For a general introduction into SKOS see [SKOS Core], for further development see [SKOS RS] and [SWDWG].

The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is based on the Resource Description Framework [RDF], which provides a simple data formalism for talking about things, their properties, inter-relationships, and categories (classes). For an overview of RDF, see [RDF Concepts].

To express RDF classes and properties in the examples in this guide, the following namespace qualifiers are used:

Expressing a Classification in RDF

Classifying objects and ideas is an basic mental process that's used to all human beeings. Classification schemes are used since thousands of years in philosophy and libraries. Most classifications in practical use, especially in libraries, have less strict ontological requirements, that's why classifications should not be confound with hierarchies of classes in object-oriented programming, and rdfs:Class and rdfs:subClassOf are not sufficient to express existing classifications for knowledge organization. Classifications can also be more complex than simple mono-hierarchical structures (also known as “taxonomies”). However most classifications have basic structure and components that should be expressable in RDF with SKOS.

In a classification you can distinguish a macrostructure with classification metadata and tables that contain classes, and and microstructure that describes single classes with notations, captions, notes and relations (most of them hierarchical relations and some associative relations). Additionally classes and relations can be grouped to arrays.

Classes

Classes of classifications are expressed as concepts in SKOS. Each class must have an unique, stable URI. It is common but not needed to use HTTP URLs as identifiers. However there is intense, philosophical discussion under what circumstances HTTP URLs may be used on the Semantic Web, this guide can give you no neutral advice. Anyway, the following example encodes the category “Science” in Wikipedia's system of categories [WPCAT]:

<skos:Concept
 rdf:about="http://en.wikipedia.org/wiki/category:Science">
  <skos:prefLabel>Science</skos:prefLabel>
<skos:Concept>


Captions

The main label of a class in a classification is often called caption. Unlike thesauri and subject headings, classification may have multiple classes with the same caption. For instance in DDC there are many classes that share the caption “Historical, geographic, persons treatment”. There is no final consensus how to encode captions. You can use either the skos:altLabel property or rdfs:label. The disadvantage of skos:altLabel is that most SKOS applications will only show notations (which are encoded with skos:prefLabel) and you cannot have alternative and prefered captions. The disadvantage of rdfs:label is that this solution is also not supported by real world applications. If skos:prefLabel is used, it should always have priority to determine the caption of a class. A class can have multiple captions for different languages. Permitted language tags for the xml:lang attribute are given by [XMLLANG]. A class should not have more than one caption per language. Additional alternative labels can be expressed with skos:altLabel. The attribute xml:lang should never use the “x-notation” language subtag for normal captions but for notations only. The following example encodes a class of the Chinese Library Classification [CLC]:


<skos:Concept>
  <skos:prefLabel xml:lang="x-notation">F</skos:prefLabel>
  <rdfs:label xml:lang="zh">经济</rdfs:label>
  <rdfs:label xml:lang="en">Economics</rdfs:label>
</skos:Concept>


Note that also skos:prefLabel could have been used in the example instead of rdfs:label if the labels are unambiguous. Additional labels can be added with skos:altLabel.

Notations

A notation (also known as "class number") is an artificial, unique label that represents a class in a classification. As proposed in [ISO3166SKOS] notations in SKOS should be encoded with skos:prefLabel with a xml:lang attribute value that uses the "x-notation" language subtag. You can use multiple notations per class if you assign different languages, for instance "x-notation-1" and "x-notation-2" Notations may be build according to a special grammar, for instance to define characters (pure notation vs. mixed notation), and it may have a mnemonic quality. Coordinate and subordinate relationships of classes may also be represented in the structure of a notation, but explicit encoding in RDF with SKOS is always needed to express relations.

The following examples express a a class of the 1998 ACM Computing Classification System [ACM] and a coordinated class of the Colon Classification [CC]:


<skos:Concept>
  <skos:prefLabel xml:lang="x-notation">H.3.1</skos:prefLabel>
  <skos:altLabel xml:lang="en">Content Analysis and Indexing</skos:altLabel>
</skos:Concept>

<skos:Concept>
  <skos:prefLabel xml:lang="x-notation">C3:11;5</skos:prefLabel>
  <skos:altLabel xml:lang="en">Velocity of sound in water</skos:altLabel>
</skos:Concept>


Notes

The meaning of a concept can be guesses by its labels, usage and related terms, but detailed notes and definitions provide more information. SKOS defined the skos:note property and some sub-properties for the purpose of notes. The question of how to uniformly express markup and especially links in scope notes in SKOS has not been answered yet. Probably XHTML can be used for extended notes. These XHTML fragments may also contain semantic links with RDFa [RDFa]. The following example is experimental:


<skos:Concept rdf:about="http://www.ams.org/msc/76-xx.html">
  <skos:prefLabel xml:lang="x-notation">76-xx</skos:prefLabel>
  <skos:note>
    <rdf:value rdf:parseType="Literal">
      <p xmlns="http://www.w3.org/1999/xhtml" xmlns:dcterms="http://purl.org/dc/terms/">
        For general continuum mechanics, see 
        <a href="http://www.ams.org/msc/74Axx.html" rel="dcterms:references">74Axx</a>,
        or other parts of
        <a href="http://www.ams.org/msc/74-xx.html" rel="dcterms:references">74-xx</a>
      </p>
    </rdf:value>
  </skos:note>
</skos:Concept>


Hierarchical Relations

To assert that one concept is broader in meaning (i.e. more general) than another, where the scope (meaning) of one falls completely within the scope of the other, use the skos:broader property. To assert the inverse, that one concept is narrower in meaning (i.e. more specific) than another, use the skos:narrower property. The properties skos:broader and skos:narrower are each other's inverse. The top-level concepts of a concept scheme (concepts with no broader concepts) should be connected from skos:ConceptScheme with the skos:hasTopConcept property.

Symmetric Associative Relations

To assert a symmetric associative relationship between two concepts, use the property skos:related, and for unidirectional associative relationships rdfs:seeAlso.

Non-symmetric Associative Relations

In thesauri associative relations are mostly symmetric but in classifications there are also unidirectional links. To assert an unidirectional associative relationship between two concepts, use dcterms:references (an element refinement of dc:relation), for example (taken from the Mathematics Subject Classification [MSC]):


<skos:Concept rdf:about="http://www.ams.org/msc/76-xx.html">
  <skos:prefLabel xml:lang="x-notation">76-xx</skos:prefLabel>
  <skos:note>For general continuum mechanics, see 74Axx,
             or other parts of 74-xx</skos:note>
  <dcterms:references
   rdf:resource="http://www.ams.org/msc/74Axx.html"/>
  <dcterms:references
   rdf:resource="http://www.ams.org/msc/74-xx.html"/>
</skos:Concept>


Applications that visualize SKOS classifications should automatically link notations and caption of other concepts that occur in literal documentation, if there is relation to the linked concept. In the example above the notations 74Axx and 74-xx would be linked because there is a relation from the concept with notation 76-xx to both of them. This way textual hyperlinks between concepts are possible (see also the section about notes above).

Arrays

Narrower concepts can be grouped by different characteristics to arrays with a common node label. SKOS allows grouping with the skos:Collection class and skos:member property. To assign a lexical label to a collection, the rdfs:label property can be used in the same way as for captions of classes. The following example is adopted from [WILLPOWER].

As text:

vehicles
  <vehicles by number of wheels>
    monocycles
    bicycles
    tricycles
    four-wheeled_vehicles
  <vehicles by motive power>
    mechanically powered vehicles
    human powered vehicles
    hybrid powered vehicles


In SKOS RDF/XML:


<skos:Concept rdf:about="#vehicles">
  <skos:prefLabel>vehicles</skos:prefLabel>
  <skos:narrower rdf:resource="#array1"/>
  <skos:narrower rdf:resource="#array2"/>
</skos:Concept>

<skos:Collection rdf:about="#array1">
  <rdfs:label>vehicles by number of wheels</rdfs:label>
  <skos:member rdf:resource="#monocyles"/>
  <skos:member rdf:resource="#bicycles"/>
  <skos:member rdf:resource="#tricycles"/>
  <skos:member rdf:resource="#four-wheeled_vehicles"/>
  <skos:broader rdf:resource="#vehicles"/>
</skos:Collection>

<skos:Collection rdf:about="#array2">
  <rdfs:label>vehicles by motive power</rdfs:label>
  <skos:member rdf:resource="#mechanically_powered_vehicles"/>
  <skos:member rdf:resource="#human_powered_vehicles"/>
  <skos:member rdf:resource="#hybrid_powered_vehicles"/>
  <skos:broader rdf:resource="#vehicles"/>
</skos:Collection>


Collections can also be nested. For instance the DDC contains so called centered entry that span several numbers. Here is an outline of class 580 and 590, centered entries in angle brackets:


500 Science
  <579-590 Natural history of specific kinds of organisms>
    <580-590 Plants and animals>
      580 Plants (Botany) 
      590 Animals (Zoology) 


Tables

Tables (or sections) group classes together under a common label. skos:Collection is used for tables. Tables may also have notations and notes but they cannot be used as concepts for indexing. The following examples contains a small subset of the International Patent Classification [IPC]:


<skos:ConceptScheme>
  <skos:hasTopConcept rdf:resource="#section-A"/>
  <skos:hasTopConcept rdf:resource="#section-B"/>
  <!-- ... -->
</skos:ConceptScheme>

<skos:Collection rdf:about="#section-A">
  <skos:prefLabel xml:lang="x-notation">A</skos:prefLabel>
  <rdfs:label>HUMAN NECESSITIES</rdfs:label>
  <skos:member rdf:resource="#section-A-first"/>
  <skos:member rdf:resource="#section-A-second"/>
  <!-- ... -->
</skos:Collection>

<skos:Collection rdf:about="#section-A-first">
  <rdfs:label>AGRICULTURE</rdfs:label>
  <skos:member rdf:resource="#ipc-A01"/>
</skos:Collection>

<skos:Concept rdf:about="#ipc-A01">
  <skos:prefLabel xml:lang="x-notation">A01</skos:prefLabel>
  <rdfs:label>AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY;
              HUNTING; TRAPPING; FISHING</rdfs:label>
</skos:Concept> 


Expressing Classification Metadata in RDF

RDF can also be used to express properties of a classification, such as it's title, description, date of modification and so on. The DCMI Metadata Terms [DCMI Terms] include a number of useful properties for this purpose.

Open Issues and Limitations

SKOS is still work in progress. The following features are not supported yet, but there are proposals for discussion how symbolic hierarchical relationships, facets, and coordination should be encoded.

Symbolic hierarchical relations

In some classifications there are symbolic (or secondary) hierarchical relations to allow multiinheritance while keeping a main monohierarchical structure. This principle is useful for instance for library shelving and similar to symbolic links in file systems. A popular example is the Open Directory Project [ODP]. For instance in DMOZ there is the class “Health” (http://www.dmoz.org/Health/) that contains the class “Healthcare Industry” (shown as "Healthcare Industry@") which equals “Business/Healthcare“ (http://www.dmoz.org/Business/Healthcare/).

In ODP's RDF (http://www.dmoz.org/rdf.html) this is encoded: the following way:


<Topic r:id="Top/Health">
  <d:Title>Health</d:Title>
  <symbolic1
   r:resource="Healthcare_Industry:Top/Business/Healthcare"/>
</Topic>
<Alias r:id="Healthcare_Industry:Top/Business/Healthcare">
  <d:Title>Healthcare_Industry</d:Title>
  <Target r:resource="Top/Business/Healthcare"/>
</Alias>
<Topic r:id="Top/Business/Healthcare">
  <d:Title>Healthcare</d:Title>
</Topic>


Up to now in SKOS Core there is no way to encode such structures (the discussion whether such symbolic relationships are bad design or not is not in the scope of this guide). Here is a proposal to enhance SKOS with a new class skos:virtualConcept, and the map:exactMapping property from the experimental SKOS Mapping vocabulary [1]:


<skos:Concept rdf:about="Top/Health">
  <skos:prefLabel>Health</skos:prefLabel>
  <skos:narrower rdf:resource="Healthcare_Industry"/>
</skos:Concept>
<skos:virtualConcept
 rdf:about="Healthcare_Industry">
  <rdfs:label>Healthcare Industry</rdfs:label>
  <map:exactMatch rdf:resource="Top/Business/Healthcare"/>
</skos:virtualConcept>
<skos:Concept rdf:about="Top/Business/Healthcare">
  <rdfs:label>Healthcare</rdfs:label>
</Topic>


In this proposal virtual concepts are labeled placeholders for other concepts (mapping) or coordinations of other concepts. An alternative solution to would be to use normal concepts with mapping and a special skos:virtualNarrower oder skos:secondaryNarrower property.

Faceted Classification

Faceted classification gives the users the ability to find items based on more than one dimension. See a proposal to extend SKOS with facets at http://isegserv.itd.rl.ac.uk/cvs-public/~checkout~/skos/drafts/appextensions.html. Here is another proposal to enhance SKOS with a new property skos:hasFacet, and a new class skos:Facet. The example models two facets from the Art and Architecture Thesaurus [AAT]:


<skos:ConceptScheme>
  <skos:hasFacet rdf:resource="AssociatedConcept"/>
  <skos:hasFacet rdf:resource="PhysicalAttributesConcept"/>
  <!-- skipped 5 other facets -->
<skos:ConceptScheme>

<skos:Facet rdf:about="AssociatedConcept">
  <skos:hasTopTerm rdf:resource="AssociatedConcepts"/>
</skos:Facet>

<skos:Facet rdf:about="PhysicalAttributesConcept">
  <skos:hasTopTerm rdf:resource="AttributesAndProperties"/>
  <skos:hasTopTerm rdf:resource="ConditionsAndEffects"/>
  <skos:hasTopTerm rdf:resource="DesignElements"/>
  <skos:hasTopTerm rdf:resource="Color"/>
</skos:Facet>


Another solution is to encode facets as Concept Schemes and collect the factes under a common concept scheme by grouping as proposed in [ISO3166SKOS].

In some Concept Schemes there are main facets and additional secondary facets that will be presented less prominent in display (similar to symbolic relationships). Secondary facets are not included in this proposal.

Faceted classification plays an important role in todays web design so it should be able to adequately express facets in SKOS. Some experts in faceted classification should have a look at this and give input (for instance see [XFML]).

Coordination

Pre-Coordination is a common method in index to build complex descriptors by a combination of existing concepts. Up to now there is no standard way in SKOS to encode coordination. Here is a complex example of a coordinated notation in UDC. The Universal Decimal Classification [UDC] is the world's foremost multilingual, general classification.

94(4+6)"18"	History of Europa and Africa in the 19th century

It consists of

  • 94 History
  • (4+6) Europa (4) and Africa (6)
  • "18" 19th century

An expression in SKOS could be


<skos:Concept>
  <skos:prefLabel xml:lang="zxx">94(4+6)"18"</skos:prefLabel>
  <dcterms:hasPart rdf:resource="#udc94"/>
  <dcterms:hasPart>
    <dcterms:hasPart rdf:resource="#udc-place4"/>
    <dcterms:hasPart rdf:resource="#udc-place6"/>
  </dcterms:hasPart>
  <dcterms:hasPart rdf:resource="#udc-time18"/>
</skos:Concept>


With dcterms:hasPart (and its counterpard dcterms:isPartOf) no new class or property has to be introduced. But the special kind of relationship between coordinated concepts cannot be expressed. Moreover the interpretation of coordinated notation is still complicated if there is no detailed markup. An XML structure like in the following would suit better the needs of user interfaces:


<corrdinated literal='94(4+6)"18"'>
  <part ref="#udc94">94</part>
  <coordinated literal="(4+6)">
    <part ref="#udc-place4">4</part>
    <part ref="#udc-place6"/>6</part>
  </coordinated>
  <part ref="#udc-time18">"18"</part>
</coordinated>


Note that qualifiers in vocabulary control (also missing in SKOS) may also be seen as a special kind of coordination.

Subject counts

SKOS allows to specify subject indexing of single records with skos:subject, skos:isSubjectOf, skos:primarySubject, and skos:isPrimarySubjectOf. However for many applications the total number of records indexed with a specific concept is also needed. Up to now there is no way to encode these numbers (subject count). A new property skos:hasSubjectCount could fill this gap. It could be used unquilified with a numeric value or qualified with a link to a more detailed new class skos:SubjectCount. The follwoing example first expresses the fact that there are 10298 subjects indexed with "birthday" at flickr [FLICKR] in a qualified and second in a qualified way:


<skos:Concept rdf:about="http://www.flickr.com/photos/tags/birthday/" />
  <skos:prefLabel>birthday</skos:prefLabel>
  <skos:hasSubjectCount>10298</skos:hasSubjectCount>
</skos:Concept>

<skos:Concept rdf:about="http://www.flickr.com/photos/tags/birthday/" />
  <skos:prefLabel>birthday</skos:prefLabel>
  <skos:hasSubjectCount rdf:resource="#birthdayAtFlickr"/>
</skos:Concept>

<skos:SubjectCount rdf:about="#birthdayAtFlickr">
  <skos:documentCollection rdf:resource="http://www.flickr.com" /> <!-- another new property -->
  <skos:subject rdf:resource="http://www.flickr.com/photos/tags/birthday/" />
  <rdf:value>767884</rdf:value>
  <dc:date>2006-07-20</dc:date>
</skos:recordCount>


You could also express the result of expanded queries with qualfied subject counts. For instance there are only 6 websites in the "Health" category at ODP, but 64965 websites in all of its subcategories. The type of query expansion could be determined by a special vocabulary.


<skos:SubjectCount rdf:about="#birthdayAtFlickr">
  <skos:subject rdf:resource="http://www.dmoz.org/Health/"/>
  <skos:queryExpansion>tree</skos:queryExpansion> <!-- new property -->
  <rdf:value>767884</rdf:value>
  <dc:date>2006-07-20</dc:date>
</skos:recordCount>


References