Use Case Vocabulary Merging
From Library Linked Data
Back to Use Cases & Case Studies page
Vocabulary Merging (Subject Interoperability)
Bernard Vatant; Gordon Dunsire
Background and Current Practice
The publication of library legacy includes publication of structuring vocabularies such as thesauri, classifications, subject headings. Different sources use different vocabularies, different in structure, width, depth and scope, and languages. Federated access to distributed data collections is currently possible if they rely on the same vocabularies. Mapping techniques and standards supporting them (such as SKOS mapping properties, OWL sameAs and equivalentClass) are still largely experimental, even in the linked data land.
Libraries use a variety of controlled subject vocabulary and classification schemes to index items in their collections. Although most collections will employ only a single scheme, different schemes may be chosen to index different collections within a library or in separate libraries; schemes are chosen on the basis of language, subject focus (general or specific), granularity (specificity), user expectation, and availability and support (cost, currency, completeness, tools).
For example, a typical academic library will operate separate metadata systems for the library's main collections, special collections (e.g. manuscripts, archives, audiovisual), digital collections, and one or more institutional repositories for teaching and research output; each of these systems may employ a different subject vocabulary, with little or no interoperability between terms and concepts.
Users expect to have a single point-of-search in resource discovery services focussed on their local institutional collections. Librarians have to use complex and expensive resource discovery platforms to meet user expectations.
Library communities continue to develop resource discovery services for consortia with a geographical, subject, sector (public, academic, school, special libraries), and/or domain (libraries, archives, museums) focus. Services are based on distributed searching (e.g. via Z39.50) or metadata aggregations (e.g. OCLC's WorldCat and OAISter). As a result, the number of different subject schemes encountered in such services is increasing. Trans-national consortia (e.g. Europeana) add to the complexity of the environment by including subject vocabularies in multiple languages.
Users expect single point-of-search in consortial resource discovery service involving multiple organisations and large-scale metadata aggregations. Users also expect to be able to search for subjects using their own language and terms in an unambiguous, contextualised manner.
- Allow vocabularies defined by different sources to organize (classify, index ...) legacy data to be used together to support federated searches on distributed data bases, semantic extension of search, etc.
- Linked data technologies provide the underlying infrastructure by semantic mapping or merging of concepts across vocabularies.
Vocabulary curators (or is it end users?)
Use Case Scenario
Alice wants to find information resources on a specific topic that she can obtain and use within a couple of hours. She uses the resource discovery service offered by her local university library, and opts to search the local collections (which she can consult after a 10 minute walk) and several digital collections which are open-access or for which she is a registered subscriber.
She enters a subject term in the search box, but gets a "no hits" message. This is annoying and mystifying, because she knows that the term gets hits in the local public library catalogue.
She enters another subject term, and gets hits, but many of them turn out to be irrelevant. She knows this is because her term (e.g. "china") can refer to different topics in different contexts, but the system does not alert her to the ambiguity, or allow her to specify the context. This is annoying and frustrating.
She wants to find resources with language-independent content, such as music or silent film, but suspects that the metadata in the European digital collection she has selected for searching is in French, German and Italian. She does not know the corresponding subject term in those languages, and cannot carry out the search.
She tries to use a tag cloud of topics presented by the service. These topics are those giving the most hits in the various collections she is searching. She notices that the same topic seems to appear more than once, that obvious synonyms are presented as separate topics, and that many of the topics are in foreign languages. She drills down into a few specific topics in the tag cloud, but realises she is getting sets of hits which often overlap, or which do not contain items which she knows to be in the local collections. She also finds out that many of the topics use North American spelling, and that terms such as "color" are treated as separate topics from those with British English spelling, such as "colour". She wonders why the subject focus so natural to the organisation of faculties within the university is treated so badly in the library.
Application of linked data for the given use case
Concepts represented in vocabularies published as linked data can be federated in various ways
- Simple one-to-one mapping declarations embedded in concept descriptions using SKOS mapping style declarations (exact match, close match, broad match, narrow match, related match).
- One-to-many or many-to-many mappings between post-coordinated concepts using boolean compositions.
- More expressive rules similar to those used by ontology mapping tools.
Existing Work (optional)
- The HILT (High-Level Thesaurus) project established the need for a subject terminologies service to provide mappings between subject terms in different schemes, developed a hub-and-spoke mapping architecture using the Dewey Decimal Classification (DDC) as the hub, implemented the architecture in RDF/SKOS, developed an API to the pilot service, demonstrated the utility of the service in several operational subject indexing and resource discovery applications, and recommended the development of a distributed approach based on linked data.
- The MACS (enabling large-scale multilingual access to subjects) project created one-to-one mappings between terms in the English Library of Congress Subject Headings (LCSH), French RAMEAU, and German Schlagwortnormdatei (SWD) vocabularies. The LCSH-RAMEAU mappings have been released as linked data to accompany the linked data versions of LCSH and RAMEAU, in the framework of STITCH
- ISO 25964 Thesauri and Interoperability with other Vocabularies. Part 2 will cover interoperability between thesauri and other vocabularies such as classification schemes, taxonomies and ontologies. It will provide guidance on mapping practice and architecture.
- Quite off the library framework, but long experience in vocabulary integration is provided in the biomedicine and health domain by UMLS
- Pervasive work on ontology mapping, see e.g. Jérôme Euzenat and al.
- National Diet Library's mapping of NDLSH to LCSH using SKOS.
Related Vocabularies (optional)
Problems and Limitations
- Cost of development of vocabulary mapping.
- Scalability issues for large vocabularies such as LCSH and RAMEAU (see Dunsire/Nicholson and Soergel presentations).
- Research on alignment more focused on ontologies than library general vocabularies.
- Management of mappings re. vocabulary evolution.
- Lack of expressivity of current mapping vocabularies in the RDF pile. No support for compound concepts in SKOS, for example. (See Soergel presentation).
Related Use Cases and Unanticipated Uses (optional)
Related use cases:
- Use Case Authority Data Enrichment
- Use Case Europeana
- Use Case Language Technology
- Use Case Subject Search
- Use Case Virtual International Authority File (VIAF)
The latter appears as a specific example of the current one.
- Capturing user-generated subject terms on the fly.
Library Linked Data Dimensions / Topics
Dimensions: Browse / explore / select (Users needs); Retrieve / find (Users needs); Indexes (Library systems); Authority data (Library systems); Mash-ups (Social uses); Thesauri and controlled vocabularies (Information assets); User-generated information (Information assets); make new entities accessible (Information lifecycle).
*these items are not in the initial list, suggestion for adding them
Signposting the crossroads: terminology web services and classification-based interoperability. Presentation by Gordon Dunsire and Dennis Nicholson to Classification at the crossroads: multiple directions to usability : international UDC seminar, 29-30 Oct 2009, The Hague, Netherlands.
Conceptual foundations for semantic mapping and semantic search. Presentation by Dagobert Soergel to the Cologne Conference on Interoperability and Semantics in Knowledge Organization, July 19 2010, Cologne, Germany.