Cluster VocAlign
Authors: Antoine Isaac, Michael Panzer, Marcia Zeng
Contents
Background
Libraries and other culture institutions use a variety of value vocabularies and metadata element sets to describe items in their collections. Different vocabularies may be chosen to describe different collections, within a library or in separate libraries, based on language, subject focus, granularity (specificity), user expectation, availability and support.
For example, a typical academic library may operate separate metadata systems for the library's main collections, special collections (e.g. manuscripts, archives, audiovisual), digital collections, etc. Each of these systems may employ a different (profile of) metadata element set(s) and/or different value vocabularies, with very limited interoperability between terms and concepts at the semantic level. There is usually no explicit link across vocabularies that indicates that two concepts or metadata fields have a similar meaning, or that two name authorities relate to the same person.
This raises a crucial issue for discovering resources over different collections. Library communities develop discovery services for consortia with a geographical, subject, sector (public, academic), and/or domain (libraries, archives, museums) focus, either based on distributed searching (e.g., via Z39.50) or metadata aggregation (e.g., OCLC's WorldCat). The number of different value vocabularies and metadata element sets such services have to deal with is increasing. Trans-national consortia (e.g., Europeana) add to the complexity of the environment by including vocabularies in multiple languages.
Users expect single point-of-search in such discovery service involving multiple organisations. They also expect to be able to search for subjects and navigate within collections using their own language and terms in an unambiguous, contextualised manner. Some digital collections can use universal value vocabularies of multilingual nature, like DDC. But they still require local vocabularies to be aligned to such a pivot language, even when they have different structures (e.g., classifications vs. subject headings) and different specificity levels. A similar requirement apply to the various metadata element sets used by the collections to be accessed.
Vocabulary heterogeneity also impacts back-office work: overall, managing and publishing metadata is less efficient when everyone develops and maintains their own vocabularies in isolation. On the other hand, maintaining link maps, which might be products of vocabulary alignment, can be challenging as well when participating vocabularies change without employing robust notification systems and versioning approaches. Often, an institution could benefit from the work of another for identifying new relevant concepts for one domain, maintaining standard metadata elements, authority sources, synonym lists, re-using subject indices for books, etc. Institutions specialized in one specific field, for instance, could re-use and extend parts of general vocabularies without having to start a new vocabulary from scratch, as in the AGROVOC case. Some vocabularies are used across many datasets, and are often the result of collaborative development (possibly under supervision of a central agency): e.g., LCSH, DDC, UDC, AGROVOC for value vocabularies and Dublin Core for metadata element sets. But there are still (too) many vocabularies with overlapping foci, which are developed independently, resulting in more human effort and lower system interoperability.
Topic in the Context of Linked Data
Linked data technologies provide tools to express, share and exploit semantic mapping or merging of concepts across value vocabularies, e.g., represented using SKOS, as well as elements (classes, properties) from metadata element sets, as defined in (RDFS/OWL) ontologies. On the other hand, the ease of publishing data without bothering about these semantic connections in the first place is raising problems: Proliferation of URIs and Managing Coreference has been identified as one of the main "semantic elephants" in the room [1]. The issue of establishing connection (especially, equivalence) between entities that are semantically comparable is common to all Linked Data applications. The current number of links is an order of magnitude below the number of entities published on the LOD cloud, and one only dataset (DBPedia) serves as semantic mapping hub for almost the entire LOD cloud.
Scenarios (Case Studies)
With a first attempt to map to Goals
- Use Case Vocabulary Merging
- SEARCH/BROWSE & FEDERATE: the case aim at allowing users to do (semantic/multilingual) search among different sources
- MAP(value): semantic mapping of the vocabularies used by the different sources is required
- Use Case Browsing And Searching In Repositories With Different Thesauri
- MAP (value): the case needs mapping between value vocabularies to translate queries from one vocabulary to another.
- SEARCH/BROWSE & FEDERATE: the case is about searching for books that match a given query, over various catalogue systems
- Use Case AGROVOC Thesaurus
- REUSE-SCHEMAS: the goal is to make AGROVOC as compatible with other linked data as possible and thus more easily consumable, facilitating REUSE-VALUE-VOCABS in *other* cases
- MAP (value): the case mentions facilitating the correspondance between vocabularies. FAO wants to map AGROVOC to EuroVOC and other vocabularies.
- SEARCH/BROWSE: one crucial aim is to find concepts for indexing (with the multilingual aspect being crucial), and possibly for external distributed vocabulary maintenance environment: via the API, search of existing terms can help identifying gaps and adding new terms.
- Use Case Civil War Data 150
- SHARE: (share objects description (data) among institutions, not limited to bibliographic data) CWD150's goal is to provide tools to discover information about the American Civil War from across multiple institutions and collections (Library, Museum, Archives, Individuals, etc.)
- MAP (value): CWD150 plans to 1) create an ontology from the Civil War Soldiers and Sailors Database (dates, places, people, events, etc.) 2) with a crosswalk of corresponding identifiers in Freebase, DBPedia, and potentially LCSH.
- SEARCH/BROWSE: The results are the connections based on the strong identifiers and taxonomy of the Civil War through which to enable search/browse/explore about a particular place, regiment, battle, and people.
- DISCOVER/SUGGEST: A scenario is the visualization of the troop movements and engagements of a Regiment and correspond the timeline to show troop casualties.
- Use Case Language Technology
- various goals of language technology (which are not mandatory LLD ones). One example of language technology is named entity recognition (NER), which can also have various goals, including
- RELATE(new): NER can be used to relate an object to a specific entity (place, person) that is connected to it, exploiting a natural language description of that object.
- ENHANCE VOCABs (new): language technology can utilize available value vocabularies (e.g., authorities) or values from structured data (e.g., dBpedia, Wordnet) to align terms that represent instance-like entities (current-and-new terms, cross-lingual terms) in order to enrich and update a value vocabulary.
- PUBLISH & MAP (value): the case is dependent on (multilingual) library vocabularies having been published as LD and connected to other datasets (such as DBpedia, Wordnet)
- various goals of language technology (which are not mandatory LLD ones). One example of language technology is named entity recognition (NER), which can also have various goals, including
- Use Case Bridging OWL and UML It is unclear whether this case is specific to the library (or wider GLAM) context. But one can see potential applications regarding:
- MAP (metadata) and RELATE (new): UML specification can be used as a hub to align domain model components (class, attributes, association, operation, instance, etc.) in a design at a human-understandable level. Connection between OWL and UML would help alignment of these components at the machine-understandable level.
- REUSE-SCHEMAS: metadata elements can be expressed in UML then to be explored, reused, and applied in building an OWL ontology.
It is unsure whether the following cases should be in the cluster, as they don't mention explicitly "vocabulary alignment". They in fact don't feature several (value) vocabularies that would or should be aligned.
- Use Case Subject Search
- SEARCH/BROWSE and DISCOVER/SUGGEST: the aim from a user perspective is to better fit library objects in web search engines, enabling users to access subjects and explore a subject's "landscape" by:
- NAME WITH URI & PUBLISH: assigning URIs to subjects that occur in books and publish relevant list of objects in the pages accessible from these URIs
- RELATE (aggregation): providing (1) subject representations, made of the books that are relevant for the subjects (2) object representations conceived as as a way to provide various documents (medium, editions...) that are relevant to the user's information need. But the case is not explicit about what should be made here.
- REUSE-VALUE-VOCABS: re-using subjects from reference authority lists with rich information attached to them
- Use Case Component Vocabularies
- DESCRIBE: the case aims at creating bibliographic metadata
- REUSE-VALUE-VOCABS: descriptions re-use values coming from already published reference vocabularies
- RELATE (new): new descriptions are being created, which connect books to vocabulary elements
- PUBLISH & NAME WITH URI: bibliographic metadata and vocabularies are published as interconnected LD
- SEARCH/BROWSE: the vocabulary LD is used in the process during which users retrieve books
Cases from other clusters can be relevant, see Relevant cases
Extracted Use Cases
The four "general applications" for vocabulary alignment data (as elaborated in [2]) can serve as a foil for the extraction (with Voc1 and Voc2 as the vocabularies to be aligned):
- Reindexing of collections: supporting the indexing of documents with Voc2 based on existing indexing with Voc1, or vice versa.
- Concept-based search across vocabularies in heterogeneously indexed collections: supporting the retrieval of documents indexed with Voc1 for queries that use Voc2 concepts, or vice versa.
- Navigation across vocabularies: supporting the exploration of concept spaces across vocabularies, giving (exploratory) access to collection items indexed with selected concepts.
- Vocabulary merging: supporting the construction of a new vocabulary that encompasses both Voc1 and Voc2, or the integration of one vocabulary into the other (as an extension or satellite of the other vocabulary)
These can be further abstracted in 3 categories of uses:
- Enrichment and discovery related use cases. These use cases focus on collections that have applied source or target vocabularies that are part of alignment efforts.
- Vocabulary-based enrichment of collections: the usage of an alignment technique or an existing alignment (e.g., a crosswalk or link map) to add semantically related concepts from target vocabularies to documents that have been indexed or are otherwise discoverable with the source vocabulary. Reindexing is a specific instance of vocabulary-based enrichment use cases.
- Vocabulary-based discovery in and across heterogeneously indexed collections: the enhancement of recall (with improved or at least comparable precision) for queries that use terms of multiple source and target vocabularies. Query expansion is a specific instance of vocabulary-based discovery use cases.
- Exploration of topical spaces by cross-vocabulary navigation: the enabling of interactive query construction by providing guided access to aligned vocabularies, allowing the traversal of intra- and inter-vocabulary (alignment) relationships, optionally resulting in a query using terms from the source vocabulary only.
- Multilingual discovery: the employment of alignment techniques, e.g., informed by natural language processing (NLP), to establish semantic interoperability between value vocabularies in different languages. Named entity recognition (NER) as described in Use_Case_Language_Technology is a specific instance of multilingual discovery, aiming at establishing semantic equivalence for concepts that have the same entities (persons, places, events, etc.) as extension, referent, or focus across multiple languages.
- Bridging multiple domains, disciplines, or communities of practice: the enabling of brokering or switching between domain-focused vocabularies of varying terminological specificity to enhance federated discovery in heterogeneously indexed collections or exploration of transdisciplinary topic spaces.
- Vocabulary enhancement and reuse either to extend other value vocabularies or as a basis of creation of new value vocabularies. (Often, these use cases will be prerequisites to fulfilling the discovery and enrichment use cases described above.)
- Extending a common pivot or spine vocabulary with specialized vocabularies that become local extensions of a shared upper-level core.
- Vocabulary merging: supporting the construction of a new vocabulary that encompasses both Voc1 and Voc2, or the integration of one vocabulary into the other (as an extension or satellite of the other vocabulary)
- Publication, discovery, and maintenance of tools or services of vocabulary alignment.
- Alignment-level description that enables one-stop shopping of value vocabulary alignments and/or contents provided by these vocabularies.
- Change management and versioning of alignments (e.g.: crosswalks, link maps): offering update and notification services to allow application using vocabulary alignments to keep pace with changes in source or target vocabularies, or to keep targeting a specific stable version.
Relevant technologies
- Ontology alignment tools from Semantic Web research [3], for example:
- R2R mapping frameworkfor converting data from one metadata element set to another
- SILK for instance-level mapping of Linked Data resources (can be applied to value vocabularies)
- Yearly evaluation campaigns are run with some alignment tools in tracks that are relevant to GLAM [4].
- Registries:
- Vocabulary registries: the Metadata Registry allows creating and serving inter-vocabulary mappings (both for value vocabularies and metadata element sets).
- Research on alignment-oriented repositories for value vocabularies includes the following projects: FinnONTO, NeON, CATCH and Bioportal (in the biology domain).
- Sameas.org co-reference service serves semantic equivalence links harvested on the linked data cloud. General Semantic Web services like sindice.com can achieve a similar result.
- Vocabularies that can be used as "hubs" (specific design, wide coverage...)
- Vocabulary Mapping Framework enables anchoring metadata element sets together
- DDC (dewey.info) is an alignment target for many "traditional" subject heading lists.
- Persistent identifiers: Civil War 150 other cases assign persistent URIs to their key entities, and rely on using (sometimes on a temporary basis) external identifiers, such as dbpedia/Freebase ones.
Relevant vocabularies
- SKOS, including SKOS eXtension for Labels (SKOS-XL) and SKOS mapping properties. With SKOS the degree of mapping can be specified as exact match, close match, broad match, narrow match, related match, etc. (Use Case Vocabulary Merging) .
- OWL
- UMBEL
- many other vocabulary provide potential mapping properties, Cf slide 21 of Mike Bergman's slides at DC 2010
- EDOAL, an Alignment language used by the Ontology Alignment community
Mapped/merged Value Vocabularies
(available through terminology services or published vocabs)
- HILT (High-Level Thesaurus) project, DDC and others - Use Case Vocabulary Merging
- MACS project: enabling large-scale multilingual access to subjects, LCSH, French RAMEAU, and German Schlagwortnormdatei (SWD) - Use Case Vocabulary Merging
- LCSH-RAMEAU mappings has been published as linked data, as well as SWD-RAMEAU and SWD-LCSH, through the respective vocabulary sites
- LIBRIS-LCSH mappins are available at LIBRIS site
- VIAF authorities are mapped to LIBRIS and GND
- SWD is published with mappings to (German) DDC, result of CrissCross project
- Unified Medical Language System UMLS, over 100 value vocabularies - Use Case Vocabulary Merging
- National Diet Library's subject headings NDLSH mapped to LCSH with SKOS - Use Case Vocabulary Merging
- OCLC Terminology Services, LCSH, and other LC value vocabs, MeSH, Industry Study Group Subject Headings (BISAC®), etc. available in SKOS.
- Agrovoc is mapped to SWD, GEMET (and CAT, NAL as non-LD vocabularies)
- New York Times subject headings are mapped to freebase, DBpedia, Geonames
- DBPedia is being mapped from the subject headings from Hungary National Library, STW, GND...
- The MARC Countries (available as LOD) entries include references to their equivalent ISO 3166 codes.
- The MARC List for Languages have been cross referenced with ISOs 639-1, 639-2, and 639-5, where appropriate. Additional vocabularies will be added in the future, including additional PREMIS controlled vocabularies.
Problems and limitations
Missing Vocabularies
- Current mapping vocabularies in RDF reveal some expressivity issues:
- As more types of value vocabularies are aligned, there can be a gap between the entities to be aligned and what mapping vocabularies can express. These include specific support in SKOS for:
- expressing mapping between compound concepts represented in a pre-coordinated and a post-coordinated vocabulary,
- expressing mapping between concepts from vocabularies that have different structures, for example, a classification system and a thesaurus.
- More investigation is needed on links between entities of different semantic types, e.g., Concepts vs. RWO (Real World Objects), Concepts vs. (OWL) Classes, RWO to RWO, etc.
- In addition to the type of link (mapping [property]), other information such as provenance (creation methods, e.g., automatic vs. manual mapping, degree of reliability, etc.) would probably be required to ease share and re-use of alignment results.
- There is a need of generalized mapping rules for the degrees of mapping.
- As more types of value vocabularies are aligned, there can be a gap between the entities to be aligned and what mapping vocabularies can express. These include specific support in SKOS for:
- Usage of standard patterns for mapping is not yet a reality
- There are many proposals, not all yet agreed-upon.[1]
- Available semantic mapping properties have been sometimes overused or used inappropriately. For example, 'owl:sameAs' has been used to express many kinds of mappings, beyond its original formal semantics.
- Although there are new cases of using SKOS as a pattern in the vocabularies beyond KOS, such as in MADS and VIAF, there is a need for a comprehensive study of the effectiveness of such an approach, regarding the benefits for the mapping of the vocabularies.
[1] cf. slide 21 of Mike Bergman's slides at DC 2010
Data incompatibilities or lack of full compatibility
- There is a general sparseness of linkage in the LOD cloud.
- Resources to be aligned vary in their scopes and granularity levels, modeling principles (E.g. thesauri vs. classification systems), language and culture, and many other aspects.
- Quality and size of resources involved in the alignment are heterogeneous.
- Semantic enrichment may be needed for some vocabularies before vocabulary integration.
Community guidance/organization issues
- Cost of intellectual work for vocabulary mapping, esp. for complex metadata element sets or large value vocabularies.
- Diversity of users' needs regarding the alignment quality. It is difficult to reach a consensus about what is a good mapping for any project.
- There will be different applications of a vocabulary alignment. A same alignment may be performed differently in different scenarios. Users' needs are diverse and can have direct impact on the mapping practices and results.
- Patterns for vocabulary alignment can be multiple and decisions have to be made based on the assessments. For example, there are direct mapping and backbone (hub) mapping approaches. Within the same project different patterns may be combined.
- Copyright and licensing can influence mapping strategies and their implementation.
- Selection of sources is largely influenced by their availability.
- There is a need to differentiate between the licensing and use of metadata about digital asset and the licensing and use of the assets themselves.
- The ownership of the alignment data is still an unclear area. Where should one publish mappings if multiple vocabularies are involved? Who owns the original data and mapping expression?
- Reliability of data sources that are targets of alignments is critical yet difficult to discover without certain levels of investment. ("See also" under 'Missing Vocabularies' section.)
- Re-alignments may occur due to the updates of the vocabularies involved. Participants need to be informed about other vocabularies' updating policies, workflow, and frequencies. Such updating needs to be incorporated in the mapping results or routine. (See also under 'Technology availability/questions' section.)
Technology availability/questions
- Alignment tools (for metadata element sets and value vocabularies)
- Research is more focused on ontologies than library general value vocabularies
- Tools have scalability issues for large vocabularies
- Provenance of alignment information has been only slightly touched
- Management of mappings re. vocabulary evolution
- Concept scheme evolution is challenging considering that even deprecated concepts and relationships must maintain an accessible URL
- Mappings should be updated to take into account new elements in the mapped vocabularies
- Taking over aligned vocabularies in a local repository remains an option
- Handling mapping to not-yet LD-published vocabularies
- Diversity of application environments, e.g., integration into a language technology processing pipeline, potentially crossing technology stacks (RDF & XML based Web Services)
References
[1] http://www.mail-archive.com/public-lod@w3.org/msg05298.html
[2] http://doi.ieeecomputersociety.org/10.1109/MIS.2009.26
[3] J. Euzenat, P. Shvaiko: Ontology Matching. Springer, 2007.
[4] Ontology Alignment Evaluation Initiative:
- OAEI library task: 2007 2008, 2009
- OAEI very large crosslingual resources task: 2008, 2009 2010
- OAEI instance matching track 2009, 2010
- OAEI proceedings 2007 2008 2009
Other projects having experimented with vocabulary alignment in a non linked data context: DESIRE, CARMEN, Renardus, AQUARELLE, LIMBER, SWAD-Europe, MSAC, KoMoHe