Warning:
This wiki has been archived and is now read-only.

Cluster VocAlign

From Library Linked Data
Jump to: navigation, search

Authors: Antoine Isaac, Michael Panzer, Marcia Zeng

Background

Libraries and other culture institutions use a variety of value vocabularies and metadata element sets to describe items in their collections. Different vocabularies may be chosen to describe different collections, within a library or in separate libraries, based on language, subject focus, granularity (specificity), user expectation, availability and support.

For example, a typical academic library may operate separate metadata systems for the library's main collections, special collections (e.g. manuscripts, archives, audiovisual), digital collections, etc. Each of these systems may employ a different (profile of) metadata element set(s) and/or different value vocabularies, with very limited interoperability between terms and concepts at the semantic level. There is usually no explicit link across vocabularies that indicates that two concepts or metadata fields have a similar meaning, or that two name authorities relate to the same person.

This raises a crucial issue for discovering resources over different collections. Library communities develop discovery services for consortia with a geographical, subject, sector (public, academic), and/or domain (libraries, archives, museums) focus, either based on distributed searching (e.g., via Z39.50) or metadata aggregation (e.g., OCLC's WorldCat). The number of different value vocabularies and metadata element sets such services have to deal with is increasing. Trans-national consortia (e.g., Europeana) add to the complexity of the environment by including vocabularies in multiple languages.

Users expect single point-of-search in such discovery service involving multiple organisations. They also expect to be able to search for subjects and navigate within collections using their own language and terms in an unambiguous, contextualised manner. Some digital collections can use universal value vocabularies of multilingual nature, like DDC. But they still require local vocabularies to be aligned to such a pivot language, even when they have different structures (e.g., classifications vs. subject headings) and different specificity levels. A similar requirement apply to the various metadata element sets used by the collections to be accessed.

Vocabulary heterogeneity also impacts back-office work: overall, managing and publishing metadata is less efficient when everyone develops and maintains their own vocabularies in isolation. On the other hand, maintaining link maps, which might be products of vocabulary alignment, can be challenging as well when participating vocabularies change without employing robust notification systems and versioning approaches. Often, an institution could benefit from the work of another for identifying new relevant concepts for one domain, maintaining standard metadata elements, authority sources, synonym lists, re-using subject indices for books, etc. Institutions specialized in one specific field, for instance, could re-use and extend parts of general vocabularies without having to start a new vocabulary from scratch, as in the AGROVOC case. Some vocabularies are used across many datasets, and are often the result of collaborative development (possibly under supervision of a central agency): e.g., LCSH, DDC, UDC, AGROVOC for value vocabularies and Dublin Core for metadata element sets. But there are still (too) many vocabularies with overlapping foci, which are developed independently, resulting in more human effort and lower system interoperability.

Topic in the Context of Linked Data

Linked data technologies provide tools to express, share and exploit semantic mapping or merging of concepts across value vocabularies, e.g., represented using SKOS, as well as elements (classes, properties) from metadata element sets, as defined in (RDFS/OWL) ontologies. On the other hand, the ease of publishing data without bothering about these semantic connections in the first place is raising problems: Proliferation of URIs and Managing Coreference has been identified as one of the main "semantic elephants" in the room [1]. The issue of establishing connection (especially, equivalence) between entities that are semantically comparable is common to all Linked Data applications. The current number of links is an order of magnitude below the number of entities published on the LOD cloud, and one only dataset (DBPedia) serves as semantic mapping hub for almost the entire LOD cloud.

Scenarios (Case Studies)

With a first attempt to map to Goals

  • Use Case Vocabulary Merging
    • SEARCH/BROWSE & FEDERATE: the case aim at allowing users to do (semantic/multilingual) search among different sources
    • MAP(value): semantic mapping of the vocabularies used by the different sources is required
  • Use Case AGROVOC Thesaurus
    • REUSE-SCHEMAS: the goal is to make AGROVOC as compatible with other linked data as possible and thus more easily consumable, facilitating REUSE-VALUE-VOCABS in *other* cases
    • MAP (value): the case mentions facilitating the correspondance between vocabularies. FAO wants to map AGROVOC to EuroVOC and other vocabularies.
    • SEARCH/BROWSE: one crucial aim is to find concepts for indexing (with the multilingual aspect being crucial), and possibly for external distributed vocabulary maintenance environment: via the API, search of existing terms can help identifying gaps and adding new terms.
  • Use Case Civil War Data 150
    • SHARE: (share objects description (data) among institutions, not limited to bibliographic data) CWD150's goal is to provide tools to discover information about the American Civil War from across multiple institutions and collections (Library, Museum, Archives, Individuals, etc.)
    • MAP (value): CWD150 plans to 1) create an ontology from the Civil War Soldiers and Sailors Database (dates, places, people, events, etc.) 2) with a crosswalk of corresponding identifiers in Freebase, DBPedia, and potentially LCSH.
    • SEARCH/BROWSE: The results are the connections based on the strong identifiers and taxonomy of the Civil War through which to enable search/browse/explore about a particular place, regiment, battle, and people.
    • DISCOVER/SUGGEST: A scenario is the visualization of the troop movements and engagements of a Regiment and correspond the timeline to show troop casualties.
  • Use Case Language Technology
    • various goals of language technology (which are not mandatory LLD ones). One example of language technology is named entity recognition (NER), which can also have various goals, including
      • RELATE(new): NER can be used to relate an object to a specific entity (place, person) that is connected to it, exploiting a natural language description of that object.
      • ENHANCE VOCABs (new): language technology can utilize available value vocabularies (e.g., authorities) or values from structured data (e.g., dBpedia, Wordnet) to align terms that represent instance-like entities (current-and-new terms, cross-lingual terms) in order to enrich and update a value vocabulary.
    • PUBLISH & MAP (value): the case is dependent on (multilingual) library vocabularies having been published as LD and connected to other datasets (such as DBpedia, Wordnet)
  • Use Case Bridging OWL and UML It is unclear whether this case is specific to the library (or wider GLAM) context. But one can see potential applications regarding:
    • MAP (metadata) and RELATE (new): UML specification can be used as a hub to align domain model components (class, attributes, association, operation, instance, etc.) in a design at a human-understandable level. Connection between OWL and UML would help alignment of these components at the machine-understandable level.
    • REUSE-SCHEMAS: metadata elements can be expressed in UML then to be explored, reused, and applied in building an OWL ontology.

It is unsure whether the following cases should be in the cluster, as they don't mention explicitly "vocabulary alignment". They in fact don't feature several (value) vocabularies that would or should be aligned.

  • Use Case Subject Search
    • SEARCH/BROWSE and DISCOVER/SUGGEST: the aim from a user perspective is to better fit library objects in web search engines, enabling users to access subjects and explore a subject's "landscape" by:
    • NAME WITH URI & PUBLISH: assigning URIs to subjects that occur in books and publish relevant list of objects in the pages accessible from these URIs
    • RELATE (aggregation): providing (1) subject representations, made of the books that are relevant for the subjects (2) object representations conceived as as a way to provide various documents (medium, editions...) that are relevant to the user's information need. But the case is not explicit about what should be made here.
    • REUSE-VALUE-VOCABS: re-using subjects from reference authority lists with rich information attached to them
  • Use Case Component Vocabularies
    • DESCRIBE: the case aims at creating bibliographic metadata
    • REUSE-VALUE-VOCABS: descriptions re-use values coming from already published reference vocabularies
    • RELATE (new): new descriptions are being created, which connect books to vocabulary elements
    • PUBLISH & NAME WITH URI: bibliographic metadata and vocabularies are published as interconnected LD
    • SEARCH/BROWSE: the vocabulary LD is used in the process during which users retrieve books

Cases from other clusters can be relevant, see Relevant cases

Extracted Use Cases

The four "general applications" for vocabulary alignment data (as elaborated in [2]) can serve as a foil for the extraction (with Voc1 and Voc2 as the vocabularies to be aligned):

  1. Reindexing of collections: supporting the indexing of documents with Voc2 based on existing indexing with Voc1, or vice versa.
  2. Concept-based search across vocabularies in heterogeneously indexed collections: supporting the retrieval of documents indexed with Voc1 for queries that use Voc2 concepts, or vice versa.
  3. Navigation across vocabularies: supporting the exploration of concept spaces across vocabularies, giving (exploratory) access to collection items indexed with selected concepts.
  4. Vocabulary merging: supporting the construction of a new vocabulary that encompasses both Voc1 and Voc2, or the integration of one vocabulary into the other (as an extension or satellite of the other vocabulary)

These can be further abstracted in 3 categories of uses:

  • Enrichment and discovery related use cases. These use cases focus on collections that have applied source or target vocabularies that are part of alignment efforts.
    • Vocabulary-based enrichment of collections: the usage of an alignment technique or an existing alignment (e.g., a crosswalk or link map) to add semantically related concepts from target vocabularies to documents that have been indexed or are otherwise discoverable with the source vocabulary. Reindexing is a specific instance of vocabulary-based enrichment use cases.
    • Vocabulary-based discovery in and across heterogeneously indexed collections: the enhancement of recall (with improved or at least comparable precision) for queries that use terms of multiple source and target vocabularies. Query expansion is a specific instance of vocabulary-based discovery use cases.
    • Exploration of topical spaces by cross-vocabulary navigation: the enabling of interactive query construction by providing guided access to aligned vocabularies, allowing the traversal of intra- and inter-vocabulary (alignment) relationships, optionally resulting in a query using terms from the source vocabulary only.
    • Multilingual discovery: the employment of alignment techniques, e.g., informed by natural language processing (NLP), to establish semantic interoperability between value vocabularies in different languages. Named entity recognition (NER) as described in Use_Case_Language_Technology is a specific instance of multilingual discovery, aiming at establishing semantic equivalence for concepts that have the same entities (persons, places, events, etc.) as extension, referent, or focus across multiple languages.
    • Bridging multiple domains, disciplines, or communities of practice: the enabling of brokering or switching between domain-focused vocabularies of varying terminological specificity to enhance federated discovery in heterogeneously indexed collections or exploration of transdisciplinary topic spaces.
  • Vocabulary enhancement and reuse either to extend other value vocabularies or as a basis of creation of new value vocabularies. (Often, these use cases will be prerequisites to fulfilling the discovery and enrichment use cases described above.)
    • Extending a common pivot or spine vocabulary with specialized vocabularies that become local extensions of a shared upper-level core.
    • Vocabulary merging: supporting the construction of a new vocabulary that encompasses both Voc1 and Voc2, or the integration of one vocabulary into the other (as an extension or satellite of the other vocabulary)
  • Publication, discovery, and maintenance of tools or services of vocabulary alignment.
    • Alignment-level description that enables one-stop shopping of value vocabulary alignments and/or contents provided by these vocabularies.
    • Change management and versioning of alignments (e.g.: crosswalks, link maps): offering update and notification services to allow application using vocabulary alignments to keep pace with changes in source or target vocabularies, or to keep targeting a specific stable version.

Relevant technologies

  • Ontology alignment tools from Semantic Web research [3], for example:
    • R2R mapping frameworkfor converting data from one metadata element set to another
    • SILK for instance-level mapping of Linked Data resources (can be applied to value vocabularies)
    • Yearly evaluation campaigns are run with some alignment tools in tracks that are relevant to GLAM [4].
  • Registries:
    • Vocabulary registries: the Metadata Registry allows creating and serving inter-vocabulary mappings (both for value vocabularies and metadata element sets).
    • Research on alignment-oriented repositories for value vocabularies includes the following projects: FinnONTO, NeON, CATCH and Bioportal (in the biology domain).
    • Sameas.org co-reference service serves semantic equivalence links harvested on the linked data cloud. General Semantic Web services like sindice.com can achieve a similar result.
  • Vocabularies that can be used as "hubs" (specific design, wide coverage...)
  • Persistent identifiers: Civil War 150 other cases assign persistent URIs to their key entities, and rely on using (sometimes on a temporary basis) external identifiers, such as dbpedia/Freebase ones.

Relevant vocabularies

Mapped/merged Value Vocabularies

(available through terminology services or published vocabs)

Problems and limitations

Missing Vocabularies

  • Current mapping vocabularies in RDF reveal some expressivity issues:
    • As more types of value vocabularies are aligned, there can be a gap between the entities to be aligned and what mapping vocabularies can express. These include specific support in SKOS for:
      • expressing mapping between compound concepts represented in a pre-coordinated and a post-coordinated vocabulary,
      • expressing mapping between concepts from vocabularies that have different structures, for example, a classification system and a thesaurus.
    • More investigation is needed on links between entities of different semantic types, e.g., Concepts vs. RWO (Real World Objects), Concepts vs. (OWL) Classes, RWO to RWO, etc.
    • In addition to the type of link (mapping [property]), other information such as provenance (creation methods, e.g., automatic vs. manual mapping, degree of reliability, etc.) would probably be required to ease share and re-use of alignment results.
    • There is a need of generalized mapping rules for the degrees of mapping.
  • Usage of standard patterns for mapping is not yet a reality
    • There are many proposals, not all yet agreed-upon.[1]
    • Available semantic mapping properties have been sometimes overused or used inappropriately. For example, 'owl:sameAs' has been used to express many kinds of mappings, beyond its original formal semantics.
    • Although there are new cases of using SKOS as a pattern in the vocabularies beyond KOS, such as in MADS and VIAF, there is a need for a comprehensive study of the effectiveness of such an approach, regarding the benefits for the mapping of the vocabularies.

[1] cf. slide 21 of Mike Bergman's slides at DC 2010

Data incompatibilities or lack of full compatibility

  • There is a general sparseness of linkage in the LOD cloud.
  • Resources to be aligned vary in their scopes and granularity levels, modeling principles (E.g. thesauri vs. classification systems), language and culture, and many other aspects.
  • Quality and size of resources involved in the alignment are heterogeneous.
  • Semantic enrichment may be needed for some vocabularies before vocabulary integration.

Community guidance/organization issues

  • Cost of intellectual work for vocabulary mapping, esp. for complex metadata element sets or large value vocabularies.
  • Diversity of users' needs regarding the alignment quality. It is difficult to reach a consensus about what is a good mapping for any project.
    • There will be different applications of a vocabulary alignment. A same alignment may be performed differently in different scenarios. Users' needs are diverse and can have direct impact on the mapping practices and results.
    • Patterns for vocabulary alignment can be multiple and decisions have to be made based on the assessments. For example, there are direct mapping and backbone (hub) mapping approaches. Within the same project different patterns may be combined.
  • Copyright and licensing can influence mapping strategies and their implementation.
    • Selection of sources is largely influenced by their availability.
    • There is a need to differentiate between the licensing and use of metadata about digital asset and the licensing and use of the assets themselves.
  • The ownership of the alignment data is still an unclear area. Where should one publish mappings if multiple vocabularies are involved? Who owns the original data and mapping expression?
  • Reliability of data sources that are targets of alignments is critical yet difficult to discover without certain levels of investment. ("See also" under 'Missing Vocabularies' section.)
  • Re-alignments may occur due to the updates of the vocabularies involved. Participants need to be informed about other vocabularies' updating policies, workflow, and frequencies. Such updating needs to be incorporated in the mapping results or routine. (See also under 'Technology availability/questions' section.)

Technology availability/questions

  • Alignment tools (for metadata element sets and value vocabularies)
    • Research is more focused on ontologies than library general value vocabularies
    • Tools have scalability issues for large vocabularies
  • Provenance of alignment information has been only slightly touched
  • Management of mappings re. vocabulary evolution
    • Concept scheme evolution is challenging considering that even deprecated concepts and relationships must maintain an accessible URL
    • Mappings should be updated to take into account new elements in the mapped vocabularies
  • Taking over aligned vocabularies in a local repository remains an option
  • Handling mapping to not-yet LD-published vocabularies
  • Diversity of application environments, e.g., integration into a language technology processing pipeline, potentially crossing technology stacks (RDF & XML based Web Services)

References

[1] http://www.mail-archive.com/public-lod@w3.org/msg05298.html

[2] http://doi.ieeecomputersociety.org/10.1109/MIS.2009.26

[3] J. Euzenat, P. Shvaiko: Ontology Matching. Springer, 2007.

[4] Ontology Alignment Evaluation Initiative:

Other projects having experimented with vocabulary alignment in a non linked data context: DESIRE, CARMEN, Renardus, AQUARELLE, LIMBER, SWAD-Europe, MSAC, KoMoHe