Cluster VocAlign

From Library Linked Data
Revision as of 12:28, 13 January 2011 by Aisaac (Talk | contribs)

Jump to: navigation, search

Authors: Antoine Isaac, Michael Panzer, Marcia Zeng

Background

Raw material selected from http://www.w3.org/2005/Incubator/lld/wiki/Use_Case_Vocabulary_Merging

Libraries use a variety of controlled subject vocabulary and classification schemes to index items in their collections. Although most collections will employ only a single scheme, different schemes may be chosen to index different collections within a library or in separate libraries; schemes are chosen on the basis of language, subject focus (general or specific), granularity (specificity), user expectation, and availability and support (cost, currency, completeness, tools).

For example, a typical academic library will operate separate metadata systems for the library's main collections, special collections (e.g. manuscripts, archives, audiovisual), digital collections, and one or more institutional repositories for teaching and research output; each of these systems may employ a different subject vocabulary, with little or no interoperability between terms and concepts.

Note [AI]: this is true for value vocabularies AND for metadata element sets

Search and Browse issues

Raw material selected from http://www.w3.org/2005/Incubator/lld/wiki/Use_Case_Vocabulary_Merging

Library communities continue to develop resource discovery services for consortia with a geographical, subject, sector (public, academic, school, special libraries), and/or domain (libraries, archives, museums) focus. Services are based on distributed searching (e.g. via Z39.50) or metadata aggregations (e.g. OCLC's WorldCat and OAISter). As a result, the number of different subject schemes encountered in such services is increasing. Trans-national consortia (e.g. Europeana) add to the complexity of the environment by including subject vocabularies in multiple languages.

Users expect single point-of-search in consortial resource discovery service involving multiple organisations and large-scale metadata aggregations. Users also expect to be able to search for subjects using their own language and terms in an unambiguous, contextualised manner.

[MZ] This also applies for structured navigation within collections, especially in a multilingual context. Libraries and digital collections can use universal value vocabularies of multilingual nature, such as classification systems that use language-independent notations, like DDC. For example, Dewey.info is to be used in The Multilingual Library in Oslo (Use Case Pode). In order to reach this goal, existing value vocabularies and the universal value vocabulary must be aligned, even when they have different structures (e.g., classifications vs. subject headings) and different specificity levels.

Metadata management issues

[AI] There are back-office issues too: managing and publishing data is less efficient if each institution develops its own value vocabularies and metadata element set in isolation: often an institution could benefit for its vocabularies of the work that has been done by another institution (identification of new relevant concepts for one field, maintenance of authority sources, synonym lists). This is especially true for an institution specialized in one specific field, which could re-use and extend parts of a more general vocabulary, without having to start a new vocabulary from scratch (cf. AGROVOC case). There are some vocabularies which are re-used in many datasets, and often are the result of collaborative development (possibly under supervision of a central agency): e.g., LCSH, DDC, UDC, AGROVOC for value vocabularies, or Dublin Core for metadata element sets. But there are still many vocabularies with overlapping foci, which are developed independently, harming interoperability between the systems that use them.

Topic in the Context of Linked Data

Linked data technologies provide tools to express, share and exploit semantic mapping or merging of concepts across value vocabularies, e.g., represented using SKOS, as well as elements (classes, properties) from metadata element sets, as defined in (RDFS/OWL) ontologies. On the other hand, the ease of publishing data without bothering about these semantic connections in the first place is raising problems: Proliferation of URIs and Managing Coreference has been identified as one of the main "semantic elephants" in the room [1]. The issue of establishing connection (especially, equivalence) between entities that are semantically comparable is common to all Linked Data applications. The current number of links is an order of magnitude below the number of entities published on the LOD cloud, and one only dataset (DBPedia) serves as semantic mapping hub for almost the entire LOD cloud.

[1] http://www.mail-archive.com/public-lod@w3.org/msg05298.html

Scenarios (Case Studies)

With a first attempt to map to Goals

  • Use Case Vocabulary Merging
    • SEARCH/BROWSE & FEDERATE: the case aim at allowing users to do (semantic/multilingual) search among different sources
    • MAP(value): semantic mapping of the vocabularies used by the different sources is required
  • Use Case AGROVOC Thesaurus
    • REUSE-SCHEMAS: the goal is to make AGROVOC as compatible with other linked data as possible and thus more easily consumable, facilitating REUSE-VALUE-VOCABS in *other* cases
    • MAP (value): the case mentions facilitating the correspondance between vocabularies. FAO wants to map AGROVOC to EuroVOC and other vocabularies.
    • SEARCH/BROWSE: one crucial aim is to find concepts for indexing (with the multilingual aspect being crucial), and possibly for external distributed vocabulary maintenance environment: via the API, search of existing terms can help identifying gaps and adding new terms.
  • Use Case Civil War Data 150
    • SHARE: (share objects description (data) among institutions, not limited to bibliographic data) CWD150's goal is to provide tools to discover information about the American Civil War from across multiple institutions and collections (Library, Museum, Archives, Individuals, etc.)
    • MAP (value): CWD150 plans to 1) create an ontology from the Civil War Soldiers and Sailors Database (dates, places, people, events, etc.) 2) with a crosswalk of corresponding identifiers in Freebase, DBPedia, and potentially LCSH.
    • SEARCH/BROWSE: The results are the connections based on the strong identifiers and taxonomy of the Civil War through which to enable search/browse/explore about a particular place, regiment, battle, and people.
    • DISCOVER/SUGGEST: A scenario is the visualization of the troop movements and engagements of a Regiment and correspond the timeline to show troop casualties.
  • Use Case Language Technology
    • various goals of language technology (which are not mandatory LLD ones). One example of language technology is named entity recognition (NER), which can also have various goals, including
      • RELATE(new): NER can be used to relate an object to a specific entity (place, person) that is connected to it, exploiting a natural language description of that object.
      • ENHANCE VOCABs (new): language technology can utilize available value vocabularies (e.g., authorities) or values from structured data (e.g., dBpedia, Wordnet) to align terms that represent instance-like entities (current-and-new terms, cross-lingual terms) in order to enrich and update a value vocabulary.
    • PUBLISH & MAP (value): the case is dependent on (multilingual) library vocabularies having been published as LD and connected to other datasets (such as DBpedia, Wordnet)
  • Use Case Bridging OWL and UML It is unclear whether this case is specific to the library (or wider GLAM) context. But one can see potential applications regarding:
    • MAP (metadata) and RELATE (new): UML specification can be used as a hub to align domain model components (class, attributes, association, operation, instance, etc.) in a design at a human-understandable level. Connection between OWL and UML would help alignment of these components at the machine-understandable level.
    • REUSE-SCHEMAS: metadata elements can be expressed in UML then to be explored, reused, and applied in building an OWL ontology.

It is unsure whether the following cases should be in the cluster, as they don't mention explicitly "vocabulary alignment". They in fact don't feature several (value) vocabularies that would or should be aligned.

  • Use Case Subject Search
    • SEARCH/BROWSE and DISCOVER/SUGGEST: the aim from a user perspective is to better fit library objects in web search engines, enabling users to access subjects and explore a subject's "landscape" by:
    • NAME WITH URI & PUBLISH: assigning URIs to subjects that occur in books and publish relevant list of objects in the pages accessible from these URIs
    • RELATE (aggregation): providing (1) subject representations, made of the books that are relevant for the subjects (2) object representations conceived as as a way to provide various documents (medium, editions...) that are relevant to the user's information need. But the case is not explicit about what should be made here.
    • REUSE-VALUE-VOCABS: re-using subjects from reference authority lists with rich information attached to them
  • Use Case Component Vocabularies
    • DESCRIBE: the case aims at creating bibliographic metadata
    • REUSE-VALUE-VOCABS: descriptions re-use values coming from already published reference vocabularies
    • RELATE (new): new descriptions are being created, which connect books to vocabulary elements
    • PUBLISH & NAME WITH URI: bibliographic metadata and vocabularies are published as interconnected LD
    • SEARCH/BROWSE: the vocabulary LD is used in the process during which users retrieve books


Cases from other clusters can be relevant, see Relevant cases


Scenarios (Extracted Use Cases)

Starting point is 4 "general applications" for vocabulary alignment data [1] (taking Voc1 and Voc2 to be the vocabularies being aligned):

  1. Reindexing: supporting the indexing with Voc 2 of books that are already indexed with Voc 1, or vice versa.
  2. Concept-based search across vocabularies: supporting the retrieval of Voc1-indexed books for queries that use Voc2 concepts, or vice versa.
  3. Navigation across vocabularies: supporting the exploration of concept spaces across vocabularies, and give (exploratory) access to collection items indexed with selected concepts.
  4. Vocabulary merging: supporting the construction of a new vocabulary that encompasses both Voc1 and Voc2, or the integration of one vocabulary into the other (so as to extend this other vocabulary)

These can be further abstracted in 3 categories of uses:

  • Indexing and retrieval related uses. At all these cases the use is on the applications that have employed value vocabularies (i.e., the metadata datasets).
    • 1. Reindexing: supporting the indexing with Voc 2 of books that are already indexed with Voc 1, or vice versa.
    • 2. Concept-based search across vocabularies: supporting the retrieval of Voc1-indexed books for queries that use Voc2 concepts, or vice versa.
    • 3. Navigation across vocabularies: give (exploratory) access to collection items indexed with selected concepts.
    • Other use cases may be on browsing- and navigation-kind of applications, bridging different natural languages, cultural, and community terminologies.
  • Enhancement of the value vocabulary themselves or reuse the vocabulary (some portions or structures) to extend another value vocabulary; or, to contribute to the creation of another value vocabulary.
    • 4. Vocabulary merging: supporting the construction of a new vocabulary that encompasses both Voc1 and Voc2, or the integration of one vocabulary into the other (so as to extend this other vocabulary)
  • Services of vocabulary/terminology tools that enable one-stop shopping of value vocabularies and/or contents provided by these vocabularies.
    • 3. Navigation across vocabularies: supporting the exploration of concept spaces across vocabularies.

[AI] Can we fit back-end application scenario such as NLP (Use_Case_Language_Technology) here?

Relevant technologies

  • Civil War 150:
    • persistent URIs for all of the key entities
    • possible to utilize dbpedia/Freebase identifiers as a proxy until such time as being able to point to the National Park Service as a source.
  • Sameas.org co-reference service serves semantic equivalence links harvested on the linked data cloud, for any kind of URIs
  • The Metadata Registry allows creating and serving inter-vocabulary mappings (both for value vocabularies and metadata element sets)
  • Vocabulary Mapping Framework for anchoring metadata element sets together
  • R2R mapping framework (for converting from one metadata element set to another)
  • SILK (for instance-level mapping)
  • research on alignment-oriented value vocabulary repositories: NeOn project, CATCH, FinnONTO

[MZ]: These are added according to other sources, such as SKOS/Datasets page:

Relevant vocabularies

Mapped/merged Value Vocabularies

(available through terminology services or published vocabs)

Problems and limitations

Missing Vocabularies:

  • Expressivity issues of current mapping vocabularies in the RDF pile
    • E.g., no support for compound concepts in SKOS to express mappings between a pre-coordinated and a post-coordinated vocabulary, issue of generalized mapping rules
    • Links between entities of different semantic types (Concepts vs RWO, Concepts vs. Classes, RWO to RWO: sameAs issue)
    • To ease share and re-use of alignments would probably often require representing more than the type of link (mapping). Mostly, provenance information
      • Who created it? Manual vs. Automatic? Which alignment strategy or tool? Is there a degree of confidence?
  • Standard alignment languages (SKOS mapping properties, OWL sameAs and equivalentClass) are still experimental:
    • Great variety of proposals, not all yet agreed-upon, cf. slide 21 of Mike Bergman's slides at DC 2010
    • Problem with usage of semantic mapping properties, e.g. owl:sameAs being overloaded to express many kind of mappings, beyond its original formal semantics

Data incompatibilities or lacks:

  • General sparseness of linkage in the LOD cloud
  • Resources to align vary along: scope and granularity, modeling principles (E.g. thesauri vs. classification systems), language and culture
    • E.g., what is a street in English and in Japanese (if there is one)?
    • Alignment between high quality, small resources and larger resources with heterogenous quality, within the same application
  • Semantic enrichment may be needed for some vocabularies before KOS integration

Community guidance/organization issues:

  • Cost of (manual) vocabulary mapping, esp. for complex metadata element sets or large value vocabularies
  • Diversity of users' needs re. alignments: what is a good mapping?
    • knowing the application of a vocabulary alignment is important, esp. for value vocabularies: a same alignment performs differently in different scenarios
  • Patterns for vocabulary alignment: from pivot (BS8723's backbone approach) to many-to-many
  • Copyright and licensing can influence mapping strategies and their implementation
    • selecting sources
    • differentiating between the licensing and use of metadata about digital assets, and the assets themselves.
  • Where to publish mappings? Who owns the data?
  • Reliability of data sources that are target of alignments

Technology availability/questions:

  • Tools for mapping (for metadata element sets and value vocabularies)
    • Research is more focused on ontologies than library general value vocabularies
  • Scalability issues for large vocabularies
  • Management of mappings re. vocabulary evolution.
    • Concept scheme evolution is challenging considering that even deprecated concepts and relationships must maintain an accessible URL
    • Mappings should be updated to take into account new elements in the mapped vocabularies
  • Diversity of application environments
    • e.g., integration into a language technology processing pipeline, potentially crossing technology stacks (e.g. RDF > XML based Web Services)
  • Provenance of alignment information (cf. missing vocabularies)
  • Handling mapping to not-yet LD-published vocabularies
  • Taking over aligned vocabularies in a local repository

References

[1] http://doi.ieeecomputersociety.org/10.1109/MIS.2009.26

  • other projects having experimented with vocabulary alignment in a non linked data context: DESIRE, CARMEN, Renardus, AQUARELLE, LIMBER, SWAD-Europe, MSAC, KoMoHe