Draft Vocabularies Datasets Section2

From Library Linked Data
Jump to: navigation, search

Appendix A: An inventory of existing library Linked Data resources

The complexity and variety of available vocabularies, with their overlapping coverage, derivative relationships, and alignments, result in uncertainty for the re-use or linking efforts that are crucial to the success of linked library data. Many, especially among library professionals, are unfamiliar with the linked datasets and vocabularies that can be of use in the library domain because these have often been developed in the Semantic Web research community. A current and reliable bird's-eye view can help both novices seeking an overview of the library Linked Data domain and experts needing a quick look-up or refresher for a library Linked Data project.

The Incubator Group has therefore produced an inventory of useful resources for creating or consuming Linked Data in the library domain. This inventory, presented in a side deliverable @@@CITE@@@, shows that there are many areas where early adoption of Semantic Web and Linked Data principles and technology has led to the development of mature datasets and vocabularies. The inventory also points to areas where libraries and related organizations can still make key contributions. Finally, this document tries to provide the Linked Data community with an opportunity to understand the specific viewpoint, resources, and terminology used by the library community for their data, while helping Library and Information Science professionals grasp the Linked Data notions corresponding to their own traditions.

Though Linked Data technology differs from traditional library data concepts, this report classifies available resources into three non-mutually-exclusive categories that reflect library practices:

  • Datasets describing library-related resources, e.g., the British National Bibliography, the catalog of the Hungarian national library, the Open Library, CrossRef, Europeana;
  • Value vocabularies such as the Library of Congress Subject Headings, AGROVOC, the Virtual International Authority File (VIAF), Dewey Decimal Classification, and GeoNames;
  • Metadata element sets such as Dublin Core Metadata Terms, the elements of RDA: Resource Description and Access, Simple Knowledge Organization System (SKOS), and the Friend of a Friend vocabulary (FOAF).

Specific datasets re-use elements from various value vocabularies, and are structured according to the specifications for metadata element sets. For example, the British National Bibliography dataset re-uses concepts from the Library of Congress Headings vocabulary, and is structured by properties from the Dublin Core element set. Instances of these categories are listed in the side deliverable along with a brief description, links to their online locations, and to the use cases that our group has gathered from the community. A visualization is also presented to show relationships among datasets and value vocabularies (@@@Figure x@@@).

Our side deliverable @@@CITE@@@ is intended to provide a broad coverage of the available datasets. However, we are well aware that this report cannot capture the full diversity of current datasets, especially given the dynamic nature of Linked Data: new resources are continuously made available, and existing ones are regularly updated. To get a representative overview, we intentionally based our work on the use cases we received. Additional coverage was provided by the experts who participated in the Incubator Group to ensure that key resources available at the time of writing were not overlooked.

To help make our report useful in the long run we have included a number of links to tools or Web sites which we believe can provide up-to-date information after the Incubator Group has completed its work. In particular we have set up a Library Linked Data group as a site to collect information on relevant library linked datasets. http://ckan.net/group/lld. This site is hosted by the Comprehensive Knowledge Archive Network (CKAN)(@@@http://ckan.net@@@@), a repository designed to be a central hub for descriptions of data packages with an emphasis on those that are published as Open Data. We hope that this CKAN site will be actively maintained by the library Linked Data community after the Incubator Group has ended.

Semantic alignment

"Alignments" are links between semantically equivalent, similar, or related entities across different value vocabularies, metadata element sets, or datasets. Many semantic links across value vocabularies are already available, some of them obtained through high-quality manual work, as in the MACS or CRISSCROSS projects. Many value vocabulary publishers strive to establish and maintain links to resources semantically close to their own. VIAF, for example, merges authority records from over a dozen national and regional agencies. AGROVOC has been published with links to six other major thesauri or subject heading lists. Though quantitative evaluation was outside the scope of our effort, we hypothesize that many more such links should be created. Much work remains to be done to increase alignments among value vocabularies in the "library data cloud".

Alignments are likewise relevent for metadata element sets. As evidenced in the Linked Open Vocabularies inventory, practitioners generally follow the good practice of re-using existing element sets or building application profiles that re-use elements from multiple sets. Projects such as the Vocabulary Mapping Framework aim at supporting alignment.

The lack of institutional support for element sets can threaten the long-term persistence of their shared meanings. Moreover, some reference frameworks, notably Functional Requirements for Bibliographic Records (FRBR), have been expressed in a number of different ontologies, and these different expressions are not always explicitly aligned -- a situation that limits the semantic interoperability of datasets in which their RDF vocabularies are used. The Library Linked Data community should promote the coordinated re-use or extension of existing element sets over the creation of new sets from scratch. Aligning already existing element sets when they overlap, typically using semantic relations from the RDF Vocabulary Description Language (RDFS) and OWL Web Ontology Language, should also be encouraged. We hope that better communication among the creators and maintainers of these resources, as advocated by the LOD-LAM initiative, the Dublin Core Metadata Initiative and FOAF Project, and our own Incubator Group, will lead to more explicit conceptual connections between element sets.

Datasets may also be aligned. For example, Open Library attaches OCLC numbers to its bibliographic items. Re-use is arguably less central an issue for descriptions of individual books and other library-related resources than for metadata element sets and value vocabularies; union catalogs, for example, already realize a significant level of merging of book-level data. Yet it is crucial -- indeed, one of the expected benefits of linked data applied in our domain -- that library-related datasets be published and interconnected rather than continue to exist in their own silos. Because of past practices the community is already well aware of challenges such as "deduplication" @@@Link@@@.

We also note that links are being built between library resources and resources originating in other organizations or domains. For example, VIAF aggregates authority records from various library agencies, identifies the primary entities involved, and links them to DBpedia (where possible), which is a Linked Data extraction of Wikipedia. Here is the semantic alignment for Jane Austen: VIAF: http://viaf.org/viaf/102333412, Wikipedia: http://en.wikipedia.org/wiki/Jane_Austen, DBpedia: http://dbpedia.org/resource/Jane_Austen) This illustrates one of the expected benefits of linked data, which is that data can be easily networked irrespective of its origins. In this way the library domain can benefit from re-using data from other fields, and at the same time library data can can contribute to initiatives that did not originate in the library community.

The creation of alignments will benefit from the availability of better linking tools. Much effort has been put into computer science research areas such as Ontology Matching. This leads to implementations based, for example, on string matching and statistical techniques. These efforts have tended to focus on metadata element sets and typically are not ready to be applied more generally to the (often huge) datasets and value vocabularies of the library domain. Recent generic tools for linking data include Silk - Link Discovery Framework, Google Refine, and Google Refine Reconciliation Service API. Nonetheless, the community still needs to gain experience in their use, to share results of this experience, and to possibly build tools better suited to library Linked Data.

One final caveat: data consumers should bear in mind that -- in contrast to traditional, closed IT systems -- linked data follows an open-world assumption: the assumption that data cannot generally be assumed to be complete and that, in principle, more data may become available for any given entity. We hope that more "data linking" will happen in the library domain in line with the projects mentioned here.