Difference between revisions of "Draft Vocabularies Datasets Section"

From Library Linked Data
Jump to: navigation, search
(Linking)
(An inventory of existing library linked data resources)
Line 11: Line 11:
 
The LLD XG thus prepared a [http://www.w3.org/2005/Incubator/lld/wiki/Vocabulary_and_Dataset side deliverable] that identifies a set of useful resources for creating or consuming linked data in the library domain. These are classified into three main groups, which are non mutually exclusive as shown in our side deliverable: metadata element sets, value vocabularies, and datasets.  
 
The LLD XG thus prepared a [http://www.w3.org/2005/Incubator/lld/wiki/Vocabulary_and_Dataset side deliverable] that identifies a set of useful resources for creating or consuming linked data in the library domain. These are classified into three main groups, which are non mutually exclusive as shown in our side deliverable: metadata element sets, value vocabularies, and datasets.  
  
* '''[http://www.w3.org/2005/Incubator/lld/wiki/Vocabulary_and_Dataset#Relevant_LLD_Metadata_element_sets_-_anno_2011 Metadata element sets]''': A metadata element set is a namespace that contains terms used to describe entities. In the linked data paradigm, such element sets are materialized through (RDF) schemas or (OWL) ontologies, with "RDF vocabulary" occasionally being used as an umbrella term. It may help to think of metadata elements sets as defining the model as distinct from the instance data (which fall into the value vocabulary or dataset categories below). Some examples:
+
* '''[http://www.w3.org/2005/Incubator/lld/wiki/Vocabulary_and_Dataset#Relevant_LLD_Metadata_element_sets_-_anno_2011 Metadata element sets]''': A metadata element set is a namespace that contains terms used to describe entities. In the linked data paradigm, such element sets are generally materialized through (RDF) schemas or (OWL) ontologies, with "RDF vocabulary" occasionally being used as an umbrella term. It may help to think of metadata elements sets as defining the model as distinct from the instance data (which fall into the value vocabulary or dataset categories below). Some examples:
 
** Dublin Core defines elements such as Creator and Date (but DC does not define bibliographic records that use those elements).  
 
** Dublin Core defines elements such as Creator and Date (but DC does not define bibliographic records that use those elements).  
 
** FRBR defines entities such as Work and Manifestation and elements that link and describe them.  
 
** FRBR defines entities such as Work and Manifestation and elements that link and describe them.  

Revision as of 08:54, 7 July 2011

Available Vocabularies and Datasets

@@TODO: for general TODOs on this section see the Discussion page

The success of linked library data relies on the ability of its practitioners to identify, re-use or connect to existing datasets and data models. Linked datasets and vocabularies that are essential in the library and related domains, however, have previously been unknown or unfamiliar to many.

An inventory of existing library linked data resources

The complexity and variety of available vocabularies, overlapping coverage, derivative relationships and alignments, all result in layers of uncertainty for re-use or connection efforts. Therefore, a current and reliable bird's eye view is essential for both novices seeking an overview of the library linked data domain and experts needing a quick look-up or refresher for a library linked data project.

The LLD XG thus prepared a side deliverable that identifies a set of useful resources for creating or consuming linked data in the library domain. These are classified into three main groups, which are non mutually exclusive as shown in our side deliverable: metadata element sets, value vocabularies, and datasets.

  • Metadata element sets: A metadata element set is a namespace that contains terms used to describe entities. In the linked data paradigm, such element sets are generally materialized through (RDF) schemas or (OWL) ontologies, with "RDF vocabulary" occasionally being used as an umbrella term. It may help to think of metadata elements sets as defining the model as distinct from the instance data (which fall into the value vocabulary or dataset categories below). Some examples:
    • Dublin Core defines elements such as Creator and Date (but DC does not define bibliographic records that use those elements).
    • FRBR defines entities such as Work and Manifestation and elements that link and describe them.
    • MARC21 defines elements (fields) to describe bibliographic records and authorities.
    • FOAF and ORG define elements to describe people and organisations as might be used for describing authors and publishers
  • Value vocabularies : A value vocabulary could be thought of as a specialized dataset that focuses on the management of discrete value/label literals for use in metadata records and/or user displays. Value vocabularies commonly focus on specific areas such as topic labels, art styles, author names, etc. They are not typically used to manage complex bibliographic resources such as books, but they are appropriate for related components, such as personal names, languages, countries, codes, etc. These act "building blocks" with which more complex metadata record structures can be built. Many libraries require specific value vocabularies for use in particular metadata elements. A value vocabulary thus represents a "controlled list" of allowed values for an element. Broad categories of value vocabularies include: thesaurus, code list, term list, classification scheme, subject heading list, taxonomy, authority file, digital gazetteer, concept scheme, and other types of knowledge organisation systems. Note however, that value vocabularies often have http URIs assigned to the label/value, which could be used in a metadata record instead of or in addition to the literal value. Some examples:
    • LCSH defines topics of books
    • Art and Architecture Thesaurus defines a.o. art styles
    • VIAF defines authorities
    • GeoNames defines geographical locations (e.g. cities).
  • Datasets : A dataset is a collection of structured metadata (aka instance data) -- descriptions of things, such as books in a library. Library records consist of statements about things, where each statement consists of an element ("attribute" or "relationship") of the entity, and a "value" for that element. The elements that are used are often selected from a set of standard elements, such as Dublin Core. The values for the elements are either taken from value vocabularies such as LCSH, or are free text values. Similar notions to "dataset" include "collection" or "metadata record set". Note that in the Linked Data context, Datasets do not necessarily consist of clearly identifiable "records". Some examples:
    • a record from a dataset for a given book could have a Subject element drawn from Dublin Core, and a value for Subject drawn from LCSH.
    • the same dataset may contain records for authors as first-class entities that are linked from their book, described with elements like "name" from FOAF
    • a dataset may be self describing in that it contains information about itself as a distinct entity for example with a modified date and maintainer/curator elements drawn from Dublin Core

Instances of these categories are listed in the side-deliverable along with a brief introduction, basic description and links to their locations. For metadata element sets and value vocabularies, use cases collected by the LLD XG are listed under each entry, which provides a clear context of the usage. For the available metadata element sets, namespaces and descriptions of their domain coverage are briefly presented. Two visuzaliations are also presented to help reveal the inter-relations of metadata element sets and the relationships between datasets and value vocabularies registered in CKAN.

Our side deliverable aims at a broad coverage for each of these categories. However, we are well aware that our report cannot capture the entire diversity of what is out there, especially given the dynamic nature of linked data: new resources are continuously made available, and existing ones are regularly updated. To get a representative overview, we intentionally grounded our work on the use cases that our group has gathered from the community. Additional coverage has been added by the experts who participated in LLD XG to ensure that the most visible resources available at the time of writing have not been forgotten. Finally, to help make our report useful in a longer run, we have included a number of links to tools or web spaces, which we believe can help a reader get a more continuously updated snapshot after this incubator group has ended its work. Notably, we have set up a "Library Linked Data" group in the CKAN repository to gather information on relevant library linked datasets. We hope to actively maintain this CKAN group, but for the sake of long-term success the entire community is invited to contribute.

Some observations

Coverage

The coverage of available metadata element sets and value vocabularies is encouraging. Many such resources have been released over the past couple of years, including some flagship value vocabularies already used by many libraries, such as the Library of Congress Subject Headings, or the Dewey Decimal Classification. Referece metadata frameworks are also provided in a linked data-compatible form, including Dublin Core or various FRBR implementations.

The main concern regarding coverage is the relatively low availability of bibliographic datasets. Descriptions of individual books and other library-held items are slightly less important than metadata element sets and value vocabularies, when re-use come into play. And indeed, tools like union catalogues already realize a significant level of exchange of book-level data. Yet it remains crucial -- and it is truly one of the expected benefits of linked data applied in our domain -- that library-related datasets get published and interconnected, rather than continue to exist in their own silos.

Quality and support for available sources

The level of maturity or stability of available resources vary greatly. Many resources we found are the result of (ongoing) project work, or the result of individual initiatives, and advertise themselves as mere prototypes. The abundance of such efforts is an sign of healthy activity going on in the library linked data domain. In fact it should come as no surprise, when the whole linked data endeavor encourages a much more agile view on data than in any previous paradigm. Yet this somehow jeopardizes the long-term availability and support for library linked data resources.

From this perspective, we find it encouraging that more and more established institutions are committing resources to linked data projects, from the national libraries of Sweden and Hungary, to the Food and Agriculture Organization of the United Nations, not to mention the Library of Congress or OCLC.

Linking

Establishing connections across various datasets is a core aspect of linked data technology, and a key condition to its success. Many semantic links across value vocabularies are already available, some of them obtained through high-quality manual work, like in the MACS or CRISSCROSS projects. And many value vocabulary publishers clearly strive to establish and maintain links to resources that are close to theirs. VIAF, for example, merges authority records from over a dozen national and regional agencies. And although quantitative evaluation was outside the scope of our effort, we hypothesize that many more such links are possible. Consumers of library linked data should be aware of the open world assumption that characterizes it, i.e., data cannot generally be assumed to be complete, and more data could always be released for any given entity.

A similar concern can be voiced regarding metadata element sets. As testified in the LOV inventory, practitioners generally follow the good practice of re-using existing element sets or building "application profiles" of them, but the lack of long-term support for them threatens their enduring meaning and common understanding. Further, some reference frameworks, notably FRBR, have been implemented in different RDF vocabularies, which are not always connected together. Such situation lowers the semantic interoperability of the datasets expressed using these vocabularies. Here, we hope that better communication between the creators and maintainers of these resources, as encouraged by our own incubator group or the LOD-LAM initiative, will help to consolidate the conceptual connections between them.

At the level of datasets, one may observe the same phenomenon as for the previous categories. For example, Open Library has started attaching OCLC numbers to its manifestations. We note however that efforts are being undertaken, and that the community is already well aware of challenges such as the "de-duplication" one.

We also observe that links are being built between library-originated resources and resources originating in other organizations or domains, DBpedia being an obvious case. Again, VIAF provides an example by taking the merged authority records and linking them to DBpedia whenever possible. This illustrates one of the expected benefits of linked data, where data can be easily networked, irrespective of its origins. The library domain can thus benefit from re-using data from other fields, while library data can itself contributes to initiatives that do not strictly fall into the library scope. In the same vein, LLD efforts could benefit from the availability of generic tools for linking data such as Silk - Link Discovery Framework, Google Refine, or Google Refine Reconciliation Service API. However, the community needs to gain experience using them, sharing linking results, and possibly building more tools that are better suited to the LLD environment.