Draft Vocabularies Datasets Section
Relevant datasets and vocabularies
The success of linked library data relies on the ability of its practitioners to identify, re-use or connect to existing datasets and data models. Linked datasets and vocabularies that are essential in the library and related domains, however, have previously been unknown or unfamiliar to many.
The complexity and variety of available vocabularies, overlapping coverage, derivative relationships and alignments, all result in layers of uncertainty for re-use or connection efforts. Therefore, a current and reliable bird's eye view is essential for both novices seeking an overview of the library linked data domain and experts needing a quick look-up or refresher for a library linked data project.
The LLD XG thus prepared an inventory of existing library linked data resources (side deliverable) that identifies a set of useful resources for creating or consuming linked data in the library domain. These are classified into three main groups, which are non mutually exclusive as shown in our side deliverable: metadata element sets, value vocabularies, and datasets.
A metadata element set defines classes and attributes used to describe entities of interest. In the linked data terminology, such element sets are generally made concrete through (RDF) schemas or (OWL) ontologies, the term "RDF vocabulary" being often used as an umbrella for these. Usually a metadata element set does not describe bibliographic entities, rather it provides elements to be used by others to describe such entities. Some examples:
- Dublin Core defines elements such as Creator and Date (but DC does not define bibliographic records that use those elements).
- FRBR defines entities such as Work and Manifestation and elements that link and describe them. Resource Description and Access (RDA) defines elements for cataloging, based on the FRBR model.
- MARC21 defines elements (fields) to describe bibliographic records and authorities.
- FOAF and ORG define elements to describe people and organisations as might be used for describing authors and publishers.
A value vocabulary defines resources (instances of topics, art styles, authors) that are used as values of elements in metadata records. Typically a value vocabulary does not define bibliographic resources such as books but concepts related to bibliographic resources (persons, languages, countries, etc.). They are "building blocks" with which metadata records can be populated. Many libraries mandate specific value vocabularies for selecting values for a particular metadata element. A value vocabulary thus represents a "controlled list" of allowed values for an element. Examples include: thesaurus, code list, term list, classification scheme, subject heading list, taxonomy, authority file, digital gazetteer, concept scheme, and other types of knowledge organisation system. Value vocabularies often have http URIs assigned to the value, which would appear in a metadata record instead of or in addition to the literal value. Some examples:
- LCSH defines topics of works (e.g., Travel).
- Art and Architecture Thesaurus defines art styles (e.g., Impressionist) among others.
- VIAF defines name authorities (e.g., Mark Twain).
- GeoNames defines geographical locations ("features"), e.g., Paris.
In our report we focus on datasets as collections of structured metadata -- descriptions of things, such as books in a library. The equivalent of a dataset in the library world is a collection of Library records. Library records consist of statements about things, where each statement consists of an element ("attribute" or "relationship") of the entity, and a "value" for that element. The elements that are used are usually selected from a set of standard elements, such as Dublin Core. The values for the elements are either taken from value vocabularies such as LCSH, or are free text values. Similar notions to "dataset" include "collection" or "metadata record set". Note that in the Linked Data context, Datasets do not necessarily consist of clearly identifiable "records". They are merely consistent set of triples that you can query or download from a specific point, without making a strict distinction between metadata and data. We expect this view to impact the way the library community conceive its own data, as (i) it creates or re-uses RDF vocabularies with domain and range settings and documentation that conforms to best practices, and (ii) more application cases emerge, where "traditional" descriptive metadata is being used together with other types of data. Some examples:
- a record from a dataset for a given book could have a Subject element drawn from Dublin Core, and a value for Subject drawn from LCSH.
- the same dataset may contain records for authors as first-class entities that are linked from their book, described with elements like "name" from FOAF.
- a dataset may be self describing in that it contains information about itself as a distinct entity for example with a modified date and maintainer/curator elements drawn from Dublin Core.
Instances of these categories are listed in the side-deliverable along with a brief description, links to their locations and to the use cases that our group has gathered from the community. Two visualizations (@@TODO: maybe just one!) are also presented to help reveal the inter-relations of metadata element sets and the relationships between datasets and value vocabularies.
Our side deliverable aims at a broad coverage. However, we are well aware that our report cannot capture the entire diversity of what is out there, especially given the dynamic nature of linked data: new resources are continuously made available, and existing ones are regularly updated. To get a representative overview, we intentionally grounded our work on the use cases we collected. Additional coverage has been provided by the experts who participated in the LLD XG, to ensure that the most visible resources available at the time of writing have not been forgotten. Finally, to help make our report useful in a longer run, we have included a number of links to tools or web spaces, which we believe can help a reader get a more continuously updated snapshot after this incubator group has ended its work. Notably, we have set up a Library Linked Data group to gather information on relevant library linked datasets: http://ckan.net/group/lld. This group is hosted by the Comprehensive Knowledge Archive Network (CKAN), a repository designed with the goal of describing and finding data packages, most of them open. We hope to actively maintain this CKAN group, but for the sake of long-term success the entire community is invited to contribute.
@@TODO (to remove from final version of report). The previous version of this report section included an "observations" sub-section (with paragraphs on "Coverage", "Quality and support for available sources", "Linking") which has been re-named and moved to a new "current situation" section of the report. The content moved is accessible at http://www.w3.org/2005/Incubator/lld/wiki/Draft_Vocabularies_Datasets_As_Current_Situation