Warning:
This wiki has been archived and is now read-only.

BP Data Vocabularies

From Data on the Web Best Practices
Jump to: navigation, search

Intro

This document is intended as a section of the Data on the Web Best Practices document.


Datasets often resort to a range of vocabularies in the data they contain: data values are entered or captured in a controlled way, i.e., for certain positions in a data graph (or column in a relationship table), the value used should come with a limited set of pre-existing resources: for example object types, roles of a person, countries in a geographic area, or possible subjects for books.

Such vocabularies aim to represent a set of terms and the relationship among them, in order to ensure a level of control, standardization and interoperability in the data. They can also provide a way to easily create richer data. Say, a dataset contain one reference in a data statement to a concept description in several languages. This single statement allows applications to localize their display of their search depending on the language of the user. The vocabularies can also have different forms (e.g. thesaurus, taxonomy, semantic network) and be represented in different formats (e.g. RDF, OWL, JSON, CSV).

This section presents best practices for data vocabularies accessible as URI sets on the Web. It focuses on the description of Best Practices to the audience of the Data on the Web Best Practices, as described below:

This document provides guidance to those who publish data on the Web, but also to those who consume data on the Web. These best practices have been written to meet the needs of many different audiences from developers and information management staff to scientists interested on sharing data on the Web. Every attempt has been made to make the document as readable and usable as possible while still retaining the accuracy and clarity needed in a technical specification. Readers of this document are expected to be familiar with some fundamental concepts of the architecture of the Web, such as resources and URIs, as well as open data formats. Basic knowledge about vocabularies and ontologies would be helpful to better understand some aspects of this document.

Scope / Issues

In the first round of Data on the Web Best Practices Use Cases & Requirements, some requirements and Best Practices were identified. A second round started in order to new requirements and Best Practices.

This document refers to a set of best practices for publishing and consuming Data Vocabularies on the Web.


@@TODO: Format in order to add the SKOS document. One approach is: Put SKOS and OWL in different levels and them provide define the Cycle ("how-to sections" below) for each one and them try to provide BP independent on representation. Another approach is to provide BP independent on SKOS and OWL.

Data Vocabularies

According to W3C, Vocabularies define the concepts and relationships (also referred to as “terms”) used to describe and represent an area of concern. Vocabularies are used to classify the terms that can be used in a particular application, characterize possible relationships, and define possible constraints on using those terms. In practice, vocabularies can be very complex (with several thousands of terms) or very simple (describing one or two concepts only). There is no clear division between what is referred to as “vocabularies” and “ontologies”. The trend is to use the word “ontology” for more complex, and possibly quite formal collection of terms, whereas “vocabulary” is used when such strict formalism is not necessarily used or only in a very loose sense. Vocabularies are the basic building blocks for inference techniques on the Semantic Web.

W3C offers different ways for creating Vocabularies on the Web in a standard format, such as:

  • RDF and RDF Schemas: The Resource Description Framework (RDF) is a general-purpose language for representing information in the Web;
  • SKOS: Simple Knowledge Organization System (SKOS) is a common data model for sharing and linking knowledge organization systems via the Web;
  • OWL: The Web Ontology Language (OWL) is a semantic markup language for publishing and sharing ontologies on the World Wide Web
  • RIF: W3C Rule Interchange Format (RIF) for writing declarative rules and production rules on the Web.

How to Create

Two different types:

  • Light-weight: annotation-oriented (used for information search), describing the concepts and their relationships. Here is recommended the use of SKOS or RDF and RDF Schema.
  • Heavy-weight: the body of knowledge that describes a given domain, describing the concepts and relationships between them from both perspectives: hierarchy and aggregation/composition. In addition, such heavy-weight descriptions provide the axiomatization of semantics constraints between those concepts and relations. Here is recommended the use of OWL and RIF.

Three approaches:

  • Top-layer: the whole building process compliant with the conventional software development process
  • Middle-layer: Generic constraints and guidelines which specify major steps
  • Bottom-layer: The most fine-grain guidelines such as those for class identification process, design of URIs, ODP, Anti-Patterns etc.


Some of the Best Practices for the creation of vocabularies are:

  • Vocabularies should be clearly documented:
    • Why: Because a documentation defines what is within the vocabulary in order to be reused or extended; The better is the documentation the higher is possibility of reuse the vocabulary;
    • What:
    • Intended outcome: The vocabulary be understandable by humans;
    • Possible Approach to Implement: Through the use of Ontology Requirements Specification Documents;
    • How to Test:
    • Evidence: Relevant Use Cases: R-VocabDocum;


  • Vocabularies should be shared in an open way
    • Why:
    • What:
    • Intended outcome:
    • Possible Approach to Implement:
    • How to Test:
    • Evidence: Relevant Use Cases: R-VocabOpen;


  • Vocabularies should include versioning information
    • Why: Because it guarantees compatibility over time; Versioning information provides a way to compare different versions of the vocabulary in order to be reused;
    • What:
    • Intended outcome: Humans can make changes in a compatible way and machines can interoperate data consistently;
    • Possible Approach to Implement: Use a versioning policy;
    • How to Test: Apply the versioning policy;
    • Evidence: Relevant Use Cases: R-VocabVersion;

How to Find

In order to be possible the reuse of vocabularies, it is important to provide ways to find them, when they are available on the Web. The can be done by using different approaches, such as:


Others include the LOV directory, Prefix.cc, Bioportal (biological domain) and the European Commission's Joinup platform.

How to Choose

It is very important to know how to choose the vocabulary to be reused. For doing this, see Vocabulary Checklist;

How to Reuse

In order to provide semantic interoperability between applications, it is important to reuse well-know vocabularies wherever possible. This should be done by defining terms to be shared with meaning and unambiguously.

One of the best practices is:

  • Existing reference vocabularies should be reused where possible
    • Why: Because it provides a way to support interoperability;
    • What:
    • Intended outcome: Machines can automatically search and navigate on the Web;
    • Possible Approach to Implement:
    • How to Test:
    • Evidence: Relevant Use Cases: R-VocabReference;

How to Extend

It is important to extend existing vocabularies, in spite of create a new one. Such extension can be done through the description of a vocabulary in another language or by enriching the vocabulary, adding new concepts and relationships.

Editors and Contributors

  • Ig Ibert Bittencourt
  • ...