Copyright © 2008 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
Metadata element sets, taxonomies, subject heading systems, thesauri, concept systems, and ontologies are examples of vocabularies that are increasingly deployed in Semantic Web settings. Managing vocabularies for use by Semantic Web applications means identifying, documenting, and publishing vocabulary terms in ways that facilitate their citation and re-use in a wide range of applications. This document articulates some basic principles of good practice for managing an RDF vocabulary. Following these principles makes an RDF vocabulary "usable": new users can learn quickly how to use the vocabulary, and a relationship of trust is built between the user community and the vocabulary developers/maintainers. This promotes growth of a user community, which generates more feedback for the developers/maintainers, leading to further improvements in quality and usability.
This document focuses primarily on those principles of good practice where a clear recommendation can be made. Other related issues remain research topics, and therefore are outside the scope of this document. Further, there are a number of ways to address most, if not all, of the topics highlighted below. While this document does not attempt to provide an exhaustive survey of those methodologies/approaches, it is intended to provide pointers to approaches that have worked well for seasoned practitioners.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document was prepared by the Semantic Web Deployment Working Group (SWD) as part of the W3C Semantic Web Activity. It attempts to respond to a number of questions directed to the Semantic Web Best Practices and Deployment Working Group, and its successor, with regard to strategies for publishing and managing vocabularies over time. The recommendations reflect the experience of members of the working group in developing and managing individual vocabularies, such as Dublin Core and the Simple Knowledge Organization System (SKOS), as well as in managing repositories of vocabularies, such as the BioPortal, a web application providing access to the Open Biomedical Ontologies (OBO) library. The principles we describe represent only the "tip of the iceberg" in terms of what may be needed to support ontology evolution over time, but cover a minimal common set of practices required to support an active user community.
This document is a W3C Editor's Draft published to solicit comments from interested parties. All comments are welcome and may be sent to public-swd-wg@w3.org; please include the text "VM principles comment" in the subject line. All messages received at this address are viewable in a public archive.
Editors' note: Anything marked in the document with an "@@" indicates something still to be done or fixed. For example, "@@TODO" indicates an outstanding task, "@@REF" indicates a reference that needs to be properly cited.
Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
An RDF vocabulary is a set of resources denoted by URIs. Informally, these resources are known as the "terms" of the vocabulary.
These resources will usually (but not necessarily) be of type rdf:Property
, rdfs:Class
, owl:Class
,
or skos:Concept
.
An RDF vocabulary is created and maintained for the use of a community of people as a set of building blocks for creating RDF descriptions of things in their domain of interest. An RDF vocabulary usually implies a shared conceptualisation, and thus the notion of an "RDF vocabulary" is similar to the notion of a "web ontology" (see OWL Web Ontology Language Use Cases and Requirements [OWL-req]).
Many controlled vocabularies have been encoded in RDF, OWL, and other knowledge representation languages, and a growing number are available in the public domain. A fraction of these appear to have fostered significant reuse to date, however (see discussion thread starting with Semantic Web Ontology Map). While there are many issues that can limit reuse potential, a significant contributor is the lack of well-specified policies for vocabulary management, metadata, and provenance specification, depending on the application. Several of the most prominent RDF vocabularies currently in use (e.g., OWL, FOAF, Dublin Core, SKOS) have emerged from a close collaboration between a relatively small community of developers and a larger community of users. The prominence of these vocabularies may be attributed to their utility, but also to the commitment made by those responsible for developing/maintaining the vocabularies to forming, accomodating, serving, and working with a community of users.
In addition to these individual vocabularies, a number of portals are emerging as "collection points" for vocabularies designed to support users in specific domains, such as the BioPortal from NCOR (National Center for Biomedical Ontology) or specific communities, such as the Object Management Group's Business Vocabulary and Ontology Portal (coming soon). Such portals are useful for users searching for the vocabularies they serve, but also because of the significant metadata describing the ontologies that they provide. Increasingly, the metadata describing a particular vocabulary is becoming as important as the vocabulary itself for documentation and reuse purposes.
The goal of implementing the principles outlined in this document is to make an RDF vocabulary "usable". This could be restated as, managing an RDF vocabulary in such a way that it can easily be understood and deployed by users.
@@TODO [Ed?] Need paragraph on digital preservation, per email from Dan Brickley re: social responsibility, etc.
@@TODO review paragraph 1.2 in http://esw.w3.org/topic/VocabManagementNote to see if we want to incorporate any other thinking from the original note in the introduction - definitions of vocabularies, intro to the remainder of the document, etc.
In this section of the document we present a number of principles reflecting collective experience in publishing and managing RDF vocabularies. These are topics for which there is general consensus among practitioners, and for which clear recommendations can be made. They include:
An RDF vocabulary consists of a set of URIs. "Naming" refers to the act of allocating URIs to resources (see RDF Semantics [RDFSem]).
The developers/maintainers of an RDF vocabulary should inform the potential user of the following:
These practices are among the most critical for ensuring that potential users can trust that a particular RDF vocabulary is stable, will persist for some length of time, and can be either referenced or used in applications they are building.
For example, in developing applications that leverage OWL-S ontologies, there have been times when some of the ontologies on which OWL-S depends, such as those representing the names of cities and states in the US, return 404 errors (i.e., they are unavailable, and thus the corresponding applications may fail unless they have access to a locally cached version). This appears to be due to temporary outages, but is an issue for developers nonetheless. Additionally, the standards described by certain dependent ontologies are constantly evolving, for example, language and country codes, while the referenced ontologies themselves appear to be static and aging. The end result is that users are less confident of the availability or applicability of these dependent vocabularies and thus of OWL-S itself.
In cases where access to such "utility" vocabularies is critical for many applications, there are discussions underway with authoritative organizations regarding development and management of key vocabularies. For example, the Library of Congress is the registration authority for parts of ISO 639 (language codes) and ISO 3166 (country codes), and thus the obvious choice to manage and maintain the corresponding RDF vocabularies. Until such time as "all of the vocabularies we might need" become available from an authoritative source, or for any of us considering publishing vocabularies designed specifically for reuse, the minimal set of naming conventions identified above should be the starting point.
Guidelines for choosing URI namespaces, including considerations and examples to assist in the process are provided in Best Practice Recipes for Publishing RDF Vocabularies [Recipes]. Additional suggestions are detailed in Cool URIs for the Semantic Web [CoolURIs]. Another important reference is Identifying Metadata Elements with URIs: The CORES Resolution [CORES], which describes an agreement between a number of internationally recognized metadata standards to use URIs for element naming, and provides examples that may be helpful to practitioners.
A number of organizations are grappling with decisions regarding general URI schemes, such as the date-based scheme generally used by the W3C. Some communities, such as the Object Management Group (OMG), have found that as the number of documents and communities within the broader organization grows, dates alone may not be sufficient. In order to assist potential users in finding various artifacts on the OMG site, recent proposals suggest including the higher level specification name, date-based version information, artifact type, and so forth as part of the subordinate URI scheme. Use of a simple RDF vocabulary to support this scheme and assist in navigation, once adopted, is under consideration.
@@TODO note in wiki from F2F-01-2007: Good place to mention the domain registration problem (Alistair or Tom, can you remind us of what this specifically refers to?)
RDF vocabulary publishers should provide natural-language (i.e., human-readable) documentation about the vocabulary and its proper use. The principle aim of this documentation is to help potential users *learn* how to apply the vocabulary, and therefore to promote *consistency* in the way that the vocabulary is applied. Inconsistent usage reduces the value of a vocabulary, because the meaning associated with the vocabulary becomes in practice ambiguous.
At a bare minimum, a list of the terms should be published, along with their text definitions. Ideally, detailed prose describing proper usage patterns and scenarios is recommended, with clear examples. Relevant metadata should include a description of the use-case(s) that formed the basis for the original vocabulary development, its intented audience and target domain, references and authoritative sources used, development and validation methodology, and other domain dependent content that may be useful for reuse purposes.
In practice, we recommend publishing both human and machine-readable documentation, with liberal use of rdfs:seeAlso
. For vocabularies that
define terminology for a single reference, such as for ISO language and country codes, it may seem trivial to link back to the original source documents and
registration authority (although citing the exact version of the source used may be essential for some domains). In many cases, however, vocabulary terms are drawn from a number of sources, and documentation is critical to reusability. Examples of
good documentation as well as actual metadata terms that can be used in documenting vocabularies include the DCMI
Metadata Terms [DublinCore,] vocabulary as well as the SKOS
Simple Knowledge Organization System Reference [SKOS,].
A recent EU activity involved in documenting and managing both the metadata and content for particular vocabularies is the University of Karlsruhe's Ontology Metadata Vocabulary project. The OMV team has developed an extensive ontology and related repository for collecting and managing OWL ontologies, including extensions for mappings across ontologies, multilinguality, and so forth. Practitioners may find the OMV core ontology particularly useful as a basis for in situ vocabulary documentation. Another example of a well-documented resource is Wordnet. A number of extensions, including a plugin for Protege, an RDF version, an ontology and related semantic database, a Lucene index, tools that calculate semantic similarity, etc., are available from the Wordnet site.
In practice, and in particular in a collaborative development and maintenance environment, it may be essential to document not only the static vocabularies themselves, but also changes being introduced dynamically. Such documentation may include contextual information, as well as requirements, issues, or other reasons leading to the change, and can facilitate discussion with respect to adoption of particular changes for public release of the vocabulary in question (among other things). It may be difficult for users to understand the reasoning behind a particular modification if there is no documentation other than the actual change itself or change proposal submitted for a subsequent version of a vocabulary. Proper change documentation should include sufficient meta-information to assist users in understanding the change, the requirements driving it, and its potential consequences.
Example collaborative development tools that enable change management include:
Further discussion on vocabulary documentation can be found in [Recipes], under Requirements.
An RDF vocabulary may be developed in private by a closed community and published without the need for consideration of future change. An RDF vocabulary may, on the other hand, be developed in a more public setting, potentially by an open community, with the content of the vocabulary being allowed to evolve indefinitely. Regardless, potential users need to know under what circumstances the vocabulary (or parts of it) may change, and the kinds of changes that may be expected.
The key concept here is "stability". When a potential user chooses a vocabulary, they are making an investment of time/money/effort that depends to a certain extent upon the stability of that vocabulary. Therefore, users need to know exactly how stable a vocabulary is in order to determine how much to invest. If a vocabulary is less than perfectly stable, the user needs to know exactly what may change, how it may change, and of course to be informed of changes when they do occur.
With that as background, it is essential that RDF vocabulary publishers provide maintenance policies for every vocabulary. These policies should articulate whether or not change is allowed and the manner in which change is managed. The publisher should also provide some facility whereby users can be informed of changes as and when they are made, and provide feedback if possible. Examples of different types of vocabularies and other artifacts that have a published maintenance policy include:
All SKOS Core terms use a www.w3.org
namespace, and so inherit the commitments to URI persistence made by the W3C. The SKOS Core vocabulary
maintenance policy implements the following principles:
As the semantics of a SKOS term are refined in response to deployment and testing, the term passes through the following stages: "unstable", "testing", "stable". "unstable" is roughly analagous to "alpha" release in software development, "testing" is analagous to "beta" release. Once a term is stable, it may be relied upon not to change further. A stable term may be deprecated, in which case a description of the deprecated term will be maintained indefinitely.
@@TODO Ask Alistair to check the above and revise as appropriate given latest work, context
@@TODO Ask Tom for a paragraph on Dublin Core maintenance policies
@@TODO Ask Ralph for a paragraph on OWL WG effort, including use of tracker for issues, policies for responding to reported issues, working group timeline, revision/publication planning
Version management for RDF vocabularies, and ontology evolution in particular is an ongoing research problem. Most practitioners agree that there is no single approach that may be appropriate in all situations. They also agree, however, that managing multiple versions of evolving vocabularies is essential, and that mechanisms enabling users to distinguish between and recognize multiple versions is critical. See below (Research Topics), for additional background. Having said this, simple version identification may meet the needs of many users, and has proven invaluable for developers of a number of the vocabularies cited herein. For many users, "human-readable" version information may be at least as important as machine-readable version detail, so we urge readers of this document to consider human usability as a critical factor in selecting a version identification and management approach.
In cases where a vocabulary is allowed to change, users developing systems based on that vocabulary may prefer to work to a given stationary, rather than moving, target. To support these users, the developers/maintainers of a vocabulary should:
Where the resources that are the members of a vocabulary may evolve independently, or be at differing levels of stability, the developers/maintainers may also allocate URIs to historical versions of a particular resource.
Various methods of version identification (and related maintenance policies) have been proposed in the literature and are beginning to be used in the Semantic Web community. Some advocate the use of micro-theories that may be articulated for publishing the relationships between versions. A simple example is The Fourth Dimension Ontology, which is a very terse vocabulary for indicating change. An example from the Dublin Core term history shows how one might track changes to individual vocabulary term definitions more formally over time. Serving Snapshots proposes a solution to an issue raised against the current version of [Recipes] by suggesting an approach for serving the latest of multiple snapshots of a vocabulary.
@@TODOMight also provide an example using named graphs -- check with Jeremy for decent examples
In a recent survey on ontology version management, respondents indicated that they use (or have seen) at least two distinct maintenance styles, both involving version management systems such as CVS, (see Ontology Versioning Questionnaire - Brief Report on the Results [OntVer]). The first approach suggested repeated editing of a particular major version of a vocabulary (with intermediate check-in to a repository) on the way towards public release. Multiple changes may be proposed to this particular version in parallel, making systematic diffs and merging essential to arriving at a final, stable snapshot to be released. In this case, vocabulary element names may be changed (e.g., via namespaces), which may in turn raise issues for users who desire to maintain compatibility. The second approach involved releasing multiple, different versions of the same ontology publicly, in parallel, (preferably distinguishing among them and defining the respective precedence relations). In this case, vocabularies are published with the same element names (modulo the changes in the versions), but at different URIs. The first approach may be preferable when a vocabulary has been relatively stable and a new release makes significant portions of the older version obsolete. On the other hand, the second approach may be more appropriate in cases when a vocabulary is rather dynamic and when the new versions do not make radical/destructive changes to the old ones (and thus maintain compatibility at least to an extent across the different versions).
There appears to be fairly widespread agreement on the need for certain basic version-annotation metadata attributes. These may include creation date, author, valid time (i.e., guaranteed up time or automatic expiry time), provenance (primarily in the form of simple IRIs), and the ability to extend this set through additional, arbitrary RDF-encoded metadata. From the survey mentioned above [OntVer], respondents suggested that basic relations between ontology versions, including successor and predecessor relations, were also important. They also identified a number of features they considered essential for usability of an RDF vocabulary versioning system, for example, support for branching (similar to the CVS-like systems for software version management). Other desired capabilities included commiting a new version as a successor, merging two versions into a new version, version comparison, querying of particular versions and commiting a diff as a new version.
SemVersion is one example of a vocabulary version control system supporting many of these capabilities, such as user management, version commitment, project creation, branching and merging the version tree, retrieving or setting metadata, querying, model manipulation, and so forth (cf. the Knowledge Web D2.3.3b deliverable, and an IADIS 2006 paper on the implementation). The SemVersion data architecture includes several structures for managing projects as well as vocabularies:
A SemVersion versioned model includes unique versions, each of which have a number of attributes and relations. Common attributes include time stamp, branch label, status of acceptance. Predecessor relationships indicate the history path. This meta-information may be managed independently from the versioned artifacts themselves, creating a highly flexible and reusable management layer. As every version can be identified via an URI, one can make arbitrary statements in RDF about them, including statements reflecting version dependencies, project management related dependencies, or other relevant process-related annotations.
Additional papers on vocabulary version management tools (not exhaustive) include:
While this may seem obvious to the reader, it is important that an RDF/OWL description (document) is published at the definitive namespace URI for the vocabulary. Potential users should be clearly informed as to which is the "authoritative" RDF description of an RDF vocabulary, if more than one is available. Where the resources that are the members of an RDF vocabulary are denoted by HTTP URIs, an HTTP GET request with the header field "accept=application/rdf+xml" against that URI should return an RDF/XML serialisation of an RDF graph that includes a description of the denoted resource.
Detailed instructions on how to go about publishing a vocabulary can be found in [Recipes].
As mentioned above, ontology version management, use of reasoning to compare multiple versions of an ontology, understanding the implications of making certain kinds of changes on reasoning and applications, and related topics remain subjects of ongoing research. Well-known papers on ontology versioning include:
Some of the research in this area is focused on resolving certain kinds of queries against a versioned vocabulary set, such as:
Two possible approaches were analysed in a study on realisations of inference among multiple versions of an ontology:
Note that the C-OWL based reasoning across multiple ontology versions can be easily transformed to the temporal logics-based approach. This is essentially done by omitting the bridge rules (i.e., the mappings associated to particular ontology versions, relating them to one or more previous versions in the mapped version space. We end up with a version space exploitable by the temporal logics approach then. The information on the sequence of versions has not changed, therefore we may utilise the model-checking in order to evaluate certain typical types of queries. Note that since both presented approaches to multi-version reasoning build on the same underlying (description) logics-based inference, the basic queries on concept satisfiability they are able to evaluate are mutually reducible to each other. The only difference is that the C-OWL based approach provides additional query expressivity due to the bridge rules, which have to be explicitly defined. When it is not feasible to define and maintain the mappings between the versions, the temporal logics-based approach should be the choice for that particular application scenario. When the mappings can be maintained and exploited, combination of both approaches may be useful. Some types of queries can be answered by model-checking and other types (certainly the ones involving bridge rule constructs) using the C-OWL approach. For queries that can be evaluated by both approaches, users may typically choose the more efficient approach.
Note that this brief research topic description is mostly adapted according to the Knowledge Web EU NoE D2.3.9 deliverable with the kind permission of the authors. An interested reader may find more information on the theoretical principles and prototype implementations there.
See also:
This documents takes its lead from earlier work initiated by members of the Semantic Web Best Practices and Deployment working group, who attempted and continue to work towards providing clear examples of best practices and rules of thumb for deploying reusable vocabularies in the Semantic Web. The editors gratefully acknowledge those who contributed to the development of specific vocabularies mentioned herein, including the Dublin Core Metadata Terms, the Friend of a Friend Ontology, and the SKOS Core Vocabulary, as well as early repositories for managing such vocabularies, including the BioPortal, SemVersion, the OMV ontology and related software, and other documents, such as the Recipes document, which were invaluable in preparing this one.
The editors would also like to thank following additional people who contributed to this note, either initially as part of the Semantic Web Best Practices and Deployment Working Group, or more recently: Dan Brickley (W3C), Libby Miller (Asemantics), Ralph Swick (W3C)...