INPUT TO THIS DRAFT IS NOW CLOSED - EDITING CONTINUES - CONTACT tbaker@tbaker.de

This page is an experiment in using the wiki for collaborative work on a W3C Working Group document, intended ultimately for publication as a note. This work comes from the Vocabulary Management Task Force of the W3C Semantic Web Best Practices and Deployment Working Group. Edits from non-WG members are welcome, but for now please keep them modest in scope (eg. don't break this longish document out into separate pages for us). We realise that this might be a break from wiki culture, but we are working with additional constraints, which is that we want to be able to get this text back out of the wiki again for W3C publication. --DanBri, Thomas Baker, W3C SW Best Practices WG.

SWBPD note on "Vocabulary Management"

A rather short note (circa 15 pages) formulating a few broad principles of best practice for managing a Semantic Web vocabulary; discussing issues on the "bleeding edge"; and providing pointers to further reading.

The draft will be open on the Wiki for input until Friday, 10 December.

Begin text of document:

Managing a Vocabulary for the Semantic Web -- Best Practice

Abstract

Metadata element sets, taxonomies, subject headings, thesauri, and ontologies are examples of vocabularies which are increasingly used in a "Semantic Web" environment. Managing vocabularies for use in Semantic Web applications means identifying, documenting, and publishing vocabulary terms in ways that facilitate their citation and re-use in a wide range of applications. This paper examines practices in the maintenance communities for representative vocabularies ranging from small and informal to large and complex. The paper formulates principles of good practice and summarizes discussion on issues for which good practice has yet to emerge.

1. Introduction

1.1. Vocabularies in the Semantic Web

The Semantic Web is an open, distributed, loosely-coupled environment with lots of languages (metadata element sets, controlled vocabularies, taxonomies, thesauri, ontologies, etc). Organizations or even individuals can define and publish vocabulary terms in an open, bottom-up, and distributed manner. This paper is addressed to people who want to create and maintain such a Vocabulary.

This paper articulates some basic principles for creating and maintaining RDF vocabularies in a Semantic-Web-friendly way. By this we mean vocabularies that can support processes of referencing, repurposing, recombining, or merging data from a diversity of sources; that are evolvable; that are extensible and mixable with other Semantic Web vocabularies; and that are declared in a way that is processable by networked machines in an emerging "semantic infrastructure". [Bernard asks: Which processes are the terms supposed to support -- indexing, vocabulary merging, data integration, search...? Do we say something about those processes or are we agnostic?]

TASK: James - One page on "vocabularies in Semantic Web" The two placeholder paragraphs above should be expanded into one short page providing a general introduction to the topic "vocabularies in the Semantic Web" -- what kinds of vocabularies are we talking about here (e.g., the typology in [PIDCOCK]) and what does it mean to use them in a "Semantic Web" environment? Rather than elaborate very much in-line, this section should point off to further reading about Semantic Web.

1.2. Method of this paper

In Section 2, this paper will formulate a few principles of good practice applicable to Semantic Web vocabularies in general. To illustrate these principles, the paper will describe practices used in several vocabularies chosen to exemplify a range from small and informal to large and complex:

FOAF: FOAF is an example of a "relatively small" vocabulary for "descriptive metadata" about people and their interests [FOAF]. Its maintenance processes are "somewhat informal".
- * TASK: DanBri and Libby - One paragraph on FOAF
Dublin Core: Dublin Core is an example of a "medium-sized" vocabulary for "descriptive metadata" about information resources [DC]. Its maintenance processes are "lightweight but not weightless" and increasingly formal as DCMI evolves from a workshop-driven movement to a stable maintenance community supported by institutional stakeholders.
- * TASK: Tom - One paragraph about Dublin Core
SKOS Core: SKOS is an example of a "medium-sized" vocabulary for describing "thesauri" and similar types of knowledge organization systems. (Not sure about maintenance issues.) The SWBPD thesaurus activity should be cited.
- * TASK: Alistair - One paragraph about SKOS

AJM>> SKOS Core is an RDF vocabulary for creating RDF descriptions of 'concept schemes', where a concept scheme is a description of a set of concepts, including a description of relationships between concepts. It has been designed as a set of basic building blocks for creating RDF descriptions of the more classical, linguistically oriented types of knowledge organisation system, such as thesauri, glossaries, subject heading schemes and classification systems. SKOS Core has been developed and is managed as part of an open collaboration, analagous to an open-source project in software development. It's maintenance is open ended, which means it may continue to evolve indefinitely.

Placeholder for another "terminology style vocabulary such as FAO thesaurus"...

In addition, this paper cites several prior works on good practice in closely related areas:

* TASK: Ralph - Bullet point on W3C good-practice documents [2005-01-30 done first pass]
Architecture of the World Wide Web [WEBARCH] - Written by the W3C Technical Architecure Group, this document discusses the core design components of the Web. These are identification of resources, representation of resource state, and the protocols that support the interaction between agents and resources in the space. The document connects core design components, constraints, and good practices to the principles and properties they support.
Cool URIs don't change [COOLURI] - Drafted in 1998, this piece by Tim Berners-Lee addresses an all-too-common occurrence on the Web; the broken link. In the Semantic Web it is especially important that the URIs used to declare the semantics of data not be allowed to go stale. In this paper, Berners-Lee considers the reasons why URIs become abandoned and proposes solutions for choosing URIs that can survive for decades and centuries.
Web Design Issues [DESIGN] - In a set of personal notes, Tim Berners-Lee describes many of the architectural and philosophical design points underlying the Web and the Semantic Web. In "What do HTTP URIs Identify?" [HTTPURI], Berners-Lee discusses the misperception that the HTTP URI scheme can only be used to name documents.
OASIS Published Subjects: The bullet point should provide some context on Topic Maps and Semantic Web and on the PSI Recommendation [OASIS-PUBSUBJ].
* TASK: Bernard - Bullet point on OASIS Published Subjects

BV>> The OASIS Published Subjects Technical Committee has been set forth in 2001 to promote the use of Published Subjects by specifying recommendations, requirements and best practices, for their definition, management and application. The notion of "Public Subject" was introduced by ISO 13250 Topic Maps in 1999 and further refined as "Published Subject" in the XML Topic Maps (XTM) 1.0 in 2001. This Technical Committee has issued in June 2003 a first recommendation called "Published Subjects : Introduction and Basic Requirements", specifying under which conditions a URI can be defined and used as a "Published Subject Identifier", and the matching information resource as a "Published Subject Indicator". Further recommendations, to be delivered, are intended to focus on nature and format of subject indicators, and management of Published Subjects, with objectives similar to those developed in this note.

The terminology used to talk about vocabularies and their underlying linguistic models differ between user communities. Without wishing to imply that these differences are trivial, this paper uses a small set of words defined with deliberate fuzziness:

Term: A named concept. [Tom: Or - "A named physical or conceptual entity". (Because "London" is arguably not a "concept" but could be a "term".)

AJM>> Can we not do something like ... Term: a name or identifier used to denote a physical or conceptual entity. ...?

Vocabulary: A set of terms.
URI Reference: A globally unique identifier.
Description: A set of statements about a term or vocabulary.
Declaration: A machine-processable representation of a term or vocabulary.
Vocabulary Owner: The maintainer of a term set.
Versioning: The identification of changes to a term or vocabulary.
Natural language: A grammar and vocabulary for statements that can be uttered, written, and understood by ordinary humans.
Formal language: A grammar and vocabulary for statements intended for processing by machines.

These words are qualified in the examples which follow and in the Glossary. One potential source of confusion should perhaps be acknowledged and discussed up-front: the term "namespace", which is used in a number of vocabulary communities, W3C in particular, but is (in my opinion) difficult to pin down. If we can agree to use "vocabulary" in this paper (noting the usage of "namespace" where appropriate), I would like to task someone (DanBri?) to explain the W3C use of the term "namespace".

TASK: DanBri or Libby - Describe W3C usage of the word "namespace"

RRS>> Within W3C, "namespace" has come to refer specifically to XML namespaces, a technique for modularizing vocabularies so as to avoide clashes between like names from different XML markup vocabularies. For an XML document, the intended purpose is to permit a single document to use multiple vocabularies where some vocabularies are intended to be used across many document types and are designed to be recognized by multiple software modules. An XML namespace [XMLNAMES] is a collection of XML elements and attributes identified by an IRI (an internationalized URI) -- the collection's "namespace URI". In RDF the term "namespace" refers to an XML namespace where the identifying namespace URI can be dereferenced in the Web to obtain further information about the objects defined in that namespace. That is, in the Semantic Web a namespace is expected to be self-documented at its namespace URI.

2. Principles of Good Practice

Short paragraph explaining that in this section, we formulate and illustrate principles of good practice on which we generally agree.

2.1. Identify Terms with URI References.

[Building on the Introduction (above), this point should reinforce the centrality of URIs. However, we should not get too deeply here into what constitutes a term -- rather, we should point people to Point 3.6 ("What is a term, really?") for a discussion of the terminological versus conceptual debate.]

TASK: DanBri - Define "URI Reference", elaborating in the Glossary
TASK: DanBri - Sentence or two on FOAF term URIrefs
TASK: Tom - Sentence or two on DCMI term URIrefs
TASK: Tom - A sentence on the "CORES Resolution"
TASK: Alistair - Sentence or two on SKOS term URIrefs

AJM>> All of the terms in the SKOS Core vocabulary are URI references. Each term is formed by appending a fragment identifier (such as 'prefLabel') to the SKOS Core base namespace (which is 'http://www.w3.org/2004/02/skos/core#'), so for example 'http://www.w3.org/2004/02/skos/core#prefLabel'.

TASK: Aldo - Sentence or two on Wordnet term URIrefs
TASK: DanBri - What W3C says about identifying terms
TASK: Bernard - What PSI says about identifying terms

BV>> Published Subjects requirements are not specific to subjects which are defined by terms in a Vocabulary, or concepts named by those terms, since the Topic Map definition of "subject" encompass "anything whatsoever that can be a subject of discourse". Published Subjects target more the issue of identification of concepts, defined by terms or otherwise, than identification of the terms themselves. The Published Subjects first requirement is "A Published Subject Identifier must be a URI".

2.2. Articulate and publish maintenance policies for the Terms and their URI references.

A Vocabulary Owner should specify and publish any policies governing the maintenance of the terms and their URI references: e.g. institutional commitments to persistence and semantic stability. This short to medium-length section should simply describe a sample of such policies.

[It would be nice if we could agree on something of the substance of those policies, such as stability of URI references in the face of "semantically compatible" evolution, but this may be difficult to define.

See, however, 3. Relationship between Defining Specification and Namespace URI in The Art of Consensus: A Guidebook for W3C Working Group Chairs and other Collaborators. ]

TASK: DanBri - Describe maintenance policies for FOAF
TASK: Tom - Describe maintenance policies for DCMI
TASK: Alistair - Describe maintenance policies for SKOS

AJM>> All SKOS Core terms use a www.w3.org namespace, and so inherit the commitments to URI persistence made by the W3C. The SKOS Core vocabulary maintenance policy implements the following principles (see also http://www.w3.org/2004/02/skos/core/spec/):

Open-ended evolution, i.e. new terms may be added at any time in the future.
All new terms in the same namespace, i.e. all new terms deemed to be within the scope of SKOS Core will use the current SKOS Core base namespace. There will be no versioning of the base namespace.
Terms evolve in situ, i.e. each term has a 'stability' value, which indicates the extent to which the description of that particular term may change in the future. As the semantics of a term are refined in response to deployment and testing, the term passes through the following stages: 'unstable', 'testing', 'stable'. 'unstable' is roughly analagous to 'alpha' release in software development, 'testing' is analagous to 'beta' release. Once a term is stable, it may be relied upon not to change further. A stable term may be deprecated, in which case a description of the deprecated term will be maintained indefinitely.
TASK: Aldo - Describe maintenance policies for Wordnet
TASK: DanBri - What W3C says about maintenance policies
TASK: Bernard - What PSI says about maintenance policies

BV>> Published Subjects recommendations so far have expressed no formal requirement about maintenance policies. They are in the scope of future deliverables.

TASK: Alistair - TAG Versioning on "semantic stability"

AJM>> (From http://www.w3.org/2001/tag/2004/webarch-20041101/) On the subject of URI persistence, section 3.5.1 states that 'once a URI has been associated with a resource, it should continue indefinitely to refer to that resource.' (From '[Editorial Draft] Extending and Versioning XML Languages Part 1 Draft TAG Finding 24 November 2004' as yet not online) In respect of semantic stability, the action of changing the meaning or semantics of existing components is described as an 'incompatible change', in contrast to 'forward-compatible change' and 'backwards-compatible change'. I.e. altering the semantics of an existing vocabulary term is guaranteed to make subsequent versions of the vocabulary incompatible with previous versions. Vocabulary authors should understand the consequences of making incompatible changes.

2.3. Identify the historical version of a Vocabulary or its Terms.

Building on the previous section, this section should look at versioning from the standpoint of identification. At what level of granularity does versioning operate? Are URI references being assigned to individual terms, to sets of terms in the abstract, or to documents or schemas of term sets? Presumably, this section should highlight W3C practice in this area (e.g., the method of distinguishing a timeless Latest Version from a date-stamped This Version and Previous Version). In the teleconference of 1 November, it was suggested that this point be expanded to "version information" in a more inclusive sense.

The URIs assigned to W3C Technical Reports are an exemplar of a version identification scheme. All W3C Technical Reports (e.g. Recommendations) can be accessed via (at least) two URIs. When a Technical Report is published W3C makes a commitment to preserve the content of that particular version of the document at a URI that will persist [W3CPP]. This "This version" URI is given in the document header and gives the document users a way to cite exactly that version of the document for all time. In particular, a new version of a Technical Report will cite its predecessor version by including the "This version" URI of the predecessor in a "Previous version" entry in the document header of the new version. Often the "This version" URI is formed by concatenating a name used for all versions of the Technical Report with a string corresponding to the date of publication of the specific version and therefore the "This version" URI is often called the "dated URI". When the initial version of a Technical Report is published it is also assigned a second URI, given in the document header, called the "Latest version" URI. W3C also commits that this "Latest version" URI will persist however the content will change as new versions of the document are published. The content at the "Latest version" URI will always be the content of the most recent version of that Technical Report. Versioning of Technical Reports should not be confused with versioning of specifications; i.e. the specification for HTML includes HTML 2.0, HTML 3.2, HTML 4.01, and XHTML 1.0. Each of those "versions" of HTML has its own "Latest version" URI for the corresponding Technical Report. The decision to establish a new "Latest version" URI for a specification is made by the group who is responsible for authoring the content and generally corresponds with non-backwards-compatible changes to the specification.

[We would like to have these semantics of the "This version" and "Latest version" URIs described in RDF in some way. These semantics can be simulated in a poor way with the HTTP cache expiration metadata however that such overloading of semantics is not good design practice. It would be especially grand to be able to learn from the W3C site -- in RDF -- whether the content of the "Latest version" was identical to the content of any given "This version" without having to actually retrieve that content.]

Related: Versioning and Extensibility in [WEBARCH].

TASK: Ralph - Longer paragraph on versioning in W3C
TASK: DanBri - Short paragraph on versioning in FOAF
TASK: Tom - Short paragraph on versioning in DCMI
TASK: Alistair - Short paragraph on versioning in SKOS

AJM>> At the current time there is no explicit versioning of either the SKOS Core vocabulary or its terms, although this would be desirable and will probably be implemented shortly in line with the current DCMI practise. All changes to SKOS Core vocabulary terms and to the vocabulary itself are recorded as structured annotations via the skos:changeNote RDF property.

TASK: Aldo - Short paragraph on versioning in Wordnet
TASK: Bernard - Short paragraph on versioning in PSI

BV>> Versioning of subjects, published or not, has never been addressed explicitly by the Topic Map family of standards, since basically a subject can't be versioned: either two topics represent the same subject, or not. Published Subjects recommendations are so far agnostic about the way to handle versioning.

TASK: Alistair - What TAG says about versioning

AJM>> TAG has quite alot to say about versioning :) Comments here based on '[Editorial Draft] Extending and Versioning XML Languages Part 1 Draft TAG Finding 24 November 2004' as yet not online. Versioning strategies for XML languages fall into the following classes: 'none', 'backwards-compatible', 'forwards-compatible', 'flavors' and 'big bang'. (AJM: How these strategies apply to semantic web vocabularies (as opposed to XML languages) is as yet undescribed.) The best approach to take depends on the application domain, but in general it is recommended to plan for versioning from the start. If you don't plan for versioning from the start, when you do decide to adopt a plan for versioning, you may be constrained in the available approaches by decisions that you've already made (AJM: and this comment does apply equally to semantic web vocabs IMHO).

TASK: Alan - "What constitutes a change?"

2.4. Provide natural-language documentation about the Terms.

The Vocabulary Owner should describe and publish a human-readable description of the Terms -- typically, at a minimum, text definitions on a Web page. This short section should merely say what sort of Web documents are made available for the example vocabularies.

TASK: DanBri - One sentence pointing to FOAF Web documents
TASK: Tom - One sentence pointing to DCMI Web documents
TASK: Alistair - One sentence pointing to SKOS Web documents

AJM>> The SKOS Core Specification (http://www.w3.org/2004/02/skos/core/spec/) gives an overview of the current state of the SKOS Core vocabulary, producing a summary table of human-readable annotations for each term. This document could also be described as a 'namespace document'. This document is generated from the underlying RDF description of the SKOS Core vocabulary, and an HTML template, and is regenerated after any changes are made to the vocabulary. The SKOS Core Guide (http://www.w3.org/2004/02/skos/core/guide/) (still under development, officially published next week hopefully) gives instructions for proper use of the vocabulary, including additional constraints and inference rules.

TASK: Aldo - One sentence pointing to Wordnet Web documents
TASK: DanBri - One sentence pointing to W3C Web documents
TASK: Bernard - One sentence pointing to PSI Web documents

BV>> One example of Published Subject Indicators for ISO Languages can be found at http://psi.oasis-open.org/iso/639/

2.5. Declare the Terms using a formal, machine-processable schema language.

This short section should merely say what sorts of schemas the example maintenance communities publish. Policies for dereferencing and choice of schema language will be discussed in more detail in Section 3.

TASK: DanBri - Two sentences on FOAF schemas.
TASK: Tom - Two sentences on DCMI schemas.
TASK: Alistair - Two sentences on SKOS schemas.

AJM>> The SKOS Core vocabulary is published as an RDF description, with each term declared as either and RDF property or an RDFS class. Other features from RDF Schema and OWL (such as property domain/range statements, additional property types) are used as appropriate to formally express additional semantics and/or constraints.

TASK: Aldo - Two sentences on Wordnet schemas.
TASK: DanBri - Two sentences on W3C schemas.
TASK: Bernard - Two sentences on PSI schemas.

BV>> No specific requirements for PSI schema, beyond the minimal recommendation: "A Published Subject Indicator may provide machine-processable metadata about itself." Long discussions in the Published TC about to know if any schema or format should be recommended led to a consensus for not recommending any generic format or schema, but provide best practices when using RDF, XTM (TBD)

3. Questions on the Bleeding Edge

Paragraph explaining that Section 3 discusses issues on which consensus currently seems more elusive. Our goal is to describe the range of positions taken.

3.1. What should the identifier of a Vocabulary or Term (i.e., its URI Reference) resolve to when someone "clicks on it" in a Web browser?

We could reword this as the problem of resolving ("dereferencing") Term URIs to human-readable descriptions or machine-processable declarations. Several years ago, Tim Berners-Lee said that "The namespace document (with the namespace URI) is a place for the language publisher to keep definitive material about a namespace. Schema languages are ideal for this." Others have disagreed with this and the question was taken up by TAG. Point 3.1 should summarize the state of discussion. If Terms are documented in multiple ways, should a Vocabulary Owner distinguish between "canonical" versus "derived" sources?

The URI of an XML namespace can be used to identify an "information resource" that provides information about the terms in the namespace. Such an information resource is called a "namespace document". [WEBARCH] declares that when the owner of a namespace URI provides a namespace document, that document is authoritative for the namespace. [WEBARCH] declares that it is good practice to provide namespace documents but leaves it to the namespace URI owner to decide, based on expected applications, the representation of a namespace document.

In the Semantic Web it is best practice to provide a namespace document of type application/rdf+xml declaring -- in machine-processable RDF -- the properties and classes within that namespace. It is good practice to also provide a namespace document of type text/html describing the same properties and classes for human readers. [@@At some time in the not-too-distant future we should cite GRDDL as a means to serve a single text/html document from which RDF can be procedurally extracted.]

RDF specifies that a full URI for an RDF property or class is formed by appending the XML local name of the property or class to the namespace URI. Good practice is that the resulting URI also resolves to an information resource providing information about the property or class. While RDF permits a namespace URI to end in either "/" or "#" (or indeed any other URI character), the TAG's position is that an http: URI that does not contain a fragment identifier is always a reference to a document and thus an RDF property or class name should contain a fragment identifier. Therefore, we conclude that namespace names in RDF that use the http URI scheme should contain the '#' character.

TASK: Ralph - Paragraph or two on W3C dereferencing policy
TASK: Bernard - Paragraph on PSI dereferencing policy

BV>> This point is at the core of PSI requirements expressed in the OASIS Published Subjects first recommendation Requirement #2 : "A Published Subject Identifier must resolve to an human-interpretable Published Subject Indicator." Requirement #3 : "A Published Subject Indicator must explicitly state the ""unique"" URI that is to be used as its Published Subject Identifier."

TASK: DanBri - Short paragraph on FOAF dereferencing policy
TASK: Tom - Short paragraph on DCMI dereferencing policy
TASK: Alistair - Short paragraph on SKOS dereferencing policy

AJM>> Because the RDF description of the SKOS Core vocabulary was published before stable human readable documentation was available, there was initially nothing suitable for the term URIs to derefence to. At the current time, all URIs derefence to the SKOS Core homepage, as this provides up-to-date links to current available documentation. After the official publication of the SKOS Core Specification (http://www.w3.org/2004/02/skos/core/spec/) it may be desirable to have each term URI (e.g. 'http://www.w3.org/2004/02/skos/core#prefLabel') dereference for content-type text/html to the summary of that term in the specification document (e.g. 'http://www.w3.org/2004/02/skos/core/spec/#prefLabel').

TASK: Aldo - Short paragraph on Wordnet dereferencing policy
TASK: DanBri - Short para on RDF/A and mixed human-/machine-oriented documentation.

3.2. Which schema language should be used to declare the Vocabulary machine-processably?

Short answer: It depends what you want to say. This section should characterize the assertions made in schemas published by various communities.

TASK: DanBri - Short paragraph on what FOAF schemas assert.
TASK: Tom - Short paragraph on what DCMI schemas assert.
TASK: Aldo - Short paragraph on what Wordnet schemas assert.
TASK: DanBri - Short paragraph on what W3C schemas assert.
TASK: Bernard - Short paragraph on what PSI schemas assert.
TASK: Alistair - Short paragraph on what SKOS schemas assert.

In particular, there was a discussion in September on the SWBPD list on different approaches to modeling thesauri [THESAURUS-MODEL]. For example, one could use OWL or RDFS to represent an existing language of thesaurus relations and simply translate an existing thesaurus into those terms. Or one could fundamentally remodel the the

[COOLURI]

   Berners-Lee, T., Cool URIs don't change,
   http://www.w3.org/Provider/Style/URI.html

[DESIGN]

   Berners-Lee, T., Design Issues, Architectural and Philosophical Points,
   personal notes,
   http://www.w3.org/DesignIssues/

[HTTPURI]

   Berners-Lee, T., Design Issues, What do HTTP URIs Identify?,
   personal note,
   http://www.w3.org/DesignIssues/HTTP-URI.html

[W3CPP]

   World Wide Web Consoritum, Persistence Policy,
   http://www.w3.org/Consortium/Persistence

[WEBARCH]

   Jacobs, I., Walsh, N. eds, Architecture of the World Wide Web, Volume One,
   W3C Recommendation, 15 December 2004,
   http://www.w3.org/TR/2004/REC-webarch-20041215/

[WGS84] - DanBri??

   Walsh, J. An RDF vocabulary for WGS84 geo positioning
   [Informational Internet draft], RDF Interest Group,
   http://space.frot.org/draft-geo-draft.html.

[XMLNAMES]

   Bray, T., Hollander, D., Layman, A., Tobin, R., eds,
   Namespaces in XML 1.1,
   W3C Recommendation, 4 February 2004,
   http://www.w3.org/TR/2004/REC-xml-names11-20040204/#dt-IRI