Vocabulary Management at W3C (Draft)

OBSOLETE. See Vocabulary Services instead.

Comments to ivan@w3.org and sandro@w3org, or w3c-semweb-cg@w3.org (archive)

Latest version at: http://www.w3.org/2013/02/vrc
Old versions and diffs at: http://www.w3.org/2013/02/vrc-history/
This is $Revision: 1.24 $ $Date: 2013-06-20 08:08:51 $


One of the major stumbling blocks in deploying RDF has been the difficulty data providers have in determining which vocabularies to use. For example, a publisher of scientific papers who wants to embed document metadata in the web pages about each paper has to make an extensive search to find the possible vocabularies and gather the data to decide which among them are appropriate for this use. Many vocabularies may already exist, but they are difficult to find; there may be more than one on the same subject area, but it is not clear which ones have a reasonable level of stability and community acceptance; or there may be none, i.e. one may have to be developed in which case it is unclear how to make the community know about the existence of such a vocabulary.

There have been several attempts to create vocabulary catalogs, indexes, etc. but none of them has gained a general acceptance and few have remained up for very long. The latest notable attempt is LOV, created and maintained by Bernard Vatant (Mondeca) and Pierre-Yves Vandenbussche (DERI) as part of the DataLift project. Other application areas have more specific, application-dependent catalogs; e.g., the HCLS community has established such application-specific "ontology portals" (vocabulary hosting and/or directory services) as NCBO and OBO. (Note that for the purposes of this document, the terms "ontology" and "vocabulary" are synonyms.) Unfortunately, many of the cataloging projects in the past relied on a specific project or some individuals and they became, more often than not, obsolete after a while.

Initially (1999-2003) W3C stayed out of this process, waiting to see if the community would sort out this issue by itself. We hoped to see the emergence of an open market for vocabularies, including development tools, reviews, catalogs, consultants, etc. When that did not emerge, we decided to begin offering ontology hosting (on www.w3.org) and we began the Ontaria project (with DARPA funding) to provide an ontology directory service. Implementation of these services was not completed, however, and project funding ended in 2005. After that, W3C took no active role until the emergence of schema.org and the eventual creation of the Web Schemas Task Force of the Semantic Web Interest Group. WSTF was created both to provide an open process for schema.org and as a general forum for people interested in developing vocabularies. At this point, we are contemplating taking a more active role supporting the vocabulary ecosystem.

Business model/member benefits

The W3C Vocabulary management proposals set out here have emerged from our extensive discussions around the strategic direction that W3C should take in the Semantic Web and eGov Activities. It answers a clear community need and in that sense is simply something that W3C should do. Arguably, it's something W3C should have been doing for a long time but that's water under the bridge. Our work is required to underpin the development of the linked and open data visions. In that respect, undertaking this work is something our members — actual and potential — can reasonably expect of us.

The proposals below are simple and easy to do. Each has potential to generate tremendous value for the community. We suggest waiting until the value is clearly present before putting much effort into monetization. At a few points below, we note some possible revenue sources. In addition, once this work has demonstrated its value, it may be easier to obtain grant funding to improve it.

Vocabulary Management Activities

1. Vocabulary Providers Group

Goal: provide a forum for experts to talk to each other and newcomers to talk to the experts. This group can also help point out and coordinate areas of overlap among vocabularies, and help gather people into groups for new vocabularies

Proposal: redirect the Web Schemas Task Force to take on this role. Rename it, avoiding the word 'schema' to help clarify it's not particularly about schema.org, but keep the mailing list name (public-vocabs@w3.org) to avoid disruption. Add another chair. Perhaps call it "Vocabulary Advice Task Force" or "Vocabulary Coordination Group".

If possible, the group should host regular discussions on general vocabulary development issues, answer questions. Maybe have regular presentations where one group presents its vocabulary to the wider audience to get feedback (like w3c staff project reviews).

2. Domain-Specific Vocabulary Groups

Goal: provide a forum for the people involved in each vocabulary to communicate and share material. Provide a trusted archive of public comments.

Proposal: in general, use Community Groups (CGs), with their normal tools (mailing lists, wikis, etc). Use Working Groups in situations where enough W3C members want a more formal process, possibly more restricted participation, and the stamp of "W3C Recommendation" on the vocabulary. (possible revenue source)

This is already done sometimes, as with the Open Annotation CG. We can encourage more people to do this by linking it with other services, such as vocabulary hosting (item 3, below).

Note that vocabularies have somewhat different stability and interoperability characteristics than most W3C Recommended technologies, so the full Recommendation Track is often not warranted. A well-constructed and properly published vocabulary seems to be taken at least as seriously by much of the market as a W3C-Recommended one.

3. Vocabulary Hosting

Goal: Make it practical, even easy, for people to publish their vocabulary namespace document in accord with best practice, especially with regard to long-term stability.

Proposal: offer URLs starting with http://www.w3.org/ns/ to any W3C group, including Community Groups, as long as that group has an open/consensus decision process. Provide a simple Web interface for people to allocate a namespace and then update the contents of the namespace document as needed. This would be subject to reasonable terms of service, including the understanding that individuals act as editors, on behalf of the group, and that W3C has ultimate authority over the content.

The W3C Namespace Policy already allows W3C groups to claim URLs starting with http://www.w3.org/ns/, but that policy was written before Community Groups existed and its applicability to them is unclear. At the moment, requests to allocate names and requests to update the contents of a namespace document have to be handled by W3C staff.

In a second iteration, the W3C vocab hosting service could provide various tools which support or even enforce good practice in vocabulary development, such as not removing terms that people might be using. Existing Web-based tools such as WebProtege (from Stanford), Neologism (from DERI), and Knoodl (from Revelytix) should be considered.

A key advantage of W3C hosting, unlike most other options available, is that we can handle changes in personnel, business models, governments, corporate mergers, etc, through our normal group consensus processes.

Q: What if someone wants to host a vocabulary at W3C, but does not want to turn over control to the group?
A: For now: tell them to come talk to us and we'll consider it on a case-by-case basis. (possible revenue source)

Q: What if a name is allocated and never used?
A: Abandoned vocabulary names may be reclaimed, depending on evidence of their use.

Q: What about name conflicts? What if someone wants to claim "html" or "google" as a namespace name?
A: The terms of service will require that groups declare they have made reasonable effort to find other uses of the term, considered them, and concluded there is no significant likelihood of user confusion. If such confusion is reported, especially in the early days of the term being used, we may reallocate the name.

Q: Can we use https://www.w3.org/ns/ (TLS secure)?
A: That's included automatically; all of www.w3.org is simultanously served with TLS.
Q: What about http://id.w3.org/?
A: We are considering this, as a way to better manage the load.
Q: What about http://www.w3.org/yyyy/mm/?
A: It's a possibility, if someone wants that. Does anyone?
Q: What about http://www.w3.org/ns/foo/bar (subdirectories)?
A: Yes, subdirectories will be supported. (maybe this is just reserve-a-prefix)
Q: What about http://foo.org/ ?
A: Possibly in the future, to accommodate vocabularies that started outside W3C or that see a need for someday moving away from W3C. (possible revenue source)

4. Vocabulary Selection Metadata

Goal: Make sure that people selecting vocabularies have the data they need to make a good choice. Some of this data will be provided by the vocabulary providers (first-party metadata, self-reported) while some will be provided by others (third-party metadata).

Proposal: ask a group (maybe the experts group, item 1, above — or maybe a new CG) to come up with a vocabulary for this metadata, promote it, and also use it in the W3C vocabulary directory (item 5, below)

Some of this is currently in-scope for the Government Linked Data (GLD) Working Group, under Best Practices for Vocabulary Selection ("... issues of stability, security, and long-term maintenance commitment...").

Some possible items:

  1. Who/what is publishing data, and who/what is consuming data, using the vocabulary? (Might be broken down by each term in the vocabulary)
  2. What public comments have been made about this vocabulary, and how have they been addressed? Maybe separate different kinds of comments, eg bug reports, editorial suggestions, and in-depth professional reviews
  3. Who has been involved in the development/maintenance of this vocabulary? How were these people selected/recruited, and what decision process was used among them? (See UK Government's 2012 consultation for a definition of open standards)
  4. Is the vocabulary encumbered by a restrictive license?
  5. Is the vocabulary encumbered by a patents?
  6. Does the vocabulary complement existing vocabularies or does it duplicate/compete with them?
  7. Does the vocabulary have proper basic metadata, on authorship, provenance, etc.?
  8. Is the vocabulary actively maintained?
  9. What is the vocabulary's versioning policy? In particular, under what circumstance might a term's definition be changed or removed?
  10. Have appropriate domain experts been involved in the development process, or at least reviewed it?
  11. Is the URI structure used by the vocabulary likely to be persistent? (See, for example, Phil Archer's suggested guidelines).
  12. Is there a credible fall back should the organization currently responsible no longer be able or willing to maintain it in the public interest?

5. Vocabulary Directory

Goal: Provide vocabulary consumers (people publishing and in some cases consuming RDF data) with a convenient way to find the vocabularies that might work for them, along with metadata to help them choose among the options.

Proposal: in the first iteration, make a simple web page showing all the vocabularies we host, along with all the others that have been reported to us (via a web form which asks for basic metadata). For each vocabulary, provide some of the metadata from item 4, above. Depending on available resources and on feedback, grow this into a more complete "shopping site" where people can search, sort, and filter on various criteria, as well as enter their own ratings, reviews, and other metadata.

Q: Do we include all known vocabularies, or only ones that seem pretty good?
A: We include any vocabulary for which someone is willing to fill out the form, or which already has embedded the basic required metadata.

Q: Do we include en masse the vocabularies known to NCBO, LOV, prefix.cc, etc?
A: Probably not in the first iteration, because without good search and filter tools, those will dwarf the others. At the start, focus on vocabularies hosted at W3C or which people specifically request to be included, via the submission form.

At some point, it may be best to move to existing software, such as LOV or Calamachus; see an article on some others)

Ivan Herman ivan@w3.org, Sandro Hawke sandro@w3.org
$Id: vrc.html,v 1.24 2013-06-20 08:08:51 phila Exp $