HTML Data Vocabularies

From W3C Wiki

Designing vocabularies is a complex craft, and this page does not cover all aspects of how to go about it. Instead, this page focuses on aspects of microdata, microformat and RDFa syntax and processing that should influence vocabulary design, and how to create vocabularies that can be used across multiple syntaxes, as part of the work of the HTML Data TF. There are several existing more general resources for vocabulary creators, such as:

TODO: More?

Extending Vocabularies

There are already many vocabularies in existence, particularly for common domains such as people, organisations, events, products, reviews, recipes and so on. Reusing these vocabularies benefits consumers because it saves design time and means they do not have to create supporting tools and materials such as validators, previewers or documentation. It also benefits publishers because it increases the likelihood that the data within their pages can be consumed by other useful tools. It is therefore good practice to extend existing vocabularies rather than creating new ones, where possible.

This section describes some of the issues that vocabulary authors who extend existing vocabularies need to be aware of.

Extending microformats

Microformats are developed using an iterative process whereby proposals for extensions are brainstormed and eventually either accepted or rejected by the microformats community. It is not appropriate to create unilateral extensions to microformats. On the other hand, publishers should use semantic classes within their HTML, whether or not they are used within current microformats. Evidence of use of semantic classes within HTML pages is one input to the microdata standardisation process.

Extending RDF vocabularies

RDF vocabularies, which are used within RDFa, use IRIs for types and properties. Any resource in RDFa can be extended by adding new types to the @typeof attribute and/or adding new properties from different vocabularies. However, it is not general practice to allow RDF vocabularies themselves to be extended with new types or properties by third parties.

One pattern that is quite common is for one vocabulary to accept a string for a property, such as an address, and for an extension to provide more structure for that property. In this case, a useful pattern is to nest the more structured property inside the textual property within the HTML. For example:

<div property="location">
  <address property="http://example.org/address" vocab="http://example.org/" typeof="Address">
    <span property="name">The White House</span><br>
    <span property="street">1600 Pennsylvania Avenue NW</span><br>
    <span property="city">Washington</span>, <span property="state">DC</span> <span property="zip">20500</span>
  </address>
</div>

This pattern also works for properties whose values are XML literals; in this case, the XML literal will include the RDFa markup.

Extending microdata vocabularies

Microdata items can have both properties that are scoped to the type of the item and properties that have absolute URLs. The acceptability of non-URL properties is determined by the vocabulary author of the type of the item; some vocabularies may define a set of acceptable properties, others say that any properties are acceptable. In all cases, however, it's possible to add properties to items if they are named with an absolute URL. Third parties who wish to extend an existing type with new properties should check the constraints of the type being extended to work out whether it's possible to use a non-URL property or not. Note that there is always a possibility, if you do use a non-URL property name, that your extension will conflict with an extension made by someone else; properties whose names are absolute URLs do not have this issue but are more verbose when used in markup.

Microdata does not allow items to have multiple types from different vocabularies. Some vocabularies, such as schema.org, may permit third parties to freely extend existing types within that vocabulary. In this case, items should be assigned both the supertype and the extension type within the @itemtype attribute. For example, schema.org describes a method of extending its vocabulary that involves identifying an appropriate supertype or superproperty and appending a / and then the name of a subtype or subproperty. Schema.org also permits anyone to create additional non-URL properties on these new types. To extend schema.org's types with a type for a member of parliament, a vocabulary author might use the URI http://schema.org/Person/MP, and mark up their page with

<p itemscope itemtype="http://schema.org/Person http://schema.org/Person/MP">
  <span itemprop="name">David Cameron</span> is the member of parliament for <span itemprop="constituency">Witney</span>.
</p>

Here, both http://schema.org/Person and http://schema.org/Person/MP are given as types, and the non-URL constituency property is used despite it not being defined within the schema.org vocabulary.

Other microdata vocabularies do not enable third parties to extend the vocabulary. In these cases, third parties should use a URL property to specify the additional type for the item. For compatibility with RDF, we recommend using http://www.w3.org/1999/02/22-rdf-syntax-ns#type for this property, and using a full URL for the type. An alternative to the example above that didn't use the schema.org extension mechanism would be:

<p itemscope itemtype="http://schema.org/Person">
  <link itemprop="http://www.w3.org/1999/02/22-rdf-syntax-ns#type" href="http://gov.example.org/uk/MP">
  <span itemprop="name">David Cameron</span> is the member of parliament for <span itemprop="http://gov.example.org/uk/constituency">Witney</span>.
</p>

More details about the use and limitations of this technique can be found on the section about using multiple vocabularies in microdata.

The technique described for RDFa above, of nesting a property that contains more structure within a property that has less, can also be used with microdata content.

Designing Vocabularies

This section looks at the particular requirements of different HTML data syntaxes on vocabularies, and how to create vocabularies that can be used across HTML data syntaxes.

Syntax-Specific Requirements

Each HTML data syntax brings with it a set of constraints on both how vocabularies are designed and their documentation.

Microformats

The microformats 2 page describes the constraints on the design of microformat vocabularies, and the microformats process describes additional procedural guidelines on how to create a new microformat.

Microdata

Microdata vocabularies must define, within a specification for that vocabulary, processing rules to be followed by consumers of that vocabulary, using the terms given by the microdata specification. These include:

  • what types the vocabulary includes
  • which types support @itemid to provide global identifiers for items
  • whether and how two items described using microdata should be considered a single item by a consumer (such as when they have the same @itemid) and if so, how two items within an HTML page should be merged
  • whether URL values that have the same value as an @itemid should be treated the same as if the item had been nested within the page
  • which non-URL properties (defined property names) are permitted on each of those types, whether there are equivalent URL properties for them, and how properties will be merged if both are used
  • how many and what types of values are allowed for each property, and what consumers should do if there are more or fewer values than required, how the values are parsed, and what happens when the values are of the wrong type
  • whether items that are the value of a property must explicitly have a type or if this can be inferred by consumers
  • what to do when an item has a property that it should not have
  • whether type and property URLs can be dereferenced
  • how consumers should recognise items belonging to the vocabulary (whether purely by @itemtype or through some other mechanism)

An example of a microdata vocabulary description is available for GoodRelations. There are also example microdata vocabularies within the WHATWG version of the microdata specification.

Microdata does not support the use of the HTML lang attribute to provide language information for textual values; if this is important, a microdata vocabulary must provide a mechanism for supplying a language separately. This can be done by:

  • having a property that indicates the language used in the data for the item; this only works if all the data uses the same language
  • defining a LanguageString type that has properties for both content and language and specifying the use of items of that type as a value for any appropriate property

Microdata does not support structured HTML values. Where these need to be captured, vocabularies can instead use URLs that reference fragments of HTML in the page. For example:

<link itemprop="breadcrumb" href="#breadcrumb">
<div id="breadcrumb">
  <a href="category/books.html">Books</a> >
  <a href="category/books-literature.html">Literature & Fiction</a> >
  <a href="category/books-classics">Classics</a>
</div>

RDFa

RDFa is used to create RDF graphs, so vocabularies used within RDFa should bear in mind the constraints and conventions that commonly apply to RDF vocabularies. These include:

  • types should be named using CapitalCamelCase, and properties using lowerCamelCase
  • types and properties in the same vocabulary should share a IRI prefix — the vocabulary IRI — which should end in a # or a /; the local part of a type or property IRI, after this prefix, should be a valid NCName so that it can be used within RDF/XML serialisations
  • the IRIs used for types and properties should resolve into documentation and/or (through content negotiation) an RDFS schema or OWL ontology that describes the types and properties

More guidelines and patterns for modelling using RDF are available within Linked Data Patterns.

Syntax-Neutral Vocabularies

Syntax-neutral vocabularies must have variants for each syntax that meet the requirements for the syntax as described above, but the capabilities of each variant do not have to be identical.

For example, a syntax-neutral review vocabulary could specify a required reviewLanguage property to give the language of a review in microdata, but say that if microformats or RDFa were used, and this were left unspecified, the language would be assumed. Publishers who had content that included multiple languages in the review itself (which couldn't be represented using a property providing a language for the entire review) would be able to use microformats or RDFa to mark up the review.

There are a number of measures that make it easier for vocabularies to be used across syntaxes in ways that make it easier for consumers to combine data whichever syntax is used.

Naming Conventions
Adopt consistent names across syntaxes, even if the naming conventions between the syntaxes differs. For example, microformats uses lowercase-hyphenated-names whereas RDF uses lowerCamelCase; all that is needed is a clear mapping between them. Although microdata allows defined property names to contain any character except : and ., non-URL properties should have names that are NCNames so that they can be used in microformats and RDFa. Note that microdata's restrictions mean that .s should be avoided in these names.
Entity Identity
Microformats and microdata have a limited notion of entity identity: entities may have identifiers (in microdata, from the @itemid attribute) but these are not used within the data model to combine entities or link them together into graphs. Syntax-neutral vocabularies use the RDF concept of identity whereby entities with the same identifier are the same entity, and references to that entity's identifier serve to create a graph of entities. This should be reflected in the definition of the microdata variant of the vocabulary, which should allow @itemid on all items, and specify that consumers should combine and link to items to create a graph.

TODO: other guidelines?

Good Practices

It is good practice for vocabulary creators to collaborate with others who are consuming or publishing information in the relevant domains in order to create a vocabulary that can be used widely across an industry.

It is good practice for vocabulary creators to make available a validation tool that enables publishers who use a vocabulary to check that their HTML pages contain data that is valid against that vocabulary.

It is good practice for vocabulary creators to make available test suites that enable implementers to check the behaviour of their implementations. These test suites should cover error handling as well as the correct interpretation of valid data.