WebSchemas/ExternalEnumerations

From W3C Wiki
< WebSchemas
Revision as of 05:30, 20 April 2012 by Scor (Talk | contribs)

Jump to: navigation, search

Draft, April 2012 from Dan. --Dan Brickley 16:02, 13 April 2012 (UTC)

  • This is a draft of a document for inclusion on schema.org as /docs/external.html
  • When it is finalised, the Wiki version will be historical only; for now it is a discussion draft only.
  • As a potential schema.org page, it is written in a tone suited to that site, rather than more typical Wiki-speak; please don't edit this away.

External Enumerations

Schema.org markup can be used alongside various existing descriptive standards. For example, country codes, units and measures, or job classifications - in each of these areas there are existing standards and authorities. This document describes how to reference such authorities using schema.org vocabulary. To do this, we define the 'http://ext.schema.org/*' mechanism, which allows externally enumerated values to be reflected into the URL structure of schema.org, and give some example external authorities that can supply values that are expressed as URLs such as 'http://ext.schema.org/wikipedia/countries/Canada'.

Schema.org is about finding a balance between having a single well-integrated vocabulary and the full diversity of the Web. To achieve this, we define here some specific integration points through which selected externally maintained vocabulary can be published as part of schema.org markup. This is in addition to the existing extension mechanisms we support, and the general ability to include whatever markup you like in your pages. The focus here is on external vocabularies which can be thought of as 'supported' (or anticipated) in some sense by schema.org.

Initiatives such as DBpedia and Wikidata provide community maintained sources of schema vocabulary and identifiers. Alongside professionally maintained vocabularies (e.g. the Library of Congress Subject Headings for topics, or the UN FAO Country Profiles for countries), there are a great many externally enumerated lists that are useful to include in schema.org descriptions.

This page describes how we do this, and how to link to these enumerations in a way that schema.org-aware systems can understand.

Goals

  • show how to make direct use of externally-maintained vocabularies and datasets
  • encourage common representation of controlled values, without adding hundreds of details into schema.org core
  • balance decentralisation with markup simplicity

There are many cases in which a more precise description can be given by drawing upon some standard list or vocabulary.

  • Countries
  • Topics and categories
  • Specific named individuals (people, events, ...)
  • Units and Measures (future work)

Using only Schema.org vocabulary, you can describe many things directly. However often it is useful to reference things that have been described more authoritatively elsewhere, and to structure your description so the link with external data can be exploited. We start with some common examples, and show (using Microdata) how this looks in practice.

Scope

There are several ways in which external enumerations can show up in schema.org markup. We begin with the simplest.

Controlled property values

Enumerations that can be used as values for schema.org properties. For example, Schema.org has the notion of a class of things called 'Country', which is a kind of 'AdministrativeArea'. Some schema.org properties have a range of 'Country', which means they can be used to indicate some specific country. However at schema.org we do not attempt to maintain a detailed list of countries, their identifiers or other properties. That is the role of this extension mechanism.

Schema.org has some properties such as 'nationality' (of a Person) or 'addressCountry' (of a PostalAddress) whose values are countries. Others properties may be added in the future.

Note: current text for addressCountry is 'The country. For example, USA. You can also provide the two-letter ISO 3166-1 alpha-2 country code.' ...this uses the common schema.org idiom of allowing descriptions to contain either strings or detailed descriptions.

Sometimes a controlled property value will be written as a simple code, like 'fr'. Sometimes it will be a particular identified thing, such as the country France, identified in markup via URL.

Types as enumerations

There are a few places in schema.org's vocabulary where we can find long lists of kinds of thing, organized into a class hierarchy. So for example we have a class 'HairSalon' which is, alongside 'DaySpa', and 'HealthClub' a subclass of 'HealthAndBeautyBusiness', itself a subclass of LocalBusiness. While schema.org can usefully include some common cases like this, we don't want to attempt to catalog all of human experience and activity in our schema.

Often these type enumerations don't need to add additional properties. A useful description of a 'DaySpa' can be made that only uses well-known properties of 'LocalBusiness'. This means that schema.org can document these basic shared common classes, providing extension points for other efforts to supply longer lists of types. If the markup says that something is an XYZ and also a HealthClub, schema.org search engines and other consumers can handle it as a HealthClub.

Note: the expression of multiple types in Microdata is difficult. Do we need our own 'type' property? Multiple types are supported in RDFa Lite.

We do not address here the most complex extension scenarios: types from extensions that also have associated properties defined by 3rd parties. Schema.org markup consumers should tolerate such unexpected markup, even if they don't understand it. However, schema.org publishers shouldn't assume schema.org consumers will be able to do anything useful with unknown properties.

This principle of 'partial understanding' should allow schema.org to remain easy for search engines and publishers to understand, while allowing greater expressivity and detail through richer lists of types of thing. We expect for example for Wikipedia entries (directly or through intermediaries like DBpedia/Wikidata or Freebase) to provide community-maintained type enumerations that can be used in schema.org markup.

Country examples

Several examples of using external authorities for 'Country'. We'll use 'nationality' and 'addressCountry' to illustrate some of the practical issues.

Basic markup for 'nationality'

Let's take a cut-down version of our existing Movie example, and add in nationality properties for some of the people mentioned. Perhaps these features might help fine tune a movie recommender, for example.

First, the basic example with no nationality information:

  1. <div itemscope itemtype="http://schema.org/Movie">
  2.   <h1 itemprop="name">Pirates of the Carribean: On Stranger Tides (2011)</h1>
  3.   <span itemprop="description">Jack Sparrow and Barbossa embark on a quest to find the elusive 
  4.            fountain of youth, only to discover that Blackbeard and his daughter are after it too.</span>
  5.   <div itemprop="actors" itemscope itemtype="http://schema.org/Person">
  6.     <span itemprop="name">Johnny Depp</span>,
  7.    </div>  
  8.   <div itemprop="actors" itemscope itemtype="http://schema.org/Person">
  9.     <span itemprop="name">Penelope Cruz</span>,
  10.   </div>
  11.   <div itemprop="actors" itemscope itemtype="http://schema.org/Person">
  12.     <span itemprop="name">Ian McShane</span>
  13.   </div>
  14. </div>


Now, let's start adding it. First, what are the facts we're trying to add? According to wikipedia, Johnny Depp is a US citizen. Well this isn't very explicit; but he is listed as being born in Kentucky USA, and his page is in the following categories (amongst others): American expatriates in France, American film actors, American film directors, American film producers, American people of French descent, American television actors, American voice actors. So, what do we add to the Johnny Depp section to indicate nationality=USA?

Here we show use of two different sources of country identifiers:


URL Design

The schema.org approach to using external enumerations involves adding a level of indirection. Property values and types use externally defined systems, but those systems are given schema.org URLs.

Each such URL has the following structure: 'http://ext.schema.org/' + authority id + '/' + scope + '/' + term

The list of authorities and their short names ('wikipedia', 'faogeo' etc.) is to be maintained by schema.org. We use short scope names, to reduce chances of name clash and give us some redundancy in case of changes within the target dataset. Finally, the term is that datasets most appropriate (stable) identifier. We leave open the possibility of allowing different keys/IDs to be used with one dataset. Scope is roughly 'entity type', but without implying an exact match to a schema.org type. There is some flexibility in deciding how to partition things between the authority id and the inner scope. For example, we could assign different 'authority IDs' to each dataset from a large organization, or we could put them within a larger comment authority. This will be handled on a case by case basis for now.

  • here we point instead to schema.org URIs for the entity concerned (in this case, again, United States - the country), per each authority
  • we let schema.org worry about redirecting these to the best page
  • the basic idea is that each authority gets a controlled name within the schema.org extensions system, just as each country gets a name/ID within each of the authorities:
  • the basic URL structure beneath ext.schema.org is: /authority_id/ + /scope/ + term
  • using human-readable links might make sense, depending on what the URLs eventually point to...
  1. <div itemscope itemtype="http://schema.org/Movie">
  2.   <h1 itemprop="name">Pirates of the Carribean: On Stranger Tides (2011)</h1>
  3.   <span itemprop="description">Jack Sparrow and Barbossa embark on a quest to find the elusive 
  4.            fountain of youth, only to discover that Blackbeard and his daughter are after it too.</span>
  5.   <div itemprop="actors" itemscope itemtype="http://schema.org/Person">
  6.     <span itemprop="name">Johnny Depp</span>, 
  7.     <link itemprop="nationality" href="http://ext.schema.org/wikipedia/en/United_States"/>
  8.     <link itemprop="nationality" href="http://ext.schema.org/faogeo/countries/USA"/>
  9.    </div>  
  10. </div>

Document examples

A very common scenario is to want to describe the subject (also often called 'topic') of a document, by using a controlled vocabulary rather than arbitrary keywords. The library and bibliographic community have created many such systems over the years, and they are increasingly available as open structured data. Often such vocabularies are expressed using W3C's SKOS system, which models them as a hierarchy of linked 'concepts', each with textual (possibly multilingual) labels of various kinds. See the final report of the W3C Linked Library Data group for more details.

For schema.org, the CreativeWork class has a property 'about' whose values can come from controlled value systems. For example, the Library of Congress use LCSH, and assign Web identifiers to each concept there. Many other vocabularies do likewise.

Here is an example using a direct link to a controlled value page:

  1. <div itemscope itemtype="http://schema.org/Book">
  2. <span itemprop="name">Introduction to Linear Algebra</span> - <link itemprop="bookFormat" href="http://schema.org/Paperback" />Paperback
  3. by <a itemprop="author" href="/author/g_strang.html">Gilbert Strang</a>
  4. </div>


Here is an example using a link to an indirection page at schema.org:

  1. <div itemscope itemtype="http://schema.org/Book">
  2. <span itemprop="name">Introduction to Linear Algebra</span> - <link itemprop="bookFormat" href="http://schema.org/Paperback" />Paperback
  3. by <a itemprop="author" href="/author/g_strang.html">Gilbert Strang</a>
  4. <link itemprop="about" href="http://ext.schema.org/loc/lcsh/sh85003441"/>
  5. </div>

Questions and Answers

This section anticipates likely questions regarding this mechanism.

Q: Why do you have these URIs point to schema.org rather than directly off to third party sites?

A: The use of schema.org URIs emphasises that the schema.org project supports these terms for use in schema.org markup. It allows us to either redirect directly to the target page, or to add value by publishing an intermediate page that (crediting the source site) provides more context about that entry. This approach also provides for very regular markup, which is important for our target audience, and allows schema.org to track changes to target sites and to relevant Web standards (e.g. httpRange14) without requiring markup changes from schema.org publishers. We want to make it as easy as possible to use such vocabularies without requiring publishers to worry about whether they are using the correct and currently fashionable standard for linking to e.g. some country as a Web-identified entity rather than to the page about that country (or vice-versa).

Q: "How do you decide which external vocabularies to support in this manner?"

A: We will maintain a simple directory of useful vocabularies, based on suggestions from the community, publishers and consumers of schema.org data. We prefer vocabularies that address common practical needs, and that show either broad grassroots support (e.g. Wikipedia) or relevant professional input (e.g. LCSH, UN FAO). Inclusion does not represent any kind of formal endorsement of the vocabulary content by the schema.org partners; rather it is an indication that we expect schema.org markup will be richer and more informative if the dataset is used. So, widely-valued vocabularies / datasets that are available under liberal license and through open standards at stable URIs are strong candidates for inclusion.


Q: "What if I want to link directly to the vocabulary site, so that well known URIs are included in my structured data?"

A: It is always possible to include extra links. So for example in the link above, you could additionally include markup such as

<link itemprop="about" href="http://id.loc.gov/authorities/sh85003441#concept"/>

Q: How will the schema.org server respond if we make bad links to ext.schema.org URLs?

A: Initially, we will publish simple HTTP redirects, and rely on the target site to handle 404 messages. Later we may offer more intelligent handling.

Q: Which HTTP redirection code will you use?

A: HTTP 303 is most likely, although the W3C standards community are currently debating related issues. Schema.org will follow W3C best practice on this issue.

Q: Where can we find out the identifiers used within these naming scopes defined by each authority? ie. that we write "United_States" in the Wikipedia dataset (and hence DBpedia too), yet "USA" within the UN FAO dataset?

A: At launch, this involves visiting each site independently. Since these datasets are available in open and standard formats, better tool support is likely.

Initial Dataset Directory

The following well-known vocabularies are expected to be supported (by simple HTTP redirect) at launch; others will follow.

Authority Authority_id Summary Topics Named scopes
UN FAO GeoPolitical Ontology faogeo United Nations Food and Agriculture Organization's GeoPolitical Ontology. Dataset providing in depth descriptions of countries and related geopolitical entities, as well as various formal codes for these. See homepage and Wikipedia entry, countrylist. Example term: 'USA'. Related schema.org term(s): Country. 'countries'
Wikipedia wikipedia Wikipedia, the famous encyclopedia. Everything you can imagine. Namespaces follow Wikipedia naming, so the English namespace is 'en', etc.These identifiers should also map cleanly to DBpedia's.
US Library of Congress loc The Library of Congress, in the United States. Everything you can imagine. Example term: 'sh85003441' (Linear Algebra) 'lcsh' - LCSH is the initial named scope, which includes topics for use in bibliographic subject description.
The Universal Decimal Classification udc The Universal Decimal Classification (UDC) is a bibliographic and library classification developed by the Belgian bibliographers Paul Otlet and Henri La Fontaine at the end of the 19th century, and is widely used in library catalogues internationally. The UDC is managed by the UDC Consortium. See also blog. The more general levels of the classification are available in public; the full detailed schedules are subscription only. Multilingual translations are available. Everything you can imagine. 'topics' is the initial named scope, whose values are UDC classification codes; or 'entries' for more stable but indirect record identifiers.


URL Template registry

Associated HTTP 303 redirections (note that some of these are URI references, the #something part should be sent):

See the IETF URI Template spec for a more formal notation for such patterns.

See also