Mixing HTML Data Formats

From W3C Wiki

This page examines how publishers can mix different HTML data formats within their pages, and how consumers can interpret the results, as part of the work of the HTML Data TF.

Mixing Vocabularies

Methods for marking up the same data in a page using different vocabularies in the same syntax vary by syntax.

Mixing Vocabularies in microformats

As microformats are simply indicated through classes, it's possible to mix several within the same set of content. An example is the BBC Bangladesh River Journey page which includes hAtom and hCalendar:

<li class="hentry vevent xfolkentry postid-f2068841910">
  <h3 class="entry-title summary">
    <a href="http://www.flickr.com/photos/bangladeshboat/2068841910" title="The final picture (on Flickr)">The final picture</a>
  </h3>
  <div class="entry-content">
    <p class="photo">
      <a rel="bookmark" class="taggedlink url" href="http://www.flickr.com/photos/bangladeshboat/2068841910" title="The final picture (on Flickr)">
        <img src="http://farm3.static.flickr.com/2175/2068841910_1162a8086b_s.jpg" 
             alt="The final picture (on Flickr)" title="The final picture (on Flickr)" width="64" height="64" />
      </a>
    </p>
    <p class="description">As the BBC team prepare to disembark the boat, the sun sets overhead, and indeed on the trip itself.</p>
  </div>
  <ul class="meta">
    <li class="date"><abbr class="published dtstart" title="2007-11-26T02:11:51+06:00">2 days ago</abbr></li>
    <li class="location"><abbr class="geo point-22" title="+22.47157;+89.59534">Mongla, Bangladesh</abbr></li>
  </ul>
</li>

Mixing Vocabularies in RDFa

RDFa is designed to be used with multiple vocabularies:

  • types and properties are given IRIs as names, so do not have to be disambiguated; IRIs do not have to be written out in full (see below)
  • an entity can be assigned multiple types from different vocabularies by listing them within the @typeof attribute
  • attributes that indicate properties (@property, @rel and @rev) can take multiple space-separated properties which may be from different vocabularies

Writing out IRIs in full can clutter HTML so RDFa provides four mechanisms to shorten IRIs:

  • There are several built-in prefixes which can be used for popular vocabularies. These are listed as part of the RDFa 1.1 Core initial context. Any IRI within one of these vocabularies can be abbreviated using the prefix:name notation.
  • The @prefix attribute can be used to define additional prefixes for other vocabularies.
  • The @vocab attribute defines a default vocabulary within its scope; any IRIs that begin with this vocabulary can be abbreviated to a short name (the remainder of the IRI after the vocabulary IRI).
  • Namespace declarations (xmlns:prefix attributes) can also be used to define prefixes. This mechanism is deprecated and should not be used.

Note that if you use any of the last three mechanisms, the shortened IRIs can only be understood when they are within the scope of the relevant attributes. These can be easy to mislay when people copy and paste HTML from one place to another, or as the result of template changes in a content-management system. We therefore recommend that these attributes are avoided where possible — use the built-in prefixes or full IRIs in preference — and, where they are used, placed on elements that represent entities (those with @about or @typeof attributes) and repeated on each entity element rather than being inherited from an ancestor element.

Mixing Vocabularies in microdata

microdata is designed such that each piece of information in a page is assigned types from a single vocabulary, though each entity may have multiple types and have properties from other vocabularies.

Properties in microdata are either short names (in which case they are scoped to the vocabulary of the types of the entity) or URLs. A URL property has no relationship to a given short name property unless that relationship is specified within the vocabulary that defines the properties.

You might find that you need to target two consumers who each recognise items using types from different vocabularies. For example, you might want to target schema.org and use the vEvent vocabulary with the original HTML:

<a href="nba-miami-philidelphia-game3.html">
NBA Eastern Conference First Round Playoff Tickets:
 Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1)
</a>

Thu, 04/21/16
8:00 p.m.

<a href="wells-fargo-center.html">
Wells Fargo Center
</a>
Philadelphia, PA

Priced from: $35
1938 tickets left

In this case there are three options available to you. The first, if consumers support it, is to use a different syntax for one of the vocabularies. For example, the vEvent vocabulary is only supported in microdata but schema.org can be consumed from either microdata or RDFa. Mixing syntaxes within a single page is rarely a good option but in some circumstances it may be preferable to the other workarounds described here.

Mixing Vocabularies using a Type Property

Some vocabularies may define a property through which types from that vocabulary can be assigned to items that are in a different vocabulary. For example, schema.org could define a http://schema.org/type property whose value is a URL, and state that any microdata item that a schema.org type as a value for that property is recognised as being an item of that type. In this case, the types specified in the @itemtype attribute are the primary types of the entity and those specified through the property are the secondary types.

Alongside the assertion that property URLs that begin with http://schema.org/ have the same semantics as short name properties on items with a schema.org type, this enables the schema.org vocabulary to be mixed in with an item marked up using vEvent:

Note that at time of writing schema.org does not specify a http://schema.org/type property and this example will not work.

<div itemscope itemtype="http://microformats.org/profile/hcalendar#vevent">
  <link itemprop="http://schema.org/type" href="http://schema.org/Event">
  <a itemprop="url http://schema.org/url" href="nba-miami-philidelphia-game3.html">
  NBA Eastern Conference First Round Playoff Tickets:
  <span itemprop="summary http://schema.org/name"> Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1) </span>
  </a>

  <meta itemprop="dtstart http://schema.org/startDate" content="2016-04-21T20:00">
    Thu, 04/21/16
    8:00 p.m.

  <div itemprop="location">
    <div itemprop="http://schema.org/location" itemscope itemtype="http://schema.org/Place">
      <a itemprop="url" href="wells-fargo-center.html">
      Wells Fargo Center
      </a>
      <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
        <span itemprop="addressLocality">Philadelphia</span>,
        <span itemprop="addressRegion">PA</span>
      </div>
    </div>
  </div>

  <div itemprop="http://schema.org/offers" itemscope itemtype="http://schema.org/AggregateOffer">
    Priced from: <span itemprop="lowPrice">$35</span>
    <span itemprop="offerCount">1938</span> tickets left
  </div>
</div>

Note in particular that the vEvent location property takes text while the schema.org location property takes structured information about the location. These are combined by having an element for the property which requires structured information nested within the property that requires text.

Also note that in this example the http://schema.org/type property is only used where necessary, on the entity which needs to be marked as an event in both vocabularies. Where possible, the schema.org type for an entity is provided explicitly through the @itemtype attribute.

This method of mixing vocabularies requires vocabularies to specify how consumers should recognise items of a particular type. It is recommended that vocabulary authors define an @itemtype-equivalent property, and that, for better integration with RDF tools, this property is http://www.w3.org/1999/02/22-rdf-syntax-ns#type (TODO: Issue about what to recommend here.)

The other disadvantage of this approach is that there is no support within the microdata API for retrieving items based on the value of a property. In the example above, it would be possible to retrieve the event using:

document.getItems('http://microformats.org/profile/hcalendar#vevent')

but not through:

document.getItems('http://schema.org/Event')

Scripts that extract microdata information using the DOM will be faster if they can use the primary types for an item, specified within the @itemtype attribute, so you should specify types accessed through scripts within @itemtype rather than through a property wherever possible.

Mixing Vocabularies using Repeated Content

The second method of supporting multiple properties is to have the entity represented by two (or more) microdata items on the page. To enable dragging and dropping the data from these items, they should be nested inside each other. Properties can be set on the outer element using link and meta elements which are hidden from users, while the visible content of the page is marked up by the inner element.

<div itemscope itemtype="http://microformats.org/profile/hcalendar#vevent">
  <link itemprop="url" href="nba-miami-philidelphia-game3.html">
  <meta itemprop="summary" content="Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1)">
  <meta itemprop="dtstart" content="2016-04-21T20:00">
  <meta itemprop="location" content="Wells Fargo Center, Philadelphia, PA">
  <div itemscope itemtype="http://schema.org/Event">
    <a itemprop="url" href="nba-miami-philidelphia-game3.html">
    NBA Eastern Conference First Round Playoff Tickets:
    <span itemprop="name"> Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1) </span>
    </a>

    <meta itemprop="startDate" content="2016-04-21T20:00">
      Thu, 04/21/16
      8:00 p.m.

    <div itemprop="location" itemscope itemtype="http://schema.org/Place">
      <a itemprop="url" href="wells-fargo-center.html">
      Wells Fargo Center
      </a>
      <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
        <span itemprop="addressLocality">Philadelphia</span>,
        <span itemprop="addressRegion">PA</span>
      </div>
    </div>

    <div itemprop="offers" itemscope itemtype="http://schema.org/AggregateOffer">
      Priced from: <span itemprop="lowPrice">$35</span>
      <span itemprop="offerCount">1938</span> tickets left
    </div>
  </div>
</div>

This method does not require any special properties to be defined in the vocabularies used to mark up the page, and the two items are directly assigned the relevant type and are thus accessible to scripts through the document.getItems() method.

The disadvantages of this method are that the page contains more items than there are entities (in the above example, two items representing the same event), and it requires repetition of data within the page.

Mixing Syntaxes

A requirement to support a large range of consumers can mean that it becomes necessary to publish using not only multiple vocabularies but multiple syntaxes.

RDFa, microformats and microdata all share the same basic entity/attribute/value model, so in many cases it is possible to mirror attributes across the syntaxes. The following example shows the same content marked up with:

  • hCalendar (microformat)
  • schema.org (RDFa)
  • vEvent (microdata)
<div class="vevent"
  itemscope itemtype="http://microformats.org/profile/hcalendar#vevent"
  vocab="http://schema.org/" typeof="Event">
  <a class="url" itemprop="url" property="url" href="nba-miami-philidelphia-game3.html">
    NBA Eastern Conference First Round Playoff Tickets:
    <span class="summary" itemprop="summary" property="name"> Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1) </span>
  </a>

  <meta itemprop="dtstart" property="startDate" content="2016-04-21T20:00:00">
  <abbr class="dtstart" title="2016-04-21T20:00:00">
    Thu, 04/21/16
    8:00 p.m.
  </abbr>

  <div class="location" itemprop="location" 
       vocab="http://schema.org/" property="location" typeof="Place">
    <a property="url" href="wells-fargo-center.html">
      Wells Fargo Center
    </a>
    <div property="address" vocab="http://schema.org/" typeof="PostalAddress">
      <span property="addressLocality">Philadelphia</span>,
      <span property="addressRegion">PA</span>
    </div>
  </div>

  <div vocab="http://schema.org/" property="offers" typeof="AggregateOffer">
    Priced from: <span property="lowPrice">$35</span>
    <span property="offerCount">1938</span> tickets left
  </div>
</div>

It is particularly important to check pages in which syntaxes are mixed together using an appropriate validator for each format.

The following guidelines may help when creating pages in which different syntaxes are mixed together.

  • microformats do not use link or meta elements within the content of the page and in some cases require particular elements to be used to encode information, such as using abbr to support the datetime-design-pattern as illustrated by the dtstart property in the example above
  • link relations required in certain microformats, particularly XFN, clash with the use of RDFa's @vocab attribute; avoid using @vocab on any ancestor of an element that contains a @rel (see issues on link relations)
  • the following equivalencies between RDFa and microdata attributes generally hold true:
    • @itemid = @resource
    • @itemtype = @typeof (+ @vocab to enable the use of short names for properties)
    • @itemprop + @itemscope = @property + an empty @typeof if there's no @itemtype
    • @itemprop otherwise = @property
  • when using RDFa, any property elements within an element with a @href will be taken as being properties of the entity identified by the URL in that @href; as long as the link doesn't have a @rel, this can be avoided by adding an empty @property to the link. If the link does have a @rel, you can either move the property elements outside the link or add a @resource attribute whose value is the same as the @resource on the entity element (this can be a local "blank node" identifier in the form _:localName
  • RDFa vocabularies are typically stricter in the range of values that they accept for properties that take dates and times; it is best to use the syntax YYYY-MM-DD for dates, hh:mm:ss for times and YYYY-MM-DDThh:mm:ss for dateTimes to be compliant with the XML Schema dates and times which RDFa-based vocabularies will typically use
  • the @datatype property might be required for some RDFa vocabularies/consumers; others will coerce values into the appropriate datatype based on the property itself. However, if a property takes a structured value, the property element must have datatype="rdf:XMLLiteral" for that structure to be preserved

Consuming Pages with Multiple Formats

In attempting to provide information to multiple consumers, publishers may use several formats within a single page. Consumers should ignore data in vocabularies that they do not recognise and only raise errors for unexpected properties in those vocabularies.

Consumers of HTML data may recognise several formats embedded within a given page, and even within the same part of a page. In these cases, consumers should merge from the different formats; in the example above, a consumer should recognise that the data in vEvent, hCalendar and schema.org is about is a single event rather than interpreting it as three events and merge property values so that the event ends up having a single URL rather than several. Different formats may provide information about different aspects of an entity to different levels of fidelity — in the example above, the schema.org RDFa provided extra details about the location of the event t to the vEvent or hCalendar formats — and consumers should seek to use whatever gives them the most detailed information.