HTML Data Improvements

From W3C Wiki

This page holds drafts and other tracking for improvements to microdata and RDFa developed by the HTML Data TF.

Raised Bugs

The following bugs have been raised with appropriate working groups.

Additional XSD Datatypes

HTML5's time elements supports timezones and weeks as well as the usual date/time/duration datatypes. These new datatypes will be hard to map into RDF or XML. This has been raised with the XML Schema Working Group (see bug 14881) but it's likely that they will reject the bug for lack of time and that W3C Notes will be required to define new xs:timezone and xs:yearWeek types to support this data.

Link Relations

This is background information and examples for RDFA-ISSUE-108.

According to RDFa Core, if someone uses an recognised term within a @rel attribute then an RDFa processor will interpret that term based on the local term mappings. If the term isn't recognised, it will be interpreted based on the local default vocabulary. In HTML+RDFa 1.1, the local term mappings are based on the IANA link relations and the local default vocabulary is undefined. Unrecognised terms are therefore ignored in HTML+RDFa unless the @vocab attribute is used to provide a local default vocabulary.

According to HTML5, there are a set of built-in link types with specified (and complex) semantics which is then extended using the microformats wiki as a registry.

There are also microformats, in particular XFN, which rely on the @rel attribute. XFN looks for any relations within its known list and interprets these as pointers from the author of the document in which they are found to people with whom that author has a relationship.

There are problems with this state of affairs in that documents may be interpreted differently when they are processed with an RDFa processor from when they are processed with an HTML5 processor.

1. Different registries

The set of link relationships recognised by RDFa is different from that recognised by HTML. Some HTML link relations will be mapped into properties by RDFa, and some not; some relations recognised by RDFa will be ignored by HTML processors and some not. Publishers need to look at different lists to know what link relations are available.

2. Combined link relations

RDFa breaks up the value of the @rel attribute on spaces and simply considers each term that results individually. HTML5 combines alternate and stylesheet to provide a single meaning ("alternate stylesheet"). However, under RDFa processing, a document containing:

<link rel="alternate stylesheet" href="/styles/mobile.css" title="Mobile">

will create the triples:

<>
  xhv:alternate </styles/mobile.css> ;
  xhv:stylesheet </styles/mobile.css> ;
  .

when this is not the intention of the author or the meaning understood by an HTML processor.

The link relation shortcut icon is similarly treated specially by HTML processors (as a synonym for just icon).

3. Non-document link relations

The bookmark HTML link relation specifies a relationship between the nearest ancestor section of the page and the linked resource. Assuming no other RDFa markup within the page, an RDFa processor will erroneously associate the bookmark with the HTML document as a whole instead.

The example from the spec is:

<body>
 <h1>Example of permalinks</h1>
 <div id="a">
  <h2>First example</h2>
  <p><a href="a.html" rel="bookmark">This</a> permalink applies to
  only the content from the first H2 to the second H2. The DIV isn't
  exactly that section, but it roughly corresponds to it.</p>
 </div>
 <h2>Second example</h2>
 <article id="b">
  <p><a href="b.html" rel="bookmark">This</a> permalink applies to
  the outer ARTICLE element (which could be, e.g., a blog post).</p>
  <article id="c">
   <p><a href="c.html" rel="bookmark">This</a> permalink applies to
   the inner ARTICLE element (which could be, e.g., a blog comment).</p>
  </article>
 </article>
</body>

This will create the triples:

<>
  xhv:bookmark <a.html> ;
  xhv:bookmark <b.html> ;
  xhv:bookmark <c.html> ;
  .

which is not the intention of the document. The intention is more like:

<#a> xhv:bookmark <a.html> .
<#b> xhv:bookmark <b.html> .
<#c> xhv:bookmark <c.html> .


4. Alias link relations

Some link relations are aliases for others. For example, copyright is an alias for license — it means the same thing and should be treated in the same way by HTML processors. However, an RDFa processor will not normalise to the standard term.

In addition, alias link relations are not valid within HTML documents. If they are not recognised as terms by RDFa processing and a publisher were to use them (coupled with @vocab such that they did not have a prefix) then they would be creating an invalid HTML document.

5. Misinterpretation due to @vocab

When there is a local default vocabulary indicated by a @vocab attribute, and the term is not listed in the set that is recognised by RDFa, RDFa will interpret a link relation based on that vocabulary, while HTML and microformats semantics will not be affected. For example, say that security.example.org had a vocabulary for security disclosures about vulnerabilities in operating systems, and the RDFa in the page looked like:

<a vocab="http://security.example.org/"
   rel="disclosure" href="/vulnerability/4389">...</a>

An RDFa 1.1 processor will interpret this as:

<> <http://security.example.org/disclosure> </vulnerability/4389> .

whereas based on the extended list of link relations managed by the microformats wiki, an HTML processor will assume that the link is to a "list of patent disclosures or a particular patent disclosure itself made with respect to material for which such relation type is specified".

Similarly, if the publisher had actually meant to use the HTML definition of the link relation, they might not realise that the in-scope @vocab attribute actually changes the meaning of the normal link relation for RDFa processors in a way that they didn't intend.

There are also clashes with microformat link relations. For example:

 <a vocab="http://purl.org/dc/terms/"
   rel="date" resource="http://reference.data.gov.uk/id/day/2011-11-15">15th November 2011</a>

will result in a dc:date relationship under RDFa processing, but as a link to someone the author of the page is dating according to XFN.

This arises due to the mismatch between the RDFa and HTML sets of link relations, so is a side-effect of the problem of working from different lists of link relations; if all HTML and microformats link relations were reserved, there wouldn't be a problem, but if only a subset are reserved, the above becomes a problem.

6. Misinterpretation due to RDFa context

The subject of link relations will often be different under RDFa and HTML processing. The subject of the HTML link relationships is usually (but not always) the document, whereas the subject of a property in RDFa is determined through the @about, @typeof etc attributes. An example is:

<figure about="picture.jpg">
  <img src="picture.jpg">
  <figcaption><a rel="license"
    href="http://creativecommons.org/licenses/by-nc-nd/3.0/">CC-by-nc-nd</a></figcaption>
</figure>

In this case, an RDFa processor will associate the license with the image (picture.jpg) whereas an HTML processor will associate the license with the HTML page.

Potential Bugs

The following haven't been raised as bugs, but might be.

Structured Values

Unlike microformats and RDFa, microdata doesn't have any support for values that are HTML structures. This has previously been raised as a bug which was resolved without change.

There are examples in the schema.org documentation which use HTML structures without seeming to realise that they will be ignored. The breadcrumb property is described as "A set of links that can help a user understand and navigate a website hierarchy." The sole example of it is:

<div itemprop="breadcrumb">
  <a href="category/books.html">Books</a> >
  <a href="category/books-literature.html">Literature & Fiction</a> >
  <a href="category/books-classics">Classics</a>
</div>

Microdata processing dictates that the value of the breadcrumb property in this case is "Books Literature & Fiction Classics". The HTML content of this property isn't preserved by microdata processing, so it isn't actually a set of links but rather a textual description of the context of the page.

Other examples of microdata publishers assuming that the markup within content will be carried through into the data gleaned from the page have been identified:

From http://www.goodreads.com/book/show/14770.Neuromancer

<div class="infoBoxRowItem" itemprop='awards'>
  <a href="/award/show/9-hugo-award" class="award">Hugo Award for Best Novel (1985)</a>, 
  <a href="/award/show/23-nebula-award" class="award">Nebula Award for Best Novel (1984)</a>, 
  <a href="/award/show/326-philip-k-dick-award" class="award">Philip K. Dick Award (1984)</a>, 
  <a href="/award/show/1403-john-w-campbell-memorial-award" class="award">John W. Campbell Memorial Award Nominee for Best Science Fiction Novel (1985)</a>                      
</div>

From http://www.telegraph.co.uk/foodanddrink/restaurants/8824409/The-Butchers-Arms-Woolhope-Herefordshire-restaurant-review.html

<div id="mainBodyArea" itemprop="reviewBody">
  <div class="firstPar">
    <p>
      <strong>The Butchers Arms</strong>, Woolhope, Herefordshire HR1 4RF <br>
      <strong>Contact </strong>01432 860281; food@butchersarmswoolhope.co.uk).<br>
      <strong>Price </strong>Three courses with a couple of pints, or half a bottle of 
      wine and coffee: £35-40 per head
    </p>
  </div>
  <div class="secondPar">
    <p>
      For a man whose first career was in advertising, Stephen Bull is no huge fan 
      of the hard sell. “It’s all much of a muchness, really,” 
      replied the owner of The Butchers Arms near Hereford when asked to recommend a
      couple of his dishes. “All pretty mediocre.”
    </p>
    ...
  </div>
</div>

URLs, URIs and IRIs

These HTML5 rules on URL resolution are known to be broken and will hopefully be fixed by reference to a revised IRI spec. This section documents the particular issues that arise from the current algorithm as it applies to RDFa processing.

HTML5 uses the term URL throughout. The definition that it uses for a valid URL is:

A URL is a valid URL if at least one of the following conditions holds:
  • The URL is a valid URI reference [RFC3986].
  • The URL is a valid IRI reference and it has no query component. [RFC3987]
  • The URL is a valid IRI reference and its query component contains no unescaped non-ASCII characters. [RFC3987]
  • The URL is a valid IRI reference and the character encoding of the URL's Document is UTF-8 or a UTF-16 encoding. [RFC3987]

This allows IRIs to appear within documents so long as the character encoding of the document in which the URL is found is UTF-8 or UTF-16.

DOM attributes whose values reflect HTML attributes whose values are URLs (such as @href, @src, @itemid and so on) are then resolved through the HTML5 resolution algorithm. This turns all IRIs into URIs by percent-encoding characters that aren't allowed in URIs and performs resolution based on URI rules from RFC3986. The results at the moment are always valid URIs.

In RDFa-Core, resolution of IRIs in all cases is done through the standard IRI resolution algorithm from RFC3987. The RDF restrictions on URIs used to identify resources is documented in its abstract semantics. This currently normalises IRIs to URIs by percent-encoding non-ASCII characters, so currently the effective RDF generated from RDFa will contain URIs.

The new draft of the RDF 1.1 abstract semantics allows IRIs to be used as identifiers for resources. For future RDF 1.1 implementations, the effective RDF generated from RDFa will contain IRIs. Importantly, since these IRIs are being used as identifiers, their equivalence will be assessed through string-equivalence rather than by first normalising to URIs and then comparing.

This mismatch between HTML5 and RDF is a problem for people using RDFa 1.0 and 1.1 because the resolution of those URL attributes defined in HTML5 (@href, @src etc) differs from the resolution of URL attributes defined by RDFa (@resource, @typeof etc). Specifically, normalising IRIs to URIs and then resolving according to RFC3986 (URI resolution), which is what HTML5 does, might not (?) produce the same results as resolving IRIs according to RFC3987 (IRI resolution) and then normalising to a URI, which is what RDFa does.

It is also a problem when people are using RDFa and microdata side-by-side (or switching between them) because the URIs they use within @itemid, @itemtype and @itemprop will not be handled in the same way as those within @about, @typeof and @property, resulting in slightly different data in the two cases.

These discrepancies will be worse when RDF 1.1 is standardised and used with HTML+RDFa, as at that point some of the identifiers generated from HTML+RDFa processing will be normalised to URIs (those in @href, @src etc attributes) while others will be IRIs (those in @resource, @typeof, @property etc attributes).