Support for RDFa in HTML4 and HTML5

W3C Working Draft 13 December 2012

Manu Sporny, Digital Bazaar, Inc.
Ben Adida, Creative Commons
Mark Birbeck
Shane McCarron, Applied Testing and Technology, Inc.
Steven Pemberton, CWI

This specification defines rules and guidelines for adapting the RDFa Core 1.1 and RDFa Lite 1.1 specifications for use in HTML5 and XHTML5. The rules defined in this specification not only apply to HTML5 documents in non-XML and XML mode, but also to HTML4 and XHTML documents interpreted through the HTML5 parsing rules.

1. Introduction

This section is non-normative.

Today's web is built predominantly for human readers. Even as machine-readable data begins to permeate the web, it is typically distributed in a separate file, with a separate format, and very limited correspondence between the human and machine versions. As a result, web browsers can provide only minimal assistance to humans in parsing and processing web pages: browsers only see presentation information. RDFa is intended to solve the problem of marking up machine-readable data in HTML documents. RDFa provides a set of HTML attributes to augment visual data with machine-readable hints. Using RDFa, authors may turn their existing human-visible text and links into machine-readable data without repeating content.

2. Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words must, must not, required, should, should not, recommended, may, and optional in this specification are to be interpreted as described in [RFC2119].

2.1 Document Conformance

There are two types of document conformance criteria for HTML documents containing RDFa semantics; HTML+RDFa and HTML+RDFa Lite.

The following conformance criteria apply to any HTML document including RDFa markup:

An example of a conforming HTML+RDFa document:

<html lang="en">
    <title>Example Document</title>
    <p>This website is <a href="http://example.org/">example.org</a>.</p>

Non-XML mode HTML+RDFa 1.1 documents should be labeled with the Internet Media Type text/html as defined in section 12.1 of the HTML5 specification [HTML5].

XML mode XHTML5+RDFa 1.1 documents should be labeled with the Internet Media Type application/xhtml+xml as defined in section 12.3 of the HTML5 specification [HTML5], must not use a DOCTYPE declaration for XHTML+RDFa 1.0 or XHTML+RDFa 1.1, and should not use the version attribute.

2.2 RDFa Processor Conformance

The RDFa Processor conformance criteria are listed below, all of which are mandatory:

2.3 User Agent Conformance

A User Agent is considered to be a type of RDFa Processor when the User Agent stores or processes RDFa attributes and their values. The reason there are separate RDFa Processor Conformance and a User Agent Conformance sections is because one can be a valid HTML5 RDFa Processor but not a valid HTML5 User Agent (for example, by only providing a very small subset of rendering functionality).

The User Agent conformance criteria are listed below, all of which are mandatory:

3. Extensions to RDFa Core 1.1

The RDFa Core 1.1 [RDFA-CORE] specification is the base document on which this specification builds. RDFa Core 1.1 specifies the attributes and syntax, in Section 5: Attributes and Syntax, and processing model, in Section 7: Processing Model, for extracting RDF from a Web document. This section specifies changes to the attributes and processing model defined in RDFa Core 1.1 in order to support extracting RDF from HTML documents.

The requirements and rules, as specified in RDFa Core and further extended in this document, apply to all HTML5 documents. An RDFa Processor operating on both HTML and XHTML documents, specifically on their resulting DOMs or Infosets, must apply these processing rules for HTML4, HTML5 and XHTML5 serializations, DOMs and/or Infosets.

3.1 Additional RDFa Processing Rules

Documents conforming to the rules in this specification are processed according to [RDFA-CORE] with the following extensions:

The version attribute is not supported in HTML5 and is non-conforming. However, if an HTML+RDFa document contains the version attribute on the html element, a conforming RDFa Processor must examine the value of this attribute. If the value matches that of a defined version of RDFa, then the processing rules for that version must be used. If the value does not match a defined version, or there is no version attribute, then the processing rules for the most recent version of RDFa 1.1 must be used.

3.2 Modifying the Input Document

RDFa's tree-based processing rules, outlined in Section 7.5: Sequence of the RDFa Core 1.1 specification [RDFA-CORE], allow an input document to be automatically corrected, cleaned-up, re-arranged, or modified in any way that is approved by the host language prior to processing. Element nesting issues in HTML documents should be corrected before the input document is translated into the DOM, a valid tree-based model, on which the RDFa processing rules will operate.

Any mechanism that generates a data structure equivalent to the HTML5 or XHTML5 DOM, such as the html5lib library, may be used as the mechanism to construct the tree-based model provided as input to the RDFa processing rules.

3.3 Specifying the language for a literal

RDFa Core 1.1 allows for the current language to be specified by the Host Language. In order for RDFa Processors to conform to this specification, they must use the mechanism described in The lang and xml:lang attributes section of the [HTML5] specification to determine the language of a node.

If an author is editing an HTML fragment and is unsure of the final encapsulating MIME type for his/her markup, it is suggested that the author specify both lang and xml:lang where the value in both attributes is exactly the same.

3.4 Invalid XMLLiteral values

When generating literals of type XMLLiteral, the processor must ensure that the output XMLLiteral is a namespace well-formed XML fragment. A namespace well-formed XML fragment has the following properties:

An RDFa Processor that transforms the XML fragment must use the Coercing an HTML DOM into an Infoset algorithm, as specified in the HTML5 specification, followed by the algorithm defined in the Serializing XHTML Fragments section of the HTML5 specification. If an error or exception occurs at any point during the transformation, the triple containing the XMLLiteral must not be generated.

Transformation to a namespace well-formed XML fragment is required because an application that consumes XMLLiteral data expects that data to be a namespace well-formed XML fragment.

The transformation requirement does not apply to input data that are text-only, such as literals that contain a datatype attribute with an empty value (""), or input data that that contain only text nodes.

An example transformation demonstrating the preservation of namespace values is provided below. The → symbol is used to denote that the line is a continuation of the previous line and is included purely for the purposes of readability:

<p xmlns:ex="http://example.org/vocab#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 Two rectangles (the example markup for them are stored in a triple):
 <svg xmlns="http://www.w3.org/2000/svg" property="ex:markup" datatype="rdf:XMLLiteral">
 →<rect width="300" height="100" style="fill:rgb(0,0,255);stroke-width:1; stroke:rgb(0,0,0)"></rect>
 →<rect width="50" height="50" style="fill:rgb(255,0,0);stroke-width:2;stroke:rgb(0,0,0)"></rect></svg>

The markup above should produce the following triple, which preserves the xmlns declaration in the markup by injecting the xmlns attribute in the rect elements:

      "<rect xmlns=\"http://www.w3.org/2000/svg\" width=\"300\" 
→height=\"100\" style=\"fill:rgb(0,0,255);stroke-width:1; stroke:rgb(0,0,0)\"/>
→<rect xmlns=\"http://www.w3.org/2000/svg\" width=\"50\" 
→height=\"50\" style=\"fill:rgb(255,0,0);stroke-width:2; 
→stroke:rgb(0,0,0)\"/>"^^http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral .

Since the ex and rdf namespaces are not used in either rect element, they are not preserved in the XMLLiteral.

Similarly, compound document elements that reside in different namespaces must have their namespaces declarations preserved:

<p xmlns:ex="http://example.org/vocab#"
 This is how you markup a user in FBML:
 <span property="ex:markup" datatype="rdf:XMLLiteral">
→<p><fb:user uid="12345">The User</fb:user></p>

The markup above should produce the following triple, which preserves the fb namespace in the corresponding triple:

      "<p xmlns:fb="http://www.facebook.com/2008/fbml">
→<fb:user uid="12345">
→</p>"^^http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral .

4. Extensions to the HTML5 Syntax

There are a few attributes that are added as extensions to the HTML5 syntax in order to fully support RDFa:

5. Backwards Compatibility

RDFa Core 1.1 deprecates the usage of xmlns: in RDFa 1.1 documents. Web page authors should not use xmlns: to express prefix mappings in RDFa 1.1 documents. Web page authors should use the prefix attribute to specify prefix mappings.

However, there are times when XHTML+RDFa 1.0 documents are served by web servers using the text/html MIMEType. In these instances, the HTML5 specification asserts that the document is processed according to the non-XML mode HTML5 processing rules. In these particular cases, it is important that the prefixes declared via xmlns: are preserved for the RDFa processors to ensure backwards-compatibility with RDFa 1.0 documents. The following sections detail the backwards compatibility details for RDFa processor implementations.

5.1 xmlns:-Prefixed Attributes

The RDFa Core 1.1 [RDFA-CORE] specification effectively deprecates the use of the xmlns: mechanism to declare CURIE prefix mappings in favor of the prefix attribute. While utilizing xmlns: is now frowned upon, there are instances where it is unavoidable - such as publishing legacy documents as HTML5 or supporting older XHTML+RDFa 1.0 documents that rely on the xmlns: attribute.

CURIE prefix mappings specified using attributes prepended with xmlns: must be processed using the algorithm defined in section 4.4.1: Extracting URI Mappings from Infosets for Infoset-based processors, or section 4.5.1: Extracting URI Mappings from DOMs for DOM Level 2-based processors. For CURIE prefix mappings using the prefix attribute, Section 7.5: Sequence, step 3 must be used to process namespace values.

Since CURIE prefix mappings have been specified using xmlns:, and since HTML attribute names are case-insensitive, CURIE prefix names declared using the xmlns:attribute-name pattern xmlns:<PREFIX>="<URI>" should be specified using only lower-case characters. For example, the text "xmlns:" and the text in "<PREFIX>" should be lower-case only. This is to ensure that prefix mappings are interpreted in the same way between HTML (case-insensitive attribute names) and XHTML (case-sensitive attribute names) document types.

5.2 Conformance Criteria for xmlns:-Prefixed Attributes

Since RDFa 1.0 documents may contain attributes starting with xmlns: to specify CURIE prefixes, any attribute starting with a case-insensitive match on the text string "xmlns:" must be preserved in the DOM or other tree-like model that is passed to the RDFa Processor. For documents conforming to this specification, attributes with names that have a case insensitive prefix matching "xmlns:" must be considered conforming. Conformance checkers should accept attribute names that have a case insensitive prefix matching "xmlns:" as conforming. Conformance checkers should generate warnings noting that the use of xmlns: is deprecated. Conformance checkers may report the use of xmlns: as an error.

All attributes starting with a case insensitive prefix matching "xmlns:" must conform to the production rules outlined in Namespaces in XML [XML-NAMES11], Section 3: Declaring Namespaces. Documents that contain xmlns: attributes that do not conform to Namespaces in XML must not be accepted as conforming.

5.3 Preserving Namespaces via Coercion to Infoset

This section needs feedback from the user agent vendors to ensure that this feature does not conflict with user agent architecture and has no technical reason that it cannot be implemented.

RDFa 1.0 documents may contain the xmlns: pattern to declare prefix mappings, it is important that namespace information that is declared in non-XML mode HTML5 documents are mapped to an Infoset correctly. In order to ensure this mapping is performed correctly, the "Coercing an HTML DOM into an infoset" rules defined in [HTML5] must be extended to include the following rule:

If the XML API is namespace-aware, the tool must ensure that ([namespace name], [local name], [normalized value]) namespace tuples are created when converting the non-XML mode DOM into an Infoset. Given a standard xmlns: definition, xmlns:foo="http://example.org/bar#", the [namespace name] is http://www.w3.org/2000/xmlns/, the [local name] is foo, and the [normalized value] is http://example.org/bar#, thus the namespace tuple would be (http://www.w3.org/2000/xmlns/, foo, http://example.org/bar#).

For example, given the following input text:

<div xmlns:com="http://purl.org/commerce#">

The div element above, when coerced from an HTML DOM into an Infoset, should contain an attribute in the [namespace attributes] list with a [namespace name] set to "http://www.w3.org/2000/xmlns/", a [local name] set to com, and a [normalized value] of "http://purl.org/commerce#".

5.4 Infoset-based Processors

While the intent of the RDFa processing instructions are to provide a set of rules that are as language and toolchain agnostic as possible, for the sake of clarity, detailed methods of extracting RDFa content from processors operating on an XML Information Set are provided below.

5.4.1 Extracting URI Mappings from Infosets

Extracting URI Mappings declared via xmlns: while operating from within an Infoset-based RDFa processor can be achieved using the following algorithm:

While processing an element as described in [RDFA-CORE], Section 7.5: Sequence, Step #2:

  1. For each attribute in the [namespace attributes] list that has a [prefix] value, create a [IRI mapping] by storing the [prefix] as the value to be mapped, and the [normalized value] as the value to map.
  2. For each attribute in the [attributes] list that has no value for [prefix] and a [local name] that starts with xmlns:, create a [IRI mapping] by storing the [local name] part with the xmlns: characters removed as the value to be mapped, and the [normalized value] as the value to map.

    This step is unnecessary if the Infoset coercion rules preserve namespaces specified in non-XML mode.

For example, assume that the following markup is processed by an Infoset-based RDFa processor:

<div xmlns:audio="http://purl.org/media/audio#" ...

After the markup is processed, there should exist a [URI mapping] in the [local list of URI mappings] that contains a mapping from audio to http://purl.org/media/audio#.

5.4.2 Processing RDFa Attributes

There are a number of non-prefixed attributes that are associated with RDFa Processing in HTML5. If an XML Information Set based RDFa processor is used to process these attributes, the following algorithm should be used to detect and extract the values of the attributes.

While processing Infoset Attribute Information Items in Element Information Items as described in [RDFA-CORE], Section 7.5: Sequence, Step #4 through Step #9:

  1. For each Attribute Information Item specific to RDFa in the Infoset [attributes] list that has a [prefix] with no value, extract and use the [normalized value].

5.5 DOM Level 1 and Level 2-based Processors

This mechanism should be double-checked against all of the RDFa Javascript implementations to ensure correctness.

Most DOM-aware RDFa Processors are capable of accessing DOM Level 1 [DOM-LEVEL-1] methods to process attributes on elements. To discover all xmlns:-specified CURIE prefix mappings, the Node.attributes NamedNodeMap can be iterated over. Each Attr.name that starts with the text string xmlns: specifies a CURIE prefix mapping. The value to be mapped is the string after the xmlns: substring in the Attr.name variable and the value to be mapped is the value of the Attr.value variable.

The intent of the RDFa processing instructions are to provide a set of rules that are as language and toolchain agnostic as possible. If a developer chooses to not use the DOM1 environment mechanism outlined in the previous paragraph, they may use the following DOM2 [DOM-LEVEL-2-CORE] environment mechanism.

5.5.1 Extracting URI Mappings via DOM Level 2

Extracting URI Mappings declared via xmlns: while operating from within a DOM Level 2 based RDFa processor can be achieved using the following algorithm:

While processing each DOM2 [Element] as described in [RDFA-CORE], Section 7.5: Sequence, Step #2:

  1. For each [Attr] in the [Node.attributes] list that has a [namespace prefix] value of xmlns, create a [IRI mapping] by storing the [local name] as the value to be mapped, and the [Node.nodeValue] as the value to map.
  2. For each [Attr] in the [Node.attributes] list that has a [namespace prefix] value of null and a [local name] that starts with xmlns:, create a [IRI mapping] by storing the [local name] part with the xmlns: characters removed as the value to be mapped, and the [Node.nodeValue] as the value to map.

    This step is unnecessary if the XML and non-XML mode DOMs are namespace consistent.

For example, assume that the following markup is processed by a DOM2-based RDFa processor:

<div xmlns:com="http://purl.org/commerce#" ...

After the markup is processed, there should exist a [URI mapping] in the [local list of URI mappings] that contains a mapping from com to http://purl.org/commerce#.

5.5.2 Processing RDFa Attributes

There are a number of non-prefixed attributes that are associated with RDFa processing in HTML5. If an DOM2-based RDFa processor is used to process these attributes, the following algorithm should be used to detect and extract the values of the attributes.

While processing an element as described in [RDFA-CORE], Section 5.5: Sequence, Step #3 through Step #9:

  1. For each RDFa attribute in the [Node.attributes] list that has a [namespace prefix] that is null, extract and use [Node.nodeValue] as the value.

When extracting values from href, src and data, Web authors and developers should note that certain values may be transformed if accessed via the DOM versus a non-DOM processor. The rules for modification of URL values can be found in the main HTML5 specification under Section 2.6.2: Parsing URLs.

