Choosing an HTML Data Format

From W3C Wiki
Revision as of 20:50, 27 November 2011 by Jenit (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This page describes the recommendations of the HTML Data TF regarding how you choose which data format to use to embed data within your HTML pages, if you are a data publisher, or target for consumption, if you are a consumer.

Publishers

You are likely to find that the markup within your pages is simpler and easier to maintain if you only use one format (syntax and vocabulary) within each page. To decide which to use, your first consideration has to be which consumers will read the data within your web pages, and which formats they support. These may include:

  • scripting libraries
  • browsers and browser plug-ins
  • general-purpose search engines
  • vertical or domain-specific search engines
  • data reusers with whom you have agreements

Your second consideration may be the current state of the tooling to support a particular format. For example:

Are you able to publish using HTML5?
If you are using a content-management system that doesn't support adding new attributes such as @itemprop or @typeof or if your publishing guidelines require validity against an older version of HTML, then you will be constrained to using microformats. If your publishing guidelines require validity against XHTML, then you might be able to use XHTML+RDFa, depending on how precise your publishing guidelines are.
Are there development tools available?
Because it is not visible within a web page, it can be hard to tell whether HTML data has been written correctly. Consumers should provide validators that enable you to check that your data has been correctly detected and interpreted, but you may also want to consider tool support for generating the HTML data.

Once you have considered both your target consumers and the tooling support that is available, you will be in one of four situations:

  1. with a single choice of format in which case you are good to go
  2. unable to publish HTML data that your target consumers understand in which case you either have to lobby those consumers to add support for the format(s) you can publish in, or consider changing your toolset so that you can publish in something they understand
  3. still with a choice between a number of formats in which case you will want to pick one (see below)
  4. having to use multiple formats at the same time to provide data to all your target customers in which case you will need to mix formats within your pages (see below)

Choosing a Publishing Format

This section addresses a situation where all your target consumers recognise a set of formats (each with a particular syntax and vocabulary), your toolset supports publishing in all of them, and you need to make a choice about which of these formats to use. It's assumed that you will want to choose a single format rather than mixing multiple formats, as this will mean less markup in your page and make your publishing task easier.

Syntax Considerations

The different syntaxes -- microformats, microdata and RDFa -- have different capabilities which may inform your choice.

Structured HTML values
Under appropriate conditions, RDFa and microformats will use markup within the content of an element to provide a property value; in microdata values never retain markup. If property values within your page contain markup (for example descriptions containing emphasised text, multiple paragraphs, tables and so on), you may want to use RDFa or microformats to ensure that structure is available to consumers of your pages. In RDFa, this is done through adding datatype="rdf:XMLLiteral" to the relevant element. In microformats, the handling of the content of an element is determined by the property; in microformats-2, those that retain the HTML structure are named with a e-* prefix, such as e-content.
Language support
Microformats and RDFa use the language of the HTML elements in the page (from the lang attribute) to indicate the language of relevant values. In microdata, the vocabulary has to provide a separate mechanism to indicate a language (pending resolution of bug 14470). If you have multi-lingual information in your pages, you may find it easier to use microformats or RDFa than microdata.
CSS support
Because microformats generally use classes to mark up data within an HTML page, it is easy to use CSS to style those elements based on their type. For example .hcard .n { font-weight: bold; } will enbolden any person's name. This is a little harder with microdata (where the selector might be something like [itemtype~="http://microformats.org/profile/hcard"] [itemprop~="n"]) or RDFa (where it might be [typeof~="foaf:Person"] [property~="foaf:name"]). If you are planning to style your page based on the data embedded within it, you may find it easier to use microformats than either microdata or RDFa; if you do style RDFa, you should plan for dependencies between your CSS documents and any prefixes used within it.

TODO: Other guidelines?

Vocabulary Considerations

Vocabularies and syntaxes are closely tied together, especially in the case of microformats. Aspects of a vocabulary to bear in mind are:

  • How closely does it match with the information that you have?
  • How much support does it have? Are there tools for validating and viewing it? Is there good documentation?
  • How stable is it? Who has control to make changes to it? How frequently might those changes be made?
  • Are other consumers likely to adopt it in the future?

Usability Considerations

The usability of a particular format is likely to depend on your existing expertise and the match between the structure and content of your web pages and the required structure and content of the format. The best thing to do is to try using the format to mark up an example page from your site.

TODO: Example?

Publishing in Multiple Formats

Publishing in multiple formats can be easy. For example, it may be that different consumers expect HTML data to appear in different places within the page, such as Facebook requiring Open Graph Protocol data to appear within the head of an HTML page, while schema.org markup appears in the body of the page. Or it may be that the items that you need to mark up on the page appear in different places -- events listed in a sidebar while company details are provided in a footer, for example.

Different formats and vocabularies can be used independently in these circumstances. Consumers of the data within your pages might read additional data if it is in a syntax that they recognise -- for example, an processor that recognises both RDFa and microdata will interpret all such markup in the page -- but it should ignore information that is in a vocabulary that it doesn't understand rather than giving an error.

Publishing can be harder when there are multiple consumers of information that require different formats. If your target consumers will all accept the same syntax, it is usually easiest to use that single syntax in your pages. However, microdata does not support multiple types for a single entity, so if your target consumers expect different vocabularies to be used for the same entities you may find it easier to mix syntaxes or to use RDFa or microformats, which do support multiple vocabularies.

Further techniques for mixing different syntaxes and vocabularies within a page are provided on a separate page.

Good Publishing Practice

Valid HTML is particularly important in pages that contain embedded markup. All methods of embedding data within HTML use the structure of the HTML to determine the meaning of the additional markup. For example, the item to which an element with an @itemprop attribute assigns a property is usually the closest ancestor element with a @itemscope attribute.

In some cases, elements can be moved when HTML is parsed into a DOM. This can lead to properties unexpectedly referring to the wrong entity, and, if you are serving your documents as XHTML (with a application/xhtml+xml mime type), it can cause discrepancies between the data gleaned by XML-based consumers and HTML-aware consumers. There are two causes for this:

  • Error correction in HTML parsing can restructure invalid HTML is restructured to make it valid, for example non-table markup within a table is moved to before the table. This includes link and meta elements that are directly within the table element. You can avoid this restructuring by making sure that your HTML is valid so that it is not needed.
  • Some older browsers may move meta and/or link elements in the body of an HTML document to within the head element, because they could not validly appear within the body in older versions of HTML. If you are targeting consumers which run within older browsers, such as scripts or plug-ins, you can avoid this restructuring by using empty span or other elements instead of link or meta; other consumers should be using an up-to-date HTML5 parser which will not do this.

It is good practice to test the data that you expose within your page against a parser that will show you the data your page contains. Existing online parsers include:

microdata
Live Microdata maps to JSON, vCard and iCal
RDF Distiller maps to various RDF-based formats
any23.org
RDFa
RDF Distiller
check.rdfa.info
Python RDFa 1.0 Distiller, or its Experimental RDFa 1.1 version
any23.org
microformats
see below for parsers for specific microformats

TODO: add more

It is good practice to test the data that you expose using a tool that understands the vocabulary you are using. Consumers may provide testing tools and validators for this purpose, or you may need to check the way that vocabulary-specific tools behave with your data. Example vocabulary-aware testing tools and validators include:

hCalendar
Google's Rich Snippets Testing Tool
see also hCalendar implementations
hCard
hCard microformat Validator
Google's Rich Snippets Testing Tool
see also hCard implementations
hReview
Google's Rich Snippets Testing Tool
see also hReview implementations
Open Graph Protocol
check.rdfa.info
schema.org
Google's Rich Snippets Testing Tool
check.rdfa.info
Structured Data Linter
vCard
Live Microdata
vEvent
Live Microdata

The goal of publishing HTML data is to enable consumers to reuse it. To make it clear how the HTML data you publish can be reused, you should include information about the rights holder and license that the information is made under. There are a number of vocabularies that enable you to do this, such as schema.org, rel-license, Creative Commons and Dublin Core. Your target consumers should indicate which formats they understand when it comes to expressing licensing information and which licenses they know about, and you should choose a relevant format in the same way as you do for the core data that you are publishing.

TODO: add more

Consumers

You will find it easier to consume and combine data published using a single format (syntax and vocabulary). To decide which to consume, you should first look at what formats your target publishers are currently using. It may be that these contain sufficient information for your application.

If the publishers whom you are targeting are already publishing using multiple formats, you may want to consume from all those formats in order to maximise the data that you can collect while minimising the impact on the publishers who are providing that information. If you are consuming microdata and storing the results as RDF, you should follow a standard mapping.

If current formats do not encode the information you need to the detail you need it for your application, publishers will be more likely to publish extra data for you to consume if you:

If you cannot simply extend an existing vocabulary, you will need to create your own vocabulary and choose which syntaxes to support with that vocabulary.

Choosing a Syntax to Consume

As you choose syntax, you should take into account the following considerations.

Tooling Considerations

Applications vary widely in terms of the tooling that they need. A script that runs in a publisher's page needs easy access to data through a DOM API. A crawler that creates a store of data from a set of distributed pages requires a server-side parser and good storage and querying support.

As a consumer, you will be led by the requirements you have for your application and the experience that you have with different technology sets. It's important, however, to also consider the experience and capabilities of the publishers that are providing you with data, and which formats they will find easy to publish given their tooling. You should also consider the ease with which you can provide support tools for the format, such as validators or previewers that make it easy for publishers to tell whether they have published data correctly within their pages.

There are several specifications that can be used to provide standard mechanisms for accessing, manipulating, querying and validating data gleaned from HTML pages. However, you should check what has been implemented in your environment: it may be that there isn't an implementation that follows a standard, but there is one that provides its own API which enables you to do what you need to do.

microdata/microformats-2 data model

Microdata and microformats-2 can be mapped to the same basic (JSON) data model. Processing JSON into native programming structures, in Javascript and other languages, is usually very easy. Vocabularies are usually described in specification prose rather than a formal language.

  • microdata DOM API — part of microdata specification (W3C Last Call Working Draft)
  • JSON Schema — schema language for JSON (IETF Internet Draft)
RDF data model

RDFa processors extract an RDF data model and processors can also generate RDF from microdata. There are a number of standards for formally expressing RDF vocabularies and querying RDF, and drafts in progress for DOM-based manipulation of RDFa content.

  • RDFa API — W3C Working Draft
  • JSON-LD — JSON representation of RDF (Unofficial Draft)
  • SPARQL — query language for RDF (W3C Recommendation)
  • SPARQL 1.1 — W3C Working Draft
  • RDFS — vocabulary description language for RDF (W3C Recommendation)
  • OWL — ontology language for RDF (W3C Recommendation)

Data Model Considerations

Microdata uses a JSON-based data model of a tree of objects which may be identified through a URI, with properties whose values are strings. microformats-2 uses a similar JSON-based data model of a tree of objects, but they do not have identifiers and their property values may be strings, URLs, date/times or structured HTML values. RDFa uses RDF as its data model, which is a graph of objects identified by URLs with properties whose values may be other objects, lists or literal values which can be tagged with a language or any datatype. These different models have different capabilities.

Structured HTML values
Under appropriate conditions, RDFa and microformats will use markup within the content of an element to provide a property value; in microdata values never retain markup. If you wish to consume data that may contain markup — be it structures such as multiple paragraphs, list items, tables, or inline markup such as emphases, links or ruby markup — you will need publishers to use RDFa or microformats to mark up that data. In RDFa, this is done by publishers adding datatype="rdf:XMLLiteral" to elements whose markup should be preserved. In microformats, the handling of the content of an element is determined by the property; in microformats-2, those that retain the HTML structure are named with a e-* prefix, such as e-content.
Language support
Microformats and RDFa use the language of the HTML elements in the page (from the lang attribute) to indicate the language of relevant values. In microdata, the vocabulary has to provide a separate mechanism to indicate a language (pending resolution of bug 14470). If you are consuming information about the same things from pages that use different languages, or anticipate publishers using multiple languages in their pages to describe a particular entity, you can automatically pick up the language of the content of the page if publishers use microformats or RDFa. If you consume microdata, you need to provide specific properties in your vocabulary that publishers can use to indicate the language of the content.

Usability Considerations

Publishing data within HTML can be a challenge for publishers, simply because the structure of the data that they publish is not immediately visible within their pages. The publishers you are targeting will have different levels of skill and experience, which may influence your choice of syntax and the way in which you design your vocabulary. If you can, you should try to work closely with a few target publishers to better understand their requirements and constraints. Experimenting with marking up a few of their existing pages will often highlight issues with both syntax and vocabulary.

Some usability issues may be addressed by restricting the set of attributes that you instruct publishers how to use, or by restricting their location to provide more consistency. For example:

  • RDFa 1.1 Lite is an authoring profile of RDFa 1.1 that is sufficient for most data publishing
  • most microdata markup does not require @itemid or @itemref
  • constraining data markup to the head of an HTML document can make it easier to author and protect it from templating changes, although it also runs the risk of getting out of sync with the content of the page, increases repetition, and is hard to use for anything but flat data structures

Profiling microdata and RDFa is useful for documentation, but consumers should still recognise and understand the full set of syntactic constructs described by the standards. This ensures that those publishers who find that they need the more advanced constructs to mark up their pages can do so, and means that publishers can use general-purpose tools and documentation rather than just those that you provide.

Good Consumption Practice

It is good practice for a consumer to provide tools that help publishers to see how the data within their pages is interpreted by the consumer and that highlight any errors in the markup, such as invalid values or missing required properties.

It is good practice for consumers to ignore markup that uses syntax or vocabularies that they do not understand. Properties and types in unrecognised vocabularies should be ignored by consumers.

The presence of HTML data within a website does not imply that the data can be used without restriction. Publishers may license the information provided through HTML data, for example to restrict it to non-commercial use or to use only with attribution. It is good practice for a consumer to honour licenses and to indicate to publishers which formats they recognise for expressing licensing information within HTML pages, and which licenses they recognise as indicating that the data within the page is consumable. Typical vocabularies for expressing this information are schema.org, rel-license, Creative Commons or Dublin Core.

Even when the use of data is unrestricted, it is good practice for consumers to record the source of the information that they use and, when republishing that data, provide metadata about the rights holder, source and license under which the information is available, using the same vocabularies as those listed above.

TODO: More?