dcmi: The Dublin Core microformat

Bert Bos (W3C) bert@w3.org

26 November 2011

Abstract

This is a proposal for a new microformat for HTML, to help software automatically find bibliographic metadata about documents, in particular the metadata that conforms to the set known as the Dublin Core.

That set contains the typical properties found in a bibliography or library catalog, such as title, author, date, publisher and abstract. The DCMI (Dublin Core Metadata Initiative, the organization behind the Dublin Core) itself already defines a way to embed such metadata in HTML document in a machine-readable way, viz., with LINK and META elements, but that method suffers from some of the well known problems for which microformats were invented: error-prone editing, duplication of data, and invisible data.

The result of extracting the dcmi-coded data from an HTML document is thus a bibliographic record about that document.

Background

This section is not normative.


The dcmi microformat is designed for a specific use-case (but can, of course, be used for other cases). That use-case is the case of a scientist who is researching the literature before writing a scientific paper. Whenever he reads an (electronic) article that he thinks he might cite in his paper, he records the bibliographic data of the article: author, title, publisher, date, etc. He probably has some reference manager (e.g., Mendeley, citeUlike or Zotero) or at least a database of references (BibTeX, Refer, etc.) and rather than copying the bibliographic data by hand, he prefers to copy-and-paste, or better still: to press a single button.

Some reference managers have tools that can parse certain kinds of online articles automatically. They may have built-in parsers for certain ACM publications or they read the DCMI metadata from META and LINK tags, as defined by the DCMI. As long as the article's publisher or author has provided the machine-readable metadata, the scientist indeed only has to press the button.

The dcmi microformat, once the various reference managers have learned to parse it, should make it easier for authors to add the required metadata and should also reduce the number of errors in such metadata. The result is that the scientist looking for articles to cite can even more often copy the bibliographic data with just a single click.


The dcmi microformat resembles other microformats, such as hcard and hcalendar, but also has some unique characteristics.

The syntax uses the well-known microformat patterns: CLASS attributes, REL attributes, ABBR elements, etc. Like rel-license, but unlike hcard, the dcmi microformat defines metadata of which the subject is, implicitly, the HTML document itself. Like hcard, dcmi uses a special class value, called the root class name (“vcard” for one, “dcmi” for the other) as a kind of sentinel: keywords are only recognized if they are inside an element with the root class name. But where each occurrence of the root class name “vcard” indicates the start of a separate vcard, the root class name “dcmi” only serves as sentinel and the keywords below it all contribute metadata to the same bibliographic record.

Syntax

The reference for the dcmi keywords is the official specification of the DCMI Metadata Terms. All the terms, more precisely: the part referred to in that specification as the “name” of the term, can be used as a value on a CLASS attribute (following the microformat syntax known as the class design pattern), with the exception of the terms explained further down.

If the [TBD] URL is present in the PROFILE attribute of the HEAD element, then a UA that reads the dcmi microformat MUST look for DCMI terms on any element with a class value of “dcmi” and on its descendants. If the [TBD] URL is not present, a UA that reads the dcmi microformat MAY still look for DCMI terms on any element with a class value of “dcmi” and its descendants. A UA that reads the dcmi microformat MUST NOT look for DCMI terms on other elements.

Note: This means that a UA that writes markup conforming to the dcmi microformat MUST make sure that all elements that represent DCMI terms either have the class “dcmi” or are descendants of an element with that class. Such a UA SHOULD add the [TBD] profile.

Note: The class “dcmi” can occur multiple times and does not need to be on a common ancestor of all the DCMI terms. E.g., the following two HTML fragments represent the same data:

<body class=dcmi>
 <p><span class=creator>P. Maple</span>,
 <abbr class=date
  title=2011-12-15>15 Dec. 2011</abbr>

and:

<body>
 <p><span class="dcmi creator">P. Maple</span>,
 <abbr class="date dcmi"
  title=2011-12-15>15 Dec. 2011</abbr>

If the same dcmi term occurs multiple times, the value corresponding to that term is the concatenation of the values of all occurrences.

The following terms are handled specially:

“date” and its sub-properties
The mark-up SHOULD use the date design pattern (i.e., the value should be in the ISO 8601 syntax for dates, YYYY-MM-DD, possibly by using the abbr design pattern).
“format,” “extent“ and “medium”
These terms are not used in the dcmi microformat. I.e., if they occur, they are considered normal class values, without any meaning for the purpose of the dcmi microformat. The “format” property of a document is instead taken from the document's Internet Media Type (typically text/html or application/xhtml). The “extent” property of a document is always undefined (nil). The “medium” property is also undefined, unless there is external information about the physical medium on which the document is stored (e.g., “DVD”).
“identifier” and “bibliographicCitation”
These terms are not used in the dcmi microformat. I.e., if they occur, they are considered normal class values, without any meaning for the purpose of the dcmi microformat. The “identifier” property of a document is instead taken from its URL: it is the absolute URL for the document. The “bibliographicCitation” property is always undefined.
“language”
This term is not used. Instead, the “language” property is taken from the language of the BODY element. It SHOULD be expressed as a language code defined in BCP 47 (e.g., en, en-us, or sr-Latn) Note that the language of the BODY element can be explicit (given by a LANG or xml:lang attribute), or inherited from the HTML element or even from outside the document, e.g., from HTTP headers.
“hasFormat,” “hasPart,” “hasVersion,” “isFormatOf,” “isPartOf,” “isReferencedBy,” “isReplacedBy,” “isRequiredBy,” “isVersionOf,” “references,” “replaces” and “requires”
The values of these terms SHOULD be URLs. In most cases that means they should use the rel design pattern, e.g.:
<a rel=hasVersion href="document-B">previous</a>,
although sometimes the class design pattern may be used:
<a class=hasVersion
href="http://example.org/doc-B">
http://example.org/doc-B</a>.
“license”
The license text can be part of the document or linked from it. In the latter case, the “license” keyword occurs on a REL attribute, conforming to the rel=license microformat. E.g.:
<a href="license.html">Copyright</a>
“title”
This term is not used. Instead, the DCMI title property is mapped to the TITLE element. Note: the term “alternative” is used.
“type”
This term is not used. The DCMI property “type” is implicit and has the value “text”.

Example

This section is not normative.


A document marked-up like this:

<!doctype html public '-//W3C//DTD HTML 4.01//EN'>
<html lang=en>
 <head profile=
  "http://microformats.org/profile/hcard
  [the-URL-for-the DC-microformat]">

  <title>dcmi: The Dublin Core microformat</title>

 <body class=dcmi>
  <h1>dcmi: The Dublin Core microformat</h1>

  <p class="creator vcard"><span
   class=fn>Bert Bos</span> (<span
  class=org>W3C</span>) <a class=email
  href="mailto:bert@w3.org">bert@w3.org</a>

  <p class=date><abbr class=date
   title=2011-11-26>26 November 2011</abbr>

  <p class=abstract>This is a proposal for...
  <p class=abstract>That set contains...
  [...]

encodes the following Dublin Core metadata:

Term Value
language en
title dcmi: The Dublin Core microformat
creator Bert Bos (W3C) bert@w3.org
date 2011-11-26
abstract This is a proposal…
That set contains…
type text
identifier [the URL of this document]
format text/html

Note that the “creator” uses the hcard microformat to provide extra structure for the author in the form of a vcard (not shown in the table above).

Appendix: Singapore framework

This section is not normative.


The DCMI has written a document (called the Singapore Framework) listing five components that each specification for an “application profile” of the Dublin Core should define. For the dcmi microformat, those components are as follows:

Functional requirements
The Background section above explains the principal use case of the dcmi microformat. In short: allowing a reference manager to extract reliable DC metadata about an HTML document from that document itself, with minimal effort for the document's creator.
Domain model
The dcmi microformat applies to HTML documents.
Description Set Profile (DSP)
The dcmi microformat produces full DC metadata records, i.e., all of the DCMI terms can be given values. However, certain terms have value ranges that are restricted, as described in the Syntax section above. (E.g., the value of “type” is aways “text.”)
Usage guidelines
The dcmi microformat is applied to HTML using the well-known methods of microformats, complemented with the specific rules for the dcmi microformat in the Syntax section above.
Encoding syntax guidelines
The syntax is that of HTML. Any valid encoding of HTML can be used.

References

  1. DCMI Usage Board. DCMI Metadata Terms. 11 October 2011, or any later version.
  2. Pete Johnston, Andy Powell. Expressing Dublin Core metadata using HTML/XHTML meta and link elements. 4 August 2008.
  3. Tantek Çelik. rel="license". 6 February 2005.
  4. Tantek Çelik. hCalendar 1.0. 24 November 2011.
  5. Tantek Çelik. hCard 1.0. 6 October 2011.
  6. Mikael Nilsson, Thomas Baker, Pete Johnston. The Singapore Framework for Dublin Core Application Profiles. 14 January 2008.