Embedding Tabular Metadata in HTML

Abstract

The Model for Tabular Data and Metadata on the Web describes mechanisms for extracting metadata from CSV documents starting with either a tabular data file, or a metadata description. In the case of starting with a CSV document, a procedure is followed to locate metadata describing that CSV (see Locating Metadata in [tabular-data-model]). Alternatively, processing may begin with a metadata file directly, which references the tabular data file(s). However, in some cases, it is preferred to publish datasets using HTML rather than starting with either CSV or metadata files.

Secondly, tabular data is often contained within HTML in the form of HTML table elements (see [html5]). This document describes a means of identifying such tables from [tabular-metadata] and extracting annotated tabular data from HTML tables.

Note

This document does not attempt to address the full range of ways in which tabular datasets can be used within browser based applications, e.g. related Javascript efforts such as IndexedDB and Web Components. It is concerned primarily with providing additional information about tabular data. Discussion of deeper integration into Web-based apps is encouraged via the CSVW Community Group.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

The CSV on the Web Working Group was chartered to produce a recommendation "Access methods for CSV Metadata" as well as recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various formats (e.g., RDF, JSON, or XML)". This non-normative document describes extensions for discovering [tabular-metadata] within HTML documents, and for extracting annotated tables from HTML tables. The normative standards are:

This document was published by the CSV on the Web Working Group as a Working Group Note. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org (subscribe, archives). All comments are welcome.

Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 September 2015 W3C Process Document.

1. Embedding Tabular Metadata within HTML Documents

Metadata may be exposed in an HTML document in a couple of different ways.

1.1 Embedding Metadata in a `script` Element

This section describes mechanisms similar to Embedding JSON-LD in HTML Documents (see [json-ld]) for embedding metadata within an HTML document.

HTML script elements can be used to embed data blocks in documents (see Scripting in [html5]). Metadata [tabular-metadata] describing one or more tabular data files can be embedded in HTML, which can be used as an alternative way to publish datasets.

The content should be placed in a script element with the type set to application/csvm+json. The character encoding of the embedded metadata will match the HTML documents encoding.

Example 1: Tabular Metadata embedded in HTML

<html>
  <head>
    <script type="application/csvm+json">
    {
      "@context": "http://www.w3.org/ns/csvw",
      "tables": [{
        "url": "countries.csv",
        "tableSchema": {
          "columns": [{
            "name": "countryCode",
            "titles": "countryCode",
            "datatype": "string",
            "propertyUrl": "http://www.geonames.org/ontology{#_name}"
          }, {
            "name": "latitude",
            "titles": "latitude",
            "datatype": "number"
          }, {
            "name": "longitude",
            "titles": "longitude",
            "datatype": "number"
          }, {
            "name": "name",
            "titles": "name",
            "datatype": "string"
          }],
          "aboutUrl": "countries.csv{#countryCode}",
          "propertyUrl": "http://schema.org/{_name}",
          "primaryKey": "countryCode"
        }
      }, {
        "url": "country_slice.csv",
        "tableSchema": {
          "columns": [{
            "name": "countryRef",
            "titles": "countryRef",
            "valueUrl": "countries.csv{#countryRef}"
          }, {
            "name": "year",
            "titles": "year",
            "datatype": "gYear"
          }, {
            "name": "population",
            "titles": "population",
            "datatype": "integer"
          }],
          "foreignKeys": [{
            "columnReference": "countryRef",
            "reference": {
              "resource": "countries.csv",
              "columnReference": "countryCode"
            }
          }]
        }
      }]
    }
    </script>
    ...
  </head>
  <body>
    ...
  </body>
</html>

Depending on how the HTML document is served, script content may need to be escaped. See Restrictions for contents of script elements in [html5] for more information.

Processing embedded metadata is the same as processing Overriding Metadata where the retrieved document type is text/html or application/xhtml+xml instead of a JSON document type. The base URI of the encapsulating HTML document provides a "Base URI Embedded in Content" per [RFC3986] section 5.1.1; metadata is extracted from the first script element having @type application/csvm+json. Metadata documents parsed from an HTML DOM will be a stream of character data rather than a stream of UTF-8 encoded bytes. No decoding is necessary if the HTML document has already been parsed into DOM. Each matching script data block is considered to be it's own metadata document.

1.2 Linking to Metadata

An alternative to embedding metadata within a script element is linking to the metadata using an HTTP Link header and/or an HTML link element using the equivalent mechanism described for CSV files by Link Header in [tabular-data-model]. Linked metadata provides an alternate mechanism for referencing metadata that would otherwise be discovered by Locating Metadata as defined in [tabular-data-model]. See The link element in [html5] for more information.

Example 2: Linking to Metadata

HTTP/1.1 200 OK
Link: <metadata.jsonld>; rel="describedby"
Content-Type: text/html

<html>
  <head>
    <link rel="describedby" type="application/csvm+json" href="metadata.json"/>
    ...
  </head>
</html>

The preceding example shows an HTTP response for an HTML document containing a link element referencing external metadata, along with an HTTP Link header referencing the same metadata.

Best Practice 1: HTML and HTTP Link references must be consistent

If using both HTML link and HTTP Link it is important to reference the same metadata URI.

Best Practice 2: Prefer embedded metadata

To avoid inconsistencies, do not both embed metadata and link metadata as differences in the embedded representation and the linked representation can cause processing inconsistencies.

2. Extracting Tabular Data from HTML Tables

This section describes a mechanism for locating tabular data within an HTML document, extracting tabular data from an identified table element, and processing the tabular data to create annotated tables.

In addition to tabular data files, a metadata table id may reference an HTML table within an HTML document. A reference within an HTML document is described using a document-relative fragment identifier which is defined using the @id attribute on an HTML table element.

Best Practice 3: Include metadata and referenced HTML tables in a single HTML document

HTML documents which are self contained, including both embedded metadata which references HTML tables contained within the same document, are preferred to HTML tables or CSV files defined externally.

Consideration must be given to the generation of URLs. The standard forms of both JSON [csv2json] and RDF [csv2rdf] generate URLs by appending a fragment identifier to the table URL to identify rows. Also, unless an explicit propertyUrl is defined, RDF properties are also generated using a fragment of the table URL.

Best Practice 4: Avoid automatically generated URLs

Explicitly define aboutUrl, propertyUrl, and valueUrl, where appropriate, to avoid using automatically generated URL fragments which conflict with using fragments to identify tables.

2.1 Extracting HTML Tables

Raw tabular data may be extracted from HTML tables with use of the dialect description as with CSV tables.

Table rows are numbered starting from 1, as with CSV files.
The in scope language of the table element is used as the lang inherited property for embedded metadata.
Rows containing only th elements have their text content used as the column titles in the embedded metadata.
Rows containing td are used as row content with the text content of each td element used as the cell string value; such rows may also contain th elements which are treated as data elements.
caption elements within a table element are ignored.
th and td elements contained within thead, tbody, or tfoot elements are processed as if they were child elements of the table element.

Processing extracted tables is otherwise handled in a similar manner to CSV as defined in Parsing Tabular Data in [tabular-data-model].

Note

Processors using a Document Object Model Model [DOM] may have their content coerced to a normalized including optional elements such as tbody.

Best Practice 5: Header rows proceed content rows

Tables should be organized with the first rows containing only th elements to describe column headers. Subsequent rows should contain only td elements to describe table data.

Best Practice 6: Avoid use of @colspan and @rowspan attributes

The processing algorithm for tabular data does not account for differences in column counts and row counts that might be present in HTML tables using the @colspan and/or @rowspan attributes; use of these attributes should be avoided. Note that a header row containing @colspan, or a data column containing @rowspan may be ignored using appropriate dialect descriptions.

2.2 Example

The following tables are identified using #countries and #country_slice:

Countries
countryCode	latitude	longitude	name
AD	42.5	1.6	Andorra
AE	23.4	53.8	United Arab Emirates
AF	33.9	67.7	Afghanistan

Country Slice
countryRef	year	population
AF	1960	9616353
AF	1961	9799379
AF	1962	9989846

Example 3: Referenced HTML tables

<table id="countries">
  <caption>Countries</caption>
  <tr><th>countryCode</th><th>latitude</th><th>longitude</th><th>name</th></tr>
  <tr><td>AD</td><td>42.5</td><td>1.6</td><td>Andorra</td></tr>
  <tr><td>AE</td><td>23.4</td><td>53.8</td><td>United Arab Emirates</td></tr>
  <tr><td>AF</td><td>33.9</td><td>67.7</td><td>Afghanistan</td></tr>
</table>
<table id="country_slice">
  <caption>Country Slice</caption>
  <tr><th>countryRef</th><th>year</th><th>population</th></tr>
  <tr><td>AF</td><td>1960</td><td>9616353</td></tr>
  <tr><td>AF</td><td>1961</td><td>9799379</td></tr>
  <tr><td>AF</td><td>1962</td><td>9989846</td></tr>
</table>

The metadata is describe here in a script element:

Example 4

Generating Minimal JSON from this document should result in the following:

Example 5: Minimal JSON output

[
  {
    "@id": "http://example.org/#countries-AD",
    "http://www.geonames.org/ontology#countryCode": "AD",
    "schema:latitude": 42.5,
    "schema:longitude": 1.6,
    "schema:name": "Andorra"
  },
  {
    "@id": "http://example.org/#countries-AE",
    "http://www.geonames.org/ontology#countryCode": "AE",
    "schema:latitude": 23.4,
    "schema:longitude": 53.8,
    "schema:name": "United Arab Emirates"
  },
  {
    "@id": "http://example.org/#countries-AF",
    "http://www.geonames.org/ontology#countryCode": "AF",
    "schema:latitude": 33.9,
    "schema:longitude": 67.7,
    "schema:name": "Afghanistan"
  },
  {
    "http://dbpedia.org/ontology/locationCountry": "http://example.org/#countries-AF",
    "http://dbpedia.org/property/urbanAreaDate": "1960",
    "http://www.geonames.org/ontology/population": 9616353
  },
  {
    "http://dbpedia.org/ontology/locationCountry": "http://example.org/#countries-AF",
    "http://dbpedia.org/property/urbanAreaDate": "1961",
    "http://www.geonames.org/ontology/population": 9799379
  },
  {
    "http://dbpedia.org/ontology/locationCountry": "http://example.org/#countries-AF",
    "http://dbpedia.org/property/urbanAreaDate": "1962",
    "http://www.geonames.org/ontology/population": 9989846
  }
]

Generating Minimal RDF from this document should result in the following:

Example 6: Minimal RDF output

@prefix geonames: <http://www.geonames.org/ontology#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/#countries-AD> schema:latitude 4.25e1;
   schema:longitude 1.6e0;
   schema:name "Andorra";
   geonames:countryCode "AD" .

<http://example.org/#countries-AE> schema:latitude 2.34e1;
   schema:longitude 5.38e1;
   schema:name "United Arab Emirates";
   geonames:countryCode "AE" .

<http://example.org/#countries-AF> schema:latitude 3.39e1;
   schema:longitude 6.77e1;
   schema:name "Afghanistan";
   geonames:countryCode "AF" .

 [
     <http://dbpedia.org/ontology/locationCountry> <http://example.org/#countries-AF>;
     <http://dbpedia.org/property/urbanAreaDate> "1962"^^xsd:gYear;
     <http://www.geonames.org/ontology/population> 9989846
 ] .

 [
     <http://dbpedia.org/ontology/locationCountry> <http://example.org/#countries-AF>;
     <http://dbpedia.org/property/urbanAreaDate> "1961"^^xsd:gYear;
     <http://www.geonames.org/ontology/population> 9799379
 ] .

 [
     <http://dbpedia.org/ontology/locationCountry> <http://example.org/#countries-AF>;
     <http://dbpedia.org/property/urbanAreaDate> "1960"^^xsd:gYear;
     <http://www.geonames.org/ontology/population> 9616353
 ] .

Embedding Tabular Metadata in HTML

W3C Working Group Note 25 February 2016

Abstract

Status of This Document

Table of Contents

1. Embedding Tabular Metadata within HTML Documents

1.1 Embedding Metadata in a `script` Element

1.2 Linking to Metadata

2. Extracting Tabular Data from HTML Tables

2.1 Extracting HTML Tables

2.2 Example

3. Extracting Tabular Data from embedded CSV

3.1 Example

A. References

A.1 Informative references

Abstract

Status of This Document

Table of Contents

1. Embedding Tabular Metadata within HTML Documents

1.1 Embedding Metadata in a script Element

1.2 Linking to Metadata

2. Extracting Tabular Data from HTML Tables

2.1 Extracting HTML Tables

2.2 Example

3. Extracting Tabular Data from embedded CSV

3.1 Example

A. References

A.1 Informative references

1.1 Embedding Metadata in a `script` Element