Abstract

The Model for Tabular Data and Metadata on the Web describes mechanisms for extracting metadata from CSV documents starting with either a tabular data file, or a metadata description. In the case of starting with a CSV document, a procedure is followed to locate metadata describing that CSV (see Locating Metadata in [tabular-data-model]). Alternatively, processing may begin with a metadata file directly, which references the tabular data file(s). However, in some cases, it is preferred to publish datasets using HTML rather than starting with either CSV or metadata files.

Secondly, tabular data is often contained within HTML in the form of HTML table elements (see [html5]). This document describes a means of identifying such tables from [tabular-metadata] and extracting annotated tabular data from HTML tables.

Note

This document does not attempt to address the full range of ways in which tabular datasets can be used within browser based applications, e.g. related Javascript efforts such as IndexedDB and Web Components. It is concerned primarily with providing additional information about tabular data. Discussion of deeper integration into Web-based apps is encouraged via the CSVW Community Group.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

The CSV on the Web Working Group was chartered to produce a recommendation "Access methods for CSV Metadata" as well as recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various formats (e.g., RDF, JSON, or XML)". This non-normative document describes extensions for discovering [tabular-metadata] within HTML documents, and for extracting annotated tables from HTML tables. The normative standards are:

This document was published by the CSV on the Web Working Group as a Working Group Note. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org (subscribe, archives). All comments are welcome.

Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 September 2015 W3C Process Document.

Table of Contents

1. Embedding Tabular Metadata within HTML Documents

Metadata may be exposed in an HTML document in a couple of different ways.

1.1 Embedding Metadata in a script Element

This section describes mechanisms similar to Embedding JSON-LD in HTML Documents (see [json-ld]) for embedding metadata within an HTML document.

HTML script elements can be used to embed data blocks in documents (see Scripting in [html5]). Metadata [tabular-metadata] describing one or more tabular data files can be embedded in HTML, which can be used as an alternative way to publish datasets.

The content should be placed in a script element with the type set to application/csvm+json. The character encoding of the embedded metadata will match the HTML documents encoding.

Example 1: Tabular Metadata embedded in HTML
<html>
  <head>
    <script type="application/csvm+json">
    {
      "@context": "http://www.w3.org/ns/csvw",
      "tables": [{
        "url": "countries.csv",
        "tableSchema": {
          "columns": [{
            "name": "countryCode",
            "titles": "countryCode",
            "datatype": "string",
            "propertyUrl": "http://www.geonames.org/ontology{#_name}"
          }, {
            "name": "latitude",
            "titles": "latitude",
            "datatype": "number"
          }, {
            "name": "longitude",
            "titles": "longitude",
            "datatype": "number"
          }, {
            "name": "name",
            "titles": "name",
            "datatype": "string"
          }],
          "aboutUrl": "countries.csv{#countryCode}",
          "propertyUrl": "http://schema.org/{_name}",
          "primaryKey": "countryCode"
        }
      }, {
        "url": "country_slice.csv",
        "tableSchema": {
          "columns": [{
            "name": "countryRef",
            "titles": "countryRef",
            "valueUrl": "countries.csv{#countryRef}"
          }, {
            "name": "year",
            "titles": "year",
            "datatype": "gYear"
          }, {
            "name": "population",
            "titles": "population",
            "datatype": "integer"
          }],
          "foreignKeys": [{
            "columnReference": "countryRef",
            "reference": {
              "resource": "countries.csv",
              "columnReference": "countryCode"
            }
          }]
        }
      }]
    }
    </script>
    ...
  </head>
  <body>
    ...
  </body>
</html>

Depending on how the HTML document is served, script content may need to be escaped. See Restrictions for contents of script elements in [html5] for more information.

Processing embedded metadata is the same as processing Overriding Metadata where the retrieved document type is text/html or application/xhtml+xml instead of a JSON document type. The base URI of the encapsulating HTML document provides a "Base URI Embedded in Content" per [RFC3986] section 5.1.1; metadata is extracted from the first script element having @type application/csvm+json. Metadata documents parsed from an HTML DOM will be a stream of character data rather than a stream of UTF-8 encoded bytes. No decoding is necessary if the HTML document has already been parsed into DOM. Each matching script data block is considered to be it's own metadata document.

1.2 Linking to Metadata

An alternative to embedding metadata within a script element is linking to the metadata using an HTTP Link header and/or an HTML link element using the equivalent mechanism described for CSV files by Link Header in [tabular-data-model]. Linked metadata provides an alternate mechanism for referencing metadata that would otherwise be discovered by Locating Metadata as defined in [tabular-data-model]. See The link element in [html5] for more information.

Example 2: Linking to Metadata
HTTP/1.1 200 OK
Link: <metadata.jsonld>; rel="describedby"
Content-Type: text/html

<html>
  <head>
    <link rel="describedby" type="application/csvm+json" href="metadata.json"/>
    ...
  </head>
</html>

The preceding example shows an HTTP response for an HTML document containing a link element referencing external metadata, along with an HTTP Link header referencing the same metadata.

Best Practice 1: HTML and HTTP Link references must be consistent

If using both HTML link and HTTP Link it is important to reference the same metadata URI.

Best Practice 2: Prefer embedded metadata

To avoid inconsistencies, do not both embed metadata and link metadata as differences in the embedded representation and the linked representation can cause processing inconsistencies.

2. Extracting Tabular Data from HTML Tables

This section describes a mechanism for locating tabular data within an HTML document, extracting tabular data from an identified table element, and processing the tabular data to create annotated tables.

In addition to tabular data files, a metadata table id may reference an HTML table within an HTML document. A reference within an HTML document is described using a document-relative fragment identifier which is defined using the @id attribute on an HTML table element.

Best Practice 3: Include metadata and referenced HTML tables in a single HTML document

HTML documents which are self contained, including both embedded metadata which references HTML tables contained within the same document, are preferred to HTML tables or CSV files defined externally.

Consideration must be given to the generation of URLs. The standard forms of both JSON [csv2json] and RDF [csv2rdf] generate URLs by appending a fragment identifier to the table URL to identify rows. Also, unless an explicit propertyUrl is defined, RDF properties are also generated using a fragment of the table URL.

Best Practice 4: Avoid automatically generated URLs

Explicitly define aboutUrl, propertyUrl, and valueUrl, where appropriate, to avoid using automatically generated URL fragments which conflict with using fragments to identify tables.

2.1 Extracting HTML Tables

Raw tabular data may be extracted from HTML tables with use of the dialect description as with CSV tables.

Processing extracted tables is otherwise handled in a similar manner to CSV as defined in Parsing Tabular Data in [tabular-data-model].

Note

Processors using a Document Object Model Model [DOM] may have their content coerced to a normalized including optional elements such as tbody.

Best Practice 5: Header rows proceed content rows

Tables should be organized with the first rows containing only th elements to describe column headers. Subsequent rows should contain only td elements to describe table data.

Best Practice 6: Avoid use of @colspan and @rowspan attributes

The processing algorithm for tabular data does not account for differences in column counts and row counts that might be present in HTML tables using the @colspan and/or @rowspan attributes; use of these attributes should be avoided. Note that a header row containing @colspan, or a data column containing @rowspan may be ignored using appropriate dialect descriptions.

2.2 Example

The following tables are identified using #countries and #country_slice:

Countries
countryCodelatitudelongitudename
AD42.51.6Andorra
AE23.453.8United Arab Emirates
AF33.967.7Afghanistan
Country Slice
countryRefyearpopulation
AF19609616353
AF19619799379
AF19629989846
Example 3: Referenced HTML tables
<table id="countries">
  <caption>Countries</caption>
  <tr><th>countryCode</th><th>latitude</th><th>longitude</th><th>name</th></tr>
  <tr><td>AD</td><td>42.5</td><td>1.6</td><td>Andorra</td></tr>
  <tr><td>AE</td><td>23.4</td><td>53.8</td><td>United Arab Emirates</td></tr>
  <tr><td>AF</td><td>33.9</td><td>67.7</td><td>Afghanistan</td></tr>
</table>
<table id="country_slice">
  <caption>Country Slice</caption>
  <tr><th>countryRef</th><th>year</th><th>population</th></tr>
  <tr><td>AF</td><td>1960</td><td>9616353</td></tr>
  <tr><td>AF</td><td>1961</td><td>9799379</td></tr>
  <tr><td>AF</td><td>1962</td><td>9989846</td></tr>
</table>

The metadata is describe here in a script element:

Example 4

Generating Minimal JSON from this document should result in the following:

Example 5: Minimal JSON output
[
  {
    "@id": "http://example.org/#countries-AD",
    "http://www.geonames.org/ontology#countryCode": "AD",
    "schema:latitude": 42.5,
    "schema:longitude": 1.6,
    "schema:name": "Andorra"
  },
  {
    "@id": "http://example.org/#countries-AE",
    "http://www.geonames.org/ontology#countryCode": "AE",
    "schema:latitude": 23.4,
    "schema:longitude": 53.8,
    "schema:name": "United Arab Emirates"
  },
  {
    "@id": "http://example.org/#countries-AF",
    "http://www.geonames.org/ontology#countryCode": "AF",
    "schema:latitude": 33.9,
    "schema:longitude": 67.7,
    "schema:name": "Afghanistan"
  },
  {
    "http://dbpedia.org/ontology/locationCountry": "http://example.org/#countries-AF",
    "http://dbpedia.org/property/urbanAreaDate": "1960",
    "http://www.geonames.org/ontology/population": 9616353
  },
  {
    "http://dbpedia.org/ontology/locationCountry": "http://example.org/#countries-AF",
    "http://dbpedia.org/property/urbanAreaDate": "1961",
    "http://www.geonames.org/ontology/population": 9799379
  },
  {
    "http://dbpedia.org/ontology/locationCountry": "http://example.org/#countries-AF",
    "http://dbpedia.org/property/urbanAreaDate": "1962",
    "http://www.geonames.org/ontology/population": 9989846
  }
]

Generating Minimal RDF from this document should result in the following:

Example 6: Minimal RDF output
@prefix geonames: <http://www.geonames.org/ontology#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/#countries-AD> schema:latitude 4.25e1;
   schema:longitude 1.6e0;
   schema:name "Andorra";
   geonames:countryCode "AD" .

<http://example.org/#countries-AE> schema:latitude 2.34e1;
   schema:longitude 5.38e1;
   schema:name "United Arab Emirates";
   geonames:countryCode "AE" .

<http://example.org/#countries-AF> schema:latitude 3.39e1;
   schema:longitude 6.77e1;
   schema:name "Afghanistan";
   geonames:countryCode "AF" .

 [
     <http://dbpedia.org/ontology/locationCountry> <http://example.org/#countries-AF>;
     <http://dbpedia.org/property/urbanAreaDate> "1962"^^xsd:gYear;
     <http://www.geonames.org/ontology/population> 9989846
 ] .

 [
     <http://dbpedia.org/ontology/locationCountry> <http://example.org/#countries-AF>;
     <http://dbpedia.org/property/urbanAreaDate> "1961"^^xsd:gYear;
     <http://www.geonames.org/ontology/population> 9799379
 ] .

 [
     <http://dbpedia.org/ontology/locationCountry> <http://example.org/#countries-AF>;
     <http://dbpedia.org/property/urbanAreaDate> "1960"^^xsd:gYear;
     <http://www.geonames.org/ontology/population> 9616353
 ] .

3. Extracting Tabular Data from embedded CSV

This section describes a mechanism for locating tabular data within an HTML document, extracting tabular data from an identified script element, and processing the tabular data to create annotated tables.

In addition to embedded metadata, CSV data may also be embedded within HTML using a script element. The general provisions and access patterns described in section 2. Extracting Tabular Data from HTML Tables apply for embedded CSV data.

3.1 Example

The following CSV script elements are identified using #countries and #country_slice:

Example 7: Referenced CSV data
<script id="countries" type="text/csv">
countryCode,latitude,longitude,name
AD,42.5,1.6,Andorra
AE,23.4,53.8,"United Arab Emirates"
AF,33.9,67.7,Afghanistan
</script>

<script id="country_slice" type="text/csv">
countryRef,year,population
AF,1960,9616353
AF,1961,9799379
AF,1962,9989846
</script>

The metadata shown in section 2.2 Example can be used to access embedded CSV as well as HTML tables.

A. References

A.1 Informative references

[DOM]
Anne van Kesteren; Aryeh Gregor; Ms2ger; Alex Russell; Robin Berjon. W3C DOM4. 19 November 2015. W3C Recommendation. URL: http://www.w3.org/TR/dom/
[RFC3986]
T. Berners-Lee; R. Fielding; L. Masinter. Uniform Resource Identifier (URI): Generic Syntax. January 2005. Internet Standard. URL: https://tools.ietf.org/html/rfc3986
[csv2json]
Jeremy Tandy; Ivan Herman. Generating JSON from Tabular Data on the Web. 17 December 2015. W3C Recommendation. URL: http://www.w3.org/TR/csv2json/
[csv2rdf]
Jeremy Tandy; Ivan Herman; Gregg Kellogg. Generating RDF from Tabular Data on the Web. 17 December 2015. W3C Recommendation. URL: http://www.w3.org/TR/csv2rdf/
[html5]
Ian Hickson; Robin Berjon; Steve Faulkner; Travis Leithead; Erika Doyle Navara; Theresa O'Connor; Silvia Pfeiffer. HTML5. 28 October 2014. W3C Recommendation. URL: http://www.w3.org/TR/html5/
[json-ld]
Manu Sporny; Gregg Kellogg; Markus Lanthaler. JSON-LD 1.0. 16 January 2014. W3C Recommendation. URL: http://www.w3.org/TR/json-ld/
[tabular-data-model]
Jeni Tennison; Gregg Kellogg. Model for Tabular Data and Metadata on the Web. 17 December 2015. W3C Recommendation. URL: http://www.w3.org/TR/tabular-data-model/
[tabular-metadata]
Jeni Tennison; Gregg Kellogg. Metadata Vocabulary for Tabular Data. 17 December 2015. W3C Recommendation. URL: http://www.w3.org/TR/tabular-metadata/