Generating JSON from Tabular Data on the Web

Abstract

This document defines the procedures and rules to be applied when mapping tabular data into JSON. Tabular data may be complemented with metadata annotations that describe its structure, the meaning of its content and how it may form part of a collection of interrelated tabular data. This document specifies the effect of this metadata on the resulting JSON.

4. Mapping Annotated Tabular Data

The procedures and rules for mapping annotated tabular data compliant with the annotated tabular data model are described below.

The metadata for annotated tabular data MAY be provided by either or both of the following sources:

the header line within the CSV file; and/or
the table description, schema and column descriptions (as defined in [tabular-metadata]) within the associated metadata document.

Mapping applications SHALL establish a column description object for each column within the annotated tabular data. The column description object contains the aggregated set of metadata properties for a given column that affect how the cell values within the associated column are expressed in the output graph. Metadata properties are sourced from the header line in the CSV file and column description in the metadata document and enriched with inherited properties from the table description and schema.

The output graph MAY include some Direct Annotations sourced from the metadata document. Where these are natural language properties, no locale information can be provided given the lack of multi-lingual support in JSON. Where an array of values for a property are provided using a language map (as defined in [json-ld]) the array of locale-specific values SHALL be included in the output graph - albeit without the inferred semantics about language codes.

Clearly, in order to process annotated tabular data, a mapping application MUST have access to the full metadata description associated with the tabular data.

URL expansion behaviour of relative URLs SHALL be consistent with Section 6.3 IRI Expansion in [json-ld-api]. The base URL provides the URL against which relative URLs from annotated tabular data are resolved. The base URL SHALL be that of the source CSV file.

Issue 91

What is default value if @base is not defined in the metadata description?

4.1 Generating JSON

4.1.1 Table-level processing

The root object of the output graph SHALL describe the annotated table.
Where provided in the table description, the metadata property @id SHALL be used to identify the root object.

The root object SHALL be identified using the name/value pair "url": "[@id]" where [@id] is the value of metadata property @id.
The root object SHALL contain a reference to the CSV-encoded distribution of the tabular data.

The distribution SHALL be described using an object with name distribution that contains the name/value pair "downloadURL": "[CSV-LOCATION]" where [CSV-LOCATION] is the absolute URL of the source CSV file.

Note
The semantics of properties distribution and downloadURL correspond with the properties dcat:distribution and dcat:downloadURL as defined in [vocab-dcat].

Issue 93
Are the abstract tabular data and the CSV that encodes it the same thing? (Is DCAT distribution appropriate?)
Where a header line is present in the CSV file then, for each column, the cell value from the column header SHALL be assigned to metadata property name within the column description object for that column.

Where the column header is null, the value assigned to name SHALL be _col=[N] where [N] is the column number.
Where present in the table description, the following metadata properties SHALL be included in the output graph within the root object:
- notes - the array of objects representing structured annotations on the tabular data SHALL be included verbatim.
  
  Note
  The Web Annotation Working Group is developing a vocabulary for expressing annotations which we anticipate referencing from this specification. Issues likely to be covered therein include: how to anchor the annotation to a target in the tabular data and/or CSV file, what form the annotations themselves may take (e.g. a simple literal annotation body, or whether additional formatting properties are required to indicate that the annotation is expressed in, say, Markdown or HTML).
  
  Issue 71
  
  Exact handling of annotations.
  
  Additionally, the mechanism to reference the annotation target (within tabular data) is still unclear - especially given the confusion on identifying row numbers (ISSUE #68 refers).
- Any Common Properties (as defined in Section 3.3 Common Properties of [tabular-metadata]).
Any of the inherited properties null, separator, format, datatype, or default defined within the table description and/or schema SHALL be added to the column description object for each column.

Where the same property is defined in both the table description and the schema, the value from the schema SHALL take precedence.
Each column description SHALL be matched to a column in the tabular data based on the order that the description is listed in the columns array of the schema.

For each column description in the metadata document, the following metadata properties SHALL be added to the relevant column description object established by the mapping application:
- name.
  
  Where metadata property name is also provided via the header line the value from the column description in the metadata document SHALL take precedence.
- predicateUrl.
- urlTemplate.
- Inherited properties null, separator, format, datatype, and default are added to the column description object, overwriting values added in the previous step where properties are duplicated.
For each column description object where metadata property predicateUrl has not been assigned within the column description, the value of predicateUrl SHALL be set as the value of metadata property name.
The root object SHALL contain an array named row containing one object for each of the rows within the tabular data.

Refer to Section 4.1.2 Row-level processing for further details.
The output graph MAY contain information about the metadata documents that were used when when creating the output graph using the following name/value pair for each metadata document:

"describedBy": "[Metadata Location]"

where [Metadata Location] is the absolute URL of the metadata document.

4.1.2 Row-level processing

Each row in the tabular data is processed sequentially. The behaviour exhibited when processing a given cell within the current row is dependent on the metadata properties of the column description object for the column that that cell resides in. The effect of each metadata property is defined in Section 4.1.3 Metadata property effects on row-level mapping behaviour

Each object in the output graph corresponding to a row in the tabular data SHALL contain one name/value pair for each column where the cell value is not null.
Where the metadata property urlTemplate is provided in the schema, each object in the output graph corresponding to a row in the tabular data SHALL be explicitly identified using the name/value pair "url": "[EXPANDED-URI-TEMPLATE]" where [EXPANDED-URI-TEMPLATE] is the value resulting from the expansion of the [uri-template] specified in the urlTemplate property.

Note

The variables in the URI Template expression relate to the name property specified for each column. During template expansion, the variables evaluate to the cell value within the row being processed that is associated with the named column.

The variable _row evaluates to the number of the row being processed.

Once the URL has been generated via the template expansion, relative URLs are resolved against the base URL to create an absolute URL.
Where the cell value is null, the name/value pair SHALL be omitted from the output graph unless a default value is specified for that column (see metadata properties null and default).
The value of name for the name/value pair SHALL be provided by the predicateUrl metadata property within the column description object.

In accordance with [json], all unicode characters may be used. However, the following characters SHALL be escaped:
- quotation mark; " (U+0022)
- reverse solidus; \ (U+005C)
- control characters; (U+0000 through U+001F)
The value of value SHALL be provided by the associated cell value subject to the effect of the metadata properties for that column (if any are specified).

Note

Given the lack of multi-lingual support in JSON, internationalized strings are simply copied verbatim into the output graph without any additional locale information.

Also note that values of metadata property name must be unique within a table. Therefore if, say, English and French versions of the same property are provided as complementary columns in a CSV file then their name properties must be different; e.g. title_en and titre_fr. It may be possible for applications to infer a language from the name of a column - but this is entirely dependant on the syntax used by the metadata author when defining values of the name property.

4.1.3 Metadata property effects on row-level mapping behaviour

The following metadata properties modify the way that cell values are incorporated into the output graph:

null

By default, a cell value is deemed to be null if it contains an empty string. If specified, the metadata property null provides a token (string) that can be used to identify null values.

separator

Where metadata property separator is defined, the cell value SHALL be parsed into an ordered list of values, using the value of separator as the delimiter.

The list of values SHALL be expressed in the output graph as an array.

datatype and format

Where metadata property datatype is undefined, the column SHALL be inferred to hold values of datatype string.

The following datatypes are given special attention:

Datatypes with embedded syntax: xml, json and html.

These datatypes are treated as literal values; no attempt SHOULD be made to 'unpack' the structured syntax to create sub-objects within the output graph
Booleans: boolean.

Metadata property format MAY be provided for a boolean-typed column; providing non-standard tokens for true and false (e.g. Y|N. Section 3.12.3 Formats for booleans from [tabular-metadata] refers.

If a boolean type is declared, the cell value SHALL be processed as follows:
1. if the value is true, 1 or, if the format property is defined, the value of true, then the output graph SHALL include the value true;
2. else if the value is false, 0 or, if the format property is defined, the value of false, then the output graph SHALL include the value false;
3. else the output graph SHALL include the cell value verbatim.
Numbers: number, decimal, integer, nonPositiveInteger, negativeInteger, long, int, short, nonNegativeInteger, unsignedLong, unsignedInt, unsignedShort, positiveInteger, float and double.

Cell values that are asserted to be numeric shall be expressed in the output graph as numbers; double quotes will be omitted.

It is not uncommon for numbers within tabular data to be formatted for human consumption, which may involve using commas for decimal points, grouping digits in the number using commas, or adding currency symbols or percent signs to the number.

Metadata property format MAY be provided to describe the formatting of the cell values to assist the mapping application convert the cell value to a number format readily consumable by downstream applications.

Issue 54

Describing the formatting of numbers is currently unresolved and is likely to require information on decimal separator characters, grouping characters and possibly others such as Infinity, Nan, currency tokens, negative numbers appearing in parentheses etc.

In the interim, mapping applications are not required to undertake any reformatting and may simply pass the cell value to the output graph verbatim.
Dates, times and durations: date, time, datetime, dateTime and duration.

A standard syntax for dates and times is defined by [iso8601]. This format can be readily consumed by software applications. However, dates and times are often provided in a locale-specific format, or use alternate calendars and/or eras.

Metadata property format MAY be provided to describe the formatting of cell values and assist the mapping application convert the cell value to a date, time, date-time or duration format readily consumable by downstream applications.

Note
Where possible, data publishers SHOULD provide dates and times in the [iso8601] format. However, where data publishers choose to use locale-specific date and time formatting, they SHOULD also provide equivalent values in [iso8601] format (e.g. in a complementary column).

Issue 54

Describing the formatting of dates and times is currently unresolved. The favoured option is to defer the parsing of dates and times to implementations based a picture string provided in the metadata. Unfortunately, there is no standard syntax for picture strings, therefore an array of picture strings relating to common implementations seems like the best option. For example:

"datatype": "date",

"format": {

  "picture-strings": [

    "unicode": "dd MMM yyyy",

    "xpath": "[D01] [MN,*-3] [Y0001]"

  ]

}

Where an implementation is able to interpret one of the provided picture strings, the date-time value reformatted in [iso8601] format shall be included in the output graph, else the original cell value shall be included verbatim.

In the interim, mapping applications are not required to undertake any reformatting and may simply pass the cell value to the output graph verbatim.

Issue 65

A list of potential date-time formatting implementations needs to be defined.

Note

Where the metadata property separator is specified (e.g. to indicate that a cell value is to be parsed into a list of values), the datatype specified by datatype SHALL be inferred to apply to the members of the resulting list.

urlTemplate

If metadata property urlTemplate is specified, the value used in the output graph SHALL be the result of the URI Template expansion, as defined in Section 3.1 Property Syntax of [tabular-metadata].

Once the URL has been generated via the template expansion, relative URLs are resolved against the base URL to create an absolute URL.

default

If metadata property default is specified and the cell value is deemed to be null, then the value of default SHALL be used in the output graph.

4.2 Examples

Issue

These examples don't really show the edge cases - probably need to rework them

The first example illustrates how a CSV file with metadata annotations drawn only from a header line is processed. The tabular data describes lists countries, giving their country code and name. There are two columns, named country and name, and four rows.

country	name
AD	Andorra
AF	Afghanistan
AI	Anguilla
AL	Albania

The CSV input (published at http://example.org/country-codes-and-names.csv):

Example 3: CSV input

country,name
AD,Andorra
AF,Afghanistan
AI,Anguilla
AL,Albania

The resulting JSON output graph:

Example 4: JSON output

{
  "distribution": {
    "downloadURL": "http://example.org/country-codes-and-names.csv"
  },
  "row": [{
    "country": "AD",
    "name": "Andorra"
  },{
    "country": "AF",
    "name": "Afghanistan"
  },{
    "country": "AI",
    "name": "Anguilla"
  },{
    "country": "AL",
    "name": "Albania"
  }]
}

The second example illustrates how the mapping is modified with the addition of metadata annotations in a metadata document. The CSV file is a small extract from a much larger Tree Inventory dataset from the City of Palo Alto which supports the maintaining and tracking the city's public trees and urban forest. There are five columns, named GID, On Street, Species, Trim Cycle and Inventory Date, and three rows.

GID	On Street	Species	Trim Cycle	Inventory Date
1	ADDISON AV	Celtis australis	Large Tree Routine Prune	10/18/2010
2	EMERSON ST	Liquidambar styraciflua	Large Tree Routine Prune	6/2/2010
3	EMERSON ST	Liquidambar styraciflua	Large Tree Routine Prune	6/2/2010

The CSV input (published at http://example.org/tree-ops.csv):

Example 5: CSV input

GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
3,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010

The metadata description (published at http://example.org/tree-ops.csv-metadata.json):

Example 6: Metadata description

{
  "@id": "tree-ops",
  "@context": {
    "@language": "en"
  }
  "dcat:distribution": {
    "dcat:downloadURL": "tree-ops.csv"
  }
  "dc:title": "Tree Operations",
  "dc:keywords": ["tree", "street", "maintenance"],
  "dc:publisher": [{
    "schema:name": "Example Municipality",
    "schema:web": "http://example.org"
  }],
  "dc:license": "http://opendefinition.org/licenses/cc-by/",
  "dc:modified": "2010-12-31",
  "schema": {
    "columns": [{
      "name": "GID",
      "title": [
        "GID",
        "Generic Identifier"
      ],
      "dc:description": "An identifier for the operation on a tree.",
      "datatype": "string",
      "required": true,
      "unique": true
    }, {
      "name": "on-street",
      "title": "On Street",
      "dc:description": "The street that the tree is on.",
      "datatype": "string"
    }, {
      "name": "species",
      "title": "Species",
      "dc:description": "The species of the tree.",
      "datatype": "string"
    }, {
      "name": "trim-cycle",
      "title": "Trim Cycle",
      "dc:description": "The operation performed on the tree.",
      "datatype": "string"
    }, {
      "name": "inventory-date",
      "title": "Inventory Date",
      "dc:description": "The date of the operation that was performed.",
      "datatype": "date"
    }]
    "primaryKey": "GID",  
    "urlTemplate": "#gid-{GID}"
  }
}

The resulting JSON output graph:

Example 7: JSON output

{
  "url": "tree-ops"
  "distribution": {
    "downloadURL": "http://example.org/tree-ops.csv"
  },
  "dc:title": "Tree Operations",
  "dc:keywords": ["tree", "street", "maintenance"],
  "dc:publisher": [{
    "schema:name": "Example Municipality",
    "schema:web": "http://example.org"
  }],
  "dc:license": "http://opendefinition.org/licenses/cc-by/",
  "dc:modified": "2010-12-31",
  "row": [{
    "url": "http://example.org/tree-ops.csv#gid-1",
    "GID": "1",
    "on-street": "ADDISON AV",
    "species": "Celtis australis",
    "trim-cycle": "Large Tree Routine Prune",
    "inventory-date": "10/18/2010"
  },{
    "url": "http://example.org/tree-ops.csv#gid-2",
    "GID": "2",
    "on-street": "EMERSON ST",
    "species": "Liquidambar styraciflua",
    "trim-cycle": "Large Tree Routine Prune",
    "inventory-date": "6/2/2010"
  },{
    "url": "http://example.org/tree-ops.csv#gid-3",
    "GID": "3",
    "on-street": "EMERSON ST",
    "species": "Liquidambar styraciflua",
    "trim-cycle": "Large Tree Routine Prune",
    "inventory-date": "6/2/2010"
  }],
  "describedBy": "http://example.org/tree-ops.csv-metadata.json"
}

1	Jill	Smith	50
2	Eve		94
3	Adam	Johnson
4	John	Doe	80

Generating JSON from Tabular Data on the Web

W3C First Public Working Draft 08 January 2015

Abstract

Status of This Document

Table of Contents

1. Introduction

2. Conformance

3. Mapping Core Tabular Data

3.1 Generating JSON

3.1.1 Table-level processing

3.1.2 Row-level processing

3.2 Examples

4. Mapping Annotated Tabular Data

4.1 Generating JSON

4.1.1 Table-level processing

4.1.2 Row-level processing

4.1.3 Metadata property effects on row-level mapping behaviour

4.2 Examples

5. Mapping Grouped Tabular Data

5.1 Generating JSON

5.1.1 Group-level processing

5.2 Examples

A. References

A.1 Normative references

A.2 Informative references