Copyright © 2015 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document defines the procedures and rules to be applied when mapping tabular data into JSON. Tabular data may be complemented with metadata annotations that describe its structure, the meaning of its content and how it may form part of a collection of interrelated tabular data. This document specifies the effect of this metadata on the resulting JSON.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
The CSV on the Web Working Group was chartered to produce Recommendations for "Access methods for CSV Metadata", "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various Formats (e.g., RDF, JSON, or XML)". This document aims to satisfy the JSON variant of the mapping Recommendation.
Due to the limited resources available within the CSV on the Web Working Group, this document describes only a simple mapping—that is, where a single object is created for each row of tabular data that contains a single property per cell. The Working Group solicits input on the value of mapping a single row of tabular data into multiple inter-related objects.
This document was published by the CSV on the Web Working Group as a First Public Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org (subscribe, archives). All comments are welcome.
Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 August 2014 W3C Process Document.
This document describes the processing of tabular data to create a set of nested objects referred to as the output graph. The output graph SHALL be serialized as JSON [json].
The JSON encoding is intended for web developers who need not care about the complexities of RDF [rdf11-concepts].
The Tabular Data Model [tabular-data-model] defines a core tabular data model consisting of tables, columns, rows and cells.
Tabular data may be enriched with metadata that describes its structure and the meaning of its content. These metadata annotations are described in [tabular-metadata] and may be embedded within the CSV encoding itself as a header line or provided within a separate metadata document. The resulting annotated table conforms to the annotated tabular data model.
The metadata annotations may describe how a table relates to a group of tables. Such collections conform to the grouped tabular data model.
The mapping procedure operates on the abstract tabular data model; core, annotated or grouped. No discussion is given to the processes needed to convert CSV-encoded data into tabular data form. Please refer to [tabular-data-model] for details of parsing tabular data. Further details on parsing cells within tabular data is provided in [tabular-metadata].
Adopting terminology from the Data Catalog Vocabulary [vocab-dcat], the tabular data is considered to be a dataset, whilst the CSV file within which that tabular data is encoded is considered to be a distribution of that tabular data.
Are the abstract tabular data and the CSV that encodes it the same thing? (Is DCAT distribution appropriate?)
The mapping procedure is intended to be simple; encouraging the provision of compliant mapping applications. The limitation of this simple mapping is that a single object is created for each row of tabular data that contains a single property per cell.
An annotated table may include a reference to a template specification (see [tabular-metadata]) that describes how tabular data can be transformed into another format using a template-based approach. Templating facilitates far more sophisticated transformations than are possible using the simple mapping.
There is no standard template syntax, therefore template specifications may be written using existing template languages, such as Mustache.
The processing of template specifications during the mapping is yet to be determined by the Working Group and is, at least for the interim, beyond the scope of this document.
Finally, note that the mapping procedure is considered to be entirely textual. There is no requirement on compliant mapping applications to check the semantic consistency of the data during the mapping, nor validate the cell values against JSON syntax rules. Where cell values within CSV encoded content are improperly formatted, the output from the mapping is likely to include syntax errors. Downstream applications should be aware of this and take appropriate action.
Should the RDF/JSON transformation check the values?
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MAY, MUST, SHALL, and SHOULD are to be interpreted as described in [RFC2119].
Tabular data MUST conform to the description from [tabular-data-model]. In particular note that each row MUST contain the same number of cells (although some of these cells may be empty). Given this constraint, not all CSV-encoded data can be considered to be tabular data. As such, the procedures and rules defined in this document cannot be applied to all CSV files.
This document relies on terms (e.g. group, table, column, row, cell) defined in [tabular-data-model].
The procedures and rules for mapping tabular data compliant with the core tabular data model are described below.
Core Tabular Data lacks any annotation; neither from the header line within the CSV file nor from a separate metadata document.
The root object of the output graph SHALL describe the table.
The root object SHALL contain a reference to the CSV-encoded distribution of the tabular data.
The distribution SHALL be described using an object with name
distribution
that contains the name/value pair
"downloadURL": "[CSV-LOCATION]"
where [CSV-LOCATION]
is the absolute URL of the source CSV file.
The semantics of properties distribution
and
downloadURL
correspond with the properties
dcat:distribution
and
dcat:downloadURL
as defined
in [vocab-dcat].
Are the abstract tabular data and the CSV that encodes it the same thing? (Is DCAT distribution appropriate?)
The root object SHALL contain an array named row
containing
one object for each of the
rows within the
tabular data.
What should the name of the property be that
relates rows to the table? row
is one option; hasRow
is another.
Refer to Section 3.1.2 Row-level processing for further details.
Each row in the tabular data is processed sequentially.
Each object in the output graph corresponding to a
row in the
tabular data SHALL contain one name/value pair for each
column
where the
cell value
is not null (e.g. where the cell does not contain an empty string;
""
).
The value of name SHALL be _col=[N]
where [N]
is the
column number,
whilst the value of value SHALL be provided by the associated
cell value.
What do to with conversion if no column name is given?
Where the cell value is null, the name/value pair SHALL be omitted from the output graph.
Given the absence of metadata annotations to indicate the type of data present in a given column, all cell values SHALL be treated as strings.
Should the mapping output for a given row include a reference to the CSV source row?
The following example provides a numeric score for four fictional people. A row number is included for convenience. There are four columns and four rows. Given that no metadata annotations are provided, it is very difficult to ascertain the subject of the tabular data without additional insight.
1 | Jill | Smith | 50 |
2 | Eve | 94 | |
3 | Adam | Johnson | |
4 | John | Doe | 80 |
The CSV input (published at http://example.org/people-and-points.csv
):
1,Jill,Smith,50 2,Eve,,94 3,Adam,Johnson, 4,John,Doe,80
The resulting JSON output graph:
{ "distribution": { "downloadURL": "http://example.org/people-and-points.csv" }, "row": [{ "_col=1": "1", "_col=2": "Jill", "_col=3": "Smith", "_col=4": "50" },{ "_col=1": "2", "_col=2": "Eve", "_col=4": "94" },{ "_col=1": "3", "_col=2": "Adam", "_col=3": "Johnson" },{ "_col=1": "4", "_col=2": "John", "_col=3": "Doe", "_col=4": "80" }] }
The procedures and rules for mapping annotated tabular data compliant with the annotated tabular data model are described below.
The metadata for annotated tabular data MAY be provided by either or both of the following sources:
Mapping applications SHALL establish a column description object for each column within the annotated tabular data. The column description object contains the aggregated set of metadata properties for a given column that affect how the cell values within the associated column are expressed in the output graph. Metadata properties are sourced from the header line in the CSV file and column description in the metadata document and enriched with inherited properties from the table description and schema.
The output graph MAY include some Direct Annotations sourced from the metadata document. Where these are natural language properties, no locale information can be provided given the lack of multi-lingual support in JSON. Where an array of values for a property are provided using a language map (as defined in [json-ld]) the array of locale-specific values SHALL be included in the output graph - albeit without the inferred semantics about language codes.
Clearly, in order to process annotated tabular data, a mapping application MUST have access to the full metadata description associated with the tabular data.
URL expansion behaviour of relative URLs SHALL be consistent with Section 6.3 IRI Expansion in [json-ld-api]. The base URL provides the URL against which relative URLs from annotated tabular data are resolved. The base URL SHALL be that of the source CSV file.
What is default value if @base is not defined in the metadata description?
The root object of the output graph SHALL describe the annotated table.
Where provided in the
table
description, the metadata property @id
SHALL be used to identify the
root object.
The root object SHALL be identified using the name/value pair
"url": "[@id]"
where [@id]
is the value
of metadata property @id
.
The root object SHALL contain a reference to the CSV-encoded distribution of the tabular data.
The distribution SHALL be described using an object with name
distribution
that contains the name/value pair
"downloadURL": "[CSV-LOCATION]"
where [CSV-LOCATION]
is the absolute URL of the source CSV file.
The semantics of properties distribution
and
downloadURL
correspond with the properties
dcat:distribution
and
dcat:downloadURL
as defined
in [vocab-dcat].
Are the abstract tabular data and the CSV that encodes it the same thing? (Is DCAT distribution appropriate?)
Where a
header line
is present in the CSV file then, for each
column, the
cell value
from the column header SHALL be assigned to metadata property name
within the column description object for
that column.
Where the column header is null, the value assigned to name
SHALL be
_col=[N]
where [N]
is the
column number.
Where present in the table description, the following metadata properties SHALL be included in the output graph within the root object:
notes
- the array of objects representing structured annotations
on the tabular data SHALL be included verbatim.
The Web Annotation Working Group is developing a vocabulary for expressing annotations which we anticipate referencing from this specification. Issues likely to be covered therein include: how to anchor the annotation to a target in the tabular data and/or CSV file, what form the annotations themselves may take (e.g. a simple literal annotation body, or whether additional formatting properties are required to indicate that the annotation is expressed in, say, Markdown or HTML).
Any Common Properties (as defined in Section 3.3 Common Properties of [tabular-metadata]).
Any of the
inherited properties null
,
separator
, format
, datatype
,
or default
defined within the
table description and/or
schema
SHALL be added to the column description
object for each
column.
Where the same property is defined in both the table description and the schema, the value from the schema SHALL take precedence.
Each
column description SHALL be matched to a
column in
the tabular data based on the order that the description is listed in the columns
array of the schema.
For each column description in the metadata document, the following metadata properties SHALL be added to the relevant column description object established by the mapping application:
name
.
Where metadata property name
is also provided via the
header line
the value from the
column description in the metadata document SHALL take precedence.
predicateUrl
.
urlTemplate
.
Inherited properties null
,
separator
, format
, datatype
,
and default
are added to the
column description object, overwriting values added in the previous step
where properties are duplicated.
For each column description object
where metadata property predicateUrl
has not been
assigned within the
column description, the value of predicateUrl
SHALL be
set as the value of metadata property name
.
The root object SHALL contain an array named row
containing
one object for each of the
rows within the
tabular data.
Refer to Section 4.1.2 Row-level processing for further details.
The output graph MAY contain information about the metadata documents that were used when when creating the output graph using the following name/value pair for each metadata document:
"describedBy": "[Metadata Location]"
where [Metadata Location]
is the absolute URL of the
metadata document.
Each row in the tabular data is processed sequentially. The behaviour exhibited when processing a given cell within the current row is dependent on the metadata properties of the column description object for the column that that cell resides in. The effect of each metadata property is defined in Section 4.1.3 Metadata property effects on row-level mapping behaviour
Each object in the output graph corresponding to a row in the tabular data SHALL contain one name/value pair for each column where the cell value is not null.
Where the metadata property urlTemplate
is provided in the
schema,
each object in the output graph corresponding to a
row in the
tabular data SHALL be explicitly identified using the name/value pair
"url": "[EXPANDED-URI-TEMPLATE]"
where
[EXPANDED-URI-TEMPLATE]
is the value resulting from
the expansion of the [uri-template] specified in the urlTemplate
property.
The variables in the URI Template expression relate to
the name
property specified for each
column.
During template expansion, the variables evaluate to the
cell value
within the
row
being processed that is associated with the named
column.
The variable _row
evaluates to the number of the row being
processed.
Once the URL has been generated via the template expansion, relative URLs are resolved against the base URL to create an absolute URL.
Where the
cell value
is null, the name/value pair SHALL be omitted from the
output graph unless a default value is
specified for that column
(see metadata properties null
and default
).
The value of name for the name/value pair SHALL be provided by the
predicateUrl
metadata property within the
column description object.
In accordance with [json], all unicode characters may be used. However, the following characters SHALL be escaped:
"
(U+0022
)\
(U+005C
)U+0000
through U+001F
)The value of value SHALL be provided by the associated cell value subject to the effect of the metadata properties for that column (if any are specified).
Given the lack of multi-lingual support in JSON, internationalized strings are simply copied verbatim into the output graph without any additional locale information.
Also note that values of metadata property name
must be unique
within a table.
Therefore if, say, English and French versions of the same property are provided
as complementary columns in a CSV file then their name
properties
must be different; e.g. title_en
and titre_fr
. It may
be possible for applications to infer a language from the name of a
column -
but this is entirely dependant on the syntax used by the metadata author when
defining values of the name
property.
The following metadata properties modify the way that cell values are incorporated into the output graph:
null
By default, a
cell value
is deemed to be null if it contains an empty string. If specified, the
metadata property null
provides a token (string) that can be used to
identify null values.
separator
Where metadata property separator
is defined, the
cell value SHALL be
parsed into an ordered list of values, using the value of separator
as the
delimiter.
The list of values SHALL be expressed in the output graph as an array.
datatype
and format
Where metadata property datatype
is undefined, the
column SHALL be
inferred to hold values of datatype string
.
The following datatypes are given special attention:
Datatypes with embedded syntax: xml
, json
and html
.
These datatypes are treated as literal values; no attempt SHOULD be made to 'unpack' the structured syntax to create sub-objects within the output graph
Booleans: boolean
.
Metadata property format
MAY be provided for a boolean-typed
column;
providing non-standard tokens for true and false (e.g.
Y|N
.
Section 3.12.3 Formats for booleans from [tabular-metadata] refers.
If a boolean type is declared, the cell value SHALL be processed as follows:
true
, 1
or, if the format
property is defined, the value of true, then the output graph
SHALL include the value true
;false
, 0
or, if the format
property is defined, the value of false, then the output graph
SHALL include the value false
;Numbers: number
, decimal
, integer
,
nonPositiveInteger
, negativeInteger
, long
,
int
, short
, nonNegativeInteger
,
unsignedLong
, unsignedInt
, unsignedShort
,
positiveInteger
, float
and double
.
Cell values that are asserted to be numeric shall be expressed in the output graph as numbers; double quotes will be omitted.
It is not uncommon for numbers within tabular data to be formatted for human consumption, which may involve using commas for decimal points, grouping digits in the number using commas, or adding currency symbols or percent signs to the number.
Metadata property format
MAY be provided to describe the formatting of the
cell values
to assist the mapping application convert the
cell value
to a number format readily consumable by downstream applications.
Describing the formatting of numbers is currently unresolved and is likely to require information on decimal separator characters, grouping characters and possibly others such as Infinity, Nan, currency tokens, negative numbers appearing in parentheses etc.
In the interim, mapping applications are not required to undertake any reformatting and may simply pass the cell value to the output graph verbatim.
Dates, times and durations: date
, time
,
datetime
, dateTime
and duration
.
A standard syntax for dates and times is defined by [iso8601]. This format can be readily consumed by software applications. However, dates and times are often provided in a locale-specific format, or use alternate calendars and/or eras.
Metadata property format
MAY be provided to describe the formatting of
cell values
and assist the mapping application convert the
cell value
to a date, time, date-time or duration format readily consumable by downstream
applications.
Where possible, data publishers SHOULD provide dates and times in the [iso8601] format. However, where data publishers choose to use locale-specific date and time formatting, they SHOULD also provide equivalent values in [iso8601] format (e.g. in a complementary column).
Describing the formatting of dates and times is currently unresolved. The favoured option is to defer the parsing of dates and times to implementations based a picture string provided in the metadata. Unfortunately, there is no standard syntax for picture strings, therefore an array of picture strings relating to common implementations seems like the best option. For example:
"datatype": "date",
"format": {
"picture-strings": [
"unicode": "dd MMM yyyy",
"xpath": "[D01] [MN,*-3] [Y0001]"
]
}
Where an implementation is able to interpret one of the provided picture strings, the date-time value reformatted in [iso8601] format shall be included in the output graph, else the original cell value shall be included verbatim.
In the interim, mapping applications are not required to undertake any reformatting and may simply pass the cell value to the output graph verbatim.
A list of potential date-time formatting implementations needs to be defined.
Where the metadata property separator
is specified (e.g.
to indicate that a
cell value
is to be parsed into a list of values), the datatype specified by datatype
SHALL be inferred to apply to the members of the resulting list.
urlTemplate
If metadata property urlTemplate
is specified, the value used
in the output graph SHALL be
the result of the URI Template expansion, as defined in
Section 3.1 Property Syntax
of [tabular-metadata].
Once the URL has been generated via the template expansion, relative URLs are resolved against the base URL to create an absolute URL.
default
If metadata property default
is specified and the
cell value
is deemed to be null, then the value of default
SHALL be used
in the output graph.
These examples don't really show the edge cases - probably need to rework them
The first example illustrates how a CSV file with metadata annotations drawn only
from a header line is
processed. The tabular data describes lists countries, giving their country code and name.
There are two columns, named country
and name
, and four rows.
country | name |
---|---|
AD | Andorra |
AF | Afghanistan |
AI | Anguilla |
AL | Albania |
The CSV input (published at http://example.org/country-codes-and-names.csv
):
country,name AD,Andorra AF,Afghanistan AI,Anguilla AL,Albania
The resulting JSON output graph:
{ "distribution": { "downloadURL": "http://example.org/country-codes-and-names.csv" }, "row": [{ "country": "AD", "name": "Andorra" },{ "country": "AF", "name": "Afghanistan" },{ "country": "AI", "name": "Anguilla" },{ "country": "AL", "name": "Albania" }] }
The second example illustrates how the mapping is modified with the addition of metadata
annotations in a metadata document. The CSV file is a small extract from a much larger
Tree Inventory dataset from the City of Palo Alto which supports the maintaining and
tracking the city's public trees and urban forest. There are five columns, named
GID
, On Street
, Species
, Trim Cycle
and Inventory Date
, and three rows.
GID | On Street | Species | Trim Cycle | Inventory Date |
---|---|---|---|---|
1 | ADDISON AV | Celtis australis | Large Tree Routine Prune | 10/18/2010 |
2 | EMERSON ST | Liquidambar styraciflua | Large Tree Routine Prune | 6/2/2010 |
3 | EMERSON ST | Liquidambar styraciflua | Large Tree Routine Prune | 6/2/2010 |
The CSV input (published at http://example.org/tree-ops.csv
):
GID,On Street,Species,Trim Cycle,Inventory Date 1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010 2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010 3,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
The metadata description (published at http://example.org/tree-ops.csv-metadata.json
):
{ "@id": "tree-ops", "@context": { "@language": "en" } "dcat:distribution": { "dcat:downloadURL": "tree-ops.csv" } "dc:title": "Tree Operations", "dc:keywords": ["tree", "street", "maintenance"], "dc:publisher": [{ "schema:name": "Example Municipality", "schema:web": "http://example.org" }], "dc:license": "http://opendefinition.org/licenses/cc-by/", "dc:modified": "2010-12-31", "schema": { "columns": [{ "name": "GID", "title": [ "GID", "Generic Identifier" ], "dc:description": "An identifier for the operation on a tree.", "datatype": "string", "required": true, "unique": true }, { "name": "on-street", "title": "On Street", "dc:description": "The street that the tree is on.", "datatype": "string" }, { "name": "species", "title": "Species", "dc:description": "The species of the tree.", "datatype": "string" }, { "name": "trim-cycle", "title": "Trim Cycle", "dc:description": "The operation performed on the tree.", "datatype": "string" }, { "name": "inventory-date", "title": "Inventory Date", "dc:description": "The date of the operation that was performed.", "datatype": "date" }] "primaryKey": "GID", "urlTemplate": "#gid-{GID}" } }
The resulting JSON output graph:
{ "url": "tree-ops" "distribution": { "downloadURL": "http://example.org/tree-ops.csv" }, "dc:title": "Tree Operations", "dc:keywords": ["tree", "street", "maintenance"], "dc:publisher": [{ "schema:name": "Example Municipality", "schema:web": "http://example.org" }], "dc:license": "http://opendefinition.org/licenses/cc-by/", "dc:modified": "2010-12-31", "row": [{ "url": "http://example.org/tree-ops.csv#gid-1", "GID": "1", "on-street": "ADDISON AV", "species": "Celtis australis", "trim-cycle": "Large Tree Routine Prune", "inventory-date": "10/18/2010" },{ "url": "http://example.org/tree-ops.csv#gid-2", "GID": "2", "on-street": "EMERSON ST", "species": "Liquidambar styraciflua", "trim-cycle": "Large Tree Routine Prune", "inventory-date": "6/2/2010" },{ "url": "http://example.org/tree-ops.csv#gid-3", "GID": "3", "on-street": "EMERSON ST", "species": "Liquidambar styraciflua", "trim-cycle": "Large Tree Routine Prune", "inventory-date": "6/2/2010" }], "describedBy": "http://example.org/tree-ops.csv-metadata.json" }
The procedures and rules for mapping a collection of tabular data compliant with the grouped tabular data model are described below.
The metadata for a group of tables SHALL be provided by a table group description (as defined in [tabular-metadata]) within the associated metadata document.
The root object of the output graph SHALL describe the table group.
Where present in the table group description, any Common Properties (as defined in Section 3.3 Common Properties of [tabular-metadata]) SHALL be included in the output graph within the root object.
The root object SHALL contain an array named table
containing
one object for each of the
tables listed in
the resources
array of the
table group description.
Each table SHALL be processed sequentially according to the appropriate set of rules for mapping core or annotated tabular data. Refer to Section 3 Mapping Core Tabular Data and Section 4 Mapping Annotated Tabular Data for further details.
The JSON generated from processing the
tables SHALL be
incorporated into the output graph as objects within
the table
array.
Any of the
inherited properties null
,
separator
, format
, datatype
,
or default
defined within the
table group description SHALL be used to pre-populate the
column description objects for each
table in the
group.
Where the same property is defined in the table group description, table description, schema or column description the order of precedence SHALL be:
The presence of foreign-key references within the table descriptions may affect the way the data is packaged in the output graph. Reviewers are invited to comment on how grouped tabular data with foreign-key references might best be organised.
Use Case 4: Publication of public sector roles and salaries likely provides a good source of material for these examples. To be added.