This document is also available in this non-normative format: ePub
Copyright © 2016 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and document use rules apply.
CSV is one of the most popular formats for publishing data on the web. It is concise, easy to understand by both humans and computers, and aligns nicely to the tabular nature of most data.
But CSV is also a poor format for data. There is no mechanism within CSV to indicate the type of data in a particular column, or whether values in a particular column must be unique. It is therefore hard to validate and prone to errors such as missing values or differing data types within a column.
The CSV on the Web Working Group has developed standard ways to express useful metadata about CSV files and other kinds of tabular data. This primer takes you through the ways in which these standards work together, covering:
Where possible, this primer links back to the normative definitions of terms and properties in the standards. Nothing in this primer overrides those normative definitions.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
The CSV on the Web Working Group was chartered to produce a recommendation "Access methods for CSV Metadata" as well as recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various formats (e.g. RDF, JSON, or XML)". This non-normative document is a primer that describes how these standards work together for new readers. The normative standards are:
This document was published by the CSV on the Web Working Group as a Working Group Note. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org (subscribe, archives). All comments are welcome.
Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 September 2015 W3C Process Document.
true
or false
?We'll begin with some basic concepts.
Tabular data is any data that can be arranged in a table, like the one here:
column 1 | column 2 | column 3 | |
---|---|---|---|
row 1 | cell in column 1 and row 1 | cell in column 2 and row 1 | cell in column 3 and row 1 |
row 2 | cell in column 1 and row 2 | cell in column 2 and row 2 | cell in column 3 and row 2 |
row 3 | cell in column 1 and row 3 | cell in column 2 and row 3 | cell in column 3 and row 3 |
There are lots of syntaxes for expressing tabular data on the web. You can put it in HTML tables, pass it around as Excel spreadsheets, or store it in an SQL database.
One easy way to pass around tabular data is as CSV: as comma-separated values. A CSV file writes each row on a separate line and each cell is separated from the next with a comma. The values of cells can be written with double quotes around them; this is necessary when a cell value contains a line break or a comma. So the tabular data above can be expressed in CSV as:
cell in column 1 and row 1,cell in column 2 and row 1,cell in column 3 and row 1 cell in column 1 and row 2,cell in column 2 and row 2,cell in column 3 and row 2 cell in column 1 and row 3,cell in column 2 and row 3,cell in column 3 and row 3
or, with double quotes around cell values:
"cell in column 1 and row 1","cell in column 2 and row 1","cell in column 3 and row 1" "cell in column 1 and row 2","cell in column 2 and row 2","cell in column 3 and row 2" "cell in column 1 and row 3","cell in column 2 and row 3","cell in column 3 and row 3"
CSV files usually have an additional row at the top called a header row, which gives human-readable names or titles for each of the columns. Here is a sample CSV file that contains a header row:
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude" "at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005" "be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405" "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
Column titles are a type of annotation on a column, not part of the data itself. For example, they aren't included when you're counting the rows of data in a table:
column 1 | column 2 | column 3 | column 4 | column 5 | column 6 | column 7 | |
---|---|---|---|---|---|---|---|
titles | country | country group | name (en) | name (fr) | name (de) | latitude | longitude |
row 1 | AT | eu | Austria | Autriche | Österreich | 47.6965545 | 13.34598005 |
row 2 | BE | eu | Belgium | Belgique | Belgien | 50.501045 | 4.47667405 |
row 3 | BG | eu | Bulgaria | Bulgarie | Bulgarien | 42.72567375 | 25.4823218 |
See also:
You can provide extra information, known as metadata, about CSV files using a JSON metadata file. If you're just providing metadata about one file, the easiest thing to do is to name the metadata file by adding -metadata.json
to the end of the name of the CSV file. For example, if your CSV file is called countries.csv
then call the metadata file countries.csv-metadata.json
.
The simplest metadata file you can create contains a single table description and looks like:
{ "@context": "http://www.w3.org/ns/csvw", "url": "countries.csv" }
Metadata files must always include the @context
property with a value "http://www.w3.org/ns/csvw"
: this enables implementations to tell that these are CSV metadata files. The url
property points to the CSV file that the metadata file describes.
These metadata documents should be served from a web server with a media type of application/csvm+json
if possible.
The description of a table within a metadata file can include:
By default, if implementations can't find a metadata file by appending -metadata.json
to the filename of the CSV file, they'll just look for a file called csv-metadata.json
in the same directory as the CSV file.
Metadata files can also describe several CSV files at once, using a slightly different syntax:
{ "@context": "http://www.w3.org/ns/csvw", "tables": [{ "url": "countries.csv" }, { "url": "country-groups.csv" }, { "url": "unemployment.csv" }] }
Here, the tables
property holds an array of table descriptions, each with the URL of the CSV file that it's describing. The metadata file as a whole describes a group of tables. This is usually used when the tables relate to each other in some way: perhaps they're data in the same format from different periods of time, or perhaps they reference each other.
See also:
You can give information about the columns in a CSV file through a table schema. The simplest thing you can do is say what those columns are called. For example, if you have some CSV like this:
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude" "at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005" "be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405" "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
You can say that the table contains seven columns named as they are in this CSV file like so:
{ "@context": "http://www.w3.org/ns/csvw", "url": "countries.csv" "tableSchema": { "columns": [{ "titles": "country" },{ "titles": "country group" },{ "titles": "name (en)" },{ "titles": "name (fr)" },{ "titles": "name (de)" },{ "titles": "latitude" },{ "titles": "longitude" }] } }
A validator can check that the CSV file holds the expected columns (both the right number of columns and columns with the expected titles).
See also:
This Note can't keep an up-to-date list of the tools that implement or otherwise use CSV on the Web. Instead, look at:
Providing metadata about CSVs can be useful simply in providing extra information to anyone who wants to work with them.
Here's an example that includes some extra descriptive documentation about a number of CSV files:
{ "@context": "http://www.w3.org/ns/csvw", "dc:title": "Unemployment in Europe (monthly)" "dc:description": "Harmonized unemployment data for European countries." "dc:creator": "Eurostat", "tables": [{ "url": "countries.csv", "dc:title": "Countries" }, { "url": "country-groups.csv", "dc:title": "Country groups" }, { "url": "unemployment.csv", "dc:title": "Unemployment (monthly)", "dc:description": "The total number of people unemployed" }] }
This example uses Dublin Core as a vocabulary for providing metadata. You can tell that's the vocabulary that's being used because the terms like dc:title
and dc:description
begin with the prefix dc
, which stands for Dublin Core.
There are several different metadata vocabularies in common use around the web. Some people use Dublin Core. Some people use schema.org. Some people use DCAT. All of these vocabularies can be used independently or together. A publisher could alternatively use:
{ "@context": "http://www.w3.org/ns/csvw", "schema:name": "Unemployment in Europe (monthly)", "schema:description": "Harmonized unemployment data for European countries." "schema:creator": { "schema:name": "Eurostat" }, "tables": [{ "url": "countries.csv", "schema:name": "Countries" }, { "url": "country-groups.csv", "schema:name": "Country groups" }, { "url": "unemployment.csv", "schema:name": "Unemployment (monthly)", "schema:description": "The total number of people unemployed" }] }
It's not clear at the moment which metadata vocabulary will give publishers the most benefits. Search engines are likely to recognise schema.org. RDF-based systems are more likely to recognise Dublin Core.
More generally, you can use prefixed properties like these on any of the objects in a metadata document. The prefixes that are recognised are those used in the RDFa 1.1 Initial Context. Other properties must be named with full URLs.
See also:
You can use metadata properties like the ones used for tables for individual columns as well. For example:
{ "@context": "http://www.w3.org/ns/csvw", "url": "countries.csv" "tableSchema": { "columns": [{ "titles": "country", "dc:description": "The ISO two-letter code for a country, in lowercase." },{ "titles": "country group", "dc:description": "A lowercase two-letter code for a group of countries." },{ "titles": "name (en)", "dc:description": "The official name of the country in English." },{ "titles": "name (fr)", "dc:description": "The official name of the country in French." },{ "titles": "name (de)", "dc:description": "The official name of the country in German." },{ "titles": "latitude", "dc:description": "The latitude of an indicative point in the country." },{ "titles": "longitude", "dc:description": "The longitude of an indicative point in the country." }] } }
See also:
We've already seen an example where metadata supplied about a table is structured. Look at the schema:creator
here:
{ "@context": "http://www.w3.org/ns/csvw", "schema:name": "Unemployment in Europe (monthly)", "schema:description": "Harmonized unemployment data for European countries." "schema:creator": { "schema:name": "Eurostat" }, "tables": [{ "url": "countries.csv" }, { "url": "country-groups.csv" }, { "url": "unemployment.csv" }] }
The metadata supplied for a table or group of tables can have nested objects within it. You can provide arrays of values. It will be interpreted as if it is (a minimal version of) [json-ld]. Particular patterns that are useful are:
"schema:creator": { "schema:name": "Eurostat", "schema:url": "http://ec.europa.eu/eurostat", "schema:contactPoint": [{ "schema:telephone": "+44 20 300 63103", "schema:availableLanguage": "English" }, { "schema:telephone": "+49 30 223 86 029", "schema:availableLanguage": "Deutsch" }, { "schema:telephone": "+33 1 77 68 89 07" "schema:availableLanguage": "français" }] }
@type
to indicate the type of objects if there's an appropriate type to use, for example in this case Eurostat is a governmental organisation:
"schema:creator": { "schema:name": "Eurostat", "@type": "schema:GovernmentOrganization" }
@id
property to provide values that are URLs for things about which more information is available at that URL. For example, rather than providing all the information about Eurostat embedded within the metadata, you could just reference out to it:
"schema:creator": { "@id": "http://ec.europa.eu/eurostat" }
@id
along with other basic properties to make it more readable for humans.
See also:
There's no standardised facility in the CSV on the Web specifications for annotating individual cells, but there is a hook that will enable best practice about how to do that to emerge: the notes
property on a table description can contain objects that represent annotations.
The W3C Web Annotation Working Group is working on a vocabulary for annotations themselves. This vocabulary includes the concept of a target for an annotation and its body (the content of the annotation).
If you are annotating a single cell, the target needs to point to that cell. The easiest way to do that is to use fragment identifiers for CSV files as defined in [RFC7111]. These fragment identifiers reference cells based on their position within the original CSV file, with the first row in the CSV file (usually the header row) counted as 1. For example, with the CSV file:
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude" "at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005" "be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405" "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
the cell containing Belgique
is at #cell=3,4
. It's also possible to refer to ranges of cells with this syntax and to use *
to refer to the last row in the file. For example, to target a comment on all the locations in the CSV file you could use the fragment identifier #cell=2,6-*,7
.
To create comments, then, the notes
property can hold an array of objects that use the Web Annotation structure. For example:
{ "@context": "http://www.w3.org/ns/csvw", "url": "countries.csv", "notes": [{ "type": "Annotation", "target": "countries.csv#cell=2,6-*,7", "body": "These locations are of representative points.", "motivation": "commenting" }, { "type": "Annotation", "target": "countries.csv#cell=3,4", "body": "Corrected.", "motivation": "editing" }] }
See also:
Validation is all about checking whether a file contains what you expect it to contain. For CSV files, this can be about:
There's lots more that you can say about the expected content of columns in a CSV file. The most obvious thing is to indicate the data type. For example, with the CSV file:
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude" "at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005" "be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405" "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
The first five columns are strings and the last two are numbers. You can indicate this with the datatype
property for each column:
{ "@context": "http://www.w3.org/ns/csvw", "url": "countries.csv" "tableSchema": { "columns": [{ "titles": "country", "datatype": "string" },{ "titles": "country group", "datatype": "string" },{ "titles": "name (en)", "datatype": "string" },{ "titles": "name (fr)", "datatype": "string" },{ "titles": "name (de)", "datatype": "string" },{ "titles": "latitude", "datatype": "number" },{ "titles": "longitude", "datatype": "number" }] } }
You don't have to include "datatype": "string"
for columns that are strings — columns are assumed to hold strings if no datatype is explicitly specified.
There are a number of different datatypes supported by CSV on the Web implementations, based on the set defined in [xmlschema11-2]. The complete set is shown in the following diagram:
See also:
You can define new datatypes based on the built-in datatypes using an object as the value of the datatype
property rather than a string. For example:
"datatype": { "base": "integer", "minimum": "1", "maximum": "5" }
The base
property must be an existing datatype. The other properties on the new datatype define extra restrictions on values of the new datatype. You can give the new datatype a name and description to provide extra documentation for people using the data:
"datatype": { "dc:title": "Star Rating", "dc:description": "A star rating between 1 and 5." "base": "integer", "minimum": "1", "maximum": "5" }
See also:
In the example we've been using, the first column always contains a country code consisting of two lowercase letters. This is a new datatype based on string
. You can specify the format for string values in a column using a regular expression, like this:
{ "titles": "country", "datatype": { "dc:title": "Country Code", "dc:description": "Country codes as specified in ISO 3166.", "base": "string", "format": "[a-z]{2}" } }
It's also possible to restrict the length of a string-based datatype using the length
or minLength
and/or maxLength
properties. For example the following says that the column holding the English names of countries must have values between 3 and 128 characters long:
{ "titles": "name (en)", "datatype": { "base": "string", "minLength": "3", "maxLength": "128" } }
See also:
The size of numbers in a column can be restricted using the minimum
and maximum
properties and/or the minExclusive
and maxExclusive
properties. In our example, one column contains latitudes, which can range between -90 and +90:
{ "titles": "latitude", "datatype": { "base": "number", "minimum": "-90", "maximum": "90" } }
See also:
In the example we're using, the latitudes are provided to between six and eight decimal places and there are no leading or trailing zeros. You can use the format
property to provide a pattern that matches these numbers. In the pattern, 0
represents a required digit and #
represents an optional digit. For the latitude
, the definition looks like:
{ "titles": "latitude", "datatype": { "base": "number", "minimum": "-90", "maximum": "90", "format": "#0.000000##" } }
The format
property can also be used to indicate that values in a column should have leading zeros. For example, if a column were supposed to hold a three digit number you could use the pattern 000
.
See also:
Sometimes numbers within a CSV file won't be in a standard numeric format. For example, they might include commas as grouping characters (eg 12,345,678
) or as decimal points (eg 12,3
). In these cases, you can use the format
property with an object value.
To match numbers with grouping separators as in 12,345,678
you should specify ","
as the groupChar
for the format. The pattern
property then holds the pattern that indicates how many digits should be in each group. This example validates numbers with groups of three digits separated by commas:
"datatype": { "base": "integer", "format": { "groupChar": ",", "pattern": "#,##0" } }
To match numbers with decimal separators other than .
, as in 12,3
, you should specify ","
as the decimalChar
for the format. This example validates numbers with commas as decimal points:
"datatype": { "base": "integer", "format": { "decimalChar": "," } }
You can mix and match decimal and grouping characters and patterns. For example, in France it's standard to use commas for decimal points and spaces for grouping characters, so CSV files produced in France might contain numbers like 1 234 567,89
. These could be validated with a datatype like:
"datatype": { "base": "integer", "format": { "decimalChar": ",", "groupChar": " ", "pattern": "# ##0,0#" } }
See also:
Dates and times are treated similarly to numbers. You can use the minimum
, maximum
, minExclusive
and/or maxExclusive
properties to restrict their values.
For example, to indicate that the column should contain dates later than 1st January 2000, you can use the datatype:
"datatype": { "base": "date", "minimum": "2000-01-01" }
To indicate that the column should contain times before midday (exclusive), you can use the datatype:
"datatype": { "base": "time", "minExclusive": "12:00:00" }
The format of dates or times used for minimum or maximum values is always the ISO 8601 format: yyyy-MM-dd
for dates, HH:mm:ss.S
for times and yyyy-MM-ddTHH:mm:ss.S
for date/times.
See also:
Dates and times in CSV files often come in formats other than the standard ISO 8601 format. You can use the format
property to indicate the expected format of the date or time.
For example, to recognise dates in the usual UK format such as 31/10/2015 for 31st October 2015, you could use:
"datatype": { "base": "date", "format": "dd/MM/yyyy" }
Implementations are only required to understand a particular set of common formats for dates and times. These formats are, for dates:
yyyy-MM-dd
e.g. 2015-03-22
yyyyMMdd
e.g. 20150322
dd-MM-yyyy
e.g. 22-03-2015
d-M-yyyy
e.g. 22-3-2015
MM-dd-yyyy
e.g. 03-22-2015
M-d-yyyy
e.g. 3-22-2015
dd/MM/yyyy
e.g. 22/03/2015
d/M/yyyy
e.g. 22/3/2015
MM/dd/yyyy
e.g. 03/22/2015
M/d/yyyy
e.g. 3/22/2015
dd.MM.yyyy
e.g. 22.03.2015
d.M.yyyy
e.g. 22.3.2015
MM.dd.yyyy
e.g. 03.22.2015
M.d.yyyy
e.g. 3.22.2015
For times:
HH:mm:ss.S
with one or more trailing S
characters indicating the maximum number of fractional seconds e.g. HH:mm:ss.SSS
for 15:02:37.143
HH:mm:ss
e.g. 15:02:37
HHmmss
e.g. 150237
HH:mm
e.g. 15:02
HHmm
e.g. 1502
And for date/times:
yyyy-MM-ddTHH:mm:ss.S
with one or more trailing S
characters indicating the maximum number of fractional seconds e.g. yyyy-MM-ddTHH:mm:ss.SSS
for 2015-03-15T15:02:37.143
yyyy-MM-ddTHH:mm:ss
e.g. 2015-03-15T15:02:37
yyyy-MM-ddTHH:mm
e.g. 2015-03-15T15:02
M/d/yyyy HH:mm
for 3/22/2015 15:02
or dd.MM.yyyy HH:mm:ss
for 22.03.2015 15:02:37
None of these formats include names or abbreviations for months or days. The implementation you use might support other date and time formats as well, including specialised formats for the other date and time datatypes such as gMonthYear
. Check your implementation's documentation to see what it supports.
See also:
true
or false
?
By default, validators will recognise boolean values that are 1
or 0
or true
or false
. If a CSV file contains boolean values like T
and F
or Yes
and No
then you can create a derived boolean datatype that uses that syntax using the format
property, with the two possible values separated by a |
, for example:
"datatype": { "base": "boolean", "format": "Yes|No" }
See also:
The example CSV file we're using is this:
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude" "at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005" "be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405" "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
In it, the second column, country group
, contains either the value eu
or the value non-eu
. Despite there being two values, this isn't a boolean column. Instead, it's a column that has only two valid values.
There are two ways to specify that a column contains one of a list of values: using a regular expression to list the values and using a reference to a separate CSV file that contains the values.
Using a regular expression to list values works best if those values are strings, if there are only a few of them, and if they are self-explanatory such that you don't want to provide any additional information about them.
In this example, the country group
column could be specified as:
{ "titles": "country group", "datatype": { "base": "string", "format": "eu|non-eu" } }
As described in section 3.3 How do you restrict what kind of strings a column contains?, the format
property contains a regular expression. List the options separated by |
and ensure that you escape any of the characters in the options that have special meaning in regular expressions.
See also:
A more powerful method of listing the valid values in a particular column is to list those values in a separate CSV file. The CSV file can be very simple, containing just a single column that lists the valid values. In this example, we can create country-groups.csv
containing:
group eu non-eu
We can then provide definitions for both the countries.csv
and country-groups.csv
files, and state that the country group
column in countries.csv
references the group
column in country_groups.csv
. This reference from one file to another is called a foreign key.
To use a foreign key, both files must be referenced in the same metadata document, and both columns must be given names. Column names are only used inside the metadata document and you can only use (ASCII) letters, numbers, .
and _
within them. So the basic metadata document, before adding the foreign key, should look like:
{ "@context": "http://www.w3.org/ns/csvw", "tables": [{ "url": "countries.csv", "tableSchema": { "columns": [{ "titles": "country" },{ "name": "country_group", "titles": "country group" },{ "titles": "name (en)" },{ "titles": "name (fr)" },{ "titles": "name (de)" },{ "titles": "latitude" },{ "titles": "longitude" }] } }, { "url": "country-groups.csv", "tableSchema": { "columns": [{ "name": "group", "titles": "group" }] } }] }
The foreign key is defined in the schema for the countries.csv
table, as follows:
{ "@context": "http://www.w3.org/ns/csvw", "tables": [{ "url": "countries.csv", "tableSchema": { "columns": [{ "titles": "country" },{ "name": "country_group", "titles": "country group" },{ "titles": "name (en)" },{ "titles": "name (fr)" },{ "titles": "name (de)" },{ "titles": "latitude" },{ "titles": "longitude" }], "foreignKeys": [{ "columnReference": "country_group", "reference": { "resource": "country-groups.csv", "columnReference": "group" } }] } }, { "url": "country-groups.csv", "tableSchema": { "columns": [{ "name": "group", "titles": "group" }] } }] }
The foreignKeys
property can hold several foreign keys. Each contains a columnReference
to a column or list of columns in one CSV file, and a reference
which defines a column or list of columns in another CSV file.
The advantage of this method of listing the values allowed in a column is that the CSV file that contains the list of possible values can also provide additional information about those values. For example, we can provide expansions of what eu
and non-eu
mean in different languages:
group,name (en),name (fr),name (de) eu,"European Union","Union européenne","Europäische Union" non-eu,"Non EU countries","Pays hors Union européenne",Nicht-EU-Länder
See also:
Sometimes a column that contains numbers will contain special values, such as X
or NK
, when a value is unknown or redacted. If these columns are simply classified as numeric then the non-numeric values will be classed as errors.
To avoid values being classified as errors when they are being used to indicate missing values, list those values as null values using the null
property. This can take either a single string or an array of strings. For example, the latitude
column might usually be numeric but hold an X
if there is no indicative point for the country:
{ "titles": "latitude", "null": "X", "datatype": { "base": "number", "minimum": "-90", "maximum": "90" } }
The null
property can also be useful when a column contains values that are of the right type but used to indicate a missing value. It's not uncommon, for example, for publishers to use the value 99
in a column that contains integers to indicate that a value is missing.
See also:
Cells may contain lists of values with spaces, semi-colons or other characters acting as separators. For example, instead of using separate latitude
and longitude
columns, the CSV that we're looking at could contain a single latlong
column consisting of the latitude and longitude separated by a space:
"country","country group","name (en)","name (fr)","name (de)","latlong" "at","eu","Austria","Autriche","Österreich","47.6965545 13.34598005" "be","eu","Belgium","Belgique","Belgien","50.501045 4.47667405" "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375 25.4823218"
In this scenario, the separator
property can be used to indicate that the values in a column are lists themselves, and what separator is used between the items in the list. For example:
{ "titles": "latlong", "separator": " ", "datatype": { "base": "number", "minimum": "-180", "maximum": "180" } }
When separator
is specified, the datatype
property applies to each of the values in the list. There's no way to indicate that the values in the list have different datatypes, or set limits on the length of the list.
See also:
By default, a validator won't give any errors if a value is missing in a column. If you want to ensure that a value is provided for every row in the column, use the required
property for that column, with the value true
.
In our example, we might say that all the columns are required except the French and German names (applications being expected to default to the English name if the translation is missing):
"tableSchema": { "columns": [{ "titles": "country", "required": true },{ "titles": "country group", "required": true },{ "titles": "name (en)", "required": true },{ "titles": "name (fr)" },{ "titles": "name (de)" },{ "titles": "latitude", "required": true },{ "titles": "longitude", "required": true }] }
Setting required
to true
means that you can't have any null values in a column. If, in this example, latitude
and longitude
had null
set to X
then those columns couldn't contain an X
. It doesn't usually make sense to specify both null
and required
.
See also:
In our example, the country
column must contain unique values: each row should define a different country. To specify this, you can use a primary key to refer to the name of the column:
"tableSchema": { "columns": [{ "name": "country", "titles": "country" },{ "titles": "country group" },{ "titles": "name (en)" },{ "titles": "name (fr)" },{ "titles": "name (de)" },{ "titles": "latitude" },{ "titles": "longitude" }], "primaryKey": "country" }
Each CSV file can only have one primary key. A primary key can be made up of a number of columns that have to be unique in combination: the classic example would be ["firstName", "lastName"]
.
See also:
CSV is great for transferring data around the place, but it's often not as useful for processing that data. Other formats can be better at providing structure and meaning for data. So a means for transforming data out of CSV and into other formats is a common requirement.
You can of course transform CSV into anything you like using your favourite programming language. However, the CSV on the Web specs provide standardised ways of mapping CSV into two other formats:
These specifications make use of the metadata described in this primer during the transformation to decide what to include in the transformed output and how to include it. Processors that support these standardised transformations can be used by people who can't program.
These specifications describe what output you get if you don't supply any metadata. Given a CSV file like this:
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude" "at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005" "be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405" "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
the usual, minimal JSON output would be:
[{ "country": "at", "country group": "eu", "name (en)": "Austria", "name (fr)": "Autriche", "name (de)": "Österreich", "latitude": "47.6965545", "longitude": "13.34598005" }, { "country": "be", "country group": "eu", "name (en)": "Belgium", "name (fr)": "Belgique", "name (de)": "Belgien", "latitude": "50.501045", "longitude": "4.47667405" }, { "country": "bg", "country group": "eu", "name (en)": "Bulgaria", "name (fr)": "Bulgarie", "name (de)": "Bulgarien", "latitude": "42.72567375", "longitude": "25.4823218" }]
and the RDF output would be:
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> . [ <#country> "at"; <#country%20group> "eu"; <#latitude> "47.6965545"; <#longitude> "13.34598005"; <#name%20%28de%29> "Österreich"; <#name%20%28en%29> "Austria"; <#name%20%28fr%29> "Autriche" ] . [ <#country> "be"; <#country%20group> "eu"; <#latitude> "50.501045"; <#longitude> "4.47667405"; <#name%20%28de%29> "Belgien"; <#name%20%28en%29> "Belgium"; <#name%20%28fr%29> "Belgique" ] . [ <#country> "bg"; <#country%20group> "eu"; <#latitude> "42.72567375"; <#longitude> "25.4823218"; <#name%20%28de%29> "Bulgarien"; <#name%20%28en%29> "Bulgaria"; <#name%20%28fr%29> "Bulgarie" ] .
The specifications define how to transform CSV into RDF. In this Primer all the examples use Turtle as the serialisation for that RDF. Implementations may generate other serialisations for RDF such as RDF/XML or JSON-LD.
See also:
The result of a transformation will include typed values based on the datatype and language of each column. So if we state that the longitude and latitude are numbers and the names are strings in the given language, as in this metadata:
{ "@context": "http://www.w3.org/ns/csvw", "url": "countries.csv", "tableSchema": { "columns": [{ "titles": "country" },{ "titles": "country group" },{ "titles": "name (en)", "lang": "en" },{ "titles": "name (fr)", "lang": "fr" },{ "titles": "name (de)", "lang": "de" },{ "titles": "latitude", "datatype": "number" },{ "titles": "longitude", "datatype": "number" }] } }
then the JSON will look like:
[{ "country": "at", "country group": "eu", "name (en)": "Austria", "name (fr)": "Autriche", "name (de)": "Österreich", "latitude": 47.6965545, "longitude": 13.34598005 },{ "country": "be", "country group": "eu", "name (en)": "Belgium", "name (fr)": "Belgique", "name (de)": "Belgien", "latitude": 50.501045, "longitude": 4.47667405 },{ "country": "bg", "country group": "eu", "name (en)": "Bulgaria", "name (fr)": "Bulgarie", "name (de)": "Bulgarien", "latitude": 42.72567375, "longitude": 25.4823218 }]
and the RDF will look like:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . [ <#country> "at"; <#country%20group> "eu"; <#latitude> 4.76965545e1; <#longitude> 1.334598005e1; <#name%20%28de%29> "Österreich"@de; <#name%20%28en%29> "Austria"@en; <#name%20%28fr%29> "Autriche"@fr ] . [ <#country> "be"; <#country%20group> "eu"; <#latitude> 5.0501045e1; <#longitude> 4.47667405e0; <#name%20%28de%29> "Belgien"@de; <#name%20%28en%29> "Belgium"@en; <#name%20%28fr%29> "Belgique"@fr ] . [ <#country> "bg"; <#country%20group> "eu"; <#latitude> 4.272567375e1; <#longitude> 2.54823218e1; <#name%20%28de%29> "Bulgarien"@de; <#name%20%28en%29> "Bulgaria"@en; <#name%20%28fr%29> "Bulgarie"@fr ] .
See also:
If there's a missing value in the CSV, then usually the property will be omitted in the result as well. So if the latitude and longitude are missing for the first row of the CSV file we've been using, the equivalent JSON won't include those properties either:
{ "country": "at", "country group": "eu", "name (en)": "Austria", "name (fr)": "Autriche", "name (de)": "Österreich" }
and nor will the RDF:
[ <#country> "at"; <#country%20group> "eu"; <#name%20%28de%29> "Österreich"@de; <#name%20%28en%29> "Austria"@en; <#name%20%28fr%29> "Autriche"@fr ] .
If you want a value to appear even when the value is missing in the CSV, you can provide that value as the default
for the column. This value must be supplied as a string but will be treated exactly as if it had appeared within the CSV file. For example, if you supply a non-numeric string for a numeric column, as in:
{ "titles": "latitude", "datatype": "number", "default": "NOT KNOWN" }
The default value will be seen as an invalid value and therefore represented as a string in the output.
See also:
By default, the properties in the JSON and RDF come from the titles of the columns (the headers in the CSV file). In RDF, since the properties are URIs, the names are URL-encoded and turned into fragment identifiers in a URL based on the location of the CSV file being transformed. Hence in the previous examples there have been property names like "name (en)"
in JSON and #name%20%28en%29
in RDF.
You can override this default by supplying a name
for the column. That name will be used instead of the title of the column when creating the property. So if you have:
{ "titles": "name (en)", "name": "english_name" }
then the property will be called english_name
in the JSON output and #english_name
in the RDF output.
You can also use the propertyUrl
property to supply a prefixed name or a URL for the property. For example, to use schema:latitude
as the name for the latitude
property in both the JSON and the RDF output, you could use:
{ "titles": "latitude", "propertyUrl": "schema:latitude" "datatype": "number" }
The propertyUrl
property can be used to map several columns in the CSV file onto properties with the same name. In our example, each country has several names which are all really the same property; the schema could look like:
{ "titles": "name (en)", "propertyUrl": "schema:name" "lang": "en" },{ "titles": "name (fr)", "propertyUrl": "schema:name" "lang": "fr" },{ "titles": "name (de)", "propertyUrl": "schema:name" "lang": "de" }
With that schema, the JSON output will contain:
"schema:name": ["Belgium", "Belgique", "Belgien"]
and the RDF output will contain:
schema:name "Belgium"@en, "Belgique"@fr, "Belgien"@de
If there isn't a relevant property in one of the vocabularies that is built-in to CSV on the Web (those listed as part of the RDFa 1.1 initial context), the propertyUrl
can hold a URL template. Usually this template won't include any substitutable parts because it's generally the case that the property should be the same for the whole column. For example, you might have:
{ "titles": "country group", "propertyUrl": "http://example.org/def/countryGroup" }
In this case, the result of a transformation to JSON will contain:
"http://example.org/def/countryGroup": "eu"
and the RDF similarly:
<http://example.org/def/countryGroup> "eu"
See also:
Sometimes a column contains a value that can be programmatically mapped into a URL. In this case, the valueUrl
property contains a template for the URL that it should be mapped into.
For example, say that there were pages for each country on the web at e.g. http://example.org/country/at
. In that case, the URL for the country could be generated with the URL template http://example.org/country/{country}
. Within this template, {country}
inserts the value from the column named country
into the URL. So the metadata could contain:
{ "titles": "country", "name": "country", "valueUrl": "http://example.org/country/{country}", "propertyUrl": "schema:url" }
The JSON output would then contain:
"schema:url": "http://example.org/country/at"
and the RDF output:
schema:url <http://example.org/country/at>
If you want to preserve the original value from the column and use it to create a URL, you may want to introduce a virtual column. For example, with the latitude and longitude of each country available, you might want to provide a link to a map centered on the country within Google Maps. The URLs for these look like https://www.google.com/maps/@50.501045,4.476674,7z, and a template like https://www.google.com/maps/@{lat},{long},7z
.
To add a property that points to this URL, add a virtual column at the end of the column definitions within the schema. A virtual column definition looks just like a normal column definition but with the virtual
property set to true
:
{ "@context": "http://www.w3.org/ns/csvw", "url": "countries.csv", "tableSchema": { "columns": [{ "titles": "country" },{ "titles": "country group" },{ "titles": "name (en)", "lang": "en" },{ "titles": "name (fr)", "lang": "fr" },{ "titles": "name (de)", "lang": "de" },{ "titles": "latitude", "name": "lat", "datatype": "number" },{ "titles": "longitude", "name": "long", "datatype": "number" },{ "virtual": true, "propertyUrl": "schema:hasMap", "valueUrl": "https://www.google.com/maps/@{lat},{long},7z" }] } }
In JSON, this will result in the output:
"schema:hasMap": "https://www.google.com/maps/@42.72567375,25.4823218,7z"
and in RDF, the output:
schema:hasMap <https://www.google.com/maps/@42.72567375,25.4823218,7z>
See also:
valueUrl
property in [tabular-metadata]
By default, the things described by each row don't have identifiers associated with them in either JSON or RDF outputs. You can add an identifier for the row by setting the aboutUrl
property. Usually that's done at the top level of the schema.
For example, say each row in countries.csv
was about a country whose identifier looked like http://example.org/country/{code}
where code
was the value within the first column of the CSV file (the country
column). The aboutUrl
could be set to the generate this URL for each row using a URL template:
{ "@context": "http://www.w3.org/ns/csvw", "url": "countries.csv", "tableSchema": { "aboutUrl": "http://example.org/country/{code}", "columns": [{ "titles": "country", "name": "code" },{ "titles": "country group" },{ "titles": "name (en)", "lang": "en" },{ "titles": "name (fr)", "lang": "fr" },{ "titles": "name (de)", "lang": "de" },{ "titles": "latitude", "datatype": "number" },{ "titles": "longitude", "datatype": "number" }] } }
In the JSON, these identifiers are turned into @id
properties on the objects generated for each row:
[{ "@id": "http: //example.org/country/at", "code": "at", "country group": "eu", "name (en)": "Austria", "name (fr)": "Autriche", "name (de)": "Österreich" },{ "@id": "http: //example.org/country/be", "code": "be", "country group": "eu", "name (en)": "Belgium", "name (fr)": "Belgique", "name (de)": "Belgien", "latitude": 50.501045, "longitude": 4.47667405 },{ "@id": "http: //example.org/country/bg", "code": "bg", "country group": "eu", "name (en)": "Bulgaria", "name (fr)": "Bulgarie", "name (de)": "Bulgarien", "latitude": 42.72567375, "longitude": 25.4823218 }]
In the RDF, these identifiers become the identifiers for the entities that the properties relate to, rather than those being blank nodes:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . <http://example.org/country/at> <#code> "at"; <#country%20group> "eu"; <#latitude> 4.76965545e1; <#longitude> 1.334598005e1; <#name%20%28de%29> "Österreich"@de; <#name%20%28en%29> "Austria"@en; <#name%20%28fr%29> "Autriche"@fr . <http://example.org/country/be> <#code> "be"; <#country%20group> "eu"; <#latitude> 5.0501045e1; <#longitude> 4.47667405e0; <#name%20%28de%29> "Belgien"@de; <#name%20%28en%29> "Belgium"@en; <#name%20%28fr%29> "Belgique"@fr . <http://example.org/country/bg> <#code> "bg"; <#country%20group> "eu"; <#latitude> 4.272567375e1; <#longitude> 2.54823218e1; <#name%20%28de%29> "Bulgarien"@de; <#name%20%28en%29> "Bulgaria"@en; <#name%20%28fr%29> "Bulgarie"@fr .
See also:
aboutUrl
property in [tabular-metadata]Whether generating JSON or RDF it can be useful to indicate that each row contains data about a particular type of thing, such as a Person or a Country. There isn't usually a column within tabular data that indicates the type of the row (because it's generally the same for every row, so including it would be superfluous), so you have to add it as a virtual column.
The virtual column needs to come after the descriptions of columns actually within the data. It should have its virtual
property set to true
and its propertyUrl
set to rdf:type
to indicate that the virtual column will indicate the type of entity the row is about. The valueUrl
property can then be set to the prefixed name or URL of the type of the entity. For example, when each row represents a Country, you might use:
{ "virtual": true, "propertyUrl": "rdf:type", "valueUrl": "schema:Country" }
In the JSON this value will be transformed into the value of the @type
property on the relevant object:
"@type": "schema:Country"
In RDF output, it becomes the class for the entity:
a schema:Country
See also:
Transformations into JSON or RDF can be carried out in one of two modes. In minimal mode, which is what we've looked at so far, the output only contains data from the rows within the CSV file. In full mode, the output also contains metadata about the CSV file, including metadata from the metadata file.
For example, say that our metadata file looked like:
{ "@context": "http://www.w3.org/ns/csvw", "url": "countries.csv", "schema:name": "Countries", "schema:description": "European countries for which comparative statistics are collected by Eurostat.", "schema:creator": { "schema:name": "Eurostat" }, "tableSchema": { ... } }
The output of a full JSON transformation would look like:
{ "tables": [{ "url":"countries.csv", "schema:name":"Countries", "schema:description":"European countries for which comparative statistics are collected by Eurostat.", "schema:creator": { "schema:name": "Eurostat" }, "row": [{ "url": "countries.csv#row=2", "rownum":1, "describes": [{ "country": "at", "country group": "eu", "name (en)": "Austria", "name (fr)": "Autriche", "name (de)": "Österreich", "latitude": 47.6965545, "longitude": 13.34598005 }] }, { "url": "countries.csv#row=3", "rownum": 2, "describes": [{ "country": "be", "country group": "eu", "name (en)": "Belgium", "name (fr)": "Belgique", "name (de)": "Belgien", "latitude": 50.501045, "longitude": 4.47667405 }] }, { "url": "countries.csv#row=4", "rownum": 3, "describes": [{ "country": "bg", "country group": "eu", "name (en)": "Bulgaria", "name (fr)": "Bulgarie", "name (de)": "Bulgarien", "latitude": 42.72567375, "longitude": 25.4823218 }] }] }] }
Similarly, the output of the full RDF would look like:
@prefix csvw: <http://www.w3.org/ns/csvw#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix schema: <http://schema.org/> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . [ a csvw:TableGroup; csvw:table [ a csvw:Table; schema:creator [ schema:name "Eurostat" ]; schema:description "European countries for which comparative statistics are collected by Eurostat."; schema:name "Countries"; csvw:row [ a csvw:Row; csvw:describes [ <#country> "at"; <#country%20group> "eu"; <#latitude> 4.76965545e1; <#longitude> 1.334598005e1; <#name%20%28de%29> "Österreich"@de; <#name%20%28en%29> "Austria"@en; <#name%20%28fr%29> "Autriche"@fr ]; csvw:rownum 1; csvw:url <#row=2> ], [ a csvw:Row; csvw:describes [ <#country> "be"; <#country%20group> "eu"; <#latitude> 5.0501045e1; <#longitude> 4.47667405e0; <#name%20%28de%29> "Belgien"@de; <#name%20%28en%29> "Belgium"@en; <#name%20%28fr%29> "Belgique"@fr ]; csvw:rownum 2; csvw:url <#row=3> ], [ a csvw:Row; csvw:describes [ <#country> "bg"; <#country%20group> "eu"; <#latitude> 4.272567375e1; <#longitude> 2.54823218e1; <#name%20%28de%29> "Bulgarien"@de; <#name%20%28en%29> "Bulgaria"@en; <#name%20%28fr%29> "Bulgarie"@fr ]; csvw:rownum 3; csvw:url <#row=4> ]; csvw:url <> ] ] .
Metadata provided about tables is interpreted based on the rules for [json-ld] which means that you can provide as much structure within that metadata as you like, including providing structured values, languages and datatypes, so that the data in the output includes what you need it to.
See also:
By default, the output from a JSON or RDF transformation will include all the data from all the columns of all the tables in the metadata document. It may be that you're not interested in some of that within the output of your transformation. In that case, you can use the suppressOutput
property in the metadata to exclude the data that you're not interested in.
For example, perhaps I'm not interested in the non-English names of countries in my output. In that case, I could suppress them as follows:
{ "titles": "name (fr)", "lang": "fr", "suppressOutput": true }
Similarly, when generating the output for a set of tables, you can suppress the output from a whole table by adding the suppressOutput
property to the description of that table:
{ "@context": "http://www.w3.org/ns/csvw", "tables": [{ "url": "countries.csv" }, { "url": "country-groups.csv", "suppressOutput": true }, { "url": "unemployment.csv" }] }
See also:
While tabular data is by necessity flat, it often holds data that is actually structured. For example, the data that we have been looking at:
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude" "at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005" "be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405" "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
if modelled according to the schema.org vocabulary, would look like:
[{ "@type": "schema:Country", "schema:name": ["Austria", "Autriche", "Österreich"], "schema:geo": { "@type": "schema:GeoCoordinates", "schema:latitude":47.6965545, "schema:longitude":13.34598005 } }, { "@type": "schema:Country", "schema:name": ["Belgium", "Belgique", "Belgien"], "schema:geo": { "@type": "schema:GeoCoordinates", "schema:latitude":50.501045, "schema:longitude":4.47667405 } }, { "@type": "schema:Country", "schema:name": ["Bulgaria", "Bulgarie", "Bulgarien"], "schema:geo": { "@type": "schema:GeoCoordinates", "schema:latitude":42.72567375, "schema:longitude":25.4823218 } }]
Generating JSON in this shape requires the judicious use of virtual columns, aboutUrl
and valueUrl
: if you create a column whose valueUrl
corresponds to the aboutUrl
of another column, you will create nested properties.
In this example, we can use two aboutUrl
s: one in the form http://example.org/country/{code}
for the Country and one in the form http://example.org/country/{code}#geo
for the geo-coordinates of the country. The names are properties of the former while the longitude and latitude are properties of the latter. A virtual column can add the association between the two objects, with a propertyUrl
of schema:geo
, like so:
{ "@context": "http://www.w3.org/ns/csvw", "url": "countries.csv", "tableSchema": { "aboutUrl": "http://example.org/country/{code}", "columns": [{ "titles": "country", "name": "code", "suppressOutput": true },{ "titles": "country group", "suppressOutput": true },{ "titles": "name (en)", "lang": "en", "propertyUrl": "schema:name" },{ "titles": "name (fr)", "lang": "fr", "propertyUrl": "schema:name" },{ "titles": "name (de)", "lang": "de", "propertyUrl": "schema:name" },{ "titles": "latitude", "datatype": "number", "aboutUrl": "http://example.org/country/{code}#geo", "propertyUrl": "schema:latitude" },{ "titles": "longitude", "datatype": "number", "aboutUrl": "http://example.org/country/{code}#geo", "propertyUrl": "schema:longitude" },{ "virtual": true, "propertyUrl": "rdf:type", "valueUrl": "schema:Country" },{ "virtual": true, "propertyUrl": "schema:geo", "valueUrl": "http://example.org/country/{code}#geo" },{ "virtual": true, "aboutUrl": "http://example.org/country/{code}#geo", "propertyUrl": "rdf:type", "valueUrl": "schema:GeoCoordinates" }] } }
Note also in this example the use of suppressOutput
to remove properties that we're not interested in, and the use of virtual columns to add type information to both types of generated object.
The result of this transformation is close to what we were aiming for, though with the addition of @id
properties:
[{ "@id": "http://example.org/country/at", "@type": "schema:Country", "schema:name": ["Austria", "Autriche", "Österreich"], "schema:geo": { "@id": "http://example.org/country/at#geo", "@type": "schema:GeoCoordinates", "schema:latitude": 47.6965545, "schema:longitude": 13.34598005 } }, { "@id": "http://example.org/country/be", "@type": "schema:Country", "schema:name": ["Belgium", "Belgique", "Belgien"], "schema:geo": { "@id":"http://example.org/country/be#geo", "@type": "schema:GeoCoordinates", "schema:latitude": 50.501045, "schema:longitude": 4.47667405 } }, { "@id": "http://example.org/country/bg", "@type": "schema:Country", "schema:name": ["Bulgaria", "Bulgarie", "Bulgarien"], "schema:geo": { "@id": "http://example.org/country/bg#geo", "@type": "schema:GeoCoordinates", "schema:latitude": 42.72567375, "schema:longitude": 25.4823218 } }]
The same metadata will generate similar RDF, though the nesting structure is not so obvious because of the way RDF works:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix schema: <http://schema.org/> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . <http://example.org/country/at> a schema:Country; schema:geo <http://example.org/country/at#geo>; schema:name "Austria"@en, "Autriche"@fr, "Österreich"@de . <http://example.org/country/be> a schema:Country; schema:geo <http://example.org/country/be#geo>; schema:name "Belgium"@en, "Belgique"@fr, "Belgien"@de . <http://example.org/country/bg> a schema:Country; schema:geo <http://example.org/country/bg#geo>; schema:name "Bulgaria"@en, "Bulgarie"@fr, "Bulgarien"@de . <http://example.org/country/at#geo> a schema:GeoCoordinates; schema:latitude 4.76965545e1; schema:longitude 1.334598005e1 . <http://example.org/country/be#geo> a schema:GeoCoordinates; schema:latitude 5.0501045e1; schema:longitude 4.47667405e0 . <http://example.org/country/bg#geo> a schema:GeoCoordinates; schema:latitude 4.272567375e1; schema:longitude 2.54823218e1 .
See also:
If you are used to using RDF you'll know that there's a big difference between having a property that has multiple values (ie multiple triples with the same subject and property) and a property that has a rdf:List
as a value. Sometimes one is appropriate, sometimes the other. The ordered
property enables you to indicate which to use for values that are sequences in the original data.
Let's use as an example the version of our data in which the latitude and longitude are in the same property:
"country","country group","name (en)","name (fr)","name (de)","latlong" "at","eu","Austria","Autriche","Österreich","47.6965545 13.34598005" "be","eu","Belgium","Belgique","Belgien","50.501045 4.47667405" "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375 25.4823218"
In this example, if we state that the latlong
column is space-separated like so:
{ "titles": "latlong", "separator": " ", "datatype": "number" }
we'll end up with output like this:
[ <#country> "at"; <#country%20group> "eu"; <#name%20%28en%29> "Austria"@en; <#name%20%28fr%29> "Autriche"@fr; <#name%20%28de%29> "Österreich"@de; <#latlong> 4.76965545e1, 1.334598005e1; ] .
This shows the #latlong
property having two values: 4.76965545e1
and 1.334598005e1
. These values could easily get mixed up such that the first was taken to be the longitude and the second the latitude rather than the other way around.
To avoid this mix-up occurring, set the ordered
property to true
:
{ "titles": "latlong", "separator": " ", "ordered": true, "datatype": "number" }
This will make the processor use a rdf:List
for the value of the #latlong
property instead, which in Turtle looks like:
[ <#country> "at"; <#country%20group> "eu"; <#name%20%28en%29> "Austria"@en; <#name%20%28fr%29> "Autriche"@fr; <#name%20%28de%29> "Österreich"@de; <#latlong> ( 4.76965545e1, 1.334598005e1 ); ] .
The ordered
property makes no difference to JSON output because sequences in CSV are always transformed into arrays in JSON.
See also:
As illustrated above, it is possible to transform CSV into something that looks like JSON-LD by transforming it into JSON. You can add @id
properties for identifiers using aboutUrl
and add @type
properties for types using virtual columns.
However, if you're really after JSON-LD as an output, the best route is to transform into RDF and emit that RDF as JSON-LD. This will give you more control over the context that's used to determine the properties and structure of the output.
See also:
There's no single specified way of displaying CSV tables in HTML. If you are writing code to do so, it's good practice to:
tableDirection
and textDirection
propertiesaboutUrl
and valueUrl
properties to link out to other pages from cells within the table, and the propertyUrl
to link out to information about the property a column represents; these properties could also be used to embed RDFa into the HTML table, as described in [html-rdfa]rowTitles
) visible; for screen readers, use these to label the rows so that people can navigate through the table easilySee also:
The Data Cube vocabulary [vocab-data-cube] is a vocabulary for statistical data based on SDMX (Statistical Data and Metadata eXchange). Statistical data is often expressed in tables, sometimes with one row per Observation and sometimes with each cell containing a different Observation.
Generating data in the Data Cube vocabulary requires the use of techniques that have been described above, such as adding identifiers to entities, adding structure and types through virtual columns, and using metadata to supply additional static context.
There is a fully worked out example of transforming to DataCube, using meterological data as its basis, available on the Github repository for the Working Group.
See also:
While there are only specifications for transforming CSV into JSON and RDF, there is an extension mechanism within CSV metadata to indicate other transformations that could be applied to CSV files. The transformations
property on a table description holds an array of descriptions of transformations that processors could carry out. There's no guarantees that a given processor will recognise them, but over time it might be that there begins to be recognised practices for how such transformations might work.
The transformations must have the following properties:
targetFormat
http://www.iana.org/assignments/media-types/application/xml
for XMLurl
scriptFormat
https://mustache.github.io/
for Mustache or http://www.iana.org/assignments/media-types/application/javascript
for Javascript
You can also supply a titles
property to provide a human-readable description of the output of the transformation, and a source
property to indicate that the input to the transformation isn't the original CSV or tabular data, but json
or rdf
.
For example, if I wanted to convert the data that we've been using into XML, I could create a Mustache template like this at xml-template.mustache
:
{{#tables}} <countries> {{#row}} {{#describes}} <country id="{{country}}" group="{{country_group}}"> <name xml:lang="en">{{name_en}}</name> <name xml:lang="fr">{{name_fr}}</name> <name xml:lang="de">{{name_de}}</name> <geo lat="{{latitude}}" long="{{longitude}}" /> </country> {{/describes}} {{/row}} </countries> {{/tables}}
In the metadata for the CSV file I could then include:
{ "@context": "http://www.w3.org/ns/csvw", "url": "countries.csv", "transformations": [{ "targetFormat": "http://www.iana.org/assignments/media-types/application/xml", "titles": "Simple XML version", "url": "xml-template.mustache", "scriptFormat": "https://mustache.github.io/", "source": "json" }] }
Processors that recognised the URL for Mustache could offer users the option of passing the JSON output to a Mustache processor, which would generate:
<countries> <country id="at" group="eu"> <name xml:lang="en">Austria</name> <name xml:lang="fr">Autriche</name> <name xml:lang="de">Österreich</name> <geo lat="47.6965545" long="13.34598005" /> </country> <country id="be" group="eu"> <name xml:lang="en">Belgium</name> <name xml:lang="fr">Belgique</name> <name xml:lang="de">Belgien</name> <geo lat="50.501045" long="4.47667405" /> </country> <country id="bg" group="eu"> <name xml:lang="en">Bulgaria</name> <name xml:lang="fr">Bulgarie</name> <name xml:lang="de">Bulgarien</name> <geo lat="42.72567375" long="25.4823218" /> </country> </countries>
Processors should document the kind of transformation scripts that they can recognise and how they process them.
See also:
There are a number of features in the CSV on the Web metadata documents that support scenarios encountered in CSV files that use different languages. We already discussed using varying number formats in section 3.6 How do you validate numbers that aren't in standard numeric formats? and date formats in section 3.8 How do you validate dates that aren't in standard date or time formats?. Here we'll look at how to create metadata files, schemas and CSV files that take account of and work across multiple languages.
The metadata file will often contain natural language text, such as titles and descriptions of columns and tables. Unless you specify otherwise, implementations will assume all this text is in an undefined language (und
). If you want to specify what natural language is in use within the metadata file, you have to change the way the @context
is specified. Instead of the normal value:
"@context": "http://www.w3.org/ns/csvw"
The @context
should take an array, where the first value is the usual URL as a string and the second is an object with a @language
property set to the language being used within the metadata file. This example, which has English-language titles and descriptions, illustrates:
{ "@context": [ "http://www.w3.org/ns/csvw", { "@language": "en "} ], "dc:title": "Countries" "url": "countries.csv" "tableSchema": { "columns": [{ "titles": "country", "dc:description": "The ISO two-letter code for a country, in lowercase." },{ "titles": "country group", "dc:description": "A lowercase two-letter code for a group of countries." },{ "titles": "name (en)", "dc:description": "The official name of the country in English." },{ "titles": "name (fr)", "dc:description": "The official name of the country in French." },{ "titles": "name (de)", "dc:description": "The official name of the country in German." },{ "titles": "latitude", "dc:description": "The latitude of an indicative point in the country." },{ "titles": "longitude", "dc:description": "The longitude of an indicative point in the country." }] } }
See also:
Metadata such as descriptions can be objects rather than strings. Using objects is useful when you want to provide the language for a value. In this case, the object should have two properties: a @value
property holding the natural-language string and a @lang
value indicating what language that string is in. For example:
{ "titles": "name (en)", "dc:description": { "@value": "The official name of the country in English.", "@lang": "en" } }
You can use an array to provide the same metadata in many different languages, for example:
"dc:title": [{ "@lang": "en", "@value": "Unemployment in Europe (monthly)" },{ "@lang": "de", "@value": "Arbeitslosigkeit in Europa (monatlich)" },{ "@lang": "fr", "@value": "Le Chômage en Europe (mensuel)" }]
If you don't indicate the language used for metadata, processors will assume it's the default language used in the metadata as a whole.
See also:
You can use an object as the value for the titles
property for a column to provide titles in different languages. Within the object, each property is a language and the value is the title in that language:
"titles": { "en": "Country", "de": "Land", "fr": "Pays" }
If you don't indicate the language used for the title of a column, processors will assume it's the default language used in the metadata as a whole.
See also:
Within an individual CSV file, it may be that different columns contain values that are in different languages. In the example we're using, there are three columns that each contain the name of a country, in English, French and German:
"country","country group","name (en)","name (fr)","name (de)","latlong" "at","eu","Austria","Autriche","Österreich","47.6965545 13.34598005" "be","eu","Belgium","Belgique","Belgien","50.501045 4.47667405" "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375 25.4823218"
Use the lang
property on the column description to indicate the language of text in that column:
{ "@context": [ "http://www.w3.org/ns/csvw", { "@language": "en "} ], "dc:title": "Countries" "url": "countries.csv" "tableSchema": { "columns": [{ "titles": "country", "dc:description": "The ISO two-letter code for a country, in lowercase." },{ "titles": "country group", "dc:description": "A lowercase two-letter code for a group of countries." },{ "titles": "name (en)", "dc:description": "The official name of the country in English.", "lang": "en" },{ "titles": "name (fr)", "dc:description": "The official name of the country in French.", "lang": "fr" },{ "titles": "name (de)", "dc:description": "The official name of the country in German.", "lang": "de" },{ "titles": "latitude", "dc:description": "The latitude of an indicative point in the country." },{ "titles": "longitude", "dc:description": "The longitude of an indicative point in the country." }] } }
There's no relationship between the language used in a metadata file and that used in the CSV file that it describes.
See also:
Implementations that display tables according to the specs should mostly be able to guess whether a table should be displayed left-to-right or right-to-left based on the content of the table. Implementations will look at the content of the cells to work out which way to display their content and will look at the content of the table as a whole to work out whether to display the first column on the right or left of the page.
If you want to override the display of a particular column then you can use the textDirection
property on a column description to explicitly be rtl
or ltr
:
{ "titles": "name (ar)", "lang": "ar", "textDirection": "rtl" }
If you want to override the display of the table overall then you can use the tableDirection
property on the description of the table, or for all tables in the group.
{ "@context": "http://www.w3.org/ns/csvw", "url": "results.csv", "tableDirection": "rtl" }
The value of the tableDirection
property is inherited to all columns in the table, so any text within this table will similarly be displayed right-to-left. This can be overridden by setting textDirection
to ltr
or auto
(in which case the direction of the text within each cell will be determined by its contents).
See also:
There is no native support for expressing the units of measure for a particular column. You can, however, use documentation to tell people who are using the data what unit of measure is used for that particular column. This can be informal within the description of the column:
{ "titles": "distance", "dc:description": "Distance (kilometres)" }
Alternatively, it can be more explicit using an existing units-of-measure property and vocabulary, such as:
{ "titles": "distance", "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure": { "@id": "http://qudt.org/vocab/unit#Kilometer" } }
to which you could even add more detail if you wanted (this is replicating the canonical definition of definition of unit:Kilometer
from the Quanitites, Units, Dimensions and Data Types (QUDT) Ontologies):
{ "titles": "distance", "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure": { "@id": "http://qudt.org/vocab/unit#Kilometre", "@type": [ "http://qudt.org/schema/qudt#SIUnit", "http://qudt.org/schema/qudt#DerivedUnit", "http://qudt.org/schema/qudt#LengthUnit" ], "rdfs:label": "Kilometer", "http://qudt.org/schema/qudt#abbreviation": "km", "http://qudt.org/schema/qudt#code": "1091", "http://qudt.org/schema/qudt#conversionMultiplier": 1000, "http://qudt.org/schema/qudt#conversionOffset": 0.0, "http://qudt.org/schema/qudt#symbol": "km", "skos:exactMatch": { "@id": "http://dbpedia.org/resource/Kilometre" } } }
If you are generating JSON or RDF from CSV, you may want to generate structured values that include the units of each value from the CSV file. This is a little complicated, but useful if different rows contain values that use different units. In this case, the output that you're aiming for in RDF would look something like:
[] :distance <#row-1-distance> . <#row-1-distance> schema:value 3.5 ; schema:unitCode <http://qudt.org/vocab/unit#Kilometer> ; .
and in JSON something like this:
"distance": { "@id": "#row-1-distance", "schema:value": 3.5 "schema:unitCode": "http://qudt.org/vocab/unit#Kilometer" }
You may want to use different properties than schema:value
and schema:unitCode
; if so, just use different propertyUrl
s.
You need to decide on a pattern for the URLs for the values themselves, and set the aboutUrl
for the relevant column create that URL. In this example, the URLs that look like #row-1-distance
can be generated with the pattern #row-{_row}-distance
. The propertyUrl
for the column needs to be schema:value
as the value in the column provides the value for that property. So the column description looks like:
{ "name": "distance_value", "titles": "distance", "datatype": "number", "aboutUrl": "#row-{_row}-distance", "propertyUrl": "schema:value" }
You then need to use virtual columns (descriptions of additional columns that don't exist in the source CSV) to generate the relationship between the thing whose distance is being measured and the structured value, and the additional property providing the unit for the structured value.
To generate the relationship being the thing that has the distance and the structured value, the virtual column's valueUrl
needs to hold the same URL template as you used before:
{ "name": "distance", "virtual": true, "valueUrl": "#row-{_row}-distance" }
To create the units property, you need another virtual column where the aboutUrl
of that virtual column generates the URL for the structured value, the propertyUrl
is schema:unitCode
and the valueUrl
is the URL representing the unit (in this case http://qudt.org/vocab/unit#Kilometre
):
{ "name": "distance_unit", "aboutUrl": "#row-{_row}-distance", "propertyUrl": "schema:unitCode", "valueUrl": "http://qudt.org/vocab/unit#Kilometer" }
If it's necessary to add more detail about the unit (e.g. the fact that it's a unit of length) this can be done with additional virtual columns:
{ "name": "kilometer_abbreviation", "virtual": true, "aboutUrl": "http://qudt.org/vocab/unit#Kilometer", "propertyUrl": "rdf:type", "valueUrl": "http://qudt.org/schema/qudt#LengthUnit" }
Usually, however, processors should recognise or be able to resolve the URL for the unit to understand that it's a unit of length, if this is important for onward processing.
See also:
If you are generating RDF from CSV, you may want to define a datatype for a column and then provide additional information about that datatype as properties. For example, the column description could look like:
{ "titles": "distance", "datatype": { "@id": "http://example.org/unit/kilometre", "rdfs:label": "Kilometre", "base": "number" } }
When values are generated in RDF for this column, they will be assigned the relevant datatype, for example:
[] :distance "3.5"^^<http://example.org/unit/kilometre> . <http://example.org/unit/kilometre> rdfs:label "Kilometre" .
Again, it is possible to include additional information about the unit being used as the datatype within the definition of the datatype:
{ "titles": "distance", "datatype": { "@id": "http://example.org/unit/kilometre", "@type": "http://example.org/quantity/length", "rdfs:label": "Kilometre", "base": "number", "skos:notation": "km" } }
See also:
There are many different ways of representing geospatial data within a CSV file, and no single best practice for doing so.
At the simplest level, it's possible to reference geospatial coordinates as points with separate columns for latitude, longitude and if necessary altitude (or using a different spatial reference system). This enables separate validation for the separate coordinates. The example used throughout this primer uses this setup:
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude" "at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005" "be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405" "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
Metadata can be used to provide specialist types for the values of these coordinates, to indicate that they are latitude and longitude (by mapping to the well-known schema:latitude
and schema:longitude
properties which specify the use of WGS84), to group the coordinates together, and to provide a link that uses the coordinates to provide a map:
{ "titles": "latitude", "name": "lat", "datatype": { "base": "number", "minimum": "-90", "maximum": "90" }, "aboutUrl": "http://example.org/country/{code}#geo", "propertyUrl": "schema:latitude" }, { "titles": "longitude", "name": "long", "datatype": { "base": "number", "minimum": "-180", "maximum": "180" }, "aboutUrl": "http://example.org/country/{code}#geo", "propertyUrl": "schema:longitude" }, { "virtual": true, "propertyUrl": "schema:geo", "valueUrl": "http://example.org/country/{code}#geo" }, { "virtual": true, "aboutUrl": "http://example.org/country/{code}#geo", "propertyUrl": "rdf:type", "valueUrl": "schema:GeoCoordinates" }, { "virtual": true, "propertyUrl": "schema:hasMap", "valueUrl": "https://www.google.com/maps/@{lat},{long},7z" }
You can put latitude and longitude into a single column, with a character separator between the numbers as shown in section 3.12 What if the cells in a column contain lists of values?. However, this makes it harder to accurately validate the individual coordinates. They also cannot be separated out into separate property values when converting to JSON or RDF. So this is a more restrictive method and best avoided.
CSV files may also need to contain geometries beyond individual points. There are no built-in formats for geometries recognised by implementations of CSV on the Web. Geometries may be expressed using GeoJSON, GML, KML or OGC Well-Known Text (WKT) representations. In each case, schemas may indicate that columns containing geometries adhere to a particular type:
json
if the format is JSON-based such as GeoJSONxml
if the format is XML-based such as GML or KMLstring
if the format is text-based such as OGC WKT
You can use the format
property to further constrain the content of columns containing these values. For example, you could use:
"datatype": { "@id": "http://geojson.org/", "base": "json", "format": "\\{ ?\"type\": ?\"Polygon\",.+\\}" }
to indicate that a column contains a Polygon in GeoJSON, or:
"datatype": { "@id": "http://www.iana.org/assignments/media-types/application/gml+xml", "base": "xml", "format": ".*\\<gml:Point xmlns:gml=\"http://www\.opengis\.net/ont/gml" srsName=\"([^\"])+\".*\\>.+\\</gml:Point\\>" }
to indicate that a column contains a GML Point, or:
"datatype": { "@id": "http://www.iana.org/assignments/media-types/application/vnd.google-earth.kml+xml", "base": "xml", "format": ".*\\<kml xmlns=\"http://www\\.opengis\\.net/kml/2.2\"\\>.+\\</kml\\>" }
to indicate that a column contains KML, or:
"datatype": { "base": "string", "format": "POLYGON \\(\\(\\d+(\\.\\d+)? \\d+(\\.\\d+)?(, \\d+(\\.\\d+)? \\d+(\\.\\d+)?)+\\)\\)" "rdfs:seeAlso": "http://www.opengeospatial.org/standards/sfa" }
to indicate that a column contains a very basic OGC WKT Polygon. (More sophisticated regular expressions could be used if properties held different types of objects such as points or lines or more complex polygons expressed in OGC WKT.)
There is no standard way of expressing which coordinate reference system (CRS) is being used within CSV metadata, but GeoJSON, GML and OGC WKT values can include information about which CRS they use. The W3C Spatial Data on the Web Working Group intend to make recommendations in this area within [sdw-bp].
See also:
It is useful to have titles for rows both for screen readers and in other displays where it may be difficult to view the complete context of a table and therefore to understand the content of a row.
The rowTitles
property within a schema provides an array of columns that provide sufficient context to label the row. These may sometimes be the same as the columns used for the primaryKey
in the table, but are more likely to be columns that contain human readable text.
For example, with the CSV:
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude" "at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005" "be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405" "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
The rowTitles
property could be set to reference the columns containing the name of each country, in the different languages available:
"tableSchema": { "columns": [{ "titles": "country" },{ "titles": "country group" },{ "name": "name_en", "titles": "name (en)", "lang": "en" },{ "name": "name_fr", "titles": "name (fr)", "lang": "fr" },{ "name": "name_de", "titles": "name (de)", "lang": "de" },{ "titles": "latitude" },{ "titles": "longitude" }], "rowTitles": ["name_en", "name_fr", "name_de"] }
In this case, a screen reader or other display of the table could choose to read or display only the row title that matched the user's preferred language.
In other cases, the rowTitles
property may be set to an array of columns that together provided sufficient context to understand the column (eg ["firstName", "lastName"]
).
See also:
A lot of what's called "CSV" that's published on the web isn't actually CSV. It might use something other than commas (such as tabs or semi-colons) as separators between values, or might have multiple header lines.
The specification for CSV as a format is [RFC4180]. However, this is an informational specification and and not a formal standard. Therefore, applications may deviate from it.
The metadata that's described here can be used with files that contain tabular data but that aren't CSV. You can provide guidance to processors that are trying to parse those files through the dialect
property on a table description. For example, say we were dealing with a tab-separated file that contains multiple header lines at http://example.org/data/unemployment.tsv
:
"country" "country group" "name (en)" "name (fr)" "name (de)" "latitude" "longitude" "Land" "Ländergruppe" "Name (en)" "Name (fr)" "Name (de)" "Breite" "Länge" "pays" "groupe de pays" "nom (en)" "nom (fr)" "nom (de)" "latitude" "longitude" "at" "eu" "Austria" "Autriche" "Österreich" "47.6965545" "13.34598005" "be" "eu" "Belgium" "Belgique" "Belgien" "50.501045" "4.47667405" "bg" "eu" "Bulgaria" "Bulgarie" "Bulgarien" "42.72567375" "25.4823218"
The metadata for this file could be:
{ "@context": "http://www.w3.org/ns/csvw", "url": "http://example.org/data/unemployment.tsv", "dialect": { "delimiter": "\t", "headerRowCount": 3 } }
There are a number of other properties that you can set within the dialect to cater for the large range of weird things that people do in CSV files. They are:
commentPrefix
If the file contains comment lines, set this to the character used at the start of the lines that are comments (usually that's #
).
delimiter
If the file doesn't use commas as separators between values, set this to the separator that it uses.
doubleQuote
If the file uses \
to escape double quotes within values, set this to false
.
encoding
If the encoding of the file is not UTF-8, set this to the encoding.
header
If the file doesn't have a header line, set this to false
.
headerRowCount
If the file has more than one header line, set this to the number of header lines it has.
lineTerminators
If the file uses an unusual character at the end of its lines, set this to that character.
quoteChar
If the file doesn't use double quotes ("
) around values that contain commas, set this to the character that it does use.
skipBlankRows
If the file contains blank rows that should just be ignored, set this to true
.
skipColumns
If the file has some columns at the start that don't contain useful information, set this to that number of columns.
skipInitialSpace
If values in the file sometimes start with whitespace that should be ignored, set this to true
.
skipRows
If the file has some rows at the start that don't contain useful information, set this to that number of rows. (Sometimes people put metadata at the start of a CSV file, before the header lines.)
trim
If you don't want to ignore whitespace around values, set this to false
. If you want to only ignore whitespace at the beginning of values, set it to start
and if you want to only ignore whitespace at the end of values, to end
. By default whitespace at the start and the end of a value will be stripped away.
See also:
Tables in HTML are a bit more complicated than tables in CSV files. However, it is possible to use the metadata described here and in [tabular-metadata] to described HTML tables by embedding the metadata within a script
element in the header of the HTML page. A description and example of how to do this is provided in the [csvw-html] Note.
See also:
By default, if you point a processor at a CSV file on the web, they will look for a Link
header within the response with rel="describedby"
and specifying a content type of application/csvm+json
, application/ld+json
or application/json
. If you're publishing CSV files and it's possible to set the Link
header, this is the best way of telling processors where to look for metadata files.
If processors don't see an appropriate Link
header, they will append -metadata.json
to the end of the URL of the CSV file to try to find metadata for it. If they can't find a metadata file there, they will look in the directory containing the CSV file for a file called csv-metadata.json
and use that file.
Looking for files in this way isn't appropriate on all servers. As a publisher, it might be that you have URLs for CSV files that are automatically generated using queries, such as http://example.org/data?format=csv&x=15&y=53
or that you want metadata to live in a separate subdirectory or central location.
As a publisher, if you can't or don't want to use the Link
header, you can control where processors look for metadata for your CSV files by listing the locations to look at within the /.well-known/csvm
file on your server. (This is a well-known location as defined in [RFC5785].) This file should contain a list of URL patterns which will be expanded by substituting url
for the URL of the CSV file, and then resolving against the location of the CSV file. Lines might look like:
/metadata?for={url} ?format=metadata {+url}m
With a CSV file at http://example.org/data.csv
this would lead the processor to search for metadata at:
http://example.org/metadata?for=http://example.org/data.csv http://example.org/data.csv?format=metadata http://example.org/data.csvm
See also: