Abstract

CSV is one of the most popular formats for publishing data on the web. It is concise, easy to understand by both humans and computers, and aligns nicely to the tabular nature of most data.

But CSV is also a poor format for data. There is no mechanism within CSV to indicate the type of data in a particular column, or whether values in a particular column must be unique. It is therefore hard to validate and prone to errors such as missing values or differing data types within a column.

The CSV on the Web Working Group has developed standard ways to express useful metadata about CSV files and other kinds of tabular data. This primer takes you through the ways in which these standards work together, covering:

Where possible, this primer links back to the normative definitions of terms and properties in the standards. Nothing in this primer overrides those normative definitions.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

The CSV on the Web Working Group was chartered to produce a recommendation "Access methods for CSV Metadata" as well as recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various formats (e.g. RDF, JSON, or XML)". This non-normative document is a primer that describes how these standards work together for new readers. The normative standards are:

This document was published by the CSV on the Web Working Group as a Working Group Note. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org (subscribe, archives). All comments are welcome.

Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 September 2015 W3C Process Document.

Table of Contents

1. Introduction

We'll begin with some basic concepts.

1.1 What is tabular data and CSV?

Tabular data is any data that can be arranged in a table, like the one here:

column 1column 2column 3
row 1cell in column 1 and row 1cell in column 2 and row 1cell in column 3 and row 1
row 2cell in column 1 and row 2cell in column 2 and row 2cell in column 3 and row 2
row 3cell in column 1 and row 3cell in column 2 and row 3cell in column 3 and row 3

There are lots of syntaxes for expressing tabular data on the web. You can put it in HTML tables, pass it around as Excel spreadsheets, or store it in an SQL database.

One easy way to pass around tabular data is as CSV: as comma-separated values. A CSV file writes each row on a separate line and each cell is separated from the next with a comma. The values of cells can be written with double quotes around them; this is necessary when a cell value contains a line break or a comma. So the tabular data above can be expressed in CSV as:

Example 1
cell in column 1 and row 1,cell in column 2 and row 1,cell in column 3 and row 1
cell in column 1 and row 2,cell in column 2 and row 2,cell in column 3 and row 2
cell in column 1 and row 3,cell in column 2 and row 3,cell in column 3 and row 3

or, with double quotes around cell values:

Example 2
"cell in column 1 and row 1","cell in column 2 and row 1","cell in column 3 and row 1"
"cell in column 1 and row 2","cell in column 2 and row 2","cell in column 3 and row 2"
"cell in column 1 and row 3","cell in column 2 and row 3","cell in column 3 and row 3"

CSV files usually have an additional row at the top called a header row, which gives human-readable names or titles for each of the columns. Here is a sample CSV file that contains a header row:

Example 3
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

Column titles are a type of annotation on a column, not part of the data itself. For example, they aren't included when you're counting the rows of data in a table:

column 1column 2column 3column 4column 5column 6column 7
titlescountrycountry groupname (en)name (fr)name (de)latitudelongitude
row 1ATeuAustriaAutricheÖsterreich47.696554513.34598005
row 2BEeuBelgiumBelgiqueBelgien50.5010454.47667405
row 3BGeuBulgariaBulgarieBulgarien42.7256737525.4823218

See also:

1.2 How can you provide extra information about CSV data?

You can provide extra information, known as metadata, about CSV files using a JSON metadata file. If you're just providing metadata about one file, the easiest thing to do is to name the metadata file by adding -metadata.json to the end of the name of the CSV file. For example, if your CSV file is called countries.csv then call the metadata file countries.csv-metadata.json.

The simplest metadata file you can create contains a single table description and looks like:

Example 4
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv"
}

Metadata files must always include the @context property with a value "http://www.w3.org/ns/csvw": this enables implementations to tell that these are CSV metadata files. The url property points to the CSV file that the metadata file describes.

Note

These metadata documents should be served from a web server with a media type of application/csvm+json if possible.

The description of a table within a metadata file can include:

By default, if implementations can't find a metadata file by appending -metadata.json to the filename of the CSV file, they'll just look for a file called csv-metadata.json in the same directory as the CSV file.

Metadata files can also describe several CSV files at once, using a slightly different syntax:

Example 5
{
  "@context": "http://www.w3.org/ns/csvw",
  "tables": [{
    "url": "countries.csv"
  }, {
    "url": "country-groups.csv"
  }, {
    "url": "unemployment.csv"
  }]
}

Here, the tables property holds an array of table descriptions, each with the URL of the CSV file that it's describing. The metadata file as a whole describes a group of tables. This is usually used when the tables relate to each other in some way: perhaps they're data in the same format from different periods of time, or perhaps they reference each other.

See also:

1.3 How can you provide extra information about the columns in a CSV file?

You can give information about the columns in a CSV file through a table schema. The simplest thing you can do is say what those columns are called. For example, if you have some CSV like this:

Example 6
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

You can say that the table contains seven columns named as they are in this CSV file like so:

Example 7
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv"
  "tableSchema": {
    "columns": [{
      "titles": "country"
    },{
      "titles": "country group"
    },{
      "titles": "name (en)"
    },{
      "titles": "name (fr)"
    },{
      "titles": "name (de)"
    },{
      "titles": "latitude"
    },{
      "titles": "longitude"
    }]
  }
}

A validator can check that the CSV file holds the expected columns (both the right number of columns and columns with the expected titles).

See also:

1.4 What tools implement or use CSV on the Web?

This Note can't keep an up-to-date list of the tools that implement or otherwise use CSV on the Web. Instead, look at:

2. Documenting CSVs

Providing metadata about CSVs can be useful simply in providing extra information to anyone who wants to work with them.

2.1 How can you provide documentation about a CSV file?

Here's an example that includes some extra descriptive documentation about a number of CSV files:

Example 8
{
  "@context": "http://www.w3.org/ns/csvw",
  "dc:title": "Unemployment in Europe (monthly)"
  "dc:description": "Harmonized unemployment data for European countries."
  "dc:creator": "Eurostat",
  "tables": [{
    "url": "countries.csv",
    "dc:title": "Countries"
  }, {
    "url": "country-groups.csv",
    "dc:title": "Country groups"
  }, {
    "url": "unemployment.csv",
    "dc:title": "Unemployment (monthly)",
    "dc:description": "The total number of people unemployed"
  }]
}

This example uses Dublin Core as a vocabulary for providing metadata. You can tell that's the vocabulary that's being used because the terms like dc:title and dc:description begin with the prefix dc, which stands for Dublin Core.

There are several different metadata vocabularies in common use around the web. Some people use Dublin Core. Some people use schema.org. Some people use DCAT. All of these vocabularies can be used independently or together. A publisher could alternatively use:

Example 9
{
  "@context": "http://www.w3.org/ns/csvw",
  "schema:name": "Unemployment in Europe (monthly)",
  "schema:description": "Harmonized unemployment data for European countries."
  "schema:creator": { "schema:name": "Eurostat" },
  "tables": [{
    "url": "countries.csv",
    "schema:name": "Countries"
  }, {
    "url": "country-groups.csv",
    "schema:name": "Country groups"
  }, {
    "url": "unemployment.csv",
    "schema:name": "Unemployment (monthly)",
    "schema:description": "The total number of people unemployed"
  }]
}
Note

It's not clear at the moment which metadata vocabulary will give publishers the most benefits. Search engines are likely to recognise schema.org. RDF-based systems are more likely to recognise Dublin Core.

More generally, you can use prefixed properties like these on any of the objects in a metadata document. The prefixes that are recognised are those used in the RDFa 1.1 Initial Context. Other properties must be named with full URLs.

See also:

2.2 How can you provide documentation about the columns in a CSV file?

You can use metadata properties like the ones used for tables for individual columns as well. For example:

Example 10
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv"
  "tableSchema": {
    "columns": [{
      "titles": "country",
      "dc:description": "The ISO two-letter code for a country, in lowercase."
    },{
      "titles": "country group",
      "dc:description": "A lowercase two-letter code for a group of countries."
    },{
      "titles": "name (en)",
      "dc:description": "The official name of the country in English."
    },{
      "titles": "name (fr)",
      "dc:description": "The official name of the country in French."
    },{
      "titles": "name (de)",
      "dc:description": "The official name of the country in German."
    },{
      "titles": "latitude",
      "dc:description": "The latitude of an indicative point in the country."
    },{
      "titles": "longitude",
      "dc:description": "The longitude of an indicative point in the country."
    }]
  }
}

See also:

2.3 What about when metadata like this is structured?

We've already seen an example where metadata supplied about a table is structured. Look at the schema:creator here:

Example 11
{
  "@context": "http://www.w3.org/ns/csvw",
  "schema:name": "Unemployment in Europe (monthly)",
  "schema:description": "Harmonized unemployment data for European countries."
  "schema:creator": { "schema:name": "Eurostat" },
  "tables": [{
    "url": "countries.csv"
  }, {
    "url": "country-groups.csv"
  }, {
    "url": "unemployment.csv"
  }]
}

The metadata supplied for a table or group of tables can have nested objects within it. You can provide arrays of values. It will be interpreted as if it is (a minimal version of) [json-ld]. Particular patterns that are useful are:

See also:

2.4 How should you annotate individual cells?

There's no standardised facility in the CSV on the Web specifications for annotating individual cells, but there is a hook that will enable best practice about how to do that to emerge: the notes property on a table description can contain objects that represent annotations.

The W3C Web Annotation Working Group is working on a vocabulary for annotations themselves. This vocabulary includes the concept of a target for an annotation and its body (the content of the annotation).

If you are annotating a single cell, the target needs to point to that cell. The easiest way to do that is to use fragment identifiers for CSV files as defined in [RFC7111]. These fragment identifiers reference cells based on their position within the original CSV file, with the first row in the CSV file (usually the header row) counted as 1. For example, with the CSV file:

Example 15
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

the cell containing Belgique is at #cell=3,4. It's also possible to refer to ranges of cells with this syntax and to use * to refer to the last row in the file. For example, to target a comment on all the locations in the CSV file you could use the fragment identifier #cell=2,6-*,7.

To create comments, then, the notes property can hold an array of objects that use the Web Annotation structure. For example:

Example 16
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "notes": [{
    "type": "Annotation",
    "target": "countries.csv#cell=2,6-*,7",
    "body": "These locations are of representative points.",
    "motivation": "commenting"
  }, {
    "type": "Annotation",
    "target": "countries.csv#cell=3,4",
    "body": "Corrected.",
    "motivation": "editing"
  }]
}

See also:

3. Validating CSVs

Validation is all about checking whether a file contains what you expect it to contain. For CSV files, this can be about:

3.1 How can you say what kinds of values are expected in a column?

There's lots more that you can say about the expected content of columns in a CSV file. The most obvious thing is to indicate the data type. For example, with the CSV file:

Example 17
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

The first five columns are strings and the last two are numbers. You can indicate this with the datatype property for each column:

Example 18
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv"
  "tableSchema": {
    "columns": [{
      "titles": "country",
      "datatype": "string"
    },{
      "titles": "country group",
      "datatype": "string"
    },{
      "titles": "name (en)",
      "datatype": "string"
    },{
      "titles": "name (fr)",
      "datatype": "string"
    },{
      "titles": "name (de)",
      "datatype": "string"
    },{
      "titles": "latitude",
      "datatype": "number"
    },{
      "titles": "longitude",
      "datatype": "number"
    }]
  }
}
Note

You don't have to include "datatype": "string" for columns that are strings — columns are assumed to hold strings if no datatype is explicitly specified.

There are a number of different datatypes supported by CSV on the Web implementations, based on the set defined in [xmlschema11-2]. The complete set is shown in the following diagram:

Built-in Datatype Hierarchy diagram
Fig. 1 Diagram showing the built-in datatypes, based on [xmlschema11-2]; names in parentheses denote aliases to the [xmlschema11-2] terms (see the diagram in SVG or PNG formats)

See also:

3.2 How do you define new datatypes?

You can define new datatypes based on the built-in datatypes using an object as the value of the datatype property rather than a string. For example:

Example 19
"datatype": {
  "base": "integer",
  "minimum": "1",
  "maximum": "5"
}

The base property must be an existing datatype. The other properties on the new datatype define extra restrictions on values of the new datatype. You can give the new datatype a name and description to provide extra documentation for people using the data:

Example 20
"datatype": {
  "dc:title": "Star Rating",
  "dc:description": "A star rating between 1 and 5."
  "base": "integer",
  "minimum": "1",
  "maximum": "5"
}

See also:

3.3 How do you restrict what kind of strings a column contains?

In the example we've been using, the first column always contains a country code consisting of two lowercase letters. This is a new datatype based on string. You can specify the format for string values in a column using a regular expression, like this:

Example 21
{
  "titles": "country",
  "datatype": {
    "dc:title": "Country Code",
    "dc:description": "Country codes as specified in ISO 3166.",
    "base": "string",
    "format": "[a-z]{2}"
  }
}

It's also possible to restrict the length of a string-based datatype using the length or minLength and/or maxLength properties. For example the following says that the column holding the English names of countries must have values between 3 and 128 characters long:

Example 22
{
  "titles": "name (en)",
  "datatype": {
    "base": "string",
    "minLength": "3",
    "maxLength": "128"
  }
}

See also:

3.4 How do you restrict the size of numbers a column contains?

The size of numbers in a column can be restricted using the minimum and maximum properties and/or the minExclusive and maxExclusive properties. In our example, one column contains latitudes, which can range between -90 and +90:

Example 23
{
  "titles": "latitude",
  "datatype": {
    "base": "number",
    "minimum": "-90",
    "maximum": "90"
  }
}

See also:

3.5 How do you ensure that decimal numbers have a particular precision or leading zeros?

In the example we're using, the latitudes are provided to between six and eight decimal places and there are no leading or trailing zeros. You can use the format property to provide a pattern that matches these numbers. In the pattern, 0 represents a required digit and # represents an optional digit. For the latitude, the definition looks like:

Example 24
{
  "titles": "latitude",
  "datatype": {
    "base": "number",
    "minimum": "-90",
    "maximum": "90",
    "format": "#0.000000##"
  }
}

The format property can also be used to indicate that values in a column should have leading zeros. For example, if a column were supposed to hold a three digit number you could use the pattern 000.

See also:

3.6 How do you validate numbers that aren't in standard numeric formats?

Sometimes numbers within a CSV file won't be in a standard numeric format. For example, they might include commas as grouping characters (eg 12,345,678) or as decimal points (eg 12,3). In these cases, you can use the format property with an object value.

To match numbers with grouping separators as in 12,345,678 you should specify "," as the groupChar for the format. The pattern property then holds the pattern that indicates how many digits should be in each group. This example validates numbers with groups of three digits separated by commas:

Example 25
"datatype": {
  "base": "integer",
  "format": {
    "groupChar": ",",
    "pattern": "#,##0"
  }
}

To match numbers with decimal separators other than ., as in 12,3, you should specify "," as the decimalChar for the format. This example validates numbers with commas as decimal points:

Example 26
"datatype": {
  "base": "integer",
  "format": {
    "decimalChar": ","
  }
}

You can mix and match decimal and grouping characters and patterns. For example, in France it's standard to use commas for decimal points and spaces for grouping characters, so CSV files produced in France might contain numbers like 1 234 567,89. These could be validated with a datatype like:

Example 27
"datatype": {
  "base": "integer",
  "format": {
    "decimalChar": ",",
    "groupChar": " ",
    "pattern": "# ##0,0#"
  }
}

See also:

3.7 How do you restrict what kind of dates and times a column contains?

Dates and times are treated similarly to numbers. You can use the minimum, maximum, minExclusive and/or maxExclusive properties to restrict their values.

For example, to indicate that the column should contain dates later than 1st January 2000, you can use the datatype:

Example 28
"datatype": {
  "base": "date",
  "minimum": "2000-01-01"
}

To indicate that the column should contain times before midday (exclusive), you can use the datatype:

Example 29
"datatype": {
  "base": "time",
  "minExclusive": "12:00:00"
}
Note

The format of dates or times used for minimum or maximum values is always the ISO 8601 format: yyyy-MM-dd for dates, HH:mm:ss.S for times and yyyy-MM-ddTHH:mm:ss.S for date/times.

See also:

3.8 How do you validate dates that aren't in standard date or time formats?

Dates and times in CSV files often come in formats other than the standard ISO 8601 format. You can use the format property to indicate the expected format of the date or time.

For example, to recognise dates in the usual UK format such as 31/10/2015 for 31st October 2015, you could use:

Example 30
"datatype": {
  "base": "date",
  "format": "dd/MM/yyyy"
}

Implementations are only required to understand a particular set of common formats for dates and times. These formats are, for dates:

For times:

And for date/times:

Note

None of these formats include names or abbreviations for months or days. The implementation you use might support other date and time formats as well, including specialised formats for the other date and time datatypes such as gMonthYear. Check your implementation's documentation to see what it supports.

See also:

3.9 How do you validate boolean values that aren't true or false?

By default, validators will recognise boolean values that are 1 or 0 or true or false. If a CSV file contains boolean values like T and F or Yes and No then you can create a derived boolean datatype that uses that syntax using the format property, with the two possible values separated by a |, for example:

Example 31
"datatype": {
  "base": "boolean",
  "format": "Yes|No"
}

See also:

3.10 How do you specify a list of valid values for a column?

The example CSV file we're using is this:

Example 32
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

In it, the second column, country group, contains either the value eu or the value non-eu. Despite there being two values, this isn't a boolean column. Instead, it's a column that has only two valid values.

There are two ways to specify that a column contains one of a list of values: using a regular expression to list the values and using a reference to a separate CSV file that contains the values.

3.10.1 Using a regular expression to give a list of valid values

Using a regular expression to list values works best if those values are strings, if there are only a few of them, and if they are self-explanatory such that you don't want to provide any additional information about them.

In this example, the country group column could be specified as:

Example 33
{
  "titles": "country group",
  "datatype": {
    "base": "string",
    "format": "eu|non-eu"
  }
}

As described in section 3.3 How do you restrict what kind of strings a column contains?, the format property contains a regular expression. List the options separated by | and ensure that you escape any of the characters in the options that have special meaning in regular expressions.

See also:

3.10.2 Using a separate CSV file to give a list of valid values

A more powerful method of listing the valid values in a particular column is to list those values in a separate CSV file. The CSV file can be very simple, containing just a single column that lists the valid values. In this example, we can create country-groups.csv containing:

Example 34
group
eu
non-eu

We can then provide definitions for both the countries.csv and country-groups.csv files, and state that the country group column in countries.csv references the group column in country_groups.csv. This reference from one file to another is called a foreign key.

To use a foreign key, both files must be referenced in the same metadata document, and both columns must be given names. Column names are only used inside the metadata document and you can only use (ASCII) letters, numbers, . and _ within them. So the basic metadata document, before adding the foreign key, should look like:

Example 35
{
  "@context": "http://www.w3.org/ns/csvw",
  "tables": [{
    "url": "countries.csv",
    "tableSchema": {
      "columns": [{
        "titles": "country"
      },{
        "name": "country_group",
        "titles": "country group"
      },{
        "titles": "name (en)"
      },{
        "titles": "name (fr)"
      },{
        "titles": "name (de)"
      },{
        "titles": "latitude"
      },{
        "titles": "longitude"
      }]
    }
  }, {
    "url": "country-groups.csv",
    "tableSchema": {
      "columns": [{
        "name": "group",
        "titles": "group"
      }]
    }
  }]
}

The foreign key is defined in the schema for the countries.csv table, as follows:

Example 36
{
  "@context": "http://www.w3.org/ns/csvw",
  "tables": [{
    "url": "countries.csv",
    "tableSchema": {
      "columns": [{
        "titles": "country"
      },{
        "name": "country_group",
        "titles": "country group"
      },{
        "titles": "name (en)"
      },{
        "titles": "name (fr)"
      },{
        "titles": "name (de)"
      },{
        "titles": "latitude"
      },{
        "titles": "longitude"
      }],
      "foreignKeys": [{
        "columnReference": "country_group",
        "reference": {
          "resource": "country-groups.csv",
          "columnReference": "group"
        }
      }]
    }
  }, {
    "url": "country-groups.csv",
    "tableSchema": {
      "columns": [{
        "name": "group",
        "titles": "group"
      }]
    }
  }]
}

The foreignKeys property can hold several foreign keys. Each contains a columnReference to a column or list of columns in one CSV file, and a reference which defines a column or list of columns in another CSV file.

The advantage of this method of listing the values allowed in a column is that the CSV file that contains the list of possible values can also provide additional information about those values. For example, we can provide expansions of what eu and non-eu mean in different languages:

Example 37
group,name (en),name (fr),name (de)
eu,"European Union","Union européenne","Europäische Union"
non-eu,"Non EU countries","Pays hors Union européenne",Nicht-EU-Länder

See also:

3.11 How do you enable a column to have a mix of value types?

Sometimes a column that contains numbers will contain special values, such as X or NK, when a value is unknown or redacted. If these columns are simply classified as numeric then the non-numeric values will be classed as errors.

To avoid values being classified as errors when they are being used to indicate missing values, list those values as null values using the null property. This can take either a single string or an array of strings. For example, the latitude column might usually be numeric but hold an X if there is no indicative point for the country:

Example 38
{
  "titles": "latitude",
  "null": "X",
  "datatype": {
    "base": "number",
    "minimum": "-90",
    "maximum": "90"
  }
}
Note

The null property can also be useful when a column contains values that are of the right type but used to indicate a missing value. It's not uncommon, for example, for publishers to use the value 99 in a column that contains integers to indicate that a value is missing.

See also:

3.12 What if the cells in a column contain lists of values?

Cells may contain lists of values with spaces, semi-colons or other characters acting as separators. For example, instead of using separate latitude and longitude columns, the CSV that we're looking at could contain a single latlong column consisting of the latitude and longitude separated by a space:

Example 39
"country","country group","name (en)","name (fr)","name (de)","latlong"
"at","eu","Austria","Autriche","Österreich","47.6965545 13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045 4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375 25.4823218"

In this scenario, the separator property can be used to indicate that the values in a column are lists themselves, and what separator is used between the items in the list. For example:

Example 40
{
  "titles": "latlong",
  "separator": " ",
  "datatype": {
    "base": "number",
    "minimum": "-180",
    "maximum": "180"
  }
}

When separator is specified, the datatype property applies to each of the values in the list. There's no way to indicate that the values in the list have different datatypes, or set limits on the length of the list.

See also:

3.13 How do you ensure every row has a value for a column?

By default, a validator won't give any errors if a value is missing in a column. If you want to ensure that a value is provided for every row in the column, use the required property for that column, with the value true.

In our example, we might say that all the columns are required except the French and German names (applications being expected to default to the English name if the translation is missing):

Example 41
"tableSchema": {
  "columns": [{
    "titles": "country",
    "required": true
  },{
    "titles": "country group",
    "required": true
  },{
    "titles": "name (en)",
    "required": true
  },{
    "titles": "name (fr)"
  },{
    "titles": "name (de)"
  },{
    "titles": "latitude",
    "required": true
  },{
    "titles": "longitude",
    "required": true
  }]
}
Note

Setting required to true means that you can't have any null values in a column. If, in this example, latitude and longitude had null set to X then those columns couldn't contain an X. It doesn't usually make sense to specify both null and required.

See also:

3.14 How do you indicate all the values in a column are unique?

In our example, the country column must contain unique values: each row should define a different country. To specify this, you can use a primary key to refer to the name of the column:

Example 42
"tableSchema": {
  "columns": [{
    "name": "country",
    "titles": "country"
  },{
    "titles": "country group"
  },{
    "titles": "name (en)"
  },{
    "titles": "name (fr)"
  },{
    "titles": "name (de)"
  },{
    "titles": "latitude"
  },{
    "titles": "longitude"
  }],
  "primaryKey": "country"
}

Each CSV file can only have one primary key. A primary key can be made up of a number of columns that have to be unique in combination: the classic example would be ["firstName", "lastName"].

See also:

4. Transforming CSVs

CSV is great for transferring data around the place, but it's often not as useful for processing that data. Other formats can be better at providing structure and meaning for data. So a means for transforming data out of CSV and into other formats is a common requirement.

4.1 What can you transform CSV into?

You can of course transform CSV into anything you like using your favourite programming language. However, the CSV on the Web specs provide standardised ways of mapping CSV into two other formats:

These specifications make use of the metadata described in this primer during the transformation to decide what to include in the transformed output and how to include it. Processors that support these standardised transformations can be used by people who can't program.

These specifications describe what output you get if you don't supply any metadata. Given a CSV file like this:

Example 43
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

the usual, minimal JSON output would be:

Example 44
[{
  "country": "at",
  "country group": "eu",
  "name (en)": "Austria",
  "name (fr)": "Autriche",
  "name (de)": "Österreich",
  "latitude": "47.6965545",
  "longitude": "13.34598005"
}, {
  "country": "be",
  "country group": "eu",
  "name (en)": "Belgium",
  "name (fr)": "Belgique",
  "name (de)": "Belgien",
  "latitude": "50.501045",
  "longitude": "4.47667405"
}, {
  "country": "bg",
  "country group": "eu",
  "name (en)": "Bulgaria",
  "name (fr)": "Bulgarie",
  "name (de)": "Bulgarien",
  "latitude": "42.72567375",
  "longitude": "25.4823218"
}]

and the RDF output would be:

Example 45
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[
  <#country> "at";
  <#country%20group> "eu";
  <#latitude> "47.6965545";
  <#longitude> "13.34598005";
  <#name%20%28de%29> "Österreich";
  <#name%20%28en%29> "Austria";
  <#name%20%28fr%29> "Autriche"
] .

[
  <#country> "be";
  <#country%20group> "eu";
  <#latitude> "50.501045";
  <#longitude> "4.47667405";
  <#name%20%28de%29> "Belgien";
  <#name%20%28en%29> "Belgium";
  <#name%20%28fr%29> "Belgique"
] .

[
  <#country> "bg";
  <#country%20group> "eu";
  <#latitude> "42.72567375";
  <#longitude> "25.4823218";
  <#name%20%28de%29> "Bulgarien";
  <#name%20%28en%29> "Bulgaria";
  <#name%20%28fr%29> "Bulgarie"
] .
Note

The specifications define how to transform CSV into RDF. In this Primer all the examples use Turtle as the serialisation for that RDF. Implementations may generate other serialisations for RDF such as RDF/XML or JSON-LD.

See also:

4.2 What values get used in the output of a transformation?

The result of a transformation will include typed values based on the datatype and language of each column. So if we state that the longitude and latitude are numbers and the names are strings in the given language, as in this metadata:

Example 46
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "tableSchema": {
    "columns": [{
      "titles": "country"
    },{
      "titles": "country group"
    },{
      "titles": "name (en)",
      "lang": "en"
    },{
      "titles": "name (fr)",
      "lang": "fr"
    },{
      "titles": "name (de)",
      "lang": "de"
    },{
      "titles": "latitude",
      "datatype": "number"
    },{
      "titles": "longitude",
      "datatype": "number"
    }]
  }
}

then the JSON will look like:

Example 47
[{
  "country": "at",
  "country group": "eu",
  "name (en)": "Austria",
  "name (fr)": "Autriche",
  "name (de)": "Österreich",
  "latitude": 47.6965545,
  "longitude": 13.34598005
},{
  "country": "be",
  "country group": "eu",
  "name (en)": "Belgium",
  "name (fr)": "Belgique",
  "name (de)": "Belgien",
  "latitude": 50.501045,
  "longitude": 4.47667405
},{
  "country": "bg",
  "country group": "eu",
  "name (en)": "Bulgaria",
  "name (fr)": "Bulgarie",
  "name (de)": "Bulgarien",
  "latitude": 42.72567375,
  "longitude": 25.4823218
}]

and the RDF will look like:

Example 48
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[
  <#country> "at";
  <#country%20group> "eu";
  <#latitude> 4.76965545e1;
  <#longitude> 1.334598005e1;
  <#name%20%28de%29> "Österreich"@de;
  <#name%20%28en%29> "Austria"@en;
  <#name%20%28fr%29> "Autriche"@fr
] .

[
  <#country> "be";
  <#country%20group> "eu";
  <#latitude> 5.0501045e1;
  <#longitude> 4.47667405e0;
  <#name%20%28de%29> "Belgien"@de;
  <#name%20%28en%29> "Belgium"@en;
  <#name%20%28fr%29> "Belgique"@fr
] .

[
  <#country> "bg";
  <#country%20group> "eu";
  <#latitude> 4.272567375e1;
  <#longitude> 2.54823218e1;
  <#name%20%28de%29> "Bulgarien"@de;
  <#name%20%28en%29> "Bulgaria"@en;
  <#name%20%28fr%29> "Bulgarie"@fr
] .

See also:

4.3 What value gets used in the output if it's missing in the CSV?

If there's a missing value in the CSV, then usually the property will be omitted in the result as well. So if the latitude and longitude are missing for the first row of the CSV file we've been using, the equivalent JSON won't include those properties either:

Example 49
{
  "country": "at",
  "country group": "eu",
  "name (en)": "Austria",
  "name (fr)": "Autriche",
  "name (de)": "Österreich"
}

and nor will the RDF:

Example 50
[
  <#country> "at";
  <#country%20group> "eu";
  <#name%20%28de%29> "Österreich"@de;
  <#name%20%28en%29> "Austria"@en;
  <#name%20%28fr%29> "Autriche"@fr
] .

If you want a value to appear even when the value is missing in the CSV, you can provide that value as the default for the column. This value must be supplied as a string but will be treated exactly as if it had appeared within the CSV file. For example, if you supply a non-numeric string for a numeric column, as in:

Example 51
{
  "titles": "latitude",
  "datatype": "number",
  "default": "NOT KNOWN"
}

The default value will be seen as an invalid value and therefore represented as a string in the output.

See also:

4.4 How are the properties in the output of a transformation named?

By default, the properties in the JSON and RDF come from the titles of the columns (the headers in the CSV file). In RDF, since the properties are URIs, the names are URL-encoded and turned into fragment identifiers in a URL based on the location of the CSV file being transformed. Hence in the previous examples there have been property names like "name (en)" in JSON and #name%20%28en%29 in RDF.

You can override this default by supplying a name for the column. That name will be used instead of the title of the column when creating the property. So if you have:

Example 52
{
  "titles": "name (en)",
  "name": "english_name"
}

then the property will be called english_name in the JSON output and #english_name in the RDF output.

You can also use the propertyUrl property to supply a prefixed name or a URL for the property. For example, to use schema:latitude as the name for the latitude property in both the JSON and the RDF output, you could use:

Example 53
{
  "titles": "latitude",
  "propertyUrl": "schema:latitude"
  "datatype": "number"
}

The propertyUrl property can be used to map several columns in the CSV file onto properties with the same name. In our example, each country has several names which are all really the same property; the schema could look like:

Example 54
{
  "titles": "name (en)",
  "propertyUrl": "schema:name"
  "lang": "en"
},{
  "titles": "name (fr)",
  "propertyUrl": "schema:name"
  "lang": "fr"
},{
  "titles": "name (de)",
  "propertyUrl": "schema:name"
  "lang": "de"
}

With that schema, the JSON output will contain:

Example 55
"schema:name": ["Belgium", "Belgique", "Belgien"]

and the RDF output will contain:

Example 56
schema:name "Belgium"@en, "Belgique"@fr, "Belgien"@de

If there isn't a relevant property in one of the vocabularies that is built-in to CSV on the Web (those listed as part of the RDFa 1.1 initial context), the propertyUrl can hold a URL template. Usually this template won't include any substitutable parts because it's generally the case that the property should be the same for the whole column. For example, you might have:

Example 57
{
  "titles": "country group",
  "propertyUrl": "http://example.org/def/countryGroup"
}

In this case, the result of a transformation to JSON will contain:

Example 58
"http://example.org/def/countryGroup": "eu"

and the RDF similarly:

Example 59
<http://example.org/def/countryGroup> "eu"

See also:

4.5 How do you map values into URLs?

Sometimes a column contains a value that can be programmatically mapped into a URL. In this case, the valueUrl property contains a template for the URL that it should be mapped into.

For example, say that there were pages for each country on the web at e.g. http://example.org/country/at. In that case, the URL for the country could be generated with the URL template http://example.org/country/{country}. Within this template, {country} inserts the value from the column named country into the URL. So the metadata could contain:

Example 60
{
  "titles": "country",
  "name": "country",
  "valueUrl": "http://example.org/country/{country}",
  "propertyUrl": "schema:url"
}

The JSON output would then contain:

Example 61
"schema:url": "http://example.org/country/at"

and the RDF output:

Example 62
schema:url <http://example.org/country/at>

If you want to preserve the original value from the column and use it to create a URL, you may want to introduce a virtual column. For example, with the latitude and longitude of each country available, you might want to provide a link to a map centered on the country within Google Maps. The URLs for these look like https://www.google.com/maps/@50.501045,4.476674,7z, and a template like https://www.google.com/maps/@{lat},{long},7z.

To add a property that points to this URL, add a virtual column at the end of the column definitions within the schema. A virtual column definition looks just like a normal column definition but with the virtual property set to true:

Example 63
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "tableSchema": {
    "columns": [{
      "titles": "country"
    },{
      "titles": "country group"
    },{
      "titles": "name (en)",
      "lang": "en"
    },{
      "titles": "name (fr)",
      "lang": "fr"
    },{
      "titles": "name (de)",
      "lang": "de"
    },{
      "titles": "latitude",
      "name": "lat",
      "datatype": "number"
    },{
      "titles": "longitude",
      "name": "long",
      "datatype": "number"
    },{
      "virtual": true,
      "propertyUrl": "schema:hasMap",
      "valueUrl": "https://www.google.com/maps/@{lat},{long},7z"
    }]
  }
}

In JSON, this will result in the output:

Example 64
"schema:hasMap": "https://www.google.com/maps/@42.72567375,25.4823218,7z"

and in RDF, the output:

Example 65
schema:hasMap <https://www.google.com/maps/@42.72567375,25.4823218,7z>

See also:

4.6 How do you include an identifier for the thing described by each row?

By default, the things described by each row don't have identifiers associated with them in either JSON or RDF outputs. You can add an identifier for the row by setting the aboutUrl property. Usually that's done at the top level of the schema.

For example, say each row in countries.csv was about a country whose identifier looked like http://example.org/country/{code} where code was the value within the first column of the CSV file (the country column). The aboutUrl could be set to the generate this URL for each row using a URL template:

Example 66
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "tableSchema": {
    "aboutUrl": "http://example.org/country/{code}",
    "columns": [{
      "titles": "country",
      "name": "code"
    },{
      "titles": "country group"
    },{
      "titles": "name (en)",
      "lang": "en"
    },{
      "titles": "name (fr)",
      "lang": "fr"
    },{
      "titles": "name (de)",
      "lang": "de"
    },{
      "titles": "latitude",
      "datatype": "number"
    },{
      "titles": "longitude",
      "datatype": "number"
    }]
  }
}

In the JSON, these identifiers are turned into @id properties on the objects generated for each row:

Example 67
[{
  "@id": "http: //example.org/country/at",
  "code": "at",
  "country group": "eu",
  "name (en)": "Austria",
  "name (fr)": "Autriche",
  "name (de)": "Österreich"
},{
  "@id": "http: //example.org/country/be",
  "code": "be",
  "country group": "eu",
  "name (en)": "Belgium",
  "name (fr)": "Belgique",
  "name (de)": "Belgien",
  "latitude": 50.501045,
  "longitude": 4.47667405
},{
  "@id": "http: //example.org/country/bg",
  "code": "bg",
  "country group": "eu",
  "name (en)": "Bulgaria",
  "name (fr)": "Bulgarie",
  "name (de)": "Bulgarien",
  "latitude": 42.72567375,
  "longitude": 25.4823218
}]

In the RDF, these identifiers become the identifiers for the entities that the properties relate to, rather than those being blank nodes:

Example 68
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/country/at>
   <#code> "at";
   <#country%20group> "eu";
   <#latitude> 4.76965545e1;
   <#longitude> 1.334598005e1;
   <#name%20%28de%29> "Österreich"@de;
   <#name%20%28en%29> "Austria"@en;
   <#name%20%28fr%29> "Autriche"@fr .

<http://example.org/country/be>
   <#code> "be";
   <#country%20group> "eu";
   <#latitude> 5.0501045e1;
   <#longitude> 4.47667405e0;
   <#name%20%28de%29> "Belgien"@de;
   <#name%20%28en%29> "Belgium"@en;
   <#name%20%28fr%29> "Belgique"@fr .

<http://example.org/country/bg>
   <#code> "bg";
   <#country%20group> "eu";
   <#latitude> 4.272567375e1;
   <#longitude> 2.54823218e1;
   <#name%20%28de%29> "Bulgarien"@de;
   <#name%20%28en%29> "Bulgaria"@en;
   <#name%20%28fr%29> "Bulgarie"@fr .

See also:

4.7 How do you indicate the type of the thing described by each row?

Whether generating JSON or RDF it can be useful to indicate that each row contains data about a particular type of thing, such as a Person or a Country. There isn't usually a column within tabular data that indicates the type of the row (because it's generally the same for every row, so including it would be superfluous), so you have to add it as a virtual column.

The virtual column needs to come after the descriptions of columns actually within the data. It should have its virtual property set to true and its propertyUrl set to rdf:type to indicate that the virtual column will indicate the type of entity the row is about. The valueUrl property can then be set to the prefixed name or URL of the type of the entity. For example, when each row represents a Country, you might use:

Example 69
{
  "virtual": true,
  "propertyUrl": "rdf:type",
  "valueUrl": "schema:Country"
}

In the JSON this value will be transformed into the value of the @type property on the relevant object:

Example 70
"@type": "schema:Country"

In RDF output, it becomes the class for the entity:

Example 71
a schema:Country

See also:

4.8 How do you include extra metadata in the result of the transformation?

Transformations into JSON or RDF can be carried out in one of two modes. In minimal mode, which is what we've looked at so far, the output only contains data from the rows within the CSV file. In full mode, the output also contains metadata about the CSV file, including metadata from the metadata file.

For example, say that our metadata file looked like:

Example 72
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "schema:name": "Countries",
  "schema:description": "European countries for which comparative statistics are collected by Eurostat.",
  "schema:creator": { "schema:name": "Eurostat" },
  "tableSchema": {
    ...
  }
}

The output of a full JSON transformation would look like:

Example 73
{
  "tables": [{
    "url":"countries.csv",
    "schema:name":"Countries",
    "schema:description":"European countries for which comparative statistics are collected by Eurostat.",
    "schema:creator": {
      "schema:name": "Eurostat"
    },
    "row": [{
      "url": "countries.csv#row=2",
      "rownum":1,
      "describes": [{
        "country": "at",
        "country group": "eu",
        "name (en)": "Austria",
        "name (fr)": "Autriche",
        "name (de)": "Österreich",
        "latitude": 47.6965545,
        "longitude": 13.34598005
      }]
    }, {
      "url": "countries.csv#row=3",
      "rownum": 2,
      "describes": [{
        "country": "be",
        "country group": "eu",
        "name (en)": "Belgium",
        "name (fr)": "Belgique",
        "name (de)": "Belgien",
        "latitude": 50.501045,
        "longitude": 4.47667405
      }]
    }, {
      "url": "countries.csv#row=4",
      "rownum": 3,
      "describes": [{
        "country": "bg",
        "country group": "eu",
        "name (en)": "Bulgaria",
        "name (fr)": "Bulgarie",
        "name (de)": "Bulgarien",
        "latitude": 42.72567375,
        "longitude": 25.4823218
      }]
    }]
  }]
}

Similarly, the output of the full RDF would look like:

Example 74
@prefix csvw: <http://www.w3.org/ns/csvw#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

 [
    a csvw:TableGroup;
    csvw:table [
      a csvw:Table;
      schema:creator [ schema:name "Eurostat" ];
      schema:description "European countries for which comparative statistics are collected by Eurostat.";
      schema:name "Countries";
      csvw:row [
        a csvw:Row;
        csvw:describes [
          <#country> "at";
          <#country%20group> "eu";
          <#latitude> 4.76965545e1;
          <#longitude> 1.334598005e1;
          <#name%20%28de%29> "Österreich"@de;
          <#name%20%28en%29> "Austria"@en;
          <#name%20%28fr%29> "Autriche"@fr
        ];
        csvw:rownum 1;
        csvw:url <#row=2>
      ],  [
        a csvw:Row;
        csvw:describes [
          <#country> "be";
          <#country%20group> "eu";
          <#latitude> 5.0501045e1;
          <#longitude> 4.47667405e0;
          <#name%20%28de%29> "Belgien"@de;
          <#name%20%28en%29> "Belgium"@en;
          <#name%20%28fr%29> "Belgique"@fr
        ];
        csvw:rownum 2;
        csvw:url <#row=3>
      ],  [
        a csvw:Row;
        csvw:describes [
          <#country> "bg";
          <#country%20group> "eu";
          <#latitude> 4.272567375e1;
          <#longitude> 2.54823218e1;
          <#name%20%28de%29> "Bulgarien"@de;
          <#name%20%28en%29> "Bulgaria"@en;
          <#name%20%28fr%29> "Bulgarie"@fr
        ];
        csvw:rownum 3;
        csvw:url <#row=4>
      ];
      csvw:url <>
    ]
 ] .

Metadata provided about tables is interpreted based on the rules for [json-ld] which means that you can provide as much structure within that metadata as you like, including providing structured values, languages and datatypes, so that the data in the output includes what you need it to.

See also:

4.9 How can you remove output from a transformation result?

By default, the output from a JSON or RDF transformation will include all the data from all the columns of all the tables in the metadata document. It may be that you're not interested in some of that within the output of your transformation. In that case, you can use the suppressOutput property in the metadata to exclude the data that you're not interested in.

For example, perhaps I'm not interested in the non-English names of countries in my output. In that case, I could suppress them as follows:

Example 75
{
  "titles": "name (fr)",
  "lang": "fr",
  "suppressOutput": true
}

Similarly, when generating the output for a set of tables, you can suppress the output from a whole table by adding the suppressOutput property to the description of that table:

Example 76
{
  "@context": "http://www.w3.org/ns/csvw",
  "tables": [{
    "url": "countries.csv"
  }, {
    "url": "country-groups.csv",
    "suppressOutput": true
  }, {
    "url": "unemployment.csv"
  }]
}

See also:

4.10 How do you transform into nested structures in JSON?

While tabular data is by necessity flat, it often holds data that is actually structured. For example, the data that we have been looking at:

Example 77
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

if modelled according to the schema.org vocabulary, would look like:

Example 78
[{
  "@type": "schema:Country",
  "schema:name": ["Austria", "Autriche", "Österreich"],
  "schema:geo": {
    "@type": "schema:GeoCoordinates",
    "schema:latitude":47.6965545,
    "schema:longitude":13.34598005
  }
}, {
  "@type": "schema:Country",
  "schema:name": ["Belgium", "Belgique", "Belgien"],
  "schema:geo": {
    "@type": "schema:GeoCoordinates",
    "schema:latitude":50.501045,
    "schema:longitude":4.47667405
  }
}, {
  "@type": "schema:Country",
  "schema:name": ["Bulgaria", "Bulgarie", "Bulgarien"],
  "schema:geo": {
    "@type": "schema:GeoCoordinates",
    "schema:latitude":42.72567375,
    "schema:longitude":25.4823218
  }
}]

Generating JSON in this shape requires the judicious use of virtual columns, aboutUrl and valueUrl: if you create a column whose valueUrl corresponds to the aboutUrl of another column, you will create nested properties.

In this example, we can use two aboutUrls: one in the form http://example.org/country/{code} for the Country and one in the form http://example.org/country/{code}#geo for the geo-coordinates of the country. The names are properties of the former while the longitude and latitude are properties of the latter. A virtual column can add the association between the two objects, with a propertyUrl of schema:geo, like so:

Example 79
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "tableSchema": {
    "aboutUrl": "http://example.org/country/{code}",
    "columns": [{
      "titles": "country",
      "name": "code",
      "suppressOutput": true
    },{
      "titles": "country group",
      "suppressOutput": true
    },{
      "titles": "name (en)",
      "lang": "en",
      "propertyUrl": "schema:name"
    },{
      "titles": "name (fr)",
      "lang": "fr",
      "propertyUrl": "schema:name"
    },{
      "titles": "name (de)",
      "lang": "de",
      "propertyUrl": "schema:name"
    },{
      "titles": "latitude",
      "datatype": "number",
      "aboutUrl": "http://example.org/country/{code}#geo",
      "propertyUrl": "schema:latitude"
    },{
      "titles": "longitude",
      "datatype": "number",
      "aboutUrl": "http://example.org/country/{code}#geo",
      "propertyUrl": "schema:longitude"
    },{
      "virtual": true,
      "propertyUrl": "rdf:type",
      "valueUrl": "schema:Country"
    },{
      "virtual": true,
      "propertyUrl": "schema:geo",
      "valueUrl": "http://example.org/country/{code}#geo"
    },{
      "virtual": true,
      "aboutUrl": "http://example.org/country/{code}#geo",
      "propertyUrl": "rdf:type",
      "valueUrl": "schema:GeoCoordinates"
    }]
  }
}
Note

Note also in this example the use of suppressOutput to remove properties that we're not interested in, and the use of virtual columns to add type information to both types of generated object.

The result of this transformation is close to what we were aiming for, though with the addition of @id properties:

Example 80
[{
  "@id": "http://example.org/country/at",
  "@type": "schema:Country",
  "schema:name": ["Austria", "Autriche", "Österreich"],
  "schema:geo": {
    "@id": "http://example.org/country/at#geo",
    "@type": "schema:GeoCoordinates",
    "schema:latitude": 47.6965545,
    "schema:longitude": 13.34598005
  }
}, {
  "@id": "http://example.org/country/be",
  "@type": "schema:Country",
  "schema:name": ["Belgium", "Belgique", "Belgien"],
  "schema:geo": {
    "@id":"http://example.org/country/be#geo",
    "@type": "schema:GeoCoordinates",
    "schema:latitude": 50.501045,
    "schema:longitude": 4.47667405
  }
}, {
  "@id": "http://example.org/country/bg",
  "@type": "schema:Country",
  "schema:name": ["Bulgaria", "Bulgarie", "Bulgarien"],
  "schema:geo": {
    "@id": "http://example.org/country/bg#geo",
    "@type": "schema:GeoCoordinates",
    "schema:latitude": 42.72567375,
    "schema:longitude": 25.4823218
  }
}]

The same metadata will generate similar RDF, though the nesting structure is not so obvious because of the way RDF works:

Example 81
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/country/at> a schema:Country;
   schema:geo <http://example.org/country/at#geo>;
   schema:name "Austria"@en, "Autriche"@fr, "Österreich"@de .

<http://example.org/country/be> a schema:Country;
   schema:geo <http://example.org/country/be#geo>;
   schema:name "Belgium"@en, "Belgique"@fr, "Belgien"@de .

<http://example.org/country/bg> a schema:Country;
   schema:geo <http://example.org/country/bg#geo>;
   schema:name "Bulgaria"@en, "Bulgarie"@fr, "Bulgarien"@de .

<http://example.org/country/at#geo> a schema:GeoCoordinates;
   schema:latitude 4.76965545e1;
   schema:longitude 1.334598005e1 .

<http://example.org/country/be#geo> a schema:GeoCoordinates;
   schema:latitude 5.0501045e1;
   schema:longitude 4.47667405e0 .

<http://example.org/country/bg#geo> a schema:GeoCoordinates;
   schema:latitude 4.272567375e1;
   schema:longitude 2.54823218e1 .

See also:

4.11 How do you indicate that values should be mapped to a list rather than repeating properties in RDF?

If you are used to using RDF you'll know that there's a big difference between having a property that has multiple values (ie multiple triples with the same subject and property) and a property that has a rdf:List as a value. Sometimes one is appropriate, sometimes the other. The ordered property enables you to indicate which to use for values that are sequences in the original data.

Let's use as an example the version of our data in which the latitude and longitude are in the same property:

Example 82
"country","country group","name (en)","name (fr)","name (de)","latlong"
"at","eu","Austria","Autriche","Österreich","47.6965545 13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045 4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375 25.4823218"

In this example, if we state that the latlong column is space-separated like so:

Example 83
{
  "titles": "latlong",
  "separator": " ",
  "datatype": "number"
}

we'll end up with output like this:

Example 84
[
  <#country> "at";
  <#country%20group> "eu";
  <#name%20%28en%29> "Austria"@en;
  <#name%20%28fr%29> "Autriche"@fr;
  <#name%20%28de%29> "Österreich"@de;
  <#latlong> 4.76965545e1, 1.334598005e1;
] .

This shows the #latlong property having two values: 4.76965545e1 and 1.334598005e1. These values could easily get mixed up such that the first was taken to be the longitude and the second the latitude rather than the other way around.

To avoid this mix-up occurring, set the ordered property to true:

Example 85
{
  "titles": "latlong",
  "separator": " ",
  "ordered": true,
  "datatype": "number"
}

This will make the processor use a rdf:List for the value of the #latlong property instead, which in Turtle looks like:

Example 86
[
  <#country> "at";
  <#country%20group> "eu";
  <#name%20%28en%29> "Austria"@en;
  <#name%20%28fr%29> "Autriche"@fr;
  <#name%20%28de%29> "Österreich"@de;
  <#latlong> ( 4.76965545e1, 1.334598005e1 );
] .
Note

The ordered property makes no difference to JSON output because sequences in CSV are always transformed into arrays in JSON.

See also:

4.12 How should you transform CSV into JSON-LD?

As illustrated above, it is possible to transform CSV into something that looks like JSON-LD by transforming it into JSON. You can add @id properties for identifiers using aboutUrl and add @type properties for types using virtual columns.

However, if you're really after JSON-LD as an output, the best route is to transform into RDF and emit that RDF as JSON-LD. This will give you more control over the context that's used to determine the properties and structure of the output.

See also:

4.13 How should you display CSV tables in HTML?

There's no single specified way of displaying CSV tables in HTML. If you are writing code to do so, it's good practice to:

See also:

4.14 How can you transform CSV into the DataCube vocabulary?

The Data Cube vocabulary [vocab-data-cube] is a vocabulary for statistical data based on SDMX (Statistical Data and Metadata eXchange). Statistical data is often expressed in tables, sometimes with one row per Observation and sometimes with each cell containing a different Observation.

Generating data in the Data Cube vocabulary requires the use of techniques that have been described above, such as adding identifiers to entities, adding structure and types through virtual columns, and using metadata to supply additional static context.

There is a fully worked out example of transforming to DataCube, using meterological data as its basis, available on the Github repository for the Working Group.

See also:

4.15 How can you transform CSV into other formats?

While there are only specifications for transforming CSV into JSON and RDF, there is an extension mechanism within CSV metadata to indicate other transformations that could be applied to CSV files. The transformations property on a table description holds an array of descriptions of transformations that processors could carry out. There's no guarantees that a given processor will recognise them, but over time it might be that there begins to be recognised practices for how such transformations might work.

The transformations must have the following properties:

targetFormat
gives a URL for the format that the transformation transforms into, for example http://www.iana.org/assignments/media-types/application/xml for XML
url
points to a script or template that can be used to transform the CSV into that format
scriptFormat
gives a URL for the format that the script is in, for example https://mustache.github.io/ for Mustache or http://www.iana.org/assignments/media-types/application/javascript for Javascript

You can also supply a titles property to provide a human-readable description of the output of the transformation, and a source property to indicate that the input to the transformation isn't the original CSV or tabular data, but json or rdf.

For example, if I wanted to convert the data that we've been using into XML, I could create a Mustache template like this at xml-template.mustache:

Example 87
{{#tables}}
  <countries>
    {{#row}}
      {{#describes}}
        <country id="{{country}}" group="{{country_group}}">
          <name xml:lang="en">{{name_en}}</name>
          <name xml:lang="fr">{{name_fr}}</name>
          <name xml:lang="de">{{name_de}}</name>
          <geo lat="{{latitude}}" long="{{longitude}}" />
        </country>
      {{/describes}}
    {{/row}}
  </countries>
{{/tables}}

In the metadata for the CSV file I could then include:

Example 88
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "transformations": [{
    "targetFormat": "http://www.iana.org/assignments/media-types/application/xml",
    "titles": "Simple XML version",
    "url": "xml-template.mustache",
    "scriptFormat": "https://mustache.github.io/",
    "source": "json"
  }]
}

Processors that recognised the URL for Mustache could offer users the option of passing the JSON output to a Mustache processor, which would generate:

Example 89
<countries>
  <country id="at" group="eu">
    <name xml:lang="en">Austria</name>
    <name xml:lang="fr">Autriche</name>
    <name xml:lang="de">Österreich</name>
    <geo lat="47.6965545" long="13.34598005" />
  </country>
  <country id="be" group="eu">
    <name xml:lang="en">Belgium</name>
    <name xml:lang="fr">Belgique</name>
    <name xml:lang="de">Belgien</name>
    <geo lat="50.501045" long="4.47667405" />
  </country>
  <country id="bg" group="eu">
    <name xml:lang="en">Bulgaria</name>
    <name xml:lang="fr">Bulgarie</name>
    <name xml:lang="de">Bulgarien</name>
    <geo lat="42.72567375" long="25.4823218" />
  </country>
</countries>

Processors should document the kind of transformation scripts that they can recognise and how they process them.

See also:

5. Handling language in CSVs

There are a number of features in the CSV on the Web metadata documents that support scenarios encountered in CSV files that use different languages. We already discussed using varying number formats in section 3.6 How do you validate numbers that aren't in standard numeric formats? and date formats in section 3.8 How do you validate dates that aren't in standard date or time formats?. Here we'll look at how to create metadata files, schemas and CSV files that take account of and work across multiple languages.

5.1 How do you indicate the language used by the metadata file?

The metadata file will often contain natural language text, such as titles and descriptions of columns and tables. Unless you specify otherwise, implementations will assume all this text is in an undefined language (und). If you want to specify what natural language is in use within the metadata file, you have to change the way the @context is specified. Instead of the normal value:

Example 90
"@context": "http://www.w3.org/ns/csvw"

The @context should take an array, where the first value is the usual URL as a string and the second is an object with a @language property set to the language being used within the metadata file. This example, which has English-language titles and descriptions, illustrates:

Example 91
{
  "@context": [ "http://www.w3.org/ns/csvw", { "@language": "en "} ],
  "dc:title": "Countries"
  "url": "countries.csv"
  "tableSchema": {
    "columns": [{
      "titles": "country",
      "dc:description": "The ISO two-letter code for a country, in lowercase."
    },{
      "titles": "country group",
      "dc:description": "A lowercase two-letter code for a group of countries."
    },{
      "titles": "name (en)",
      "dc:description": "The official name of the country in English."
    },{
      "titles": "name (fr)",
      "dc:description": "The official name of the country in French."
    },{
      "titles": "name (de)",
      "dc:description": "The official name of the country in German."
    },{
      "titles": "latitude",
      "dc:description": "The latitude of an indicative point in the country."
    },{
      "titles": "longitude",
      "dc:description": "The longitude of an indicative point in the country."
    }]
  }
}

See also:

5.2 How do you provide metadata such as descriptions in different languages?

Metadata such as descriptions can be objects rather than strings. Using objects is useful when you want to provide the language for a value. In this case, the object should have two properties: a @value property holding the natural-language string and a @lang value indicating what language that string is in. For example:

Example 92
{
  "titles": "name (en)",
  "dc:description": {
    "@value": "The official name of the country in English.",
    "@lang": "en"
  }
}

You can use an array to provide the same metadata in many different languages, for example:

Example 93
"dc:title": [{
  "@lang": "en",
  "@value": "Unemployment in Europe (monthly)"
},{
  "@lang": "de",
  "@value": "Arbeitslosigkeit in Europa (monatlich)"
},{
  "@lang": "fr",
  "@value": "Le Chômage en Europe (mensuel)"
}]
Note

If you don't indicate the language used for metadata, processors will assume it's the default language used in the metadata as a whole.

See also:

5.3 How do you provide titles for columns in different languages?

You can use an object as the value for the titles property for a column to provide titles in different languages. Within the object, each property is a language and the value is the title in that language:

Example 94
"titles": {
  "en": "Country",
  "de": "Land",
  "fr": "Pays"
}
Note

If you don't indicate the language used for the title of a column, processors will assume it's the default language used in the metadata as a whole.

See also:

5.4 How do you specify the language of the values in a column?

Within an individual CSV file, it may be that different columns contain values that are in different languages. In the example we're using, there are three columns that each contain the name of a country, in English, French and German:

Example 95
"country","country group","name (en)","name (fr)","name (de)","latlong"
"at","eu","Austria","Autriche","Österreich","47.6965545 13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045 4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375 25.4823218"

Use the lang property on the column description to indicate the language of text in that column:

Example 96
{
  "@context": [ "http://www.w3.org/ns/csvw", { "@language": "en "} ],
  "dc:title": "Countries"
  "url": "countries.csv"
  "tableSchema": {
    "columns": [{
      "titles": "country",
      "dc:description": "The ISO two-letter code for a country, in lowercase."
    },{
      "titles": "country group",
      "dc:description": "A lowercase two-letter code for a group of countries."
    },{
      "titles": "name (en)",
      "dc:description": "The official name of the country in English.",
      "lang": "en"
    },{
      "titles": "name (fr)",
      "dc:description": "The official name of the country in French.",
      "lang": "fr"
    },{
      "titles": "name (de)",
      "dc:description": "The official name of the country in German.",
      "lang": "de"
    },{
      "titles": "latitude",
      "dc:description": "The latitude of an indicative point in the country."
    },{
      "titles": "longitude",
      "dc:description": "The longitude of an indicative point in the country."
    }]
  }
}
Note

There's no relationship between the language used in a metadata file and that used in the CSV file that it describes.

See also:

5.5 How do you indicate that tables should be displayed right-to-left?

Implementations that display tables according to the specs should mostly be able to guess whether a table should be displayed left-to-right or right-to-left based on the content of the table. Implementations will look at the content of the cells to work out which way to display their content and will look at the content of the table as a whole to work out whether to display the first column on the right or left of the page.

If you want to override the display of a particular column then you can use the textDirection property on a column description to explicitly be rtl or ltr:

Example 97
{
  "titles": "name (ar)",
  "lang": "ar",
  "textDirection": "rtl"
}

If you want to override the display of the table overall then you can use the tableDirection property on the description of the table, or for all tables in the group.

Example 98
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "results.csv",
  "tableDirection": "rtl"
}

The value of the tableDirection property is inherited to all columns in the table, so any text within this table will similarly be displayed right-to-left. This can be overridden by setting textDirection to ltr or auto (in which case the direction of the text within each cell will be determined by its contents).

See also:

6. Advanced Use

6.1 How do you support units of measure?

There is no native support for expressing the units of measure for a particular column. You can, however, use documentation to tell people who are using the data what unit of measure is used for that particular column. This can be informal within the description of the column:

Example 99
{
  "titles": "distance",
  "dc:description": "Distance (kilometres)"
}

Alternatively, it can be more explicit using an existing units-of-measure property and vocabulary, such as:

Example 100
{
  "titles": "distance",
  "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure": {
    "@id": "http://qudt.org/vocab/unit#Kilometer"
  }
}

to which you could even add more detail if you wanted (this is replicating the canonical definition of definition of unit:Kilometer from the Quanitites, Units, Dimensions and Data Types (QUDT) Ontologies):

Example 101
{
 "titles": "distance",
 "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure": {
   "@id": "http://qudt.org/vocab/unit#Kilometre",
   "@type": [
     "http://qudt.org/schema/qudt#SIUnit",
     "http://qudt.org/schema/qudt#DerivedUnit",
     "http://qudt.org/schema/qudt#LengthUnit"
   ],
   "rdfs:label": "Kilometer",
   "http://qudt.org/schema/qudt#abbreviation": "km",
   "http://qudt.org/schema/qudt#code": "1091",
   "http://qudt.org/schema/qudt#conversionMultiplier": 1000,
   "http://qudt.org/schema/qudt#conversionOffset": 0.0,
   "http://qudt.org/schema/qudt#symbol": "km",
   "skos:exactMatch": { "@id": "http://dbpedia.org/resource/Kilometre" }
 }
}       

6.1.1 Supporting units of measure by transforming to structured values

If you are generating JSON or RDF from CSV, you may want to generate structured values that include the units of each value from the CSV file. This is a little complicated, but useful if different rows contain values that use different units. In this case, the output that you're aiming for in RDF would look something like:

Example 102
[] :distance <#row-1-distance> .

<#row-1-distance>
  schema:value 3.5 ;
  schema:unitCode <http://qudt.org/vocab/unit#Kilometer> ;
  .

and in JSON something like this:

Example 103
"distance": {
  "@id": "#row-1-distance",
  "schema:value": 3.5
  "schema:unitCode": "http://qudt.org/vocab/unit#Kilometer"
}
Note

You may want to use different properties than schema:value and schema:unitCode; if so, just use different propertyUrls.

You need to decide on a pattern for the URLs for the values themselves, and set the aboutUrl for the relevant column create that URL. In this example, the URLs that look like #row-1-distance can be generated with the pattern #row-{_row}-distance. The propertyUrl for the column needs to be schema:value as the value in the column provides the value for that property. So the column description looks like:

Example 104
{
  "name": "distance_value",
  "titles": "distance",
  "datatype": "number",
  "aboutUrl": "#row-{_row}-distance",
  "propertyUrl": "schema:value"
}

You then need to use virtual columns (descriptions of additional columns that don't exist in the source CSV) to generate the relationship between the thing whose distance is being measured and the structured value, and the additional property providing the unit for the structured value.

To generate the relationship being the thing that has the distance and the structured value, the virtual column's valueUrl needs to hold the same URL template as you used before:

Example 105
{
  "name": "distance",
  "virtual": true,
  "valueUrl": "#row-{_row}-distance"
}

To create the units property, you need another virtual column where the aboutUrl of that virtual column generates the URL for the structured value, the propertyUrl is schema:unitCode and the valueUrl is the URL representing the unit (in this case http://qudt.org/vocab/unit#Kilometre):

Example 106
{
  "name": "distance_unit",
  "aboutUrl": "#row-{_row}-distance",
  "propertyUrl": "schema:unitCode",
  "valueUrl": "http://qudt.org/vocab/unit#Kilometer"
}

If it's necessary to add more detail about the unit (e.g. the fact that it's a unit of length) this can be done with additional virtual columns:

Example 107
{
  "name": "kilometer_abbreviation",
  "virtual": true,
  "aboutUrl": "http://qudt.org/vocab/unit#Kilometer",
  "propertyUrl": "rdf:type",
  "valueUrl": "http://qudt.org/schema/qudt#LengthUnit"
}

Usually, however, processors should recognise or be able to resolve the URL for the unit to understand that it's a unit of length, if this is important for onward processing.

See also:

6.1.2 Supporting units of measure with named datatypes in RDF

If you are generating RDF from CSV, you may want to define a datatype for a column and then provide additional information about that datatype as properties. For example, the column description could look like:

Example 108
{
  "titles": "distance",
  "datatype": {
    "@id": "http://example.org/unit/kilometre",
    "rdfs:label": "Kilometre",
    "base": "number"
  }
}

When values are generated in RDF for this column, they will be assigned the relevant datatype, for example:

Example 109
[] :distance "3.5"^^<http://example.org/unit/kilometre> .

<http://example.org/unit/kilometre> rdfs:label "Kilometre" .

Again, it is possible to include additional information about the unit being used as the datatype within the definition of the datatype:

Example 110
{
  "titles": "distance",
  "datatype": {
    "@id": "http://example.org/unit/kilometre",
    "@type": "http://example.org/quantity/length",
    "rdfs:label": "Kilometre",
    "base": "number",
    "skos:notation": "km"
  }
}

See also:

6.2 How do you support geospatial data?

There are many different ways of representing geospatial data within a CSV file, and no single best practice for doing so.

At the simplest level, it's possible to reference geospatial coordinates as points with separate columns for latitude, longitude and if necessary altitude (or using a different spatial reference system). This enables separate validation for the separate coordinates. The example used throughout this primer uses this setup:

Example 111
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

Metadata can be used to provide specialist types for the values of these coordinates, to indicate that they are latitude and longitude (by mapping to the well-known schema:latitude and schema:longitude properties which specify the use of WGS84), to group the coordinates together, and to provide a link that uses the coordinates to provide a map:

Example 112
{
  "titles": "latitude",
  "name": "lat",
  "datatype": {
    "base": "number",
    "minimum": "-90",
    "maximum": "90"
  },
  "aboutUrl": "http://example.org/country/{code}#geo",
  "propertyUrl": "schema:latitude"
}, {
  "titles": "longitude",
  "name": "long",
  "datatype": {
    "base": "number",
    "minimum": "-180",
    "maximum": "180"
  },
  "aboutUrl": "http://example.org/country/{code}#geo",
  "propertyUrl": "schema:longitude"
}, {
  "virtual": true,
  "propertyUrl": "schema:geo",
  "valueUrl": "http://example.org/country/{code}#geo"
}, {
  "virtual": true,
  "aboutUrl": "http://example.org/country/{code}#geo",
  "propertyUrl": "rdf:type",
  "valueUrl": "schema:GeoCoordinates"
}, {
  "virtual": true,
  "propertyUrl": "schema:hasMap",
  "valueUrl": "https://www.google.com/maps/@{lat},{long},7z"
}
Note

You can put latitude and longitude into a single column, with a character separator between the numbers as shown in section 3.12 What if the cells in a column contain lists of values?. However, this makes it harder to accurately validate the individual coordinates. They also cannot be separated out into separate property values when converting to JSON or RDF. So this is a more restrictive method and best avoided.

CSV files may also need to contain geometries beyond individual points. There are no built-in formats for geometries recognised by implementations of CSV on the Web. Geometries may be expressed using GeoJSON, GML, KML or OGC Well-Known Text (WKT) representations. In each case, schemas may indicate that columns containing geometries adhere to a particular type:

You can use the format property to further constrain the content of columns containing these values. For example, you could use:

Example 113
"datatype": {
  "@id": "http://geojson.org/",
  "base": "json",
  "format": "\\{ ?\"type\": ?\"Polygon\",.+\\}"
}

to indicate that a column contains a Polygon in GeoJSON, or:

Example 114
"datatype": {
  "@id": "http://www.iana.org/assignments/media-types/application/gml+xml",
  "base": "xml",
  "format": ".*\\<gml:Point xmlns:gml=\"http://www\.opengis\.net/ont/gml" srsName=\"([^\"])+\".*\\>.+\\</gml:Point\\>"
}

to indicate that a column contains a GML Point, or:

Example 115
"datatype": {
  "@id": "http://www.iana.org/assignments/media-types/application/vnd.google-earth.kml+xml",
  "base": "xml",
  "format": ".*\\<kml xmlns=\"http://www\\.opengis\\.net/kml/2.2\"\\>.+\\</kml\\>"
}

to indicate that a column contains KML, or:

Example 116
"datatype": {
  "base": "string",
  "format": "POLYGON \\(\\(\\d+(\\.\\d+)? \\d+(\\.\\d+)?(, \\d+(\\.\\d+)? \\d+(\\.\\d+)?)+\\)\\)"
  "rdfs:seeAlso": "http://www.opengeospatial.org/standards/sfa"
}

to indicate that a column contains a very basic OGC WKT Polygon. (More sophisticated regular expressions could be used if properties held different types of objects such as points or lines or more complex polygons expressed in OGC WKT.)

Note

There is no standard way of expressing which coordinate reference system (CRS) is being used within CSV metadata, but GeoJSON, GML and OGC WKT values can include information about which CRS they use. The W3C Spatial Data on the Web Working Group intend to make recommendations in this area within [sdw-bp].

See also:

6.3 How can you specify a single schema for multiple CSV files?

CSV on the Web is designed to enable you to reuse the same schema when publishing multiple CSV files, even if those files are created by different organisations and therefore reside in different places. Rather than embedding a schema within the description of a table, the tableSchema property can point out to a schema held somewhere else.

For example, if you were a statistical agency and wanted municipalities to publish their unemployment figures using the same schema, you could specify the columns that you wanted included within the schema you specified at http://example.org/schema/unemployment.json:

Example 117
{
  "columns": [{
    "name": "municipality"
  }, {
    "name": "month",
    "datatype": "gYearMonth"
  }, {
    "name": "unemployment",
    "datatype": "number"
  }]
}

Experience shows that publishers of data in CSV files often use their own headings for the columns. So long as these don't change the meaning of the column, as a statistical agency you probably don't care. The titles property allows you to provide multiple alternative titles that people may use in an array:

Example 118
{
  "columns": [{
    "name": "municipality",
    "titles": [ "Municipality", "City", "Area" ]
  }, {
    "name": "month",
    "titles": [ "Month", "Period" ],
    "datatype": "gYearMonth"
  }, {
    "name": "unemployment",
    "titles": [ "Unemployment Rate", "Unemployment", "Number unemployed" ]
    "datatype": "number"
  }]
}
Note

In cases where the data is being gathered from multiple countries, it may also be useful to specify multiple possible titles in different languages, as described in section 5.3 How do you provide titles for columns in different languages?.

It will help consistency and processing of the data if the municipalities use a consistent set of codes to indicate which municipality the data relates to. As the statistical agency, you can supply the relevant codes in a CSV file, e.g. http://example.org/ref/municipalities.csv:

Example 119
code,name
0101,Absecon City
0102,Atlantic City
0103,Brigantine City
...

This file will of course have its own simple schema, http://example.org/schema/municipalities.json, that provides the datatype for the municipality codes and indicates that they are unique through a primary key:

Example 120
{
  "columns": [{
    "name": "code",
    "datatype": { "format": "\d{4}" }
  }, {
    "name": "name"
  }],
  "primaryKey": "code"
}

A foreign key in the schema supplied by the statistical authority ensures that the codes used in the unemployment data match up with the standard set supplied by the statistical authority:

Example 121
{
  "columns": [{
    "name": "municipality",
    "titles": [ "Municipality", "City", "Area" ]
  }, {
    "name": "month",
    "titles": [ "Month", "Period" ],
    "datatype": "gYearMonth"
  }, {
    "name": "unemployment",
    "titles": [ "Unemployment Rate", "Unemployment", "Number unemployed" ]
    "datatype": "number"
  }],
  "foreignKeys": [{
    "columnReference": "municipality",
    "reference": {
      "resource": "http://example.org/ref/municipalities.csv",
      "columnReference": "code"
    }
  }]
}

When a municipality describes their unemployment CSV, they will need to point to your schema for their data and to the centrally provided municipalities reference data and schema:

Example 122
{
  "@context": "http://www.w3.org/ns/csvw",
  "tables": [{
    "url": "http://local.example.org/data/unemployment.csv",
    "tableSchema": "http://example.org/schema/unemployment.json"
  }, {
    "url": "http://example.org/ref/municipalities.csv",
    "tableSchema": "http://example.org/schema/municipalities.json"
  }]
}
Note

A more complex example in which there is linking between pairs of files where the schemas are provided by a central authority is provided in the section Foreign Key Reference Between Schemas within [tabular-metadata].

See also:

6.4 How can you provide a title for a row?

It is useful to have titles for rows both for screen readers and in other displays where it may be difficult to view the complete context of a table and therefore to understand the content of a row.

The rowTitles property within a schema provides an array of columns that provide sufficient context to label the row. These may sometimes be the same as the columns used for the primaryKey in the table, but are more likely to be columns that contain human readable text.

For example, with the CSV:

Example 123
"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

The rowTitles property could be set to reference the columns containing the name of each country, in the different languages available:

Example 124
"tableSchema": {
  "columns": [{
    "titles": "country"
  },{
    "titles": "country group"
  },{
    "name": "name_en",
    "titles": "name (en)",
    "lang": "en"
  },{
    "name": "name_fr",
    "titles": "name (fr)",
    "lang": "fr"
  },{
    "name": "name_de",
    "titles": "name (de)",
    "lang": "de"
  },{
    "titles": "latitude"
  },{
    "titles": "longitude"
  }],
  "rowTitles": ["name_en", "name_fr", "name_de"]
}

In this case, a screen reader or other display of the table could choose to read or display only the row title that matched the user's preferred language.

In other cases, the rowTitles property may be set to an array of columns that together provided sufficient context to understand the column (eg ["firstName", "lastName"]).

See also:

6.5 What about CSV that isn't standard CSV?

A lot of what's called "CSV" that's published on the web isn't actually CSV. It might use something other than commas (such as tabs or semi-colons) as separators between values, or might have multiple header lines.

The specification for CSV as a format is [RFC4180]. However, this is an informational specification and and not a formal standard. Therefore, applications may deviate from it.

The metadata that's described here can be used with files that contain tabular data but that aren't CSV. You can provide guidance to processors that are trying to parse those files through the dialect property on a table description. For example, say we were dealing with a tab-separated file that contains multiple header lines at http://example.org/data/unemployment.tsv:

Example 125
"country"	"country group"	"name (en)"	"name (fr)"	"name (de)"	"latitude"	"longitude"
"Land"	"Ländergruppe"	"Name (en)"	"Name (fr)"	"Name (de)"	"Breite"	"Länge"
"pays"	"groupe de pays"	"nom (en)"	"nom (fr)"	"nom (de)"	"latitude"	"longitude"
"at"	"eu"	"Austria"	"Autriche"	"Österreich"	"47.6965545"	"13.34598005"
"be"	"eu"	"Belgium"	"Belgique"	"Belgien"	"50.501045"	"4.47667405"
"bg"	"eu"	"Bulgaria"	"Bulgarie"	"Bulgarien"	"42.72567375"	"25.4823218"

The metadata for this file could be:

Example 126
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "http://example.org/data/unemployment.tsv",
  "dialect": {
    "delimiter": "\t",
    "headerRowCount": 3
  }
}

There are a number of other properties that you can set within the dialect to cater for the large range of weird things that people do in CSV files. They are:

commentPrefix

If the file contains comment lines, set this to the character used at the start of the lines that are comments (usually that's #).

delimiter

If the file doesn't use commas as separators between values, set this to the separator that it uses.

doubleQuote

If the file uses \ to escape double quotes within values, set this to false.

encoding

If the encoding of the file is not UTF-8, set this to the encoding.

header

If the file doesn't have a header line, set this to false.

headerRowCount

If the file has more than one header line, set this to the number of header lines it has.

lineTerminators

If the file uses an unusual character at the end of its lines, set this to that character.

quoteChar

If the file doesn't use double quotes (") around values that contain commas, set this to the character that it does use.

skipBlankRows

If the file contains blank rows that should just be ignored, set this to true.

skipColumns

If the file has some columns at the start that don't contain useful information, set this to that number of columns.

skipInitialSpace

If values in the file sometimes start with whitespace that should be ignored, set this to true.

skipRows

If the file has some rows at the start that don't contain useful information, set this to that number of rows. (Sometimes people put metadata at the start of a CSV file, before the header lines.)

trim

If you don't want to ignore whitespace around values, set this to false. If you want to only ignore whitespace at the beginning of values, set it to start and if you want to only ignore whitespace at the end of values, to end. By default whitespace at the start and the end of a value will be stripped away.

See also:

6.6 What about tables in HTML?

Tables in HTML are a bit more complicated than tables in CSV files. However, it is possible to use the metadata described here and in [tabular-metadata] to described HTML tables by embedding the metadata within a script element in the header of the HTML page. A description and example of how to do this is provided in the [csvw-html] Note.

See also:

6.7 What if you want to put metadata files somewhere else?

By default, if you point a processor at a CSV file on the web, they will look for a Link header within the response with rel="describedby" and specifying a content type of application/csvm+json, application/ld+json or application/json. If you're publishing CSV files and it's possible to set the Link header, this is the best way of telling processors where to look for metadata files.

If processors don't see an appropriate Link header, they will append -metadata.json to the end of the URL of the CSV file to try to find metadata for it. If they can't find a metadata file there, they will look in the directory containing the CSV file for a file called csv-metadata.json and use that file.

Looking for files in this way isn't appropriate on all servers. As a publisher, it might be that you have URLs for CSV files that are automatically generated using queries, such as http://example.org/data?format=csv&x=15&y=53 or that you want metadata to live in a separate subdirectory or central location.

As a publisher, if you can't or don't want to use the Link header, you can control where processors look for metadata for your CSV files by listing the locations to look at within the /.well-known/csvm file on your server. (This is a well-known location as defined in [RFC5785].) This file should contain a list of URL patterns which will be expanded by substituting url for the URL of the CSV file, and then resolving against the location of the CSV file. Lines might look like:

Example 127
/metadata?for={url}
?format=metadata
{+url}m

With a CSV file at http://example.org/data.csv this would lead the processor to search for metadata at:

Example 128
http://example.org/metadata?for=http://example.org/data.csv
http://example.org/data.csv?format=metadata
http://example.org/data.csvm

See also:

A. References

A.1 Informative references

[RFC4180]
Y. Shafranovich. Common Format and MIME Type for Comma-Separated Values (CSV) Files. October 2005. Informational. URL: https://tools.ietf.org/html/rfc4180
[RFC5785]
M. Nottingham; E. Hammer-Lahav. Defining Well-Known Uniform Resource Identifiers (URIs). April 2010. Proposed Standard. URL: https://tools.ietf.org/html/rfc5785
[RFC7111]
M. Hausenblas; E. Wilde; J. Tennison. URI Fragment Identifiers for the text/csv Media Type. January 2014. Informational. URL: https://tools.ietf.org/html/rfc7111
[csv2json]
Jeremy Tandy; Ivan Herman. Generating JSON from Tabular Data on the Web. 17 December 2015. W3C Recommendation. URL: http://www.w3.org/TR/csv2json/
[csv2rdf]
Jeremy Tandy; Ivan Herman; Gregg Kellogg. Generating RDF from Tabular Data on the Web. 17 December 2015. W3C Recommendation. URL: http://www.w3.org/TR/csv2rdf/
[csvw-html]
Gregg Kellogg. Embedding Tabular Metadata in HTML. W3C Note. URL: http://www.w3.org/TR/csvw-html/
[html-rdfa]
Manu Sporny. HTML+RDFa 1.1 - Second Edition. 17 March 2015. W3C Recommendation. URL: http://www.w3.org/TR/html-rdfa/
[json-ld]
Manu Sporny; Gregg Kellogg; Markus Lanthaler. JSON-LD 1.0. 16 January 2014. W3C Recommendation. URL: http://www.w3.org/TR/json-ld/
[sdw-bp]
Jeremy Tandy; Payam Barnaghi; Linda van den Brink. Spatial Data on the Web Best Practices. 19 January 2016. W3C Working Draft. URL: http://www.w3.org/TR/sdw-bp/
[tabular-data-model]
Jeni Tennison; Gregg Kellogg. Model for Tabular Data and Metadata on the Web. 17 December 2015. W3C Recommendation. URL: http://www.w3.org/TR/tabular-data-model/
[tabular-metadata]
Jeni Tennison; Gregg Kellogg. Metadata Vocabulary for Tabular Data. 17 December 2015. W3C Recommendation. URL: http://www.w3.org/TR/tabular-metadata/
[vocab-data-cube]
Richard Cyganiak; Dave Reynolds. The RDF Data Cube Vocabulary. 16 January 2014. W3C Recommendation. URL: http://www.w3.org/TR/vocab-data-cube/
[xmlschema11-2]
David Peterson; Sandy Gao; Ashok Malhotra; Michael Sperberg-McQueen; Henry Thompson; Paul V. Biron et al. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. 5 April 2012. W3C Recommendation. URL: http://www.w3.org/TR/xmlschema11-2/