Abstract

Tabular data is routinely transferred on the web as "CSV", but the definition of "CSV" in practice is very loose. a variety of formats, including variants on CSV, tab-delimited files, fixed field formats, spreadsheets, HTML tables, and SQL dumps. This document outlines a basic data model model, or infoset infoset, for tabular data and metadata about that tabular data. data that can be used as a basis for validation, display, or creating other formats. It also contains some non-normative information about a best practice syntax guidance for publishing tabular data, for mapping into that data model, to contribute to as CSV and how that maps into the standardisation tabular data model.

An annotated model of CSV syntax tabular data can be supplemented by IETF. Various methods of locating separate metadata are also provided. about the table. This specification defines how implementations should locate that metadata, given a file containing tabular data. The standard syntax for that metadata is defined in [ tabular-metadata ]. Note, however, that applications may have other means to create annotated tables, e.g., through some application specific API-s; this model does not depend on the specificities described in [ tabular-metadata ].

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

The CSV on the Web Working Group was chartered to produce a Recommendation "Access methods for CSV Metadata" as well as Recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various Formats (e.g., RDF, JSON, or XML)". This document aims to primarily satisfy the first of those Recommendations (see section 3. 5. Locating Metadata ), though it also specifies an underlying model for tabular data and is therefore starting point a basis for the other chartered Recommendations.

This definition of CSV used in this document is based on IETF's [ RFC4180 ] which is an Informational RFC. The working group's expectation is that future suggestions to refine RFC 4180 will be relayed to the IETF (e.g. around I18N encoding and multi-part packaging) line endings) and contribute to its discussions about moving CSV to the Standards track.

Many "CSV" files containing tabular data embed metadata, for example in lines before the header row of the an otherwise standard CSV document. This specification does not define any formats for embedding metadata within CSV files, aside from the names titles of columns in the header row. row which is defined in CSV. We would encourage groups that define tabular data formats to also define a mapping into the annotated tabular data model defined in this document.

This document was published by the CSV on the Web Working Group as a Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org ( subscribe , archives ). All comments are welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .

This document is governed by the 1 August 2014 W3C Process Document .

Table of Contents

1. Introduction

Tabular data is data that is structured into rows, each of which contains information about some thing. Each row contains the same number of cells (although some of these cells may be empty), which provide values of properties of the thing described by the row. In tabular data, cells within the same column provide values for the same property of the thing things described by the particular each row. This is what differentiates tabular data from other line-oriented formats.

Tabular data is routinely transferred on the web in a textual format called "CSV", CSV, but the definition of "CSV" CSV in practice is very loose. Some people use the term to mean any delimited text file. Others stick more closely to the most standard definition of CSV that there is, [ RFC4180 ]. Appendix A describes the various ways in which CSV is defined. This specification refers to such files, as well as tab-delimited files, fixed field formats, spreadsheets, HTML tables, and SQL dumps as tabular data files .

In section 2. 4. Tabular Data Models There are different levels of , this document defines a model for tabular data models that abstracts away from the varying syntaxes that are used for when exchanging tabular data: data. The core tabular data model described in includes annotations, or metadata, about collections of individual tables, rows, columns, and cells. These annotations are typically supplied through separate metadata files; section 2.1 5. Core Tabular Data Model Locating Metadata defines the core model for a single how these metadata files can be located, while [ tabular-metadata ] defines what they contain.

Once an annotated table of basic has been created, it can be processed in various ways, such as display, validation, or conversion into other formats. This processing is described in section 6. Processing Tables .

This specification does not normatively define a format for exchanging tabular data. The annotated However, it does provide some best practice guidelines for publishing tabular data model described as CSV, in section section 2.2 7. Annotated Best Practice CSV , and for parsing both this syntax and those similar to it, in section 8. Parsing Tabular Data Model .

2. Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MAY , MUST , MUST NOT , SHOULD , and SHOULD NOT are to be interpreted as described in [ RFC2119 defines ].

This specification makes use of the compact IRI Syntax ; please refer to the Compact IRIs from [ JSON-LD ].

Within examples, this specification makes use of the following namespaces:

dc :
http://purl.org/dc/terms/
rdfs :
http://www.w3.org/2000/01/rdf-schema#
schema :
http://schema.org/
xsd :
http://www.w3.org/2001/XMLSchema#

3. Typographical conventions

The following typographic conventions are used in this specification:

markup
Markup (elements, attributes, properties), machine processable values (string, characters, media types), property name, or a model for tables that file name is in red-orange monospace font.
variable
A variable in pseudo-code or in an algorithm description is in italics.
definition
A definition of a term, to be used elsewhere in this or other specifications, is in bold and italics.
definition reference
A reference to a definition in this document is underlined and is also an active link to the definition itself.
markup definition reference
A references to a definition in this document , when the reference itself is also a markup, is underlined, red-orange monospace font, and is also an active link to the definition itself.
external definition reference
A reference to a definition in another document is underlined, in italics, and is also an active link to the definition itself.
markup external definition reference
A reference to a definition in another document , when the reference itself is also a markup, is underlined, in italics red-orange monospace font, and is also an active link to the definition itself.
hyperlink
A hyperlink is underlined and in blue.
[reference]
A document reference (normative or informative) is enclosed in square brackets and links to the references section.
Note

Notes are annotated in light green boxes with metadata. a green left border and with a "Note" header in green. Notes are normative or informative depending on the whether they are in a normative or informative section, respectively.

Example 1
Examples are in light khaki boxes, with khaki left border, and with a 

numbered "Example" header in khaki. Examples are always informative. 

The
grouped
tabular
data
model
described

content
of
the
example
is

in
section

monospace
font
and
may
be
syntax
colored.

2.3 4. Grouped Tabular Data Model Models

This section defines an annotated tabular data model : a model for tables that are related to each annotated with metadata. Annotations provide information about the cells, rows, columns, tables, and groups of tables with which they are associated. The values of these annotations may be lists, structured objects, or atomic values. Core annotations are those that affect the behavior of processors defined in this specification, but other annotations may also be present on any of the components of the model.

Annotations may be described directly in some way. [ tabular-metadata ], be embedded in a tabular data file , or created during the process of generating an annotated table .

Note

In this document, the term annotation refers to any metadata associated with an object in the annotated tabular data model. These are not necessarily web annotations in the sense of [ annotation-model ].

2.1 4.1 Core Tabular Data Model Table groups

A group of tables comprises a set of annotated tables and a set of annotations that relate to that group of tables. The core tabular data model can be used to describe annotations of a table that lacks group of tables are:

Groups of tables MAY in addition have any number of annotations which provide information about the group of tables. Annotations on a CSV file group of tables may include:

When originating from [ tabular-metadata ], these annotations arise from common properties defined on table group descriptions within metadata document. documents.

4.2 Tables

Data An annotated table is held in a table . that is annotated with additional metadata. The properties core annotations of a table are:

The table MAY in addition have any number of other annotations . Annotations on a table may include:

When originating from [ tabular-metadata ], these annotations arise from common properties defined on table group descriptions within metadata documents.

4.3 Columns

A column represents a vertical arrangement of cells within a table . The properties core annotations of a column are:

Columns MAY in addition have any number of other annotations , starting such as a description. When originating from 1 [ tabular-metadata ], these annotations arise from common properties defined on table group descriptions within metadata documents.

4.4 Rows

A row represents a horizontal arrangement of cells within a table . The core annotations of a row are:

Rows MAY have any number of additional annotations . The annotations on a row provide additional metadata about the information held in the row, such as:

Neither this specification nor [ tabular-metadata ] defines a method to specify such annotations. Implementations MAY define a method for adding annotations to rows by interpreting notes on the table.

4.5 Cells

A cell represents a cell at the intersection of a row and a column within a table . The properties core annotations of a cell are:

Note

There presence or absence of quotes around a value within a CSV file is a syntactic detail that is not reflected in the tabular data model. In other words, there is no distinction in the model between the second value in a,,z and the second value in a,"",z .

2.2 Annotated Tabular Data Model

An annotated table is a table that is annotated with additional metadata. The table Cells MAY have any number of properties additional annotations . The annotations on a cell provide metadata about the value held in addition to those the cell, particularly when this overrides the information provided in for the core tabular data model described in section 2.1 Core Tabular Data Model column which provide information about and row that the table as a whole. The values of these properties may be lists, structured objects, or atomic values. cell falls within. Annotations on a table may include: cell might be:

The columns within an annotated table Neither this specification nor [ tabular-metadata are all annotated columns : columns which ] defines a method to specify such annotations. Implementations MAY have any number of properties in addition define a method for adding annotations to those provided in cells by interpreting notes on the core tabular data model described in section table.

2.1 4.6 Core Tabular Data Model Datatypes . The annotations on

Columns and Cells within tables may be annotated with a column might provide information about how to interpret datatype which indicates the cells in type of the column or information about values obtained by parsing the column as string value of the cell.

Datatypes are based on a whole. Examples subset of annotations might be: those defined in [ xmlschema11-2 ] with aliases. The annotated tabular data model limits cell values to have datatypes as shown on the diagram :

The rows within an annotated table are all annotated rows : rows which MAY have any number of properties
Built-in Datatype Hierarchy diagram
Fig. 1 Diagram showing the built-in datatypes, based on [ xmlschema11-2 ]; names in addition paranthesis denote aliases to those provided in the core tabular data model described [ xmlschema11-2 ] terms (see the diagram in section 2.1 Core Tabular Data Model SVG or PNG formats) .

The core annotations on of a row provide additional metadata about the information held in the row, such as: datatype are:

Datatypes MAY be have any number of a datatype other than a string. For example, additional annotations might enable . The annotations on a processor to understand the string value of datatype provide metadata about the cell datatype such as representing a number title or a date. Issue 41 description. These arise from common properties defined on datatype descriptions within metadata documents, as defined in [ tabular-metadata The permitted types of names and values of annotations need to be spelled out here. ].

2.3 4.6.1 Grouped Tabular Data Model Length Constraints

A group of tables comprises a set of tables (which may be annotated tables The length , minimum length ) and a set of annotations ( properties maximum length annotations indicate the exact, minimum and maximum lengths for cell string values .

The length of a value ) that relate to the set. is determined as follows:

  • if the value is null , its length is zero
  • if the value is a string or one of its subtypes, its length is the number of characters in the value
  • if the value is of a binary type, its length is the number of bytes in the binary value

Note 4.6.2 Value Constraints Tables can be loosely related to each other simply through annotations; not all tables that are related to each other need

The minimum , maximum , minimum exclusive , and maximum exclusive annotations indicate limits on cell string values . These apply to grouped together. Groups numeric, date/time, and duration types.

Validation of tables are useful because they can be annotated with metadata that applies to all the tables cell string values against these datatypes is as defined in the group. [ xmlschema11-2 ].

3. 5. Locating Metadata

As described in section 2.2 4. Annotated Tabular Data Model Models , tabular data may have a number of annotations associated with it. Here we describe the different methods that can be used to locate metadata that provides those annotations given a link to a CSV file. annotations.

In most methods of locating metadata described here, metadata is provided within a separate document. documents. The syntax of this document these documents is defined in the Metadata Vocabulary for Tabular Data [ tabular-metadata specification. These documents can include things like: ].

metadata about the table, such as titles, descriptions, provenance, and links to other tables metadata about columns in the table, such as labels, data types and other constraints, or flags to indicate values in the column are unique metadata about rows in the table, such as certainty or provenance metadata about values in the table, such as notes or type annotations

When creating a set of annotations from metadata, if the same property is specified in two locations then information "closer" to the end user or interpreter of the document should override overrides other information, and information "closer" to the document itself should override overrides information "further" from the document. Explicitly, the order of preference precedence is:

  1. metadata supplied by the user of the implementation that is processing the tabular data, see section 3.1 5.1 Overriding Metadata
  2. metadata embedded within the tabular data file itself, see section 3.2 5.2 Embedded Metadata
  3. metadata in a document linked to using a Link header associated with the tabular data file, file , see section 3.3 5.3 Link Header
  4. file-specific metadata in a document located based on the location of the tabular data file, file , see section 3.4 5.4 Standard File Metadata
  5. directory-specific metadata in a document located based on the location of the tabular data file, file , see section 3.5 5.5 Standard Directory Metadata

Processors MUST attempt to locate a always use overriding metadata document based on each of these locations in order, and use first embedded metadata. If the overriding metadata document that is successfully located in this way. Issue 42 What should processors do if they have been passed does not originate from a complete metadata file and located a CSV file itself (for example, if it is generated from that metadata file. Should they still check for metadata files related specifically command-line options when invoking the implementation), processors MUST attempt to locate the CSV file itself? For example, say that an application has been pointed at a metadata file at http://example.org/metadata.json which references http://example.org/toilets.csv but there is also a first metadata file at http://example.org/toilets.csv-metadata.json . If document from the processor had been originally pointed to http://example.org/toilets.csv Link then it would have located header, the file-specific metadata, or the directory-specific metadata. The located metadata at http://example.org/toilets.csv-metadata.json , but coming via http://example.org/metadata.json means that file-specific from these locations MUST be merged as defined in [ tabular-metadata ].

If no metadata is skipped. supplied or found, then the following Metadata is used as the default metadata :

{
  "@type": "TableGroup",
  "tables": []
}

3.1 5.1 Overriding Metadata

Processors SHOULD provide users with the facility to provide their own metadata for tabular data files that they process. This might be provided:

For example, a processor might be invoked with:

Example 2 : Command-line CSV processing with column types

$
csvlint
data.csv
--datatypes:string,float,string,string

to enable the testing of the types of values in the columns of a CSV file, or with:

Example 3 : Command-line CSV processing with a schema

$
csvlint
data.csv
--schema:schema.json

to supply a schema that describes the contents of the file, against which it can be validated.

Metadata supplied in this way is called overriding, or user-supplied, metadata. Implementations SHOULD define how any options they define are mapped into the vocabulary defined in [ tabular-metadata ]. If the user selects existing metadata files, implementations MUST NOT use metadata located through the link header (as described in section 5.3 Link Header ), file-specific metadata (as described in section 5.4 Standard File Metadata ) or directory-specific metadata (as described in section 5.5 Standard Directory Metadata ).

3.2 5.2 Embedded Metadata

Most syntaxes for tabular data provide a facility for embedding some metadata within the tabular data file itself. The definition of a syntax for tabular data SHOULD include a description of how the syntax maps to an annotated data model. See section model, and in particular how any embedded metadata is mapped into the vocabulary defined in [ tabular-metadata ]. Parsing based on the default dialect for CSV, as described in 5. 8. Parsing Tabular Data , will extract column titles for from the first row of a description CSV file.

Example 4: http://example.org/tree-ops.csv
GID,On Street,Species,Trim Cycle,Inventory Date

1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON
ST,Liquidambar
styraciflua,Large
Tree
Routine
Prune,6/2/2010

The results of this for CSV files. can be found in section 8.2.1 Simple Example .

For another example, the following tab-delimited file contains embedded metadata where it is assumed that comments may be added using a # , and that the column types may be indicated using a #datatype annotation:

Example 5: Tab-separated file containing embedded metadata
# publisher City of Palo Alto

# updated 12/31/2010
#name GID on_street species trim_cycle  inventory_date
#datatype string  string  string  string  date:M/D/YYYY
  GID On Street Species Trim Cycle  Inventory Date
  1 ADDISON AV  Celtis australis  Large Tree Routine Prune  10/18/2010
2
EMERSON
ST
Liquidambar
styraciflua
Large
Tree
Routine
Prune
6/2/2010

A processor that recognises this format may be able to extract and make sense of this embedded metadata.

3.4 5.4 Standard File Metadata

If the user has not supplied a metadata file as overriding metadata, described in section 5.1 Overriding Metadata , and no applicable metadata document file has been located discovered through higher-priority mechanisms, a Link header, described in section 5.3 Link Header , processors MUST attempt to locate a file-specific metadata document. This method is only MUST NOT be used if the URL of the tabular data file does not contain contains a query component.

In this case, processors MUST construct a URL for the file-specific metadata document should be located at by appending location -metadata.json , where location is to the base URL location of the tabular data file with the (with any fragment component of the that URL removed, if applicable. removed) and attempt to retrieve a metadata document from that location.

For example, if the tabular data file is at http://example.org/south-west/devon.csv then processors should must attempt to locate a metadata document at http://example.org/south-west/devon.csv-metadata.json . On the other hand, if the tabular data file is at http://example.com/index.php?file=43178.csv then, because the URL of the tabular data file includes a query component, the processor must not attempt to look for a file-specific metadata document.

If the metadata file found at this location does not explicitly point include a reference to the relevant tabular data file then it MUST be ignored.

3.5 5.5 Standard Directory Metadata

If the user has not supplied a metadata file as overriding metadata, described in section 5.1 Overriding Metadata , and no applicable metadata document file has been located discovered through higher-priority mechanisms, a Link header, described in section 5.3 Link Header , or from the standard file metadata location, described in section 5.4 Standard File Metadata , processors MUST attempt to locate a directory-level metadata document. This method is only MUST NOT be used if the URL of the tabular data file does not contain contains a query component.

The Processors MUST locate and retrieve the common metadata document for a directory can be located by resolving the relative URL metadata.json against the base URL of the tabular data file. file and fetching the resulting URL.

For example, if the tabular data file is at http://example.org/south-west/devon.csv then processors should must attempt to locate a metadata file at http://example.org/south-west/metadata.json . On the other hand, if the tabular data file is at http://example.com/index.php?file=43178.csv then, because the URL of the tabular data file includes a query component, the processor must not attempt to look for a directory-level metadata document.

If the metadata file found at this location does not explicitly point include a reference to the relevant tabular data file then it MUST be ignored.

4. 6. CSV+ Syntax Processing Tables

This section describes how particular types of applications should process tabular data and metadata files.

In many cases, an application will start processing from a metadata file. In that case, the initial metadata file is non-normative. treated as overriding metadata and the application MUST NOT continue to retrieve other available metadata about each of the tabular data file s referenced by that initial metadata file other than embedded metadata .

In other cases, applications will start from a tabular data file , such as a CSV file, and locate metadata from that file. This metadata will be used to process the file as if the processor were starting from that metadata file.

For example, if a validator is passed a locally authored metadata file spending.json , which contains:

Example 7: Metadata file referencing multiple tabular data files sharing a schema
{
  "tableSchema": "government-spending.csv",
  "tables": [{
    "url": "http://example.org/east-sussex-2015-03.csv",
  }, {
    "url": "http://example.org/east-sussex-2015-02.csv"
  }, ... 
  ]

}

the validator would validate all the listed tables, using the locally defined schema at government-spending.csv . It would also use the metadata embedded in the referenced CSV files; for example, when processing http://example.org/east-sussex-2015-03.csv , it would use embedded metadata within that file.

There If a validator is no standard passed a tabular data file http://example.org/east-sussex-2015-03.csv , the validator would use the metadata located from the CSV file: the first metadata file found through the Link headers found when retrieving that file, or at http://example.org/east-sussex-2015-03.csv-metadata.json or http://example.org/metadata.json .

Note

Starting with a metadata file can remove the need to perform additional requests to locate linked, file-specific, or directory-specific metadata

6.1 Creating Annotated Tables

After locating metadata, metadata is merged into a single table group description . When starting with a metadata file, this involves merging the provided metadata file with the embedded metadata for CSV, each tabular data file referenced by that file. When starting with a tabular data file, this involves locating the first metadata file as described in section 5. Locating Metadata and there are many variants merging all relevant metadata into a single descriptor.

If processing starts with a tabular data file , implementations:

  1. Retrieve the tabular data file.
  2. Retrieve any metadata supplied by the user ( UM T ) (see section 5.1 Overriding Metadata ).
  3. If UM T does not describe a table or group of CSV used tables describing the tabular data file, retrieve the first metadata file ( FM ) as described in section 5. Locating Metadata :
    1. metadata referenced from a Link Header that may be returned when retrieving the tabular data file (see section 5.3 Link Header ).
    2. file-specific metadata in a document located based on the web today. This location of the tabular data file (see section defines 5.4 Standard File Metadata ).
    3. directory-specific metadata in a method document located based on the location of the tabular data file (see section 5.5 Standard Directory Metadata ).
    4. default metadata as defined in section 5. Locating Metadata with a single tables entry where the url property is set from that of the tabular data file .
  4. Proceed as if the process starts with a metadata file using the merged metadata UM M = merge(UM T , FM) or UM T if no FM is loaded.

If the process starts with a metadata file:

  1. Retrieve the metadata file yielding the metadata UM M (which is treated as overriding metadata, see section 5.1 Overriding Metadata ).
  2. For each table in UM M in order, create one or more annotated tables :
    1. Extract the dialect description ( DD ) from UM M for outputting the table associated with the tabular data adhering file. If there is no such dialect description, extract the first available dialect description from a group of tables in which the tabular data file is described. Otherwise use the default dialect description.
    2. If using the default dialect description , override default values in DD based on HTTP headers found when retrieving the tabular data file:
      • If the media type from the Content-Type header is text/tsv , set delimiter to TAB in DD .
      • If the core Content-Type header includes the header parameter with a value of absent , set header to false in DD .
      • If the Content-Type header includes the charset parameter, set encoding to this value in DD .
    3. Parse the tabular data file, using DD as a guide, to create a basic tabular data model ( T ) and extract embedded metadata described ( EM ), for example from the header line .

      Note

      This specification provides a non-normative definition for parsing CSV-based files, including the extraction of embedded metadata , in section 2.1 8. Core Parsing Tabular Data Model . This specification does not define any syntax for embedded metadata beyond this; whatever syntax is used, it's assumed that metadata can be mapped to the vocabulary defined in [ tabular-metadata ].

    4. If a Content-Language HTTP header was found when retrieving the tabular data file, and the value provides a single language, set the lang inherited property to this value in EM .
    5. Create the merged metadata M = merge(UM M , EM) using the merge algorithm defined in [ tabular-metadata ].
    6. Use the metadata M to add annotations to the tabular data model T as described in Section 2 Annotating Tables in [ tabular-metadata ].
Note

In the case of starting with a metadata file, UM M will describe a table or group of tables , and no other metadata files will be retrieved. Thus the metadata file must provide all applicable metadata aside from that embedded within the tabular data files themselves.

6.2 Parsing Cells

Unlike many other data formats, tabular data is designed to be read by humans. For that reason, it's common for data to be represented within tabular data in a human-readable way. The datatype , default , lang , null , required , and separator annotations provide the information needed to parse the string value of a cell into its (semantic) value annotation. This is used:

The process of parsing a cell creates a cell with annotations based on the original string value, parsed value and other column annotations and adds the cell to the list of cells in a row and cells in a column :

After parsing, the cell value can be:

The process of parsing the string value into a single value or a list of values is as follows:

  1. unless the datatype base is string , json , xml , html or anyAtomicType , replace all carriage return ( #xD ), line feed ( #xA ), and tab ( #x9 ) characters with space characters.
  2. unless the datatype base is string , json , xml , html , anyAtomicType , or normalizedString , strip leading and trailing whitespace from the string value and replace all instances of two or more whitespace characters with a single space character.
  3. if the normalized string is an empty string, apply the remaining steps to the string given by the column default annotation.
  4. if the column separator annotation is not null and the normalized string is an empty string, the cell value is an empty list. If the column required annotation is true , add an error to the list of errors for the cell.
  5. if the column separator annotation is not null , the cell value is a list of values; set the list annotation on the cell to true , and create the cell value created by:
    1. if the normalized string is the same as any one of the values of the column null annotation, then the resulting value is null .
    2. split the normalized string at the character specified by the column separator annotation.
    3. unless the datatype base is string or anyAtomicType , strip leading and trailing whitespace from these strings.
    4. applying the remaining steps to each of the strings in turn.
  6. if the string is an empty string, apply the remaining steps to the string given by the column default annotation.
  7. if the string is the same as any one of the values of the column null annotation, then the resulting value is null . If the column separator annotation is null and the column required annotation is true , add an error to the list of errors for the cell.
  8. parse the string using the datatype format if one is specified, as described below to give a value with an associated datatype . If the datatype base is string , or there is no datatype , the value has an associated language from the column lang annotation. If there are any errors, add them to the list of errors for the cell; in this case the value has a datatype of string ; if the datatype base is string , or there is no datatype , the value has an associated language from the column lang annotation.
  9. validate the value based on the length constraints described in section 4.6.1 Length Constraints , the value constraints described in section 4.6.2 Value Constraints and the datatype format annotation if one is specified, as described below. If there are any errors, add them to the list of errors for the cell.

The final value (or values) become the value annotation on the cell .

If there is a about URL annotation on the column , it becomes the about URL annotation on the cell , after being transformed into an absolute URL as described in URI Template Properties of [ tabular-metadata ].

If there is a property URL annotation on the column , it becomes the property URL annotation on the cell , after being transformed into an absolute URL as described in URI Template Properties of [ tabular-metadata ].

If there is a value URL annotation on the column , it becomes the value URL annotation on the cell , after being transformed into an absolute URL as described in URI Template Properties of [ tabular-metadata ]. The value URL annotation is null if the cell value is null and the column virtual annotation is false .

6.2.1 Parsing examples

This section is non-normative.

When datatype annotation is available, the value of a cell is the same as its string value . For example, a cell with a string value of "99" would similarly have the (semantic) value "99" .

If a datatype base is provided for the cell, that output is used to create a (semantic) value for the cell. For example, if the metadata contains:

Example 8

"datatype"
:
"integer"

for the cell with the string value "99" then the value of that cell will be the integer 99 . A cell whose string value was not a valid integer (such as "one" or "1.0" ) would be assigned that string value as its (semantic) value annotation, but also have a validation error listed in its errors annotation.

Sometimes data uses special codes to indicate unknown or null values. For example, a particular column might contain a number that is expected to be between 1 and 10 , with the string 99 used in the original tabular data file to indicate a null value. The metadata for such a column would include:

Example 9
"datatype": {
  "base": "integer",
  "minimum": 1,
  "maximum": 10
},

"null"
:
"99"

In this case, a cell with a string value of "5" would have the (semantic) value of the integer 5 ; a cell with a string value of "99" would have the value null .

Similarly, a cell may be assigned a default value if the string value for the cell is empty. A configuration such as:

Example 10
"datatype": {
  "base": "integer",
  "minimum": 1,
  "maximum": 10
},

"default"
:
"5"

In this case, a cell whose string value is "" would be assigned the value of the integer 5 . A cell whose string value contains whitespace, such as a single tab character, would also be assigned the value of the integer 5 : when the datatype is something other than string or anyAtomicType , leading and trailing whitespace is stripped from string values before the remainder of the processing is carried out.

Cells can contain sequences of values. For example, a cell might have the string value "1 5 7.0" . In this case, the separator is a space character. The appropriate configuration would be:

Example 11
"datatype": {
  "base": "integer",
  "minimum": 1,
  "maximum": 10
},
"default": "5",

"separator"
:
"
"

and this would mean that the cell's value would be an array containing two integers and a string: [1, 5, "7.0"] . The final value of the array is a string because it is not a valid integer; the cell's errors annotation will also contain a validation error.

Also, with this configuration, if the string value of the cell were "" (ie it was an empty cell) the value of the cell would be an empty list.

A cell value can be inserted into a URL created using a URI template property such as valueUrl . For example, if a cell with the string value "1 5 7.0" were in a column named values , defined with:

Example 12
"datatype": "decimal",
"separator": " ",

"valueUrl"
:
"{?values}"

then after expansion of the URI template, the resulting valueUrl would be ?values=1.0,5.0,7.0 . The canonical representations of the decimal values are used within the URL.

6.2.2 Formats for numeric types

It is not uncommon for numbers within tabular data to be formatted for human consumption, which may involve using commas for decimal points, grouping digits in the number using commas, or adding currency symbols or percent signs to the number.

If the datatype base is a numeric type, the datatype format must meet each annotation indicates the expected format for that number. Its value MUST be either a single string or an object with one or more of the constraints. properties:

decimalChar
A single character string whose value is used to represent a decimal point within the number. The default value is "." .
groupChar
A single character string whose value is used to group digits within the number. The default value is "," .
pattern
A regular expression string, in the syntax and interpreted as defined by [ ECMASCRIPT ].
Note

We Authors are actively working encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.

If the datatype format annotation is a single string, this is interpreted in the same way as if it were an object with a pattern property whose value is that string.

When parsing the IETF string value of a cell against this format specification, implementations MUST recognise and parse numbers that consist of:

  1. an optional + or - sign,
  2. followed by a decimal digit ( 0-9 ),
  3. followed by any number of decimal digits ( 0-9 ) and the character specified as the groupChar ,
  4. followed by an optional decimalChar followed by one or more decimal digits ( 0-9 ),
  5. followed by an optional exponent, consisting of an E followed by an optional + or - sign followed by one or more decimal digits ( 0-9 ), or
  6. followed by an optional percent ( % ) or per-mille ( ) sign.

or that are one of the special values:

  1. NaN ,
  2. INF , or
  3. -INF .

Implementations MUST add a validation error to develop the errors annotation for the cell if the string being parsed:

  • does not meet the numeric format defined above,
  • contains two consecutive groupChar characters,
  • does not match the regular expression defined in the pattern property, if there is one,
  • contains the decimalChar , if the datatype base is integer or one of its sub-values,
  • contains an exponent, if the datatype base is decimal or one of its sub-values, or
  • is one of the special values NaN , INF , or -INF , if the datatype base is decimal or one of its sub-values.

Implementations MUST use the sign, exponent, percent, and per-mille signs when parsing the string value of a cell to provide the value of the cell. For example, the string value "-25%" must be interpreted as -0.25 and the string value "1E6" as 1000000 .

6.2.3 Formats for booleans

Boolean values may be represented in many ways aside from the standard 1 and 0 or true and false .

If the datatype base for CSV, which a cell is outside boolean , the scope of datatype format annotation provides the Working Group. true and false values expected, separated by | . For example if format is Y|N then cells must hold either Y or N with Y meaning true and N meaning false .

The details resulting cell value will be one or more boolean true or false values.

6.2.4 Formats for dates and times

Dates and times are commonly represented in tabular data in formats other than those defined in [ xmlschema11-2 ].

If the datatype base is a date or time type, the datatype format annotation indicates the expected format for that date or time.

The supported date and time formats listed here aim to help shape are expressed in terms of the date field symbols defined in [ UAX35 ] and MUST be interpreted by implementations as defined in that specification.

The following date formats MUST be recognised by implementations:

  • yyyy-MM-dd e.g., 2015-03-22
  • yyyyMMdd e.g., 20150322
  • dd-MM-yyyy e.g., 22-03-2015
  • d-M-yyyy e.g., 22-3-2015
  • MM-dd-yyyy e.g., 03-22-2015
  • M-d-yyyy e.g., 3-22-2015
  • dd/MM/yyyy e.g., 22/03/2015
  • d/M/yyyy e.g., 22/3/2015
  • MM/dd/yyyy e.g., 03/22/2015
  • M/d/yyyy e.g., 3/22/2015
  • dd.MM.yyyy e.g., 22.03.2015
  • d.M.yyyy e.g., 22.3.2015
  • MM.dd.yyyy e.g., 03.22.2015
  • M.d.yyyy e.g., 3.22.2015

The following time formats MUST be recognised by implementations:

  • HH:mm:ss e.g., 15:02:37
  • HHmmss e.g., 150237
  • HH:mm e.g., 15:02
  • HHmm e.g., 1502

The following date/time formats MUST be recognised by implementations:

  • yyyy-MM-ddTHH:mm:ss e.g., 2015-03-15T15:02:37
  • yyyy-MM-ddTHH:mm e.g., 2015-03-15T15:02
  • any of the date formats above, followed by a single space, followed by any of the time formats above, e.g., M/d/yyyy HH:mm for 3/22/2015 15:02 or dd.MM.yyyy HH:mm:ss for 22.03.2015 15:02:37

Implementations MUST also recognise date, time, and date/time formats that end with timezone markers consisting of between one and three x s or X s, possibly after a single space. These MUST be interpreted as follows:

  • X e.g., -08 , +0530 , or Z (minutes are optional)
  • XX e.g., -0800 , +0530 , or Z
  • XXX e.g., -08:00 , +05:30 , or Z
  • x e.g., -08 or +0530 ( Z is not permitted)
  • xx e.g., -0800 or +0530 ( Z is not permitted)
  • xxx e.g., -08:00 or +05:30 ( Z is not permitted)

For example, formats could include yyyy-MM-ddTHH:mm:ssXXX for 2015-03-15T15:02:37Z or 2015-03-15T15:02:37-05:00 , or HH:mm x for 15:02 -05 .

The cell value will one or more dates/time values extracted using the format .

Note

For simplicity, this version of this standard based on our requirements. does not support abbreviated or full month or day names, or double digit years. Future versions of this standard may support other date and time formats, or general purpose date/time pattern strings. Authors of schemas SHOULD use appropriate regular expressions, along with the string datatype, for dates and times that use a format other than that specified here.

6.2.5 Formats for durations

This section does Durations MUST be formatted and interpreted as defined in [ xmlschema11-2 ], using the [ ISO8601 ] format -?P n Y n M n DT n H n M n S . For example, the duration P1Y1D is used for a year and a day; the duration PT2H30M for 2 hours and 30 minutes.

If the datatype base is a duration type, the datatype format annotation provides a regular expression for the string values, in the syntax and processed as defined by [ ECMASCRIPT ].

Note

Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.

The cell value will be one or more durations extracted using the format .

6.2.6 Formats for other types

If the datatype base is not seek numeric, boolean , a date/time type, or a duration type, the datatype format annotation provides a regular expression for the string values, in the syntax and processed as defined by [ ECMASCRIPT ].

Note

Authors are encouraged to describe how applications be conservative in the regular expressions that input they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.

Values that are labelled as html , xml , or json SHOULD NOT textual be validated against those formats.

Note

Metadata creators who wish to check the syntax of HTML, XML, or JSON within tabular data should interpret it, except that any use the datatype format annotation to specify a regular expression against which such values will be tested.

6.3 Displaying Tables

This section is non-normative.

When displaying tables, implementations should:

6.3.1 Bidirectional Tables

There are two levels of bidirectionality to consider when displaying tables: the directionality of the table (i.e., whether the columns should be arranged left-to-right or right-to-left) and the directionality of the content of individual cells.

The direction annotation on the table provides information about the desired display of the table. If direction is ltr then the first column should be displayed on the left and the last column on the right. If direction is rtl then the first column should be displayed on the right and the last column on the left.

If direction is default then tables should be displayed with attention to the bidirectionality of the content of the file. Specifically, the values of the cells in the format table should be scanned breadth first: from the first cell in the first column through to the last cell in the first row, down to the last cell in the last column. If the first character in the table with a strong type as defined here in [ BIDI ] indicates a RTL directionality, the table should be understood displayed with the first column on the right and the last column on the left. Otherwise, the table should be displayed with the first column on the left and the last column on the right. Characters such as whitespace, quotes, commas, and numbers do not have a strong type, and therefore are skipped when identifying the character that determines the directionality of the table.

Implementations should enable user preferences to override the indicated metadata about the directionality of the table.

Once the directionality of the table has been determined, each cell within the table should be considered as a separate paragraph , as defined here. by the Unicode Bidirectional Algorithm ( UBA ) in [ BIDI ]. The default directionality for the cell is determined by looking at the text direction annotation for the cell.

Thus, as defined by the UBA , if a cell contains no characters with a strong type (if it's a number or date for example) then the way the cell is displayed should be determined by the text direction annotation of the cell. However, when the cell contains characters with a strong type (such as letters) then they should be displayed according to the Unicode Bidirectional Algorithm as described in [ BIDI ].

6.4 Validating Tables

Validators test whether given tabular data files adhere to the structure defined within a schema . Validators MUST raise errors as defined in [ tabular-metadata ] and in the following situations:

Processors MAY continue validating after encountering the first error.

6.5 Converting Tables

Conversions of tabular data to other formats operate over a annotated table constructed as defined in Annotating Tables in [ tabular-metadata ]. The mechanics of these conversions to other formats are defined in other specifications such as [ csv2json ] and [ csv2rdf ].

Conversion specifications MUST define a default mapping from an annotated table that lacks any annotations (i.e., that is equivalent to an un-annotated table).

Conversion specifications MUST use the property value of the propertyUrl of a column as the basis for naming machine-readable fields in the target format, such as the name of the equivalent element or attribute in XML, property in JSON or property URI in RDF.

Conversion specifications MAY use any of the annotations found on an annotated table group, table, column, row or cell, including non-core annotations, to adjust the mapping into another format.

Conversion specifications MAY define additional annotations, not defined in this specification, which are specifically used when converting to the target format of the conversion. For example, a conversion to XML might specify a http://example.org/conversion/xml/element-or-attribute property on columns that determines whether a particular column is represented through an element or an attribute in the data.

7. Best Practice CSV

This section is non-normative.

There is no standard for CSV, and there are many variants of CSV used on the web today. This section defines a method for expressing tabular data adhering to the annotated tabular data model in CSV. Authors are encouraged to adhere to the constraints described in this section as implementations should process such CSV files consistently.

Note

This syntax is not compliant with text/csv as defined in [ RFC4180 ] in that it permits line endings other than CRLF . Supporting LF line endings are is important characteristics for data formats that are used on non-Windows platforms. However, all files that adhere to [ RFC4180 ]'s definition of CSV are compliant CSV+ files. meet the constraints described in this section.

Developing a standard for CSV is outside the scope of the Working Group. The details here aim to help shape any future standard.

4.1 7.1 Content Type

The appropriate content type for a CSV+ CSV file is text/csv . For example, when a CSV+ CSV file is transmitted via HTTP, the HTTP response MUST should include a Content-Type header with the value text/csv :

Content-Type: text/csv
Content-Type: text/csv
Issue See below for issues relating to whether we should instead define a different content type.

4.2 7.2 Encoding

CSV+ CSV files SHOULD should be encoded using UTF-8. If a CSV+ CSV file is not encoded using UTF-8, the encoding MUST should be specified through the charset parameter in the Content-Type header:

Content-Type: text/csv;charset=ISO-8859-1
Content-Type: text/csv;charset=ISO-8859-1

4.3 7.3 Line Endings

The ends of rows in a CSV+ CSV file MUST should be either CRLF ( U+000D U+000A ) or LF ( U+000A ). Line endings within escaped cells are not normalised.

Issue 45 Section 4.1.1 of RFC2046 specifies that "The canonical form of any MIME "text" subtype MUST always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" MUST represent a line break. Use of CR and LF outside of line break sequences is also forbidden." Should we be defining application/csv instead, to prevent having to adhere to this rule, or should we stick to the CRLF rule?

4.4 7.4 Lines

Each line of a CSV+ CSV file MUST should contain the same number of comma-separated values.

Values that contain commas, line endings endings, or double quotes MUST should be escaped by having the entire value wrapped in double quotes. There MUST NOT should not be whitespace before or after the double quotes. Within these escaped cells, any double quotes MUST should be escaped with two double quotes ( "" ).

4.4.1 7.4.1 Headers

The first line of a CSV+ CSV file SHOULD should contain a comma-separated list of names of columns . This is known as the header line and provides names titles for the columns. There are no constraints on these names. titles.

If a CSV+ CSV file does not include a header line, this MUST should be specified using the header parameter of the media type:

Content-Type: text/csv;header=absent
Content-Type: text/csv;header=absent

4.4.2 7.5 Grammar

This grammar is a generalization of that defined in [ RFC4180 ] and is included for reference only.

The EBNF used here is defined in XML 1.0 [ EBNF-NOTATION ].

[1] csv ::= header record +
[2] header ::= record
[3] record ::= fields #x0D ? #x0A
[4] fields ::= field (" , " fields ) *
[5] field ::= WS * rawfield WS *
[6] rawfield ::= ' " ' QCHAR * ' " ' | SCHAR *
[7] QCHAR ::= [ ^" ] | ' "" '
[8] SCHAR ::= [ ^", #x0A #x0D ]
[9] WS ::= [ #x20 #x09 ]

8. Parsing Tabular Data

This section is non-normative.

As described in section 7. Best Practice CSV , there may be many formats which an application might interpret into the tabular data model described in section 4. Tabular Data Models , including using different separators or fixed format tables, multiple tables within a single file, or ones that have metadata lines before a table header.

Note

Standardising the parsing of CSV is outside the chartered scope of the Working Group. This non-normative section is intended to help the creators of parsers handle the wide variety of CSV-based formats that they may encounter due to the current lack of standardisation of the format.

This section describes an algorithm for parsing formats that do not adhere to the constraints described in section 7. Best Practice CSV , as well as those that do, and extracting embedded metadata . The parsing algorithm uses the following flags. These may be set by metadata properties found while Locating Metadata , including through user input (see Overriding Metadata ), or through the inclusion of a dialect description within a metadata file:

comment prefix
A character that, when it appears at the beginning of a row, indicates that the row is a comment that should be associated as a rdfs:comment annotation to the table. This is set by the commentPrefix property of a dialect description . The default is null , which means no rows are treated as comments. A value other than null may mean that the source numbers of rows are different from their numbers .
delimiter
The separator between cells, set by the delimiter property of a dialect description . The default is ,.
encoding
The character encoding for the file, one of the encodings listed in [ encoding ], set by the encoding property of a dialect description . The default is utf-8 .
escape character
The character that is used to escape the quote character within escaped cells, or null , set by the doubleQuote property of a dialect description . The default is " (such that "" is used to escape " within an escaped cell).
header row count
The number of header rows (following the skipped rows) in the file, set by the header or headerRowCount property of a dialect description . The default is 1 . A value other than 0 will mean that the source numbers of rows will be different from their numbers .
line terminators
The characters that can be used at the end of a row, set by the lineTerminators property of a dialect description . The default is [CRLF, LF] .
quote character
The character that is used around escaped cells, or null , set by the quoteChar property of a dialect description . The default is " .
skip blank rows
Indicates whether to ignore wholly empty rows (ie rows in which all the cells are empty), set by the skipBlankRows property of a dialect description . The default is false . A value other than false may mean that the source numbers of rows are different from their numbers .
skip columns
The number of columns to skip at the beginning of each row, set by the skipColumns property of a dialect description . The default is 0 . A value other than 0 will mean that the source numbers of columns will be different from their numbers .
skip rows
The number of rows to skip at the beginning of the file, before a header row or tabular data, set by the skipRows property of a dialect description . The default is 0 . A value greater than 0 will mean that the source numbers of rows will be different from their numbers .
trim
Indicates whether to trim whitespace around cells; may be true , false , start , or end , set by the skipInitialSpace or trim property of a dialect description . The default is false .

The algorithm for using these flags to parse a document containing tabular data to create a basic annotated tabular data model and to extract embedded metadata is as follows:

  1. Create a new table T with the annotations:
  2. Create a metadata document structure M that looks like:
    {
      "@context": "http://www.w3.org/ns/csvw",
      "rdfs:comment": []
      "tableSchema": {
        "columns": []
      }
    }
    
  3. If the URL of the tabular data file being parsed is known, set the url property on M to that URL.
  4. Set source row number to 1 .
  5. Read the file using the encoding .
  6. Repeat the following the number of times indicated by skip rows :
    1. Read a row to provide the row content .
    2. If the comment prefix is not null and the row content begins with the comment prefix , strip that prefix from the row content , and add the resulting string to the M .rdfs:comment array.
    3. Otherwise, if the row content is not an empty string, add the row content to the M .rdfs:comment array.
    4. Add 1 to the source row number .
  7. Repeat the following the number of times indicated by header row count :
    1. Read a row to provide the row content .
    2. If the comment prefix is not null and the row content begins with the comment prefix , strip that prefix from the row content , and add the resulting string to the M .rdfs:comment array.
    3. Otherwise, parse the row to provide a list of cell values , and:
      1. Remove the first skip columns number of values from the list of cell values .
      2. For each of the remaining values at index i in the list of cell values :
        1. If the value at index i in the list of cell values is an empty string or consists only of whitespace, do nothing.
        2. Otherwise, if there is no column description object at index i in M .tableSchema.columns, create a new one with a title property whose value is an array containing a single value that is the value at index i in the list of cell values .
        3. Otherwise, add the value at index i in the list of cell values to the array at M .tableSchema.columns[ i ].title .
    4. Add 1 to the source row number .
  8. Set row number to 1 .
  9. While it is possible to read another row, do the following:
    1. Set the source column number to 1 .
    2. Read a row to provide the row content .
    3. If the comment prefix is not null and the row content begins with the comment prefix , strip that prefix from the row content , and add the resulting string to the M .rdfs:comment array.
    4. Otherwise, parse the row to provide a list of cell values , and:
      1. If all of the values in the list of cell values are empty strings, and skip blank rows is true , add 1 to the source row number and move on to process the next row.
      2. Otherwise, create a new row R , with:
      3. Append R to the rows of table T .
      4. Remove the first skip columns number of values from the list of cell values and add that number to the source column number .
      5. For each of the remaining values at index i in the list of cell values (where i starts at 1 ):
        1. Identify the column C at index i within the columns of table T . If there is no such column:
          1. Create a new column C with:
          2. Append C to the columns of table T (at index i ).
        2. Create a new cell D , with:
        3. Append cell D to the cells of column C .
        4. Append cell D to the cells of row R (at index i ).
        5. Add 1 to the source column number .
    5. Add 1 to the source row number .
  10. If M .rdfs:comment is an empty array, remove the rdfs:comment property from M .
  11. Return the table T and the embedded metadata M .

To read a row to provide row content , perform the following steps:

  1. Set the row content to an empty string.
  2. Read a character and process as follows:
    1. If the character is an escape character followed by the quote character , append both characters to the row content , and move on to process the character following the quote character .
    2. Otherwise, if the character is an escape character and the escape character is not the same as the quote character , append the escape character and the character following it to the row content and move on to process the next character.
    3. Otherwise, if the character is a quote character , append the quoted value obtained by reading a quoted value to the row content and move on to process the character following the quoted value.
    4. Otherwise, if the character is, or starts a sequence that is, a line terminator , return the row content .
    5. Otherwise, append the character to the row content and move on to process the next character.
  3. If there are no more characters to read, return the row content .

To read a quoted value to provide a quoted value , perform the following steps:

  1. Set the quoted value to an empty string.
  2. Read the initial quote character and add a quote character to the quoted value .
  3. Read a character and process as follows:
    1. If the character is an escape character followed by the quote character , append both characters to the quoted value , and move on to process the character following the quote character .
    2. Otherwise, if the character is an escape character and the escape character is not the same as the quote character , append the escape character and the character following it to the quoted value and move on to process the next character.
    3. Otherwise, if the character is a quote character , return the quoted value .
    4. Otherwise, append the character to the quoted value and move on to process the next character.

To parse a row to provide a list of cell values , perform the following steps:

  1. Set the list of cell values to an empty list and the current cell value to an empty string.
  2. Set the quoted flag to false .
  3. Read a character and process as follows:
    1. If the character is an escape character followed by the quote character , append a quote character to the current cell value , and move on to process the character following the quote character .
    2. Otherwise, if the character is an escape character and the escape character is not the same as the quote character , append the character following the escape character to the current cell value and move on to process the next character.
    3. Otherwise, if the character is a quote character then:
      1. If quoted is false , set the quoted flag to true , and move on to process the next character. If the current cell value is not an empty string, raise an error.
      2. Otherwise, set quoted to false , and move on to process the next character. If the next character is not the delimiter , raise an error.
    4. Otherwise, if the character is a delimiter , then:
      1. If quoted is true , append the delimiter character to the current cell value and move on to process the next character.
      2. Otherwise, trim the current cell value , add the resulting trimmed cell value to the list of cell values and move on to process the next character.
    5. Otherwise, append the character to the current cell value and move on to process the next character.
  4. If there are no more characters to read, trim the current cell value , add the resulting trimmed cell value to the list of cell values and return the list of cell values .

To trim a cell value to provide a trimmed cell value , perform the following steps:

  1. Set the trimmed cell value to the provided cell value.
  2. If trim is true or start then remove any leading whitespace from the start of the trimmed cell value and move on to the next step.
  3. If trim is true or end then remove any trailing whitespace from the end of the trimmed cell value and move on to the next step.
  4. Return the trimmed cell value .
Note

This parsing algorithm does not account for the possibility of there being more than one area of tabular data within a single CSV file.

8.1 Bidirectionality in CSV+ CSV Files

This section is non-normative.

Bidirectional content does not alter the definition of rows or the assignment of cells to columns. Whether or not a CSV+ CSV file contains right-to-left characters, the first column's content is the first cell of each row, which is the text prior to the first occurrence of a comma within that row.

For example, Egyptian Referendum results are available as a CSV file at https://egelections-2011.appspot.com/Referendum2012/results/csv/EG.csv . Over the wire and in non-Unicode-aware text editors, the CSV looks like:

‌ا‌ل‌م‌ح‌ا‌ف‌ظ‌ة‌,‌ن‌س‌ب‌ة‌ ‌م‌و‌ا‌ف‌ق‌,‌ن‌س‌ب‌ة‌ ‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌,‌ع‌د‌د‌ ‌ا‌ل‌ن‌ا‌خ‌ب‌ي‌ن‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ص‌ح‌ي‌ح‌ة‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ب‌ا‌ط‌ل‌ة‌,‌ن‌س‌ب‌ة‌ ‌ا‌ل‌م‌ش‌ا‌ر‌ك‌ة‌,‌م‌و‌ا‌ف‌ق‌,‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌ ‌ا‌ل‌ق‌ل‌ي‌و‌ب‌ي‌ة‌,60.0,40.0,"2,639,808","853,125","15,224",32.9,"512,055","341,070" ‌ا‌ل‌ج‌ي‌ز‌ة‌,66.7,33.3,"4,383,701","1,493,092","24,105",34.6,"995,417","497,675" ‌ا‌ل‌ق‌ا‌ه‌ر‌ة‌,43.2,56.8,"6,580,478","2,254,698","36,342",34.8,"974,371","1,280,327" ‌ق‌ن‌ا‌,84.5,15.5,"1,629,713","364,509","6,743",22.8,"307,839","56,670" ...
            
‌ا‌ل‌م‌ح‌ا‌ف‌ظ‌ة‌,‌ن‌س‌ب‌ة‌ ‌م‌و‌ا‌ف‌ق‌,‌ن‌س‌ب‌ة‌ ‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌,‌ع‌د‌د‌ ‌ا‌ل‌ن‌ا‌خ‌ب‌ي‌ن‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ص‌ح‌ي‌ح‌ة‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ب‌ا‌ط‌ل‌ة‌,‌ن‌س‌ب‌ة‌ ‌ا‌ل‌م‌ش‌ا‌ر‌ك‌ة‌,‌م‌و‌ا‌ف‌ق‌,‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌
‌ا‌ل‌ق‌ل‌ي‌و‌ب‌ي‌ة‌,60.0,40.0,"2,639,808","853,125","15,224",32.9,"512,055","341,070"
‌ا‌ل‌ج‌ي‌ز‌ة‌,66.7,33.3,"4,383,701","1,493,092","24,105",34.6,"995,417","497,675"
‌ا‌ل‌ق‌ا‌ه‌ر‌ة‌,43.2,56.8,"6,580,478","2,254,698","36,342",34.8,"974,371","1,280,327"
‌ق‌ن‌ا‌,84.5,15.5,"1,629,713","364,509","6,743",22.8,"307,839","56,670"
...
            

Within this CSV file, the first column appears as the content of each line before the first comma and is named المحافظة (appearing at the start of each row as ‌ا‌ل‌م‌ح‌ا‌ف‌ظ‌ة‌ in the example, which is displaying the relevant characters from left to right in the order they appear "on the wire").

The CSV translates to a table model that looks like:

Column / Row column 1 column 2 column 3 column 4 column 5 column 6 column 7 column 8 column 9
row 1 (header) column names المحافظة نسبة موافق نسبة غير موافق عدد الناخبين الأصوات الصحيحة الأصوات الباطلة نسبة المشاركة موافق غير موافق
row 2 1 القليوبية 60.0 40.0 2,639,808 853,125 15,224 32.9 512,055 341,070
row 3 2 الجيزة 66.7 33.3 4,383,701 1,493,092 24,105 34.6 995,417 497,675
row 4 3 القاهرة 43.2 56.8 6,580,478 2,254,698 36,342 34.8 974,371 1,280,327
row 5 4 قنا 84.5 15.5 1,629,713 364,509 6,743 22.8 307,839 56,670

The fragment identifier #col=3 identifies the third of the columns, named نسبة غير موافق (appearing as ‌ن‌س‌ب‌ة‌ ‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌ in the example).

[ section 6.3.1 Bidirectional Tables tabular-metadata ] defines how this table model should be displayed by compliant applications, and how metadata can affect the display. The default is for the display to be determined by the content of the table. For example, if this CSV were turned into an HTML table for display into a web page, it should be displayed with the first column on the right and the last on the left, as follows:

غير موافق موافق نسبة المشاركة الأصوات الباطلة الأصوات الصحيحة عدد الناخبين نسبة غير موافق نسبة موافق المحافظة
341,070 512,055 32.9 15,224 853,125 2,639,808 40.0 60.0 القليوبية
497,675 995,417 34.6 24,105 1,493,092 4,383,701 33.3 66.7 الجيزة
1,280,327 974,371 34.8 36,342 2,254,698 6,580,478 56.8 43.2 القاهرة
56,670 307,839 22.8 6,743 364,509 1,629,713 15.5 84.5 قنا

The fragment identifier #col=3 still identifies the third of the columns, named نسبة غير موافق , which appears in the HTML display as the third column from the right and is what those who read right-to-left would think of as the third column.

Note that this display matches that shown on the original website .

An alternative approach is for the CSV to be parsed into a table model in which the columns are numbered in the reverse, for tables which are either marked as or detected to be right-to-left tables. For example, we could introduce a bidi=rtl or similar media type parameter, and use this to determine whether the first column in table generated from the CSV is the text before the first comma in each line or the text after the last comma in the line.

In the example above, if the CSV were served with bidi=rtl , or the table was detected as being a right-to-left table, then the column numbering in the model would be reversed:

Column / Row column 9 column 8 column 7 column 6 column 5 column 4 column 3 column 2 column 1
row 1 (header) المحافظة نسبة موافق نسبة غير موافق عدد الناخبين الأصوات الصحيحة الأصوات الباطلة نسبة المشاركة موافق غير موافق
row 2 القليوبية 60.0 40.0 2,639,808 853,125 15,224 32.9 512,055 341,070
row 3 الجيزة 66.7 33.3 4,383,701 1,493,092 24,105 34.6 995,417 497,675
row 4 القاهرة 43.2 56.8 6,580,478 2,254,698 36,342 34.8 974,371 1,280,327
row 5 قنا 84.5 15.5 1,629,713 364,509 6,743 22.8 307,839 56,670

This would require a change to [ RFC7111 ] but that might be required by updates to the definition of text/csv in any case. With the change, the fragment identifier #col=3 would then refer to the third column from the right, named نسبة المشاركة .

If the model were defined in this way, there would be no need to determine the order of the columns when displayed using a metadata property. Columns would always be displayed with the first column (numbered 1 in the model) on the left. The final display in HTML, for example, would be exactly as above. The only difference would be that #col=3 would refer to the third column from the left.

We note that using media type parameters is problematic because publishers might not have the ability to set them on their servers, and because they can easily get lost as a file is republished or emailed between people.

We invite comment on the best way to approach bidirectionality in CSV files.

4.5 8.2 Grammar Examples

8.2.1 Simple Example

This grammar is a generalization of A simple CSV file that complies to the constraints described in section 7. Best Practice CSV , at http://example.org/tree-ops.csv , might look like:

Example 13: http://example.org/tree-ops.csv
GID,On Street,Species,Trim Cycle,Inventory Date

1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON
ST,Liquidambar
styraciflua,Large
Tree
Routine
Prune,6/2/2010

Parsing this file results in an annotated tabular data model of a single table T with five columns and two rows. The columns have the annotations shown in the following table:

id core annotations
table number source number cells titles
C1 T 1 1 C1.1 , C2.1 GID
C2 T 2 2 C1.2 , C2.2 On Street
C3 T 3 3 C1.3 , C2.3 Species
C4 T 4 4 C1.4 , C2.4 Trim Cycle
C5 T 5 5 C1.5 , C2.5 Inventory Date

The extracted embedded metadata , as defined in [ RFC4180 tabular-metadata ] and is included for reference only. ], would look like:

Example 14: tree-ops.csv Embedded Metadata
{
  "@type": "Table",
  "url": "http://example.org/tree-ops.csv",
  "tableSchema": {
    "columns": [
      {"titles": [ "GID" ]},
      {"titles": [ "On Street" ]},
      {"titles": [ "Species" ]},
      {"titles": [ "Trim Cycle" ]},
      {"titles": [ "Inventory Date" ]}
    ]
  }


}

The EBNF used here rows have the annotations shown in the following table:

id core annotations
table number source number cells
R1 T 1 2 C1.1 , C1.2 , C1.3 , C1.4 , C1.5
R2 T 2 3 C2.1 , C2.2 , C2.3 , C2.4 , C2.5
Note

The source number of each row is defined offset by one from the number of each row because in XML 1.0 the source CSV file, the header line is the first line. It is possible to reconstruct a [ EBNF-NOTATION RFC7111 ]. ] compliant reference to the first record in the original CSV file ( http://example.org/tree-ops.csv#row=2 ) using the value of the row's source number. This enables implementations to retain provenance between the table model and the original file.

The cells have the annotations shown in the following table (note that the values of all the cells in the table are strings, denoted by the double quotes in the table below):

::= ::= ::= ::= ::=
id core annotations
table column row string value value
[1] C1.1 T C1 R1 csv "1" header record + "1"
[2] C1.2 T C2 R1 header "ADDISON AV" "ADDISON AV"
C1.3 record T C3 R1 "Celtis australis" "Celtis australis"
[3] C1.4 T C4 R1 record "Large Tree Routine Prune" "Large Tree Routine Prune"
fields #x0D C1.5 T C5 R1 "10/18/2010" ? "10/18/2010" #x0A
C2.1 T C1 R2 "2" "2"
[4] C2.2 T C2 R2 fields "EMERSON ST" "EMERSON ST"
field (" , C2.3 T C3 R2 "Liquidambar styraciflua" " fields ) * "Liquidambar styraciflua"
[5] C2.4 T C4 R2 field "Large Tree Routine Prune" "Large Tree Routine Prune"
C2.5 T C5 R2 WS * "6/2/2010" rawfield "6/2/2010"
8.2.1.1 Using Overriding Metadata

The tools that the consumer of this data uses may provide a mechanism for overriding the metadata that has been provided within the file itself. For example, they might enable the consumer to add machine-readable names to the columns, or to mark the fifth column as holding a date in the format M/D/YYYY . These facilities are implementation defined; the code for invoking a Javascript-based parser might look like:

Example 15: Javascript implementation configuration
data.parse({
  "column-names": ["GID", "on_street", "species", "trim_cycle", "inventory_date"],
  "datatypes": ["string", "string", "string", "string", "date"],
  "formats": [null,null,null,null,"M/D/YYYY"]

});

This is equivalent to a metadata file expressed in the syntax defined in [ tabular-metadata WS ], looking like:

Example 16: Equivalent metadata syntax
{
  "@type": "Table",
  "url": "http://example.org/tree-ops.csv",
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "datatype": "string"
    }, {
      "name": "on_street",
      "datatype": "string"
    }, {
      "name": "species",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "datatype": {
        "base": "date",
        "format": "M/d/yyyy"
      }
    }]
  }

}

This would be merged with the embedded metadata found in the CSV file, providing the titles for the columns to create:

Example 17: Merged metadata
{
  "@type": "Table",
  "url": "http://example.org/tree-ops.csv",
  "tableSchema": {
    "columns": [{
      "name": "GID", 
      "titles": "GID", 
      "datatype": "string"
    }, {
      "name": "on_street", 
      "titles": "On Street", 
      "datatype": "string"
    }, {
      "name": "species", 
      "titles": "Species", 
      "datatype": "string"
    }, {
      "name": "trim_cycle", 
      "titles": "Trim Cycle", 
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "titles": "Inventory Date",
      "datatype": {
        "base": "date", 
        "format": "M/d/yyyy"
      }
    }]
  }


}

The processor can then create an annotated tabular data model that included * name annotations on the columns, and datatype annotations on the cells, and created cells whose values were of appropriate types (in the case of this Javascript implementation, the cells in the last column would be Date objects, for example).

Assuming this kind of implementation-defined parsing, the columns would then have the annotations shown in the following table:

::= ::=
id core annotations
table number source number cells name titles datatype
C1 T 1 1 C1.1 , C2.1 GID GID string
[6] C2 T 2 2 C1.2 , C2.2 rawfield on_street On Street ' " string ' QCHAR
C3 T 3 3 C1.3 , C2.3 * species ' " Species ' | string SCHAR
C4 T 4 4 C1.4 , C2.4 * trim_cycle Trim Cycle string
[7] C5 T 5 5 C1.5 , C2.5 QCHAR inventory_date Inventory Date [ { "base": "date", "format": "M/d/yyyy" } ^"

The cells have the annotations shown in the following table. Because of the overrides provided by the consumer to guide the parsing, and the way the parser works, the cells in the Inventory Date column (cells C1.5 and C2.5 ) have values that are parsed dates rather than unparsed strings.

::= ::= We should probably place further restrictions on QCHAR and SCHAR to avoid control characters. If header weren’t optional, it would be better defined as in RFC4180, but if the syntax allows it to be optional, this would make it not an LL(1) grammar, which isn’t too much of an issue. 5. Parsing Tabular Data This section is non-normative.
id core annotations
table column row string value value
C1.1 T C1 R1 ] "1" | "1" ' ""
C1.2 T C2 R1 "ADDISON AV" ' "ADDISON AV"
[8] C1.3 T C3 R1 SCHAR "Celtis australis" "Celtis australis"
C1.4 T C4 R1 [ "Large Tree Routine Prune" ^", "Large Tree Routine Prune" #x0A
C1.5 T C5 R1 "10/18/2010" #x0D 2010-10-18
C2.1 T C1 R2 "2" ] "2"
[9] C2.2 T C2 R2 WS "EMERSON ST" "EMERSON ST"
C2.3 T C3 R2 [ "Liquidambar styraciflua" #x20 "Liquidambar styraciflua" #x09
C2.4 T C4 R2 "Large Tree Routine Prune" ] "Large Tree Routine Prune"
Note C2.5 T C5 R2 "6/2/2010" 2010-06-02
As described in section 4. 8.2.1.2 CSV+ Syntax Using a Metadata File , there may

A similar set of annotations could be many formats which an application might interpret into the tabular data model described provided through a metadata file, located as discussed in section 2. 5. Tabular Data Models Locating Metadata , including using different separators or fixed format tables, multiple tables within a single file, or ones that have metadata lines before a table header. and defined in [ tabular-metadata Note ]. For example, this might look like:

Example 18: http://example.org/tree-ops.csv-metadata.json
{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "url": "tree-ops.csv",
  "dc:title": "Tree Operations",
  "dcat:keyword": ["tree", "street", "maintenance"],
  "dc:publisher": {
    "schema:name": "Example Municipality",
    "schema:url": {"@id": "http://example.org"}
  },
  "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"},
  "dc:modified": {"@value": "2010-12-31", "@type": "xsd:date"},
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "titles": ["GID", "Generic Identifier"],
      "dc:description": "An identifier for the operation on a tree.",
      "datatype": "string",
      "required": true
    }, {
      "name": "on_street",
      "titles": "On Street",
      "dc:description": "The street that the tree is on.",
      "datatype": "string"
    }, {
      "name": "species",
      "titles": "Species",
      "dc:description": "The species of the tree.",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "titles": "Trim Cycle",
      "dc:description": "The operation performed on the tree.",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "titles": "Inventory Date",
      "dc:description": "The date of the operation that was performed.",
      "datatype": {"base": "date", "format": "M/d/yyyy"}
    }],
    "primaryKey": "GID",
    "aboutUrl": "#gid-{GID}"
  }


}

Standardising
the
parsing
of
CSV
is
outside
the
chartered
scope
of
the
Working
Group.
This
non-normative
section
is
intended
to
help
the
creators
of
parsers
handle
the
wide
variety
of
CSV-based
formats
that
they
may
encounter
due
to
the
current
lack
of
standardisation
of
the
format.

This section describes an algorithm for parsing formats other than the plain CSV+ format specified in section 4. CSV+ Syntax . It is impossible to do this in a fully automated manner, so The annotated tabular data model generated from this algorithm depends on would be more sophisticated again. The table itself would have the following flags being set externally (eg through user input): annotations:

encoding dc:title
The character encoding for the file, one of the encodings listed in [ encoding {"@value": "Tree Operations", "@language": "en"}
]. The default is utf-8 . dcat:keyword
[{"@value": "tree", "@language", "en"}, {"@value": "street", "@language": "en"}, {"@value": "maintenance", "@language": "en"}]
line terminator dc:publisher
The character that is used at the end of a row. The default is CRLF . [{ "schema:name": "Example Municipality", "schema:url": {"@id": "http://example.org"} }]
quote character dc:license
The character that is used around escaped cells. The default is " . {"@id": "http://opendefinition.org/licenses/cc-by/"}
escape character dc:modified
{"@value": "2010-12-31", "@type": "date"}

The character that is used to escape columns would have the quote character within escaped cells. The default is annotations shown in the following table:

The number of rows to skip at the beginning of A character that, when it appears at the beginning
id core annotations other annotations
table number source number cells name titles datatype " dc:description (such that
C1 T 1 1 C1.1 , C2.1 "" GID is used to escape " GID , Generic Identifier within an escaped cell). skip rows string An identifier for the file, before operation on a header row or tabular data. tree.
C2 T 2 2 C1.2 , C2.2 on_street On Street string The default street that the tree is on.
C3 T 3 3 C1.3 , C2.3 0 . species comment prefix Species string The species of a skipped row, indicates a comment that should be associated as a comment annotation to the table. tree.
C4 T 4 4 C1.4 , C2.4 trim_cycle Trim Cycle string The default is operation performed on the tree.
C5 T 5 5 C1.5 , C2.5 # . inventory_date header row count Inventory Date { "base": "date", "format": "M/d/yyyy" } The number date of header the operation that was performed.

The rows (following have an additional primary key annotation, as shown in the skipped rows) following table:

id core annotations
table number source number cells primary key
R1 T 1 2 C1.1 , C1.2 , C1.3 , C1.4 , C1.5 C1.1
R2 T 2 3 C2.1 , C2.2 , C2.3 , C2.4 , C2.5 C2.1

Thanks to the provided metadata, the cells again have the annotations shown in the file. following table. The default metadata file has provided the information to supplement the model with additional annotations but also, for the Inventory Date column (cells C1.5 and C2.5 ), have a value that is a parsed date rather than an unparsed string.

The separator between cells. The default is , .
id core annotations
table column row string value value about URL
C1.1 T C1 R1 1 . "1" delimiter skip columns "1" http://example.org/tree-ops.csv#gid-1
C1.2 T C2 R1 "ADDISON AV" "ADDISON AV" http://example.org/tree-ops.csv#gid-1
C1.3 T C3 R1 "Celtis australis" "Celtis australis" http://example.org/tree-ops.csv#gid-1
C1.4 T C4 R1 "Large Tree Routine Prune" "Large Tree Routine Prune" http://example.org/tree-ops.csv#gid-1
C1.5 T C5 R1 "10/18/2010" 2010-10-18 http://example.org/tree-ops.csv#gid-1
C2.1 T C1 R2 "2" "2" http://example.org/tree-ops.csv#gid-2
C2.2 T C2 R2 "EMERSON ST" "EMERSON ST" http://example.org/tree-ops.csv#gid-2
C2.3 T C3 R2 "Liquidambar styraciflua" "Liquidambar styraciflua" http://example.org/tree-ops.csv#gid-2
C2.4 T C4 R2 "Large Tree Routine Prune" "Large Tree Routine Prune" http://example.org/tree-ops.csv#gid-2
C2.5 T C5 R2 "6/2/2010" 2010-06-02 http://example.org/tree-ops.csv#gid-2

8.2.2 Empty and Quoted Cells

The number following slightly amended CSV file contains quoted and missing cell values:

Example 19: CSV file containing quoted and missing cell values
GID,On Street,Species,Trim Cycle,Inventory Date

1,ADDISON AV,"Celtis australis","Large Tree Routine Prune",10/18/2010
2,,"Liquidambar
styraciflua","Large
Tree
Routine
Prune",

Parsing this file similarly results in an annotated tabular data model of a single table T with five columns and two rows. The columns and rows have exactly the same annotations as previously, but there are two null cell values for C2.2 and C2.5 . Note that the quoting of values within the CSV makes no difference to skip at either the beginning string value or value of each row, before any header columns. The default is 0 . the cell.

header
id core annotations
table column count row string value value
C1.1 T C1 R1 "1" "1"
C1.2 T C2 R1 "ADDISON AV" "ADDISON AV"
C1.3 T C3 R1 "Celtis australis" "Celtis australis"
C1.4 T C4 R1 "Large Tree Routine Prune" "Large Tree Routine Prune"
C1.5 T C5 R1 "10/18/2010" "10/18/2010"
C2.1 T C1 R2 "2" "2"
C2.2 T C2 R2 "" null
C2.3 T C3 R2 "Liquidambar styraciflua" "Liquidambar styraciflua"
C2.4 T C4 R2 "Large Tree Routine Prune" "Large Tree Routine Prune"
C2.5 T C5 R2 "" null

8.2.3 Tabular Data Embedding Annotations

The number following example illustrates some of header columns (following the skipped columns) complexities that can be involved in each row. parsing tabular data, how the flags described above can be used, and how new tabular data formats could be defined that embed additional annotations into the tabular data model.

In this example, the publishers of the data are using an internal convention to supply additional metadata about the tabular data embedded within the file itself. They are also using a tab as a separator rather than a comma.

Example 20: Tab-separated file containing embedded metadata
#	publisher	City of Palo Alto

#	updated	12/31/2010
#name	GID	on_street	species	trim_cycle	inventory_date
#datatype	string	string	string	string	date:M/D/YYYY
	GID	On Street	Species	Trim Cycle	Inventory Date
	1	ADDISON AV	Celtis australis	Large Tree Routine Prune	10/18/2010
2
EMERSON
ST
Liquidambar
styraciflua
Large
Tree
Routine
Prune
6/2/2010
8.2.3.1 Naive Parsing

Naive parsing of the above data will assume a comma separator and thus results in a single table T with a single column and six rows. The default is column has the annotations shown in the following table:

Indicates whether to ignore wholly empty rows (ie
id core annotations
table number source number cells titles
C1 T 1 1 C1.1 , C2.1 , C3.1 , C4.1 , C5.1 0 . # publisher City of Palo Alto skip blank rows

The rows have the annotations shown in which the following table:

id core annotations
table number source number cells
R1 T 1 2 C1.1
R2 T 2 3 C2.1
R3 T 3 4 C3.1
R4 T 4 5 C4.1
R5 T 5 6 C5.1
R6 T 6 7 C6.1

The cells have the annotations shown in the following table (note that the values of all the cells in the table are empty). The default is strings, denoted by the double quotes in the table below):

Issue 47
id core annotations
table column row string value value
C1.1 T C1 R1 false . "# updated 12/31/2010" trim Indicates whether to trim whitespace around cells; may be true , "# updated 12/31/2010"
C1.1 T C1 R1 false , "#name GID on_street species trim_cycle inventory_date" start "#name GID on_street species trim_cycle inventory_date" or
C2.1 T C1 R2 end . The default is "#datatype string string string string date:M/D/YYYY" false . "#datatype string string string string date:M/D/YYYY"
C3.1 T C1 R3 " GID On Street Species Trim Cycle Inventory Date" " GID On Street Species Trim Cycle Inventory Date"
C4.1 T C1 R4 " 1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010" " 1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010"
C5.1 T C1 R5 " 2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010" " 2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010"
8.2.3.2 Parsing with Flags

When parsing, should we: The consumer of the data may use the flags described above to create a more useful set of data from this file. Specifically, they could set:

Setting these is done in an implementation-defined way. It could be done, for example, by sniffing the contents of the file itself, through command-line options, or by embedding a dialect description into a metadata file associated with the tabular data, which would look like:

Example 21: Dialect description
{
  "delimiter": "\t",
  "skipRows": 4,
  "skipColumns": 1,
  "commentPrefix": "#"

}

The algorithm for With these flags in operation, parsing a document containing this file results in an annotated tabular data model of a single table T with five columns and two rows which is as follows: Read largely the file using same as that created from the specified encoding original simple example described in section 8.2.1 Simple Example . Find There are three differences.

First, because the rows. Each row ends four skipped rows began with a line terminator , but values that are enclosed within the quote character may contain comment prefix , the line terminator without it indicating table itself now has four rdfs:comment annotations, with the end values:

  1. publisher City of Palo Alto
  2. updated 12/31/2010
  3. name GID on_street species trim_cycle inventory_date
  4. datatype string string string string date:M/D/YYYY

Second, because the row. The quote character may be escaped using first column has been skipped, the escape character source number where it appears within cells. If of each of the escape character columns is not the same as the quote character then offset by one from the escape character number is also used to escape the character that immediately follows it. Skip of each column:

id core annotations
table number source number cells titles
C1 T 1 2 C1.1 , C2.1 GID
C2 T 2 3 C1.2 , C2.2 On Street
C3 T 3 4 C1.3 , C2.3 Species
C4 T 4 5 C1.4 , C2.4 Trim Cycle
C5 T 5 6 C1.5 , C2.5 Inventory Date

Finally, because four additional rows have been skipped, the source number of each of the rows indicated is offset by five from the skip row number (the four skipped rows plus the single header row):

id core annotations
table number source number cells
R1 T 1 6 C1.1 , C1.2 , C1.3 , C1.4 , C1.5
R2 T 2 7 C2.1 , C2.2 , C2.3 , C2.4 , C2.5
8.2.3.3 Recognising Tabular Data Formats

The conventions used in this data (invented for the purpose of this example) are in fact intended to create an annotated tabular data model parameter. These form comment which includes named annotations on the table. If table itself, on the columns, and on the cells. The creator of these conventions could create a specification for this particular tabular data syntax and register a media type for it. The specification would include statements like:

  • A tab delimiter is always used.
  • The first column is always ignored.
  • When the first column of a skipped row begins with has the comment prefix , strip that prefix from value "#" , the beginning second column is the name of an annotation on the table and the values of the row remaining columns are concatenated to create the content value of the comment that annotation.
  • Gather When the number first column of header rows indicated by the header a row count parameter; has the value #name , the remaining rows are data rows. cells in the row provide a name annotation for each column in the table.
  • Split When the header and data rows into first column of a row has the value #datatype , the remaining cells using in the delimiter . Values that are enclosed within row provide datatype / format annotations for the quote character may contain cells within the delimiter . The quote character may be escaped using relevant column, and these are interpreted to create the escape character value for each cell in that column.
  • The first row where it appears within cells. If the escape character first column is not empty is a row of headers; these provide title annotations on the same as columns in the quote character then table.
  • The remaining rows make up the escape character is also used to escape data of the character table.

Parsers that immediately follows it. recognised the format could then build a more sophisticated annotated tabular data model using only the embedded information in the tabular data file . They would extract embedded metadata looking like:

Example 22: Embedded metadata in the format of the annotated tabular model
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "tree-ops.csv",
  "dc:publisher": "City of Palo Alto",
  "dc:updated": "12/31/2010",
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "titles": "GID",
      "datatype": "string",
    }, {
      "name": "on_street",
      "titles": "On Street",
      "datatype": "string"
    }, {
      "name": "species",
      "titles": "Species",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "titles": "Trim Cycle",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "titles": "Inventory Date",
      "datatype": {
        "base": "date", 
        "format": "M/d/yyyy"
      }
    }]
  }


}

If trim As before, the result would be a single table is T with five columns and two rows. The table itself would have two annotations:

true dc:publisher or
start then whitespace from the start {"@value": "City of values that are not enclosed must be removed from the value. If trim is Palo Alto"}
true dc:updated or
end {"@value": "12/31/2010"} then whitespace from

The columns have the end of values that are not enclosed must be removed from annotations shown in the value. following table:

In each row, ignore the
id core annotations
table number of columns indicated by the skip columns parameter. Always start from source number cells name titles
C1 T 1 2 C1.1 , C2.1 GID GID
C2 T 2 3 C1.2 , C2.2 on_street On Street
C3 T 3 4 C1.3 , C2.3 species Species
C4 T 4 5 C1.4 , C2.4 trim_cycle Trim Cycle
C5 T 5 6 C1.5 , C2.5 inventory_date Inventory Date

The rows have the first character annotations shown in the row when counting columns (see section 4.4.2 Bidirectionality following table, exactly as in CSV+ Files previous examples:

). Gather the
id core annotations
table number source number cells
R1 T 1 6 C1.1 , C1.2 , C1.3 , C1.4 , C1.5
R2 T 2 7 C2.1 , C2.2 , C2.3 , C2.4 , C2.5

The cells have the annotations shown in the following table. Because of header columns indicated by the header column count parameter. Always start from way the first character in particular tabular data format has been specified, these include additional annotations but also, for the row when counting columns (see section 4.4.2 Bidirectionality in CSV+ Files ). Each cell within a header row that is not in a skipped or header Inventory Date column is (cells C1.5 and C2.5 ), have a label annotation on value that column. Each cell within a header column is a parsed date rather than an annotation on the unparsed string.

id core annotations
table column row it appears in; if there is string value value
C1.1 T C1 R1 "1" "1"
C1.2 T C2 R1 "ADDISON AV" "ADDISON AV"
C1.3 T C3 R1 "Celtis australis" "Celtis australis"
C1.4 T C4 R1 "Large Tree Routine Prune" "Large Tree Routine Prune"
C1.5 T C5 R1 "10/18/2010" 2010-10-18
C2.1 T C1 R2 "2" "2"
C2.2 T C2 R2 "EMERSON ST" "EMERSON ST"
C2.3 T C3 R2 "Liquidambar styraciflua" "Liquidambar styraciflua"
C2.4 T C4 R2 "Large Tree Routine Prune" "Large Tree Routine Prune"
C2.5 T C5 R2 "6/2/2010" 2010-06-02

8.2.4 Parsing Multiple Header Lines

The following example shows a CSV file with multiple header row then that provides lines:

Example 23: CSV file with multiple header lines
Who,What,,Where,
Organisation,Sector,Subsector,Department,Municipality
#org,#sector,#subsector,#adm1,#adm2

UNICEF,Education,Teacher training,Chocó,Quidbó
UNICEF
,
Education
,
Teacher

training
,
Choc

ó,

Bojay

á

Here, the type of first line contains some grouping titles in the annotation first line, which are not particularly helpful. The lines following those contain useful titles for the row, otherwise it is columns. Thus the appropriate configuration for a label annotation. If skip blank rows is dialect description is:

Example 24: Dialect description for multiple header lines
{
  "skipRows": 1,
  "headerRowCount": 2

}

With this configuration, the table model contains five columns, each of which have two titles, summarised in the following table:

id core annotations
table number source number cells titles
C1 T 1 1 C1.1 , C2.1 true Organisation , #org then ignore any rows in which all
C2 T 2 2 C1.2 , C2.2 Sector , #sector
C3 T 3 3 C1.3 , C2.3 Subsector , #subsector
C4 T 4 4 C1.4 , C2.4 Department , #adm1
C5 T 5 5 C1.5 , C2.5 Municipality , #adm2

As metadata, this would look like:

Example 25: Extracted metadata
{
  "tableSchema": {
    "columns": [
      { "titles": ["Organisation", "#org"] }, 
      { "titles": ["Sector", "#sector"] }, 
      { "titles": ["Subsector", "#subsector"] }, 
      { "titles": ["Department", "#adm1"] }, 
      { "titles": ["Municipality", "#adm2"] }, 
    ]
  }

}

A separate metadata file could contain just the cell values are empty strings. second of each of these titles, for example:

Example 26: Metadata file
{
  "tableSchema": {
    "columns": [
      { "name": "org", "titles": #org" }, 
      { "name": "sector", "titles": #sector" }, 
      { "name": "subsector", "titles": #subsector" }, 
      { "name": "adm1", "titles": #adm1" }, 
      { "name": "adm2", "titles": #adm2" }, 
    ]
  }

}

This enables people from multiple jurisdictions to use the same tabular data structures without having to use exactly the same titles within their documents.

A. Existing Standards

This section is non-normative.

This appendix outlines various ways in which CSV is defined.

A.1 RFC 4180

[ RFC4180 ] defines CSV with the following ABNF grammar:

file = [header CRLF] record *(CRLF record) [CRLF] header = name *(COMMA name) record = field *(COMMA field) name = field field = (escaped / non-escaped) escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE non-escaped = *TEXTDATA COMMA = %x2C CR = %x0D DQUOTE = %x22 LF = %x0A CRLF = CR LF TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D
DQUOTE =  %x22
LF = %x0A
CRLF = CR LF
TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

Of particular note here are:

A.2 Excel

Excel is a common tool for both creating and reading CSV documents, and therefore the CSV that it produces is a de facto standard.

Note

The following describes the behaviour of Microsoft Excel for Mac 2011 with an English locale. Further testing is needed to see the behaviour of Excel in other situations.

A.2.1 Saved CSV

Excel generates CSV files encoded using Windows-1252 with LF line endings. Characters that cannot be represented within Windows-1252 are replaced by underscores. Only those cells that need escaping (eg because they contain commas or double quotes) are escaped, and double quotes are escaped with two double quotes.

Dates and numbers are formatted as displayed, which means that formatting can lead to information being lost or becoming inconsistent.

A.2.2 Opened CSV

When opening CSV files, Excel interprets CSV files saved in UTF-8 as being encoded as Windows-1252 (whether or not a BOM is present). It correctly deals with double quoted cells, except that it converts line breaks within cells into spaces. It understands CRLF as a line break. It detects dates (formatted as YYYY-MM-DD ) and formats them in the default date formatting for files.

A.2.3 Imported CSV

Excel provides more control when importing CSV files into Excel. However, it does not properly understand UTF-8 (with or without BOM ). It does however properly understand UTF-16 and can read non-ASCII characters from a UTF-16-encoded file.

A particular quirk in the importing of CSV is that if a cell contains a line break, the final double quote that escapes the cell will be included within it.

A.2.4 Copied Tabular Data

When tabular data is copied from Excel, it is copied in a tab-delimited format, with LF line breaks.

A.3 Google Spreadsheets

A.3.1 Downloading CSV

Downloaded CSV files are encoded in UTF-8, without a BOM , and with LF line endings. Dates and numbers are formatted as they appear within the spreadsheet.

A.3.2 Importing CSV

CSV files can be imported as UTF-8 (with or without BOM ). CRLF line endings are correctly recognised. Dates are reformatted to the default date format on load.

A.4 CSV Files in a Tabular Data Package

Tabular Data Packages place the following restrictions on CSV files:

As a starting point, CSV files included in a Tabular Data Package package must conform to the RFC for CSV (4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files). In addition:

  • File names MUST end with .csv .

  • Files MUST be encoded as UTF-8 UTF-8.

  • Files MUST have a single header row. This row MUST be the first row in the file.

  • Rows in the file MUST NOT contain more fields than are in the header row (though they may contain less) less).

  • Each file MUST have an entry in the resources tables array in the datapackage.json file file.

  • The resource metadata MUST include a schema tableSchema attribute whose value MUST conform to the JSON Table Schema be a valid schema description .

  • All fields in the CSV files MUST be described in the schema description .

CSV files generated by different applications often vary in their syntax, e.g. use of quoting characters, delimiters, etc. To encourage conformance, CSV files in a CSV files in a Tabular Data Package SHOULD :

  • Use “,” "," as field delimiters delimiters.
  • Use “rn” CRLF ( U+000D U+000A ) or “n” LF ( U+000A ) as line terminators terminators.

If a CSV file does not follow these rules then its specific CSV dialect MUST be documented. The resource hash for the resource in the datapackage.json descriptor MUST :

Applications processing the CSV file SHOULD read use the dialect of the CSV file to guide parsing.

Issue Note

More details of behaviour of other tools should go here. This should include To replicate the most popular CSV parsing/generating libraries in common programming languages. Test findings above, test files which include non-ASCII characters, double quotes quotes, and line breaks within cells are:

B. Acknowledgements

This section is non-normative.

At the time of publication, the following individuals had participated in the Working Group, in the order of their first name: Adam Retter, Alf Eaton, Anastasia Dimou, Andy Seaborne, Axel Polleres, Christopher Gutteridge, Dan Brickley, Davide Ceolin, Eric Stephan, Erik Mannens, Gregg Kellogg, Ivan Herman, Jeni Tennison, Jeremy Tandy, Jürgen Umbrich, Rufus Pollock, Stasinos Konstantopoulos, William Ingram, and Yakov Shafranovich.

C. Changes since the working draft of 08 January 2015

The document has undergone substantial changes since the last working draft. Below are some of the changes made:

D. References

B.1 D.1 Normative references

[BCP47]
A. Phillips; M. Davis. Tags for Identifying Languages . September 2009. IETF Best Current Practice. URL: http://tools.ietf.org/html/bcp47
[BIDI]
Mark Davis; Aharon Lanin; Andrew Glass. Unicode Bidirectional Algorithm . 5 June 2014. Unicode Standard Annex #9. URL: http://www.unicode.org/reports/tr9/
[ECMASCRIPT]
Allen Wirfs-Brock. ECMA-262 ECMAScript Language Specification, Edition 6 . Draft. URL: http://people.mozilla.org/~jorendorff/es6-draft.html
[ISO8601]
Representation of dates and times. International Organization for Standardization. 2004. ISO 8601:2004. URL: http://www.iso.org/iso/catalogue_detail?csnumber=40874
[JSON-LD]
Manu Sporny; Gregg Kellogg; Markus Lanthaler. JSON-LD 1.0 . 16 January 2014. W3C Recommendation. URL: http://www.w3.org/TR/json-ld/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels . March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
[RFC4180]
Y. Shafranovich. Common Format and MIME Type for Comma-Separated Values (CSV) Files . October 2005. Informational. URL: https://tools.ietf.org/html/rfc4180
[UAX35]
Mark Davis; CLDR committee members. Unicode Locale Data Markup Language (LDML) . 15 March 2013. Unicode Standard Annex #35. URL: http://www.unicode.org/reports/tr35/tr35-31/tr35.html
[encoding]
Anne van Kesteren; Joshua Bell; Addison Phillips. Encoding . 16 September 2014. W3C Candidate Recommendation. URL: http://www.w3.org/TR/encoding/
[tabular-metadata]
Rufus Pollock; Jeni Tennison. Tennison; Gregg Kellogg. Metadata Vocabulary for Tabular Data . W3C Working Draft. URL: http://www.w3.org/TR/2015/WD-tabular-metadata-20150108/ http://www.w3.org/TR/2015/WD-tabular-metadata-20150416/
[xmlschema11-2]
David Peterson; Sandy Gao; Ashok Malhotra; Michael Sperberg-McQueen; Henry Thompson; Paul V. Biron et al. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes . 5 April 2012. W3C Recommendation. URL: http://www.w3.org/TR/xmlschema11-2/

B.2 D.2 Informative references

[EBNF-NOTATION]
Tim Bray; Jean Paoli; C. Michael Sperberg-McQueen; Eve Maler; François Yergau. EBNF Notation . W3C Recommendation. URL: http://www.w3.org/TR/REC-xml/#sec-notation http://www.w3.org/TR/xml/#sec-notation
[RFC7111]
Michael M. Hausenblas; Erik E. Wilde; Jenni J. Tennison. URI Fragment Identifiers for the text/csv Media Type . Internet RFC. January 2014. Informational. URL: http://www.ietf.org/rfc/rfc7111.txt https://tools.ietf.org/html/rfc7111
[annotation-model]
Robert Sanderson; Paolo Ciccarese. Web Annotation Data Model . 11 December 2014. W3C Working Draft. URL: http://www.w3.org/TR/annotation-model/
[csv2json]
Jeremy Tandy; Ivan Herman. Generating JSON from Tabular Data on the Web . W3C Working Draft. URL: http://www.w3.org/TR/2015/WD-csv2json-20150416/
[csv2rdf]
Jeremy Tandy; Ivan Herman; Gregg Kellogg. Generating RDF from Tabular Data on the Web . W3C Working Draft. URL: http://www.w3.org/TR/2015/WD-csv2rdf-20150416/