Model for Tabular Data and Metadata on the Web

Abstract

Tabular data is routinely transferred on the web as "CSV", but the definition of "CSV" in practice is very loose. This document outlines a basic data model or infoset for tabular data and metadata about that tabular data. It also contains some non-normative information about a best practice syntax for tabular data, for mapping into that data model, to contribute to the standardisation of CSV syntax by IETF. Various methods of locating metadata are also provided.

2. Tabular Data Models

There are different levels of data models for tabular data:

The core tabular data model described in section 2.1 Core Tabular Data Model defines the core model for a single table of basic data.
The annotated tabular data model described in section 2.2 Annotated Tabular Data Model defines a model for tables that are annotated with metadata.
The grouped tabular data model described in section 2.3 Grouped Tabular Data Model defines a model for tables that are related to each other in some way.

2.1 Core Tabular Data Model

The core tabular data model can be used to describe a table that lacks any annotations, whether those annotations are embedded within a CSV file or arise from a separate metadata document.

Data is held in a table. The properties of a table are:

columns — the list of columns in the table. A table MUST have one or more columns and the order of the columns within the list is significant and MUST be preserved by applications
rows — the list of rows in the table

A column represents a vertical arrangement of cells within a table. The properties of a column are:

table — the table that the column appears in
number — the position of the column amongst the columns for the associated table, starting from 1
cells — the list of cells in the column. A column MUST contain one cell from each row in the table. The order of the cells in the list MUST match the order of the rows in which they appear within the rows for the associated table.

A row represents a horizontal arrangement of cells within a table. The properties of a row are:

table — the table that the row appears in
number — the position of the row amongst the rows for the table, starting from 1
cells — the list of cells in the row. A row MUST contain one cell from each column in the table. The order of the cells in the list MUST match the order of the columns in which they appear within the table columns for the row's table.

A cell represents a cell at the intersection of a row and a column within a table. The properties of a cell are:

table — the table that the cell appears in
column — the column that the cell appears in; the cell MUST be in the cells for that column
row — the row that the cell appears in; the cell MUST be in the cells for that row
string value — a string that is the original syntactic representation of the value of the cell, ie how the cell appears within a CSV file; this may be an empty string
value — the semantic value of the cell; within the core tabular data model this is either a string which is the same as the string value of the cell, or null, if the string value is an empty string

Issue 1

Should there be a distinction made (in the core tabular data model) between empty cells and cells whose value is an empty string? With a lack of other metadata, the only way to make the distinction would be to say there was a difference between a missing value in a CSV file (eg the second value in a,,z) and a quoted value (eg the second value in a,"",z). This seems dangerous as I don't think many clients will make a distinction between the two, so the semantics will be lost on round-tripping.

2.2 Annotated Tabular Data Model

An annotated table is a table that is annotated with additional metadata. The table MAY have any number of properties in addition to those provided in the core tabular data model described in section 2.1 Core Tabular Data Model which provide information about the table as a whole. The values of these properties may be lists, structured objects, or atomic values. Annotations on a table may include:

titles or descriptions of the table
information about the source or provenance of the data in the table
links to other tables (eg to indicate tables that include related information)

The columns within an annotated table are all annotated columns: columns which MAY have any number of properties in addition to those provided in the core tabular data model described in section 2.1 Core Tabular Data Model. The annotations on a column might provide information about how to interpret the cells in the column or information about the column as a whole. Examples of annotations might be:

a name or label for the column
the expected type of values in that column
an indication of whether the column contains unique values

The rows within an annotated table are all annotated rows: rows which MAY have any number of properties in addition to those provided in the core tabular data model described in section 2.1 Core Tabular Data Model. The annotations on a row provide additional metadata about the information held in the row, such as:

the certainty of the information in that row
information about the source or provenance of the data in that row

The cells within an annotated row are all annotated cells: cells which MAY have any number of properties in addition to those provided in the core tabular data model described in section 2.1 Core Tabular Data Model. The annotations on a cell provide metadata about the value held in the cell, particularly when this overrides the information provided for the annotated column and annotated row that the cell falls within. Annotations on a cell might be:

notes to aid the interpretation of the value
type annotations for the value
a flag that indicates the value is a null value

The value of an annotated cell MAY be of a datatype other than a string. For example, annotations might enable a processor to understand the string value of the cell as representing a number or a date.

Issue 2

The permitted types of names and values of annotations need to be spelled out here.

Issue 3

It might be useful to define annotated regions as follows:

An annotated table may also contain a number of annotated regions. Regions are themselves tabular structures comprised of selected rows and columns and the cells within those rows for those columns. Annotated regions are regions that have annotations associated with them. Annotated columns and annotated rows are special types of annotated regions where the region is the entirety of a single column or single row.

But it's not currently clear that there are use cases or examples that justify it. Input is welcome on this.

2.3 Grouped Tabular Data Model

A group of tables comprises a set of tables (which may be annotated tables) and a set of annotations (properties and values) that relate to the set.

Note

Tables can be loosely related to each other simply through annotations; not all tables that are related to each other need to grouped together. Groups of tables are useful because they can be annotated with metadata that applies to all the tables in the group.

3. Locating Metadata

As described in section 2.2 Annotated Tabular Data Model, tabular data may have a number of annotations associated with it. Here we describe the different methods that can be used to locate those annotations given a link to a CSV file.

In most methods of locating metadata described here, metadata is provided within a separate document. The syntax of this document is defined in the Metadata Vocabulary for Tabular Data specification. These documents can include things like:

metadata about the table, such as titles, descriptions, provenance, and links to other tables
metadata about columns in the table, such as labels, data types and other constraints, or flags to indicate values in the column are unique
metadata about rows in the table, such as certainty or provenance
metadata about values in the table, such as notes or type annotations

When creating a set of annotations from metadata, if the same property is specified in two locations then information "closer" to the document itself should override information "further" from the document. Explicitly, the order of preference is:

metadata embedded within the tabular data file itself, see section 3.1 Embedded Metadata
metadata within a package that the tabular data file belongs to, see section 3.2 Package
metadata in a document linked to using a Link header when retrieving the tabular data file, see section 3.3 Link Header
metadata in a document located through a standard path from the tabular data file, see section 3.4 Standard Path

Issue 4

We probably need to add some rules about conflicts as well. For example, if a metadata file says that the CSV should contain certain columns but the names of the columns are different in the CSV file itself, is this an error?

3.1 Embedded Metadata

The first line of a CSV+ file MUST be processed as a header line unless the CSV+ file is served with a header=absent parameter on the media type. Each cell in the header line that includes non-whitespace characters provides a title annotation on the column in which it appears.

Issue 5

The title annotation needs to be linked to the relevant annotation in the metadata vocabulary.

3.2 Package

Rather than providing CSV files directly on the web, they can be packaged up with a metadata file that includes annotations, and any other relevant CSV files.

Issue 6

What should that package look like? Just a zip? A multipart document? How is the metadata file within it identified? If this is allowed we should specify a new media type for the package.

See Packaging on the Web Editor's Draft for a proposed generic approach for packaging on the web that could be used for packaging CSV files and metadata, and discovering those packages.

3.3 Link Header

When retrieving a CSV file via HTTP, the response can include a Link header with rel=describedby that points to a metadata file that describes the CSV file. If, by inspection, the referenced file is not a valid metadata file then it MUST be ignored. If there is more than one valid metadata file linked to through multiple Link headers, then the metadata referenced by Link headers that appear later in the response override that referenced by earlier Link headers.

Issue 7

Is rel=describedby the right link relation to use?

3.4 Standard Path

Given a CSV file, the default location for a metadata file that describes that CSV file is set to filename.csvm in the same directory. If that file does not exist, then the application should look for a metadata file at metadata.csvm in the same directory. In both cases, if the metadata file does not explicitly point to the relevant CSV file then it MUST be ignored.

Issue 8

Used a suffix on filenames to find metadata about them, though we haven't decided what format metadata documents should be in, or even if they should be conneg'd.

Issue 9

Should there be a default navigational thing of continuing up the path hierarchy until you find a metadata document?

Issue 10

We have discussed using a .well-known location or something within a sitemap file to provide the location of a metadata file about a given CSV file, but these are just as unlikely to be editable as a Link header, so probably don't address the use case that the standard path method addresses, of being a really simple way to provide metadata about a CSV file.

4. CSV+ Syntax

This section is non-normative.

There is no standard for CSV, and there are many variants of CSV used on the web today. This section defines a method for outputting tabular data adhering to the core tabular data model described in section 2.1 Core Tabular Data Model into a standard, CSV-based, syntax. Compliant applications that output this format must meet each of the constraints.

Note

We are actively working with the IETF to develop a standard for CSV, which is outside the scope of the Working Group. The details here aim to help shape that standard based on our requirements.

This section does not seek to describe how applications that input textual tabular data should interpret it, except that any data that is in the format defined here should be understood as defined here.

Note

This syntax is not compliant with text/csv as defined in [RFC4180]: it permits characters other than ASCII, and it permits line endings other than CRLF. Supporting the full set of Unicode characters by using UTF-8 and supporting LF line endings are important characteristics for data formats that are used internationally and on non-Windows platforms. However, all files that adhere to [RFC4180]'s definition of CSV are compliant CSV+ files.

4.1 Content Type

The appropriate content type for a CSV+ file is text/csv. For example, when a CSV+ file is transmitted via HTTP, the HTTP response MUST include a Content-Type header with the value text/csv:

Content-Type: text/csv

Issue 11

See below for issues relating to whether we should instead define a different content type.

4.2 Encoding

CSV+ files SHOULD be encoded using UTF-8. If a CSV+ file is not encoded using UTF-8, the encoding MUST be specified through the charset parameter in the Content-Type header:

Content-Type: text/csv;charset=ISO-8859-1

Issue 12

RFC4180 defines the default charset as US-ASCII because that was (at the time RFC4180 was written) the default charset for all text/* media types. This has been superseded with RFC6657. Section 3 of RFC6657 states "new subtypes of the "text" media type SHOULD NOT define a default "charset" value. If there is a strong reason to do so despite this advice, they SHOULD use the "UTF-8" [RFC3629] charset as the default."

Do we have a strong reason to specify a default charset? Should we be defining application/csv instead, to avoid doing unrecommended things with a text/* media type.

4.3 Line Endings

The ends of rows in a CSV+ file MUST be either CRLF (U+000D U+000A) or LF (U+000A). Line endings within escaped cells are not normalised.

Issue 13

Section 4.1.1 of RFC2046 specifies that "The canonical form of any MIME "text" subtype MUST always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" MUST represent a line break. Use of CR and LF outside of line break sequences is also forbidden."

Should we be defining application/csv instead, to prevent having to adhere to this rule, or should we stick to the CRLF rule?

4.4 Lines

Each line of a CSV+ file MUST contain the same number of comma-separated values.

Values that contain commas, line endings or double quotes MUST be escaped by having the entire value wrapped in double quotes. There MUST NOT be whitespace before or after the double quotes. Within these escaped cells, any double quotes MUST be escaped with two double quotes ("").

4.4.1 Headers

The first line of a CSV+ file SHOULD contain a comma-separated list of names of columns. This is known as the header line and provides names for the columns. There are no constraints on these names.

If a CSV+ file does not include a header line, this MUST be specified using the header parameter of the media type:

Content-Type: text/csv;header=absent

4.4.2 Bidirectionality in CSV+ Files

Bidirectional content does not alter the definition of rows or the assignment of cells to columns. Whether or not a CSV+ file contains right-to-left characters, the first column's content is the first cell of each row, which is the text prior to the first occurrence of a comma within that row.

For example, Egyptian Referendum results are available as a CSV file at https://egelections-2011.appspot.com/Referendum2012/results/csv/EG.csv. Over the wire and in non-Unicode-aware text editors, the CSV looks like:

              
‌ا‌ل‌م‌ح‌ا‌ف‌ظ‌ة‌,‌ن‌س‌ب‌ة‌ ‌م‌و‌ا‌ف‌ق‌,‌ن‌س‌ب‌ة‌ ‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌,‌ع‌د‌د‌ ‌ا‌ل‌ن‌ا‌خ‌ب‌ي‌ن‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ص‌ح‌ي‌ح‌ة‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ب‌ا‌ط‌ل‌ة‌,‌ن‌س‌ب‌ة‌ ‌ا‌ل‌م‌ش‌ا‌ر‌ك‌ة‌,‌م‌و‌ا‌ف‌ق‌,‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌
‌ا‌ل‌ق‌ل‌ي‌و‌ب‌ي‌ة‌,60.0,40.0,"2,639,808","853,125","15,224",32.9,"512,055","341,070"
‌ا‌ل‌ج‌ي‌ز‌ة‌,66.7,33.3,"4,383,701","1,493,092","24,105",34.6,"995,417","497,675"
‌ا‌ل‌ق‌ا‌ه‌ر‌ة‌,43.2,56.8,"6,580,478","2,254,698","36,342",34.8,"974,371","1,280,327"
‌ق‌ن‌ا‌,84.5,15.5,"1,629,713","364,509","6,743",22.8,"307,839","56,670"
...

Within this CSV file, the first column appears as the content of each line before the first comma and is named المحافظة (appearing at the start of each row as ‌ا‌ل‌م‌ح‌ا‌ف‌ظ‌ة‌ in the example, which is displaying the relevant characters from left to right in the order they appear "on the wire").

The CSV translates to a table model that looks like:

Column / Row	column 1	column 2	column 3	column 4	column 5	column 6	column 7	column 8	column 9
row 1 (header)	المحافظة	نسبة موافق	نسبة غير موافق	عدد الناخبين	الأصوات الصحيحة	الأصوات الباطلة	نسبة المشاركة	موافق	غير موافق
row 2	القليوبية	60.0	40.0	2,639,808	853,125	15,224	32.9	512,055	341,070
row 3	الجيزة	66.7	33.3	4,383,701	1,493,092	24,105	34.6	995,417	497,675
row 4	القاهرة	43.2	56.8	6,580,478	2,254,698	36,342	34.8	974,371	1,280,327
row 5	قنا	84.5	15.5	1,629,713	364,509	6,743	22.8	307,839	56,670

The fragment identifier #col=3 identifies the third of the columns, named نسبة غير موافق (appearing as ‌ن‌س‌ب‌ة‌ ‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌ in the example).

[tabular-metadata] defines how this table model should be displayed by compliant applications, and how metadata can affect the display. The default is for the display to be determined by the content of the table. For example, if this CSV were turned into an HTML table for display into a web page, it should be displayed with the first column on the right and the last on the left, as follows:

غير موافق	موافق	نسبة المشاركة	الأصوات الباطلة	الأصوات الصحيحة	عدد الناخبين	نسبة غير موافق	نسبة موافق	المحافظة
341,070	512,055	32.9	15,224	853,125	2,639,808	40.0	60.0	القليوبية
497,675	995,417	34.6	24,105	1,493,092	4,383,701	33.3	66.7	الجيزة
1,280,327	974,371	34.8	36,342	2,254,698	6,580,478	56.8	43.2	القاهرة
56,670	307,839	22.8	6,743	364,509	1,629,713	15.5	84.5	قنا

The fragment identifier #col=3 still identifies the third of the columns, named نسبة غير موافق, which appears in the HTML display as the third column from the right and is what those who read right-to-left would think of as the third column.

Note that this display matches that shown on the original website.

Issue 14

An alternative approach is for the CSV to be parsed into a table model in which the columns are numbered in the reverse, for tables which are either marked as or detected to be right-to-left tables. For example, we could introduce a bidi=rtl or similar media type parameter, and use this to determine whether the first column in table generated from the CSV is the text before the first comma in each line or the text after the last comma in the line.

In the example above, if the CSV were served with bidi=rtl, or the table was detected as being a right-to-left table, then the column numbering in the model would be reversed:

Column / Row	column 9	column 8	column 7	column 6	column 5	column 4	column 3	column 2	column 1
row 1 (header)	المحافظة	نسبة موافق	نسبة غير موافق	عدد الناخبين	الأصوات الصحيحة	الأصوات الباطلة	نسبة المشاركة	موافق	غير موافق
row 2	القليوبية	60.0	40.0	2,639,808	853,125	15,224	32.9	512,055	341,070
row 3	الجيزة	66.7	33.3	4,383,701	1,493,092	24,105	34.6	995,417	497,675
row 4	القاهرة	43.2	56.8	6,580,478	2,254,698	36,342	34.8	974,371	1,280,327
row 5	قنا	84.5	15.5	1,629,713	364,509	6,743	22.8	307,839	56,670

This would require a change to [RFC7111] but that might be required by updates to the definition of text/csv in any case. With the change, the fragment identifier #col=3 would then refer to the third column from the right, named نسبة المشاركة.

If the model were defined in this way, there would be no need to determine the order of the columns when displayed using a metadata property. Columns would always be displayed with the first column (numbered 1 in the model) on the left. The final display in HTML, for example, would be exactly as above. The only difference would be that #col=3 would refer to the third column from the left.

We note that using media type parameters is problematic because publishers might not have the ability to set them on their servers, and because they can easily get lost as a file is republished or emailed between people.

We invite comment on the best way to approach bidirectionality in CSV files.

4.5 Grammar

This grammar is a generalization of that defined in [RFC4180] and is included for reference only.

The EBNF used here is defined in XML 1.0 [EBNF-NOTATION].

[1]	`csv`	::=	header record`+`
[2]	`header`	::=	record
[3]	`record`	::=	fields `#x0D?` `#x0A`
[4]	`fields`	::=	field ("`,`" fields)`*`
[5]	`field`	::=	WS`` rawfield WS``
[6]	`rawfield`	::=	'`"`' QCHAR`` '`"`' `\|`SCHAR``
[7]	`QCHAR`	::=	`[^"]` `\|`'`""`'
[8]	`SCHAR`	::=	`[^",#x0A#x0D]`
[9]	`WS`	::=	`[#x20#x09]`

Note

We should probably place further restrictions on QCHAR and SCHAR to avoid control characters. If header weren’t optional, it would be better defined as in RFC4180, but if the syntax allows it to be optional, this would make it not an LL(1) grammar, which isn’t too much of an issue.

5. Parsing Tabular Data

This section is non-normative.

As described in section 4. CSV+ Syntax, there may be many formats which an application might interpret into the tabular data model described in section 2. Tabular Data Models, including using different separators or fixed format tables, multiple tables within a single file, or ones that have metadata lines before a table header.

Note

Standardising the parsing of CSV is outside the chartered scope of the Working Group. This non-normative section is intended to help the creators of parsers handle the wide variety of CSV-based formats that they may encounter due to the current lack of standardisation of the format.

This section describes an algorithm for parsing formats other than the plain CSV+ format specified in section 4. CSV+ Syntax. It is impossible to do this in a fully automated manner, so this algorithm depends on the following flags being set externally (eg through user input):

encoding: The character encoding for the file, one of the encodings listed in [encoding]. The default is utf-8.
row terminator: The character that is used at the end of a row. The default is CRLF.
enclosure character: The character that is used around escaped cells. The default is ".
escape character: The character that is used to escape the enclosure character within escaped cells. The default is ".
skip rows: The number of rows to skip at the beginning of the file, before a header row or tabular data. The default is 0.
comment prefix: A character that, when it appears at the beginning of a skipped row, indicates a comment that should be associated as a comment annotation to the table. The default is #.
header row count: The number of header rows (following the skipped rows) in the file. The default is 1.
delimiter: The separator between cells. The default is ,.
skip columns: The number of columns to skip at the beginning of each row, before any header columns. The default is 0.
header column count: The number of header columns (following the skipped columns) in each row. The default is 0.
skip blank rows: Indicates whether to ignore wholly empty rows (ie rows in which all the cells are empty). The default is false.
trim: Indicates whether to trim whitespace around cells.

Issue 15

When parsing, should we:

always trim whitespace around cells?
always create empty cells for missing cells?

The algorithm for parsing a document containing tabular data is as follows:

Read the file using the specified encoding.
Find the rows. Each row ends with a row terminator, but values that are enclosed within the enclosure character may contain the row terminator without it indicating the end of the row. The enclosure character may be escaped using the escape character where it appears within cells.
Skip the number of rows indicated by the skip rows parameter.
Within the skipped rows, find rows that start with the comment prefix. These form comment annotations on the table.
Gather the number of header rows indicated by the header row count parameter; the remaining rows are data rows.
Split the header and data rows into cells using the delimiter. Values that are enclosed within the enclosure character may contain the delimiter. The enclosure character may be escaped using the escape character where it appears within cells.

If trim is true or start then whitespace from the start of values that are not enclosed must be removed from the value. If trim is true or end then whitespace from the end of values that are not enclosed must be removed from the value.
In each row, ignore the number of columns indicated by the skip columns parameter. Always start from the first character in the row when counting columns (see section 4.4.2 Bidirectionality in CSV+ Files).
Gather the number of header columns indicated by the header column count parameter. Always start from the first character in the row when counting columns (see section 4.4.2 Bidirectionality in CSV+ Files).
Each cell within a header row that is not in a skipped or header column is a label annotation on that column.
Each cell within a header column is an annotation on the row it appears in; if there is a header row then that provides the type of the annotation for the row, otherwise it is a label annotation.
If skip blank rows is true then ignore any rows in which all the cell values are empty strings.

A. Existing Standards

This appendix outlines various ways in which CSV is defined.

A.1 RFC 4180

[RFC4180] defines CSV with the following ABNF grammar:

file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D
DQUOTE =  %x22
LF = %x0A
CRLF = CR LF
TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

Of particular note here are:

The production for TEXTDATA indicates that only non-control ASCII characters are permitted within a CSV file. This restriction is routinely ignored in practice, and is impractical on the international web.
Lines should be ended with CRLF. This makes it harder to produce CSV files on Unix-based systems where the usual line ending is LF.
The header line is optional; a header parameter on the media type indicates whether the header is present or not.
Fields may be escaped by wrapping them in double quotes; any double quotes within the field must be escaped with two double quotes ("").

A.2 Excel

Excel is a common tool for both creating and reading CSV documents, and therefore the CSV that it produces is a de facto standard.

Note

The following describes the behaviour of Microsoft Excel for Mac 2011 with an English locale. Further testing is needed to see the behaviour of Excel in other situations.

A.2.1 Saved CSV

Excel generates CSV files encoded using Windows-1252 with LF line endings. Characters that cannot be represented within Windows-1252 are replaced by underscores. Only those cells that need escaping (eg because they contain commas or double quotes) are escaped, and double quotes are escaped with two double quotes.

Dates and numbers are formatted as displayed, which means that formatting can lead to information being lost or becoming inconsistent.

A.2.2 Opened CSV

When opening CSV files, Excel interprets CSV files saved in UTF-8 as being encoded as Windows-1252 (whether or not a BOM is present). It correctly deals with double quoted cells, except that it converts line breaks within cells into spaces. It understands CRLF as a line break. It detects dates (formatted as YYYY-MM-DD) and formats them in the default date formatting for files.

A.2.3 Imported CSV

Excel provides more control when importing CSV files into Excel. However, it does not properly understand UTF-8 (with or without BOM). It does however properly understand UTF-16 and can read non-ASCII characters from a UTF-16-encoded file.

A particular quirk in the importing of CSV is that if a cell contains a line break, the final double quote that escapes the cell will be included within it.

A.2.4 Copied Tabular Data

When tabular data is copied from Excel, it is copied in a tab-delimited format, with LF line breaks.

A.3 Google Spreadsheets

A.3.1 Downloading CSV

Downloaded CSV files are encoded in UTF-8, without a BOM, and with LF line endings. Dates and numbers are formatted as they appear within the spreadsheet.

A.3.2 Importing CSV

CSV files can be imported as UTF-8 (with or without BOM). CRLF line endings are correctly recognised. Dates are reformatted to the default date format on load.

A.4 CSV Files in a Tabular Data Package

Tabular Data Packages place the following restrictions on CSV files:

As a starting point, CSV files included in a Tabular Data Package package must conform to the RFC for CSV (4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files). In addition:

File names MUST end with .csv

Files MUST be encoded as UTF-8

Files MUST have a single header row. This row MUST be the first row in the file.

Terminology: each column in the CSV file is termed a field and its name is the string in that column in the header row.

The name MUST be unique amongst fields and MUST contain at least one character

There are no further restrictions on the form of the name but it is RECOMMENDED that it contain only alphanumeric characters together with “ .-_”

Rows in the file MUST NOT contain more fields than are in the header row (though they may contain less)

Each file MUST have an entry in the resources array in the datapackage.json file

The resource metadata MUST include a schema attribute whose value MUST conform to the JSON Table Schema

All fields in the CSV files MUST be described in the schema

CSV files generated by different applications often vary in their syntax, e.g. use of quoting characters, delimiters, etc. To encourage conformance, CSV files in a CSV files in a Tabular Data Package SHOULD

Use “,” as field delimiters

Use “rn” or “n” as line terminators

If a CSV file does not follow these rules then its specific CSV dialect MUST be documented. The resource hash for the resource in the datapackage.json descriptor MUST:

Include a dialect key that conforms to that described in the CSV Dialect Description Format

Applications processing the CSV file SHOULD read use the dialect of the CSV file to guide parsing.

Issue 16

More details of behaviour of other tools should go here. This should include the most popular CSV parsing/generating libraries in common programming languages. Test files which include non-ASCII characters, double quotes and line breaks within cells are:

Model for Tabular Data and Metadata on the Web

W3C Working Draft 10 July 2014

Abstract

Status of This Document

Table of Contents

1. Introduction

2. Tabular Data Models

2.1 Core Tabular Data Model

2.2 Annotated Tabular Data Model

2.3 Grouped Tabular Data Model

3. Locating Metadata

3.1 Embedded Metadata

3.2 Package

3.3 Link Header

3.4 Standard Path

4. CSV+ Syntax

4.1 Content Type

4.2 Encoding

4.3 Line Endings

4.4 Lines

4.4.1 Headers

4.4.2 Bidirectionality in CSV+ Files

4.5 Grammar

5. Parsing Tabular Data

A. Existing Standards

A.1 RFC 4180

A.2 Excel

A.2.1 Saved CSV

A.2.2 Opened CSV

A.2.3 Imported CSV

A.2.4 Copied Tabular Data

A.3 Google Spreadsheets

A.3.1 Downloading CSV

A.3.2 Importing CSV

A.4 CSV Files in a Tabular Data Package

B. References

B.1 Normative references

B.2 Informative references