Copyright © 2014 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
Tabular data is routinely transferred on the web as "CSV", but the definition of "CSV" in practice is very loose. This document outlines a basic data model or infoset for tabular data and metadata about that tabular data. It also contains some non-normative information about a best practice syntax for tabular data, for mapping into that data model, to contribute to the standardisation of CSV syntax by IETF. Various methods of locating metadata are also provided.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
The CSV on the Web Working Group was chartered to produce a Recommendation "Access methods for CSV Metadata" as well as Recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various Formats (e.g., RDF, JSON, or XML)". This document aims to primarily satisfy the first of those Recommendations (see section 3. Locating Metadata), though it also specifies an underlying model and therefore starting point for the other chartered Recommendations.
This document is based on IETF's [RFC4180] which is an Informational RFC. The working group's expectation is that future suggestions to refine RFC 4180 will be relayed to the IETF (e.g. around I18N and multi-part packaging) and contribute to its discussions about moving CSV to the Standards track.
This document was published by the CSV on the Web Working Group as a First Public Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to firstname.lastname@example.org (subscribe, archives). All comments are welcome.
Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
Tabular data is data that is structured into rows, each of which contains information about some thing. Each row contains the same number of fields (although some of these fields may be empty), which provide values of properties of the thing described by the row. In tabular data, fields within the same column provide values for the same property of the thing described by the particular row. This is what differentiates tabular data from other line-oriented formats.
Tabular data is routinely transferred on the web in a textual format called "CSV", but the definition of "CSV" in practice is very loose. Some people use the term to mean any delimited text file. Others stick more closely to the most standard definition of CSV that there is, [RFC4180]. Appendix A describes the various ways in which CSV is defined.
There are different levels of data models for tabular data:
Data is held in a table.
Each table has one or more columns. The order of the columns is significant and must be preserved by applications.
Each table has one or more rows. The order of the rows is significant and must be preserved by applications. Each row contains a field for each column in the table. Some of these fields may be null fields.
An annotated table is a table that is annotated with additional metadata. The table has a number of properties which provide additional information about the table as a whole. The values of these properties may be lists, structured objects, or atomic values. Annotations on a table may include:
The columns within an annotated table are all annotated columns. The annotations on a column provide information about how to interpret the fields in the column.
The rows within an annotated table are all annotated rows. The annotations on a row provide additional metadata about the information held in the row, such as:
The fields within an annotated row are all annotated fields. The annotations on a field provide metadata about the value held in the field, particularly when this overrides the information provided for the annotated column and annotated row that the field falls within. Annotations on a field might be:
It might be useful to define annotated regions as follows:
An annotated table may also contain a number of annotated regions. Regions are themselves tabular structures comprised of selected rows and columns and the fields within those rows for those columns. Annotated regions are regions that have annotations associated with them. Annotated columns and annotated rows are special types of annotated regions where the region is the entirety of a single column or single row.
But it's not currently clear that there are use cases or examples that justify it. Input is welcome on this.
A group of tables comprises a set of tables (which may be annotated tables) and a set of annotations (properties and values) that relate to the set.
Tables can be loosely related to each other simply through annotations; not all tables that are related to each other need to grouped together. Groups of tables are useful because they can be annotated with metadata that applies to all the tables in the group.
As described in section 2.2 Annotated Data Model, tabular data may have a number of annotations associated with it. Here we describe the different methods that can be used to locate those annotations given a link to a CSV file.
In most methods of locating metadata described here, metadata is provided within a separate document. The syntax of this document is not defined here, but these documents can include things like:
When creating a set of annotations from metadata, if the same property is specified in two locations then information closer to the document itself should override information further from the document. Explicitly, the order of preference is:
Linkheader when retrieving the tabular data file, see section 3.4 Link Header
Metadata may be directly embedded within a CSV file.
What should the syntax be for embedding metadata within a CSV file? An example approach is shown in Linked CSV. If this is allowed we should specify a new media type for the syntax.
A link to the metadata to be used with a CSV may be indicated within the CSV file itself.
How should a link be embedded? Perhaps something like:
as the first line? If this is allowed we should specify a new media type for the syntax.
Rather than providing CSV files directly on the web, they can be packaged up with a metadata file that includes annotations, and any other relevant CSV files.
What should that package look like? Just a zip? A multipart document? How is the metadata file within it identified? If this is allowed we should specify a new media type for the package.
When retrieving a CSV file via HTTP, the response can include a
Link header with
rel=describedby that points to a metadata file that describes the CSV file.
rel=describedby the right link relation to use?
When retrieving a CSV file via HTTP, the default location for a metadata file that describes that CSV file is set to
csv-metadata in the same directory. If this metadata file does not explicitly point to the relevant CSV file then it MUST be ignored.
Deliberately not included a suffix here, because we haven't decided what format metadata documents should be in, or even if they should be conneg'd. Probably best for the file to be in the same directory as the CSV file, so that you can have one metadata document that describes a lot of them. Should there be a default navigational thing of continuing up the path hierarchy until you find a metadata document?
This section is non-normative.
There is no standard for CSV, and there are many variants of CSV used on the web today. This section defines a method for outputting tabular data adhering to the core tabular data model described in section 2.1 Core Data Model into a standard, CSV-based, syntax. Compliant applications that output this format must meet each of the constraints.
We are actively working with the IETF to develop a standard for CSV, which is outside the scope of the Working Group. The details here aim to help shape that standard based on our requirements.
This section does not seek to describe how applications that input textual tabular data should interpret it, except that any data that is in the format defined here should be understood as defined here.
This syntax is not compliant with
text/csv as defined in [RFC4180]: it permits characters other than ASCII, and it permits line endings other than
CRLF. Supporting the full set of Unicode characters by using UTF-8 and supporting
LF line endings are important characteristics for data formats that are used internationally and on non-Windows platforms. However, all files that adhere to [RFC4180]'s definition of CSV are compliant CSV+ files.
The appropriate content type for a CSV+ file is
text/csv. For example, when a CSV+ file is transmitted via HTTP, the HTTP response MUST include a
Content-Type header with the value
See below for issues relating to whether we should instead define a different content type.
CSV+ files SHOULD be encoded using UTF-8. If a CSV+ file is not encoded using UTF-8, the encoding MUST be specified through the
charset parameter in the
RFC4180 defines the default charset as US-ASCII because that was (at the time RFC4180 was written) the default charset for all
text/* media types. This has been superseded with RFC6657.
Section 3 of RFC6657 states "new subtypes of the "text" media type SHOULD NOT define a default "charset" value. If there is a strong reason to do so despite this advice, they SHOULD use the "UTF-8" [RFC3629] charset as the default."
Do we have a strong reason to specify a default charset? Should we be defining
application/csv instead, to avoid doing unrecommended things with a
text/* media type.
The ends of rows in a CSV+ file MUST be either
U+000D U+000A) or
U+000A). Line endings within escaped fields are not normalised.
Section 4.1.1 of RFC2046 specifies that "The canonical form of any MIME "text" subtype MUST always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" MUST represent a line break. Use of CR and LF outside of line break sequences is also forbidden."
Should we be defining
application/csv instead, to prevent having to adhere to this rule, or should we stick to the
Each line of a CSV+ file MUST contain the same number of comma-separated values.
Values that contain commas, line endings or double quotes MUST be escaped by having the entire value wrapped in double quotes. There MUST NOT be whitespace before or after the double quotes. Within these escaped fields, any double quotes MUST be escaped with two double quotes (
The first line of a CSV+ file SHOULD contain a comma-separated list of names of columns. This is known as the header line and provides names for the columns. There are no constraints on these names.
If a CSV+ file does not include a header line, this MUST be specified using the
header parameter of the media type:
This grammar is a generalization of that defined in [RFC4180] and is included for reference only.
The EBNF used here is defined in XML 1.0 [EBNF-NOTATION].
We should probably place further restrictions on QCHAR and SCHAR to avoid control characters. If header weren’t optional, it would be better defined as in RFC4180, but if the syntax allows it to be optional, this would make it not an LL(1) grammar, which isn’t too much of an issue.
This section is non-normative.
As described in section 4. CSV+ Syntax, there may be many formats which an application might interpret into the tabular data model described in section 2. Tabular Data Model, including using different separators or fixed format tables, multiple tables within a single file, or ones that have metadata lines before a table header.
Standardising the parsing of CSV is outside the chartered scope of the Working Group. This non-normative section is intended to help the creators of parsers handle the wide variety of CSV-based formats that they may encounter due to the current lack of standardisation of the format.
This section describes an algorithm for parsing formats other than the plain CSV+ format specified in section 4. CSV+ Syntax. It is impossible to do this in a fully automated manner, so this algorithm depends on the following flags being set externally (eg through user input):
When parsing, should we:
The algorithm for parsing a document containing tabular data is as follows:
Split the header and data rows into fields using the delimiter. Values that are enclosed within the enclosure character may contain the delimiter. The enclosure character may be escaped using the escape character where it appears within fields.
If trim is
start then whitespace from the start of values that are not enclosed must be removed from the value. If trim is
end then whitespace from the end of values that are not enclosed must be removed from the value.
truethen ignore any rows in which all the field values are empty strings.
This appendix outlines various ways in which CSV is defined.
[RFC4180] defines CSV with the following ABNF grammar:
file = [header CRLF] record *(CRLF record) [CRLF] header = name *(COMMA name) record = field *(COMMA field) name = field field = (escaped / non-escaped) escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE non-escaped = *TEXTDATA COMMA = %x2C CR = %x0D DQUOTE = %x22 LF = %x0A CRLF = CR LF TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
Of particular note here are:
TEXTDATAindicates that only non-control ASCII characters are permitted within a CSV file. This restriction is routinely ignored in practice, and is impractical on the international web.
CRLF. This makes it harder to produce CSV files on Unix-based systems where the usual line ending is
headerparameter on the media type indicates whether the header is present or not.
Excel is a common tool for both creating and reading CSV documents, and therefore the CSV that it produces is a de facto standard.
The following describes the behaviour of Microsoft Excel for Mac 2011 with an English locale. Further testing is needed to see the behaviour of Excel in other situations.
Excel generates CSV files encoded using Windows-1252 with
LF line endings. Characters that cannot be represented within Windows-1252 are replaced by underscores. Only those fields that need escaping (eg because they contain commas or double quotes) are escaped, and double quotes are escaped with two double quotes.
Dates and numbers are formatted as displayed, which means that formatting can lead to information being lost or becoming inconsistent.
When opening CSV files, Excel interprets CSV files saved in UTF-8 as being encoded as Windows-1252 (whether or not a BOM is present). It correctly deals with double quoted fields, except that it converts line breaks within fields into spaces. It understands
CRLF as a line break. It detects dates (formatted as
YYYY-MM-DD) and formats them in the default date formatting for files.
Excel provides more control when importing CSV files into Excel. However, it does not properly understand UTF-8 (with or without BOM). It does however properly understand UTF-16 and can read non-ASCII characters from a UTF-16-encoded file.
A particular quirk in the importing of CSV is that if a field contains a line break, the final double quote that escapes the field will be included within it.
When tabular data is copied from Excel, it is copied in a tab-delimited format, with
LF line breaks.
Downloaded CSV files are encoded in UTF-8, without a BOM, and with
LF line endings. Dates and numbers are formatted as they appear within the spreadsheet.
CSV files can be imported as UTF-8 (with or without BOM).
CRLF line endings are correctly recognised. Dates are reformatted to the default date format on load.
Ths Simple Data Format places the following restrictions on CSV files:
As a starting point, CSV files included in a Simple Data Format package must conform to the RFC for CSV (4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files). In addition:
- File names MUST end with
- Files MUST be encoded as UTF-8
Files MUST have a single header row. This row MUST be the first row in the file.
- Terminology: each column in the CSV file is termed a field and its
nameis the string in that column in the header row.
nameMUST be unique amongst fields and MUST contain at least one character
- There are no further restrictions on the form of the
namebut it is RECOMMENDED that it contain only alphanumeric characters together with “ .-_”
- Rows in the file MUST NOT contain more fields than are in the header row (though they may contain less)
- Each file MUST have an entry in the
resourcesarray in the
- The resource metadata MUST include a
schemaattribute whose value MUST conform to the JSON Table Schema
- All fields in the CSV files MUST be described in the
CSV files generated by different applications often vary in their syntax, e.g. use of quoting characters, delimiters, etc. To encourage conformance, CSV files in a Simple Data Format SHOULD
- Use “,” as field delimiters
- Use “rn” or “n” as line terminators
If a CSV file does not follow these rules then its specific CSV dialect MUST be documented. The resource hash for the resource in the
- Include a
dialectkey that conforms to that described in the CSV Dialect Description Format
Applications processing the CSV file SHOULD read use the
dialectof the CSV file to guide parsing.
More details of behaviour of other tools should go here. This should include the most popular CSV parsing/generating libraries in common programming languages. Test files which include non-ASCII characters, double quotes and line breaks within fields are: