Copyright © 2015 W3C ® ( MIT , ERCIM , Keio , Beihang ). W3C liability , trademark and document use rules apply.
Tabular data is routinely transferred on the web in a variety of formats, including variants on CSV, tab-delimited files, fixed field formats, spreadsheets, HTML tables, and SQL dumps. This document outlines a data model, or infoset, for tabular data and metadata about that tabular data that can be used as a basis for validation, display, or creating other formats. It also contains some non-normative guidance for publishing tabular data as CSV and how that maps into the tabular data model.
An annotated model of tabular data can be supplemented by separate metadata about the table. This specification defines how implementations should locate that metadata, given a file containing tabular data. The standard syntax for that metadata is defined in [ tabular-metadata ]. Note, however, that applications may have other means to create annotated tables, e.g., through some application specific API-s; this model does not depend on the specificities described in [ tabular-metadata ].
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
The CSV on the Web Working Group was chartered to produce a Recommendation "Access methods for CSV Metadata" as well as Recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various Formats (e.g., RDF, JSON, or XML)". This document aims to primarily satisfy the first of those Recommendations (see section 5. Locating Metadata ), though it also specifies an underlying model for tabular data and is therefore a basis for the other chartered Recommendations.
This definition of CSV used in this document is based on IETF's [ RFC4180 ] which is an Informational RFC. The working group's expectation is that future suggestions to refine RFC 4180 will be relayed to the IETF (e.g. around encoding and line endings) and contribute to its discussions about moving CSV to the Standards track.
Many files containing tabular data embed metadata, for example in lines before the header row of an otherwise standard CSV document. This specification does not define any formats for embedding metadata within CSV files, aside from the titles of columns in the header row which is defined in CSV. We would encourage groups that define tabular data formats to also define a mapping into the annotated tabular data model defined in this document.
This
document
was
published
by
the
CSV
on
the
Web
Working
Group
as
a
Working
Draft.
Candidate
Recommendation.
This
document
is
intended
to
become
a
W3C
Recommendation.
If
you
wish
to
make
comments
regarding
this
document,
please
send
them
to
public-csv-wg@w3.org
(
subscribe
,
archives
).
W3C
publishes
a
Candidate
Recommendation
to
indicate
that
the
document
is
believed
to
be
stable
and
to
encourage
implementation
by
the
developer
community.
This
Candidate
Recommendation
is
expected
to
advance
to
Proposed
Recommendation
no
earlier
than
30
October
2015.
All
comments
are
welcome.
Please see the Working Group's implementation report .
Publication
as
a
Working
Draft
Candidate
Recommendation
does
not
imply
endorsement
by
the
W3C
Membership.
This
is
a
draft
document
and
may
be
updated,
replaced
or
obsoleted
by
other
documents
at
any
time.
It
is
inappropriate
to
cite
this
document
as
other
than
work
in
progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This document is governed by the 1 August 2014 W3C Process Document .
Tabular data is data that is structured into rows, each of which contains information about some thing. Each row contains the same number of cells (although some of these cells may be empty), which provide values of properties of the thing described by the row. In tabular data, cells within the same column provide values for the same property of the things described by each row. This is what differentiates tabular data from other line-oriented formats.
Tabular data is routinely transferred on the web in a textual format called CSV, but the definition of CSV in practice is very loose. Some people use the term to mean any delimited text file. Others stick more closely to the most standard definition of CSV that there is, [ RFC4180 ]. Appendix A describes the various ways in which CSV is defined. This specification refers to such files, as well as tab-delimited files, fixed field formats, spreadsheets, HTML tables, and SQL dumps as tabular data files .
In section 4. Tabular Data Models , this document defines a model for tabular data that abstracts away from the varying syntaxes that are used for when exchanging tabular data. The model includes annotations, or metadata, about collections of individual tables, rows, columns, and cells. These annotations are typically supplied through separate metadata files; section 5. Locating Metadata defines how these metadata files can be located, while [ tabular-metadata ] defines what they contain.
Once an annotated table has been created, it can be processed in various ways, such as display, validation, or conversion into other formats. This processing is described in section 6. Processing Tables .
This specification does not normatively define a format for exchanging tabular data. However, it does provide some best practice guidelines for publishing tabular data as CSV, in section section 7. Best Practice CSV , and for parsing both this syntax and those similar to it, in section 8. Parsing Tabular Data .
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MAY , MUST , MUST NOT , SHOULD , and SHOULD NOT are to be interpreted as described in [ RFC2119 ].
This specification makes use of the compact IRI Syntax ; please refer to the Compact IRIs from [ JSON-LD ].
Within
examples,
this
This
specification
makes
use
of
the
following
namespaces:
csvw
:
http://www.w3.org/ns/csvw#
dc
:
http://purl.org/dc/terms/
rdf
:
http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs
:
http://www.w3.org/2000/01/rdf-schema#
schema
:
http://schema.org/
xsd
:
http://www.w3.org/2001/XMLSchema#
The following typographic conventions are used in this specification:
markup
markup
definition
reference
markup
external
definition
reference
Notes are in light green boxes with a green left border and with a "Note" header in green. Notes are normative or informative depending on the whether they are in a normative or informative section, respectively.
Examples are in light khaki boxes, with khaki left border, and with a numbered "Example" header in khaki. Examples are always informative. The content of the example is in monospace font and may be syntax colored.
This section defines an annotated tabular data model : a model for tables that are annotated with metadata. Annotations provide information about the cells, rows, columns, tables, and groups of tables with which they are associated. The values of these annotations may be lists, structured objects, or atomic values. Core annotations are those that affect the behavior of processors defined in this specification, but other annotations may also be present on any of the components of the model.
Annotations may be described directly in [ tabular-metadata ], be embedded in a tabular data file , or created during the process of generating an annotated table .
String values within the tabular data model (such as column titles or cell string values) MUST contain only Unicode characters.
In this document, the term annotation refers to any metadata associated with an object in the annotated tabular data model. These are not necessarily web annotations in the sense of [ annotation-model ].
A group of tables comprises a set of annotated tables and a set of annotations that relate to that group of tables. The core annotations of a group of tables are:
null
if
this
is
undefined.
Groups of tables MAY in addition have any number of annotations which provide information about the group of tables. Annotations on a group of tables may include:
When originating from [ tabular-metadata ], these annotations arise from common properties defined on table group descriptions within metadata documents.
An annotated table is a table that is annotated with additional metadata. The core annotations of a table are:
textDirection
property
is
set
to
inherit
(the
default).
null
if
this
is
undefined.
null
.
null
if
this
is
undefined.
The table MAY in addition have any number of other annotations . Annotations on a table may include:
When
originating
from
[
tabular-metadata
],
these
annotations
arise
from
common
properties
defined
on
table
group
descriptions
within
metadata
documents.
A column represents a vertical arrangement of cells within a table . The core annotations of a column are:
null
.
null
.
textDirection
property
is
set
to
inherit
(the
default).
false
,
the
column
is
a
real
column
,
which
exists
in
the
source
data
for
the
table.
Several of these annotations arise from inherited properties that may be defined within metadata on table group , table or individual column descriptions.
Columns
MAY
in
addition
have
any
number
of
other
annotations
,
such
as
a
description.
When
originating
from
[
tabular-metadata
],
these
annotations
arise
from
common
properties
defined
on
table
group
column
descriptions
within
metadata
documents.
A row represents a horizontal arrangement of cells within a table . The core annotations of a row are:
null
.
Rows MAY have any number of additional annotations . The annotations on a row provide additional metadata about the information held in the row, such as:
Neither this specification nor [ tabular-metadata ] defines a method to specify such annotations. Implementations MAY define a method for adding annotations to rows by interpreting notes on the table.
A cell represents a cell at the intersection of a row and a column within a table . The core annotations of a cell are:
null
.
null
.
textDirection
property
is
set
to
inherit
(the
default).
null
.
For
example,
annotations
might
enable
a
processor
to
understand
the
string
value
of
the
cell
as
representing
a
number
or
a
date.
By
default,
if
the
string
value
is
an
empty
string,
the
semantic
value
of
the
cell
is
null
.
null
.
There
presence
or
absence
of
quotes
around
a
value
within
a
CSV
file
is
a
syntactic
detail
that
is
not
reflected
in
the
tabular
data
model.
In
other
words,
there
is
no
distinction
in
the
model
between
the
second
value
in
a,,z
and
the
second
value
in
a,"",z
.
Several of these annotations arise from or are constructed based on inherited properties that may be defined within metadata on table group , table or column descriptions.
Cells MAY have any number of additional annotations . The annotations on a cell provide metadata about the value held in the cell, particularly when this overrides the information provided for the column and row that the cell falls within. Annotations on a cell might be:
Neither this specification nor [ tabular-metadata ] defines a method to specify such annotations. Implementations MAY define a method for adding annotations to cells by interpreting notes on the table.
Columns
and
Cells
cell
values
within
tables
may
be
annotated
with
a
datatype
which
indicates
the
type
of
the
values
obtained
by
parsing
the
string
value
of
the
cell.
Datatypes
are
based
on
a
subset
of
those
defined
in
[
xmlschema11-2
]
with
aliases.
].
The
annotated
tabular
data
model
limits
cell
values
to
have
datatypes
as
shown
on
the
diagram
:
anyAtomicType
xsd:anyAtomicType
.
xml
rdf:XMLLiteral
,
a
sub-type
of
string
xsd:string
,
which
indicates
the
value
is
an
XML
html
rdf:HTML
,
a
sub-type
of
string
xsd:string
,
which
indicates
the
value
is
an
HTML
json
csvw:JSON
,
a
sub-type
of
string
xsd:string
,
which
indicates
the
value
is
serialized
The core annotations of a datatype are:
null
if
undefined;
converters
SHOULD
use
this
URL
when
serializing
values
of
this
datatype.
Processors
MAY
use
this
annotation
to
perform
additional
validation
on
column
values
using
this
datatype.
null
if
the
datatype
is
xsd:anyAtomicType
.
All
values
of
the
datatype
MUST
be
valid
values
of
the
base
datatype.
If
the
id
of
a
datatype
is
that
of
a
built-in
datatype,
the
values
of
the
other
core
annotations
listed
above
MUST
be
consistent
with
the
values
defined
in
[
xmlschema11-2
]
or
above.
For
example,
if
the
id
is
xsd:integer
then
the
base
must
be
xsd:decimal
.
Datatypes MAY have any number of additional annotations . The annotations on a datatype provide metadata about the datatype such as title or description. These arise from common properties defined on datatype descriptions within metadata documents, as defined in [ tabular-metadata ].
The id annotation may reference an XSD, OWL or other datatype definition, which is not used by this specification for validating column values, but may be useful for further processing.
The
length
,
minimum
length
and
maximum
length
annotations
indicate
the
exact,
minimum
and
maximum
lengths
for
cell
string
values
.
The length of a value is determined as defined in [ xmlschema11-2 ], namely as follows:
null
,
its
length
is
If the value is a list , the constraint applies to each element of the list.
The
minimum
,
maximum
,
minimum
exclusive
,
and
maximum
exclusive
annotations
indicate
limits
on
cell
string
values
.
These
apply
to
numeric,
date/time,
and
duration
types.
Validation
of
cell
string
values
against
these
datatypes
is
as
defined
in
[
xmlschema11-2
].
If
the
value
is
a
list
,
the
constraint
applies
to
each
element
of
the
list.
As described in section 4. Tabular Data Models , tabular data may have a number of annotations associated with it. Here we describe the different methods that can be used to locate metadata that provides those annotations.
In
most
the
methods
of
locating
metadata
described
here,
metadata
is
provided
within
separate
documents.
a
single
document.
The
syntax
of
these
such
documents
is
defined
in
[
tabular-metadata
].
When
creating
a
set
of
annotations
from
metadata,
if
the
same
property
Metadata
is
specified
in
two
locations
then
information
"closer"
to
the
end
user
or
interpreter
of
the
document
overrides
other
information,
and
information
"closer"
to
the
document
itself
overrides
information
"further"
from
the
document.
Explicitly,
the
located
using
a
specific
order
of
precedence
is:
precedence:
Link
header
associated
with
the
tabular
data
file
,
see
section
Processors
MUST
always
use
overriding
metadata
and
embedded
metadata.
If
the
overriding
first
metadata
does
not
originate
from
found
for
processing
a
complete
metadata
tabular
data
file
itself
(for
example,
by
using
overriding
metadata,
if
it
is
generated
from
command-line
options
when
invoking
the
implementation),
provided.
Otherwise
processors
MUST
attempt
to
locate
the
first
metadata
document
from
the
Link
header,
the
file-specific
metadata,
header
or
the
directory-specific
metadata.
The
metadata
located
through
site-wide
configuration.
If
no
metadata
is
supplied
or
found,
processors
MUST
use
embedded
metadata
.
If
the
metadata
does
not
originate
from
these
locations
the
embedded
metadata,
validators
MUST
be
merged
verify
that
the
table
group
description
within
that
metadata
is
compatible
with
that
in
the
embedded
metadata
,
as
defined
in
[
tabular-metadata
].
Processors SHOULD provide users with the facility to provide their own metadata for tabular data files that they process. This might be provided:
For example, a processor might be invoked with:
$ csvlint data.csv --datatypes:string,float,string,string
to enable the testing of the types of values in the columns of a CSV file, or with:
$ csvlint data.csv --schema:schema.json
to supply a schema that describes the contents of the file, against which it can be validated.
Metadata
supplied
in
this
way
is
called
overriding,
or
user-supplied,
metadata.
Implementations
SHOULD
define
how
any
options
they
define
are
mapped
into
the
vocabulary
defined
in
[
tabular-metadata
].
If
the
user
selects
existing
metadata
files,
implementations
MUST
NOT
use
metadata
located
through
the
link
Link
header
(as
described
in
section
5.3
5.2
Link
Header
),
file-specific
metadata
(as
described
in
section
5.4
Standard
File
Metadata
)
or
directory-specific
metadata
site-wide
location
configuration
(as
described
in
section
5.5
5.3
Standard
Directory
Metadata
Site-wide
Location
Configuration
).
Users
should
ensure
that
any
embedded
metadata
from
those
locations
that
they
wish
to
use
is
mapped
explicitly
incorporated
into
the
vocabulary
defined
in
[
tabular-metadata
].
Parsing
based
on
the
default
dialect
for
CSV,
as
described
in
8.
Parsing
Tabular
Data
,
will
extract
column
titles
from
the
first
row
of
a
CSV
file.
GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON
ST,Liquidambar
styraciflua,Large
Tree
Routine
Prune,6/2/2010
The
results
of
this
can
be
found
in
section
8.2.1
Simple
Example
.
For
another
example,
the
following
tab-delimited
file
contains
embedded
overriding
metadata
where
it
is
assumed
that
comments
may
be
added
using
a
#
,
and
that
the
column
types
may
be
indicated
using
a
#datatype
annotation:
# publisher City of Palo Alto
# updated 12/31/2010
#name GID on_street species trim_cycle inventory_date
#datatype string string string string date:M/D/YYYY
GID On Street Species Trim Cycle Inventory Date
1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010
2
EMERSON
ST
Liquidambar
styraciflua
Large
Tree
Routine
Prune
6/2/2010
A
processor
that
recognises
this
format
they
use
to
process
tabular
data.
Processors
may
be
able
provide
facilities
to
extract
and
make
sense
of
this
embedded
metadata.
easier
by
automatically
merging
metadata
files
from
different
locations,
but
this
specification
does
not
define
how
such
merging
is
carried
out.
If
the
user
has
not
supplied
a
metadata
file
as
overriding
metadata,
described
in
section
5.1
Overriding
Metadata
,
then
when
retrieving
a
tabular
data
file
via
HTTP,
processors
MUST
retrieve
the
metadata
file
referenced
by
any
Link
header
with:
rel="describedby"
,
and
type="application/csvm+json"
,
type="application/ld+json"
or
type="application/json"
.
so
long
as
this
referenced
metadata
file
describes
the
retrieved
tabular
data
file
(ie,
contains
a
table
description
whose
url
matches
the
request
URL).
If
there
is
more
than
one
valid
metadata
file
linked
to
through
multiple
Link
headers,
then
implementations
MUST
use
the
metadata
file
referenced
by
the
last
Link
header.
For example, when the response to requesting a tab-separated file looks like:
HTTP/1.1 200 OK Content-Type: text/tab-separated-values ... Link : < metadata . json >; rel = "describedBy" ; type = "application/csvm+json"
an
implementation
must
use
the
referenced
metadata.json
to
supply
metadata
for
processing
the
file.
If the metadata file found at this location does not explicitly include a reference to the requested tabular data file then it MUST be ignored. URLs MUST be normalized as described in section 6.3 URL Normalization .
The
Link
header
of
the
metadata
file
MAY
include
references
to
the
CSV
files
it
describes,
using
the
describes
relationship.
For
example,
in
the
countries'
metadata
example
,
the
server
might
return
the
following
headers:
Link: <http://example.org/countries.csv>; rel="describes"; type="text/csv" Link: <http://example.org/country_slice.csv>; rel="describes"; type="text/csv"
However, locating the metadata SHOULD NOT depend on this mechanism.
If
the
user
has
not
supplied
a
metadata
file
as
overriding
metadata,
described
in
section
5.1
Overriding
Metadata
,
and
no
applicable
metadata
file
has
been
discovered
through
a
Link
header,
described
in
section
5.3
5.2
Link
Header
,
processors
MUST
attempt
to
locate
a
file-specific
metadata
document.
This
method
documents
through
site-wide
configuration.
In
this
case,
processors
MUST
NOT
be
used
if
the
URL
of
retrieve
the
tabular
data
file
from
the
well-known
URI
/.well-known/csvm
.
(Well-known
URIs
are
defined
by
[
RFC5785
contains
].)
If
no
such
file
is
located
(i.e.
the
response
results
in
a
query
component.
client
error
4xx
status
code
or
a
server
error
5xx
status
code),
processors
MUST
proceed
as
if
this
file
were
found
with
the
content:
{+url}-metadata.json csv-metadata.json
In
this
case,
processors
The
response
to
retrieving
/.well-known/csvm
MAY
be
cached,
subject
to
cache
control
directives.
This
includes
caching
an
unsuccessful
response
such
as
a
404
Not
Found
.
This
file
MUST
construct
contain
a
URL
for
the
file-specific
metadata
document
URI
template,
as
defined
by
appending
[
URI-TEMPLATE
],
on
each
line.
Starting
with
the
first
such
URI
template,
processors
MUST
:
-metadata.json
url
being
set
to
the
For
example,
if
the
tabular
data
file
is
at
http://example.org/south-west/devon.csv
then
processors
must
attempt
to
locate
a
metadata
document
at
http://example.org/south-west/devon.csv-metadata.json
.
On
the
other
hand,
if
the
tabular
data
well-known
file
is
at
.
If
that
file
http://example.com/index.php?file=43178.csv
then,
because
the
URL
of
the
tabular
data
http://example.org/.well-known/csvm
includes
a
query
component,
contains:
{+url}.json csvm.json /csvm?file={url}
the
processor
must
not
attempt
to
will
first
look
for
a
file-specific
http://example.org/south-west/devon.csv.json
.
If
there
is
no
metadata
document.
file
in
that
location,
it
will
then
look
for
http://example.org/south-west/csvm.json
.
Finally,
if
that
also
fails,
it
will
look
for
http://example.org/csvm?file=http://example.org/south-west/devon.csv.json
.
If
the
metadata
no
file
were
found
at
this
location
does
not
explicitly
include
http://example.org/.well-known/csvm
,
the
processor
will
use
the
default
locations
and
try
to
retrieve
metadata
from
http://example.org/south-west/devon.csv-metadata.json
and,
if
unsuccessful,
http://example.org/south-west/csv-metadata.json
.
The
use
of
a
reference
well-known
location
for
defining
URI
patters
used
to
the
relevant
tabular
data
file
then
locate
metdata
files
is
at
risk.
The
Working
Group
solicits
feedback
on
whether
this
mechanism
is
useful
and
whether
it
MUST
be
ignored.
represents
major
implementation
difficulties.
If
the
user
has
not
supplied
Most
syntaxes
for
tabular
data
provide
a
facility
for
embedding
metadata
within
the
tabular
data
file
as
overriding
metadata,
described
in
section
5.1
Overriding
Metadata
,
itself.
The
definition
of
a
syntax
for
tabular
data
SHOULD
include
a
description
of
how
the
syntax
maps
to
an
annotated
data
model,
and
no
applicable
in
particular
how
any
embedded
metadata
file
has
been
discovered
through
a
Link
header,
described
is
mapped
into
the
vocabulary
defined
in
section
5.3
Link
Header
[
,
or
from
tabular-metadata
].
Parsing
based
on
the
standard
file
metadata
location,
default
dialect
for
CSV,
as
described
in
section
5.4
8.
Standard
File
Metadata
Parsing
Tabular
Data
,
processors
MUST
attempt
to
locate
a
directory-level
metadata
document.
This
method
MUST
NOT
be
used
if
will
extract
column
titles
from
the
URL
first
row
of
the
tabular
data
file
contains
a
query
component.
CSV
file.
GID,On Street,Species,Trim Cycle,Inventory Date 1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010 2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
Processors
MUST
locate
and
retrieve
the
common
metadata
document
for
a
directory
by
resolving
the
relative
URL
metadata.json
against
the
base
URL
The
results
of
the
tabular
data
file
and
fetching
the
resulting
URL.
this
can
be
found
in
section
8.2.1
Simple
Example
.
For
another
example,
if
the
tabular
data
following
tab-delimited
file
contains
embedded
metadata
where
it
is
at
http://example.org/south-west/devon.csv
then
processors
must
attempt
to
locate
assumed
that
comments
may
be
added
using
a
metadata
file
at
,
and
that
the
http://example.org/south-west/metadata.json
.
On
the
other
hand,
if
#
tabular
data
file
is
at
column
types
may
be
indicated
using
a
http://example.com/index.php?file=43178.csv
#datatype
then,
because
the
URL
of
the
tabular
data
file
includes
a
query
component,
the
processor
must
not
attempt
to
look
for
a
directory-level
metadata
document.
annotation:
# publisher City of Palo Alto # updated 12/31/2010 #name GID on_street species trim_cycle inventory_date #datatype string string string string date:M/D/YYYY GID On Street Species Trim Cycle Inventory Date 1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010 2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010
If
the
metadata
file
found
at
A
processor
that
recognises
this
location
does
not
explicitly
include
a
reference
to
the
relevant
tabular
data
file
then
it
MUST
format
may
be
ignored.
able
to
extract
and
make
sense
of
this
embedded
metadata.
This section describes how particular types of applications should process tabular data and metadata files.
In many cases, an application will start processing from a metadata file. In that case, the initial metadata file is treated as overriding metadata and the application MUST NOT continue to retrieve other available metadata about each of the tabular data file s referenced by that initial metadata file other than embedded metadata .
In other cases, applications will start from a tabular data file , such as a CSV file, and locate metadata from that file. This metadata will be used to process the file as if the processor were starting from that metadata file.
For
example,
if
a
validator
is
passed
a
locally
authored
metadata
file
spending.json
,
which
contains:
{ "tableSchema": "government-spending.csv", "tables": [{ "url": "http://example.org/east-sussex-2015-03.csv", }, { "url": "http://example.org/east-sussex-2015-02.csv" }, ... ] }
the
validator
would
validate
all
the
listed
tables,
using
the
locally
defined
schema
at
government-spending.csv
.
It
would
also
use
the
metadata
embedded
in
the
referenced
CSV
files;
for
example,
when
processing
http://example.org/east-sussex-2015-03.csv
,
it
would
use
embedded
metadata
within
that
file.
file
to
verify
that
the
CSV
is
compatible
with
the
metadata.
If
a
validator
is
passed
a
tabular
data
file
http://example.org/east-sussex-2015-03.csv
,
the
validator
would
use
the
metadata
located
from
the
CSV
file:
the
first
metadata
file
found
through
the
Link
headers
found
when
retrieving
that
file,
or
at
http://example.org/east-sussex-2015-03.csv-metadata.json
or
http://example.org/metadata.json
.
located
through
a
site-wide
location
configuration.
Starting
with
a
metadata
file
can
remove
the
need
to
perform
additional
requests
to
locate
linked,
file-specific,
linked
metadata,
or
directory-specific
metadata
retrieved
through
site-wide
location
configuration
After
locating
metadata,
metadata
is
merged
normalized
and
coerced
into
a
single
table
group
description
.
When
starting
with
a
metadata
file,
this
involves
merging
normalizing
the
provided
metadata
file
with
and
verifying
that
the
embedded
metadata
for
each
tabular
data
file
referenced
by
that
file.
from
the
metadata
is
compatible
with
the
metadata.
When
starting
with
a
tabular
data
file,
file
,
this
involves
locating
the
first
metadata
file
as
described
in
section
5.
Locating
Metadata
and
merging
all
relevant
metadata
normalizing
into
a
single
descriptor.
If processing starts with a tabular data file , implementations:
FM
)
as
described
in
section
5.
Locating
Metadata
:
tables
entry
where
the
url
property
is
set
from
that
of
the
tabular
data
file
.
FM
is
loaded.
.
If the process starts with a metadata file:
UM
M
(which
is
treated
as
overriding
metadata,
see
section
5.1
Overriding
Metadata
).
UM
using
the
process
defined
in
Normalization
in
[
tabular-metadata
],
coercing
UM
into
a
table
group
description
,
if
necessary.
TM
)
in
UM
M
in
order,
create
one
or
more
annotated
tables
:
DD
)
from
UM
M
for
the
table
associated
with
the
tabular
data
file.
If
there
is
no
such
dialect
description,
extract
the
first
available
dialect
description
from
a
group
of
tables
in
which
the
tabular
data
file
is
described.
Otherwise
use
the
default
dialect
description.
DD
based
on
HTTP
headers
found
when
retrieving
the
tabular
data
file:
Parse
the
tabular
data
file,
using
DD
as
a
guide,
to
create
a
basic
tabular
data
model
(
T
)
and
extract
embedded
metadata
(
EM
),
for
example
from
the
header
line
.
This specification provides a non-normative definition for parsing CSV-based files, including the extraction of embedded metadata , in section 8. Parsing Tabular Data . This specification does not define any syntax for embedded metadata beyond this; whatever syntax is used, it's assumed that metadata can be mapped to the vocabulary defined in [ tabular-metadata ].
Content-Language
HTTP
header
was
found
when
retrieving
the
tabular
data
file,
and
the
value
provides
a
single
language,
set
the
lang
inherited
property
to
this
value
in
EM
.
TM
,
unless
TM
already
has
a
lang
inherited
property.
M
=
merge(UM
M
,
EM)
TM
is
compatible
with
EM
using
the
TM
is
not
compatible
with
EM
validators
MUST
raise
an
error,
other
processors
MUST
generate
a
warning
and
continue
processing.
M
TM
to
add
annotations
to
the
tabular
data
model
T
as
described
in
Section
2
Annotating
Tables
in
[
tabular-metadata
].
When
processing
a
tabular
data
file
using
metadata
file,
UM
M
as
discovered
using
section
5.
Locating
Metadata
will
describe
a
table
or
group
of
tables
,
and
no
other
metadata
files
will
be
retrieved.
Thus
processors
MUST
ensure
that
the
metadata
and
tabular
data
file
must
provide
all
applicable
are
compatible,
this
is
typically
done
by
extracting
embedded
metadata
aside
from
that
embedded
within
the
tabular
data
files
themselves.
file
and
determining
that
the
provided
or
discovered
metadata
is
compatible
with
the
embedded
metadata
using
the
procedure
defined
in
Table
Compatibility
in
[
tabular-metadata
].
Metadata
Discovery
and
Compatibility
involve
comparing
URLs.
When
comparing
URLs,
processors
MUST
use
Syntax-Based
Normalization
as
defined
in
[
RFC3968
].
Processors
MUST
perform
Scheme-Based
Normalization
for
HTTP
(
80
)
and
HTTPS
(
443
)
and
SHOULD
perform
Scheme-Based
Normalization
for
other
well-known
schems.
Unlike many other data formats, tabular data is designed to be read by humans. For that reason, it's common for data to be represented within tabular data in a human-readable way. The datatype , default , lang , null , required , and separator annotations provide the information needed to parse the string value of a cell into its (semantic) value annotation. This is used:
The process of parsing a cell creates a cell with annotations based on the original string value, parsed value and other column annotations and adds the cell to the list of cells in a row and cells in a column :
After parsing, the cell value can be:
null
,
The process of parsing the string value into a single value or a list of values is as follows:
string
,
json
,
xml
,
html
or
anyAtomicType
,
replace
all
carriage
return
(
#xD
),
line
feed
(
#xA
),
and
tab
(
#x9
)
characters
with
space
characters.
string
,
json
,
xml
,
html
,
anyAtomicType
,
or
normalizedString
,
strip
leading
and
trailing
whitespace
from
the
string
value
and
replace
all
instances
of
two
or
more
whitespace
characters
with
a
single
space
character.
null
and
the
normalized
string
is
an
empty
string,
the
cell
value
is
an
empty
list.
If
the
column
required
annotation
is
true
,
add
an
error
to
the
list
of
errors
for
the
cell.
null
,
the
cell
value
is
a
list
of
values;
set
the
list
annotation
on
the
cell
to
true
,
and
create
the
cell
value
created
by:
null
.
string
or
anyAtomicType
,
strip
leading
and
trailing
whitespace
from
these
strings.
null
.
If
the
column
separator
annotation
is
null
and
the
column
required
annotation
is
true
,
add
an
error
to
the
list
of
errors
for
the
cell.
string
,
or
there
is
no
datatype
,
the
value
has
an
associated
language
from
the
column
lang
annotation.
If
there
are
any
errors,
add
them
to
the
list
of
errors
for
the
cell;
in
this
case
the
value
has
a
datatype
of
string
;
if
the
datatype
base
is
string
,
or
there
is
no
datatype
,
the
value
has
an
associated
language
from
the
column
lang
annotation.
The final value (or values) become the value annotation on the cell .
If there is a about URL annotation on the column , it becomes the about URL annotation on the cell , after being transformed into an absolute URL as described in URI Template Properties of [ tabular-metadata ].
If there is a property URL annotation on the column , it becomes the property URL annotation on the cell , after being transformed into an absolute URL as described in URI Template Properties of [ tabular-metadata ].
If
there
is
a
value
URL
annotation
on
the
column
,
it
becomes
the
value
URL
annotation
on
the
cell
,
after
being
transformed
into
an
absolute
URL
as
described
in
URI
Template
Properties
of
[
tabular-metadata
].
The
value
URL
annotation
is
null
if
the
cell
value
is
null
and
the
column
virtual
annotation
is
false
.
This section is non-normative.
When
datatype
annotation
is
available,
the
value
of
a
cell
is
the
same
as
its
string
value
.
For
example,
a
cell
with
a
string
value
of
"99"
would
similarly
have
the
(semantic)
value
"99"
.
If a datatype base is provided for the cell, that is used to create a (semantic) value for the cell. For example, if the metadata contains:
"datatype" : "integer"
for
the
cell
with
the
string
value
"99"
then
the
value
of
that
cell
will
be
the
integer
99
.
A
cell
whose
string
value
was
not
a
valid
integer
(such
as
"one"
or
"1.0"
)
would
be
assigned
that
string
value
as
its
(semantic)
value
annotation,
but
also
have
a
validation
error
listed
in
its
errors
annotation.
Sometimes
data
uses
special
codes
to
indicate
unknown
or
null
values.
For
example,
a
particular
column
might
contain
a
number
that
is
expected
to
be
between
1
and
10
,
with
the
string
99
used
in
the
original
tabular
data
file
to
indicate
a
null
value.
The
metadata
for
such
a
column
would
include:
"datatype": { "base": "integer", "minimum": 1, "maximum": 10 }, "null" : "99"
In
this
case,
a
cell
with
a
string
value
of
"5"
would
have
the
(semantic)
value
of
the
integer
5
;
a
cell
with
a
string
value
of
"99"
would
have
the
value
null
.
Similarly, a cell may be assigned a default value if the string value for the cell is empty. A configuration such as:
"datatype": { "base": "integer", "minimum": 1, "maximum": 10 }, "default" : "5"
In
this
case,
a
cell
whose
string
value
is
""
would
be
assigned
the
value
of
the
integer
5
.
A
cell
whose
string
value
contains
whitespace,
such
as
a
single
tab
character,
would
also
be
assigned
the
value
of
the
integer
5
:
when
the
datatype
is
something
other
than
string
or
anyAtomicType
,
leading
and
trailing
whitespace
is
stripped
from
string
values
before
the
remainder
of
the
processing
is
carried
out.
Cells
can
contain
sequences
of
values.
For
example,
a
cell
might
have
the
string
value
"1
5
7.0"
.
In
this
case,
the
separator
is
a
space
character.
The
appropriate
configuration
would
be:
"datatype": { "base": "integer", "minimum": 1, "maximum": 10 }, "default": "5", "separator" : " "
and
this
would
mean
that
the
cell's
value
would
be
an
array
containing
two
integers
and
a
string:
[1,
5,
"7.0"]
.
The
final
value
of
the
array
is
a
string
because
it
is
not
a
valid
integer;
the
cell's
errors
annotation
will
also
contain
a
validation
error.
Also,
with
this
configuration,
if
the
string
value
of
the
cell
were
""
(ie
(i.e.
it
was
an
empty
cell)
the
value
of
the
cell
would
be
an
empty
list.
A
cell
value
can
be
inserted
into
a
URL
created
using
a
URI
template
property
such
as
valueUrl
.
For
example,
if
a
cell
with
the
string
value
"1
5
7.0"
were
in
a
column
named
values
,
defined
with:
"datatype": "decimal", "separator": " ", "valueUrl" : "{?values}"
then
after
expansion
of
the
URI
template,
the
resulting
valueUrl
would
be
?values=1.0,5.0,7.0
.
The
canonical
representations
of
the
decimal
values
are
used
within
the
URL.
By
default,
numeric
values
must
be
in
the
formats
defined
in
[
xmlschema11-2
].
It
is
not
uncommon
for
numbers
within
tabular
data
to
be
formatted
for
human
consumption,
which
may
involve
using
commas
for
decimal
points,
grouping
digits
in
the
number
using
commas,
or
adding
currency
symbols
or
percent
signs
to
the
number.
If the datatype base is a numeric type, the datatype format annotation indicates the expected format for that number. Its value MUST be either a single string or an object with one or more of the properties:
"."
.
If
the
supplied
value
is
not
a
string,
implementations
MUST
issue
a
warning
and
proceed
as
if
the
property
had
not
been
specified.
","
null
.
If
the
supplied
value
is
not
a
string,
implementations
MUST
issue
a
warning
and
proceed
as
if
the
property
had
not
been
specified.
0
,
#
,
the
specified
decimalChar
(or
"."
if
unspecified),
the
specified
groupChar
(or
","
if
unspecified),
E
,
+
,
%
and
‰
.
Implementations
MAY
additionally
recognise
number
format
patterns
containing
other
special
pattern
characters
defined
in
[
UAX35
If the datatype format annotation is a single string, this is interpreted in the same way as if it were an object with a pattern property whose value is that string.
When
If
the
groupChar
is
specified,
but
no
pattern
is
supplied,
when
parsing
the
string
value
of
a
cell
against
this
format
specification,
implementations
MUST
recognise
and
parse
numbers
that
consist
of:
+
or
-
sign,
0-9
),
0-9
)
and
the
0-9
),
E
followed
by
an
optional
+
or
-
sign
followed
by
one
or
more
decimal
digits
(
0-9
),
or
%
)
or
per-mille
(
‰
)
sign.
or that are one of the special values:
NaN
,
INF
,
or
-INF
.
Implementations MAY also recognise numeric values that are in any of the standard-decimal, standard-percent or standard-scientific formats listed in the Unicode Common Locale Data Repository .
Implementations MUST add a validation error to the errors annotation for the cell , and set the cell value to a string rather than a number if the string being parsed:
integer
or
one
of
its
decimal
or
one
of
its
NaN
,
INF
,
or
-INF
,
if
the
datatype
base
is
decimal
or
one
of
its
Implementations
MUST
use
the
sign,
exponent,
percent,
and
per-mille
signs
when
parsing
the
string
value
of
a
cell
to
provide
the
value
of
the
cell.
For
example,
the
string
value
"-25%"
must
be
interpreted
as
-0.25
and
the
string
value
"1E6"
as
1000000
.
Boolean
values
may
be
represented
in
many
ways
aside
from
the
standard
1
and
0
or
true
and
false
.
If
the
datatype
base
for
a
cell
is
boolean
,
the
datatype
format
annotation
provides
the
true
and
value
followed
by
the
false
values
expected,
value,
separated
by
|
.
For
example
if
format
is
Y|N
then
cells
must
hold
either
Y
or
N
with
Y
meaning
true
and
N
meaning
false
.
If
the
format
does
not
follow
this
syntax,
implementations
MUST
issue
a
warning
and
proceed
as
if
no
format
had
been
provided.
The
resulting
cell
value
will
be
one
or
more
boolean
true
or
false
values.
Dates
By
default,
dates
and
times
are
commonly
represented
in
tabular
data
assumed
to
be
in
formats
other
than
those
the
format
defined
in
[
xmlschema11-2
].
However
dates
and
times
are
commonly
represented
in
tabular
data
in
other
formats.
If the datatype base is a date or time type, the datatype format annotation indicates the expected format for that date or time.
The
supported
date
and
time
formats
format
patterns
listed
here
are
expressed
in
terms
of
the
date
field
symbols
defined
in
[
UAX35
]
and
].
These
formats
MUST
be
interpreted
recognised
by
implementations
and
MUST
be
interpreted
as
defined
in
that
specification.
Implementations
MAY
additionally
recognise
other
date
format
patterns
.
Implementations
MUST
issue
a
warning
if
the
date
format
pattern
is
invalid
or
not
recognised
and
proceed
as
if
no
date
format
pattern
had
been
provided.
For interoperability, authors of metadata documents SHOULD use only the formats listed in this section.
The
following
date
formats
format
patterns
MUST
be
recognised
recognized
by
implementations:
yyyy-MM-dd
e.g.,
2015-03-22
yyyyMMdd
e.g.,
20150322
dd-MM-yyyy
e.g.,
22-03-2015
d-M-yyyy
e.g.,
22-3-2015
MM-dd-yyyy
e.g.,
03-22-2015
M-d-yyyy
e.g.,
3-22-2015
dd/MM/yyyy
e.g.,
22/03/2015
d/M/yyyy
e.g.,
22/3/2015
MM/dd/yyyy
e.g.,
03/22/2015
M/d/yyyy
e.g.,
3/22/2015
dd.MM.yyyy
e.g.,
22.03.2015
d.M.yyyy
e.g.,
22.3.2015
MM.dd.yyyy
e.g.,
03.22.2015
M.d.yyyy
e.g.,
3.22.2015
The
following
time
formats
format
patterns
MUST
be
recognised
recognized
by
implementations:
HH:mm:ss.S
with
one
or
more
trailing
S
characters
indicating
the
maximum
number
of
fractional
seconds
e.g.,
HH:mm:ss.SSS
for
15:02:37.143
HH:mm:ss
e.g.,
15:02:37
HHmmss
e.g.,
150237
HH:mm
e.g.,
15:02
HHmm
e.g.,
1502
The
following
date/time
formats
format
patterns
MUST
be
recognised
recognized
by
implementations:
yyyy-MM-ddTHH:mm:ss.S
with
one
or
more
trailing
S
characters
indicating
the
maximum
number
of
fractional
seconds
e.g.,
yyyy-MM-ddTHH:mm:ss.SSS
for
2015-03-15T15:02:37.143
yyyy-MM-ddTHH:mm:ss
e.g.,
2015-03-15T15:02:37
yyyy-MM-ddTHH:mm
e.g.,
2015-03-15T15:02
M/d/yyyy
HH:mm
for
3/22/2015
15:02
or
dd.MM.yyyy
HH:mm:ss
for
22.03.2015
15:02:37
Implementations
MUST
also
recognise
date,
time,
and
date/time
formats
format
patterns
that
end
with
timezone
markers
consisting
of
between
one
and
three
x
s
or
X
s,
characters,
possibly
after
a
single
space.
These
MUST
be
interpreted
as
follows:
X
e.g.,
-08
,
+0530
,
or
Z
(minutes
are
optional)
XX
e.g.,
-0800
,
+0530
,
or
Z
XXX
e.g.,
-08:00
,
+05:30
,
or
Z
x
e.g.,
-08
or
+0530
(
Z
is
not
permitted)
xx
e.g.,
-0800
or
+0530
(
Z
is
not
permitted)
xxx
e.g.,
-08:00
or
+05:30
(
Z
is
not
permitted)
For
example,
formats
date
format
patterns
could
include
yyyy-MM-ddTHH:mm:ssXXX
for
2015-03-15T15:02:37Z
or
2015-03-15T15:02:37-05:00
,
or
HH:mm
x
for
15:02
-05
.
The
cell
value
will
one
or
more
dates/time
values
extracted
using
the
format
.
For
simplicity,
this
version
of
this
standard
does
not
support
abbreviated
or
full
month
or
day
names,
or
double
digit
years.
Future
versions
of
this
standard
may
support
other
date
and
time
formats,
or
general
purpose
date/time
pattern
strings.
Authors
of
schemas
SHOULD
use
appropriate
regular
expressions,
along
with
the
string
datatype,
for
dates
and
times
that
use
a
format
other
than
that
specified
here.
Durations
MUST
be
formatted
and
interpreted
as
defined
in
[
xmlschema11-2
],
using
the
[
ISO8601
]
format
-?P
n
Y
n
M
n
DT
n
H
n
M
n
S
.
For
example,
the
duration
P1Y1D
is
used
for
a
year
and
a
day;
the
duration
PT2H30M
for
2
hours
and
30
minutes.
If
the
datatype
base
is
a
duration
type,
the
datatype
format
annotation
provides
a
regular
expression
for
the
string
values,
in
the
with
syntax
and
processed
as
processing
defined
by
[
ECMASCRIPT
].
If
the
supplied
value
is
not
a
valid
regular
expression,
implementations
MUST
issue
a
warning
and
proceed
as
if
no
format
had
been
provided.
Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.
The
cell
value
will
be
one
or
more
durations
extracted
using
the
format
.
If
the
datatype
base
is
not
numeric,
boolean
,
a
date/time
type,
or
a
duration
type,
the
datatype
format
annotation
provides
a
regular
expression
for
the
string
values,
in
the
with
syntax
and
processed
as
processing
defined
by
[
ECMASCRIPT
].
If
the
supplied
value
is
not
a
valid
regular
expression,
implementations
MUST
issue
a
warning
and
proceed
as
if
no
format
had
been
provided.
Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.
Values
that
are
labelled
as
html
,
xml
,
or
json
SHOULD
NOT
be
validated
against
those
formats.
Metadata creators who wish to check the syntax of HTML, XML, or JSON within tabular data should use the datatype format annotation to specify a regular expression against which such values will be tested.
This section is non-normative.
When
displaying
presenting
tables,
implementations
should:
There are two levels of bidirectionality to consider when displaying tables: the directionality of the table (i.e., whether the columns should be arranged left-to-right or right-to-left) and the directionality of the content of individual cells.
The
table
direction
annotation
on
the
table
provides
information
about
the
desired
display
of
the
columns
in
the
table.
If
table
direction
is
ltr
then
the
first
column
should
be
displayed
on
the
left
and
the
last
column
on
the
right.
If
table
direction
is
rtl
then
the
first
column
should
be
displayed
on
the
right
and
the
last
column
on
the
left.
If
table
direction
is
then
tables
should
be
displayed
with
attention
to
the
bidirectionality
of
the
content
of
the
default
auto
file.
cells
in
the
table.
Specifically,
the
values
of
the
cells
in
the
table
should
be
scanned
breadth
first:
from
the
first
cell
in
the
first
column
through
to
the
last
cell
in
the
first
row,
down
to
the
last
cell
in
the
last
column.
If
the
first
character
in
the
table
with
a
strong
type
as
defined
in
[
BIDI
]
indicates
a
RTL
directionality,
the
table
should
be
displayed
with
the
first
column
on
the
right
and
the
last
column
on
the
left.
Otherwise,
the
table
should
be
displayed
with
the
first
column
on
the
left
and
the
last
column
on
the
right.
Characters
such
as
whitespace,
quotes,
commas,
and
numbers
do
not
have
a
strong
type,
and
therefore
are
skipped
when
identifying
the
character
that
determines
the
directionality
of
the
table.
Implementations should enable user preferences to override the indicated metadata about the directionality of the table.
Once
the
directionality
of
the
table
has
been
determined,
each
cell
within
the
table
should
be
considered
as
a
separate
paragraph
,
as
defined
by
the
Unicode
Bidirectional
Algorithm
(
UBA
)
in
[
BIDI
].
The
default
directionality
for
the
cell
is
determined
by
looking
at
the
text
direction
annotation
for
the
cell.
Thus,
cell,
as
defined
by
follows:
ltr
then
the
rtl
then
the
base
direction
for
the
cell
content
should
be
auto
then
the
If
the
textDirection
property
in
metadata
has
the
value
"inherit"
,
the
text
direction
annotation
for
a
cell
inherits
its
value
from
the
table
direction
annotation
on
the
table.
When
the
titles
of
a
column
are
displayed,
these
should
be
displayed
according
to
in
the
Unicode
Bidirectional
Algorithm
direction
determined
by
the
first
character
in
the
title
with
a
strong
type
as
described
defined
in
[
BIDI
].
Titles
for
the
same
column
in
different
languages
may
be
displayed
in
different
directions.
The labelling of columns and rows helps those who are attempting to understand the content of a table to grasp what a particular cell means. Implementations should present appropriate titles for columns, and ensure that the most important information in a row is kept apparent to the user, to aid their understanding. For example:
When labelling a column, either on the screen or aurally, implementations should use the first available of:
When labelling a row, either on the screen or aurally, implementations should use the first available of:
Validators
test
whether
given
tabular
data
files
adhere
to
the
structure
defined
within
a
schema
.
Validators
MUST
raise
errors
(and
halt
processing)
and
issue
warnings
(and
continue
processing)
as
defined
in
[
tabular-metadata
]
and
].
In
addition,
validators
MUST
raise
errors
but
MAY
continue
validating
in
the
following
situations:
Conversions of tabular data to other formats operate over a annotated table constructed as defined in Annotating Tables in [ tabular-metadata ]. The mechanics of these conversions to other formats are defined in other specifications such as [ csv2json ] and [ csv2rdf ].
Conversion specifications MUST define a default mapping from an annotated table that lacks any annotations (i.e., that is equivalent to an un-annotated table).
Conversion
specifications
MUST
use
the
property
value
of
the
propertyUrl
of
a
column
as
the
basis
for
naming
machine-readable
fields
in
the
target
format,
such
as
the
name
of
the
equivalent
element
or
attribute
in
XML,
property
in
JSON
or
property
URI
in
RDF.
Conversion specifications MAY use any of the annotations found on an annotated table group, table, column, row or cell, including non-core annotations, to adjust the mapping into another format.
Conversion
specifications
MAY
define
additional
annotations,
not
defined
in
this
specification,
which
are
specifically
used
when
converting
to
the
target
format
of
the
conversion.
For
example,
a
conversion
to
XML
might
specify
a
http://example.org/conversion/xml/element-or-attribute
property
on
columns
that
determines
whether
a
particular
column
is
represented
through
an
element
or
an
attribute
in
the
data.
This section is non-normative.
There is no standard for CSV, and there are many variants of CSV used on the web today. This section defines a method for expressing tabular data adhering to the annotated tabular data model in CSV. Authors are encouraged to adhere to the constraints described in this section as implementations should process such CSV files consistently.
This
syntax
is
not
compliant
with
text/csv
as
defined
in
[
RFC4180
]
in
that
it
permits
line
endings
other
than
CRLF
.
Supporting
LF
line
endings
is
important
for
data
formats
that
are
used
on
non-Windows
platforms.
However,
all
files
that
adhere
to
[
RFC4180
]'s
definition
of
CSV
meet
the
constraints
described
in
this
section.
Developing a standard for CSV is outside the scope of the Working Group. The details here aim to help shape any future standard.
The
appropriate
content
type
for
a
CSV
file
is
text/csv
.
For
example,
when
a
CSV
file
is
transmitted
via
HTTP,
the
HTTP
response
should
include
a
Content-Type
header
with
the
value
text/csv
:
Content-Type: text/csv
CSV
files
should
be
encoded
using
UTF-8.
If
a
CSV
file
is
not
encoded
using
UTF-8,
the
encoding
should
be
specified
through
the
charset
parameter
in
the
Content-Type
header:
Content-Type: text/csv;charset=ISO-8859-1
The
ends
of
rows
in
a
CSV
file
should
be
either
CRLF
(
U+000D
U+000A
)
or
but
may
be
LF
(
U+000A
).
Line
endings
within
escaped
cells
are
not
normalised.
Each line of a CSV file should contain the same number of comma-separated values.
Values
that
contain
commas,
line
endings,
or
double
quotes
should
be
escaped
by
having
the
entire
value
wrapped
in
double
quotes.
There
should
not
be
whitespace
before
or
after
the
double
quotes.
Within
these
escaped
cells,
any
double
quotes
should
be
escaped
with
two
double
quotes
(
""
).
The first line of a CSV file should contain a comma-separated list of names of columns . This is known as the header line and provides titles for the columns. There are no constraints on these titles.
If
a
CSV
file
does
not
include
a
header
line,
this
should
be
specified
using
the
header
parameter
of
the
media
type:
Content-Type: text/csv;header=absent
This grammar is a generalization of that defined in [ RFC4180 ] and is included for reference only.
The EBNF used here is defined in XML 1.0 [ EBNF-NOTATION ].
This section is non-normative.
As described in section 7. Best Practice CSV , there may be many formats which an application might interpret into the tabular data model described in section 4. Tabular Data Models , including using different separators or fixed format tables, multiple tables within a single file, or ones that have metadata lines before a table header.
Standardising
Standardizing
the
parsing
of
CSV
is
outside
the
chartered
scope
of
the
Working
Group.
This
non-normative
section
is
intended
to
help
the
creators
of
parsers
handle
the
wide
variety
of
CSV-based
formats
that
they
may
encounter
due
to
the
current
lack
of
standardisation
standardization
of
the
format.
This section describes an algorithm for parsing formats that do not adhere to the constraints described in section 7. Best Practice CSV , as well as those that do, and extracting embedded metadata . The parsing algorithm uses the following flags. These may be set by metadata properties found while Locating Metadata , including through user input (see Overriding Metadata ), or through the inclusion of a dialect description within a metadata file:
rdfs:comment
annotation
to
the
table.
This
is
set
by
the
commentPrefix
property
of
a
dialect
description
.
The
default
is
null
,
which
means
no
rows
are
treated
as
comments.
A
value
other
than
null
may
mean
that
the
source
numbers
of
rows
are
different
from
their
numbers
.
delimiter
property
of
a
dialect
description
.
The
default
is
,
.
encoding
property
of
a
dialect
description
.
The
default
is
utf-8
.
null
,
set
by
the
doubleQuote
property
of
a
dialect
description
.
The
default
is
"
(such
that
""
is
used
to
escape
"
within
an
escaped
cell).
header
or
headerRowCount
property
of
a
dialect
description
.
The
default
is
1
.
A
value
other
than
0
will
mean
that
the
source
numbers
of
rows
will
be
different
from
their
numbers
.
lineTerminators
property
of
a
dialect
description
.
The
default
is
[CRLF,
LF]
.
null
,
set
by
the
quoteChar
property
of
a
dialect
description
.
The
default
is
"
.
skipBlankRows
property
of
a
dialect
description
.
The
default
is
false
.
A
value
other
than
false
may
mean
that
the
source
numbers
of
rows
are
different
from
their
numbers
.
skipColumns
property
of
a
dialect
description
.
The
default
is
0
.
A
value
other
than
0
will
mean
that
the
source
numbers
of
columns
will
be
different
from
their
numbers
.
skipRows
property
of
a
dialect
description
.
The
default
is
0
.
A
value
greater
than
0
will
mean
that
the
source
numbers
of
rows
will
be
different
from
their
numbers
.
true
,
false
,
start
,
or
end
,
set
by
the
skipInitialSpace
or
trim
property
of
a
dialect
description
.
The
default
is
false
true
.
The algorithm for using these flags to parse a document containing tabular data to create a basic annotated tabular data model and to extract embedded metadata is as follows:
null
null
ltr
auto
false
false
{ "@context": "http://www.w3.org/ns/csvw", "rdfs:comment": [] "tableSchema": { "columns": [] } }
url
property
on
M
to
that
URL.
1
.
Read the file using the encoding , as specified in [ encoding ], using the replacement error mode . If the encoding is not a Unicode encoding, use a normalizing transcoder to normalize into Unicode Normal Form C as defined in [ UAX15 ].
The replacement error mode ensures that any non-Unicode characters within the CSV file are replaced by U+FFFD, ensuring that strings within the tabular data model such as column titles and cell string values only contain valid Unicode characters.
null
and
the
row
content
begins
with
the
comment
prefix
,
strip
that
prefix
from
the
row
content
,
and
add
the
resulting
string
to
the
M
.rdfs:comment
array.
M
.rdfs:comment
array.
1
to
the
source
row
number
.
null
and
the
row
content
begins
with
the
comment
prefix
,
strip
that
prefix
from
the
row
content
,
and
add
the
resulting
string
to
the
M
.rdfs:comment
array.
M
.tableSchema.columns
,
create
a
new
one
with
a
title
property
whose
value
is
an
array
containing
a
single
value
that
is
the
value
at
index
i
in
the
list
of
cell
values
.
M
.tableSchema.columns[
i
].title
.
1
to
the
source
row
number
.
M
.tableSchema.columns
for
each
column
in
the
current
row
after
skip
columns
.
1
.
1
.
null
and
the
row
content
begins
with
the
comment
prefix
,
strip
that
prefix
from
the
row
content
,
and
add
the
resulting
string
to
the
M
.rdfs:comment
array.
true
,
add
1
to
the
source
row
number
and
move
on
to
process
the
next
row.
1
):
null
false
false
string
und
false
false
null
auto
null
null
null
ltr
auto
false
null
null
null
1
to
the
source
column
number
.
1
to
the
source
row
number
.
M
.rdfs:comment
is
an
empty
array,
remove
the
rdfs:comment
property
from
M
.
To read a row to provide row content , perform the following steps:
To read a quoted value to provide a quoted value , perform the following steps:
To parse a row to provide a list of cell values , perform the following steps:
false
.
false
,
set
the
quoted
flag
to
true
,
and
move
on
to
process
the
false
,
and
move
on
to
process
the
true
,
append
the
delimiter
To conditionally trim a cell value to provide a trimmed cell value , perform the following steps:
true
or
start
then
remove
any
leading
whitespace
from
the
start
of
the
trimmed
cell
value
and
move
on
to
the
next
step.
true
or
end
then
remove
any
trailing
whitespace
from
the
end
of
the
trimmed
cell
value
and
move
on
to
the
next
step.
This parsing algorithm does not account for the possibility of there being more than one area of tabular data within a single CSV file.
This section is non-normative.
Bidirectional content does not alter the definition of rows or the assignment of cells to columns. Whether or not a CSV file contains right-to-left characters, the first column's content is the first cell of each row, which is the text prior to the first occurrence of a comma within that row.
For example, Egyptian Referendum results are available as a CSV file at https://egelections-2011.appspot.com/Referendum2012/results/csv/EG.csv . Over the wire and in non-Unicode-aware text editors, the CSV looks like:
المحافظة,نسبة موافق,نسبة غير موافق,عدد الناخبين,الأصوات الصحيحة,الأصوات الباطلة,نسبة المشاركة,موافق,غير موافق القليوبية,60.0,40.0,"2,639,808","853,125","15,224",32.9,"512,055","341,070" الجيزة,66.7,33.3,"4,383,701","1,493,092","24,105",34.6,"995,417","497,675" القاهرة,43.2,56.8,"6,580,478","2,254,698","36,342",34.8,"974,371","1,280,327" قنا,84.5,15.5,"1,629,713","364,509","6,743",22.8,"307,839","56,670" ...
Within
this
CSV
file,
the
first
column
appears
as
the
content
of
each
line
before
the
first
comma
and
is
named
المحافظة
(appearing
at
the
start
of
each
row
as
المحافظة
in
the
example,
which
is
displaying
the
relevant
characters
from
left
to
right
in
the
order
they
appear
"on
the
wire").
The CSV translates to a table model that looks like:
Column / Row | column 1 | column 2 | column 3 | column 4 | column 5 | column 6 | column 7 | column 8 | column 9 |
---|---|---|---|---|---|---|---|---|---|
column names | المحافظة | نسبة موافق | نسبة غير موافق | عدد الناخبين | الأصوات الصحيحة | الأصوات الباطلة | نسبة المشاركة | موافق | غير موافق |
row 1 | القليوبية | 60.0 | 40.0 | 2,639,808 | 853,125 | 15,224 | 32.9 | 512,055 | 341,070 |
row 2 | الجيزة | 66.7 | 33.3 | 4,383,701 | 1,493,092 | 24,105 | 34.6 | 995,417 | 497,675 |
row 3 | القاهرة | 43.2 | 56.8 | 6,580,478 | 2,254,698 | 36,342 | 34.8 | 974,371 | 1,280,327 |
row 4 | قنا | 84.5 | 15.5 | 1,629,713 | 364,509 | 6,743 | 22.8 | 307,839 | 56,670 |
The
fragment
identifier
#col=3
identifies
the
third
of
the
columns,
named
نسبة
غير
موافق
(appearing
as
نسبة
غير
موافق
in
the
example).
section
6.3.1
6.5.1
Bidirectional
Tables
defines
how
this
table
model
should
be
displayed
by
compliant
applications,
and
how
metadata
can
affect
the
display.
The
default
is
for
the
display
to
be
determined
by
the
content
of
the
table.
For
example,
if
this
CSV
were
turned
into
an
HTML
table
for
display
into
a
web
page,
it
should
be
displayed
with
the
first
column
on
the
right
and
the
last
on
the
left,
as
follows:
غير موافق | موافق | نسبة المشاركة | الأصوات الباطلة | الأصوات الصحيحة | عدد الناخبين | نسبة غير موافق | نسبة موافق | المحافظة |
---|---|---|---|---|---|---|---|---|
341,070 | 512,055 | 32.9 | 15,224 | 853,125 | 2,639,808 | 40.0 | 60.0 | القليوبية |
497,675 | 995,417 | 34.6 | 24,105 | 1,493,092 | 4,383,701 | 33.3 | 66.7 | الجيزة |
1,280,327 | 974,371 | 34.8 | 36,342 | 2,254,698 | 6,580,478 | 56.8 | 43.2 | القاهرة |
56,670 | 307,839 | 22.8 | 6,743 | 364,509 | 1,629,713 | 15.5 | 84.5 | قنا |
The
fragment
identifier
#col=3
still
identifies
the
third
of
the
columns,
named
نسبة
غير
موافق
,
which
appears
in
the
HTML
display
as
the
third
column
from
the
right
and
is
what
those
who
read
right-to-left
would
think
of
as
the
third
column.
Note that this display matches that shown on the original website .
A
simple
CSV
file
that
complies
to
the
constraints
described
in
section
7.
Best
Practice
CSV
,
at
http://example.org/tree-ops.csv
,
might
look
like:
GID,On Street,Species,Trim Cycle,Inventory Date 1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010 2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
Parsing this file results in an annotated tabular data model of a single table T with five columns and two rows. The columns have the annotations shown in the following table:
id | core annotations | ||||
---|---|---|---|---|---|
table | number | source number | cells | titles | |
C1 | T | 1 | 1 | C1.1 , C2.1 |
GID
|
C2 | T | 2 | 2 | C1.2 , C2.2 |
On
Street
|
C3 | T | 3 | 3 | C1.3 , C2.3 |
Species
|
C4 | T | 4 | 4 | C1.4 , C2.4 |
Trim
Cycle
|
C5 | T | 5 | 5 | C1.5 , C2.5 |
Inventory
Date
|
The extracted embedded metadata , as defined in [ tabular-metadata ], would look like:
{ "@type": "Table", "url": "http://example.org/tree-ops.csv", "tableSchema": { "columns": [ {"titles": [ "GID" ]}, {"titles": [ "On Street" ]}, {"titles": [ "Species" ]}, {"titles": [ "Trim Cycle" ]}, {"titles": [ "Inventory Date" ]} ] } }
The rows have the annotations shown in the following table:
id | core annotations | |||
---|---|---|---|---|
table | number | source number | cells | |
R1 | T | 1 | 2 | C1.1 , C1.2 , C1.3 , C1.4 , C1.5 |
R2 | T | 2 | 3 | C2.1 , C2.2 , C2.3 , C2.4 , C2.5 |
The
source
number
of
each
row
is
offset
by
one
from
the
number
of
each
row
because
in
the
source
CSV
file,
the
header
line
is
the
first
line.
It
is
possible
to
reconstruct
a
[
RFC7111
]
compliant
reference
to
the
first
record
in
the
original
CSV
file
(
http://example.org/tree-ops.csv#row=2
)
using
the
value
of
the
row's
source
number.
This
enables
implementations
to
retain
provenance
between
the
table
model
and
the
original
file.
The cells have the annotations shown in the following table (note that the values of all the cells in the table are strings, denoted by the double quotes in the table below):
id | core annotations | ||||
---|---|---|---|---|---|
table | column | row | string value | value | |
C1.1 | T | C1 | R1 |
"1"
|
"1"
|
C1.2 | T | C2 | R1 |
"ADDISON
AV"
|
"ADDISON
AV"
|
C1.3 | T | C3 | R1 |
"Celtis
australis"
|
"Celtis
australis"
|
C1.4 | T | C4 | R1 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
C1.5 | T | C5 | R1 |
"10/18/2010"
|
"10/18/2010"
|
C2.1 | T | C1 | R2 |
"2"
|
"2"
|
C2.2 | T | C2 | R2 |
"EMERSON
ST"
|
"EMERSON
ST"
|
C2.3 | T | C3 | R2 |
"Liquidambar
styraciflua"
|
"Liquidambar
styraciflua"
|
C2.4 | T | C4 | R2 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
C2.5 | T | C5 | R2 |
"6/2/2010"
|
"6/2/2010"
|
The
tools
that
the
consumer
of
this
data
uses
may
provide
a
mechanism
for
overriding
the
metadata
that
has
been
provided
within
the
file
itself.
For
example,
they
might
enable
the
consumer
to
add
machine-readable
names
to
the
columns,
or
to
mark
the
fifth
column
as
holding
a
date
in
the
format
M/D/YYYY
.
These
facilities
are
implementation
defined;
the
code
for
invoking
a
Javascript-based
parser
might
look
like:
data.parse({ "column-names": ["GID", "on_street", "species", "trim_cycle", "inventory_date"], "datatypes": ["string", "string", "string", "string", "date"], "formats": [null,null,null,null,"M/D/YYYY"] });
This is equivalent to a metadata file expressed in the syntax defined in [ tabular-metadata ], looking like:
{ "@type": "Table", "url": "http://example.org/tree-ops.csv", "tableSchema": { "columns": [{ "name": "GID", "datatype": "string" }, { "name": "on_street", "datatype": "string" }, { "name": "species", "datatype": "string" }, { "name": "trim_cycle", "datatype": "string" }, { "name": "inventory_date", "datatype": { "base": "date", "format": "M/d/yyyy" } }] } }
This would be merged with the embedded metadata found in the CSV file, providing the titles for the columns to create:
{ "@type": "Table", "url": "http://example.org/tree-ops.csv", "tableSchema": { "columns": [{ "name": "GID", "titles": "GID", "datatype": "string" }, { "name": "on_street", "titles": "On Street", "datatype": "string" }, { "name": "species", "titles": "Species", "datatype": "string" }, { "name": "trim_cycle", "titles": "Trim Cycle", "datatype": "string" }, { "name": "inventory_date", "titles": "Inventory Date", "datatype": { "base": "date", "format": "M/d/yyyy" } }] } }
The
processor
can
then
create
an
annotated
tabular
data
model
that
included
name
annotations
on
the
columns,
and
datatype
annotations
on
the
cells,
and
created
cells
whose
values
were
of
appropriate
types
(in
the
case
of
this
Javascript
implementation,
the
cells
in
the
last
column
would
be
Date
objects,
for
example).
Assuming this kind of implementation-defined parsing, the columns would then have the annotations shown in the following table:
id | core annotations | ||||||
---|---|---|---|---|---|---|---|
table | number | source number | cells | name | titles | datatype | |
C1 | T | 1 | 1 | C1.1 , C2.1 |
GID
|
GID
|
string
|
C2 | T | 2 | 2 | C1.2 , C2.2 |
on_street
|
On
Street
|
string
|
C3 | T | 3 | 3 | C1.3 , C2.3 |
species
|
Species
|
string
|
C4 | T | 4 | 4 | C1.4 , C2.4 |
trim_cycle
|
Trim
Cycle
|
string
|
C5 | T | 5 | 5 | C1.5 , C2.5 |
inventory_date
|
Inventory
Date
|
{
"base":
"date",
"format":
"M/d/yyyy"
}
|
The
cells
have
the
annotations
shown
in
the
following
table.
Because
of
the
overrides
provided
by
the
consumer
to
guide
the
parsing,
and
the
way
the
parser
works,
the
cells
in
the
Inventory
Date
column
(cells
C1.5
and
C2.5
)
have
values
that
are
parsed
dates
rather
than
unparsed
strings.
id | core annotations | ||||
---|---|---|---|---|---|
table | column | row | string value | value | |
C1.1 | T | C1 | R1 |
"1"
|
"1"
|
C1.2 | T | C2 | R1 |
"ADDISON
AV"
|
"ADDISON
AV"
|
C1.3 | T | C3 | R1 |
"Celtis
australis"
|
"Celtis
australis"
|
C1.4 | T | C4 | R1 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
C1.5 | T | C5 | R1 |
"10/18/2010"
|
2010-10-18
|
C2.1 | T | C1 | R2 |
"2"
|
"2"
|
C2.2 | T | C2 | R2 |
"EMERSON
ST"
|
"EMERSON
ST"
|
C2.3 | T | C3 | R2 |
"Liquidambar
styraciflua"
|
"Liquidambar
styraciflua"
|
C2.4 | T | C4 | R2 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
C2.5 | T | C5 | R2 |
"6/2/2010"
|
2010-06-02
|
A similar set of annotations could be provided through a metadata file, located as discussed in section 5. Locating Metadata and defined in [ tabular-metadata ]. For example, this might look like:
{ "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}], "url": "tree-ops.csv", "dc:title": "Tree Operations", "dcat:keyword": ["tree", "street", "maintenance"], "dc:publisher": { "schema:name": "Example Municipality", "schema:url": {"@id": "http://example.org"} }, "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"}, "dc:modified": {"@value": "2010-12-31", "@type": "xsd:date"}, "tableSchema": { "columns": [{ "name": "GID", "titles": ["GID", "Generic Identifier"], "dc:description": "An identifier for the operation on a tree.", "datatype": "string", "required": true }, { "name": "on_street", "titles": "On Street", "dc:description": "The street that the tree is on.", "datatype": "string" }, { "name": "species", "titles": "Species", "dc:description": "The species of the tree.", "datatype": "string" }, { "name": "trim_cycle", "titles": "Trim Cycle", "dc:description": "The operation performed on the tree.", "datatype": "string" }, { "name": "inventory_date", "titles": "Inventory Date", "dc:description": "The date of the operation that was performed.", "datatype": {"base": "date", "format": "M/d/yyyy"} }], "primaryKey": "GID", "aboutUrl": "#gid-{GID}" } }
The annotated tabular data model generated from this would be more sophisticated again. The table itself would have the following annotations:
dc:title
{"@value":
"Tree
Operations",
"@language":
"en"}
dcat:keyword
[{"@value":
"tree",
"@language",
"en"},
{"@value":
"street",
"@language":
"en"},
{"@value":
"maintenance",
"@language":
"en"}]
dc:publisher
[{
"schema:name":
"Example
Municipality",
"schema:url":
{"@id":
"http://example.org"}
}]
dc:license
{"@id":
"http://opendefinition.org/licenses/cc-by/"}
dc:modified
{"@value":
"2010-12-31",
"@type":
"date"}
The columns would have the annotations shown in the following table:
id | core annotations | other annotations | ||||||
---|---|---|---|---|---|---|---|---|
table | number | source number | cells | name | titles | datatype |
dc:description
|
|
C1 | T | 1 | 1 | C1.1 , C2.1 |
GID
|
GID
,
Generic
Identifier
|
string
|
An
identifier
for
the
operation
on
a
tree.
|
C2 | T | 2 | 2 | C1.2 , C2.2 |
on_street
|
On
Street
|
string
|
The
street
that
the
tree
is
on.
|
C3 | T | 3 | 3 | C1.3 , C2.3 |
species
|
Species
|
string
|
The
species
of
the
tree.
|
C4 | T | 4 | 4 | C1.4 , C2.4 |
trim_cycle
|
Trim
Cycle
|
string
|
The
operation
performed
on
the
tree.
|
C5 | T | 5 | 5 | C1.5 , C2.5 |
inventory_date
|
Inventory
Date
|
{
"base":
"date",
"format":
"M/d/yyyy"
}
|
The
date
of
the
operation
that
was
performed.
|
The rows have an additional primary key annotation, as shown in the following table:
id | core annotations | ||||
---|---|---|---|---|---|
table | number | source number | cells | primary key | |
R1 | T | 1 | 2 | C1.1 , C1.2 , C1.3 , C1.4 , C1.5 | C1.1 |
R2 | T | 2 | 3 | C2.1 , C2.2 , C2.3 , C2.4 , C2.5 | C2.1 |
Thanks
to
the
provided
metadata,
the
cells
again
have
the
annotations
shown
in
the
following
table.
The
metadata
file
has
provided
the
information
to
supplement
the
model
with
additional
annotations
but
also,
for
the
Inventory
Date
column
(cells
C1.5
and
C2.5
),
have
a
value
that
is
a
parsed
date
rather
than
an
unparsed
string.
id | core annotations | |||||
---|---|---|---|---|---|---|
table | column | row | string value | value | about URL | |
C1.1 | T | C1 | R1 |
"1"
|
"1"
|
http://example.org/tree-ops.csv#gid-1
|
C1.2 | T | C2 | R1 |
"ADDISON
AV"
|
"ADDISON
AV"
|
http://example.org/tree-ops.csv#gid-1
|
C1.3 | T | C3 | R1 |
"Celtis
australis"
|
"Celtis
australis"
|
http://example.org/tree-ops.csv#gid-1
|
C1.4 | T | C4 | R1 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
http://example.org/tree-ops.csv#gid-1
|
C1.5 | T | C5 | R1 |
"10/18/2010"
|
2010-10-18
|
http://example.org/tree-ops.csv#gid-1
|
C2.1 | T | C1 | R2 |
"2"
|
"2"
|
http://example.org/tree-ops.csv#gid-2
|
C2.2 | T | C2 | R2 |
"EMERSON
ST"
|
"EMERSON
ST"
|
http://example.org/tree-ops.csv#gid-2
|
C2.3 | T | C3 | R2 |
"Liquidambar
styraciflua"
|
"Liquidambar
styraciflua"
|
http://example.org/tree-ops.csv#gid-2
|
C2.4 | T | C4 | R2 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
http://example.org/tree-ops.csv#gid-2
|
C2.5 | T | C5 | R2 |
"6/2/2010"
|
2010-06-02
|
http://example.org/tree-ops.csv#gid-2
|
The following slightly amended CSV file contains quoted and missing cell values:
GID,On Street,Species,Trim Cycle,Inventory Date 1,ADDISON AV,"Celtis australis","Large Tree Routine Prune",10/18/2010 2,,"Liquidambar styraciflua","Large Tree Routine Prune",
Parsing
this
file
similarly
results
in
an
annotated
tabular
data
model
of
a
single
table
T
with
five
columns
and
two
rows.
The
columns
and
rows
have
exactly
the
same
annotations
as
previously,
but
there
are
two
null
cell
values
for
C2.2
and
C2.5
.
Note
that
the
quoting
of
values
within
the
CSV
makes
no
difference
to
either
the
string
value
or
value
of
the
cell.
id | core annotations | ||||
---|---|---|---|---|---|
table | column | row | string value | value | |
C1.1 | T | C1 | R1 |
"1"
|
"1"
|
C1.2 | T | C2 | R1 |
"ADDISON
AV"
|
"ADDISON
AV"
|
C1.3 | T | C3 | R1 |
"Celtis
australis"
|
"Celtis
australis"
|
C1.4 | T | C4 | R1 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
C1.5 | T | C5 | R1 |
"10/18/2010"
|
"10/18/2010"
|
C2.1 | T | C1 | R2 |
"2"
|
"2"
|
C2.2 | T | C2 | R2 |
""
|
null
|
C2.3 | T | C3 | R2 |
"Liquidambar
styraciflua"
|
"Liquidambar
styraciflua"
|
C2.4 | T | C4 | R2 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
C2.5 | T | C5 | R2 |
""
|
null
|
The following example illustrates some of the complexities that can be involved in parsing tabular data, how the flags described above can be used, and how new tabular data formats could be defined that embed additional annotations into the tabular data model.
In this example, the publishers of the data are using an internal convention to supply additional metadata about the tabular data embedded within the file itself. They are also using a tab as a separator rather than a comma.
# publisher City of Palo Alto # updated 12/31/2010 #name GID on_street species trim_cycle inventory_date #datatype string string string string date:M/D/YYYY GID On Street Species Trim Cycle Inventory Date 1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010 2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010
Naive parsing of the above data will assume a comma separator and thus results in a single table T with a single column and six rows. The column has the annotations shown in the following table:
id | core annotations | ||||
---|---|---|---|---|---|
table | number | source number | cells | titles | |
C1 | T | 1 | 1 | C1.1 , C2.1 , C3.1 , C4.1 , C5.1 |
#
publisher
City
of
Palo
Alto
|
The rows have the annotations shown in the following table:
id | core annotations | |||
---|---|---|---|---|
table | number | source number | cells | |
R1 | T | 1 | 2 | C1.1 |
R2 | T | 2 | 3 | C2.1 |
R3 | T | 3 | 4 | C3.1 |
R4 | T | 4 | 5 | C4.1 |
R5 | T | 5 | 6 | C5.1 |
R6 | T | 6 | 7 | C6.1 |
The cells have the annotations shown in the following table (note that the values of all the cells in the table are strings, denoted by the double quotes in the table below):
id | core annotations | ||||
---|---|---|---|---|---|
table | column | row | string value | value | |
C1.1 | T | C1 | R1 |
"#
updated
12/31/2010"
|
"#
updated
12/31/2010"
|
C1.1 | T | C1 | R1 |
"#name
GID
on_street
species
trim_cycle
inventory_date"
|
"#name
GID
on_street
species
trim_cycle
inventory_date"
|
C2.1 | T | C1 | R2 |
"#datatype
string
string
string
string
date:M/D/YYYY"
|
"#datatype
string
string
string
string
date:M/D/YYYY"
|
C3.1 | T | C1 | R3 |
"
GID
On
Street
Species
Trim
Cycle
Inventory
Date"
|
"
GID
On
Street
Species
Trim
Cycle
Inventory
Date"
|
C4.1 | T | C1 | R4 |
"
1
ADDISON
AV
Celtis
australis
Large
Tree
Routine
Prune
10/18/2010"
|
"
1
ADDISON
AV
Celtis
australis
Large
Tree
Routine
Prune
10/18/2010"
|
C5.1 | T | C1 | R5 |
"
2
EMERSON
ST
Liquidambar
styraciflua
Large
Tree
Routine
Prune
6/2/2010"
|
"
2
EMERSON
ST
Liquidambar
styraciflua
Large
Tree
Routine
Prune
6/2/2010"
|
The consumer of the data may use the flags described above to create a more useful set of data from this file. Specifically, they could set:
#
Setting these is done in an implementation-defined way. It could be done, for example, by sniffing the contents of the file itself, through command-line options, or by embedding a dialect description into a metadata file associated with the tabular data, which would look like:
{ "delimiter": "\t", "skipRows": 4, "skipColumns": 1, "commentPrefix": "#" }
With these flags in operation, parsing this file results in an annotated tabular data model of a single table T with five columns and two rows which is largely the same as that created from the original simple example described in section 8.2.1 Simple Example . There are three differences.
First,
because
the
four
skipped
rows
began
with
the
comment
prefix
,
the
table
itself
now
has
four
rdfs:comment
annotations,
with
the
values:
publisher
City
of
Palo
Alto
updated
12/31/2010
name
GID
on_street
species
trim_cycle
inventory_date
datatype
string
string
string
string
date:M/D/YYYY
Second, because the first column has been skipped, the source number of each of the columns is offset by one from the number of each column:
id | core annotations | ||||
---|---|---|---|---|---|
table | number | source number | cells | titles | |
C1 | T | 1 | 2 | C1.1 , C2.1 |
GID
|
C2 | T | 2 | 3 | C1.2 , C2.2 |
On
Street
|
C3 | T | 3 | 4 | C1.3 , C2.3 |
Species
|
C4 | T | 4 | 5 | C1.4 , C2.4 |
Trim
Cycle
|
C5 | T | 5 | 6 | C1.5 , C2.5 |
Inventory
Date
|
Finally, because four additional rows have been skipped, the source number of each of the rows is offset by five from the row number (the four skipped rows plus the single header row):
id | core annotations | |||
---|---|---|---|---|
table | number | source number | cells | |
R1 | T | 1 | 6 | C1.1 , C1.2 , C1.3 , C1.4 , C1.5 |
R2 | T | 2 | 7 | C2.1 , C2.2 , C2.3 , C2.4 , C2.5 |
The conventions used in this data (invented for the purpose of this example) are in fact intended to create an annotated tabular data model which includes named annotations on the table itself, on the columns, and on the cells. The creator of these conventions could create a specification for this particular tabular data syntax and register a media type for it. The specification would include statements like:
"#"
,
the
second
column
is
the
name
of
an
annotation
on
the
table
and
the
values
of
the
remaining
columns
are
concatenated
to
create
the
value
of
that
annotation.
#name
,
the
remaining
cells
in
the
row
provide
a
name
annotation
for
each
column
in
the
table.
#datatype
,
the
remaining
cells
in
the
row
provide
datatype
/
format
annotations
for
the
cells
within
the
relevant
column,
and
these
are
interpreted
to
create
the
value
for
each
cell
in
that
column.
title
annotations
on
the
columns
in
the
table.
Parsers
that
recognised
recognized
the
format
could
then
build
a
more
sophisticated
annotated
tabular
data
model
using
only
the
embedded
information
in
the
tabular
data
file
.
They
would
extract
embedded
metadata
looking
like:
{ "@context": "http://www.w3.org/ns/csvw", "url": "tree-ops.csv", "dc:publisher": "City of Palo Alto", "dc:updated": "12/31/2010", "tableSchema": { "columns": [{ "name": "GID", "titles": "GID", "datatype": "string", }, { "name": "on_street", "titles": "On Street", "datatype": "string" }, { "name": "species", "titles": "Species", "datatype": "string" }, { "name": "trim_cycle", "titles": "Trim Cycle", "datatype": "string" }, { "name": "inventory_date", "titles": "Inventory Date", "datatype": { "base": "date", "format": "M/d/yyyy" } }] } }
As before, the result would be a single table T with five columns and two rows. The table itself would have two annotations:
dc:publisher
{"@value":
"City
of
Palo
Alto"}
dc:updated
{"@value":
"12/31/2010"}
The columns have the annotations shown in the following table:
id | core annotations | |||||
---|---|---|---|---|---|---|
table | number | source number | cells | name | titles | |
C1 | T | 1 | 2 | C1.1 , C2.1 |
GID
|
GID
|
C2 | T | 2 | 3 | C1.2 , C2.2 |
on_street
|
On
Street
|
C3 | T | 3 | 4 | C1.3 , C2.3 |
species
|
Species
|
C4 | T | 4 | 5 | C1.4 , C2.4 |
trim_cycle
|
Trim
Cycle
|
C5 | T | 5 | 6 | C1.5 , C2.5 |
inventory_date
|
Inventory
Date
|
The rows have the annotations shown in the following table, exactly as in previous examples:
id | core annotations | |||
---|---|---|---|---|
table | number | source number | cells | |
R1 | T | 1 | 6 | C1.1 , C1.2 , C1.3 , C1.4 , C1.5 |
R2 | T | 2 | 7 | C2.1 , C2.2 , C2.3 , C2.4 , C2.5 |
The
cells
have
the
annotations
shown
in
the
following
table.
Because
of
the
way
the
particular
tabular
data
format
has
been
specified,
these
include
additional
annotations
but
also,
for
the
Inventory
Date
column
(cells
C1.5
and
C2.5
),
have
a
value
that
is
a
parsed
date
rather
than
an
unparsed
string.
id | core annotations | ||||
---|---|---|---|---|---|
table | column | row | string value | value | |
C1.1 | T | C1 | R1 |
"1"
|
"1"
|
C1.2 | T | C2 | R1 |
"ADDISON
AV"
|
"ADDISON
AV"
|
C1.3 | T | C3 | R1 |
"Celtis
australis"
|
"Celtis
australis"
|
C1.4 | T | C4 | R1 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
C1.5 | T | C5 | R1 |
"10/18/2010"
|
2010-10-18
|
C2.1 | T | C1 | R2 |
"2"
|
"2"
|
C2.2 | T | C2 | R2 |
"EMERSON
ST"
|
"EMERSON
ST"
|
C2.3 | T | C3 | R2 |
"Liquidambar
styraciflua"
|
"Liquidambar
styraciflua"
|
C2.4 | T | C4 | R2 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
C2.5 | T | C5 | R2 |
"6/2/2010"
|
2010-06-02
|
The following example shows a CSV file with multiple header lines:
Who,What,,Where, Organization,Sector,Subsector,Department,Municipality #org,#sector,#subsector,#adm1,#adm2 UNICEF,Education,Teacher training,Chocó,Quidbó UNICEF , Education , Teacher training , Choc ó, Bojay á
Here, the first line contains some grouping titles in the first line, which are not particularly helpful. The lines following those contain useful titles for the columns. Thus the appropriate configuration for a dialect description is:
{ "skipRows": 1, "headerRowCount": 2 }
With
this
configuration,
the
table
model
contains
five
columns,
each
of
which
have
two
titles,
summarised
summarized
in
the
following
table:
id | core annotations | ||||
---|---|---|---|---|---|
table | number | source number | cells | titles | |
C1 | T | 1 | 1 | C1.1 , C2.1 |
,
#org
|
C2 | T | 2 | 2 | C1.2 , C2.2 |
Sector
,
#sector
|
C3 | T | 3 | 3 | C1.3 , C2.3 |
Subsector
,
#subsector
|
C4 | T | 4 | 4 | C1.4 , C2.4 |
Department
,
#adm1
|
C5 | T | 5 | 5 | C1.5 , C2.5 |
Municipality
,
#adm2
|
As metadata, this would look like:
{ "tableSchema": { "columns": [{ "titles": ["Organization", "#org"] }, { "titles": ["Sector", "#sector"] }, { "titles": ["Subsector", "#subsector"] }, { "titles": ["Department", "#adm1"] }, { "titles": ["Municipality", "#adm2"] }, ] } }
A separate metadata file could contain just the second of each of these titles, for example:
{ "tableSchema": { "columns": [ { "name": "org", "titles": #org" }, { "name": "sector", "titles": #sector" }, { "name": "subsector", "titles": #subsector" }, { "name": "adm1", "titles": #adm1" }, { "name": "adm2", "titles": #adm2" }, ] } }
This enables people from multiple jurisdictions to use the same tabular data structures without having to use exactly the same titles within their documents.
This section has not yet been submitted to IANA for review, approval, and registration.
This section is non-normative.
This appendix outlines various ways in which CSV is defined.
[ RFC4180 ] defines CSV with the following ABNF grammar:
file = [header CRLF] record *(CRLF record) [CRLF] header = name *(COMMA name) record = field *(COMMA field) name = field field = (escaped / non-escaped) escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE non-escaped = *TEXTDATA COMMA = %x2C CR = %x0D DQUOTE = %x22 LF = %x0A CRLF = CR LF TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
Of particular note here are:
TEXTDATA
indicates
that
only
non-control
ASCII
characters
are
permitted
within
a
CSV
file.
This
restriction
is
routinely
ignored
in
practice,
and
is
impractical
on
the
international
web.
CRLF
.
This
makes
it
harder
to
produce
CSV
files
on
Unix-based
systems
where
the
usual
line
ending
is
LF
.
header
parameter
on
the
media
type
indicates
whether
the
header
is
present
or
not.
""
).
Excel is a common tool for both creating and reading CSV documents, and therefore the CSV that it produces is a de facto standard.
The
following
describes
the
behaviour
behavior
of
Microsoft
Excel
for
Mac
2011
with
an
English
locale.
Further
testing
is
needed
to
see
the
behaviour
behavior
of
Excel
in
other
situations.
Excel
generates
CSV
files
encoded
using
Windows-1252
with
LF
line
endings.
Characters
that
cannot
be
represented
within
Windows-1252
are
replaced
by
underscores.
Only
those
cells
that
need
escaping
(eg
(e.g.
because
they
contain
commas
or
double
quotes)
are
escaped,
and
double
quotes
are
escaped
with
two
double
quotes.
Dates and numbers are formatted as displayed, which means that formatting can lead to information being lost or becoming inconsistent.
When
opening
CSV
files,
Excel
interprets
CSV
files
saved
in
UTF-8
as
being
encoded
as
Windows-1252
(whether
or
not
a
BOM
is
present).
It
correctly
deals
with
double
quoted
cells,
except
that
it
converts
line
breaks
within
cells
into
spaces.
It
understands
CRLF
as
a
line
break.
It
detects
dates
(formatted
as
YYYY-MM-DD
)
and
formats
them
in
the
default
date
formatting
for
files.
Excel provides more control when importing CSV files into Excel. However, it does not properly understand UTF-8 (with or without BOM ). It does however properly understand UTF-16 and can read non-ASCII characters from a UTF-16-encoded file.
A particular quirk in the importing of CSV is that if a cell contains a line break, the final double quote that escapes the cell will be included within it.
When
tabular
data
is
copied
from
Excel,
it
is
copied
in
a
tab-delimited
format,
with
LF
line
breaks.
Downloaded
CSV
files
are
encoded
in
UTF-8,
without
a
BOM
,
and
with
LF
line
endings.
Dates
and
numbers
are
formatted
as
they
appear
within
the
spreadsheet.
CSV
files
can
be
imported
as
UTF-8
(with
or
without
BOM
).
CRLF
line
endings
are
correctly
recognised.
recognized.
Dates
are
reformatted
to
the
default
date
format
on
load.
Tabular Data Packages place the following restrictions on CSV files:
As a starting point, CSV files included in a Tabular Data Package package must conform to the RFC for CSV (4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files). In addition:
File names MUST end with
.csv
.Files MUST be encoded as UTF-8.
Files MUST have a single header row. This row MUST be the first row in the file.
Terminology: each column in the CSV file is termed a field and its
name
is the string in that column in the header row.The
name
MUST be unique amongst fields, MUST contain at least one character, and MUST conform to the character restrictions defined for thename
property.Rows in the file MUST NOT contain more fields than are in the header row (though they may contain less).
Each file MUST have an entry in the
tables
array in thedatapackage.json
file.The resource metadata MUST include a
tableSchema
attribute whose value MUST be a valid schema description .All fields in the CSV files MUST be described in the schema description .
CSV files generated by different applications often vary in their syntax, e.g. use of quoting characters, delimiters, etc. To encourage conformance, CSV files in a CSV files in a Tabular Data Package SHOULD :
- Use "," as field delimiters.
- Use
CRLF
(U+000D U+000A
) orLF
(U+000A
) as line terminators.If a CSV file does not follow these rules then its specific CSV dialect MUST be documented. The resource hash for the resource in the
datapackage.json
descriptor MUST :
- Include a
dialect
key that conforms to that described in the CSV Dialect Description Format .Applications processing the CSV file SHOULD read use the
dialect
of the CSV file to guide parsing.
To replicate the findings above, test files which include non-ASCII characters, double quotes, and line breaks within cells are:
This section is non-normative.
The document has undergone substantial changes since the last working draft. Below are some of the changes made: