CSV on the Web Working Group Teleconference

05 Mar 2014


See also: IRC log


Alf Eaton (fresco_), Stasinos Konstantopoulos (stasinos), Ivan Herman (ivan), Davide Ceolin (DavideCeolin), Jeni Tennison (JeniT), Yakov Shafranovich (yakovsh), J├╝rgen Umbrich (jumbrich)
Dan, Andy, Jeremy, Chris Tim, Axel, Eric


JeniT: approve previous meeting minutes?

(no comment...)

RESOLUTION: previous meeting minutes approved

Use cases

DavideCeolin: on the use cases, 5 uc were assigned to me
... jeremy already set up the requirements
... I am not sure they are really part of the others

JeniT: it would be useful to know whether there were any issues to be discussed?

DavideCeolin: one was about provenance
... but this was discussed
... but it has changed a bit
... everything related to annotation is linked to this
... the possibility to map elements to URIs
... in some spreadsheets there are codes, and it would be useful to link them to URIs
... I pointed that out as a separate requirements
... not sure whether this is covered by others

JeniT: I agree that is a requirement
... probably is not elsewhere noted

DavideCeolin: is it covered by the external definition resource?

JeniT: keep it separate for now, and we will be able to come back to this later

DavideCeolin: there is also requirement on unit of measure
... partially covered by the semantic requirements
... but i still kept it separated for now

JeniT: you mean that the semantic type might cover some part of the measure unit?

DavideCeolin: I am not an expert on that, I am not sure what the best way to cover this

JeniT: this may be a choice of the publisher whether this is something to go into the semantic part
... thanks for that, it is really good
... our UC document is coming together
... anybody have any issues/questions?

(no comments)

syntax document

<JeniT> http://w3c.github.io/csvw/syntax/

JeniT: following on last call I have took the tabular data specification into the spec
... in the data model we said every column has a name, now we say every column has an index
... the name of the column is part of the annotation, ie, the annotated data model

<JeniT> http://w3c.github.io/csvw/syntax/#annotated-model

JeniT: the annotated model talks about the different types of annotations (tables, column level, etc)
... these are the changes I made
... the first problem is issue 1
... is the order of rows significant in a table?

fresco: in the case you are using several table in the same place then it may be a problem

JeniT: in our model we have now one table only

yakovsh: if we follow the spreadsheet model the row order is specifically significant

JeniT: agreed

<fresco> good point that references to individual cells rely on the row and column order being maintained

stasinos: one of the ways we are discussing is an ID which then has the properties of a fields in a row
... we can then make the requirement for that to represent a strict order

yakovsh: rfc 7111 have a row and a column level reference

<yakovsh> http://tools.ietf.org/html/rfc7111

JeniT: any reason not to have it significant?

(no reaction...)

JeniT: in that case I will add it in

JeniT: next issue: in the annotated data model I have annotating table, column, row, cells
... I also have annotated regions
... I have a suspicion I put it in because it looked like a useful generalization
... do we really need it, or should I take it out?
... any use case around that?

stasinos: one thing, i remember we were discussing a situation of a cell being, say, the sum of other cells
... in that case there is the notion of a region
... in that case we may want to talk about general regions

JeniT: that is interesting because it brings up referencing
... referencing cells in a random manner, in a way
... this is different from the current spec which talks about subtables
... I think that is a useful thing to say

stasinos: it is hard to tell, without a really use case

fresco: referencing should be better as part of a separate specification

stasinos: that is reasonable

JeniT: in rfc 7111 we have that notion

<yakovsh> should RFC 7111 be referenced in our documents? I don't see it

<fresco> referencing abstract "table" regions (i.e. data parsed from a CSV file) vs referencing parts of a CSV file

ivan: do we really want to represent spreadsheet functionalities

stasinos: it is not the reproduction of the functionalities, it is just to characterize the raw data itself

<stasinos> to be able to specify which regions have data, and which have derivatives (of any sort)

<fresco> requirement for row headings as well as column headings, to be able to say that a row is a derivative?

JeniT: what I will do is to put in a note that we talked about annotation regions, there may some usage, but we can refer this to more use cases
... is that a reasonable way forward



<stasinos> +1

<yakovsh> +1

<DavideCeolin> +1

fresco: the parser can draw up an index, and you can have headers and rows, ie, you may want to specify the nature of rows and columns

JeniT: we so have annotated rows and columns

<JeniT> http://w3c.github.io/csvw/syntax/#syntax

JeniT: next issue in section 3
... what I have done is to cut it down, it does not talk about how to input tabular data, but only how to output
... that is the best practice of tabular data

JeniT: we are only looking only at the output for best practice
... issue 4 tries to make it as rfc compatible as possible
... if we use the mime type, that refers to the default and usual character sets
... this issue is that we would like to say that utf-8 is the default

yakovsh: I have discussed with the area directors and it may be possible to amend the draft
... if there are specific suggestions for character sets
... there is also the issue of cr and lf
... i do not know about the default character set
... it is definitely possible to have a default character set if we get a guidance from W3C
... the issue currently says that the content type header must be used to set the character set
... I know that people do not change content type anyway, let alone changing the character set
... so it would be great if utf-8 would be default

<JeniT> Section 4.1.1 of RFC2046 specifies that "The canonical form of any MIME "text" subtype must always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" must represent a line break. Use of CR and LF outside of line break sequences is also forbidden."

<fresco> application/csv sounds like a good idea to me

JeniT: if we want to say that it is o.k. to use LF, then we have a problem using text/csv...

fresco: the old spec was ascii but all the parser ignored that
... but the newer parsers fall back on utf8
... ie, the specification could get away to use utf8
... most people ignore the original spec

JeniT: according to the rfc we should not call that text/csv, only application/csv
... the line ending is quite clear

yakovsh: rdc 4180 was passed with the old mime guidelines, but those changed
... it is possible to change that
... i will go back and see what is involved
... i think it can be changes
... question: is there a byte mark if the default in utf8?

JeniT: bom is usually optional with utf8
... you do not usually have to use it
... in practice, if you use it, you get horrible characters
... I would like to avoid that

yakovsh: i will talk to the appl. working group, csv is not only the only one that has this issue
... we will discuss that after the ietf meeting
... another question, rfc 4180 is an informational doc, if w3c really wants that ietf could push it through as a fast track

JeniT: yes, it would be good to have a standard for csv
... there has been other cases where the body has been done by w3c
... if we can do this that would be great

ivan: any formal step is necessary from W3C?

yakovsh: no, it should be o.k without it

<fresco> parser parameters: https://github.com/hubgit/csvw/wiki/CSV-Parser-Parameters

parsing tabular data

<JeniT> https://github.com/hubgit/csvw/wiki/CSV-Parser-Notes

fresco: there are also notes and looking at the different parameters parsers use
... some of the things people have to specify
... there are 2-3 different sections
... character set, discussed
... dialects of the csv file (separators, white space should be trimmed or not, how to select particular bits of the file to be used, ie, what is a comment line, etc)
... there is also a separate set on how to transform data, that may be a separate issue
... lot of csv parsers have these transformation fields in them
... one issue is trimming of white space
... one way is the '\', the unix way
... or the quotes, the excel way
... the quoting is in particularly for output
... that is something to specify to put things in quotes only if there is a special character in the field
... that is basically it...
... I will clean it up

yakovsh: is there a list of application that you looked at?

<fresco> https://github.com/hubgit/csvw/wiki/CSV-Parser-Notes

fresco: the big one is a python parser, and pandas in python
... pandas has a lot of transformation, has a multi index with several header columns/rows
... it specifies the decimal and thousands separators
... java has a nice one
... the standard one is the php csv parser, but it uses really with the standard case

<JeniT> https://docs.google.com/a/theodi.org/spreadsheet/ccc?key=0AiswT8ko8hb4dEtOR0x1WkJ3LS1LSm1HQm1YQzZuSHc&usp=sharing

<JeniT> https://github.com/theodi/csv-validation-research

JeniT: have you looked at this one?

fresco: the data package only specifies only a few parameters
... there are a few more parameters in common use
... they might become useful

JeniT: how to move that into spec space?
... we could have it a standalone spec

JeniT: or roll it into the syntax spec as a separate section

fresco: on the transformation side it is interesting whether we would use these
... three different types of information like comment and white space whould be part of the syntax
... the transformation may be a separate specification
... it leaves us with the region specification

JeniT: the region selection should be part of the specification

stasinos: the question is how to describe a region is another thing
... the syntax doc should have a best practice part

JeniT: my inclination to roll it into the syntax spec, with a very separate section

<stasinos> ... and a much looser, permissive part describing what can be specified

JeniT: we need a separate to spec to convert to, say, json, what we need here is how to convert that into an abstract model

... AOB?

stasinos: there is a very interesting discussion on the problem of tables transformed into RDF graphs

JeniT: we took the decision with Dan not to discuss this until we have the UC document out
... then, yes, we will get to it
... but we have to have the basic things done

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.138 (CVS log)
$Date: 2014-03-05 14:14:40 $