CSV on the Web Working Group Teleconference -- 19 Feb 2014

danbri: approve last week's minutes?

<danbri> http://www.w3.org/2014/02/12-csvw-minutes.html

danbri: no objections, approved
... same agenda as last week, if anyone else has agenda items please pipe up

AndyS: are we using the Tracker for Issues?

<danbri> goal was to use github issues

Syntax for Tabular Data on the Web

http://w3c.github.io/csvw/syntax/

Jeni: what I've tried to do in this doc is to get a v basic idea

of tabular data model

and what a syntax for basic tabular data, ie. for that model, based on csv would look like

with an appendix which looks at existing specs, key implementations of csv, to see what they do

we might use existing specs and implementations as constraints on what we do

there are various issue in that we could go through

jeni: the tabular data model as i say v basic, … table w/ one or more cols, each named, one or more rows, each row has a field for each col in the table

that's the basic model

1st issue there is whether order of cols is significant

thoughts?

andys: i'd have thought it was quite significant

often they're grouped together, for comprehensibility reasons

sometimes column naming used as a hint, … eg. with years

i think there's an intention underlying that they proceed from left to right

dan: order needed if col names missing

danbri: doc order can be administratively useful (as rdf/xml etc)

<fresco_> i'd like to see an example of data that would have problems if the columns were re-ordered

jeni: […] if not preserved, conseq would be that it would be possible for an impl to read in a file and write it out in a different order conformantly

andys: some json reprs e.g. gregg's could lose order

jeni: any mappings to other formats could have this issue

timfinin: wrt ordering, systems that try to infer semantics of tables, order is a strong heuristic

… e.g. 1st column much more likely to be a key for the table, adj of columns helps inferring relationships between columns

jeni: another good reason

resolved: order of columns should be significant

next issue is, issue 2 -

jenit: in SQL, every column has a type associated with it, … should we assume same within our tabular data model?

or have that as a separate layer

dan: how would we answer this?

jenit: it's a design choice. for example in xml, … originally most values were not typed, then xmlschema layered on top

… compare w/ json, basics built-in

… in most data formats, we care about particular types, if you're passing around data you should care about the values

timfinin: two notions of type, … low level datatype; integer, date, … other is a semantic type

… it would be v interesting to support adding semantic types as help for someone trying to use that table

so instead of being a mere string, is a person; or a musicalartist.

<jtandy> agree with TimFinin

if we had something like that, it wouldn't be like a schema for validation, but more info for someone trying ot u/stand what the table is

[poor audio?]

konstant: wrt data format, … can have header saying if something has an integer, float etc.

… wrt semantic typing per tim's comment, 1st of all […] what cell is supposed to mean

<TimFinin> maybe we cld have any number of header rows. One might give simple datatypes. another might give column names. another might gibe URIs to semantic types

dan: q is whether we consider all csvs to have [homogenously] typed columns (we can always add that via external files)

<TimFinin> if we alow a *any* type it might help

ivan: two diff things. for me, mainly as we have all these semantic types, for me this part of the metadata we'll define. whether we assign a type, ...

for a column, it's metadata. that simplifies the treatment, management of the whole thing

<TimFinin> are column header names considered metadata?

… also a q: reality of what's out there. What do Excel, OpenOffice, etc do?

do they recognise basic data like json, or they turn everything into strings

andys: they spot numbers

jenit: and dates

ivan: so there's a number of data that theyspot automatically

<AndyS> Locale sensitive as well. 1/4 confuses : "1 April" vs 0.25.

Eric: a huge problem in scientific arena. If you're importing into a s/sheet

and if it detectts in a cell something that [happens to] look like a date

so you sometimes have to engineer around this, to protect against the spreadsheet tool guessing badly

sometimes too smart

jenit: good point

<fresco_> http://nsaunders.wordpress.com/2012/10/22/gene-name-errors-and-excel-lessons-not-learned/

<Zakim> danbri, you wanted to suggest some but not all columns MAY share a fixed type for whole column; but some cols are chaotic.

<fresco_> http://www.biomedcentral.com/1471-2105/5/80

jenit: some more issues, but i'll make a redraft based on this, … probably with a 2 layer model

<AndyS> For semantic types, avoid "DanBri"^^foaf:Person

<AndyS> or "DanBri" rdf:type foaf:Person

ivan: one comment on list (maybe jtandy); often in one csv file you often have several tables; is this something we're even considering

jenit: good point, a lot of our examples have required multiple tables in some ways associated with each other

it is useful for this doc to somehow recognise that; then we can move on to discuss how those sep tables can be expressed

e.g. in one table, zipped etc

<fresco_> tables within tables: http://dx.doi.org/10.7717/peerj.259/table-3

jtandy: the comment I made, .. often multiple CSVs are packaged as a dataset in a zipfile, each text file represents a facet of the dataset

ivan: that's friendlier than several tables in one file

<Zakim> AndyS, you wanted to talk about multiple tables.

andys: what i'd like to see … data syntax format pointing to a region of a csv file, …

<konstant> konstant is "stasinos" and promises to change his nick

…orig to be able to id the data parts from the presentational surround

<konstant> no prob

andys: on mult tables, … sometimes it is written, there really are two tables there, but flattened in a dump
... eg. regions + sales items packaged together

… gregg talked about this 'denormalization dumping effect'

konstant: I'm not really sure why ivan so worried about multiple tables in the same file

… we also have cols w/ diff types, diff rows, complex interdependencies, all kinds of [other] ugliness

… if someone dumps multiple tables in one file they'll have some kind of delimiter

… there should be something that is machine describable

ivan: surely true, a minor thing, but let's say the CSV handling toolkit w/ python would break on these things, for eg.

jtandy, ready to talk about Use Cases doc?

<jtandy> yes - quick update today

ivan: there are more complications out there than i expected, that's all!

Use Cases & Requirements

jtandy: I've created the boilerplate document

<danbri> http://lists.w3.org/Archives/Public/public-csv-wg/2014Feb/0072.html

<danbri> http://w3c.github.io/csvw/use-cases-and-requirements/

jtandy: abstract, intro etc are there
... yet to add use cases
... Alf has been working with Davide on providing more detailed examples with supporting datasets, so thanks to them
... I will work through those shortly
... email from Juan about CSV2RDF based on getting data out of relational databases
... but there's no use case for CSV publication from relational databases
... as yet

<danbri> juan: http://lists.w3.org/Archives/Public/public-csv-wg/2014Feb/0058.html

jtandy: re Gregg's CSV-LD proposal, it implies a bunch of use cases, but I'm not sure how many of those we're picking up

<danbri> gregg: http://lists.w3.org/Archives/Public/public-csv-wg/2014Feb/0000.html

jtandy: I'm looking for people to provide specific examples of data that we can stitch together with a narrative
... I'm putting one together myself around the Met Office data which we can use as an example

jtandy: so if you have got a use case, please provide a narrative & datasets for them
... there are details on the wiki under use case analysis

<danbri> https://www.w3.org/2013/csvw/wiki/Use_Cases

jtandy: I will ping people explicitly on the mailing list as I work on the use case
... I'm concerned that the use cases don't yet cover the full scope of what we want to achieve
... we need use cases to hang requirements on

danbri: are there any use cases promised but not delivered?

jtandy: I don't think so, but what we have doesn't much the scope of the requirements people are bringing up

<TimFinin> i volunteered to give a use case for using CSV to exchange data in text information extraction systems

EricStephan: I've tried contributing use cases

<DavideCeolin> I haven't shared a new version of the police data analysis use case yet but I'm close to have it done

<danbri> davide, think you'll do that before next week's call?

EricStephan: I'm seeing in scientific formats, a basic format of header, then delimited data
... I don't know how you want to organise those
... I contributed two more this morning along those lines

<DavideCeolin> danbri, yes hopefully in a couple of days max

jtandy: I haven't seen those, but I'll look at them and follow up on the list

EricStephan: also, there were some contributions around the NetCDF format
... also uncertainty qualification
... eg simulations that change one parameter
... which gives multiple CSVs
... I can elaborate around these and more complex examples if that would be helpful

jtandy: yes please
... One interesting thing is how we deal with missing values
... eg people using -999

<EricStephan> good point

jtandy: that's an example where we can make sure the syntax deals with that

danbri: I met last week with colleagues working on Fusion Tables
... it's possible we can ask questions about the CSVs in use on the web
... eg about what line endings are used
... or whether -999 happens often
... so if you have questions like that send them my way and I'll try to answer them
... any other business?

AOB

ivan: I saw Jeremy tested in Excel

<danbri> i might not be here next week (middle of california trip)

ivan: are tests in other tools useful?

JeniT: yes please

danbri: good to have these test files
... Scribe volunteer?

<konstant> a-ha

<konstant> ok, ok

<konstant> "volunteer"

<danbri> thanks, 'volunteer'

<danbri> :)

konstant will scribe next week

CSV on the Web Working Group Teleconference

19 Feb 2014

Attendees

Contents

Syntax for Tabular Data on the Web

Use Cases & Requirements

AOB

Summary of Action Items