See also: IRC log
danbri: approve last week's minutes?
danbri: no objections, approved
... same agenda as last week, if anyone else has agenda items please pipe up
AndyS: are we using the Tracker for Issues?
<danbri> goal was to use github issues
Jeni: what I've tried to do in this doc is to get a v basic idea
of tabular data model
and what a syntax for basic tabular data, ie. for that model, based on csv would look like
with an appendix which looks at existing specs, key implementations of csv, to see what they do
we might use existing specs and implementations as constraints on what we do
there are various issue in that we could go through
jeni: the tabular data model as i say v basic, … table w/ one or more cols, each named, one or more rows, each row has a field for each col in the table
that's the basic model
1st issue there is whether order of cols is significant
andys: i'd have thought it was quite significant
often they're grouped together, for comprehensibility reasons
sometimes column naming used as a hint, … eg. with years
i think there's an intention underlying that they proceed from left to right
dan: order needed if col names missing
danbri: doc order can be administratively useful (as rdf/xml etc)
<fresco_> i'd like to see an example of data that would have problems if the columns were re-ordered
jeni: […] if not preserved, conseq would be that it would be possible for an impl to read in a file and write it out in a different order conformantly
andys: some json reprs e.g. gregg's could lose order
jeni: any mappings to other formats could have this issue
timfinin: wrt ordering, systems that try to infer semantics of tables, order is a strong heuristic
… e.g. 1st column much more likely to be a key for the table, adj of columns helps inferring relationships between columns
jeni: another good reason
resolved: order of columns should be significant
next issue is, issue 2 -
jenit: in SQL, every column has a type associated with it, … should we assume same within our tabular data model?
or have that as a separate layer
dan: how would we answer this?
jenit: it's a design choice. for example in xml, … originally most values were not typed, then xmlschema layered on top
… compare w/ json, basics built-in
… in most data formats, we care about particular types, if you're passing around data you should care about the values
timfinin: two notions of type, … low level datatype; integer, date, … other is a semantic type
… it would be v interesting to support adding semantic types as help for someone trying to use that table
so instead of being a mere string, is a person; or a musicalartist.
<jtandy> agree with TimFinin
if we had something like that, it wouldn't be like a schema for validation, but more info for someone trying ot u/stand what the table is
konstant: wrt data format, … can have header saying if something has an integer, float etc.
… wrt semantic typing per tim's comment, 1st of all […] what cell is supposed to mean
<TimFinin> maybe we cld have any number of header rows. One might give simple datatypes. another might give column names. another might gibe URIs to semantic types
dan: q is whether we consider all csvs to have [homogenously] typed columns (we can always add that via external files)
<TimFinin> if we alow a *any* type it might help
ivan: two diff things. for me, mainly as we have all these semantic types, for me this part of the metadata we'll define. whether we assign a type, ...
for a column, it's metadata. that simplifies the treatment, management of the whole thing
<TimFinin> are column header names considered metadata?
… also a q: reality of what's out there. What do Excel, OpenOffice, etc do?
do they recognise basic data like json, or they turn everything into strings
andys: they spot numbers
jenit: and dates
ivan: so there's a number of data that theyspot automatically
<AndyS> Locale sensitive as well. 1/4 confuses : "1 April" vs 0.25.
Eric: a huge problem in scientific arena. If you're importing into a s/sheet
and if it detectts in a cell something that [happens to] look like a date
so you sometimes have to engineer around this, to protect against the spreadsheet tool guessing badly
sometimes too smart
jenit: good point
<Zakim> danbri, you wanted to suggest some but not all columns MAY share a fixed type for whole column; but some cols are chaotic.
jenit: some more issues, but i'll make a redraft based on this, … probably with a 2 layer model
<AndyS> For semantic types, avoid "DanBri"^^foaf:Person
<AndyS> or "DanBri" rdf:type foaf:Person
ivan: one comment on list (maybe jtandy); often in one csv file you often have several tables; is this something we're even considering
jenit: good point, a lot of our examples have required multiple tables in some ways associated with each other
it is useful for this doc to somehow recognise that; then we can move on to discuss how those sep tables can be expressed
e.g. in one table, zipped etc
<fresco_> tables within tables: http://dx.doi.org/10.7717/peerj.259/table-3
jtandy: the comment I made, .. often multiple CSVs are packaged as a dataset in a zipfile, each text file represents a facet of the dataset
ivan: that's friendlier than several tables in one file
<Zakim> AndyS, you wanted to talk about multiple tables.
andys: what i'd like to see … data syntax format pointing to a region of a csv file, …
<konstant> konstant is "stasinos" and promises to change his nick
…orig to be able to id the data parts from the presentational surround
<konstant> no prob
andys: on mult tables, … sometimes it is
written, there really are two tables there, but flattened in a dump
... eg. regions + sales items packaged together
… gregg talked about this 'denormalization dumping effect'
konstant: I'm not really sure why ivan so worried about multiple tables in the same file
… we also have cols w/ diff types, diff rows, complex interdependencies, all kinds of [other] ugliness
… if someone dumps multiple tables in one file they'll have some kind of delimiter
… there should be something that is machine describable
ivan: surely true, a minor thing, but let's say the CSV handling toolkit w/ python would break on these things, for eg.
jtandy, ready to talk about Use Cases doc?
<jtandy> yes - quick update today
ivan: there are more complications out there than i expected, that's all!
jtandy: I've created the boilerplate document
jtandy: abstract, intro etc are there
... yet to add use cases
... Alf has been working with Davide on providing more detailed examples with supporting datasets, so thanks to them
... I will work through those shortly
... email from Juan about CSV2RDF based on getting data out of relational databases
... but there's no use case for CSV publication from relational databases
... as yet
jtandy: re Gregg's CSV-LD proposal, it implies a bunch of use cases, but I'm not sure how many of those we're picking up
jtandy: I'm looking for people to provide
specific examples of data that we can stitch together with a narrative
... I'm putting one together myself around the Met Office data which we can use as an example
jtandy: so if you have got a use case,
please provide a narrative & datasets for them
... there are details on the wiki under use case analysis
jtandy: I will ping people explicitly on the
mailing list as I work on the use case
... I'm concerned that the use cases don't yet cover the full scope of what we want to achieve
... we need use cases to hang requirements on
danbri: are there any use cases promised but not delivered?
jtandy: I don't think so, but what we have doesn't much the scope of the requirements people are bringing up
<TimFinin> i volunteered to give a use case for using CSV to exchange data in text information extraction systems
EricStephan: I've tried contributing use cases
<DavideCeolin> I haven't shared a new version of the police data analysis use case yet but I'm close to have it done
<danbri> davide, think you'll do that before next week's call?
EricStephan: I'm seeing in scientific
formats, a basic format of header, then delimited data
... I don't know how you want to organise those
... I contributed two more this morning along those lines
<DavideCeolin> danbri, yes hopefully in a couple of days max
jtandy: I haven't seen those, but I'll look at them and follow up on the list
EricStephan: also, there were some
contributions around the NetCDF format
... also uncertainty qualification
... eg simulations that change one parameter
... which gives multiple CSVs
... I can elaborate around these and more complex examples if that would be helpful
jtandy: yes please
... One interesting thing is how we deal with missing values
... eg people using -999
<EricStephan> good point
jtandy: that's an example where we can make sure the syntax deals with that
danbri: I met last week with colleagues
working on Fusion Tables
... it's possible we can ask questions about the CSVs in use on the web
... eg about what line endings are used
... or whether -999 happens often
... so if you have questions like that send them my way and I'll try to answer them
... any other business?
ivan: I saw Jeremy tested in Excel
<danbri> i might not be here next week (middle of california trip)
ivan: are tests in other tools useful?
JeniT: yes please
danbri: good to have these test files
... Scribe volunteer?
<konstant> ok, ok
<danbri> thanks, 'volunteer'
konstant will scribe next week