CSVW weekly, Feb 12 2014

12 Feb 2014


See also: IRC log


Jeremy Tandy (jtandy), Jeni Tennison (JeniT), Andy Seaborne (AndyS), Alf Eaton (fresco), Dan Brickley (danbri), Axel Polleres (AxelPolleres), Davide Ceolin (davideceolin), Stasinos Konstantopoulos (konstant)
Jeni Tennison
Dan Brickley


<JeniT> Approval of http://www.w3.org/2014/02/05-csvw-minutes.html ?

resolved: approved previous meeting minutes: http://www.w3.org/2014/02/05-csvw-minutes.html


Use cases and requirements

<JeniT> https://www.w3.org/2013/csvw/wiki/Use_document_outline

jtandy: I … documents which I've started

<JeniT> https://www.w3.org/2013/csvw/wiki/Analysis_of_use_cases

… both of these live on the wiki currently

drafts are in wiki to get started. I went through the use case docs from the previous WGs - SKOS, OWL, etc., and pulled out some things which I thought we ought to cover.

in terms of document headings, typical w3c parts, ToC, abstract etc. Once we get into use cases, a q:

some docs do use cases AND user stories. Are we happy just with use cases?

danbri, jenit: happy

jtandy: implication is that use cases will have both a narrative style and technical content closer together

… pulling out the things people are trying do just with example

jenit: I'm happy with that. It can be a tricky distinction, we just need to get down to some practical examples

jtandy: ok so practical examples + a narrative. for each one of the detailed use cases in the doc we'll want a ref to the contributor, and a ref to a complete description (most likely link to our wiki)

… will want to hyperlink to specific requirements

jenit: why separate use case descriptions?

jtandy: I'd expect the actual W3C spec doc will be somewhat clipped, and likely we'll have full details in wiki

jenit: fine, so long as self-contained within the doc

jtandy: yes. for e.g., some of my use cases are complex, don't want to pollute doc

…embedded in the doc will be requirements

jtandy: I'm aiming for something like 8

<JeniT> jtandy: in some documents there are informative use cases

jenit: i'm not sure

… a particular reason to do that

either a use case provides requirements or it doesn't; if not we shouldn't care about it

jenit: (re 8…) that we shouldn't constrain the number but seems about right

danbri: i'm poking around for both Google and schema.org use cases

<jtandy> I'll try to resolve ASAP

<JeniT> https://www.w3.org/2013/csvw/wiki/Analysis_of_use_cases

Analysis of use cases

<JeniT> https://www.w3.org/2013/csvw/wiki/Use_Cases#Publication_of_Data_by_the_UK_Land_Registry

AndyS: Re publication of data by the land registry.

UK land registry keep title on propert"irc"england and wales

diff system in scotland, diff org, regime etc.

a couple of things: price paid data, … every time there is a property or land transaction in england or wales, then it is recorded by the land registry, they have a monthly publication cycle

about 350 million triples (internally quads); driven by a process that already existed that was producing csv files

so there is a relationship between those csv files and what is now linked data

essentially a diff vs previous month

[silence] … and deltions that can happen for various admin reasons

marked by columns abcd etc, … code lists are an important aspect

just looked at as csv it is not data, but a difference on the data

which affects the meaning of columns

each row has meaning given to it by parts of the process

info in it is at diff levels of authority

… not verified by land registry; price isn't checked but generally correct

jenit: what does this mean for requirements?

andys: the quality of the csv is pretty good, comes from a data warehouse, in terms of syntax is would confirm to what youv'e called CSV Plus

they publish both with and without column headings, due to different needs

escaping and interesting chars - occasionally a problem

i don't think any char code problems, either english or welsh

from absolute syntax level, … high quality

in terms of introducing modeling (in conjuction w/ land registry), it is quite difficult to go in and say what this data means

data only goes back to '95 because structures changed then

even in today's process, there have been subtle shifts in meaning, takes some internal investigation to figure things out

even though they have a well org'd data dictionary

despite all their good practices, still needs a knowledge capture effort

<JeniT> ack

jtandy: as i was going through andys's use case for requirements doc, I tried to pull out requirements

key seemed to be: automated transform of csv into rdf, by automated i mean having a generic way to do it,

andys: the land registry did write a custom convertor

jtandy: but arguably we should be in a position where there's a generic transformation mechanism

andys: they would've been delighted if such a thing existed. they needed to do this at scale, the tools were not up to date

andys: the bulk conversion is relatively easy part,

<AxelPolleres> Naively, I guess many people here would think CSV2RDF should be just a "dialect/small modification" of the existing RDB2RDF spec, or no?

tandy: we need a machine readable mechanism to associate rich semantics - e.g. rdf properties - with cols and rows of a csv file

andys: yes some sort of way to link back and talk about what a column, or possibly even a cell, … at that point what I draw out, is that each row is not an entity in itself

<JeniT> AxelPolleres, yes, I think that's an assumption

… if you take all the transactions, one property will be mentioned many times in different rows

because each row is a transaction

so they get mentioned in many places

<JeniT> AxelPolleres, some work is needed to analyse how that might work though, by someone who knows RDB2RDF

it would be ideal to try to extract out a property entity and several rows

jtandy: 3rd req i extracted, that each entity should be uniquely identifiable


andys: … guid for each [don't know, need to check, it's a hash of some cols]

per transaction

internally, there are some identifers for properties, but they're not in a position to publish those

jtandy: each row wants to talk about a transaction, which is an update on a prev transaction

jtandy: final requirement, is need to associate values in a csv file with an externally published thesaurus

andys: very much so, that's quite important

jtandy: 2-fold, a) you need to be able to ref a thesaurus/vocab, or b) you might need to expand some code as refering to some specific entity

(discussion of impact on the conversion workflow)

andys: if you look at it as a table of transaactions

jenit: looking at analysis you've done, are there any particular things you'd like to flag for help, input etc?

hmm did that do something bad, for scribe?

jtandy: we need example (for nat. archives) we have useful discussion but a more specific example would help.

also, 2nd use cases also from adam, relational data row and formats, … [missed detail]

jtandy: 3 and 4 from jenit. For 3., I've identified that they were talking about Excel, would be useful to id a list of commonly used tools

… a particular dataset for use case 3 would be a specific csv file to illustrate

tools wiki: https://www.w3.org/2013/csvw/wiki/Tools

jtandy: no 4, no comments. no 5, one of mine - meteorological observations; no 6, andy's discussed already; pretty specific; no 7, from Alf ...

Alf's use case 7., search results from SOLR, my interpretation is that you're trying to illustrate how to deal with a larger dataset

e.g. a huge result set from a search

just using the search result as an illustration of this process

alf: yes, pretty much right

jtandy: i remember from last call + use case, that we don't want to get into designing a search protocol

… so interest is not so much the search protocol but dealign with a subset of a larger collection.

jtandy: am trying to write the use case to make that clear, following a specific narrative … example of open refine,

how things that you tried to do affect how you have to process the csv file

alf: yes

jtandy: i've written that the search topic is misleading, we're not doing a protocol, just pagination within a dataset

in terms of no.8, the police open data reliability analysis, i think that it would be useful to include a set of csv files, some of which are unreliable, ...

so see how […] categories and geo areas

davide: ok i'll do that

jtandy: the analysis is great, looking at change over time, comparability etc. Just needs some specific examples to show where it's broken.

jtandy: a q for jenit/danbri, … in a lot of use cases, people are having to do manual effort to manipulate the files. should we say explicitly in use cases, "And it took ages as I had to do … to get into matlab etc."?

jenit: pull out what's requirement for particular tools

jenit: to inform what we do

jtandy: in a perfect world it would all just work! but we're writing use cases for today

jenit: reading stuff into R, or favourite stats package, may need extra impl work to read in the files in the format we're talking about defining, to get all the extra info / context in

but we need to have an idea about the backwards compatibility story, … how much new tools can work with CSVs, new CSV etc

jtandy: no9, analysis of scientific spreadsheets, again via davide

… also see this with my scientific colleagues; i've seen people do v similar things w/ hydrology and river flow (topical topic...)

[missed the action but davide will do something]

alf's no 10, suggest merging with alf's no 9

alf: yes, suggest that

alf/davide to discuss converging them

alf: i have a q about this, http://lists.w3.org/Archives/Public/public-csv-wg/2014Feb/0048.html

… are we trying to help ppl who would normally publish excel to do csv instead, or a subset

jenti: that's a pretty fundamental question

… we should be aiming to let people express the kinds of info that ppl express in excel files

jenit: then make judgement calls about expressivity


[bg noise]

alf: for this use case i'll go thru the excel files and try to pick out what might be represented

danbri: excel functions/expressions too?

alf: you might want totals of columns

jenit: that's the kind of q that we need to pull out as a potential requirement, and say if we'll try to address it or not

tandy: when it comes to the use case, it'll be essential to bring out, that these are things we're trying to achieve

(thought: we could/should/might say that losslessly representing original format is not a goal)

jtandy: annotating time series …

artificial shifts, e.g. when an instrument recalibrated

it would be better if you could pull out a use case

alf: [missed] merging with weather observation series

jtandy: i'm looking at integrating that with international surface temperature dataset

they're merging csv datasets from all around the world

single consolidated dataset

this would be in a sep piece of the workflow

alf: you might indicate a volcano erruption at a point in time etc

jtandy: I linked some software, …

jenit: thanks for all this work, it would be great now if we can get it into w3c draft format, let's talk offline about praticalities

jtandy: i'll try to get this done before next week

<AxelPolleres> " alf: you might want totals of columns" ... so you want to *extend* the CSV format? ... don't see this covered by the charter at the moment.

(… incl issues discussed, if people supply the details)

<konstant> NetCDF

konstant: I didn't get chance to introduce myself last week, … but wanted to mention i'll provide text on wiki, …

<AxelPolleres> ie. the consequence of this would be rather "Spreadsheets on the Web" rather than "CSV on the Web", wouldn't it?

netcdf scientific data, they have complex headers,

in between discussion

they have a header that describes the semantics of the columns, incl the ranges for the diff columns,

<konstant> http://www.unidata.ucar.edu/software/netcdf/examples/ECMWF_ERA-40_subset.cdl

for example [url above], it documents the ranges for the diff values

<AndyS> Axel - maybe it could be by metadata saying what a cell means?

then there is a data section at the end of the file.

konstant: we're cooperating on a project w/ a dutch university, who have huge amount of these files, and diff modeling software that they're using, to predict crops and crop yields

<AxelPolleres> Andy, you mean something like being able to say something like "last row contains totals"? or alike?

they combine this data with metereological data, create new netcdf files using this modeling software

they compare predictions, data, … q is how to combine netcdf w/ other kinds of data

<AndyS> netCDF -- http://www.unidata.ucar.edu/software/netcdf/

jenit: good stuff, jtandy can take it via wiki, …

jtandy: i'm accutely aware of the netcdf efforts, ERA 40 dataset etc. Are you dealing with a specific variant?

konstant: I'm not sure, will need to investigate that

we should start giving you … examples from different decades, … I'll check with wageningen

jtandy: interesting to look at mixing this with other kinds of dataset

jenit: yes, please make use of the mailing list

<JeniT> http://w3c.github.io/csvw/syntax/

jenit: I made a start at drafting a definition of what CSV might look like

… not time to discuss in detail now, pls take a look and comment on the list

there's an appendix, i picked out Excel and others; expect more work needed there on current state of the art

I'd encourage any of you that have particular favourite tools for CSV, e.g. excel on windows which I don't have, add samples etc

so request for review of this doc and input on tools section

(good stuff jeni :)

jenit: i don't particularly want to be editor of that doc, so if you're interested in editing role please say


EDF ad-hoc meeting

AxelPolleres: I got one reply from colleague in athens

but not yet more

danbri: it doesn't need to be WG members only

AxelPolleres: I can mention in my talk

… i can promote a bit the existence of the WG

jenit: good idea, more attention, more input, more impact, .. so yes please :)


jtandy: several of us will be at linking geospatial conf, … packed agenda but we can find a few minutes there

AndyS suggests maybe meeting evening before

jenit: discuss on list


Summary of Action Items

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.138 (CVS log)
$Date: 2014-02-13 01:36:23 $