CSV on the Web Working Group Teleconference

17 Sep 2014


See also: IRC log


Yakov Shafranovich (yakovsh), Ivan Herman (Ivan), Phil Archer (phila), Jeni Tennison (JeniT), Jeremy Tandy (jtandy, jtandy_), Stasinos Konstantopoulos (stasinos), Bill Ingram (bill_ingram), Andy Seaborne (AndyS), Dan Brickley (danbri)
DanBri, phila


Analysis of use cases

jtandy: I just sent a spreadsheet of use cases

<JeniT> http://lists.w3.org/Archives/Public/public-csv-wg/2014Sep/0068.html

jtandy: Having been through the use cases... haven't had time to go through the wiki
... I have one, as does Dan and Ivan
... they've been useful to inform our conversations recently
... We have less than half (9) that specifically talk about transforming from CSV to X
... not a big number. There are 2 or 3 others that aren't explicit in the demand for transformation but recognise the need for something like this
... like the GeoJSON one
... so we have 12 UCs that need to transform CSV to something else. Just under half
... I asked myself a bunch of questions
... do we have a target output in the use case? Usually no, most don't
... which makes our assessment more difficult
... Are column names mapped to properties/QNames?

tjandy: There are examples of mapping to properties, geonames etc
... are there variables in the cells
... trying to pick out hte use cases where there is a sub structure in a cell that we need to pull out

<danbri_> so sorry late (esp. as I volunteered scribe), trapped in transit

tjandy: there are very few examples of that. Such would increase the complexity of our templating question. I think we have 4 UCs with sub structure in a cell and others with a delimited list in a cell

jtandy: Most use cases don't include nesting, intermediate properties etc
... UC 4, for example, the target RDF/XML here picks up the object (such as a profession).
... In a CSV file you could just link to the cell. Need to think about cases where we're converting sets of files - how you want to aggregate those into a single target output or not
... that prob isn't a templating Q itself but it is a question
... analysing scientific spreadsheets. No complex structure, but there is a need to express units of measurements assigned to each cell.
... That might be done at metadata level (or Data Cube). So I think we can avoid that
... Multiple tables in a single file prob don't meet our criteria of what is a CSV file. Fair?

All - yes

usecase 21, biodiversity... is there complex structure in the output?

i've concluded not entirely

default pairs ...

usecase 23, introduces idea of multiple columns all having the same semantic property

but the idea is that if you had up to 3 geo area codes, ... you could have one in each column

repeated values

usecase 24, hierarchy w/ occupational listings

does require a complex structure to be created

skos broader relationships that are derived from occupational listing codes, ... transitive

could be generated via sparql construct afterwards

i.e. there are some workarounds

the only one needing conditional processing is occ. listings

conditional rules or flow control

jtandy: there are very few examples, none amongst use cases, where we need to manipulate value of strings to build target output

the only place i've come across doing this kind of thing before

could be hidden in usecadses

is generation of certain URI structures based on literal text input

e.g. generating URI-bsaed identifiers for the object that a row talks about

jtandy: as a quick overview, about half talking about transforms into other formats. But v few of those are complicated.

jeni: thanks, that's really useful!

very few that require even string processing as values to get stuff out

very few require text output restructuring(?)

scribe: i.e. "we need to be express this tabular structure" not "we need to convert it into this other structure"

<JeniT> http://lists.w3.org/Archives/Public/public-csv-wg/2014Sep/0036.html

jenit: see also this small piece of work documenting use cases

when people want to be doing a transformation

i called out 3 possibilities here

2 are re use created configurations

e.g. downloading 2nd csv file

("weird echo")

cavernous sounds

1st ex was downloading set of csv files, wanting to import that into an sql db or similar

scribe: in such case, the person acquiring the CSVs will typically know the table structure they want to create

for the particular data import tool

2nd example, someone creating a web app displaying data from a CSV on say a map

and for that then if the people who are publishing a metadata file, ... defining a conversion into geojson

they can use that conversion for that particular display

but you can easily imagine someone wanting a different json target

e.g. a graph etc

3rd example, someone using server side software to statically generate a website

like http://jekyllrb.com/

e.g. if its contact info, they might generate vcard, schema.org JSON, produce some html with embedded metadata

those were the examples that I thought of

diff characteristics

in particular what came through to me, it's quite rare, quite tool specific, ... may be person specific

the appropriate conversion might depend on the kind of output you're actually aiming for

(danbri: e.g. http://stackoverflow.com/questions/11088303/how-to-convert-to-d3s-json-format for D3 is common)

jtandy: the times i've wanted more complex output is ironically when we're trying to match community/standard models

in trying to get to a common way of expressing data, it gets more complex

e.g. if I wanted to use QUDT, or semantic sensor networks, ...

geojson - complexity usually is pulling out the geometry

others like vcard largely easy end of scale

rather than deeply complicated data

jenit: more comments?

phila: following up jtandy, ... re use of string functions for URI generation

<JeniT> escaping?

I had experience of trying to do that, ... basic string function of removing white space, case normalization, ...

but that's as complex as it got

phila: was simple excel spreadsheet, using awk

so turning string name of a ministry into a URI

pretty basic stuff

case normalize, and get rid of whitespace

<Zakim> AndyS, you wanted to mention URI templates.

AndyS: similar to what Phil says

We use a lot of URI templates

multiple fields into one URI

sector, area, ID all go in.

certain amount of cleaning, string manipulation, whitespace, chars we don't want, ...

beyond this, issue of validation

what to do when the data doesn't match what you need

although it's possible to handle it when it comes out the other end, ... feels wrong

but not clear cut

a desire to know when there's an issue and flag an error

jenit: we def want to be able to support validation against metadata file

<Zakim> phila, you wanted to talk about validation

phila: we're close to launching a WG on RDF validation

(danbri: aka 'data shapes' I think)

phila: although this is rdf only, the two are closely related. there's a danger both groups try to punt it to the other
... other wg if its creation goes ahead as (nonbindingly) anticipated, ... could maybe be useful

jenit: downstream validation has the issues that andys identified

any more re requirements?

jenit: next- a straw poll, helping us to see where we're at w.r.t. question

<JeniT> http://lists.w3.org/Archives/Public/public-csv-wg/2014Sep/0067.html

re transformation, templating

4 basic options (see mail)

a) providing no customisation of mappings to other formats

leaving it completely unspecified

I thought you said 'a' :)

<- 1.

2. Providing some kind of hooks for customised mappings

but nothing normative for what's used

3. Adopting an existing templating language, such as but not necessarily Mustache

providing a way to map data in csv into the variables used by that existing templating language

4. Going into specifying our own tempating language.

(is this multiple choice? I like 2. + 3.)

<AndyS> Epimorphics --> https://github.com/epimorphics/dclib

choose one as preferred direction

danbri: I prefer (3. with Mustache as starting point) ...with (2. to allow others), and a hint of (4.) in that Mustache could be stretched a bit, and called Mustache-inspired.

jenit: Ivan, you're suggesting an investigatory period?

ivan: at least ... this is the way we interpret however we choose

Jeni: straw poll on your preference for what we do next, with the assumption that if we investigate templating lang and if it's too difficult we revise our opinion (at end of year)

jtandy: as i've been thinking about how we might call out to other tmplating langs, e.g. xslt, sparql constructs, other things that can do our processing, ... it wasn't clear to me how / what mechanism we might have in place to provide those hooks out for external formats

which is what 2. is talking about

can someone give a 2 minute education on (2.)?

jenit: within metadata file there is a property called mappings which has objects that give a title, a format, a ref to a template thing


3. is more like GRDDL

jtandy: how do you get the object to the template lang?

jeni: that would be implementation defined

whereas 3. we'd define exactly what that would look like

for a given language.

ivan: to come back to your option 2
... and actually even 3

do we define some sort of a simple fallback mapping

or we don't do anything whatsoever

e.g. if i want a json out of the csv file

<JeniT> “In all cases, we need to specify a default mapping to RDF/XML/JSON that is purely based on the metadata (which is also used to inform validation and display of the CSV files).”

but i do not refer to any external tool

does that mean i get nothing whatsoever?

or we have straightfwd way to extract csv in json

jeni: see above from email

(aside: just remembered http://www.w3.org/TR/sparql11-results-json/ as a json table format)

stasinos: ... freedom and configurability vs sensible defaults

<AndyS> 1/2/3 are the same to us ... does not create a (sufficiently big) tool economy.

stasinos: can we stipulate that it should be the case that it should be ok for all producers and should be used, but needn't be a MUST

you could choose to use something else

jenit: you'd want extensibility option, ... something with an understood level of conformance on the use of a particular templating language to get to do the conversion.

andys: need uri mapping
... i'm answering from pov of people with requirements on transforming csv to rdf

making URIs for some scheme is a v important requirement for that

jenit: I think you could provide templates within emtadata file

don't need a full tempating system for that

andys: quite possible. but the requirement remains.

jtandy: (4.) a simple templating language. Is an example of that the restrictions that dan explained to us re Polymer

(discussion of polymer vs mustache)

andys: mustache lets you set things up before calling templates
... so has equiv of polymer but done in a diff way

<jtandy_> 4

<yakovsh> 2

initial straw poll. TYPE INTO IRC NOW

<JeniT> 2

<AndyS> 4

<ivan> 2


<bill_ingram> 2

<stasinos> 3

<stasinos> (BUT 2 MAY 2)

chairs shared their technical opinion

Results, for the minutes: Option 1: 0, Option 2: 4, Option 3:  2, Option 4: 2

<jtandy_> phila says that (4) looks complete - but dont make it super complex

phila: ultimately what I care about is that the wg has capacity to deliver

I keep meeting people who are really looking fwd to the results of this group but don't have time to help.

andys: i'd like to reinforce what phil said. the classic open source issue here is that "someone else will do it".

if you're getting that kind of interest from outside, then it is time for the group to start broadcasting outward what the real factors are

e.g. start setting expectations

if the expectations exceed what the group's delivered

otherwise great work may go unappreciated

jenit: two have argued for specifying a templating lang; several who said hook and impl defined

2 said existing tpl lang

jtandy: my issue with 2 is that there is a big gap between how we take it out of parser, and into relevant templating lang, ...

[choppy noises]

jtandy - can you type

jtandy: re 3., mustache etc, those things may change
... change control etc

which leaves us with 4.

<phila> +1 to opposing 3 for the reasons Jeremy gives

scribe: doing a bit of work

<jtandy_> danbri says (4) only if we start from mustache

ivan: i voted 4 as i've played with that, i had this proposal, a stripped down mustache, which might be good enough

what really made me change, and i didn't sync w/ phil, ... experience is that we don't have enough ppeople to properly do that, even that level that I did

scribe: a bigger group w/ more people, I believe 4 is doable

could be pretty small

I essentially did something that I believe covers most of the use cases

stasinos: I was thinking, if it's to be something that is simpler than an existing lang, then it kind of begs the question why to bother to
... vs refer to a specific github etc version

(aside: see http://www.w3.org/2013/09/normative-references )

stasinos: but for our own i don't think we're in position to complete it

AndyS: you asked why I voted for (4.), looking at other specs that feel close, in w3c space

if you look at something like GRDDL, it isn't a stunning success

<JeniT> GRDDL used XSLT right?

it's a real shame there isn't a full blown rdf rules language

<ivan> yes Jenni

(RIF isn't?)

(for some sense of 'blown')

<phila> IIRC GRDDL strongly suggests but doesn't require XSLT

scribe: R2RML gets some traction but not sure if will be a roaring success

andys: SPARQL amazingly overshot, but back then WGs could do that

features did creep in

reason was that there were things ppl wanted to do

there was resistance on putting things in

being driven by user needs made it hard

one poss is to say 'if that's the way we want to go, separate it out somewhere, send it out to be a CG'

a small group could work on it in a diff way, come up with a particular proposal

ivan: mustache is a v good example for the difficulties we might have

i initially used a mustache impl, csv files have their own features

mustache is text to text

ivan: ... we choose one tpml lang ... don't think that is practically doable

andys: what if we start from an existing one, then kinda fix it? make it 'the w3c one'?

I'd like that

jenit: I think that is a good approach

ivan: that means cutting back a bunch of things

<jtandy_> can we

phila: e.g. i wanted geojson but the community who created it weren't so interested (was that right? --scribe)

<jtandy_> oops. can we specify a minimum set of requirements for the template lang? driven by our use cases

<phila> Something like that, yes - but it's nuanced

<ivan> trackbot, end telcon

Summary of Action Items

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.138 (CVS log)
$Date: 2014-09-17 13:14:34 $