W3C

CSV on the Web Working Group Teleconference

04 Mar 2015

Agenda

See also: IRC log

Attendees

Present
JeniT, gkellogg, ivan, jtandy, jumbrich
Regrets
Chair
Jeni
Scribe
Jeremy, Jeni

Contents


<JeniT> https://github.com/w3c/csvw/issues?q=is%3Aopen+is%3Aissue+label%3A%22Requires+telcon+discussion%2Fdecision%22+sort%3Acreated-asc

JeniT: inclined to ask jtandy for update on mapping documents - then look at issues arising

<JeniT> jtandy: I’ve got the RDF mapping document to a point where I think it’s good for review by the group

<JeniT> … there are bound to be errors in it, so inputs gratefully received

<JeniT> … I’m most of the way through the JSON equivalent, which is similar but has different terminology

<JeniT> … which should be finished in the next day or so

<JeniT> … latest version is at:

http://w3c.github.io/csvw/csv2rdf/

<JeniT> jtandy: the ToC is a lot simpler than it was previously

<JeniT> … I’ve used gkellogg’s suggestion for an algorithmic approach

<JeniT> … the inclusion of provenance is non-normative

<JeniT> … there are four examples

<JeniT> … the algorithm is in 3.2

http://w3c.github.io/csvw/csv2rdf/#generating-rdf

<JeniT> jtandy: this discusses what you do in standard & minimal mode

<JeniT> … it says, ‘at this stage create a triple…’ etc

<JeniT> … it goes through table groups, tables, rows, and cells themselves, and I think it makes sense :)

<JeniT> ivan: reading through, it says use the JSON-LD algorithm on any common properties

<JeniT> gkellogg: that can be updated I think, when my PR is merged

<JeniT> … the other thing is that you’re starting from the table group

<JeniT> … the process of getting metadata or any table reference

<JeniT> jtandy: we always start from a table group, create a new node G for group

<JeniT> gkellogg: there’s two different parts to processing the metadata

<JeniT> … you can start with the metadata & load the files, or start from the files & load the metadata

<JeniT> … have you thought about factoring that logic in?

<JeniT> jtandy: I’ve made the statement in the intro that I don’t care how we’ve got to the table group: whether you go from metadata to CSV or CSV to metadata, the point is when you have the table group in memory, we create RDF from that

<JeniT> … does that make sense?

<JeniT> gkellogg: that’s the way to go through the metadata

<JeniT> … the fact is that as you go through the table group and encounter CSV files you get more metadata, and that requires more metadata

<JeniT> … which requires a recursive approach, but perhaps I’m wrapped up in my own implementation

<JeniT> … whereas if you start with a table group you have a different approach

JeniT: thinks that jtandy is showing the right approach here - starting from the model ... but gkellogg's issue needs to be raised in the model document

ivan: bothered from a user perspective
... start with a CSV file and [...] get to the metadata

<JeniT> I think that should be a user option

ivan: which could mean finding table groups - and more CSV files to merge in
... not sure that this type of processing is what users want

<JeniT> and the suppressOutput flag provides for suppressing the outputs from different tables

ivan: don't think they will want _every_ CSV file included in the output

gkellogg: the way my system works - starts with the metadata and then looks for the CSV
... and then open the CSV to get more metadata [...]
... fairly intuitive

JeniT: do we want to discuss this here - or online

gkellogg: online ...

JeniT: gkellogg - please can you open a new issue about which metadata gets used.

<JeniT> JeniT: jtandy, are there any issues that it would be useful to resolve, to unblock you?

<JeniT> jtandy: #286

<JeniT> https://github.com/w3c/csvw/issues/286

<JeniT> … in the public sector roles & salaries example, the idea was that some files were published centrally, and some by the departments

<JeniT> … when gkellogg and I talked about this, it appears that his implementation expected the use of relative urls and they all had to be on the same host

<JeniT> gkellogg: well, not entirely, my implementation tries to load the resources that it discovers

<JeniT> … if those URLs are example.com/ etc then it won’t be able to load those

<JeniT> … if you start from one metadata, if something is at a fictitious location, that’s an issue because you can’t load things from there

<JeniT> jtandy: so I need to have a relative URL on example.org?

<JeniT> gkellogg: if the URL references are relative, they’re on whatever the base URL is

<JeniT> JeniT: they need to be retrievable

<JeniT> gkellogg: the namespace location has been changed since our first release

<JeniT> … I don’t know whether we want to wait to publish the namespaces until we’re done

<JeniT> ivan: I update them when we publish the documents

<JeniT> jtandy: in real life the professions csv file would be on a real host; I’ll figure out some words

<JeniT> gkellogg: we should have examples that reference other locations, but we won’t be able to run those examples without infrastructure or common test suite mechanism

<JeniT> jtandy: I’m just looking at #289, about noProv processing

<ivan> https://github.com/w3c/csvw/issues/289

<JeniT> … I thought at the F2F we decided that implementations may choose to add provenenance information but it’s not our concern

<JeniT> gkellogg: it’s a concern for conformance and testing

<JeniT> … we need to be able to turn it off so that we can test the results from the implementations

<JeniT> jtandy: so I need to have something that says that no other triples are generated in ‘NoProv’ mode

<JeniT> ivan: saying something like that is non-RDFy

<JeniT> … you don’t close the world like that

<JeniT> gkellogg: I think it’s for conformance purposes, it’s really useful to be able to absolutely predict the triples that are generated

<JeniT> … RDFa didn’t do that, which means that we had to test things with SPARQL, which was difficult

<JeniT> … having that mode makes it much easier to test

<JeniT> ivan: isomorphism would not work anyway because it depends on the way I serialise it in Turtle or what order I use for my triples

<JeniT> gkellogg: no

<JeniT> … that’s not true, isomorphism specifically looks to ensure bnodes named differently can be isomorphic etc

<JeniT> ivan: in general, I understand the difficulties of testing but we should not control our specification on how testing can be made

<JeniT> JeniT: I propose that we say that implementations have to have a NoExtras mode to support testing, but that this isn’t part of the spec

+1

<ivan> +1

<gkellogg> +1

<jumbrich> +1

<JeniT> … so #289 goes onto the test suite

<JeniT> jtandy: and on #292

<JeniT> https://github.com/w3c/csvw/issues/292

<JeniT> … it talks about scripted conversions, and the source being RDF or JSON, we have two modes, it seems that the scripted conversions might want minimal or standard as a starting point

<JeniT> ivan: the minimal mode is more for human consumption; we should always use the standard mode as a starting point for the scripting

<JeniT> … from an RDF triplestore point of view, the fact that there are more triples than in minimal shouldn’t be a problem, that’s the whole point

<JeniT> … I think it’s OK to say they get everything

<JeniT> … if they want to filter it out, do it

<JeniT> JeniT: still say standard mode for JSON, it’s easy to ignore stuff

<JeniT> jtandy: so metadata document gets updated to say it operates on standard

http://w3c.github.io/csvw/csv2rdf/#example-countries

<JeniT> jtandy: minimal mode is example 3, standard mode example 4

<JeniT> ivan: for me it looks fine

<JeniT> … two issues: if I start from a CSV file I might not want to have this table group at the top (I don’t even know what it is)

<JeniT> … the row\=2 where does that come from?

<gkellogg> I agree with ivan on not necesarily having TableGroup

<JeniT> jtandy: in Turtle the equals sign is a reserved character

<JeniT> … I have made a comment on this in the notes

<JeniT> gkellogg: in the metadata document we use a . rather than =

<JeniT> jtandy: this is a RFC7111 fragment identifier

<JeniT> ivan: at least for the example, the prefix t1 should include row=

<JeniT> … it’s not readable currently

<JeniT> <#row=2>

<gkellogg> +1

<JeniT> JeniT: I suggest setting the base and then using a relative URL

<ivan> +1

<JeniT> jtandy: shall I take out the t1, t2, t3 prefix definitions?

<JeniT> JeniT: I would prefer that, use full URLs

<JeniT> jtandy: did we agree to have an ordered property, use a list if ordered=true?

<JeniT> … I think we agreed that at the F2F

<JeniT> https://github.com/w3c/csvw/issues/107

<JeniT> JeniT: ordered lists - we’ll add to model & metadata

<JeniT> gkellogg: the only other comment was #290 was that more of the content can be deferred to the model document

<JeniT> jtandy: we’ll discuss online

JeniT: would like to go back to the questions about unions of datatypes

<JeniT> https://github.com/w3c/csvw/issues/223

<JeniT> https://lists.w3.org/Archives/Public/public-csv-wg/2015Mar/0002.html

JeniT: juergen - please can you describe your conclusions?

jumbrich: we're now able to parse 50000 csv files
... the info shared on the list shows the number of columns where more than one data type is found

<JeniT> (DATE,FLOAT+)->8529

jumbrich: we have a lot of empty strings, of null values
... lots of date formats

<JeniT> (ALPHA,NUMBER+)->5636

<JeniT> (ALPHA,FLOAT+)->4581

jumbrich: often get numbers where strings should be

<Zakim> gkellogg, you wanted to discuss briefly #290 https://github.com/w3c/csvw/issues/290

ivan: i know this is difficult- but do you have a feeling for which of these examples are intentional and which are bugs?

jumbrich: difficult, differences may be due to different tools?
... non conformant dates might be caused by different locales
... numbers might just be different representation of values

ivan: depending on what they are indicates whether or not we should adopt the union of data types
... it's still not clear whether there is evidence that USERS mean to specify multiple datatypes for a given column
... implication for complexity if we adopt the union of datatypes

jumbrich: agreed
... take the example of room identifiers; some might be A111, others might be 101 (just numeric)

JeniT: propose that we don't support unions of datatypes in this version - but solicit feedback from reviewers

gkellogg: make sure we call for comment in the next version

ivan: practical issue, for our own workflow, lets try to get all the issues closed for the current version
... so that we have a clean break when publishing a new WD

<gkellogg> Perhaps use a flag on the issue?

JeniT: I think its fine to keep open those for which we are soliciting feedback

<JeniT> PROPOSAL: We don’t support unions of datatypes in this version but solicit feedback from reviewers the next version of our specs

+1

<gkellogg> +1

<ivan> +1

<JeniT> +1

<jumbrich> +1

<ivan> RESOLUTION: We don’t support unions of datatypes in this version but solicit feedback from reviewers the next version of our specs

JeniT: back to the list of issues ... does gkellogg have anything?

<JeniT> https://github.com/w3c/csvw/issues?q=is%3Aopen+is%3Aissue+label%3A%22Requires+telcon+discussion%2Fdecision%22

<JeniT> https://github.com/w3c/csvw/issues/203

gkellogg: most of the ones I am responsible for are pending a PR about various json-ld issues

JeniT: I'm working my way through this ...

ivan: can we get rid of #252?

<JeniT> https://github.com/w3c/csvw/issues/252

gkellogg: and #245

ivan: I am not absolutely sure how to close #252

<JeniT> I think ivan’s proposal is to not support comments

ivan: the real question is: what is the effect of comment lines on the row numbers we use?

<JeniT> https://github.com/w3c/csvw/issues/252#issuecomment-76599710

ivan: if we want to use row numbers from the original files
... we need to parse those comment lines
... [...] that's a problem
... I didn't check RFC7111 - does this deal with comment lines?

JeniT: no - in RFC 7111 there is no such thing as a comment line

ivan: suggest that we just close this issue and just ignore comment lines?

gkellogg: what is a comment line was included in the skip-rows ... should I ignore a comment row in the skip-row zone?
... a [...] mess :-)
... suggest we leave this as it is

JeniT: this is all non-normative anyway

ivan: but we use the row number in the normative parts of the spec ...

JeniT: but we're talking about `sourcenum` - this might be null
... specifying this is out of our control
... applications need to be aware that source num is null and not use it

ivan: I think this means that the parser will skip comment lines EVEN in the skip-rows zone in the header
... perhaps we just don't talk about comment prefixes

JeniT: there are a bunch of CSV files in the real world that have comment lines
... people will want to ignore those comment lines
... I think that's the intention
... We flag up the issues around (source) row numbers when people are publishing non-standard CSV (with comments)

<JeniT> PROPOSAL: we respec the handling of comment prefixes on lines so that they are ignored, and flag up the issues around row numbering that this raises

<gkellogg> +1

<ivan> +1

+1 ... i think we should cover the real world

<jumbrich> +1

JeniT: out of time

<ivan> RESOLUTION: we spec the handling of comment prefixes on lines so that they are ignored, and flag up the issues around row numbering that this raises

JeniT: thanks ... let's continue to try to close issues in GitHub. Critical mass (for closing issues) is 3 +1s

<JeniT> regrets from ivan & jtandy

Summary of Action Items

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.140 (CVS log)
$Date: 2015/03/04 16:08:02 $