14:02:44 RRSAgent has joined #csvw 14:02:44 logging to http://www.w3.org/2015/03/04-csvw-irc 14:02:46 RRSAgent, make logs public 14:02:46 Zakim has joined #csvw 14:02:48 Zakim, this will be CSVW 14:02:48 ok, trackbot; I see DATA_CSVWG()10:00AM scheduled to start in 58 minutes 14:02:49 Meeting: CSV on the Web Working Group Teleconference 14:02:49 Date: 04 March 2015 14:07:14 Chair: Jeni 14:07:18 Regret: DanBri 14:57:53 gkellogg has joined #csvw 15:00:19 zakim, code? 15:00:19 the conference code is 2789 (tel:+1.617.761.6200 sip:zakim@voip.w3.org), gkellogg 15:01:09 jtandy has joined #csvw 15:01:40 jtandy has joined #csvw 15:01:47 DATA_CSVWG()10:00AM has now started 15:01:54 +[IPcaller] 15:01:55 (just waiting for my fire alarm to complete!) 15:02:50 +gkellogg 15:03:20 zakim, code? 15:03:20 the conference code is 2789 (tel:+1.617.761.6200 sip:zakim@voip.w3.org), ivan 15:03:50 +ivan 15:03:53 +??P10 15:04:01 -??P10 15:04:27 +??P10 15:04:37 zakim, ??P10 is me 15:04:37 +jtandy; got it 15:05:44 scribe: Jeremy Tandy 15:05:49 scribenick: jtandy 15:05:59 jumbrich has joined #csvw 15:06:18 +??P14 15:06:28 zakim, ??P14 is jumbrich 15:06:28 +jumbrich; got it 15:06:49 https://github.com/w3c/csvw/issues?q=is%3Aopen+is%3Aissue+label%3A%22Requires+telcon+discussion%2Fdecision%22+sort%3Acreated-asc 15:07:02 agenda: http://www.w3.org/2013/csvw/wiki/Meetings#Standing_Agenda 15:07:34 JeniT: inclined to ask jtandy for update on mapping documents - then look at issues arising 15:07:59 jtandy: I’ve got the RDF mapping document to a point where I think it’s good for review by the group 15:08:15 … there are bound to be errors in it, so inputs gratefully received 15:08:27 … I’m most of the way through the JSON equivalent, which is similar but has different terminology 15:08:34 … which should be finished in the next day or so 15:08:44 … latest version is at: 15:08:49 http://w3c.github.io/csvw/csv2rdf/ 15:09:07 jtandy: the ToC is a lot simpler than it was previously 15:09:23 … I’ve used gkellogg’s suggestion for an algorithmic approach 15:09:29 … the inclusion of provenance is non-normative 15:09:34 … there are four examples 15:09:54 … the algorithm is in 3.2 15:09:55 http://w3c.github.io/csvw/csv2rdf/#generating-rdf 15:10:16 jtandy: this discusses what you do in standard & minimal mode 15:10:35 … it says, ‘at this stage create a triple…’ etc 15:10:46 … it goes through table groups, tables, rows, and cells themselves, and I think it makes sense :) 15:11:27 ivan: reading through, it says use the JSON-LD algorithm on any common properties 15:11:35 gkellogg: that can be updated I think, when my PR is merged 15:11:50 … the other thing is that you’re starting from the table group 15:11:58 … the process of getting metadata or any table reference 15:12:13 jtandy: we always start from a table group, create a new node G for group 15:12:27 gkellogg: there’s two different parts to processing the metadata 15:12:46 … you can start with the metadata & load the files, or start from the files & load the metadata 15:12:53 … have you thought about factoring that logic in? 15:13:21 jtandy: I’ve made the statement in the intro that I don’t care how we’ve got to the table group: whether you go from metadata to CSV or CSV to metadata, the point is when you have the table group in memory, we create RDF from that 15:13:27 … does that make sense? 15:13:34 gkellogg: that’s the way to go through the metadata 15:13:53 … the fact is that as you go through the table group and encounter CSV files you get more metadata, and that requires more metadata 15:14:06 … which requires a recursive approach, but perhaps I’m wrapped up in my own implementation 15:14:31 … whereas if you start with a table group you have a different approach 15:15:28 q+ 15:15:40 ack ivan 15:15:44 JeniT: thinks that jtandy is showing the right approach here - starting from the model ... but gkellogg's issue needs to be raised in the model document 15:16:20 ivan: bothered from a user perspective 15:16:51 q+ 15:16:56 ... start with a CSV file and [...] get to the metadata 15:17:05 I think that should be a user option 15:17:16 ... which could mean finding table groups - and more CSV files to merge in 15:17:29 ... not sure that this type of processing is what users want 15:17:31 and the suppressOutput flag provides for suppressing the outputs from different tables 15:17:35 q+ 15:17:44 ... don't think they will want _every_ CSV file included in the output 15:17:46 ack gkellogg 15:18:18 gkellogg: the way my system works - starts with the metadata and then looks for the CSV 15:18:35 ... and then open the CSV to get more metadata [...] 15:18:41 ... fairly inuitive 15:18:44 ack me 15:18:55 s/inuitive/intuitive/ 15:19:15 JeniT: do we want to discuss this here - or online 15:19:21 gkellogg: online ... 15:19:47 JeniT: gkellogg - please can you open a new issue about which metadata gets used. 15:20:09 JeniT: jtandy, are there any issues that it would be useful to resolve, to unblock you? 15:20:14 jtandy: #286 15:20:26 https://github.com/w3c/csvw/issues/286 15:20:46 … in the public sector roles & salaries example, the idea was that some files were published centrally, and some by the departments 15:21:07 … when gkellogg and I talked about this, it appears that his implementation expected the use of relative urls and they all had to be on the same host 15:21:20 gkellogg: well, not entirely, my implementation tries to load the resources that it discovers 15:21:26 -jumbrich 15:21:33 … if those URLs are example.com/ etc then it won’t be able to load those 15:21:41 +??P14 15:21:51 zakim, ??P14 is jumbrich 15:21:51 +jumbrich; got it 15:22:15 … if you start from one metadata, if something is at a fictitious location, that’s an issue because you can’t load things from there 15:22:26 jtandy: so I need to have a relative URL on example.org? 15:22:38 gkellogg: if the URL references are relative, they’re on whatever the base URL is 15:23:17 JeniT: they need to be retrievable 15:23:32 gkellogg: the namespace location has been changed since our first release 15:23:52 … I don’t know whether we want to wait to publish the namespaces until we’re done 15:24:01 ivan: I update them when we publish the documents 15:24:17 jtandy: in real life the professions csv file would be on a real host; I’ll figure out some words 15:24:37 gkellogg: we should have examples that reference other locations, but we won’t be able to run those examples without infrastructure or common test suite mechanism 15:26:01 jtandy: I’m just looking at #289, about noProv processing 15:26:03 https://github.com/w3c/csvw/issues/289 15:26:28 … I thought at the F2F we decided that implementations may choose to add provenenance information but it’s not our concern 15:26:41 gkellogg: it’s a concern for conformance and testing 15:27:06 … we need to be able to turn it off so that we can test the results from the implementations 15:27:23 jtandy: so I need to have something that says that no other triples are generated in ‘NoProv’ mode 15:27:36 ivan: saying something like that is non-RDFy 15:27:42 … you don’t close the world like that 15:28:03 gkellogg: I think it’s for conformance purposes, it’s really useful to be able to absolutely predict the triples that are generated 15:28:19 … RDFa didn’t do that, which means that we had to test things with SPARQL, which was difficult 15:28:26 … having that mode makes it much easier to test 15:28:49 ivan: isomorphism would not work anyway because it depends on the way I serialise it in Turtle or what order I use for my triples 15:28:50 gkellogg: no 15:29:16 … that’s not true, isomorphism specifically looks to ensure bnodes named differently can be isomorphic etc 15:29:38 ivan: in general, I understand the difficulties of testing but we should not control our specification on how testing can be made 15:30:40 JeniT: I propose that we say that implementations have to have a NoExtras mode to support testing, but that this isn’t part of the spec 15:30:47 +1 15:30:55 +1 15:30:56 +1 15:31:09 +1 15:31:23 … so #289 goes onto the test suite 15:31:34 jtandy: and on #292 15:31:43 https://github.com/w3c/csvw/issues/292 15:32:17 … it talks about scripted conversions, and the source being RDF or JSON, we have two modes, it seems that the scripted conversions might want minimal or standard as a starting point 15:32:35 ivan: the minimal mode is more for human consumption; we should always use the standard mode as a starting point for the scripting 15:32:51 … from an RDF triplestore point of view, the fact that there are more triples than in minimal shouldn’t be a problem, that’s the whole point 15:32:56 … I think it’s OK to say they get everything 15:33:02 … if they want to filter it out, do it 15:33:36 JeniT: still say standard mode for JSON, it’s easy to ignore stuff 15:33:46 jtandy: so metadata document gets updated to say it operates on standard 15:34:25 q+ to discuss briefly #290 https://github.com/w3c/csvw/issues/290 15:34:44 http://w3c.github.io/csvw/csv2rdf/#example-countries 15:35:44 jtandy: minimal mode is example 3, standard mode example 4 15:35:52 ivan: for me it looks fine 15:36:09 … two issues: if I start from a CSV file I might not want to have this table group at the top (I don’t even know what it is) 15:36:23 … the row\=2 where does that come from? 15:36:25 I agree with ivan on not necesarily having TableGroup 15:36:34 jtandy: in Turtle the equals sign is a reserved character 15:36:40 … I have made a comment on this in the notes 15:36:49 gkellogg: in the metadata document we use a . rather than = 15:37:01 jtandy: this is a RFC7111 fragment identifier 15:37:08 -jumbrich 15:37:22 +??P3 15:37:30 zakim, ??P3 is jumbrich 15:37:30 +jumbrich; got it 15:37:55 ivan: at least for the example, the prefix t1 should include row= 15:38:00 … it’s not readable currently 15:38:20 <#row=2> 15:38:31 +1 15:38:44 JeniT: I suggest setting the base and then using a relative URL 15:38:45 +1 15:39:28 jtandy: shall I take out the t1, t2, t3 prefix definitions? 15:39:34 JeniT: I would prefer that, use full URLs 15:39:46 jtandy: did we agree to have an ordered property, use a list if ordered=true? 15:39:52 … I think we agreed that at the F2F 15:40:03 -gkellogg 15:40:15 https://github.com/w3c/csvw/issues/107 15:40:30 +gkellogg 15:41:05 JeniT: ordered lists - we’ll add to model & metadata 15:41:47 gkellogg: the only other comment was #290 was that more of the content can be deferred to the model document 15:41:59 jtandy: we’ll discuss online 15:42:18 JeniT: would like to go back to the questions about unions of datatypes 15:42:21 https://github.com/w3c/csvw/issues/223 15:42:47 https://lists.w3.org/Archives/Public/public-csv-wg/2015Mar/0002.html 15:43:01 JeniT: juergen - please can you describe your conclusions? 15:43:23 jumbrich: we're now able to parse 50000 csv files 15:43:56 ... the info shared on the list shows the number of columns where more than one data type is found 15:43:57 (DATE,FLOAT+)->8529 15:44:31 jumbrich: we have a lot of empty strings, of null values 15:44:40 ... lots of date formats 15:44:42 (ALPHA,NUMBER+)->5636 15:44:58 q+ 15:45:02 (ALPHA,FLOAT+)->4581 15:45:04 ... often get numbers where strings should be 15:45:06 ack gkellogg 15:45:06 gkellogg, you wanted to discuss briefly #290 https://github.com/w3c/csvw/issues/290 15:45:12 ack ivan 15:45:52 ivan: i know this is difficult- but do you have a feeling for which of these examples are intentional and which are bugs? 15:46:16 jumbrich: difficult, differences may be due to different tools? 15:46:42 jumbrich: non conformant dates might be caused by different locales 15:47:04 jumbrich: numbers might just be different representation of values 15:47:36 ivan: depending on what they are indicates whether or not we should adopt the union of data types 15:48:02 ... it's still not clear whether there is evidence that USERS mean to specify multiple datatypes for a given column 15:48:34 ... implication for complexity if we adopt the union of datatypes 15:48:43 jumbrich: agreed 15:49:22 ... take the example of room identifiers; some might be A111, others might be 101 (just numeric) 15:49:55 JeniT: propose that we don't support unions of datatypes in this version - but solicit feedback from reviewers 15:50:07 gkellogg: make sure we call for comment in the next version 15:50:33 ivan: practical issue, for our own workflow, lets try to get all the issues closed for the current version 15:50:50 ... so that we have a clean break when publishing a new WD 15:50:53 Perhaps use a flag on the issue? 15:51:09 JeniT: I think its fine to keep open those for which we are solicitiing feedback 15:51:22 PROPOSAL: We don’t support unions of datatypes in this version but solicit feedback from reviewers the next version of our specs 15:51:25 s/solicitiing/soliciting/ 15:51:28 +1 15:51:28 +1 15:51:32 +1 15:51:36 +1 15:51:49 +1 15:52:04 RESOLUTION: We don’t support unions of datatypes in this version but solicit feedback from reviewers the next version of our specs 15:52:39 JeniT: back to the list of issues ... does gkellogg have anything? 15:52:49 -jumbrich 15:52:54 https://github.com/w3c/csvw/issues?q=is%3Aopen+is%3Aissue+label%3A%22Requires+telcon+discussion%2Fdecision%22 15:53:01 +??P3 15:53:10 zakim, ??P3 is jumbrich 15:53:10 +jumbrich; got it 15:53:13 https://github.com/w3c/csvw/issues/203 15:53:14 gkellogg: most of the ones I am responsible for are pending a PR about various json-ld issues 15:53:26 JeniT: I'm working my way through this ... 15:53:37 ivan: can we get rid of #252? 15:53:38 https://github.com/w3c/csvw/issues/252 15:53:45 gkellogg: and #245 15:54:04 ivan: I am not absolutely sure how to close #252 15:54:23 I think ivan’s proposal is to not support comments 15:54:33 ... the real question is: what is the effect of comment lines on the row numbers we use? 15:54:41 https://github.com/w3c/csvw/issues/252#issuecomment-76599710 15:54:50 ... if we want to use row numbers from the original files 15:54:59 ... we need to parse those comment lines 15:55:09 ... [...] that's a problem 15:55:40 ... I didn't check RFC7111 - does this deal with comment lines? 15:55:59 JeniT: no - in RFC 7111 there is no such thing as a comment line 15:56:23 ivan: suggest that we just close this issue and just ignore comment lines? 15:56:58 gkellogg: what is a comment line was included in the skip-rows ... should I ignore a comment row in the skip-row zone? 15:57:06 ... a [...] mess :-) 15:57:17 ... suggest we leave this as it is 15:57:32 JeniT: this is all non-normative anyway 15:57:53 ivan: but we use the row number in the normative parts of the spec ... 15:58:35 JeniT: but we're talking about `sourcenum` - this might be null 15:58:50 ... specifying this is out of our control 15:59:06 ... applications need to be aware that source num is null and not use it 15:59:54 ivan: I think this means that the parser will skip comment lines EVEN in the skip-rows zone in the header 16:00:11 ... perhaps we just don't talk about comment prefixes 16:00:34 JeniT: there are a bunch of CSV files in the real world that have comment lines 16:00:44 ... people will want to ignore those comment lines 16:00:51 ... I think that's the intention 16:01:23 ... We flag up the issues around (source) row numbers when people are publishing non-standard CSV (with comments) 16:01:35 PROPOSAL: we respec the handling of comment prefixes on lines so that they are ignored, and flag up the issues around row numbering that this raises 16:01:39 +1 16:01:41 +1 16:01:48 +1 ... i think we should cover the real world 16:01:48 +1 16:01:55 JeniT: out of time 16:01:58 RESOLUTION: we respec the handling of comment prefixes on lines so that they are ignored, and flag up the issues around row numbering that this raises 16:02:39 JeniT: thanks ... let's continue to try to close issues in GitHub. Critical mass (for closing issues) is 3 +1s 16:02:43 q+ 16:02:57 regrets from ivan & jtandy 16:03:02 zakim, drop me 16:03:02 ivan is being disconnected 16:03:03 -ivan 16:03:05 -jtandy 16:03:06 -gkellogg 16:03:08 -JeniT 16:03:15 -jumbrich 16:03:16 DATA_CSVWG()10:00AM has ended 16:03:16 Attendees were JeniT, gkellogg, ivan, jtandy, jumbrich 16:04:30 rrsagent, draft minutes 16:04:30 I have made the request to generate http://www.w3.org/2015/03/04-csvw-minutes.html ivan 16:04:41 trackbot, end telcon 16:04:41 Zakim, list attendees 16:04:41 sorry, trackbot, I don't know what conference this is 16:04:49 RRSAgent, please draft minutes 16:04:49 I have made the request to generate http://www.w3.org/2015/03/04-csvw-minutes.html trackbot 16:04:50 RRSAgent, bye 16:04:50 I see no action items