CSV on the Web Working Group Teleconference -- 18 Feb 2015

<trackbot> Date: 18 February 2015

<jumbrich> zakim ??P2 is me

Q: how many of us are Skyping in? considering Ivan's notification that Zakim will be shut down later this year

(I'm skyping)

<gkellogg> I’m on skype

Dan, Ivan (called back to skype-in), JeniT, Gregg, …

<scribe> scribenick: danbri

jenit: 6 issues listed for discussion

<JeniT> https://lists.w3.org/Archives/Public/public-csv-wg/2015Feb/0020.html

<JeniT> https://github.com/w3c/csvw/issues/195

… some quick hopefully, others need a bit more discussion

"Effect of tableSchema on both Table and TableGroup #195"

jenit: I proposed 2nd of the suggested options

namely that the one in table completely overides that in the table group

no fancy merging

any objections?

ivan: in spite of the normalization of the metadata?

jenit: yes

gkellogg: it's more a search path

when you're going through the cols you look at the first schema you find

you don't take into effect what might also be represented in the table gorup

you might imaging using table group … matching cols in diff langs

that would not be supported here

jenit: or you can imagine this as part of the normalization process

e.g. copying in where missing [approximately what jeni said]

jenit: any objections to proposed closure?

[none heard - resolved]

[updated to indicate editor action status]

<JeniT> https://github.com/w3c/csvw/issues/226

"Support for totalDigits and fractionDigits #226"

jenit: suggestion is that we remove these

ivan: +1

<jumbrich> +1

(gkellogg +1'd in git)

resolved - moved to editor action

<DavideCeolin> +1

<JeniT> https://github.com/w3c/csvw/issues/220

<JeniT> http://w3c.github.io/csvw/metadata/#processing-tables

"Move Processing Tables section from metadata to model document #220

jenit: this is about what processors can do with tables - displayed, converted, etc etc

suggestion is that we move that section into the Model doc

…because it is about actions over the table model.

… that means that the metadata doc is purely about how to annotate a table model / generate a table model

gkellogg: might require creating a couple of term definitions in the syntax doc

… juggling those between specs; makes a lot of sense to keep that in a single place.

jenit: yep

… probably a bit more editorial juggling to do also, cross-refs, where terms are defined etc.

+1s from gkellogg, ivan on issue

+1 from danbri here

<jumbrich> +1

<DavideCeolin> +1

resolved - moved to editorial

<JeniT> https://github.com/w3c/csvw/issues/212

jenit: the other 3 issues arose from possible use case offered in #212

… looking at real life edu data, school performance stats

… one thing interesting here is that looking at the data in depth, you see particular codes eg. SUPT, NE (= not entered) that can take the place of normal statistics

i.e. cols have typically got numeric content but can have such values as alternate content

one way of viewing those is to view them as null values

jenit: I'll summarize this set of issues as they are linked

<JeniT> https://github.com/w3c/csvw/issues/218

one approach is to say they're all kinds of nulls

see #218, … which says the value interpreted as null could be written multiple ways

… give a bit more structure, etc

<JeniT> https://github.com/w3c/csvw/issues/223

(aside https://en.wikipedia.org/wiki/Null_(SQL) … scary world)

jenit: #223 explores possibility of union values

cols are numbers or else strings like NA

<JeniT> https://github.com/w3c/csvw/issues/224

in order to support union-based types, where you'd need to list set of datatypes that the cells needed to comply with you would really need a different structure for datatypes

… which brings us to #224, "Reworking structure for datatypes #224".

gkellogg: another use case I've often seen: col might have different date stamps in it. Dates and Date-times intermingled. This would allow a super datatype there

maybe allowing different dates in different formats

ivan: original poster explicitly asked for data type unions, not sure it was just w.r.t. null.

jenit: maybe it is helpful then to talk about the requirement for union types and whether we want to support them

thoughts on supporting union types?

ivan: I was wondering about the opposite direction?

(jumbrich - is there anything from your study of actual CSV files to guide us here?)

jenit: any objections to restructuring the datatypes?

… basically they become their own little object, including base datatype, … e.g. decimal, … then you have extra properties

… we could imagine in future naming those

<ivan> +1

jenit: i personally think it is the right way to structure it
... you can still say 'decimal'

but you could also use a structure to set max/min etc

jumbrich: re dan's q, … in our study we used a simple heuristic, tried to guess cell type by using regexes, … and we found a couple of times that a col had multiple types in it. We just went with the majority.

I could try to look deeper, find what kinds of types did this

jenit: the fact that you noticed that that was happening is enough to know it's out there in real data

jumbrich: other case is decimals … maybe excel export

… maybe more interesting when strings vs numericals; or strings vs urls

I can try to look a bit into detail, what kind of different datatypes we observed in 1 col, and report back

jenit: ivan, you were asking for usecases. would it help to get that?

ivan: what if a cell can be interp'd by several of the datatypes

jenit: suggest using array order

ivan: yeahbut, … i know pathological, but imagine 2 alternatives. One is JSON, the other is a string.

How do I decide that a string is JSON

… or say XML

do I have to parse the whole thing to know?

I'm not sure we have a clear idea about these ugly edge cases

jenit: that is a separate issue about validation of xml and json and html

… this arises regardless because we say values have to be valid against whatever the datatype is

…we can go one of two ways. We say you have to go all the way and really parse it. Or else say that typing on these cols is just a hint, ...

(which has implics for union types in case that the markup isn't as valid as intended)

gkellogg: we could say that datatypes explicitly have a regex form, then that is used to match, otherwise it is the first found. That would basically get everything.

There are a couple of areas in xsd whjere you have confusions

datatime stamps are datetimes, etc

otherwise they are largely in diff spaces

ivan: but then, … for time being at least, a convertor to json or to rdf can be lax

… meaning that it does not do any validation. it will just believe what is there, and produce whatever is the datatype that is signalled.

if we introduce this, strictly speaking a converter cannot be lax, it'll have to make a decision.

jenit: depends on your meaning of 'lax'

… text we have currently says that an errror must be generated if not valid against a particular datatype

e.g. a decimal vs the word 'foo'

…means that the value of the cell is set to the string 'foo' rather than a decimal number. The conversion

…can then do whatever with that value. It could be lax and …

or strict and raise error.

various options.

jenit: the checking of the value and the generation of the value for the cell happens regardless.

it has to, otherwise you get in real messes around parsing of dates etc.

gkellogg: say i have a set of datatypes, e.g. date, boolean, … listed, …

as a convertor i need to check lexical form to see if value matches date, or then boolean, … then if it doesn't, what do i use, the last one? string?

jenit: it (i.e. "foo" in example) is a string

… if you had datatype: boolean, and string value was foo

…then value of the cell is the string 'foo'

gkellogg: if the type was xml literal, … because we don't have a defined format for detecting it, i'd just go ahead and say it was an xml literal

… could say default comes from def of that datatype

… format/pattern

<JeniT> https://github.com/w3c/csvw/issues/236 << validation of html/xml/json datatypes

we could then have same datatype diff times with diff patterns/formats

ivan: we've moved away from the q of whether we want a structure for datatypes to be an array

<JeniT> https://github.com/w3c/csvw/issues/224

<ivan> +1

jenit: let's try closing 224. any objections to sturctured datatypes?

<gkellogg> +1

[tumbleweed]

+1 is for it, right? :)

<gkellogg> Right

resolved -> editor action

<JeniT> https://github.com/w3c/csvw/issues/236

#226

jenit: if we have a cell that is marked as being e.g. json do we want to validate that it is actually json

similarly for xml, html, …

ivan: I agree w/ not validating

danbri: +1 for not needing to

gkellogg: validation isn't the right word

(xml-wf?)

ivan: not even WF, as an xml segment needn't have a top level element

all we could find in rdf discussion of this was some DOM function

gkellogg: just wording choice

i don't think we want/need detection on these 3

just a note to say that pattern/format can be used

to help discriminate

jenit: can you clarify?

gkellogg: I mean that if someone wanted to try to discriminate, based automatically on datatype, … could put a format in there which looked for <html

so that they could distinguish

ivan: i'd keep it simple

<JeniT> Proposal: we won’t built-in recognise/validate html/xml/json, but add a note to say that authors can add a pattern if they want

(that's a heavy rider tacked on the end)

ivan: so if there is a pattern i have to use it?

<JeniT> file:///Users/user/Documents/projects/w3ctag/csvw/metadata/index.html#formats-for-other-types

<JeniT> http://w3c.github.io/csvw/metadata/#formats-for-other-types

jenit: "format property provides a regex ...."

gkellogg: in formats for data/time the format is yyyy-mm-dd in which case that is not a regex

but there is still a pattern property

gkellogg: what are consequences of having both format and pattern?

jenit: pattern only on a format for a numeric type

not at the top level, alongside format

<JeniT> http://w3c.github.io/csvw/metadata/#formats-for-numeric-types

never clashes

gkellogg: ok

jenit: back to HTML/XML/JSON people can already use format property to constrain the value as gregg described

so the note is just a pointer to existing functionality

ivan: ok

<JeniT> Proposal: we won’t built-in recognise/validate html/xml/json, but add a note to say that authors can add a pattern if they want

<gkellogg> +1

<DavideCeolin> +1

<ivan> +1

<jumbrich> +1

<JeniT> +1

resolved

https://github.com/w3c/csvw/issues/218 - Categories of null values #218

gkellogg: instead of treating these multiple values as diff version of null, could treat them as …[missed]

tokens

ivan: what is current situation?

… it looked as if null was already an array

<JeniT> https://github.com/w3c/csvw/issues/136

jenit: it is but we decided from #136 that null would become a single value

… so the doc hadn't been updated to reflect that resolution

effectively #218 reopens #136 but with more of a rationale for why you might want multiple null values

ivan: what's the merge? do we concat the arrays?

jenit: atomic

so you do not merge the arrays

if you have two metadata files, the null list from A overides from B

ivan: fine w/ that

… and ok w/ several null values that way

jenit: ok

<JeniT> Proposal: allow several null values, but merge in an atomic way (don’t merge arrays)

<gkellogg> +1

<jumbrich> +1

<DavideCeolin> +1

<ivan> +1

<JeniT> +1

<JeniT> https://github.com/w3c/csvw/issues/223

https://github.com/w3c/csvw/issues/223 - Allowing "unions" of datatypes? #223

ivan: if we move to datatypes being these objects then the q of merge arises for those as well, regardless of the union issue

jenit: true

ivan: we merge atomically or property by property

[someone said 'yes']

jenit: i agree

gkellogg: general trend is to make a small set of things which merge.

ivan: maybe adding that note to the issue?

jenit: propose that we allow arrays of datatypes to be provided and the first datatype wins in terms of labelling a particular value

<JeniT> Proposal: we allow arrays of datatypes to be provided, and the first matching datatype wins in terms of assignment of datatype to a particular value

<JeniT> (and atomic merge)

ivan: at this moment my vote is "If we do it, then yes that's the way we should do it" (but) I would like to see jumbrich's measures before we make a decision on this rather than rush a new feature based on 1 new use case

jenit: ok

gkellogg: leave a week?

… we need to decide

jenit: what do you think is your measure of what would be persuasive. How many cases? 10, 20? 3?

ivan: not a number

… he goes through a certain amount of usecases for scientific data. If only 5% have this feature I would go against it. Obviously if it goes up to 30% then yes.

<jumbrich> info: we should have around 80K+ documents, from which 60k we could parse.

jenit: I'd say 5% is too high a bound. We should support features that are in 1 in 20 docs

gregg/dan: agree

jumbrich: i will have a look and report on how many cols per doc we found at least 2 or 3 datatypes

then try to present them in a reasonable way

[click]

(sip client problems)

ivan: I'll be away next week

(dan away too)

jumbrich: I'll email around beforehand

ivan: fear of feature creep

<JeniT> ACTION: jumbrich to do an analysis on union types to see if they are prevalent in real data [recorded in http://www.w3.org/2015/02/18-csvw-minutes.html#action01]

<trackbot> Created ACTION-64 - Do an analysis on union types to see if they are prevalent in real data [on Jürgen Umbrich - due 2015-02-25].

jenit: fine, leave til next week
... jumbrich's github id?

<jumbrich> jumbrich is my github id

jenit: access on repo?

ivan: no

ivan to add jumbrich to our github group

jumbrich: if we write code, should it be hosted here?

e.g. to extract metadata files etc?

ivan: no need to, can host wherever

AOB?

(github admin details)

Adjourned.

KUTGW - please try to vote on proposals etc on github.

<ivan> trackbot, end telcon

CSV on the Web Working Group Teleconference

18 Feb 2015

Attendees

Contents

Summary of Action Items