See also: IRC log
<trackbot> Date: 18 February 2015
<jumbrich> zakim ??P2 is me
Q: how many of us are Skyping in? considering Ivan's notification that Zakim will be shut down later this year
(I'm skyping)
<gkellogg> I’m on skype
Dan, Ivan (called back to skype-in), JeniT, Gregg, …
<scribe> scribenick: danbri
jenit: 6 issues listed for discussion
<JeniT> https://lists.w3.org/Archives/Public/public-csv-wg/2015Feb/0020.html
<JeniT> https://github.com/w3c/csvw/issues/195
… some quick hopefully, others need a bit more discussion
"Effect of tableSchema on both Table and TableGroup #195"
jenit: I proposed 2nd of the suggested options
namely that the one in table completely overides that in the table group
no fancy merging
any objections?
ivan: in spite of the normalization of the metadata?
jenit: yes
gkellogg: it's more a search path
when you're going through the cols you look at the first schema you find
you don't take into effect what might also be represented in the table gorup
you might imaging using table group … matching cols in diff langs
that would not be supported here
jenit: or you can imagine this as part of the normalization process
e.g. copying in where missing [approximately what jeni said]
jenit: any objections to proposed closure?
[none heard - resolved]
[updated to indicate editor action status]
<JeniT> https://github.com/w3c/csvw/issues/226
"Support for totalDigits and fractionDigits #226"
jenit: suggestion is that we remove these
ivan: +1
<jumbrich> +1
+1
(gkellogg +1'd in git)
resolved - moved to editor action
<DavideCeolin> +1
<JeniT> https://github.com/w3c/csvw/issues/220
<JeniT> http://w3c.github.io/csvw/metadata/#processing-tables
"Move Processing Tables section from metadata to model document #220
jenit: this is about what processors can do with tables - displayed, converted, etc etc
suggestion is that we move that section into the Model doc
…because it is about actions over the table model.
… that means that the metadata doc is purely about how to annotate a table model / generate a table model
+1
gkellogg: might require creating a couple of term definitions in the syntax doc
… juggling those between specs; makes a lot of sense to keep that in a single place.
jenit: yep
… probably a bit more editorial juggling to do also, cross-refs, where terms are defined etc.
+1s from gkellogg, ivan on issue
+1 from danbri here
<jumbrich> +1
<DavideCeolin> +1
resolved - moved to editorial
<JeniT> https://github.com/w3c/csvw/issues/212
jenit: the other 3 issues arose from possible use case offered in #212
… looking at real life edu data, school performance stats
… one thing interesting here is that looking at the data in depth, you see particular codes eg. SUPT, NE (= not entered) that can take the place of normal statistics
i.e. cols have typically got numeric content but can have such values as alternate content
one way of viewing those is to view them as null values
jenit: I'll summarize this set of issues as they are linked
<JeniT> https://github.com/w3c/csvw/issues/218
one approach is to say they're all kinds of nulls
see #218, … which says the value interpreted as null could be written multiple ways
… give a bit more structure, etc
<JeniT> https://github.com/w3c/csvw/issues/223
(aside https://en.wikipedia.org/wiki/Null_(SQL) … scary world)
jenit: #223 explores possibility of union values
cols are numbers or else strings like NA
<JeniT> https://github.com/w3c/csvw/issues/224
in order to support union-based types, where you'd need to list set of datatypes that the cells needed to comply with you would really need a different structure for datatypes
… which brings us to #224, "Reworking structure for datatypes #224".
gkellogg: another use case I've often seen: col might have different date stamps in it. Dates and Date-times intermingled. This would allow a super datatype there
maybe allowing different dates in different formats
ivan: original poster explicitly asked for data type unions, not sure it was just w.r.t. null.
jenit: maybe it is helpful then to talk about the requirement for union types and whether we want to support them
thoughts on supporting union types?
ivan: I was wondering about the opposite direction?
(jumbrich - is there anything from your study of actual CSV files to guide us here?)
jenit: any objections to restructuring the datatypes?
… basically they become their own little object, including base datatype, … e.g. decimal, … then you have extra properties
… we could imagine in future naming those
<ivan> +1
jenit: i personally think it is the right
way to structure it
... you can still say 'decimal'
but you could also use a structure to set max/min etc
jumbrich: re dan's q, … in our study we used a simple heuristic, tried to guess cell type by using regexes, … and we found a couple of times that a col had multiple types in it. We just went with the majority.
I could try to look deeper, find what kinds of types did this
jenit: the fact that you noticed that that was happening is enough to know it's out there in real data
jumbrich: other case is decimals … maybe excel export
… maybe more interesting when strings vs numericals; or strings vs urls
I can try to look a bit into detail, what kind of different datatypes we observed in 1 col, and report back
jenit: ivan, you were asking for usecases. would it help to get that?
ivan: what if a cell can be interp'd by several of the datatypes
jenit: suggest using array order
ivan: yeahbut, … i know pathological, but imagine 2 alternatives. One is JSON, the other is a string.
How do I decide that a string is JSON
… or say XML
do I have to parse the whole thing to know?
I'm not sure we have a clear idea about these ugly edge cases
jenit: that is a separate issue about validation of xml and json and html
… this arises regardless because we say values have to be valid against whatever the datatype is
…we can go one of two ways. We say you have to go all the way and really parse it. Or else say that typing on these cols is just a hint, ...
(which has implics for union types in case that the markup isn't as valid as intended)
gkellogg: we could say that datatypes explicitly have a regex form, then that is used to match, otherwise it is the first found. That would basically get everything.
There are a couple of areas in xsd whjere you have confusions
datatime stamps are datetimes, etc
otherwise they are largely in diff spaces
ivan: but then, … for time being at least, a convertor to json or to rdf can be lax
… meaning that it does not do any validation. it will just believe what is there, and produce whatever is the datatype that is signalled.
if we introduce this, strictly speaking a converter cannot be lax, it'll have to make a decision.
jenit: depends on your meaning of 'lax'
… text we have currently says that an errror must be generated if not valid against a particular datatype
e.g. a decimal vs the word 'foo'
…means that the value of the cell is set to the string 'foo' rather than a decimal number. The conversion
…can then do whatever with that value. It could be lax and …
or strict and raise error.
various options.
jenit: the checking of the value and the generation of the value for the cell happens regardless.
it has to, otherwise you get in real messes around parsing of dates etc.
gkellogg: say i have a set of datatypes, e.g. date, boolean, … listed, …
as a convertor i need to check lexical form to see if value matches date, or then boolean, … then if it doesn't, what do i use, the last one? string?
jenit: it (i.e. "foo" in example) is a string
… if you had datatype: boolean, and string value was foo
…then value of the cell is the string 'foo'
gkellogg: if the type was xml literal, … because we don't have a defined format for detecting it, i'd just go ahead and say it was an xml literal
… could say default comes from def of that datatype
… format/pattern
<JeniT> https://github.com/w3c/csvw/issues/236 << validation of html/xml/json datatypes
we could then have same datatype diff times with diff patterns/formats
ivan: we've moved away from the q of whether we want a structure for datatypes to be an array
<JeniT> https://github.com/w3c/csvw/issues/224
<ivan> +1
jenit: let's try closing 224. any objections to sturctured datatypes?
<gkellogg> +1
[tumbleweed]
+1 is for it, right? :)
<gkellogg> Right
resolved -> editor action
<JeniT> https://github.com/w3c/csvw/issues/236
#226
jenit: if we have a cell that is marked as being e.g. json do we want to validate that it is actually json
similarly for xml, html, …
ivan: I agree w/ not validating
danbri: +1 for not needing to
gkellogg: validation isn't the right word
(xml-wf?)
ivan: not even WF, as an xml segment needn't have a top level element
all we could find in rdf discussion of this was some DOM function
gkellogg: just wording choice
i don't think we want/need detection on these 3
just a note to say that pattern/format can be used
to help discriminate
jenit: can you clarify?
gkellogg: I mean that if someone wanted to try to discriminate, based automatically on datatype, … could put a format in there which looked for <html
so that they could distinguish
ivan: i'd keep it simple
<JeniT> Proposal: we won’t built-in recognise/validate html/xml/json, but add a note to say that authors can add a pattern if they want
(that's a heavy rider tacked on the end)
ivan: so if there is a pattern i have to use it?
<JeniT> file:///Users/user/Documents/projects/w3ctag/csvw/metadata/index.html#formats-for-other-types
<JeniT> http://w3c.github.io/csvw/metadata/#formats-for-other-types
jenit: "format property provides a regex ...."
gkellogg: in formats for data/time the format is yyyy-mm-dd in which case that is not a regex
but there is still a pattern property
gkellogg: what are consequences of having both format and pattern?
jenit: pattern only on a format for a numeric type
not at the top level, alongside format
<JeniT> http://w3c.github.io/csvw/metadata/#formats-for-numeric-types
never clashes
gkellogg: ok
jenit: back to HTML/XML/JSON people can already use format property to constrain the value as gregg described
so the note is just a pointer to existing functionality
ivan: ok
<JeniT> Proposal: we won’t built-in recognise/validate html/xml/json, but add a note to say that authors can add a pattern if they want
+1
<gkellogg> +1
<DavideCeolin> +1
<ivan> +1
<jumbrich> +1
<JeniT> +1
resolved
https://github.com/w3c/csvw/issues/218 - Categories of null values #218
gkellogg: instead of treating these multiple values as diff version of null, could treat them as …[missed]
tokens
ivan: what is current situation?
… it looked as if null was already an array
<JeniT> https://github.com/w3c/csvw/issues/136
jenit: it is but we decided from #136 that null would become a single value
… so the doc hadn't been updated to reflect that resolution
effectively #218 reopens #136 but with more of a rationale for why you might want multiple null values
ivan: what's the merge? do we concat the arrays?
jenit: atomic
so you do not merge the arrays
if you have two metadata files, the null list from A overides from B
ivan: fine w/ that
… and ok w/ several null values that way
jenit: ok
<JeniT> Proposal: allow several null values, but merge in an atomic way (don’t merge arrays)
<gkellogg> +1
<jumbrich> +1
+1
<DavideCeolin> +1
<ivan> +1
<JeniT> +1
<JeniT> https://github.com/w3c/csvw/issues/223
https://github.com/w3c/csvw/issues/223 - Allowing "unions" of datatypes? #223
ivan: if we move to datatypes being these objects then the q of merge arises for those as well, regardless of the union issue
jenit: true
ivan: we merge atomically or property by property
[someone said 'yes']
jenit: i agree
gkellogg: general trend is to make a small set of things which merge.
ivan: maybe adding that note to the issue?
jenit: propose that we allow arrays of datatypes to be provided and the first datatype wins in terms of labelling a particular value
<JeniT> Proposal: we allow arrays of datatypes to be provided, and the first matching datatype wins in terms of assignment of datatype to a particular value
<JeniT> (and atomic merge)
ivan: at this moment my vote is "If we do it, then yes that's the way we should do it" (but) I would like to see jumbrich's measures before we make a decision on this rather than rush a new feature based on 1 new use case
jenit: ok
gkellogg: leave a week?
… we need to decide
jenit: what do you think is your measure of what would be persuasive. How many cases? 10, 20? 3?
ivan: not a number
… he goes through a certain amount of usecases for scientific data. If only 5% have this feature I would go against it. Obviously if it goes up to 30% then yes.
<jumbrich> info: we should have around 80K+ documents, from which 60k we could parse.
jenit: I'd say 5% is too high a bound. We should support features that are in 1 in 20 docs
gregg/dan: agree
jumbrich: i will have a look and report on how many cols per doc we found at least 2 or 3 datatypes
then try to present them in a reasonable way
[click]
(sip client problems)
ivan: I'll be away next week
(dan away too)
jumbrich: I'll email around beforehand
ivan: fear of feature creep
<JeniT> ACTION: jumbrich to do an analysis on union types to see if they are prevalent in real data [recorded in http://www.w3.org/2015/02/18-csvw-minutes.html#action01]
<trackbot> Created ACTION-64 - Do an analysis on union types to see if they are prevalent in real data [on Jürgen Umbrich - due 2015-02-25].
jenit: fine, leave til next week
... jumbrich's github id?
<jumbrich> jumbrich is my github id
jenit: access on repo?
ivan: no
ivan to add jumbrich to our github group
jumbrich: if we write code, should it be hosted here?
e.g. to extract metadata files etc?
ivan: no need to, can host wherever
AOB?
(github admin details)
Adjourned.
KUTGW - please try to vote on proposals etc on github.
<ivan> trackbot, end telcon