CSVW F2F Meeting, London, 2nd day

13 Feb 2015


See also: IRC log


Jürgen Umbrich, Davide Ceolin, Dan Brickley, Jeremy Tandy, Jeni Tennison, Gregg Kellogg, Ivan Herman, Phil Archer
gkellogg, danbri


<JeniT> Agenda: https://www.w3.org/2013/csvw/wiki/F2F_Agenda_2015-02

<gkellogg> scribenick: gkellogg

Foreign Keys and References

<JeniT> http://piratepad.net/URwa3CM9Vv

JeniT: how we handle having multiple interrelated tabular data files.

… An example is the public salaries use case (#4?)

<ivan> scribenick: danbri

gkellogg: roles json is basically a table group referencing two tables

… the driving metadata file

references the senior roles

and the junior roles

junior people refer to senior people

… a foreign key rel from the col 'reportsTo' to the other with col 'ref'

what we've said here … we've created property urls, and a value url

so property expands to reportsTo and value uses a URI pattern

senior roles have more col definitions

ref name grade and job

this allows you to examine the data, validator… a , ...

… looking at junior, … seeing reporting senior in 1st col, which would need to exist in the senior roles in the post-unique reference

e.g. 90238 is 3rd or 4th row

JeniT: a couple of observations

first is, two kinds of mechanisms for getting pointers between resources

one is thru primary and foreign key type mechanism

very database-oriented terminology

all primary key really says is that values in this column are unique

… each is different

foreign keys, or unique comb of values if multi-column, … must reference something that does exist in this other file

so quite a tight, validation oriented relationship

AND also we have

aboutUrl, valueUrl

these create the links in the output

for rdf and json generation

creates the urls for the things being identified in these files

could easily have an example that only had the one or had the other

gkellogg: in fact primary and 2ndary keys are not used in the transformation

ivan: there is a need for consistency
... whatever is described in the f.key structure

vs what we use in the valueUri

you would expect those things would be essentially identical

woudl they ever differ?

jenit: usually they would match but it's a rope-and-hang-yourself

gkellogg: consider two tables, ...

see http://piratepad.net/URwa3CM9Vv

we… used diff

ivan: to be clear, avoid misunderstanding, ...

… source of misunderstanding, … column names used for interlinking

you invite him-her-it

gkellogg: … what we have is a primary key in the senior but not in the junior

(jeni takes to whiteboard)

jtandy: in junior schema how do you know that it is reportsTo in the senior?

ivan: template refers always to the local table value

…has nothing to do with the reportsTo of the other table

gkellogg: :better to update the example accordingly?

jenit: let's start with something super simple

ivan: e.g. i would remove the reportsTo of the senior table

jenit: it is useful to have that example

danbri: at some point it crosses over into domain datamodel validation e.g. "no reporting cycles" problem isn't our problem

<gkellogg> scribenick: gkellogg

jumbridge: this requires that columns have names?

iherman: yes, but there are defaults.

jumbridge: so, there may be a name or a title, but name is best for creating a reference.

… perhaps we should use “id” someplace in the metadata to show that this is an identifier?

jenit: I’d like to stay close to Data Package.

DavideCeolin: If I have two tables and want to say one refers to the other, I can do it using small markup in the FK specification.

jenit: we’ll go into issues.

<JeniT> https://github.com/w3c/csvw/issues/16

JeniT: Andy brought up the difference between “strong linkage” in databases, with strong validation requirements for the FK to find the reference, and the “weak linkage” in the web where something may not exist.

… He as concerned about not having to resolve URLs to validate links. When you’re at the process of generating them, they likely don’t exist anyway.

danbri: granularity was an issue as well, depending on there the data comes from.

JeniT: We have two mechanism, the first you have control by knowing what is coming together and having control of the metadata and are better able to make a strong statement about validation when using such cross-references.

… We also have the “weak linking” generation on demand where there is no check. It’s up to the metadata author to know what to use.

iherman: we have to define what a validation is expected to do. In this case, we probably require only weaker validation?

JeniT: When there is a primary key then a validator must verify that all referenced data exists and that all primary keys are unique.

jumbrich: so this allows just mapping one data without necessarily mapping the other.

iherman: you don’t have to check if the values in a column using an FK are actually present in the other table. The two tables are consistent in the roles example, as they do exist.

jtandy: if you declare it as an FK you must check that it exists. if you use a valueUrl, you don’t need to check.

… Because strict validation is a “beast”, you can only use the references within a single TableGroup.

<danbri> ( if you want some examples with multi-table keys, https://github.com/w3c/csvw/tree/gh-pages/examples/tests/scenarios/chinook )

jeniT: there is a subtlety in the examples ...

… In real life, there is a government office that says all departments need to publish senior and junior roles, and all adhering to the same schema.

… They also define a list of departments, with say name of department, and website.

… When departments publish the senior/junior roles pairs, the “dept” column will typically all be the same pointing to the identifier of a particular department, so the FK needs to reference the departments.csv file.

… The TableGroup then needs to reference the departments CSV and schema.

iherman: the person creating the description probably shouldn’t say what data not to export.

gkellogg: But, this could be specified in user-deined metadata, and is undercontrol of the user.

jumbrich: I might want to refer to other resources without pulling them in.

JeniT: the closest thing we have is to use the same table group to describe related resources and generate the URL for a “team” in any output. Youl would then use that URL to reference the team, for references and identification.

jumbrich: I might have a relation table, and a couple of tables where things are used, and I might want to point to something for additiona information.

… I might be able to build search on top of the metadata where I could use FK information to infer information about the various tables.

jtandy: I think that’s a normal FK relationship.

iherman: there’s also a difference between what a validator and a transformer will do.

… The FK spec is conceptually disjoint from the valueUrl and transformation. The FK is only there for validation.

jtandy: if you use PKs, that might change how you serialze.

<JeniT> https://github.com/w3c/csvw/issues/16

JeniT: FK references are for validation purposes...

danbri: what do we say about the results of being invalid? Are we creating a culture so that things can’t proceed if they’re invalid.

JeniT: a validator may work in strict and lax modes, where it fails at the first problem when strict, but just reports all issues encountered when lax.

<JeniT> https://github.com/w3c/csvw/issues/31

jtandy: this looks out of date now, I suggest close as “expired”.

iherman: it will come back if we have a “skip” flag.

<danbri> "Should primary keys be skipped from cell level triple (or k/v pairs) generation? #31"

JeniT: if you use valueUrl, you only get ???

<JeniT> https://github.com/w3c/csvw/issues/130

jtandy: Alain has provided some alternate JSON structure that uses identifiers as properties rathern than an array.

… If you didn’t define a PK, there’s not necessarily one thing that is unique, and such an index structure is available.

… We agreed that PK is for validation, but necessarily only for validation.

JeniT: this is the purpose of aboutUrl, which _may_ be associated with the PK, but not necessarily.

jtandy: the index and object works for some, but Tim Robertson seemed to object.

JeniT: I think we should only define one JSON output for ease of scope.

jtandy: so the “standard” publishing mechanism is an object per-line, and converting to an ‘indexed’ mechanism is “triveal”, and outside the scope of the spec.

… We may say that implementations could have alternate output forms.

<danbri> ('templating and transformation'?)

iherman: I like to have a conceptual similarity between the JSON and the RDF transformations, and for the time being they are quite similar.

<JeniT> https://github.com/w3c/csvw/issues/66

<danbri> "Composite primary keys and foreign key references #66"

jtandy: for exmple my PK may be based on givenname & familyname, and you’re making stuff up as you go along.

JeniT: you can use aboutUrl to combine such columns together to get what you want.

… You can’t say that one column points to two values, but you can create an aboutUrl which uses both name and a valueUrl in the other to create the same reference. It works for RDF, but not for validation.

<danbri> rragent, pointer?

danbri: if you had postal codes in each country, then the combination of country code and postal code will be unique.

jtandy: TableGroups contain resources and may contain schemas? (yes)

JeniT: because there are two different types of FK references you might make (departments example), one always points to the same resource, and the other to different values based on cell values.

URLs and metadata

<JeniT> https://github.com/w3c/csvw/issues/74

<JeniT> https://github.com/w3c/csvw/issues/74#issuecomment-72854167

<JeniT> https://github.com/w3c/csvw/issues/191

<JeniT> diverted onto https://github.com/w3c/csvw/issues/91

<ivan> https://github.com/w3c/csvw/issues/191#issuecomment-73497474

<danbri> gkellogg: in json-ld … there are rules for term expansion

<danbri> … the prefix expansion is more naturally dealt with as part of #91 than this.

<danbri> What we're doing here is saying it is a URL template property

<danbri> when you apply template, result is a string

<danbri> which in #91 will be made into an url

<danbri> jenit: fear we'll get stuck on exact wording

<danbri> … can we capture direction of the resolution

<danbri> … will ref #91

<danbri> … and editor action will be needed

<danbri> capturing basic thing, … these properties are string properties

<danbri> from piratepad, copying:

<danbri> resolved: The order of processing is as described in https://github.com/w3c/csvw/issues/191#issuecomment-73497474https://github.com/w3c/csvw/issues/191#issuecomment-73497474http://piratepad.net/ep/search?query=issuecomment-73497474. These properties are string properties, the URL template is expanded first. Any resolution (ie expanding prefixes & resolving against a base URL) is done after that expansion. Editor action to make this so.

<JeniT> https://github.com/w3c/csvw/issues/91

<danbri> "What is default value if @base is not defined in the metadata description #191"

<danbri> jenit: bunch of issues …

<danbri> how link urls which are bases are resolved

<danbri> how url templates following their templates, what base url they get, how they are then treated, what base url gets used on that

<danbri> and then whether we want to provide some level of control within the urltemplates to enable people to expand based on a different base url

<danbri> 1st - link properties

<danbri> like reference to the csv files

<danbri> those link properties should be resolved in the same way that they are resolved in json-ld

<danbri> i.e. if there is an @base in the context, use htat

<danbri> otherwise metadata doc in which that link is found

<danbri> requires to you expand them prior to merging

<danbri> or keep track of where original comes from

<danbri> gkellogg: that's where language in merge now says

<danbri> before merging both A and B make any link URIs absolute relative to the base of that metadata

<danbri> ivan: isn't there also a language about merging the @base?

<danbri> gkellogg: for @base there is

<danbri> works pretty much like object merging

<danbri> ivan: but then why do we merge @base?

<danbri> gkellogg: point is that after normalizing, context isn't necessary any more

<danbri> ivan: let's make that explicit

<danbri> … conceptually every metadata file needs to be normalized before merged

<danbri> gkellogg: @base and language can dissapear

<danbri> you still need the default metadata since that is how you define prefixes etc

<danbri> ivan: i don't think we do that

<danbri> jenit: they're never explicitly put inthe @context

<danbri> … gregg is saying that conceptually there is such a context

<danbri> and if you are using basic json-ld processing, then implicitly we'd pull in everything from that context doc

<danbri> gkellogg: need not be just implicit

<danbri> we need to figure out what we want to do

<danbri> jenit: you were both in agreement that the @base and the @lang were redundant by the time you had gone through the normalization

<danbri> ivan: that's correct

<danbri> gkellogg: but there is a conceptual or virtual base url of the metadata

<danbri> besides an explicit @base declaration

<danbri> jenit: yes, the location of...

<danbri> gkellogg: or the 1st in a set, ...

<danbri> jenit: that, I don't, ...

<danbri> ivan: comes back to #199

<danbri> jenit: i think we agree that the link properties are resolved against the base url, maybe the @base from the context, or it may be the location of the metadata file, during normalization of the metadata file, and prior to merge.

[[[If the property is a link property the value is turned into an absolute URL using the base URL.]]]

<danbri> jenit: 2nd piece of this, is what happens to these url templates

<danbri> these can't get expanded until you are actually processing data

<danbri> at which point you have your merged metadata as basis of what you are doing

<danbri> if you have lost your base url, or not got, what to resolve against becomes tricky

<danbri> also - jtandy's 1st assumption, that those would be resolved against url of the csv file

<danbri> so when you had template like #rownum=5

<danbri> then that would be ref to something within the csv file

<danbri> not relative to any of the metadata files it might be in

<danbri> which raises the usability perspective, ...

<danbri> … it might be better for the url templates to be ref'd against the csv file

<danbri> to have that as the default

<danbri> gkellogg: i won't stand in way, but am not enthusiastic

<danbri> … you can always avodi trouble by having absolute urls

<danbri> jtandy: we just need to be clear on what happens when not an absolute url

<danbri> ivan: raising q: is it not confusing for authors, that we have 2 diff ways of absolutising urls

<danbri> depending on whether they are link properties or templates

<danbri> … a completely diff approach would be that we don't do this under normalization

<danbri> instead use the table url just like for templates

<danbri> jenit: how do you resolve the table url? that's the link property

<danbri> gkellogg: json-ld has an url expansion algo

<danbri> … nominally each json-ld doc has a location which can overide @base

<danbri> ...

<danbri> if we say it is undefined, this would be the only doc (format) i've dealt with in which you start off with a base and then lose it along the way

<danbri> ivan: talking about confusing, … that means I get a merged metadata, and the various templates in that metadata will expand differently

<danbri> … the templates will expand depending on where they come from

<danbri> gkellogg: no, there's a single base url notionally

<danbri> ivan: then i don't understand the issue

<danbri> gkellogg:I think we said it's the csv file it is expanded against

<danbri> that's what i reacted to , saying that this is weird, …

<danbri> jenit: [missed]

<danbri> discussion of detail of mess starting with the csv file vs metadata

<danbri> jtandy: key issue to my mind, uri templates only get expanded once you've done all the merging, ...

<danbri> … only at that point,

<danbri> gkellogg: only at row processing stage

<danbri> jtandy: … templates get expanded, … urls get resolved, …

<danbri> gkellogg: which we're saying is the expanded url property of the table

<danbri> jtandy: at least we always know what that is

<danbri> jtandy: to clarify, this is for the metadata doc, and by time we get to conversions, this will all have been expanded?

<danbri> [yes]

<danbri> jenit: do we in abstract table data model need url in each cell not just value

<danbri> i.e. what you'd get from value url

<danbri> gkellogg: that is the value of the cell

<danbri> jenit: no

<danbri> -> example in piratepad

<scribe> scribenick: gkellogg

iherman: just to clarify, linkproperty values can be CURIEs/PNames

<JeniT> https://github.com/w3c/csvw/issues/121

<danbri> gkellogg: discussion of expanding urls, we talked about json-ld, then asked about URL spec

<danbri> reason for that is that url spec doesn't deal with prefixes

<scribe> scribenick: danbri

ivan: spec-wise it is fine, but if i read that doc it is like some of the HTML5 specs

jenit: does it specify the behaviour that we want it to specify

… there is no other good url spec to reference

jenit: i think it is at least consistent to point to the json-ld one

ivan: that's why i asked what i asked. back then it went into a whole set of things that were v json-ld specific, with prefixes etc.

…that was my fear

… it goes into all kinds of detail on context processing

gkellogg: we are using a context, we have one defined that defines all of our terms, that is the one used when expanding these values

jenit: let's defer this, maybe discuss over lunch, ...

gkellogg: if we choose something else let's say it is intended to be consistent with json-ld iri expansion

ivan: one thing it does introduce, … and we do not, is issue of syntax for bnode identifiers

gkellogg: but we can constrain the value space...

jenit: suggest resolve as "we'll summarize the algo from json-ld spec, extract bits that are relevant, and say it is intended to be consistent with the spec

gkellogg: yes, can do that

… re bnodes i think it is intent of group to avoid using a bnode syntax where URIs can be used

ivan: maybe we need some sort of appendix

saying this is json-ld compatible, but with these-and-these restrictions

e.g. that we restricted what can go into a context

… that we have restricted yesterday the evlaution of common properties, etc.

… i.e. there are a number of places where we restrict json-ld

[general agreement]

resolved: We will summarise the expansion processing that is necessary for our purposes, and say that it is intended to be consistent with JSON-LD IRI expansion. We do have some restrictions on what IRIs can be used, eg we don't allow blank node syntax.

Conversion issues

from https://www.w3.org/2013/csvw/wiki/F2F_Agenda_2015-02#Friday_13th_February

will revisit after lunch.

Conversion Details

<JeniT> https://github.com/w3c/csvw/issues/83

Extension Conversions: #83 "Possible error in "optional properties" for Template Specifications: source #83"

jenit: this is about when we have these extension conversions, we have said we want to enable extensions to work on results of a conversion we have already defined

e.g. we have already defined json and rdf

… can we make e.g. a post-processor that sits on top of the RDF

maybe it might use SPARQL CONSTRUCT

(the use of XML in the orig issue was a typo)

this lead to q of what the source looks like for post processing

gkellogg: how does this relate to accept headers?

e.g. my impl creates an abstract graph

… Accept: can turn into a prioritized list of formats

seems like the type of thing that a tabular data processor might do

danbri: assumes an HTTP REST deployment model?

ivan: seems like an impl detail not relevant here

… more … if you want Turtle, this is the processor you can use, etc etc

options of tools or http or online tools … i dont think we should go there

ivan: only thing, what in metadata descr params need to be specifiable

gkellogg: seems reason why Accept has a prioritised list, so you get something you can handle even if not best

jenit: in my head, the source thing here was only taking 2 values

… and when you said post-processing woudl be delivered an rdf graph

you wouldn't be specifying

you might never serialize

danbri: not comfortable assuming all in memory / API access, unix pipe model is quite likely

jenit: you (jtandy) are assuming serialized output?

jtandy: i'm v happy saying we don't serialize

that json stays just in memory

gkellogg: i believe json in memory defined in ecma
... diff between target format and template format?

… mustache vs RDF

jenit: either you'd be operating over the rdf using a mustache template, or to create rdf/xml, would be a basic thing ...

danbri: would fancy alternate mappings always use json or rdf mappings? or sometimes raw?

jenit: can go back to the base also

fwiw this was the closest we got to a demo using R2RML : https://github.com/w3c/csvw/blob/gh-pages/examples/tests/scenarios/events/attempts/attempt-1/metadata.json


<scribe> scribenick: danbri


jenit: given that we have abouturls, property urls etc etc, i.e. pretty flexible way of making triples from a row in the table...

…what does this imply in terms of what else is needed to be flexible about that structure

or should we be constraining it


issue #66 already closed

so https://github.com/w3c/csvw/issues/66 does not need discussion


"Suppression of columns in mapping #64"

jtandy: sometimes in the stuff you want to push out through RDF or JSON conversion, you might not want all of the cols in the tabular data to appear in the output

I would just like to be able to say "don't include this column"

… seemed trivial but ppl objected

gkellogg: [missed]

… re naming, we mix hyphens and CamelCase

jenit: should be CamelCase

so "table-direction" is wrong

jtandy: so that was my requirement, it would be cool if you could do that

jenit; and properties

jtandy: Gregg's suggested optimzation for skipping an entire table, it could be an inherited property

so you could say it up at the table level, schema...

gkellogg: suppressing table would handle all its cols

ivan: strictly speaking this is not the same

because if I have common properties

if i say I skip the table

if i refer back to this AM's discussion, i want to supress the generation of everything

if i have a flag on a table, is fine

… if just a space keeper for all the cols, you would generate common properties

jtandy: you are correct

therefore we should have a suppress col

gkellogg: i don't see that

if it is on the table that is how it is interpreted

ivan: let's not conflate the interpretation of this

jenit: surely having the same property does not ...

"this suppresses the conversion output from the thing that it is on" would be a fine def, to avoid having repeated similar terms

ivan: but I might want to do what I said earlier, just common properties

resolved: We will introduce a `suppressOutput` property, on individual resources or on columns, which would mean that no output was generated from that table or from that column during a conversion. This is not an inherited property.

jtandy: before we get to phantom cols, … aboutUrl on cols?

gkellogg: we resolved that aboutUrl etc are common properties

can appear in col, schema, …

ivan: there may be cells where the generated triples have a different subject

jenit: let's discuss that 1st

"whether it is useful helpful to have different about URLs on different cols …

jtandy: that would really help my use cases

… we need multiple entities per row

ivan: if we go there, fundamentally not against it, … the structure of the generated rdf needs rethinking

currently we make a predicate 'row' etc etc… this structure becomes meaningless

gkellogg: in average case it works out fine

way reads now, the row resource, iri is from 1st cell

jtandy: no, subject of row comes from aboutUrl in schema

jenit: purely what you generate as triples

jtandy: i believe this is an inherited property

so if you define it at schema level, …

[can't capture realtime and listen, backing off from detail]

gkellogg: some times it does have value to use row

ivan: where do i put these extra triples?

jenit: "the output" :)

jtandy: if we are processing on a row-by-row basis, we look at those across a row that share a subject, and emit them together

the issue we have got is that the entities which are talked about lose an implicit relationship to the table they are in

jenit: what kind of relationship …

<JeniT> https://github.com/w3c/csvw/issues/179#issuecomment-72072147

issue may be discussed in tracker under 'phantom col'

gkellogg: imagine a doap description of a software project, referencing a foaf description of a developer

… if there happens to be a spare column, e.g. foaf ID column out, i could put [missing detail]

ivan: i think you're conflating 2 different things

jenit: what is the proper relationship between the table in the rdf output

… vs the entities from the data

jtandy: at moment we say 'csv row'

jenit: i don't think it worked in 1st place

…tables rows are rows which describe things

e.g. a row might describe many things

so either you'd say, instead of csv:row property, you want 'describes'

isDescribedBy etc

… table describes all of the distinct subjects / entities

or you can do it by saying table contains rows, row describes entities


jenit: could be 2 rows talking about same entity

jtandy: in this case table is a kind of dataset

… mention yesterday that a table … if we defined CSVW 'Table' as a subclass of one of the dataset types e.g. dcat:Dataset

jenit: let's get to agreement on the q ivan posed, … do we want separate about urls on each column

resolved - jeni summarising

ivan: does it affect the json output?

<JeniT> PROPOSED: aboutUrl is a property that goes on individual columns; different columns can generate data about different subjects

jtandy: yes


<JeniT> +1

<gkellogg> +1

<DavideCeolin> +1

<jtandy> +1

<jumbrich> +1

<ivan> +1

<JeniT> RESOLVED: aboutUrl is a property that goes on individual columns; different columns can generate data about different subjects

<JeniT> https://github.com/w3c/csvw/issues/26

https://github.com/w3c/csvw/issues/26 Rich Column Classes / Types (@type / @datatype on column) #26

jenit: about types of the entities being described by this row

e.g. each row about a Person

gkellogg: a phantom column, of course!

jenit: ok …

… short term answer would be to make a custom property, but let's discuss phantom columns now

<JeniT> https://github.com/w3c/csvw/issues/179

https://github.com/w3c/csvw/issues/179 Do we need "phantom" columns, i.e., columns with their own and separate 'aboutUrl' value? #179

jenit: what problem does this solve?

gkellogg: problem that we have is that sometimes the information we want to have in our output, json or specifically rdf, … we might need info not exactly in the source CSV

e.g. that the rows describe People

we would therefore need a way to introduce data into the output on a row by row basis

a virtual column might allow us to do that

a table is defined by having some number of columns

if the table desc had more cols after the last real one from the csv, then notionally it would not retrieve a cell value

but we can through other means define ....

aboutUrl etc

that was what i was trying to accomplish

you go through each col, if there are more col records after last one, you go through … and if not in the csv, … you overide with default properties

to get literal values

jenit: i understand the goal, there's a q as this adds extra complexity, …

… the demonstration of type, to me, is proof that it is useful

the use of columns in that way concerns me

in that … the data changes if we add more cols to the data

if we start adding more cols to the data, multiple metadata files, some have extras, then we start to get conflicts

gkellogg: we could have isVirtual property set on the col

jenit: maybe have this as a separate property on schema, beyond cols, e.g. "extras"

gkellogg: how does this look in annotated model?

jenit: q is whether to pretend that they are cells or not

gkellogg: virtual col could appear any place?

jenit: concerned about the merge

ivan: agree w/ jenit, that this somehow mis-uses something

cols are to describe cols

<scribe> new issue: (disagreement over triples per cell in case of array value)

jtandy: lots of more structured data, observations etc., you want often a more deeply nested structure

e.g. adding a virtual column could support this

get more nesting

jtandy: in a CSV file of weather observations… that is a product based view

… we might have 5 different 'observation' entities

…all share the same time

which is why humans flatten them in csv

(example in http://piratepad.net/URwa3CM9Vv )

(discussion of data cube use case)

(slices can have common properties, but then we have to tie those back to observations)

jumbrich: in data at uni, we have Org with a director with an Address

ivan: seems to work with what you have

jenit: usually to make things link together _and_ to be able to say it has a firstname, givenname etc

you can basically only get one triple per column

if you had 5 cols you get 5 triples

you get to define what the abouts and values are

gkellogg: the notion of the virtual col is to have more control

jumbrich: e.g. row1 person has a first name and a last name, …

(example in piratepad)

event example is https://github.com/w3c/csvw/blob/gh-pages/examples/tests/scenarios/events/source/events-listing.csv

expected triples: https://github.com/w3c/csvw/blob/gh-pages/examples/tests/scenarios/events/output/expected-triples.txt

gkellogg: another hacky way to do this

multiple tables

hijack diff cols in diff table mappings

jenit: yes a hack!

jumbrich: there are also these mapping languages, ...

discussion of a jenit proposal

jenit: option 1, out of scope

option 2, … most common is saying this thing descrbied by row is an Event, Person, etc

so we could have a specialized handling for that

option 3, this stuff is v v useful, best way of doing that is to hook onto existing column based processing

just say we have phantom cols

4., we want to do this, but not use phantom cols but extra stuff within a col description

my prefs: 3 or 4, no pref between them

jenit; either way we'll solicit wider feedback

<JeniT> 2

<jtandy> 3

<ivan> 2

<gkellogg> 3/4

gkellogg: 3 easier to impl

4 more complex

strongly against 2

<jumbrich> 3 or 4 (if we have typed colums or several entities between columns, we need something more)

3 is a hack but it's easy with potentially a huge win

<DavideCeolin> 3/4

jenit: ivan and i preferred it simple, everyone else went for the extra power

… and i accept value of that, esp 3 seems preferred

… let's try it and seek feedback

gkellogg: I think it will probably work

ivan: means at least virtual cols need a name

jenit: we'll pursue investigating use of phantom cols for generating
... create a PR and we'll put in spec saying "we particularly seek feedback on this feature"

ivan: whatever we publish in a month will include phantom cols

jtandy: terminology?


rather than Phantom

<JeniT> PROPOSED: We will implement virtual columns for the next version of the spec, with an explicit request for comments.

<gkellogg> +1

<jumbrich> +1


<JeniT> +1

<ivan> +1

<DavideCeolin> +1

backup gist copy of piratepad, https://gist.github.com/danbri/30534e3c337b34520798

<gkellogg> scribenick: gkellogg

<JeniT> https://github.com/w3c/csvw/issues/58

How should class level qualified properties be transformed to JSON #58

<JeniT> PROPOSAL: In JSON output, we do not expand property names into URLs.

<ivan> +1


<danbri> +1

<DavideCeolin> +1

<ivan> RESOLUTION: In JSON output, we do not expand property names into URLs.

<JeniT> gkellog: right now, the value of a csvw:row is a row URI, but there are now multiple entiities for each row…

<JeniT> ivan: my understanding was that this was homework to work out the details

<JeniT> https://github.com/w3c/csvw/issues/117

<JeniT> https://github.com/w3c/csvw/issues/117#issuecomment-72898169

Make the datatype mapping more precise #117

JeniT: columns describe datatype such as strings, dates, numbers. We could also have XML, HTML, JSON.

… embedded XML, HTML, JSON does exist in the wild. Embedded CSV is a nightmare!

… In the generation of the RDF, if the datatype is XML, the output should be an rdf:XMLLiteral, HTML: rdf:HTML, JSON: ???

… Three options, xsd:string, csvw:JSON, process JSON as we process common properties.

<danbri> can we emit base64 as data uris, e.g.  AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO 9TXL0Y4OHwAAAABJRU5ErkJggg==

<JeniT> danbri: I think so, through the URL template, yes

<danbri> (yes, looks doable to me too, just wanted sanity check - thanks)

iherman: I spent some time to define a JSON datatype; the problem is that formally speaking, you need to define L2V for the datatype.

… there are discussions at IETF on doing this, but there is no universally accepted way to do it.

… My feeling is that we shouldn’t define such a datatype.

… A datatype means that I have a property RDF datatype definition, which we can’t do.

JeniT: options on table: process JSON as a common property, output in RDF with xsd:string, or with csvw:JSON, where that is defined as a subClass of xsd:string

jtandy: usually, when people do this it’s to use in a GUI, and is not intended for interpretation. I don’t want to pick out embedded output.

JeniT: this leaves options 2 and 3.

<JeniT> PROPOSED resolution: datatype: json gets mapped to RDF literal with a datatype of csvw:JSON which is a subtype of xsd:string


<jumbrich> +1

<jtandy> +1

<danbri> +1

RESOLVED: datatype: json gets mapped to RDF literal with a datatype of csvw:JSON which is a subtype of xsd:string

<danbri> scribenick: danbri

Overflow Time


ivan: metadata has an @id etc., considered as a json-ld thing, result is an rdf graph, where everythting is hanging on the subject, whose url is this one

jenit: and your understanding is… that the @id is for the graph, … or?

ivan: it is a bunch of rdf statements, whose subject is [this url]

[general agreement so far]

ivan: from this metadata thing, we also generate a bunch of rdf statements, ..

… as we describe, which includes the rows, the things jtandy has described

my understanding is that yesterday we said that the url for this, is … this

in fact we get, for the same subject, …

in current world, … means that we attach on to the same subject, a bunch of additional triples which have nothing to do with what we want here

what i claim is that these two things should be different

we have to have an explicit statement here

that gives a home … to give a subject for what we generate from CSV

jenit: what I don't understand is why you make the assertion that things about blah there, arent about blah there

ivan: [here this here something missed ]

gkellogg: my u/standing is that all of those properties _are_ the table

and all of those properties are properties of the table

and something similar to inference rules add triples based on interpreting the csv

where i believe ivan is coming from, and i also feel

what the metadata description is, is a description of the table

used to create the tabular data model

which is though, a different entity

therefore when we say Common Properties, and copying them over, …

… i think from Jeni's perspective, you are not copying them, you are just expressing them with some discrimination e.g. skipping notes and schema

gkellogg: whereas my view + i think ivan's, … we could go …. [missed]

<gkellogg> https://github.com/gkellogg/rdf-tabular/blob/feature/de-conflate-metadata/spec/data/tree-ops.csv-metadata.json

… to be unequivically of the table and not the metadata

ivan: we require an explicit thing that is different

… jtandy raised this ages ago

jtandy: i was just happy to establish that things in the @table were about the table

i did not have burning need to talk about the table description itself, who wrote it, etc.

(gkellogg talks us through https://github.com/gkellogg/rdf-tabular/blob/feature/de-conflate-metadata/spec/data/tree-ops.csv-metadata.json )

… "… in we chose to create such a distinction this would be a reasonable way"

jenit: querying this, … url is probably poperty of the table

… and tableSchema is the schema of the table

gkellogg: now i am understanding your view a bit more

the metadata thing … is the schema

if ivan wanted to make statements about the metadata, it could be in the schema

see orig version of this file, …->

<gkellogg> https://github.com/gkellogg/rdf-tabular/blob/develop/spec/data/tree-ops.csv-metadata.json

this has url, common properties, and tableSchema

i understand if we put common props in the tableSchema they won't come out via conversions


gkellogg: we're not copying over, so much as serializing this alongside rules based on the referenced csv file

ivan: .. what the rdf gen doc does is additional, but common properties are already there

should be made v clear in the doc

for me it was absolutely not clear


jenit: if you have dc:title on tableGroup you have it for the whole set , not inherited down

gkellogg: there is no description on the table group as such

ivan: in grander scale, you talk about CSV files as being part of the Linked Data cloud or world, ...

… my view until now, the metadata creates link between that cloud and CSV files which are in some form RDF

but in fact that is not what happens

… what we describe is some sort of an inference

Conversion to RDF

looking at PROV


section 3.1.x

issue #147


jtandy: as it is useful to understand how a set of info is created, and we discussed including PROV, … this section of csv2rdf is based on a suggestion in those discussions

prov: generated <[RDF Output Location]>;

…hard to know

prov: startedAtTime [Start Time];
... endedAtTime [End Time];

… for activities

and it had a usage, which was a csv file, … etc.

see also 2nd example further on.

ivan: see https://github.com/w3c/csvw/issues/174

Slight modification of the provenance structure for RDF output #174

ivan: this shows eg a bit different, … you bind it to table with activity

i was looking at prov vocab and examples

… here it was generated by an activity, ...

whether that info was useful or not is a separate debate

i think that is more correct

davide: … this kind of info was what i was looking for

may not be v useful in many cases

but sometimes can help you find problems

ivan: this is what i generate now

jenit: you mention a way of capturing what metadata files were used

jtandy: you'd have a prov qualifiedUsage block

one for every metadata involved

gkellogg: except for the embedded metadata

ivan: i have here a slightly more complex one

(adding to https://github.com/w3c/csvw/issues/174 )

prov entity has a bunch of csv files

jtandy: so it is a list of entities

jenit: i don't know what the correct usage is

… here this is an activity that has two prov usages

one of which has multiple entities

jenit: even though there are multiple metadata files, ...

ivan: problem is, ...

discussion of whether optional

how to test

esp with times

gkellogg: only thing problematic for automated testing, is inclusion of timestamps

jenit: whether that is problematic or not depends on how we define those tests

gkellogg: we got a lot of rdfa impl feedback that we made testing hard

ivan: here we have 2 metadata files that exist and can be referenced

but default and user metadata, passed on,… how do we describe them

davide: i was thinking about that

gkellogg: maybe a bnode??

danbri: do we have a UC for this?

jenit: are there any specs that generate proveance automatically
... would it be terrible if left implementation defined

ivan: prov docs can be hard to read but a good primer

danbri: provenance super useful in v detailed scientific scenarios, but we can't define that … let's point them at prov

jenit: to facilitate that, fix some prov roles

csvw: EncodedTabularData and csvw:tabularMetadata

… we may need to think about those more

<JeniT> https://github.com/w3c/csvw/issues/174

jenit: suggesting that https://github.com/w3c/csvw/issues/174 ("Slight modification of the provenance structure for RDF output") be resolved as …

(discusssion that examples are non-normative)

<JeniT> PROPOSAL: We suggest that implementations may choose to include provenance information and include an example of what it might look like.


<gkellogg> +1

jenit: the use of the prov info will really determine how much depth needed, … so am inclined to leave it impl-defined.

gkellogg: for testing, implementations should have a way to disable outputting prov

<JeniT> +1

<jumbrich> +1

<jtandy> +1

<ivan> +1

<DavideCeolin> +1

RESOLUTION: We suggest that implementations may choose to include provenance information and include an example of what it might look like.

jenit: on to prov roles

jtandy: i feel that is the right way fwd

raises q then about dcat distribution

i think important that we point to the csv where the stuff came from

<JeniT> https://github.com/w3c/csvw/issues/147

jenit: but prov roles first

Prov roles #147.

"The CSV2RDF doc uses two values for prov:hadRole: csvw:csvEncodedTabularData andcsvw:tabularMetadata. This need to be defined as instances of prov:Role in the namespace. Are there other instance types we need to define? TSV, XLS, HTML?"

ivan: q is whether there are other roles

we defined yesterday validation vs generation processors

i used a ref to my own tool saying 'this is the guy that generated that'

maybe the validation is a diff role?

danbri: plugins for R2RML etc?

jenit: no, this is just for our bit

danbri: so they'd do their own prov? fine thanks

ivan: prov's way around reification is interesting

jenit: so on #147 we make it only applicable to the csv2rdf mapping, and assign Davide to the issue, commenting "discussed at f2f…" -> see https://github.com/w3c/csvw/issues/147

… davide and ivan to come up with a list of appropriate roles

https://github.com/w3c/csvw/issues/179Is the DCAT block useful in the RDF output. #177

jtandy: to give an unambig rel between dataset and outset, i inserted idea of using a dcat:distribution statement

file vs abstract data

however we could simply use the url property

jenit: or dc:source

danbri: also in schema.org

jenit: this is one mech to do it, … there are clearly others, ...

… introducing the dcat stuff gives us some baggage that might make some people flinch

ivan: this in #177 …

… is a json transform

jtandy: I generated it. Idea is that you would, while transforming, insert a bit of json magic

gkellogg: just a json not json-ld?

no, this is the rdf transformation …

ivan: i have no dcat experience

jenit: impl is that the table is a dataset in dcat terminology

which is so flexible as to mean anything

jtandy: you could insert as a common property

jenit: i think this falls under 'it's impl defined how you might define info about the provenance of this output graph'

you could use prov or dc:source or dcat or ...

gkellogg: so goes into same non-normative section

jenit: only thing, … using dcat:distribution def falls under impl-defined, only q is whether we want there to be a CSVW URL property to be in the rdf output

danbri: can't force people to publish e.g. intranet urls

jtandy/jenit: feesls more refined than just dc:source

… should csvw:url be a subproperty of dc:source


"A related resource from which the described resource is derived."

"The described resource may be derived from the related resource in whole or in part. Recommended best practice is to identify the related resource by means of a string conforming to a formal identification system."

-1 on subproperty

jenit: Suggestion: don't add dcat:distribution, but do have, in the generated RDF output:

_: table csvw:url <tree-ops.csv> .

jtandy: in json it would just be "url": ...

jenit: proposed - any impl of dcat properties is impl defined, but that we do try to preserve the link to original file through using csvw:url


(route to http-range-14: "Is the url the id of this thing or a different thing? discuss.")

jenit: two more issues we didn't get through

lists next

<JeniT> https://github.com/w3c/csvw/issues/107

jenit: when we have a cell with a sequence, e.g. spaces, semicolons, … and the cell value then contains a sequence of values, ...

gkellogg: do we disagree? what about cells being only one triple?

jenit: what to do in these kinds of cases?

… what gets created in the rdf output?

<JeniT> https://github.com/w3c/csvw/issues/107#issuecomment-72894468

json has arrays which are always ordered

rdf output has possibilities of generating an actual rdf list, … or you generate repeated properties

jtandy: content that lists are lists

danbri: [begs for a parameter for listyness, use case of nationality]

<JeniT> PROPOSED: when a cell value is a sequence of values, it is converted to a rdf:List if ordered is true, and to multiple values for the same property if ordered is false; the default is that ordered is false


<JeniT> PROPOSED: when a column defines a separator, cell values are converted to a rdf:List if ordered is true, and to multiple values for the same property if ordered is false; the default is that ordered is false


<ivan> +0.999


<gkellogg> +1

<DavideCeolin> +1

<jtandy> +1

<jumbrich> +1

<JeniT> +1

(exit davide)

RESOLVED: when a column defines a separator, cell values are converted to a rdf:List if ordered is true, and to multiple values for the same property if ordered is false; the default is that ordered is false

jtandy: who is going to update UC doc?

davide: ok, i'll…

<JeniT> https://github.com/w3c/csvw/issues/35

<JeniT> https://github.com/w3c/csvw/issues/94

<JeniT> [discussion about whether it’s possible/useful to have a default metadata document]

ivan: what we have now...

we normalze each metadata then we 2nd-normalize them

filling in missing bits like name

<JeniT> ivan: we normalise the metadata files before merge, then we merge, then we add defaults (like name)

gkellogg: that's your view

… what's in there is consistent and does not require us to locate default metadata

ivan: I think more the q of how we define it, … an editorial issue

we do same thing

… i try to put the formulation of whole thing into metadata files,...

… at end of whole process we have another phase of normalization

which seems consistent with the current system

this is an editorial issue

jenit: i think perfectly reasonable to say 'normalization, merge, … '

…'completion' (ivan/jenit)

gkellogg: places we talk about property values to make sure they're [post-completion]

jenit: for each property we say 'if missing assume x'

ivan: name, details of dialect

jenit: we could be more disciplined providing more info throughout

(reminds me of https://en.wikipedia.org/wiki/XML_Schema_(W3C)#Post-Schema-Validation_Infoset …)

jenit: editorial action is to check property definitions are applied consistently

gkellogg: I tried this when looking at property values (in transform doc)

i think it is ok. if not, there is some editor action.

ivan: i can do this, but when? all these changes pending

jenit: process from here is … lots of editor actions

push them all through

ivan: even my implementation needs reworking after all this

gkellogg: also our test cases

jtandy: I'll always have a propertyUrl defined?


ivan: conversion docs will be cut by half

jenit: do we want to discuss ' • Relationship between table group, table and schema" ?

jtandy: that will be resolved based on [other actions/decisions]

-topic cvwr:row

Relationship in RDF output of conversion between csvw:Table and the entities generated from a row

ivan: dealing with lists is ugly

… which is why we pulled away and put in the row number

jenit: table has rows, … rows have row numbers, which describe entities, … the (possibly different/various) about URIs

('describes' or similar)

discussion of using RFC-7111 to point here with fragment IDs

gkellogg: i'm fine so long as i can turn it off

debate on whether we want to explicitly list an option

jenit: may as well be non-normative then, if optional

… related q: is it legal for the rdf conversion to include anything else it wants?

gkellogg: always should be ok, but should also be possible to turn off turnoffable things

levels of conversion-

gkellogg: including "that", rows etc

danbri: [something like named graphs, x3]

<JeniT> PROPOSAL: there are different levels of output from RDF and from JSON, which can be selected on user option. These are ‘minimal’ that produces only the data from the table, without reification triples, ‘standard’ which includes reification of tables & rows, ‘plus prov’ which includes provenance






<ivan> +1


<gkellogg> +1

<jumbrich> +1

<JeniT> +1

<JeniT> RESOLVED: there are different levels of output from RDF and from JSON, which can be selected on user option. These are ‘minimal’ that produces only the data from the table, without reification triples, ‘standard’ which includes reification of tables & rows, ‘plus prov’ which includes provenance

Summary of Action Items

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.140 (CVS log)
$Date: 2015-02-17 10:49:01 $