CSVW WG F2F meeting, 1st day, 2015-02-12 -- 12 Feb 2015

introductions

Charter review

jenit: orientation …

reviewing charter

http://www.w3.org/2013/05/lcsv-charter.html

discussion of xml mapping

proposal: XML mapping will not be undertaken by this WG (for lack of interest)

<JeniT> +1

<ivan> +1

<DavideCeolin> +1

<gkellogg> +1

<jumbrich> +1

<jtandy> +1

resolved: XML mapping will not be undertaken by this WG (for lack of interest)

jenit: use cases + requirements may need some rounding out to publish as a Note (tidy issues etc) but is basically there

(reviewing deliverables)

Metadata vocab for CSV data ...

gregg: do these docs all need some review w.r.t. what we've done

<JeniT> http://www.w3.org/2013/05/lcsv-charter.html#deliverables

jtandy: some issue work, …

… we have implicit u/standing in issue list but may be hard for outsiders to follow

1. turn requirements into much more structured text. 2. review both requirements and the use cases against what we have actually done. 3. need to close out the 9 issues that are in the doc, 3 of which need discussion in this meeting.

…

open vs closed validation

<scribe> closed is "these are the only cols in csv, error if others"

open is "unexpected columns allowed"

jtandy: we discussed phantom/virtual columns

currently you need same number of cols

jtandy: also foreign keys only working in batch of files -> is on agenda

</use cases discussion

discussing Metadata doc for CSV.

is its own doc. whereas,

Access methods for CSV metadata is a portion of larger doc

Mapping mechanisms: for conversions into RDF and into JSON

this meets our 4 deliverables

jenit: our milestones … we are behind

ivan: (re process) formally speaking this milestone doc is outdated because W3C process has changed. Last Call and Candidate Rec are simplified as 1 step.
... we need to count on 6 weeks for AC vote

so there needs to be 6 weeks before rec

the last step (lcc and pr) distance between two depends on us, i.e. where we are w.r.t. implementations. No formal rule.

JeniT: last call CR is supposed to be … all known issues closed resolved, ready for impl

LCCR we need two implementations (according to whatever we define)

independent impl for each "feature", although we need to be clear what we think a feature is

conversions, metadata finding / merging, …

… then validation

gregg: we also define a viewer

ivan: is it in the charter?

jenit: it is defined non-normatively

ivan: I would propose that we do not talk about viewers as a normative thing

brings on extra load

gregg: raises the question of talking about text direction as it only pertains to viewing

discussion of whether text direction is normative

jtandy: somebody who will use this output data, may use the metadata to provide additional info ...

ivan: minor issue

lets not talk about viewers in terms of normative, ...

gregg: in terms of impl, ivan's and mine, … transformation implementations, … with a small amount of validation but not point where we can pass tests

ivan: regarding formalities, …

… once doc is in PR it is owned by w3c team and not wg any more

our charter ends at end of august

if by end aug we are at PR we can finish and publish without any formal problem

jenit: hard hard deadline is end aug PR then?

ivan: ideally yes

… even with mood of much more stringent oversight, if we have a LC CR available end of [northern-hemispheric] Spring early Summer

… early June, ...

then getting a 6 months cr, ...

…should be ok. Otherwise it could be a problem.

Phil: this group is going so fast and well that would not be a problem

jenit: hard hard deadline, early june for lccr as our next big milestone
... I'd like us to get there sooner

danbri: attendance is an issue

jenit: discussing things I wanted to cover w.r.t. charter review

ivan: my feeling for realistic planning, is a new series of drafts for entire rec track set, mid-end march.

… and then, all issues this solves, it would clarify things for impl

if that goes well and there are no major issues, we can publish an LC CR mid may, end may, … effectively June.

jenit: also let's discuss w.r.t. charter, the aspect of parsing CSV files

that is explicitly not in formal scope

gregg: although we do define some dialect

ivan: non normatively

jenit: this is an area of difficulty

since anyone actively implementing needs an actual parser

so not having the definition normative actually leads to a situation with inconsistencies between implementations

from that p.o.v. it is a bad thing

… w.r.t. scoping and reduced workload it is good not to take on extra work

ivan: seems we have lost Yakov since he changed jobs

… the w3c management discussion was that "we do not standardize CSV"

… pref is for IETF to handle that layer

the discussion w/ Yakov was that it could be handled there

ivan: I have no idea what if anything is happening there

gregg: we should probably look at our dialect defs then

parameterized aspects, separators, quotes etc

jenit: most CSV files on the Web are not in fact CSV files per the IETF spec

which is the motivation for the dialect description

which is about parsing a text file into a tabular data model, rather than a proper CSV into...

jenit: let's look at this when we talk more about parsing

jumbrich: we sent around report end of 2014, examined 80k CSV files, 20% had some errors via slightly modified Python lib

also the delimiter was in 80% a comma

we found tab in 3.8% of the docs

jenit: how did you choose the docs?

jumbrich: via ckan repo

jenit: you may have missed some defined as TSV then

ivan: having data is a good thing to have

gregg: do we consider tsv a dialect of csv?

[yes, ish]

jenit: the wording of the docs i've worked on tries to use 'tabular data'

<phila> from the charter: "The mission of the CSV on the Web Working Group, part of the Data Activity, is to provide technologies whereby data dependent applications on the Web can provide higher interoperability when working with datasets using the CSV (Comma-Separated Values) or similar formats."

ivan: a minor thing, … in our docs for metadata, we rely on the suffix being .csv

jenit: no we don't

gregg: no, filename .json

ivan: withdrawn, you're right.

phila: charter mentions CSV explicitly in the charter, we use CSV as a generalization of tabular data

phila: Use case doc has lots of data, is it going to be turned into a test suite?

jtandy: it is avail for use in tests

… we should talk about usecases -> specs -> test suite to have consistent examples throughout

gregg: i have been adapting the examples as i maintain tests

… my intention w/ examples was to go back into docs

completely agree that our usecases should ...

ivan: let's be careful, the use cases themselves as test cases, … are unusable

jtandy: +1

ivan: each use case should have a repr in test cases but perhaps indirectly
...

gregg: staggering combinatorics

you test the smallest things, … different places in diff merged files, …

jenit: the model that i would suggest is that we continue to focus not on the parsing

we have dialect as a set of hints, …

… but not formal

gregg: in my testing, … might want a class of tests, merged metadata in output

ivan: we need something that says "of these scattered files these are the final aggregated metadata items"

jenit: yes

… re test suite input docs, informed by use cases but CSV files that do not require any dialect description in order to process them

jtandy: test should say "this test inspired by UC-xyz"

…. palo alto tree data. also govt salary thing.

these came right through from UCs

others less direct

ivan: unfortunately, some entries in the dialect that we can treat as you said jtandy, … e.g. separator

but others like no. of rows / cols, that you will want to ignore, that affect our processing

jenit: or _might_ affect

ivan: but there are some that might

those have to be tested

gregg: we could divide the dialect info into parts that might affect parsing

[...]

gregg: do we want to test some TSVs?

jenit: no

gregg: can we have a resolution

<JeniT> proposed: we will only use CSV files in our test suite (not TSV, not anything that needs dialect description)

jtandy: tests will be inspired by the use cases, but skipping rows, dialects etc., we'll normalize into clearer CSV

<ivan> +1

<gkellogg> +1

<DavideCeolin> +1

<jumbrich> +1

<JeniT> CSV = IETF CSV with UTF-8 & LF or CRLF for line endings

gregg: things we're not processing: encoding, line terminator, quote char, double quote, deliminator, trim …

jenit: assume no trim

resolved: we will only use CSV files in our test suite (not TSV, not anything that needs dialect description)

<JeniT> +1

<jtandy> +1

<JeniT> RESOLVED: we will only use CSV files in our test suite (not TSV, not anything that needs dialect description)

ivan: returning to q, … do we know what is happening with IETF

danbri: I assume nothing

jenit: we know of nobody having picked it up after Yakov's work

people from this group could get involved

ivan: since it was mentioned in charter we will need to be ready to explain this situation.

jtandy: re charter, … the piece that we have gone for "simple mapping": that we discussed templated mapping, … haven't seen anything active on this

danbri: I circulated a draft charter ~nov but it got v little discussion, i think ok to wait

jtandy: when we publish we should refer to CG charter

… e.g. point from the conversion doc to CG [if it exists]

ivan: at this point there is no CG just a mail from dan

jtandy: we talk in the specs about template formats

gregg: something somewhere about tempalting?

ivan: shouldn't be

we had some discussion of what the input to templates might be

in simple mapping itself it should be silent on any complex mappings

gregg: if we say something normative it would seem that we need a test

jenit: we have extension conversions item for tomorrow
... 3 things before 10:30. 1) gregg talk about test suite 2) go through agenda incl. ivan's notes 3.) that's it.

<JeniT> gkellogg: RDF tests use graph isomorphism, JSON use deep equality

<JeniT> … tests use the output that my implementation gives as the target output

<JeniT> … I set any optional outputs (like provenance) as false, as that’s where it’s likely there will be implementation differences

<JeniT> … the metadata components of the tests are implicitly referenced but should be located by the implementation based on the location of the CSV file

<JeniT> … there’s also an option to provide user-defined metadata, and to fake the creation of the Link header

<JeniT> https://github.com/w3c/csvw/tree/gh-pages/tests

not to be confused with https://github.com/gkellogg/csvw-test

… which is a test runner

<JeniT> ivan: I’d love separate folders for each test

discussion of file names vs folders

gregg: this happens to be the test runner running off of my distiller dir

… which is v reminiscent of the RDFa tests

… probably has to start up, can run a test, pass/fail, check details to see what the input files were, what the result was

… you can run whole set and get EARL output

jenit: if you are someone making a new impl, what process to integrate with this?

gregg: for me I integrate with my impl's own testing

i download the manifest and iterate through the downloaded tests

jenit: and if you are doing an impl, … that is based on say .js, that doesn't have any RDF processing, ...

gregg: in my processing I turn it into a json-ld manifest and then run over the json

we could easily maintain a json version

jenit: can we do that please, easier …

ivan: this or other one, relies on some sort of web service to run the conversion

gregg: which is why i run it off my site

ivan: i use a jquery extension

it will save time for some but not for all

… my stuff tries to do a jquery extnsion

read in csv file, … display as turtle,

gregg: people have used node.js to do that

ivan: then i'll need to compare jquery promises w/ node promises, … difficult

jenit: msg then is that, if you are implementing, … you'll need to handle these manifests/tests somehow

gregg: I am happy to make sure we have a json-ld representation of the manifests

danbri: woudl generated json-ld be acceptable to non-rdf wonks?

gregg: [shows pretty looking json]

jenit: Validation tests

gregg: nothing so far

… there is a vocabulary definition …

…does include diff types of test

a positive eval syntax, and a csv to rdf, csv to sparql, csv to json, … metadata to rdf, …

gregg: I can add more test types for validation and its variations

ivan: so the program that you used in past to generate a formal report from the EARL should just work

gregg: yes

ivan: this was used in earlier groups to generate reports

… question now is how many impl we will have

ivan: for time being we have one

… which the test results are based on

gregg: I asked AndyS, but no response yet

jumbrich: how much effort?

ivan: only part which is relatively complex is managing/merging the metadata, maybe extracting all the info for each cell that the metadata really gives you

once that mechanism is in place, generating rdf e.g. via rdflib, or json, … is relatively easy

finding/extracting metadata per cell is hardest part

jumbrich: i don't understand that aspect, i thought we start with a metadata file

ivan/jenit: that's the naive view, the spec is more complex see discussion later

jumbrich: we have a uni project around this, ...

ivan: I didn't attempt python but if uni team want to do python i am happy to help

… rdflib is a v solid basis

gregg: and ruby is pretty similar; my impl is unencumbered

jumbrich: we used java before

ivan: ideally you would use python since if andys does it he'll use java

jenit: after the break we'll look at parsing csv + tab data

after lunch we'll have 1.5h on metadata discovery and merge, … then a break

reviewing https://www.w3.org/2013/csvw/wiki/F2F_Agenda_2015-02

gregg: let's get resolutions in github, and editor actions, being clear what the action is

ivan: at end of 2 days we'll have a list of issues assigned to editorial work
... we turned into github issue freaks after tpac

jenit: what we have here through this agenda are all the issues marked as needing discussion

… still open but not at resolved or editor-action level

aiming to get get through all of those

… comment in github needs to be clear on what the editor action is

BREAK, back at 10.45.

Parsing CSV & other tabular data

gregg: consider branchs, pull requests etc., …

jenit: we could quickly now go through those

jtandy: i have queued up a bunch of things to do on the rdf conv doc

looking at pull #187

gregg "conseq of using about predicate url etc"

#187 not currently mergeable

- topic Tabular Data Model

jenit: need to merge in #192 … joins core and annotated data models

everyone happy? [nodding yes]

ivan: good simplifying impact

jenit: so had two kinds of level, annotated tabular data model, … and grouped, which says "there is a group of tables"

… and then that group may also have some annotations

… so the most useful is the annotated table

jenit: this is really simple, exactly what you'd expect from a tabular data model. You have a table with rows in it, cells in it, cols in it. Rows and cols have numbers. Cells belong to a particular row/column.

…. the bit that goes anywhere near controversial is distinction between string value of cell and value value of cell

… which is the parsed and processed datatyped value

for eg. if string value is empty string, cell value is null

various bits from the metadata can affect the cell value

value value can be an array

gregg: an rdf issue, a list or multiple values

jenit: that is the data model.

all of these can be annotated. Idea is that all the ops that are described

(annotations live in the metadata)

gregg: except we don't yet provide mechanisms to annotate all these things e.g. cells

jenit: well you can inherit into them

ivan: when you talk about annotation, this is not the annotations of the annotation wg

jenit: corret

… annotations, properties, attributes, … all mean essentially the same. Something will always get confused.

ivan: probably worth adding a note here that the term annotation as used in this doc is not exactly same as used in the Annotation WG

acti*on on someone?

(admin: we're capturing all actions into github not w3c tracker)

jenit: this data model is basis for conversions

…everything in the metadata doc should talk about how it affects that model

gregg: there was an issue w/ cell values, reconciled in one of these pull requests

jenit: the bit then to discuss now is around parsing the CSV or other files INTO that model

we take the model as being the central thing through which everything passes

CSV TSV HTML tables etc etc

scribe: things like CSV but with extras e.g. HXL (see usecase)

https://data.hdx.rwlabs.org/dataset/hxl-validation-schemas

http://www.w3.org/TR/csvw-ucr/#UC-CollatingHumanitarianResponseInformation

<JeniT> https://github.com/w3c/csvw/issues/9

going through the issues…

from #9 -> http://w3c.github.io/csvw/use-cases-and-requirements/#UC-PublicationOfNationalStatistics

jenit: propose close #9, we handle being able to id places within csv files, and have other issues for notes/annotations on cells

"Locating data region(s) within a CSV file #9"

jtandy: presupposes we do not have 2 tables within one csv file
... we should be clear that we do not allow multiple tables within a single file

gregg: we distinguish CSV vs model created from it

if it was the more abstract sense, someone might create something that extracted tables, put them into a processable form

jenit: I think the metadata doc does make that distinction in most places except dialect descriptions

(except we have url ref back to src file)

ivan: we separate rfc blahblah as a diff issue?

jenit: not relevant unless we support referencing areas of a csv file in a diff way, e.g. to say 'this bit is the tabular data', …

ivan: RFC came up at TPAC for handling of Web Annotations

that's where it'll fit

jtandy: the other thing this issue was about, … e.g. in usecase there were multiple header rows

now we have parsing suggestions that can hint about skipping

<JeniT> PROPOSAL: close #9 as handled by dialect description

logger, pointer?

https://github.com/w3c/csvw/issues/52

"Should we check integrity of CSV files in some way #52"

jenit: in data packaging spec which our work is based, … words around integrity checking

also subresource integrity checking

e.g. refs to scripts, you can provide an integrity url

see http://www.w3.org/TR/SRI/#use-cases-examples

jenit: we could do something like that

… I suggest we don't for now

ivan: that work is fairly early still

jenit: this can be explored later

Resolved: to not handle integrity. Editor action to remove the issue reference from the document.

https://github.com/w3c/csvw/issues/182

"Fall-back value of the `name` property? #182"

also similar,

https://github.com/w3c/csvw/issues/53

"Should there be limitations on the syntax of column names? #53"

taking #53 first, ...

jenit: these names get used in syntactic contexts that are restricted e.g. in json etc

"What syntactic limitations should there be on column names to make them most useful when used as the basis of conversion into other formats, bearing in mind that different target languages such as JSON, RDF and XML have different syntactic limitations and common naming conventions."

ivan: there are two things, … one is that we do say a name has to abide to restriction of templates. we have to.

… also we say if no name, but a title, then a name must be created from title so needs some normalization

jenit: re #53, I proposed change from SHOULD to MUST re syntax constraints

you'd get an error if col had wrong names

jenit: metadata files themselves should always be validated

… checking the metadata is important

gregg: it's useful … but complex

ivan: I think we said these things are separable. The conversion doc starts with metadata that is correct, …
... I use the name, rely on getting it from the metadata, …

jenit: it is correct for the conv doc to rely on the tabular data model, … but metadata doc has to be parsed and processed

gregg: e.g. must foreign keys be consistent?

you can still generate reasonable json and rdf even if keys are wrong

jenit: then we need this as an explicit choice and resolution

… if metadata isn't right, do we ignore those properties, provide an error/warning, etc? or error out completely

gregg: in my previous rdf processors, i've always tried to generate data when i can, ...

ivan: In this case, what we do is not only for conversion, …

… the metadata once it is there can be used for all kinds of other things

… in this sense they are separable

… i am in more favour of saying 'if it is not kosher then it is in error'

jenit: what you're suggesting there is that if the impl is a convertor, it should ignore unknown properties in the metadata, whereas if it is a validator it should report error

ivan: validators certainly should cry foul and not try to find a reasonable default
... validator is a v diff thing

gregg: validating metadata vs validating csv files

… purpose of validator is checking data integ not syntactic correctness

ivan: [missed point]

gregg: processor can impl a relaxed mode

(discussing #53, see comments in github)

and #180

… #180 closed as done.

RESOLUTION: #53 " to change 'SHOULD' to 'MUST' re column name syntax restrictions. See https://github.com/w3c/csvw/issues/197. Other things are in progress."

https://github.com/w3c/csvw/issues/182 "Fall-back value of the `name` property? #182"

jtandy: a great optimization, well done

jenit: if metadata doc supplies a title and not a name, should it come from title ?
... it isn't valid for name to be missing from metadata doc

ivan: see discussion of 2 days ago; the validity is not on one of the constituent metadata docs but on the merged metadata.

that's why i wrote down the whole process

incl some default actions

at end of which you get to the final metadata

that's the one that has to be valid

it can be ok if it is missing from specific files so long as there is a process that somehow assigns the name

we might say that for names we don't have that

reason why i think this is best

we have case where we only have 1st row

gives us a bunch of titles

the freq one, e.g. california trees, where we have titles with spaces in

jtandy: and it will end up percent encoded

… when people see that they'll go create a metadata record!

jenit: this comes into the bits around metadata merge, … maybe put off discussion until then

in the metadata files that people actually write, vs ones autogen'd from a csv file, i think we'll want to say that there is always a name for a column.

ivan: no.

… the metadata files by themselves are not complete metadata, … that is conseq of merge

ivan: what this thing describes, is what the metadata should look like, not what each src file should have

jenit: what counts for an author is what they put in the metadata doc

ivan: if they decide to write 2 or 3, which we allow by virtue of the merge, ...

gregg: they could have 1 file with schema in table group and table, …

ivan: don't make it even more complex :)

jumbrich: i could write a file, header, col name, ...

… title is for labelling a col

… if i write a meta file i might be too lazy to copy it over

ivan: i was talking about a more general issue with the merge

jenit: ok

ivan: as long as a title id a single string, no problem. but them we have lang issue

#182

jumbrich: there may be computed/inferred processes during metadata merge

ivan: yes

… also lang is nasty.

gregg: probably worth looking back at desc of name in metadata doc, it does go through, ...

3.9.1

http://w3c.github.io/csvw/metadata/#columns

ivan: that answers my issue

jenit: where does it talk about normalizing title into name?

(looking in 3.9.1)

editorial: missing mention of normalizing

somewhere around "The http://w3c.github.io/csvw/metadata/#dfn-property-value of name is that defined within metadata, if it exists. Otherwise, it is the first value from the http://w3c.github.io/csvw/metadata/#dfn-property-value of title…"

<scribe> closed #11 by jtandy earlier

<JeniT> https://github.com/w3c/csvw/issues/11

<scribe> closed #182 -> fixed in current wording (at Feb F2F). http://w3c.github.io/csvw/metadata/#columns

"How to determine language encoding #11"

"http://w3c.github.io/csvw/use-cases-and-requirements/index.html#UC-SupportingRightToLeftDirectionality

From internationalization perspective how is the proper language encoding is determined?"

<JeniT> https://github.com/w3c/csvw/issues/193

-> https://github.com/w3c/csvw/issues/193

ivan: describes what processor has to do to get the final metadata

complication has fact that the metadata includes the dialect

which is necc for the parsing

therefore you locate what metadata files you can

link header, file, global metadata, user metadata, …

make a parse, ...

then start it all over again, since metadata can be extracted from the file itself

conceptually re-run whole thing

ivan: what whole merge process does, is that various elements of the metadata can be piecewise defined, pulled together for final info

i think we need that

let's say we keep that

this also means in theory, that you can have the dialect info described piecewise

one part says this is the separator, another say this is the line ending

what if the dialect is considered conceptually as an atomic property

e.g. what if i have to extact those metadata files, i take the first dialect encountered, then i do the parse and have a clear process

jtandy: are you saying: you cascade through with your UMs DMs etc as you defined, … until you find a dialect .But if you get to the end, then default is last one in chain

… then you say 'ok i can now parse my csv file'

jenit: makes complete sense to me

gregg: many people when they do a dialect description they'll say they only changed separator

jeni/jtandy: having default values

not a 'merge'

jenit: makes complete sense, in practical terms

ivan: i like that you like it, … but then this raises heretical q to me, … do we really need this merge?

… isn't the same philosophy needed for the loads of other cases where a merge is challenging

e.g. titles from 3 diff places?

jenit: let's discuss this in later session

ivan: see my comment in #193

("just a food for thought for the F2F, …")

(too big to quote without being kicked off irc)

ivan comment: "As we said, the complication comes from the fact that the merging procedure must be performed twice: once partially to retrieve, essentially, the dialect, and then with all parties involved for the final processing."

"The current model allows a "distributed" specification of the dialect."

"The separator character is defined in one, whereas the number of skipped rows in another. Is it really necessary to do that?"

scribe:

jumbrich: normally you have a csv file, it has only one dialect, … you can't really go wrong

if you find a conflicting dialect, …

jumbrich: we looked at first 100 lines in our investigation

ivan: problem is you have metadata file … and other places, … can dialect be composed across sources

<Zakim> danbri, you wanted to ask about csv groups

<Zakim> jtandy, you wanted to ask about specifying the relationship between tablegroups and tables

jtandy: similar discussion around table groups to tables w.r.t. where you can put a schema

e.g. i went through the salary one

it might be better to put table desc only in one place [...]

jenit: on agenda for tomorrow

jtandy: can we do dialects then too?

ivan: that is the missing bit in my scheme

gregg: can also see it as good to avoid repeating yourself

ivan: I am happy to rewrite whole process of getting to this metadata

w.r.t. dialect, it might make things simpler

davide: is there a specific … [missed]

jenit: would still be in an order of prference. User defined overides linked ones, which overide those in dir, etc.

Resolved: #193 "to make dialect descriptions atomic (not merged from separate metadata files), which should simplify the process. Also needs to include factoring in the role of metadata at the table group level, and the use of the default dialect."

ivan: also metadata doc needs for merge algo to specify that dialect is sort-of atomic

gregg: which is straightforward enough change to merge

[discussion of exceptions]

resuming

Metadata discovery and merge

jenit: metadata discovery first

(jenit takes to whiteboard)

sources for metadata merge

-user

-options

jenit: first route is that we extract out metadata from the CSV format

… for simple csv it is just col headings but other formats could offer richer supply of metadata

e.g. that human rights format

gregg: e.g. comment lines being skipped

jenit: we don't say anything normative about that

after extraction, … HTTP/S link headers

jenit: when the CSV file is got, it may have a link header on it which says where to find more metadata

then default location for the file metadata, hacking the url, ...

then the one per directory

gregg: we dont' say anything much about query params

jenit: I think we say something [we should check]
... i.e. there are all these different places to define metadata

conceptually also a default set

jtandy: e.g. col1 etc?

(discussion of 'default' not being written)

ivan: e.g. default name is in some sense

jenit: all of these files get merged together to create the master metadata

… and it is this that informs that process

turning the data that is in csv into json/rdf ....

danbri: what if you had e.g. rdfa in html page

jenit: consider it 'embedded' rather than 'user'

see locating-additiona-meta info

https://github.com/w3c/csvw/issues/42

"Locating additional metadata when originally starting from a metadata document #42"

jumbrich: example of a metadata directory

phila: if someone makes 3rd party metadata, … they may have reason to consider the original metadata inaequate

jumbrich: e.g. published externally using foreign keys

jenit: this issue #42 is about starting from the metadata file

… if you have found metadata files on the Web, …

… for somebody processing that, is equiv to having the user defined metadata as the 1st set you have

but what the processors need to do is run through all the resources described in that file, for each go off, headers/links etc

your stuff will override whatever it finds there

ivan: that step is described in process description

discussed pre-lunch

i.e. #193

jumbrich: the ordering is not strict?

ivan: that's what it means

has highest priority

ivan: it is powerful, you can shoot yourself in the foot

jtandy: 3rd party vs user ...

…. might want to have both

[general nodding]

jenit: rather, let's say that the party running it gets to choose applicable metadata

phila: maybe focus on user/3rd party and don't worry about the larger merge chain [fair paraphrase?]

ivan: what is the priority of third-party over the others?

gregg: if you're starting with the metadata, … i might have user metadata that i apply first

… then reaches out to each file, …

jeni draws some example commandlines:

convert example.csv

convert example.csv —options my.json

convert example.csv —options http://example.com/metadata.json

no metadata vs local vs 3rd party, …

convert —options: http://example.com/metadata.json

convert —options: local.json

gregg: consider 1st 3 example commandline

process is - you take the user metadata, ...

jenit: you might have lots of user metadata, not just two, ...

gregg: 0 or 1 …

and an input file, a csv or a json ..

[...]

danbri: if commandline only mentions one table, but discovered metadata talks about other tables, … which ones matter?

ivan: I think conceptually I think we have 1 user metadata, even if it came from 52 other things originally, ...
... for testing, if we start with user/3rd party, we're fine as it can be impl specific

jtandy: see UC 22, making sense of other people's data.

resolved: "conceptually there is only one user-supplied metadata file, which implementations might generate from merging multiple metadata files, some of which may be provided by third parties.https://github.com/w3c/csvw/issues/193 covers what happens there." (see github for editor action)

https://github.com/w3c/csvw/issues/154 Security considerations of modifying path for metadata#154

ivan: I don't see metadata hijacking as a security issue

<JeniT> http://w3c.github.io/csvw/metadata/#security-considerations

danbri: in a google context, it's likely we would have other routes to compose a working metadata set

ivan: do we need this whole metadata merge?

jenit: it has been discussed at length already, do you want to re-open?

ivan: we know it is horrifyingly complex

… everything related to merge is currently by far the most complex part of spec and of implementations

jenit: ok, re-opening.

jtandy: but you will need to merge somehow

ivan: for embedded metadata, sure

jenit: let's say you have created a fairly complex set of tabular data conventions

e.g. the linked csv stuff i did, e.g. multiple header rows, … eg. with equiv complexity

jtandy: you have one merge, embedded to external

gregg: my original take, … rather than taking all of the metadata files, … you take the first

jtandy: let's see if that leaves us wanting

… if you have user metadata, you wouldn't go any further

jtandy: found user and embedded?

jeni: given that you have those three classes, … do we merge them or not?

if you have user, do you ignore others inluding embedded

jeni: i think you have to merge them, and therefore need a merge algo, …

ivan: i understand that view and you are right, … so we have to swallow merging, … can we do something similar to what we did with the dialect.

i.e. that we are much more restrictive in what and how we merge

we get into horribly complex merging of title

where there can be language tags etc

could they be somehow atomic, like dialect?

closing #154 discussion

…jeni leaves notes in github

#154 closed.

jtandy: example of multi-lingual metadata

complement rather than replacement, with english plus french titles

jumbrich: we defined here a kind of order, … so when we do the metadata parsing, … when do we stop?

jeni: … what ivan was saying about having a more atomic merge algo

… would it be useful to walk through the merge algo that we have? yes

ivan: q is whether we can simplify it

<JeniT> http://w3c.github.io/csvw/metadata/#merging-metadata

gregg: [missed], … context language in one versus the other

(discussing 'normalise')

jtandy: so for each file, you make all URIs absolute using base, and all langs are stated explicitly based on context

… normalize each file first

jenit: I don't think you can do that with uriTemplate properties

gregg: depends on your interpret of uri templates

else you couldn't have a value that was a pname

ivan: algo in doc is incomplete, or there is more to it...

e.g. array properties

which vary in their merging depending on what the property is

gregg: we could clean up link properties first

e.g. multiple values

only case is DC-something

jenit: what were originally link properties are now Common Properties

jeni: agree that we can say link properties are only single URLs

<scribe> -> new issue on merge algo

gregg: has result of simplifying merge algo

jtandy: what this says is that you can't have 3 versions listed

jenit: you can do what you like with dc:hasVersion etc

gregg: if you want it fully understood as an url, you'll need to use json-ld @id notation

jtandy: later!

ivan: looking in algo, 2nd 3rd 4th steps are all exactly same, overides

… helpful editorially, if we said these are special cases

(discussion of notes as object property)

ivan: which are the properties for which an atomic behaviour is not ok?

jenit: for cols, array of cols, you want to go into those two lists in your metadata, ...

for the title example that jtandy offered, e.g. extra titles or datatyping info

ivan: can we make it more paletable by saying which has priority?

jumbrich: for cols...?

gregg: col ref is similar

ivan: each col description has to be merged separately

jenit: col refs used for linking between files

as keys

gregg: could say it is always in form of an array

jenit: trying to make it not just simpler but intuitively right

there are actual difficulties that make it complex e.g. i18n, these are simply complicated topics, ...

but maybe there are other things we can address

gregg: object properties currently are tableSchema

can basically refer to a meta file

ivan: do we really need that?

jenit: yes. in the uk we have multiple local authorities that are publishing locations of toilets using the same schema for their csv files

and we want them to be able to reference exactly the same schema file

e.g. local auth 1 publishes theirs, local auth 2 publishes theirs, published by a central auth, …

gregg: put into normalization

<JeniT> https://github.com/w3c/csvw/issues/199

resolved: #199 "During metadata normalisation (prior to merge), object properties that are URLs (rather than objects) get normalised into the objects (JSON files) that they reference."

(discussion of the complexity of language normalization)

jenit: we turn into a normalized form…

ivan: we had a kind of disagreement, diffs in opinion w/ gregg, … for me the natural language property when it is normalized, an array of individual objects conceptually

… an array

… if i view it as an array, merging becomes meaningless

if the string doesn't have a lang tag it will be undefined

jenit: why is it important to remove those undefined langs?

gregg: if you have a title established in a metadata file with a language, … and embedded without, … idea is to eliminate, ...

[aside: did we decide not to use col metadata]

gregg: we wanted to avoid seeing both title(en) and title (no lang)

jenit: I don't think it is worth the extra complexity

gregg: so we can use the regular merge algo

(detailed merge algo discussion which i'm not capturing 100%)

<phila> scribe: phila

gkellogg: Moves to the flipchart to give an example
... metadata fle with @lang en and a title of 'my title'
... then we have another file with a title of 'my title' but with no lang
... result would be a an object called en with array whose value is 'my title'
... and an array called und with a title of 'my title'

<danbri> gkellogg: way algo is, if values were arrays they'd be concat'd

gkellogg: […] unless they had a name as well ,cols wouldn't match as don't share a title
... ivan's thought process about this, is that you don't have an object with lang tags, …

ivan: essentially it is an _array_ of language tag strings. How I represent that internally is besides the point.

gkellogg: spec builds on json-ld, so using language maps for titles

ivan: there is an editorial action to radically simplify the merge algo

… to look at explicit cases that require special action

… otherwise fall back on atomic behaviour

e.g. one more special array property

that we can put into the normalization

which is the resources

two metadata, one with resources, one without, ...

but every metadata has resources even if it has only 1 element

gkellogg: we can spec that as table meta vs table group meta
...

ivan: merge of resources, … array property so has a special interpretation

in http://w3c.github.io/csvw/metadata/#merging-metadata

discussion of simplifying templates section

ivan: same q as with dialect, … does it really make sense for merge to go down into constituents?

jenit: alt is arrays are simply appended to each other

… in which case there may be duplication

… such that you might have template in the same language, e.g. js, to create the same format, e.g. ical, … and referencing the same url

ivan: is that a problem?

jtandy: only reason you want a template mech is that you want more control

… if you supply a templ in the user metadata, that is the one that you will want to use

ivan: whereas jeni says she would concat the arrays

jenit: in the case where i am trying to … with some nice client side CSV viewer, … in my browser, … and enable people to export out of the viewer into formats

JSON, RDF, … the standard ones; plus other options e.g. if i can process .js templates, ...

… i'll define all of those extension mappings

e.g. exportable as ical, schemadotorg, whole bunch of others, ...

jtandy: two illustrative things

… template is an array property

if you come accross an template array property anywhere

you use array from a particular metadata file

jenit: in my example e.g. directory specifies ical, ...

whereas might want e.g. schema

ivan: why would you do that?

jtandy: assuming would base other targets on json output

http://w3c.github.io/csvw/metadata/#merging-metadatakellogg: treat same as resources

<gkellogg> When merging templateFormat, use the same prodedure as for resources.

ivan: general idea for merging arrays of elements, de-duping where same url found

jenit: for some of them

3.8 … columns

gkellogg: complicated by arrays

…. you can't ever specify them out of order

jenit: if something goes wrong it gets really messy

… validation is essential, …

what needs to be there.

jenit: you can match them on index

have a … err, …

ivan: eg if titles don't intersect

jenit: maybe, but bit concerned about that

gkellogg: …

…rather than merge if names same or titles intersect

ivan: instead merge if titles intersect

(working an example...)

jtandy: reading this [example] you might want to raise a warning if titles are different, but an error if the names are different

ivan: if names are specified

gkellogg: that's when i used value property

it gets interpolated when you access it

(impl details)

JeniT, will you post your example into relevant github bug? (or gist.github.com maybe)

(discussing name vs title example details, …)

gkellogg: the bits about ordering don't always mean anything in rdf

ivan: the 1st one in the order gives me the generated name

JeniT: summarizing: simplify as much as possible. Summary in github, … new issue:

https://github.com/w3c/csvw/issues/200 - simplify merge

adjourned until 2.45pm.

https://www.flickr.com/photos/danbri/5893173/

dinner proposal via phila, about http://www.zizzi.co.uk/venue/index/victoria

map, https://www.google.co.uk/maps/dir/123+Buckingham+Palace+Road,+London/Zizzi,+Unit+15,+Cardinal+Place+Shopping+Centre,+Cardinal+Walk,+London+SW1E+5JE,+United+Kingdom/@51.4953152,-0.1463795,17z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x4876051f4d2d1d43:0xf2b4bf4125d47649!2m2!1d-0.1467075!2d51.4930848!1m5!1m1!1s0x487604df6bad1693:0x17917ba97b216254!2m2!1d-0.140847!2d51.497454!3e2?hl=en

<phila> It's a chain of pizza places. Reasonable prices and non-pizza options for those who wish

"We're sorry, there's no availability for your selected time, but we do have availability at these times: 5pm"

Files vs. Model

https://github.com/w3c/csvw/issues/50 Mismatch between CSV files and tables #50

jenit: … row one does not contain headers

it is first row of the data

and col 1 has the annotation attached to it, name or title etc., ...

you do not have header row in the model, rather it is metadata about the cols

whereas in the CSV file, even in the simple straightfwd csv case, 1st row is usually a header row

and numbering goes from there

see 7111 fragid spec

they/we don't make any distinction between header rows and other rows, because you can't really; so 1st line is always 1, etc.

… which gets even more complex w/ bad csv files, where 1st line is some kind of comment

then two header rows

… and all of the data is indented by something etc etc

and actual data is inset

so we then get the issue in which we have a mismatch within the model vs the original data

let's look at the specifics-

<JeniT> https://github.com/w3c/csvw/issues/32

https://www.w3.org/2013/csvw/wiki/F2F_Agenda_2015-02#14:45_-_16:15_File_vs_Model

"Where does row numbering begin? #32"

jenit: row numbering may be out of sync w/ line numbers from original csv

gregg: which makes it pretty much impossible to correlate fragid with metadata about cells

ivan: a shame...

gregg: we could change our model

jtandy: you could get there by getting a dc:source statement or similar

gregg: what's point of a row number ,if it does not relate back to its source
... in which case it could be same as ...

first row we output could be row=2, etc

except that we have defined it differently

jenit: what if we in annotated datamodel had source property in model, …

… not in the conversion before but in actual mode

model

… source is a ref to physical source of that document

optional property

generated by impl on parsing

when it is a csv

jenit: we have all these properties on rows/cols/cells in table

… name title etc

… we could have source property which has url, link back to place...

ivan: RFC 7111 sense?

jenit: if it is right kind of csv yes

jtandy: rows and also cols?

… would give you provenance internally

ivan: what would an annotation, Web Annotation, use, exactly?

they would use then RFC-7111 into csv file?

https://tools.ietf.org/html/rfc7111

<JeniT> danbri: if you’re identifying by URL, is your assumption that the CSV file is unchanging

jtandy: are they annotated…

[...]

jenit: if they use 7111 they must use that frag model for csv

ivan: what i am afraid of, is that … is there a need to annotate the data in the annotated data model
... do we need to have URIs that id a cell row or col in the annotated data?

jenit: that is #24

https://github.com/w3c/csvw/issues/24 "There should be a "Fragment Model" section in the Data Model document #24"

use cases?

jtandy: typically your reification, … people can only annotate what they have in front of them

… you might want to annotate a col by name

ivan: you're beginning to say we need a fragid spec per that issue

jenit: [reviewing rfc7111 on screen]

ivan: another q on similar lines, ...

… we already have the concept of a row number

… we generate that data into the rdf file

… conceptually that is the row number in the data model

jenit: issue is scheduled, #32.

<Zakim> jtandy, you wanted to ask what scheme would be used to reference part of the abstract data (e.g. for annotations)

jtandy: Given that we were talking about the alternative, the only thing they can do is use the fragids from 7111 on the csv file

… we haven't done anything else w.r.t. abstract data model

… unless we invent a new scheme

jtandy: the way w3c anno wg intend to structure the annotations, … … they'll have some kind of reference, typically a URL

ivan: formally speaking that is rdf, so that is an rdf object

so could be a blank node

jenit: so an option would be that we specify a mechanism, non-normative type of note, ...

… if in abstract data model needs to be a structured thing, url that matches resource url, row being row no. in the model, … col being the col no in the model, ...

gregg: there is no way to know the col as we do not output it

ivan: a diff thing, maybe we do, we will, ....

jtandy: often you will want to annotate a range/block of cells

danbri: could we say it'd be the frag we'd use if we did serialize out the model?

jenit: … without the header row

jtandy: we're talking about frags within … so you have to say the abstract data diff to the file, ...

…but then we refuse to id the abstract data

(urls, urns, etc )

jenit: alternatives:

… one is to do this in abastract data model, an entitiy with a few properties, maybe it has a uri or not

… other is to do something around source annotation; target would reference the source annotation (tying back in orignal file)

… which would/could/should be valid 7111

jtandy: if we id a primary key id ...

… key changes per rows

and col id'd based on name in metadata

that completely removes reliance on numbers

if you don't have an identifying primary key you can't use anno

gregg: some places i've used id for row as resolved about url of 1st cell ...

… could differ

might be one on row, another on cell, … as they are common properties

jtandy: we could use abouturl on schema

gregg: on table, schema. ...

ivan: or maybe don't have one

jenit: i don't think that this works

ivan: we are getting into additional reqs

you need a primary key, which is not necessarily there

i think jeni's 1st option works, we can even define normatively

… and that's it

which answers Qs

row and col info are in the data model

jenit: i want to argue for the 2nd approach there

ivan: one is not better than the other

up to the annotator to decide where the annotation goes

they might want either

jtandy: if i am going to write some metadata, … talking about a block of cells

the only thing i can do in my editors, is figure out row 128-430, cols 4 5 and 6

…

… my parsing application is nice and clever,

can deref the block of bits i am talking about

e.g. converted into a viz

e.g. view the csv in a nice way

then it knows that the annotation is applied to that group of cells

if you wish to output the anno, the end target, … up to the impl to provide consistency

e.g. in rdf conversion

simple csv file

4 countries listed

andora, afghanistan, angola, albania

if you used bnodes to id each row

the annotation can use those too

for the rows

jtandy: figuring out consistency between output from conv process … leave to impl

ivan: not sure what you're getting at

jtandy: [takes to the whiteboard]

e.g. row 3-4, cols 3-4

when/if we output to rdf, do we want to include annotations?

gregg: in general it is diff

propertyurl template could be diff each row

we don't tag each cell unless use named graphs everywhere

scribe:

ivan: I don't put annotation on graph, on a group of triples, … but on the data
... the subject should be something like jeni's example

(collective editing of a simple example)

<phila> scribe: phila

<scribe> scribe: danbri

(discussion of generated rdf with bnode IDs, …)

gregg: getting to phd thesis territory

ivan: use cases for this annotation work is mostly interaction / viewers

when i use a viewer i have lost contact w/ orig csv

viewer gives a nicer ui

so i must have a hook into the annotated data

…

davide: why must i numbrer rows?

ivan: other info could be missing
... my proposal: doc should define this ,… one specific area within the annotated data model, if you need it

[...]

jtandy: moving from final to abstract data model, and using 7111 data model is enough

… what we haven't agreed is how to spit that out in terms of json/rdf serialization

suggest re 7111 is enough to allow apps to viz the csv

ivan: so impls should parse the 7111 and transform on the fly

jenit: even keeping it as-is in output is no worse

jtandy: things get worse when you have comment lines stuffed into csv

jenit: back to my earlier suggestion to have a source row, source col, …

[danbri: source sha1 etc?]

… so that when you use an rfc-7111 on the orig csv you can map that with what's needed

ivan: if I am an author of a metadata file

and i annotate into the metadata file

error possibilities, if i use only this i.e. url frag

eg. if i have skip columns

as maybe diff cols

incredibly error prone

jenit: as an author it is much easier for me to know that it is 52 … on the orig

jtandy: someone in the imported metadata could have a skip cols which would screw things up
... you may have commented out rows

gregg: these are only processed in skipped rows

ivan: forget about everything and just use 7111 full stop

jenit: explicitly have that link back in the annotated data model

which is source row, source col, ...

ivan: fine, but what do i do with it?

jenit: in your impl when you have a Row object, it has a source property as well as a rownumber

… and cols, as well as Col having a number, … .

(debate about whether Ivan needs a column object)

(jtandy takes to whiteboard)

jtandy: my row might say example.csv row=3

as it has a header

scribe:

…

jenit: we could add in _sourceRow

… dc:source is fine for output and conceptually

but each also has a number and a source number

ivan: we might want both of them in output

gregg: same for cols?

Resolved: #32 "Resolved at Feb F2F: in the annotated data model, rows have both a number (the number in the data model) and a source-number (the original line number from the source CSV file). These may be different._row in URI templates references the row number in the model, and we agree to introduce a_sourcerow (or something) to be used to reference the source number (line from the CSV file)." (model doc + metadata, but not v much re mappings)

https://github.com/w3c/csvw/issues/68 "Simplify the definition of the tabular data model? #68"

Jeni closing this as resolved thru resolutions on #32 and #24.

resolved https://github.com/w3c/csvw/issues/24 "we aren't going to specify a fragment model for our abstract data model. However given annotations on source number properties that we are preserving, it is possible for RFC7111 to be used

danbri: what does source row mean for non-IETF-CSV formats e.g. html tables?

ivan: those values are dialect dependent

(reminded of http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 )

https://github.com/w3c/csvw/issues/9 Locating data region(s) within a CSV file #9

jenit: closing as a meta issue.

ivan: we'll have to answer @edsu.

jeni: that is handled in another issue

jtandy: in a metdata desc in an @type: Table, we have used 'url' property to point to src csv file

if I use @id

i'm identifying the table, e.g. could be used in rdf as subject of table, ...

… it not the identity of the description about the table

jenit: that discussion is coming up

we have 15m

skipping to #71

https://github.com/w3c/csvw/issues/71

"Exact handling of annotations #71"

jenit: we have notes (see github issue for examples). how do they get mapped into JSON and into RDF?

… can re-use ABC example

ivan: depends on how much JSON-LD-like stuff I want to do

in json do i have to reproduce a proper rdf every time I see this stuff?

jenit: issue here … notes property … defined as an array of objects …

… fine in the json mapping as we can output it exactly as-is

but for rdf it is more complex

we could attempt to interpret it as json-ld

which would require json-ld for many impls

jtandy: in case of json, would include target being spec'd via fragIDs

ivan: see open anno model, it can get complex

… i realise this is the same issue as something that will come up later

jenit: ok to close as same as #142

(general agreement)

ivan: do we want a json-ld interpreter as part of tooling? my answer: no

gregg: therefore we should restrict things accordingly to simple literals and URIs

Adjourned until 4.15pm.

phila, can you describe dinner booking?

resuming

Metadata / Annotation handling in conversion

https://www.w3.org/2013/csvw/wiki/F2F_Agenda_2015-02#16:30_-_18:00_Metadata_.2F_Annotation_handling_in_conversion

https://github.com/w3c/csvw/issues/10 is closed already.

https://github.com/w3c/csvw/issues/142 "Value space of Common Properties #142"

jenit: all about mapping into RDF

all: hurrah!

jenit: we have ability in the metadata to have any kinds of properties at tables, rows, all levels; Common Properties.

re-used, dublin core, schema.org, whatever you like wherever you like.

(i.e. good anarchy)

jenit: for json mappings this is fine you can just copy them over directly

… for rdf conversion, you want to interpret the values of properties in some way

… there is a spec that gives us interpretation of random JSON properties i.e. JSON-LD

… however many of us think that is too heavy a burden here

… hence q what do we do with these properties? also notes issue before break is the same.

gregg: point of clarification - early examples simply used string values assuming proper interpretation

we only have ns prefix, we don't know that dc:source is an IRI

we can expand dc to its vocab URI

(but no @context)

jtandy: re common properties being done verbatim

jenit: we have had a separate issue for what to do with expansion of urls

json devs don't care

gregg's point was that expansion is w.r.t. rdf conversion

ivan: to be precise, all the prefixes in the RDFa Initial Context are accepted

danbri; which version of it?

ivan: any impl is expected to track its current state

gregg: some Qs about how to govern that

jenit: what we've been discussing in #142 is whether there is a happy medium whereby when properties have a certain shape, syntax, … then they are interpreted in a way that is consistent with json-ld

however if they have a completely random shape, …

… […missed]

ivan: no? for common properties , only thing we say is that the values are of the restricted shape

gregg: i provided a minimal spec at v top [of issue]
... if value was an object, there might be a limited shape of contents of that object that implementations would be expected to understand.

jenit: your initial set gives a shortlist of forms/shapes

gregg: json-ld tells you what you can do

… danger is that it is complex

… and there is possibility of specifying something that may not be handled in a way consistent w/ json-ld

gregg: I believe any impl should take these values and deal with it correctly

essentially literals and IRIs

the more complex one is object values

jenit: the proposal in #142 ...

ivan: if you have an object shape fitting gregg's examples, that's ok

if you have a substructure which recursively matches these, maybe accept that too? I would prefer not to

… this is feature creep

gregg: this is basically a subset of json-ld

null doesn't generate anything in the jsonld

[...]

jenit: if you never recurse into objects, ...

… what happens to them when they are given as values

e.g. publisher

see also view-source:http://www.ons.gov.uk/ons/index.html

which embeds

recursion: see recursion.

(discusion of whether allowing nested substructure is a slippery slope)

<Zakim> danbri, you wanted to argue mildly for (rdfa) linked data

<JeniT> danbri: if you don’t allow recursion then schema creators have to create publisher_name, publisher_url properties etc

danbri: [sat on the fence]

jtandy: .. we have in other areas tried to avoid monsterous complexity

… if what i'm hearing is that it isn't complex, … to have a set number of patterns, … then they'll expect the output rdf to have whatever was in the source metadata

ivan: I won't block on this even if I dislike it

jumbrich: CSV is v prominent on the Web, …

… often published by non-technical people

… consumers might be more into JSON

… allowing arbitrary JSON should please them

gregg: seems odd to have the JSON be more informative than the RDF view

ivan: I'm worried about the outcome

jumbrich: optimal would be json-ld?

ivan: in theory

jenit: it should be the case that if you already have a json-ld processor, you would get the same RDF out

gregg: with some syntactic restrictions

… lists

(and nil?)

gregg: lists are challenging

jenit: whats' the default …?

gregg: if you have an array, they are multiple occurences of same property

otherwise { @inlist …} to make an rdf list, which RDF people know is painful to process.

jenit: an option ...

an error raised if you are converting to rdf, if you detect @list or @context

as part of the validation of the metadata file

but only applied if you do the conversion

@type

(conclusions being noted in #142)

ivan: would json-ld language maps be allowed here, for example?

jenit: if you found "dc:description": { "en": "blah blah", "fr": "le blah" …}

… would be processed in normal way by our recursive algo, you'd get table, dc:desciption and just a bnode

maps to:

_table dc:descitpion _:blank .

ivan: and not _blank en something?

gregg: no, as you wouldn't recognise those as properties

jenit: would need

"dc:description": [ { "@value": "blah blah", "@language": "en" }, { …

ivan: Creeping JSON-LD-ism!

… what do I do if there is something there, an @thing that is invalid JSON-LD?

… how do we define that?

see #142 for resolutions

(discussion of whether to warn on @context)

jenit: [proposed] spec says common properties are interpreted as json-ld, but may choose to impl the subset of json-ld defined here. … …

jumbrich: so would be a subset of json-ld, minimal

gregg: it's essentially this

(adding list of things that may be ignored and generate warnings - see #142)

jtandy: are prefixes expanded?

if declared in predefined list or declared

gregg: otherwise will look like a weird URL schema

ivan: this is true for all properties (not just @type)

all the keys

gregg: if your value of 'publisher' was prefixed form it would just be the string

but if @type it would be expanded

ivan: you're making a cut-down JSON-LD

jenit: yes

gregg: [something clever and confusing]

(jeni works on example)

gregg: if you do not have language or type, the result is a string without language

even if defined in context

#142 has extensive json-ld subsetting comments.

<phila> ACTION: phil to raise issue of JSON-LD subsets at the Data Activity Coordination Group [recorded in http://www.w3.org/2015/02/12-csvw-minutes.html#action01]

<trackbot> Created ACTION-63 - Raise issue of json-ld subsets at the data activity coordination group [on Phil Archer - due 2015-02-19].

see comment beginning "Discussed at Feb F2F. Finally persuaded https://github.com/iherman (sort of, grudgingly) that recursion into objects would be useful." for details.

back to https://www.w3.org/2013/csvw/wiki/F2F_Agenda_2015-02#16:30_-_18:00_Metadata_.2F_Annotation_handling_in_conversion

https://github.com/w3c/csvw/issues/97 What should the name of the property be that relates rows to the table? #97

gregg: this is just csvw:row

<scribe> closed #97

https://github.com/w3c/csvw/issues/189 Should we remove column metadata in RDF mapping#189

gregg: what happens if there are common properties defined?

ivan: tough luck

jtandy: when I wrote rdf conv spec, i said i'd ignore common properties on col and schema

didn't need those in my data output

only thing i picked up in terms of col

… i tried to grab titles used for cols

and apply those to the predicates

i thought that this was helpful

but it does not have to be there.

gregg: in fact it is complex

if there is a diff property for each row

jtandy: yes re complex

ivan: we get to this whole issue of what's the subject that metadata applies to

i.e. some rdf graph

so gets hairy

gregg: what i did in my impl

… I do common props of table. If schema or descendets have common prop, i [do some clever stuff]

jeni: is the issue that there is some distinction between creating rdf that defines the tabular data model, … reified, … vs one that extracts the data

danbri: some (e.g. google) will only want the bits about the real world entities (toilets, shops, etc.)

ivan: if i have the metadata, … dc:description column … "this is the metadata file i wrote yesterday...

… that will go into the rdf ouptut

jeni: no...

… the metadata that is in the metadata file, … i nthe object that describes the table, ...

ivan: the metadata is also json-ld … so suddenly i get it in rdf

[looking at csv2rdf]

e.g. 7

(discussion of history of this issue)

jeni/jtandy: we agree we are making asserionts about the table

… title/keywords/etc

(aside: there is no such thing as dc:keywords)

@id vs url ...

jenit: does @id default to the metadata file

gregg: it defaults to a bnode

vs assertions about URI being ""

(flipflopping around table descriptiosn vs tables ,…)

jenit: can we close out on what we're intended to do?

we talked about the 2 levels of description

the annotated / reified level

… common properties etc

…vs how that is different from the entity descriptions generated

jtandy: when we want to describe the thing that the row describes

jenit: is it useful to have all this meta-meta stuff?

or just want the things that look like trees, etc.

jtandy: if we're not bothered about the fact that this stuff came from csv file, … then all we need is some kind of membership relation

ivan: if this is our understanding, how can i annotate the metadata itself

gregg: you could put an annotation in the table schema

ivan: whatever we put under "Table" is about the data

vs metadata

jenit: e.g. lastModified, … author, etc

jtandy: you need to make statements bout the whole resource

(mention of named graphs? vs not)

ivan: we do have a hole

jenit: as long as we are clear on what we're describing i am not too bothered

gregg: we can also have common properties in table group, schema, column

jtandy: we don't haev statements about columns, etc in output but we do have about the dataset and table and potentially table group

jenit: we can say it uses this schema which might mention tableSchema etc

revisiting #189 …. should we remove col metadata in the rdf mapping?

jtandy: unhappy am I not with this

jenit: proposed resolution is there will not be in the rdf output any descriptions of cols, schemas, etc.

(see https://github.com/w3c/csvw/issues/189 for specifics)

https://github.com/w3c/csvw/issues/106 Should the table entity in the RDF mapping of core tabular data be explicitly identified? #106

jenit records resolution in github

CSVW WG F2F meeting, 1st day, 2015-02-12

12 Feb 2015

Attendees

Contents

Charter review

Parsing CSV & other tabular data

Metadata discovery and merge

Files vs. Model

Metadata / Annotation handling in conversion

Summary of Action Items