CSV on the Web WG, F2F meeting @ TPAC, 2014-10-27 -- 27 Oct 2014

<danbri> https://docs.google.com/presentation/d/1PYx7PmaB4Ouyf_uHJZwE331Cg0R9aGPspjx6y1Z-GNg/edit?usp=sharing

danbri: [introduces the agenda]

intros

danbri: works for google, love/hate relationship with RDF. Interested in getting new ways of sucking data into search engine.

jenit: at the Open Data Institute, who are interested in helping people publish/consume open data. Wants to get more consistent CSVs on the Web, for users and publishers to express all the fiddly little context bits that are necessary for reusers to understand.

bill-ingram: At the University of Illinois Urbana-Champagne. Interested in research data in the repository space, planning one now.

<danbri> hadley beeman: one of 4 co-chairs of data on web best practices wg. Day job tech advisor to govt cto in uk. Removing barriers to data re-use and publication, become more intuitive, part of everyday life, identify bottlenecks in system.

jtandy: From the UK Met Office (the national weather service and research institute). We produce tonnes of CSV data. Interested in cross domain boundaries. I want to take CSV to annotate it in a way that it can be combined with other data. Unanticipated reuse.

laufer: Work at Web Engineering Laboratory at the Catholic University of Rio de Janeiro. Also participates in the Data on the Web BP group. Interested in lots of kinds of data.

Chunming_hu: W3C team from China, Chinese host of W3C. Research on data storage and parallel data storage. Work with lots of companies who want to know more about this kind of work, semantics and CSV.

ericstephan: Works at a lab for the US Dept of Energy (Pacific Northwest Lab). Scientists are using .xls* and CSV data. They've looked at mixing data from domains beyond original intentions for the data. Data has taken on a life of its own. I'm a hands-on, real-world problem-driven in focus.

ivan: I am the staff contact for this group. I've been working on various forms of data on the web for 7 or 8 years; used to lead the Semantic Web activity. The transition to CSV was a natural one

Eric Prud'hommeaux: I'm w3c staff, mostly working in clinical informatics and bio informatics. Worked with Sage who were trying to get their data in a more useful form ,but after a while they were still using CSVs.

phila: I'm W3C staff, am a member of the group and observing. For me it's about making sure the Web is a data platform, not just a platform for exchanging other files.

axelpolleres: I'm from Vienna University of Economics and Business, and from the RDF linked data side. A year ago, we started to talk in Austria about how to publish data. We were quite surprised at how much needs to be done.

Hitoshi: I'm gathering information about W3C activities and how working groups go on and what they're focusing on. I don't have an interest in CSV, but I want to know CSV will be used on the web.

<AxelPolleres> we talk mainly with Open Data portal providers there, such as the federal chancellery, or the Cooperation OGD Austria.

Erikmannens: AC rep for MindLabs. I had a team of researchers at Ghent University on data analytics. We are working on open data publishing. Working on RML

BJDmeest: I'm here for the Digital Publishing and Web Annotation WGs. Interested in the semantics of data in general.

charter http://www.w3.org/2013/05/lcsv-charter.html

ivan: Finishing by the end of August 2015 is, in my view, impossible.
... we will have to ask for a charter extension and hope that Phila will be kind enough to help\

danbri: This is our contract with the wider W3C community.
... The specifics for our documents come from the numbered list in the Scope section
... Re metadata vocabulary: Tables are fantastic places to put stuff, but there is no where to put any other info. How much can we dare to say in this group about what the entire planet can say about their tables?

jtandy: Many people publish many CSVs together, and we want to be able to describe the relationship between them. That fits here too.

jenit: Not just describing the file, but also going into what the table contains. What kind of data, which columns it has, what they contain.

danbri: that also fits with "standard mapping mechanisms transforming CSV to other formats".

Jenit: that's a stand-in for structure that most programming languages will consume
... the idea is that if you find a CSV file on the web, you want to be able to find out about it (metadata) or you may start with a metadata file which may point to a lot of CSV files

jtandy: it may be that the metadata and data are published independently of each other. Possibly by different publishers.

danbri: Use cases. We have lots of them

ericprudhommeaux: I assume use cases are linked to requirements. How easy is it for someone who has their own use case to discover that their requirements may be addressed?

<JeniT> use cases & requirements document: http://w3c.github.io/csvw/use-cases-and-requirements/

jtandy: the document makes more effort in describing the use case. We need to flesh out the requirements and make them clearer.
... But there is a formalised linkage between the two

<Zakim> phila, you wanted to talk about UCR

ericp: A measure of success may be that someone can bring in a use case, look at the requirements and see if theirs are included already

phila: The use case document for CSVW is useful for DWBP. That group (laufer) will pull use cases from this group's document for that group's use case doc.

laufer: You are talking about a file with metadata for other CSV files, and I've seen that you've proposed a file extension. We will have other metadata files, but I'm not sure a particular extension would be useful. A general way to link metadata files to data files may be better.

jeniT: we'll be discussing that later today. But it contains 4 mechanisms for finding metadata; appending a file suffix is one of the four.

ivan: Looking at the Use Cases document, to the editors: is the document done?

jtandy: I think we have a good collection of use cases. There may be others to include. D3: data driven documents — we may want to look at it.
... As we reviewed use cases earlier this year, we saw that most requirements in them had already been covered. But the requirements do need more work. They are placeholders that allow us in the group to work on them.

ericstephan: I'm not sure if we've drawn out — if we found use cases that correlated well, we combined them. That was an internal, organic process.
... It might be useful to show something like characteristics? Not a requirement.

jenit: can you give an example?

ericstephan: In science efforts, there may be an approach (imaging formats, for instance) used in an entirely different discipline.
... Is it enough to put it in requirements, or is there another outreach mechanism that would help draw people in, so they can relate to a use case?

jtandy: As an example, we had to work out which use cases covered data transformation. Not a requirement, but something they have in common. Maybe a simple lookup table at the topic?

danbri: Do you have everything you need to do that?

jtandy: the ones we have are sufficiently articulated to do that. We should give them the chance to comment though.

danbri: and in terms of having their actual CSV files?

jtandy: Sometimes. Some are behind corporate firewalls. Obviously only those use cases that talk about transformation can have target XML, RDF, JSON. But examples of those help.

ericstephan: It's like saying, "Here's something that illustrates this use case, and here are some sister or related datasets from something similar."
... So you could expand from datasets from the explicit use case.

jtandy: But given the limited resources of the group, we have to balance that idea along with meeting the other deliverables. Let's try to work that out this week.

danbri: My feeling this that this document is in a good place. Better than many I've seen.

GreggKellogg: (introduces himself) I'm an IE in this group. I'm an consultant. I'm one of the editors of the JSON-LD spec. I've not participated a lot on calls due to time zone challenges.

danbri: re deliverables listed in the charter. UCR?

ivan: That's what I was checking. It's 80% done?

jtandy: yes

danbri: Metadata vocabulary for tabular data. Title has changed from charter, but intention is still same.
... Access methods for CSV Metadata

jenit: This is talking about syntax around CSV, and the issues there. We have something to resolve there: we aren't the route in charge of syntax for CSV files. It's not in our charter. And yet it's the syntax that tis one of the big sticking points for making this work.
... This document therefore has a non-normative section on syntax issues, which will feed into the IETF's work on this.

ivan: This is rec track?

danbri: Yes.

jtandy: I found useful from this document: knowing what IS tabular data. We had a use case from the medical community that was a line oriented data, but not tabular.
... This is a useful document for helping determine what we do want to talk about. And what we don't.
... I'd suggest reading this before you get coffee at the brea.

jeniT: we'll be going through this in depth later today.

ivan: The editor of the IETF document is a fairly active part of this group. He's not here now.

Frederick Hirsch: David Lewis: (introductions)

phila: Do we expect the IETF spec to be updated in response to this work?

jenit: yes

danbri: are we happy with the mappings of the names in the charter to what we've done?

ivan: The titles in a charter often change.

danbri: It's not unreasonable to write down the data model for CSV before you move on.

jtandy: I don't remember having a document for access methods for metadata

<JeniT> http://w3c.github.io/csvw/syntax/#locating-metadata

danbri: it's a section of the model

<JeniT> http://w3c.github.io/csvw/csv2json/

<JeniT> http://w3c.github.io/csvw/csv2rdf/

danbri: Mapping mechanisms is the last bit. We have Generating ...

ivan: and Generation JSON from Tabular Data on the Web

jtandy: and we anticipate having one for XML

ivan: Yes, but there has been no interest

jenit: does anyone want to do this?
... a good mapping to XML would include XSI-type elements to indicate the values, which would go beyond what JSON supports.
... You could envisage a mapping to XML that turns some things in to elements and some into attributes.

ivan: But we have to be careful: if we define a mapping to XML, and we want it to be a recommendations, we need implementations, test suites, etc. Not just a cut-and-paste job.

<Zakim> phila, you wanted to talk about XML

ericP: Henry Thompson wrote a paper on normal forms of XML, turning XML into RDF. If you're going the other way you might want to see it.

phila: would it be useful to get an XML person in the room? They are here in the building.

ivan: We should talk to Liam, the XML activity lead.

phila: he's currently scribing a meeting

danbri: I spoke to him yesterday; he's suggested eXSLT.

jeniT: I was intimately involved in XSLT, but I don't remember that.
... For completeness, it would be good to have an XML mapping. Not a trivial amount of work, and we need someone within the group to take it on. If no one wants to, then we may have to rule it out of scope or issue a note with our thoughts on it.

danbri: We should take seriously that it hasn't cropped up in the use cases.

jtandy: Some mention it. But we don't have anyone keen to take a lead on the work though. Mismatch between what's being asked for and what this group can currently deliver.

danbri: I see demand for it online. Look at StackOverflow, people are asking about libraries.

ericstephan: There are a lot of scientific communities that use XML but they tend to use it more as at tag language. Not necessarily well-formed.
... I don't see a lot of interest going between CSV and XML. They're either in one or the other.

chunming: We talk about someone sharing a big CSV file on the web. Another model is that someone has a huge dataset but allows a 3rd party to access just part of it, using CSV formats. Which model?

jenit: Scope is not to specify a query language over a large dataset that produces CSV. Or an API. But instead the files themselves. But that is a good usecase, as jtandy discusses.

jtandy: We do have a use case that is from PLOS, where we are requesting a subset of results where those results are being produced in CSV or JSON or XML

We talked about looking at a bit of that CSV and decided not to. But we are including the provenance relationship between a small dataset and its parent dataset

gkellogg: Using an HTTP header — that seems like a protocol. Ensuring that a client can parse the HTTP headers appropriately. Does that open the door?

jtandy: We were talking about using query parameters on an HTTP request in order to get rows 17-29. Not in our scope but relevant.

<AxelPolleres> FWIW, IBM had some canonical JSON to XML mapping… http://pic.dhe.ibm.com/infocenter/wsdatap/v6r0m0/index.jsp?topic=%2Fcom.ibm.dp.xm.doc%2Fjson_jsonx.html (had to dig out the link)

ivan: The various methods to access the metadata means that even for huge datasets I can get it all, because they are small compared to the dataset itself.
... I don't know whether the mapping to JSON or to RDF can be helpful for someone to make an inverse and be able to query into the CSV.
... In RDF terms, knowing the metadata can I turn a SPARQL query back into a CSV? It's an exciting question which we won't answer here.

Frederick: Regarding the charter, i'd imagine you'd defer this until you have a strong reason to address it.
... @Jtandy: You mentioned provenance, which is relevant to Web Annotations

jtandy: we have a whole thread of discussions on benefitting from the good work of you group

ivan: we have a joint session this afternoon

danbri: for XML then....?

ivan: For planning, we should make a final decision before the end of the year. Ideally earlier, but we have to talk to Liam.
... He may say "forget it guys", but he may want us to talk to more of the community. In which case, Christmas is not an unrealistic time

danbri: I was going to propose that we not work on XML mappings
... Does anyone agree?

<danbri> ivan/phil 'let's talk to liam'

phila: Let's talk to Liam.

jenit: I propose we catch up to Liam and other XML people over the next couple of days and address this with a resolution by the end of tomorrow.

<JeniT> also http://msdn.microsoft.com/en-us/library/bb924435(v=vs.110).aspx

AxelPolleres: I put something in IRC from IBM (above), but I don't know if there is anything more broad.

ivan: Doing a standard just because it's the charter and not checking if it's the right thing to do — sounds awkward to me.

AxelPolleres: I thought there may be something we could refer to, that exists already.

JeniT: There are ways of doing that — but I don't think any of those are what we would call standards. Where we could make normative references to them.

danbri: It would be helpful to end the week with a decision.

JeffJaffe: (introductions) CEO of W3C. Interoperable web standards, but particular interest in CSV. So much data out there, this is key.

danbri: Looking at the mapping mechanisms for CSV into other formats... ivan, can you talk about what you've done with direct mapping?

ivan: We had loads of discussion/emails on that. Not just direct mapping. My feeling is: what is realistic: a relatively simple mapping that doesn't require further language specification or syntax within the recommendation

<AxelPolleres> backchannel-question …. as for provenance … we would just hook in PROV with the ‘provenance’ metadata property, or was anything else discussed in this group? (sorry for having missed that, in case)

ivan: what we have now is a document that mimics the RDB2RDF as a direct mapping (ericp did that). We have metadata we can rely on, so it's a bit different.

<JeniT> AxelPolleres: that was my assumption, though how to structure it in a JSON format I’m not sure

<jtandy> @AxelPolleres: W3C PROV would seem the correct option; we're not intending to re-develop anything in this space

ivan: We had last week a mail from jtandy with reference to an RFC for URI templates which is a useful addition to that simple mapping.

<AxelPolleres> hmmmm, http://www.w3.org/Submission/2013/SUBM-prov-json-20130424/ seems to be “post-PROV-WG”

ivan: Those 2 documents exist, they need some care especially in how the data dives are interpreted. I think there is a separate discussion scheduled on the data dive in the metadata.
... Most of it is stable, the core is stable. The core can be implemented because I have a proof of concept for the RDF and JSON part.
... There have been two other works that we explored. 1) We had long discussion about using this in a more general form. (Moustache?)
... Allowing a separate template to generate an RDF or JSON structure that is more complex than the line-by-line structure of a CSV file.

<jtandy> http://mustache.github.io

ivan: If we're not careful, this could be come more complicated. I think we should not go this route for rec.
... Independently, 2) Anastasia — the R2RML language minus the SQL-specific things that are irrelevant here.
... For my feeling, has the same issue as Moustache — and is very RDF-specific. No structure for JSON.
... Right now, I think it's more important to produce JSON than RDF.

danbri: Re terminology. I've realised that my thought of "direct mapping" was different to what ivan has meant.
... In R2RML group, mapping starts with an SQL table and creates RDF graphs, triples. Predicates aren't mapped to well known RDF namespaces.
... In this group, we have more richness.
... When we say "direct mapping", we probably mean "simple mapping". Which could map to Dublin Core, or SKOS.

ivan: I plead guilty because I've said "direct mapping" on the mailing list.

danbri: This came to light when I said Google would have not interest on this. But the simple thing is potentially very valuable.

jenit: our first session tomorrow morning is on this.

phila: Axel found a document from IBM, so I pinged Arnaud to ask if we can use it. He wasn't sure. I'll ask him for a clearer answer

jtandy: Re diff between "simple mapping" and "templated mapping" — in use cases, I want to represent more complicated content. That needs to go in Simple Mapping document.
... In simple mapping, you have to have property per column, Month and day property in different columns — can't create a date property merging them.
... If you have one triple per cell — we can say "this is as far as we can go now, but there we will be a community group or separate discussion to hook in external tempting stuff."

<danbri> eric ericp

ericP: If you want to characterize the difference between simple mapping and direct mapping: CSV of people and addresses. turn into a graph. Rename predicates in that graph, reflect the metadata. Compare to simple mapping. If they differ in substantial ways, then...

ivan: I use the direct mapping approach.

ericp: any differentiation would be defensible.

ivan: In the case of simple mapping, there are more info than we know. Info about the whole CSV file as a whole.

AxelPolleres: to @ivan: if it covers more but should be the same, is it a requirement that the single mapping produces more triples?

<ericstephan> @Axel I wonder if the IBM work related to DFDL and Daffodil annotating data as XML document...

gkellogg: There are advantages to looking at RDF mappings. Serialising RDF to JSON-LD gives you a JSON result. There is a spec for doing that. Looking at simple mapping — it now does provide the RDF tools to turn the graph into something more structured using SPARQL

<Zakim> gkellogg, you wanted to comment on JSON-LD from RDF with Framing

<AxelPolleres> what I meant to say is, wouldn’t it make sense to require that the “simple CSV to RDF” mapping is a *superset* (in terms of resulting triples) of “CSV->SQL->RDB2RDF direct mapping”?

ivan: Yes. Conceptually, I was wondering about the same thing. But as an implementer only interested in JSON: this is a long and torturous road. It might be a deal-breaker.

<JeniT> +1 to ivan

ivan: Having a separate document that shows what you get in JSON and making it as close as possible to JSON-LD — as ericP said, there should be no major difference between the direct mapping and the simple mapping —
... If there are differences because JSON requires something different then we have to accept that.

gkellogg: We need to include people comfortable with these technologies.

ivan: I disagree. People who don't know anything about RDF — they just want it in JSON. There are loads of people there

hadleybeeman: I agree with that

<ericstephan> +1 Ivan

ivan: Even as an RDF person — this is a painful reality.

danbri: We have a spectrum of enthusiasm for RDF.
... We need to mush these interests together. With Schema.org and Microdata (designed to be super simple for publishers) — even those were too complex
... These developers aren't thinking in terms of triples or graphs.
... Saying RDF is the answer because you can serialise to RDFXML — long histories of failings here. Let's not spend the next 10 years doing the same with JSON

ErikMannens: What's wrong with profiles? Simple profiles? More extended profiles?

<phila> XML is not fading away - its use is growing. Honestly (Liam assures us)

ivan: The simple mapping to RDF is there. The definition is strictly done on the conceptual level in RDF. If someone wants to go that route and get JSON-LD, it's fine.
... If they do that, or do direct JSON, the two things should be close. But we don't talk about that. The document should be readable for someone in that context.
... The context is a good example. If you serialise the result of the RDF mapping into JSON-LD, then you will have all those things there. But if you serialise directly in JSON, you will not.

ivan... If you want to some how be in the RDF world, then great. But if you're not — those are noise. Irritating noise.

gkellogg: The tide seems to be moving toward well understood structured data in a lot of communities that were hostile to RDF. I don't know that we need to pander to a JSON mapping that doesn't contain some aspects of this.

danbrI: we
... 'll pick this up later

meeting goals

<ericstephan> @phila - I agree with Liam's comment, lots of legacy communities still using XML, other communities that are emerging such as High Energy Physics very interested in XML. Just not sure about the CSV XML connection.

Review our implementation types

jenit: We've looked at RDF, XML, JSON — that's one set of implementations. But I'm also interested in validators (validating a set of CSV files against the metadata to say if it's formatted correctly, has the right columns, etc.)

<JeniT> http://csvlint.io/

jeniT: (shows demo of csvlint.io )
... Validation tools are really handy. We in the UK have a push to get local government to publish data about public toilets. The people pushing it defined a schema for the data, and 400+ local authorities had to validate against that.
... That makes it easy to pull all of those datasets together into something consistent and coherent.
... Another important implementation: display of CSV. GOV.UK, data.gov.uk, github — have displays of CSV as a table. They'll often add on filtering or sorting options.
... it's important and useful to know what the data type of the column is, so you can filter it the right way.
... using jquery datatables
... www.datatables.net

<AxelPolleres> sideremark… seeing csvlint.io it reminds me somewhat of http://www.w3.org/2001/sw/wiki/RDF_Alerts which we did some years ago… that was RDF specific though, not sure whether any of that useful here.

jeniT: Turn that CSV into an HTML table. You can imagine having pop-ups over the cells if they have annotations, having a metadata view, etc.

<danbri> "display" / viewers

jenit: So those are the three implementations I think of: mappers, validators, and viewers.

<danbri> (me: import from bytes into tabular data model, … but that's more IETFish)

ivan: It's clear to me what first two categories do for us. I'm not sure how the third category fits into the picture of checking our own work. Importing is definitely not in our charter. We are not defining the byte stream to tabular conversion — that's in the IETF spec.
... What are the implementations that we have to take seriously as part of the rec track?

<Zakim> JeniT, you wanted to talk about error messages & warnings

jeniT: It is useful to talk about the display in a non-normative fashion
... Also, in what we need to do for validators: we need to talk about errors, warnings, etc.

ivan: Do we have to define standard errors?

jeniT: I think so. Not standard wordings, but codes for them. I think it's helpful.

Richard: (introductions)
... Display will be different. Internationalization are looking at the forms in HTML, numeric formats in different languages, etc. There are problems associated with that that may be relevant here.

jenit: for CSV, unlike a lot of other data, it has the goal of being both machine readable and human readable. So we do have numerical formats that are location specific. (Dates, numbers, etc.)

Richard: You may need to account for locale in the metadata.
... As HTML does, a lot is done in the browser.
... A lot of a locale is a language plus local settings.

danbri: Shall we have a joint meeting about this?

jenit: we have a session on data types later today. Useful for this.

<danbri> hadleybeeman: things like the display on html page may not be as relevant for this WG but it fits well with Data On Web Best Practices WG

<phila> hadleybeeman: The display issue may be relevsant to DWBP group

<danbri> we are looking at barriers to use, if average user can't see/read/understand ...

<ericP> hadley: display may not fit this WG but it may fit well in DWBP.

jtandy: For us, in terms of display, we often want to get data into just plain JSON. "javascript goodness" can then be applied.
... In internationalization, we look at right-to-left and top-to-bottom languages too.

ivan: we have Japanese representation here. China are pretty agreeable to doing everything horizontally. Japan this is not so.

<phila> Vote of thanks to Hadley for scribing first (busy) session

<daveL> Best Practices for Multilingual Linked Open Data Community Group may be willing to help with internationalisation issues

<daveL> http://www.w3.org/community/bpmlod/

Tabular metadata

JeniT: talking about metadata representation for individual tables, but also how it can be applied to columns

… title, description, date, …

<JeniT> http://w3c.github.io/csvw/metadata/#common-properties

… currently in “Metadata Vocabulary” spec sec 3.3

… This pulls in and references all dublic core metadata terms

… In some cases terms describe data values, object, natural language string, or something with a particular date format

… Three areas to discuss.

… 1) what list of properties should be, perhaps dcat, or schema.org instead of DC. Perhaps our own set.

danbri: if they’re a DC-based project, they may need to use DC for everything.

JeniT: sometimes it’s the consumer that cares most about vocabulary mapping, rather than the publisher.

… We need a list, as we’re expecting validators and mappers to reject properties not on the list (to avoid miss-spellings).

… 2) how are the properties defined, within the spec or outside. (Constraints on what we can point to)

… 3) How is metadata used to inform the mapping to different formats.

<AxelPolleres> do we need/want any new properties on document level anything which is not covered in DC, DCAT, PROV? Do we need to specify mappings to those?

<Zakim> AxelPolleres, you wanted to comment on provenance

AxelPolleres: there are two types of metadata, document-level and structural.

… The former is also around provenance, the second is for processing instructions.

<AxelPolleres> http://www.w3.org/TR/prov-dc/

… Also consider PROV vocabulary, there are notes on how to map PROV to DC.

… Do we need to ensure that there are mappings between the two.

JeniT: we can just pick up DC terms, or we could say use DCAT or ...

… In this case “provenance” is the DC term, not necessarily relating to a different spec.

<JeniT> ‘provenance’ isn’t in the schema.org set of terms

hadleybeeman: do we have any way of knowing what is used more beetween the different formats?

<Zakim> hadleybeeman, you wanted to ask about existing implementations

danbri: Google has information for microdata/rdfa/json-ld, but not from other RDF formats.

… Clearly, we’re going to see a lot of schema.org.

hadleybeeman: what would these numbers tell us if we could get them.

<AxelPolleres> is it in us to define/extend mappings between - for us useful - properties among schema.org, DC, DCAT, PROV, e.g. extending http://www.w3.org/TR/prov-dc/

ivan: Jeni said that “these terms” are the only terms you should use, which seems to be dangerious.

<Zakim> danbri, you wanted to ask about UCs and subsetting

JeniT: That should be “un-prefixed” terms, it’s really about unprefixed terms.

danbri: can we be use-case driven? DC started with 15 terms, has grown over the years.

… Can we use the use cases to pair down the set of terms we need to support.

jtandy: National Archives has some economic data which includes publisher, date, time, obvious stuff.

danbri: perhaps we can look at CSVs in repo.

bill-ingram: in the library, everyone in metadata knows what DC is, but it took a while to get there. People starting to talk about schema.org.

… Most of this relates to the software we use, for that DC is the core metadata for describing objects. It’s starting to change.

<ericstephan> +1 bill-ingram

<jtandy> from the UC doc: see http://w3c.github.io/csvw/use-cases-and-requirements/#UC-PublicationOfNationalStatistics

<AxelPolleres> FWIW, CKAN also has some metadata properties which I am not sure how far they are aligned with e.g. DC, etc., are they?

… I’m interested in schema.org, but it always ends up talking about mapping back to DC.

laufer: there may be some mandatory items.

… Some may be mandatory, others optional.

JeniT: different organizations always create their own profiles for what they expect.

ericstephan: predomenance of data is in DC. I’m sensitive to DCAT and DC, as they’re forward thinking.

… looking at requirements derived from use-cases, that would be a way to help define a core set of metadata we should be considering, or if there are obvious glaring holes.

… I am worried about getting lost in the detail, however.

jtanday: we previously agreed to a short-list of about 15 terms.

… and of section 3.4.2

<danbri> http://w3c.github.io/csvw/metadata/#optional-properties

… these are properties that relate to core information expected to be associated with CSVs and used in mapping.

ivan: spatial and temporal were unclear if they should be part of the core

JeniT: I think that list was “plucked out of the air”. There are so many groups who have thought about this, we shouldn’t re-do that thinking.

<Zakim> JeniT, you wanted to propose using the overlap between the various specs

jtandy: we were looking at three main things: validation, mapping and display.

… What metadata do we need to ensure that these mappings can occur, this list doesn’t form that.

… Maybe we can off-load choice of terms to Best Practices WG. cc/hadleybeeman

hadleybeeman: we haven’t gotten into this too much yet.

… We need to talk about this more, but that kind of a division of labor makes sense.

jtandy: It doesn’t matter how your publishing data, these questions are universal.

… It really should be about validation of parsing.

hadleybeeman: we’re shying away about specifying specific vocabularies, as there are many different needs.

jtanday: but you probably should be able to say that there should be a license, but there are many ways to express it.

laufer: we can’t make a complete list, but we can give examples of vocabularies which can do it.

ivan: until now everything is mapped to DC. The question is should we use schema.org or DCAT instead?

… We tried to specify a very small core, but leave the details up to the users.

… This list was for the small core; it does not exclude the use of other vocabularies.

… Do we define the 5..15 terms ourselves, or leave it open to the user to decide?

<hadleybeeman> Is the question here: Are we defining a vocabulary, or pointing to existing work?

… What does it mean if we pick “language”, “title”, and “provenance”? Do we define a new core (Santa Clara Core?)

<ericP> i thought the purpose of picking the terms was to enable s modicum of validation

fjh: What are the normative assertions, and how do you test them?

… If you push too much off to the Best Practices group, you might not have something testable.

JeniT: two issues: when testing the medata file, a validator has to genrate a warning.

… The other level, is the actual use in deeper validation or mapping.

… For example, the “title” property might be used to validate column titles to be what is expected.

<Zakim> JeniT, you wanted to talk about definition through implementation

ivan: als need to check that the value given to a language mapping is a real language.

JeniT: That’s a way to distinguish between first-level terms, and other terms.

… The implication is that if you wanted to use, say, a license, they would need to use a prefixed-term.

danbri: we’re pushing on 20 years of DC work; never been too rigid. Everything’s optional.

… If this group starts to make stronger claims about DC, that might be an issue.

AxelPolleres: what are the expectations on valididty?

… For some things it’s lexical, but for other’s it is more challenging (license, for example).

<Zakim> AxelPolleres, you wanted to comment on what are the expectations on validation

… Do we want to validate other types of things. recommendations of particular strings to use?

… Some things we can validate, other’s we can’t; doesn’t mean they’re not important.

<AxelPolleres> e.g. license IS very important to be declared.

<hadleybeeman> I note the many crossovers with DWBP WG

laufer: we need to classify types of data. Structural data?

… How important is differnet types of data for searchability, for example.

… What we can do is information about the structure of the data (rows and columns, datatypes, etc.)

… license, provenance, … Difficult to test these, you can test syntactic.

jtandy: If we’re focusing the metadata vocaulary on what is necessary for validation, then we shouldn’t spend too much time worrying about it.

ericp: I’m working with DC application profiles group who want’s to make sure there are some ways of describing restrictions on publication profiles.

… We’re also starting work at the W3C on this.

danbri: the DC view is that such restrictions make sense in a particular context. For use the question is do we decide this?

… Uses by govenrment vs search engines may be different.

<hadleybeeman> +1 to ericstephan

ericstephan: our focus has been generic document-level. what we’re going to do is give insite on describing CSV contents, that’s what people will look for.

<Zakim> ivan, you wanted to ask whether the world would collapse if we stay with a few dublin core term

<AxelPolleres> +1 to DanBri: make things work together (addition: rather than defining something new)

ivan: I wonder if the metadata document maybe just went too far? Perhaps we should just take the bare minimum (2-3 terms) but we use DC explicitly.

… We do rely on DC once and for all, as we have a mechanism for using other vocabularies.

… What counts is the metadata for describing the structure.

… Accept DC, use DC, make it clear that you can use schema and DCAT by using prefixes or contexts.

… a validator may then check these things.

… Just a few un-prefixed terms from DC, not defined by us.

JeniT: We need to have some restrictions on the values of these terms.

jtandy: For the validation to work, we need consistent syntax. I can imagine not caring about what the @context says, because I can validate the structure. It may map to schema, or to DC.

ivan: I’m saying we don’t define the value of the “language” value for example.

<Zakim> danbri, you wanted to suggest dc:locale per richard ishida's contrib

danbri: localle has been described as being important, but isn’t in our list.

bill-ingram: we’d prefer that everything be prefixed, but if un-prefixed, we’d like them to map to dc?

ivan: perhaps, but I suggest that we only allow 5 terms to be unprefixed.

hadleybeeman: scope question: If I have a dataset using DC, and you have one using DCAT, does that break the purpose of this WG? Or is it okay as long as they’re each valid?

JeniT: Just validity.

<danbri> JeniT suggests resolution "we are going to stick to a small set of properties that are used in validation or mapping"

<danbri> greg: i heard a couple things, …1 at v surface level terms are used, e.g. title as a string in json doc...

<danbri> …this doesn't necc say that it maps to dc:title

<danbri> … is everyone on board with this, or is there a feeling that it must map to DC title

ivan: we define the meaning of terms according to DC, but it may be mapped. If mapped, it must be dc:title.

… Perhaps through entailment?

<danbri> gregg: cautioning that looking at surface level of json where you also expect a mapping could be problematic

ericp: Is the WG receptive to the Best Practices coming back and saying that there may be some imposed restriction?

hadleybeeman: I’d say that that is a different place for the discussion, but might include the same people

JeniT: I think that it’s reasonable for other groups to decide practices that we should conform to.

ivan: we use JSON, and when we can, the syntax conforms to JSON-LD. The metadata can be considered as JSON-LD by an implementor if it wants

… So there might not be a context?

<AxelPolleres> ericp, how bout calling it “best practices” rather than “imposed restrictions”?

hadleybeenman: I believe the REC track process is such that we (Best Practices) can’t decide things without considering the needs of other groups.

<Zakim> JeniT, you wanted to propose language as metadata for tables, title and language for columns

<danbri> p

JeniT: perhaps we can make a decisions?

… I’d say that “language” at document level (or group of CSV files), and “title” and “language” at the column level.

… “localle” is part of “language”.

<AxelPolleres> What’s rong with the list on http://w3c.github.io/csvw/metadata/#optional-properties ? would like to make an attempt to argue for that list.

… Title is natural-languge string at top of column.

… Name is like a variable name for that column, what is used in the mapping. Typically has a constrained syntax.

… @id is there for JSON-LD compatibility.

… it must be an IRI.

<danbri> "locations for toilets, e.g. @id "lat" for latitude of toilers

<danbri> gregg: that would be ok if we had a base location as you could construct an IRI

ivan: that means that “title” is fundamentally different than other properties.

<Zakim> AxelPolleres, you wanted to comment on asking CSVW vs DWBP

AxelPolleres: what’s wrong with the core list? I think the things we need are present in that list.

… It may be arbitrary, but it seems good. Better than something overly constrained.

<Zakim> jtandy, you wanted to note that additional properties like "base" will be required

<ericP> +1 to starting small (3)

jtandy: there will be other things that are necessary, but they will emerge.

danbri: A small list is easier to stand behind; a medium sized list may give the false impression that we’ve thought deeply about it.

<danbri> gkellogg: if we overly restrict use of simple strings as properties within a json file,

<danbri> …we are violating expectations of many json users who like js with dot notation (i.e. objects)

<danbri> jtandy: context could ...

<danbri> gkelllogg: context could … sure, ...

<danbri> jtandy: only lang and title are the terms which we want to validate at the surface level

<AxelPolleres> maybe a good idea to not put something expressed elsewhere (DC) into our standard… (retracting my concerns from before, if we point to that as a “for instance” option to extend the meta-data vocab).

<danbri> gkellogg: also looking for bare terms that dont map

<danbri> ivan: replying to gregg, … what we are talking about here are the usual metadata terms

<danbri> ivan: 90% of metadata file consists of terms describing the csv file

<danbri> those are of course unqualified

ivan: we need to be careful: what we’re talking about is the “usual metadata terms”. 90% of the metadata file consistes of terms describing the structure of the CSV file. Those are of course unqualified.

… There are other unqualified terms; we’ve been discussing 10% of the content of a metadata file.

<danbri> Observers: Please consider volunteering to scribe next session.

… Having a very restricted version as proposed by JeniT as being unqualified is fine. We can then see if “the other group” comes up with more required terms.

<danbri> jenit, want to take a poll on your proposal to bridge to lunch?

… We really need to solve structural terms. the use of licence and title, say, should be soleved by the Best Practices group. Leave these open until the DWBP has something to say.

<ericstephan> +1 Ivan

<laufer> danbri:

<jtandy> @ivan: +1

<JeniT> PROPOSED RESOLUTION: We will define the terms ‘title’ and ‘language’ (for columns) and ‘language’ (for table groups down), provide examples using qualified terms for other metadata vocabularies, and be guided by DWBP wrt recommending other particular metadata terms to recommend

<AxelPolleres> +q one last question

<danbri> axel "what about encoding?" (utf-8 etc)

AxelPolleres: what about encoding metadata?

JeniT: described elsewhere.

<danbri> PROPOSED: We will define the terms ‘title’ and ‘language’ (for columns) and ‘language’ (for table groups down), provide examples using qualified terms for other metadata vocabularies, and be guided by DWBP wrt recommending other particular metadata terms to recommend

<AxelPolleres> +1

<ericstephan> +1

<danbri> +1

<ivan> +1

<bill-ingram> +1

<JeniT> +q

<jtandy> +1

<JeniT> +1

<JeniT> -q

<hadleybeeman> +1 though I'm just observing. But this makes sense from DWBP's perspective too

<bjdmeest> +1

<chunming> +1

<AxelPolleres> encoding is coverd by the syntax http://www.w3.org/TR/tabular-data-model/#encoding

<ivan> RESOLUTION: We will define the terms ‘title’ and ‘language’ (for columns) and ‘language’ (for table groups down), provide examples using qualified terms for other metadata vocabularies, and be guided by DWBP wrt recommending other particular metadata terms to recommend

<ericP-mobile> +1 (as observer)

… title and language being the only document level meta-data attributes stanadardised.

<JeniT> we should use BCP47 for languages

<danbri> dc http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms#terms-language

Addison Phillips: suggest to use language tags (BCP47) for languages

Dan: if more than two codes are applicable, should we repeat the property?

Richard: several languages may appear in a doc, the intended language of the user is the top level of the meta-data, but particular cells or columns could have different language.

Ivan: there should be levels for meta-data at all levels of granularity.

<danbri> (example in mind: col in table that might be any of the croatian/bosnian/serbian lang, some in cyrillic some in latin script, but lacking formal per-cell details)

… additional information about “all languages” used/mentioned in the document?

Richard: top level is “who are the users”, second level is “rendering”.

ivan: we didn’t differentiate that so far

ericP: trivial use cases, like numeric data have no language.

Addision Williams: there is language tags for “no language”

datatypes

jeniT: datatypes per columns, cell, etc. are a common issue, e.g. xml:schema, string-value vs. semantic value

… in XML schema string values are constrained.

… to ISO format, extremly difficult for CSVs if generated locally.

… we would like to be able to map from type to particular formatting for that type.

Addison Williams: e.g. there are other calendars besides Gregorian, this makes it much more complex

(on the example of data “27/10/2014” vs “27th October 2014” on the whiteboard)

<danbri> e.g. http://www.w3.org/TR/xpath-functions-30/#syntax-of-picture-string

ivan: trying to see whether there’s a standard on the “picture values”

<danbri> http://cldr.unicode.org/

Addison: refers to ICU library

(is that this one http://site.icu-project.org/ ?)

<danbri> http://www.unicode.org/cldr/charts/26/by_type/index.html

Richard: different schemas, XML Schema (referring to ISO), HTML, UNICODE…

Ivan: we are rather talking about how to “understand” certain strings as “ISO” …

<JeniT> http://www.unicode.org/cldr/charts/26/summary/en.html

Addison Williams: unicode defines all those here : http://www.unicode.org/cldr/charts/26/summary/en.html

Richard: you might need more than just the picture strings, e.g. ‘$’ meaning USD or Australien Dollars or HK Dollar, etc.

<danbri> http://en.wikipedia.org/wiki/ISO_4217

Addison Phillips: 3-char code for currencies: ISO4217

<aphillip> http://www.unicode.org/reports/tr35

<Zakim> AxelPolleres, you wanted to ask wher to stop

<danbri> "There are things like unit ontologies too...

<danbri> .. could go arbitarily far

<danbri> … standard units, …

<danbri> … i am not clear on where this would stop

<danbri> e.g. the number of cars per 1000% people

<danbri> [see also QUDT]

ivan: we shouldn’t go beyond XSD datatypes.

Jeremy: maybe we can add in metadata a script that transforms “picture strings” into prescribed format before validation.

… we should allow people to work around.

Richard: it would be easier if you’d go with one global standard.

ivan: if you consider the data out there, that wouldn’t work, in reality, everybody uses what they want.

Addison Williams: range of date variation formats is huge.

Ivan: is there a relatively simple picture string format we could refer to and use, which covers ~70% of cases?

… that we can refer to and otherwise, for special cases, allow preprocessing?

Addison Williams: e.g. month abbreviations in various language already make it complex.

Ivan: month abbreviations should be part of locale. We should look around usual libraries in common prog. langs

… I am uneasy with saying “either use an ISO string or give me a program”

<danbri> ericstephan, … see http://www.w3.org/TR/2014/WD-tabular-data-model-20140327/#excel for special casing documentation around Excel

erik: we specify in the tabular metadata doc explicitly about e.g. Excel and the Date-formatting they use

<hadleybeeman> Re spreadsheets, I think the Open Document Format supports dc:date, if that helps any

… that is a technology-based solution.

Richard: HTML will require you to use one standard format for dates, why not start out with that format.

JeniT: because there are masses of documents that don’t use it.

<Zakim> ericP-mobile, you wanted to ask the coverage of picture formats

Richard: the argument about prescribing utf-8 is similar.

<hadleybeeman> Here is how the ODF spec handles it: http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html#__RefHeading__1416366_253892949

Addison Williams: CLDR contains 100s of locales, not everything, but for data out there is has decent coverage

<danbri> JeniT, how much time needed for rest of datatypes topics in agenda?

ericP: the value of being able to read existing data, is not that much value.

<JeniT> we should move on if we can, but it would be good to get a direction of travel

ericP: we could do several levels, the question we want to ask ourselves is where to step.

<danbri> axel: q … if we ask ourselves how far we want to go. What makes us believe people who are not willing to convert their data into a specific format, … why will they go produce metadata to do the mappings

<danbri> jenit: it could be a 3rd party

<Zakim> AxelPolleres, you wanted to ask about the value of annotatin existing data with scripts.

Ivan: metadata can be decoupled

<Zakim> danbri, you wanted to ask i18n folks how to continue this

Axel: that answers my question then.

JeniT: i’m inclined to get us jsut to Picture-String without locale (e.g. no multi-language month abbreviations)
... … that seems like a good direction for me.

Ivan: not sure, we make a requirement that is weaker than most of the prog. lang. libraries out there.

Richard: currency examples not covered.

<danbri> (eg. $)

<hadleybeeman> Is currency best captured under locale, or as metadata in its own right?

ivan: datatypes are not units, currency not a good example.

<Zakim> AxelPolleres, you wanted to repeat myself.

<danbri> axel "agree currency eg is not a good one (for datatypes). The datatype of a price is number not currency. If the metadata could be decoupled from data, … we could equally well say that someone else republishes the curated data."

<danbri> … if there is so much data out there with poor datatyping, … … isn't data republishing as likely as metadata annotation?

<danbri> ivan: a matter of scale, … if you have terrabytes of data working at metadata level is easier

<Zakim> bjdmeest, you wanted to say someone could publish a parsing method

Ben: instead of republishing, can we publish a “re-publishing” method?

<danbri> Thanks aphillip, r12a :)

<JeniT> http://w3c.github.io/csvw/metadata/#datatypes

built-in datatypes

<danbri> jenit: number, binary, datetime come from json tables which comes from json schema

ivan: we may not want to add geopoint

JeniT: propose we just ignore geopoint alltogether

<JeniT> PROPOSAL: We should not support ‘geopoint’ as a datatype

discussing issues ISSUE-13 ff

<ivan> +1

<JeniT> +1

<jtandy> +1

<ericstephan> +1

<gkellogg> +0

<danbri> +1

<bill-ingram1> +1

<bjdmeest> +0

<ivan> RESOLUTION: We should not support ‘geopoint’ as a datatype

<JeniT> PROPOSAL: We should not support ‘object’, ‘array’ or ‘geojson’ as datatypes

<JeniT> (this is ISSUE 14 in the document)

<ivan> +1

<danbri> (issue is purely in the doc, not in w3c tracker or github tracker)

<danbri> +1

<JeniT> +1

<bill-ingram> +1

<ericstephan> +1

(ISSUE 13 in the document should be closed)

<gkellogg> +1

<bjdmeest> +1

<ivan> RESOLUTION: We should not support ‘object’, ‘array’ or ‘geojson’ as datatypes

<danbri> issue 15, the any type

<danbri> from doc, "We invite comment on whether the any type is useful."

Jeremy: we will support some knd of list types though, or parsing lists.

<danbri> gkellogg, can we deal with your point after these issues are RESOLUTION?

JeniT: ISSUE-15 we could enable to let people declare explicitly that something is of no particular datatype.

<JeniT> PROPOSAL: It is useful to have an ‘any’ type to explicitly say that anything is allowed

<jtandy> +1

<ivan> +1

eric: what’s the difference between any type and string?

<danbri> +1

jeremy: it is made explicit.

-0

<ivan> +1

<ericstephan> +1

<danbri> +0.5

JeniT: still unsure.

<gkellogg> +0

Ivan: this is for mixed type columns

<Zakim> gkellogg, you wanted to ask about schema:Date/Time/Duration types

gregg: schema.org uses different datatypes than xml schema…

<Zakim> danbri, you wanted to ask about null

ivan: I am changeing back my vote to 0

<danbri> example: birthdate, deathdate

JeniT: shall we allow empty values for particular cells, e.g. deathdate

“”^^:null

<danbri> +1

<bjdmeest> EricP: any can be solved on the application level

<scribe> scribe: bjdmeest

<danbri> time til next break: 19 mins

EricP: technicaly, it will be a string

<JeniT> +1

<danbri> thanks new scribe, thanks old scribe!

EricP: for RDF: top-level is string

Jeni: might keep this as an issue...

laufer: semantics is in the application, or in the datatype?

ivan: data can come from different sources

<Zakim> ivan, you wanted to a reference back to the locale issue for the minutes

<ivan> The reference to use for the date 'picture strings' is: http://www.unicode.org/reports/tr35/

<danbri> next, "We invite comment on whether there should be types for formats like XML, HTML and markdown which may appear within CSV cells"

Jeni: issue 16 is support for other kind of stuff (xml, html, markdown)
... how to handle those
... simple strings? specific datatypes?
... Markdown would be very useful

EricP: what is user can define types?
... to support different markdown flavors

<Zakim> JeniT, you wanted to suggest using media types

Jeni: possiblity:
... specify media-type

<Zakim> danbri, you wanted to ask difference datatype / mediatypes (aka mimetypes)

danbri: what's the difference between media-type and data-type?

ivan: technical question:
... not full xml, but only fragments, same for html
... is that something to use a datatype?
... maybe we have to define a datatype for markdown?

<danbri> (q: normative refs to markdown?)

ivan: don't standardize markdown, just add datatype

jenit: people can specify their own with a prefix
... we cannot define a markdown datatype, as there is no spec

danbri: fragments is usefull, we get hyperlinks

jenit: what about json in CVS?

<JeniT> PROPOSAL: We should add ‘xml’ and ‘html’ datatypes

jtandy: i have a lot of people add json in CSV

<JeniT> +1

gkellog: it's quite common to add html table

<ivan> +1

<ericstephan> +1

<bill-ingram> +1

danbri: csv is not really specified strcitly

<gkellogg> +1

laufer: one may define its own datatype?

<JeniT> PROPOSAL: We should add ‘xml’ and ‘html’ datatypes as defined in RDF

laufer: string can have its own datatype

jtandy: define it in your own namespace

laufer: json is not a qualified datatype?

jenit: it's not on the list (yet)

<ivan> +1

<danbri> +1

<JeniT> RESOLUTION: We will add ‘xml’ and ‘html’ datatypes as defined in RDF

<JeniT> PROPOSAL: We should add ‘json’ datatype with our own namespace

ivan: what is the official status of json?

jenit: there is an IETF and an ?? spec

<danbri> (ECMA spec)

ivan: is anything stable?

<gkellogg> Common Markdown Spec: http://spec.commonmark.org/0.6/

ericP: easier to have a stable spec for json than for markdown

gkellog: there is a community spec for (common) markdown

<ivan> +0

jenit: should we have a json datatype?

<danbri> +1

<ericstephan> +1

<jtandy> +1 ... I've seen it in the wild

<gkellogg> +1

<JeniT> +1

<bill-ingram> +1

ericP: implications might be big? at parsing...
... signed up for row with value, possibly array, but with json... might explode

jenit: if there is embedded json: do we parse it?

<danbri> laufer, you're still on the queue. Was this a new question/topic?

gkellog: what about json-ld? merge with graph in RDF serialization?

ivon: json inside json? how do processors react?

jenit: same with xml serialization

ericP: my guess: serialization: escape with quotes

danbri: what is worse than 10 mb of json inside csv? 10 mb of anything inside csv

jenit: during mapping: json or xml or ... remain strings
... possibly datatype string
... set of datatypes on list right now: ok or not? less?
... I propose: let's blindly adopt the list

<danbri> suggest - add wording to SOTD section soliciting advice

EricP: r2rml and sparql support less

ivan: let's not touch that, leave it as it is

danbri: handed by the elders
... we want flippy floppy searchy things

ericP: you want timezone processing?
... wipe out everything with the 'g'

<danbri> Annotations + Notes session

Notes & Annotation

<phila> scribe: phila

<scribe> scribeNick: phila

<danbri> see http://www.w3.org/annotation/

Rob: Introduces self and Annotations work
... gives a brief history of muiltiple attempts at the smae thing coming together

<danbri> charter - http://www.w3.org/annotation/charter/

Rob: Meeting here for first f2f
... works through charter

<JeniT> http://www.openannotation.org/spec/core/

Rob: describes model in http://openannotation.org/spec/core/
... n-ary relationship between comments and targets
... want to be able to annotate a section/shape within a page, or an area of an image
... couldn't find anyone else working on this
... Also looked at simple collections of resources
... the Community Group also did some provenance work

fjh: It's a rich model
... I assume we're here to talk about annotations for CSV
... I guess you want to annotate cells in a CSV file
... or an HTML representation?
... What re the use cases?

JeniT: The typical use case I see in tabular data... like Excel files... there will often be notes on a particular cell, like a footnote etc
... and you go down to the bottom of the sheet and it has relevant info

fjh: You have a model that describes how the table is laid out?
... so we'd have to anchor into that model

JeniT: Yes. So we have the tabular data model - columns, rows and cells - and hte metadat file describes that table
... it can describe individual CSVs, so you may want to annotetc cells, rows, cols, areas etc
... the annotation will be in the metadata file
... not in the CSV file

ivan: I could imagine saying that the body is a pointer to sometehing longer
... I;m not sure that the annotation is always part of the metadata file

jtandy: The target might be a cell, or a region...

fjh: Talks through a use case where a name is spelt incorrectly

jtandy: We can give you a persistent ID for that cell
... we haven't yet separated out the cell and the contents of the cell

fjh: We have notion of robust anchoring...

<JeniT> http://tools.ietf.org/html/rfc7111

jtandy: There is RFC7111 which is a frag ID scheme for CSV cells

JeniT: We ahve a spec for the metadata file (JSON) http://www.w3.org/TR/tabular-metadata/
... It describes a set of CSV files - columns, rows, the types of values
... but we want to include a pointer to a specific cell etc so we can include annotations

fjh: I think of annotations as discrete objects
... you see them as being transported with the CSV

gkellogg: The term annotation is overloaded
... I believe the Annotation WG use case is to allow someone to say sometehing about that thing over there and the author retains ownership

laufer: Do you see annotations as being machine readable?

<gkellogg> In RDF, an annotation is a synonuym for metadata about a document, such as provenance.

Rob: Possibly, yes, we've looked at them as RDF models as one option

<gkellogg> The other view of annotation, is that of an assertion a third party makes about some other document on the web; in our case, a CSV. This makes having it in the CSV metadata document untenable

fjh: What I see is that you have this model of a metadata file that defines a schema for a CSV file and you have the data conetn in the CSV file itself and you can envision that being presented to a user
... and an application can take annotations from anywhere and overlay that
... that target needs to be defined... the fundamental question seems to be around that anchoring
... When I annotate, I somehow have to store that

(Talks about multiple annotations, created at different times)

ivan: The metadata that we define, conceptually, is in one or more separate files. The CSV data itself is not changed by the metadata
... in Gregg's model, the whole annotation procedure is completely disjoint from the CSV metadata
... you don't need to know the schema. The current RFC just plays with row and coulumn numbers
... you have the URI of the CSV and the frag ID you have the same power as HTML annotations
... but it's not robus in the meaning of the annotations WG
... In the metadata doc rught now, there is a slot (notes) - which smells like annotation
... so you may envisage a situation where you add annotations into the metadata. It's not clear what the structure that we have to provide to be compatible with Annotations

Rob: This sounds similar to the IDPF work

<danbri> IDPF - http://idpf.org/epub/30

Rob: In the last year, Marcus and I wrote a little spec for how to annotate an ePub doc
... zip them up so they go with the ePub from one device to another
... this looks similar?

ivan: The would mean that we have the metadata and that the annotation(collection) would be in a separate file?
... In the ePub world these are separate files in the smae package

JeniT: They could be embedded in the metadata

ivan: The value of the notes property is an object
... that has a target into the CSV

JeniT: It's an array

fjh: So we're talking about JSON... I can use JSON-LD, one inside the other. We can serialise as JSON-LD...

gkellogg: You can include context in the JSON file

-> http://openannotation.org/spec/core/core.html#FragmentURIs Fragment URIs in Annotation core model

ivan: The metadata for CSV files are in JSON. They are compatible with JSON-LD for those who care
... There are 2 ways of using hte annotation model. One that is disjoint from the metadata world (as Gregg descibed)
... the other is when they're created as part of the metadata
... e.g. column x is rubbish - that goes into the metadata

<danbri> (Scribing note: please consider volunteering to take over from phil later today, somebody...)

Discussion around whether the body of an annotation must be plain text or could be more structured

-> http://openannotation.org/spec/core/publishing.html#Serialization Serialisation section of annotation publishing spec

gkellogg: So much of the info in an annotation is a duplication of what is already available.

danbri: We expect a lot of MarkDown floating around - can that be the annotation body?

Dicsussion of simplicity/power - as ever

fjh: I'm hearing a string requirement for simplicity where it makes sense without precluding more flexibility where needed.

ivan: I think "body" "blah blah" is fine

Rob: The issue with punning properties if you don't know if a URI is a URI or a string

gkellogg: I can take something that is assumed to be a URI type... I would say that the body is expected to be a string
... if you put an object there then it will be interpreted as an object
... if you want it to be interpreted as a URI then you'd need to declare the type

ivan: And that's how JSON-LD works

Rob: The flipside is when you want to do MarkDown, HTML etc.

ivan: Then you need to give datatypes
... it's then feasible to require the author to specify what kind of value it is

Rob: And of course language
... that's why the CG went for always requiring an object

gkellogg: JSON-LD also has the language objects

fjh: I wanted to understand what you said about MarkDown

danbri: We were talking about how to include that the type of a cell was MarkDown
... if we're doing that for cels, maybe we should do it for annotations
... Is there a deployment scenario for annotations where the annotations appear in the body of the annotated resource

Rob: At the moment, people have used RDFa to put annotations into the annotated thing
... but I don't think any other methods have been tried

danbri: We (Google) have a tool that will turn a CSV file into JSON
... Would it make sense ever for the cells to have the annotations in that CSV file as well

fjh: In theory but we haven't done it

gkellogg: You can have annotations that are part of an ePub. That doesn't stop someone else annotationg it later
... you can always annotate a resultant HTML
... that can be annotated using the existing mechanisms

ivan: I want to understand where we are...
... especially as I am in both groups
... from the CSV POV, it's necessray for the simple case to be expressed, meaning that the body can be plain text
... and the structure discussed as possible solution is too complex (with types etc)
... switching to my Annotation hat. If it's actually JSON-LD then maybe some hidden magic can switch between string and object
... in which case the model needs to be adapated so that the value of body can be a string or an object
... can we trust the annotation WG to take care of the modification of the model and the CSV WG can carry on with either text or object

<Zakim> JeniT, you wanted to ask about whether we can just reference the annotation work and not say anything in our spec

JeniT: My ideal is that we don't specify anything other than this is an arraye of annotations as defined by them over there

fjh: I think that the 2 WGs should iterate...
... what is your timeline?

ivan: We don't know for sure. At the moment we're heading for closing by August 2015

fjh: We're hoping to get the model finalised by the end of the year
... we should be stable by then

ivan: We may have some dependencies that can mess with W3C process

fjh: I would suggets that this is not the biggest worry point

ivan: My fingers have been burned by mutual dependence before now

fjh: AIUI, the target for CSV is well specified already. And as far as representing the annotation we need to be able to do that simply
... and we need to be able to embed an annotation object within a CSV metadata file or separately
... we're in good shape with JSON-LD on that
... so this seems relatively low risk to me

Rob: The risk is allowing the body to be a literal - that's punning in RDF terms

ivan: We don't care about OWL, only RDF

Rob: I think as chairs we can commit to make it as simple as we can

fjh: Could we schedule time on one of your weekly calls??

ivan: if we change the call time, it will be just before the annotation call time

danbri: is there anything to record?

fjh: Don't put an action in tracker...

<Zakim> phila, you wanted to talk about implementations

<fjh> requirements for Annotation WG

<fjh> necessary for simple case to express simple case simply, body: text

<fjh> embedded annotation within CSV metadata file

<fjh> using JSON-LD for embeddeding

<danbri> phila "meta q: we're not just talking about dependencies in terms of spec going fwd, but if the csvwg says 'hand it off to annotations', your implementations need to do that …' … i.e have you become dependent ...

<danbri> "what is the capacity in your (annotation) group in terms of implementing stuff?"

<danbri> fjh: "2 Qs - how do we get out of REC? vs For adoption in real world, will adopters need to impl the entire thing?"

phila: Worried about creayting implentation dependecies

JeniT: There's parsing the metadata file and throwing errors when you find something unanticipated
... we could be loose and just asy it has to be an object/array

<fjh> timeline question: when will Annotation WG have this done, CSV plans to complete by Next Aug; plan in annotation wg is to focus this year on model

JeniT: and leave it at that

<fjh> can iterate on upcoming CSV call

JeniT: There's validation of CSV files where validation of annotations aren't relevant
... and there's the mapping to other formats - which for RDF is taken care of by JSON-LD. For JSON it's just JSON

<fjh> annotations need not be in metadata file

JeniT: so we can specify at that level and not pu any extra burden on the CSV implementation - processing the annotation is separate and not out problem

phila: So can the Annotation Wg commit to implementing?

fjh: No. But the notes is an extension point

JeniT: It can be

More discussion about where things get implemented

ivan: With this mapping of an annotation when you do an RDf mapping for a JSON... ??
... you want the target to be transformed to the target of the transformed thing

fjh: I understand the concern about two parallel groups not being blocked by the other and the implementaions not necessarily happening

<Zakim> danbri, you wanted to say it's just a piece of metadata

Consensus that if the data can be etxracted we're done. UIs and workflows are not required for CR exit in eitehr group

<ericP> ivan: in the case of annotation, you want to generate an object where the object has a property called "value" with a value of that's in the string.

<ericP> ... JSON-LD cannot really solve that issue

<ericP> gkellogg: JSON-LD allows you to express it as a string or an object

<ericP> ... we can do it at a different level, before we emit.

<ericP> danbri: could we defined "simpleBody" or "bodyLiteral"?

<ericP> ivan: but then we don't use the openAnnotation model

<ericP> ... so it's really in that group's profile

<ericP> scribenick: ericP

<fjh> Thank you very much for having us for a joint session to discuss Web Annotation with respect to CSV

Syntax

<JeniT> http://w3c.github.io/csvw/syntax/

JeniT: issue 1: do we need a distinction between empty cells vs. cells with an empty string?
... i propose no

ivan: the CSV parsers that i've seen in python and JSON produce an empty string if there is an empty value
... no parsers will respect the difference.
... so unless we want to rewrite all of the parsers..

AxelPolleres: the metadata can discriminate between a 1 and a 1.0.
... the same argument says we could put this in the metadata

ivan: this is "in the absense of metadata"

AxelPolleres: so what about when there *is* metadata?

ivan: do we ever care about the difference?

jtandy: the metadata doc section on NULL talks about the token that you use to represent NULL.
... I agree with Ivan; if you definitely want to NULL, you can use that token.

<JeniT> PROPOSAL: We have no difference between an empty string with quotes around it and one without quotes around it

<danbri> +1

<gkellogg> +1

<bill-ingram> +1

<JeniT> +1

<ivan> =1

<jtandy> +1

<ericstephan> +1

<ivan> +1

<AxelPolleres> +1

ericP: in RDF we'd notice the difference mphh MPHHH AGGGH!

<Hitoshi> +1

<danbri> RESOLUTION: We have no difference between an empty string with quotes around it and one without quotes around it

jtandy: the quotes around the cells are just syntax; they don't affect the value.

JeniT: [issue 2]
... in the data model we have the notion of the annotations that come from the metadata (not the set of annotations as in OpenAnnotations)
... the whole set of annotations like language, datatypes, ...
... in the annotated data model, we have annotations at the table, column, row and cell.
... do we also want annotations on regions.
... i propose we leave that for later

ivan: +1

<JeniT> PROPOSAL: We do not support annotations on regions (for now)

<JeniT> +1

<gkellogg> +1

<danbri> +1

<ericstephan> +1

<AxelPolleres> 0 , we could enable rectangular regions easily based on 7111, and Excel syntax, but let’s leave it for now…

danbri: if you were passionate about it, you could create a derived CSV with ...

<jtandy> +1

<ivan> RESOLUTION: We do not support annotations on regions (for now)

jtandy: using Open Annotation, you could have one Annotation with multiple targets

<danbri> (aside: if you want to annotate regions you could make a whole new CSV package, which just had a cut out subset of the original CSV, … plus some currently unspecified provenance metadata to record which cells were 'cut out' from the source CSV)

JeniT: or one with a range spec

AxelPolleres: in RFC7111, you're allowed to annotate columns, rows and cells

jtandy: if you look at the cell syntax...

ivan: the metadata document doesn't use 7111

<jtandy> see http://tools.ietf.org/html/rfc7111#section-2.3

ivan: when you get to a cell, you have to merge metadata from different places.
... it's a little complcated. i don't think that adding annotations on ranges would be so useful

JeniT: [ issue 4]
... we have different mechanims for finding docs:
... .. headers in the CSV
... .. metadata in a package
... .. pointing through a link doc
... .. locating via a standard path

<scribe> ACTION: JeniT to split out section 3.4 to discriminiate between different paths to metadata files [recorded in http://www.w3.org/2014/10/27-csvw-minutes.html#action01]

<trackbot> Created ACTION-41 - Split out section 3.4 to discriminiate between different paths to metadata files [on Jeni Tennison - due 2014-11-03].

JeniT: what do you do if there is more than one of the above available?
... .. if your package has a metadata file and there's a link header pointing to another, how do you combine them?
... e.g. if toilets.csv has a link header to toilets-metadata.json (included) and a metadata.json

jtandy: i believe we should merge with a precedence, and validation should warn about this

gkellogg: we haven't defined a mechanism for merging JSON-LD docs.
... (we have a way to merge @contexts)

ivan: i played with this situation.
... i found JQuery Extend to be useful.
... as jtandy proposes, it merges the contents and the rule is that if A.x and B.x exists, take A.x, otherwise B.x
... in JQuery you can choose shallow or deep but i think we only want deep.
... this is a clean model

<Zakim> danbri, you wanted to say additive except when same properties in which case undefined

ivan: i propose the same priority as in §3 ¶5

gkellogg: there are potential pitfalls to recursive merging: two elements that are individually ok but mutually inconsistent

jtandy: like throwing a warning when seeing the same thing in two places

ivan: this is a similar model to CSS cascading

<Zakim> JeniT, you wanted to argue against because of the need to search for files

JeniT: missing scenario which is the recipient of the metadata overriding stuff

jtandy: a local metadata should be below the row1 rule

<danbri> http://hxl.humanitarianresponse.info/

ericP: why wouldn't you want to override row1?

JeniT: in hxlstandard.org, you see that XHMTL row1 provides metadata about a column
... .. types
... .. language
... .. combining stuff
... it should be possible to HXL to override our stuff

ivan: i thought that the only metadata in a CSV is the first row being column names

JeniT: yes, for plain CSVs

<danbri> also maps into rdf triples/graphs — http://hxl.humanitarianresponse.info/docs/queries.php

JeniT: it would be possible for a different model to define a mapping to our metadata stuff

ivan: comping back to your user-level, we don't define how you get there
... so if i had an JS implementation, i could create some object.

phila: if the user chooses to define their own metadata, they can do whatever they want.

<Zakim> gkellogg, you wanted to note user-specified contexts in JSON-LD processing

phila: but if the title is in english, someone else will want in portugeuese

gkellogg: in the JSON-LD expansion algorithm, @@1 is processed first.
... there's no current way to say that "2 levels deep in this doc, there's an X of type int and i want it to be a float"
... the concept from CSS is !important
... we could provide something like, but we will have issues with the JSON-LD algorithms

JeniT: arguing against collecting all possible metadata docs, you have to do a bunch of optimistic GETs

<Zakim> JeniT, you wanted to complain about having to hunt for files

ivan: if we don't do that, and i look at the publisher who supplies a range of metadata files, the metadata.json becomes useless
... so i have to copy it into all of the metadata files.

JeniT: i think we're muddling up our access and metadata resolution

<Zakim> AxelPolleres, you wanted to reflect on Gregg’s comment.

JeniT: what keeps them from mechanically combining their sources to create one metadata file?

ivan: publishers aren't going to know to do that

AxelPolleres: the client could load a metadata file that points to another and finally the CSV

gkellogg: there's no import an JSON-LD, but there is in @contexts

<JeniT> PROPOSAL: We use an ‘import’ property in the first metadata document you find to merge in metadata from other files

laufer: do different types of metadata have the same precedence order? (e.g. license vs. title)

JeniT: i think they all have to be the same

laufer: new types of metadata will have to agree with our ordering

<ivan> +1

<danbri> -0.0

jtandy: in your UC4 example, you used a schema IRI to reference another doc

<danbri> in here somewhere - https://github.com/w3c/csvw/tree/gh-pages/examples/tests/scenarios/uc-4/attempts - ?

<Zakim> ericP, you wanted to ask if users will find links to metadata files

<ericstephan> +1

<JeniT> PROPOSAL: We use an ‘import’ property in the first metadata document found through the precedence hierarchy described in section 3 (but with inclusion of user-defined metadata); the merge is a depth first recursive inclusion

+1 (as observer)

<bill-ingram> +1

<ivan> +1

<gkellogg> +1

<JeniT> +1

<jtandy> +1

<ivan> RESOLUTION: We use an ‘import’ property in the first metadata document found through the precedence hierarchy described in section 3 (but with inclusion of user-defined metadata); the merge is a depth first recursive inclusion

<AxelPolleres> +1

ivan: 2⅞...
... we need lots of examples in the spec

JeniT: [issue 3.2 packaging]
... tomorrow we have a specific issue around multiple CSV files
... be we have the general problem of how to package all this stuff

ivan: we won't get packaging on the web done before we finish.
... they can accept current stuff, e.g. zip, gzip.

gkellogg: we can address the result of unpacking

JeniT: [issuse 7: link header]
... rel="describedBy"
... plus the content type (which is the metadata media type)

gkellogg: we tried various things in JSON-LD and came back to DescribedBy

<JeniT> PROPOSAL: We will use ‘describedby’ as the relevant link relation

<JeniT> +1

<danbri> +1

<ivan> +1

<jtandy> +1

<bill-ingram> +1

<bjdmeest> +1

<gkellogg> +1

+1 (as observer listening to gkellogg)

<ivan> RESOLUTION: We will use ‘describedby’ as the relevant link relation

JeniT: [issue 8: standard path]
... we have two standard paths:
... .. file-specific
... .. more generic metadata file
... pushback on .CSVM 'cause processors will understand .JSON

<danbri> .csvm vs .json

<AxelPolleres> how about .csv.json ?

JeniT: propose toliets.csv -> toilets.csv.json

<hadleybeeman> And would this be better as a best practice rather than "hacking the URI"?

<AxelPolleres> http://jsontocsvconverter.example.org?input=file1.csv.json :-)

<AxelPolleres> http://jsontocsvconverter.example.org?input=file1.json

<AxelPolleres> —> metadata http://jsontocsvconverter.example.org?input=file1.json.json ?

<Zakim> danbri, you wanted to ask re i18n

danbri: what happens in non-latin scripts at the end?

ivan: if it's all chinese chars, we end with ".json"

<Zakim> hadleybeeman, you wanted to ask what of this is specific to CSV data?

hadleybeeman: how much of this is specific to CSV vs. other data?

<AxelPolleres> maybe better … /metadata.json

hadleybeeman: 2. JeniT said hacking URIs is unpleasant so i wonder if this should be a "best practice"

jtandy: i think this is specifically about CSV/TSV (tabular data)

ericstephan_: this is about the connection between the CSV file and the metadata file.
... if you save a file in your favorite office tool, and you give it a file extension, and then it appends ".doc"

ivan: if there is a publisher that puts out a bunch of CSV data, and there's a need for the query component, they can [damn well] use the link header
... this -.json is for folks who can't control the server
... so if there's a '?' in the URI, don't look for the .json

AxelPolleres: i don't like ".json", can we have "-metadata.json"?

JeniT: the point of these simple methods of finding the metadata is that in many environments, folks have no control over http headers

<AxelPolleres> can we vote on the suffix? “-metadata.json” ?

<JeniT> PROPOSAL: We find a metadata file by adding ‘.json’ to the end of the URL of the CSV file, but only if the URL doesn’t contain a query component

<JeniT> PROPOSAL: We find a metadata file by adding ‘-metadata.json’ to the end of the URL of the CSV file, but only if the URL doesn’t contain a query component

<jtandy> +1

<danbri> +0.3

<ericstephan_> +1

<ivan> +1

<AxelPolleres> +1

<JeniT> +1

<gkellogg> +0.1

+1 (as observer, under the influence)

<ivan> RESOLUTION: We find a metadata file by adding ‘-metadata.json’ to the end of the URL of the CSV file, but only if the URL doesn’t contain a query component

JeniT: [issue 9: default navigational climb]

<AxelPolleres> We should note in the doc that the link header is the preferred version, yes?

JeniT: still completely possible. i might have metadata file at the top of a directory of csv files

<danbri> "The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://example.com/catalog/sitemap.xml can include any URLs starting with http://example.com/catalog/ but can not include URLs starting with http://example.com/images/.

<danbri> If you have the permission to change http://example.org/path/sitemap.xml, it is assumed that you also have permission to provide information for URLs with the prefix http://example.org/path/."

jtandy: i propose that we say that there's value but that we decided not to do it.

<JeniT> PROPOSAL: We do not traverse path hierarchies to locate metadata files

<jtandy> +1

<gkellogg> +1

<AxelPolleres> +1

<danbri> +1

<ericstephan_> +1

<JeniT> +1

<bill-ingram> +1

<hadleybeeman> +1 as observer

<jtandy> gkellogg suggests that this use case might be Resolution using the package mechanism

laufer: CKAN has a resolution for this

JeniT: [issue 10]

<JeniT> RESOLUTION: We do not traverse path hierarchies to locate metadata files

jtandy: we have plenty of good ways to find things. if people wnat to add, they have to motivate us.

danbri: we should have an informative ref to site-map

<danbri> ACTION: danbri propose a sentence informative-referencing sitemaps.org xml format [recorded in http://www.w3.org/2014/10/27-csvw-minutes.html#action02]

<trackbot> Created ACTION-42 - Propose a sentence informative-referencing sitemaps.org xml format [on Dan Brickley - due 2014-11-04].

<AxelPolleres> So, either link header or ‘[originalcsvfilename]-metadata.json’, with the former preferred, that’s it, yes?

laufer: you talk of a structure that you can access directly.
... in this case, DCAT can make this link.

<jtandy> meaning that if people want to include additional mechanisms to find the metadata file (e.g. sitemaps), then they need to provide a rational argument for doing so

<danbri> i.e. http://www.w3.org/TR/vocab-dcat/#class-distribution

laufer: you have link and distribution
... in DWBP, we plan to extend DCAT
... so i don't know if you can add this extension.

JeniT: [issue 12: separate media type]
... i think we need a specific media type, e.g. application/csv-metadata+json

ivan: doc has to be at a certain level of maturity

JeniT: i think that issue 15 (trimming whitespace) is an IETF issue

<danbri> see also http://tools.ietf.org/html/rfc4180

<phila> ADJOURNED

<AxelPolleres> example of what we did in SPARQL regarding mimetypes: http://www.w3.org/TR/rdf-sparql-json-res/#mediaType

CSV on the Web WG, F2F meeting @ TPAC, 2014-10-27

27 Oct 2014

Attendees

Contents

intros

charter http://www.w3.org/2013/05/lcsv-charter.html

meeting goals

Review our implementation types

Tabular metadata

datatypes

built-in datatypes

Notes & Annotation

Syntax

Summary of Action Items