CSV on the Web Working Group Teleconference

05 Feb 2014


See also: IRC log


Jeremy Tandy (jtandy), Ivan Herman (Ivan), Jeni Tennison (JeniT), Eric Stephan (ericstephan), Davide Ceolin (davideceolin), Dan Brickley (danbri), Axel Polleres (AxelPolleres), Alf Eaton (fresco), Alfonso Noriega (fonsoN). Tim Finin (TimFinin)
Andy Seaborne, Ross Jones, Stasinos Konstantopoulos


<ivan> Agenda: https://www.w3.org/2013/csvw/wiki/Meeting_Agenda_2014-02-05

Minutes: http://www.w3.org/2014/01/29-csvw-minutes.html

<jtandy> none

Meeting notes accepted


QUESTION: can we confirm we're ok with plan that F2F currently planned before TPAC

Some events other than the F2F might be useful to collaborate in the room

<JeniT> note it's in less than 8 weeks time

<danbri> ACTION: axelpollerres take a lead arranging an *informal* gathering of wg members at EDF [recorded in http://www.w3.org/2014/02/05-csvw-minutes.html#action01]

Use Cases and Requirements

<jtandy> Raj Singh and I plan to meet to discuss things at the OGC TC meeting in washington, late march ... will add to the wiki

<AxelPolleres> EDF 2014 http://2014.data-forum.eu/

<danbri> charter: http://www.w3.org/2013/05/lcsv-charter

JeniT: 3rd working draft in March 2014, we were supposed to start earlier

... we need to a volunteer to deliver that document and expect everyone to contribute

<danbri> wiki materials - https://www.w3.org/2013/csvw/wiki/Use_Cases

... we need that by next week

Jeremy: I am willing to help

JeniT: Others can help as well

<JeniT> https://www.w3.org/2013/csvw/wiki/Use_Cases#Other_W3C_use_cases_and_requirement_docs

JeniT: Dan has put together a use case and wiki page references done by other groups

<danbri> wiki page contributors so far: Adam Retter, Jeni Tennison , Jeremy Tandy, Andy Seaborne, Alf Eaton, Davide Ceolin, Martine de Vos via Davide Ceolin, …

JeniT: The more concrete examples the better

Danbri: Owl example stood out
... SKOS stood out as a real world project

<danbri> SKOS

<jtandy> jeremy is here

<AxelPolleres> ack

<JeniT> ericstephan: we're trying to pull together use cases; how much detail do you want on the data?

<JeniT> ... I'm trying to provide data about where it's coming from, how people are using it etc

<JeniT> danbri: a high-level big picture, and then some concrete samples of data

<jtandy> lol

<JeniT> ericstephan: I'll exclude data that hasn't been published

Thank you

<JeniT> ivan: in the meantime, I'll put it up for you

Ivan: Contact me if you want to put this on the wiki

Danbri: Can we go over who has contributed?

<danbri> jeremy now, then davide, jeni, …

Jeremy: My use cases weather observation data

<JeniT> https://www.w3.org/2013/csvw/wiki/Use_Cases#Publication_of_weather_observation_time-series_data_as_input_into_analysis_or_impact_assessment

Jeremy: brought out specific issues and and key requirements, no formal semantics associated with csv

<danbri> ISSUE: there is no machine-readable mechanism available to describe how the set of files are related

Jeremy: Often the datafiles are partitioned into mulitple files ands structures
... If a property is applied to each entity that is summaried as the file level.
... Under the proposal section of the wiki put link back to JeniT document

Danbri: Are you using packaging mechanisms

Jeremy: Typically not, docs on the web next to the dataset

Danbri: Different parties involved on the workflow who ultimately sets the structure of the csv

<Zakim> JeniT, you wanted to ask if a zip would be appropriate for this data in any case

Jeremy: The structure of the csv is based on who produces the data. Need to do more digging into the users

JeniT: With weather data you are dealing with you might want to have various packages

Jeremy: They can get large. Ex 100 million recordsd

JeniT: Can you point to data and how it fits in the dataset?

Jeremy: Yes

<danbri> https://www.w3.org/2013/csvw/wiki/Use_Cases#Reliability_Analysis_of_Police_Open_Data

Davideceolin: over time different csv files representing different kinds of information such as crime counts. Over time Different policies have made different formats.

Davidceolin: I already developed the tool, can I automatically identify the elements in the csv file would be useful

<fresco> A useful question to ask is whether the CSV table itself is appropriately formatted. For example, the HadCET data has "year + day of month" as rows and "month" as columns, whereas it would be easier to process if each row was a single day, and all the values were all in a single column.

<JeniT> https://www.w3.org/2013/csvw/wiki/Use_Cases#Analysis_of_Scientific_Spreadsheets

<danbri> Martine de Vos

<danbri> contributed via davideceolin

Davidceolin: Spreadsheets (e.g. excel) I haven't reported any yet, but having problems with meaning of figures and numbers in spreadsheet is difficult. Need better understanding of content.

<JeniT> that's what DataCube helps with

Danbri: speaking to people with stats everything was footnotes and annotated by links etc, some of the early rdf data put things at a graph level.

<danbri> http://www.w3.org/TR/2013/PR-vocab-data-cube-20131217/

JeniT: Works with statistical data, and provides that kind of thing

<jtandy> my proposal also seeks to use RDF Data Cube ...

danbri: The distinction between describing csv today versus best practices for the future.

<JeniT> ie should we be putting something together that works with currently published CSV files

danbri: Where should we be on that spectra?

<JeniT> or trying to get people to publish CSV differently

<danbri> tx, yes

Jeremy: People may do things that are useful to them already or because they don't need any better.

<danbri> ericstephan: netcdf (community) uses to publish their datstream with

<danbri> … conventions along lines of best practices

<danbri> … i like idea of providing a means/solution ppl can move towards. not dictatorial, …

<fresco> I think it's useful to show best practises in terms of "if you publish data like this, then you can process it as easily as this"

<JeniT> https://www.w3.org/2013/csvw/wiki/Use_Cases#Publishing_the_results_of_scientific_experiments

fresco: Replicates is always a problem in spreadsheets.

<JeniT> https://www.w3.org/2013/csvw/wiki/Use_Cases#Visualisation_of_time_series_data_with_annotations

fresco: The other thing from a scientific experiment knowing what data is in each cell. Also knowing where the data came from would also be helpful

<JeniT> https://www.w3.org/2013/csvw/wiki/Use_Cases#Processing_search_results_from_Solr

<jtandy> jeremy agrees with alf

fresco: Searching...you want to know in the metadata, like open search in csv, what offset, the number of rows, on publishing you may want people to break up the chunks for large csv files.

<jtandy> ... need to extract subsets from a larger dataset

fresco: Publishing invididual csv files

<JeniT> +1 to link relations between CSV files

fresco: Adding annotations to ??? if you want to annotate a particular cell.

<JeniT> http://tools.ietf.org/search/rfc7111

<TimFinin> I'll try to add a usecase relevant to use of CSV files for output of text information extraction systems. A common requirement is linking extracted facts to string offsets in a document.

<JeniT> TimFinin, great, that sounds like a useful use case

jeremy: I agree. When we are querying the datasets we may not know what it is, but we want to know the logical structure of the csv

<Zakim> JeniT, you wanted to mention fragids

<danbri> [I lost audio briefly]

ivan: tsv is not covered by csv?

<fresco> character-separated values?

jeniT: We need to answer the question about delimeter

<danbri> can you hear me?

<AxelPolleres> FWIW, as for binary formats for large amounts of data RDF HDT format might be a fit here? (although RDF compression not really in the scope of this group...) http://www.w3.org/Submission/2011/SUBM-HDT-20110330/

danbri: We are drifting to the protocol design..
... is there a distinction betweeen search results and data on the web

<JeniT> https://www.w3.org/2013/csvw/wiki/Use_Cases#Publication_of_Statistics

<danbri> e.g. http://www.ons.gov.uk/ons/index.html

JeniT: If you work with Statistics you care a lot about annotation and metadata
... Use excel that provides that level of adding metadata

<JeniT> https://www.w3.org/2013/csvw/wiki/Use_Cases#Organogram_Data

JeniT: Organigram data. Linked data for sharing organizational structures. Easiest way of sharing government org data because everyone works with spreadsheets.

<danbri> (I like this observation, "When the CSVs are published on the web, they need to reference this centrally defined schema (rather than, say, being packaged with a copy of schema) to make sure that they are adhering to the correct format." … this was my Q to Jeremy re workflow and definitions)

JeniT: Two csv files are published together and reference each other by identifier.

<jtandy> good examples

danbri: What kinds of questions should we be asking?

JeniT: I think we should pull out requirements for what the technology needs to do. What are the requirements in the particular use case. Must be something be able to point to a schema and not repeat it again and again.

danbri: DId you look athe RDB2RDF spec for Organogram case? JenIT no

<ivan> info on the RDB to RDF mappings

JeniT: List within a cell? Can you identify that?

danbri: Most csv pretty boring, but good to note other examples

Definition of CSV

JenIT: Thought it would be useful to talk about what we mean by CSV

<JeniT> http://dataprotocols.org/csv-dialect/

JenIT: what delimeters etc can be used, what encodings are supported by different applications.
... Conventions and how that is mapped into an info set for csv, one of the frequent places where commas are used in names.

<danbri> http://en.wikipedia.org/wiki/Decimal_mark#Countries_using_Arabic_numerals_with_decimal_comma

<danbri> ack me?

JeniT: We need to pull together a definition in roughly the same time line of the use cases.

<Zakim> danbri, you wanted to mention http://www.w3.org/wiki/WebSchemas/LookInside#Background_Research_.26_Related_Work (R, Octave, Matlab) data frames

<JeniT> also cf https://github.com/theodi/csv-validation-research

<danbri> (aside: I read some csvs use diff encoding in each row!)

AxelPolleres: Not only delimiter, decimal points, and code convention, language differences in csv files
... Makes integration difficult.

JeniT: This is exactly the same problem I have encountered. Although Excel fixes some things, how things should be escaped and delimited. You get diffent kinds of behavior. We need in depth import and export capabilities to understand constraints

<JeniT> ericstephan: there are other tools than Excel, and other binary tabular formats than Excel

danbri: Looking for large target user communities


<AxelPolleres> FWIW, also quotes and quotes escaping are an issue on CSV "in the wild"... although it is specified in http://www.ietf.org/rfc/rfc4180.txt ... it would be nice to provide cleansing tools a la xmltidy for CSV :-)

<danbri> next scribe: danbri

<JeniT> AxelPolleres, yes, but we need to specify what to cleanse into!

danbri: ANything else?

<JeniT> ericstephan, thanks for scribing!

<danbri> yes, thanks ericstephan!

<ivan> trackbot, end telcon

Summary of Action Items

[NEW] ACTION: axelpollerres take a lead arranging an *informal* gathering of wg members at EDF [recorded in http://www.w3.org/2014/02/05-csvw-minutes.html#action01]
[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.138 (CVS log)
$Date: 2014-02-05 14:28:41 $