CSVTemplating status

From CSV on the Web Working Group Wiki
Jump to: navigation, search

A Summary of the CSV conversions

2014.9.1, Ivan Herman

This is just a set of notes trying to summarize the state of discussion in the Working Group. There is no formal consensus on anything, only some sort of an informal one (meaning that nobody strongly objected yet…)

The basic approach is to use some sort of a templating mechanism. The advantage being that the same approach can be used for different output formats (JSON, XML, or some RDF syntax). At the moment the syntax is identical, if possible, to Mustache, but that can change.

What follows is *my* interpretation of what we have got to, and I have made slight changes on the various examples in the mailing list. My main source is a long thread started by Jeremy.

Basic structure

What has been discussed so far is to use a template per CSV row. Ie, the processing of a CSV file means taking each row one by one, apply the template on it to produce the output.

Simple example and template

A canonical example used by Jeremy is:

   O*NET-SOC 2010 Code,O*NET-SOC 2010 Title,O*NET-SOC 2010 Description
   15-1199.03,Web Administrators,"Manage web environment design, deployment, development and maintenance activities. [...]"

A simple way of getting to the templates is through the metadata (I have modified Jeremy’s version a bit to adapt it to the latest version of the metadata document):

   {
       "@id"   : "2010_Occupations.csv",
       "title" : "2010 Occupations",
       ...
       "schema" : {
           "columns" : [{
               "name" : "onet-soc-2010-code",
               "title": "O*NET-SOC 2010 Code",
               ...
           }, {
               "name" : "title",
               "title": "O*NET-SOC 2010 Title",
               ...
               }]
       },
       "template": {
           "path":      "2010_Occupations-csv-to-ttl.ttl",
           "hasFormat": "text/turtle"
           ...
       }
   }

the template itself may be something as trivial (for a row)

   ex:{{name}} a ex:ONETSOC-Occupation:
       skos:notation "{{name}}",
       skos:prefLabel "{{title}}",
       ...

Meaning that, for each row, the keys for a corresponding column provides the value to be used in the template (`name`).

Variables

However, that does not really go far enough, and there is a need for some sort of variables. That means 'variable names' are set for each cell in a row, depending on whether the cell itself abides to a regular expression. Taking Jeremy’s example above a bit further (I used the term 'variables' instead of 'microsyntax'):

   {
       "@id"   : "2010_Occupations.csv",
       "title" : "2010 Occupations",
       ...
       "schema" : {
           "columns" : [{
               "name" : "onet-soc-2010-code",
               "title": "O*NET-SOC 2010 Code",
               ...
               "variables" : [{
                   "name"   : "sod-major-group",
                   "regexp" : "/^(\d{2})-\d{4}.\d{2}$/"
               }, {
                   "name"   : "sod-minor-group",
                   "regexp" : "/^\d{2}-(\d{2})\d{2}.\d{2}$/"
               }]
           }, {
               "name" : "title",
               "title": "O*NET-SOC 2010 Title",
               ...
           }]
       },
       "template": {
           "path":      "2010_Occupations-csv-to-ttl.ttl",
           "hasFormat": "text/turtle"
           ...
       }
   }

A simple usage of variables is to refine the template:

   ex:{{name}} a ex:ONETSOC-Occupation:
        skos:notation "{{name}}",
        skos:prefLabel "{{title}}",
        skos:broader ex:{{onet-soc-2010-code.soc-major-group}}-0000, 
                     ex:{{onet-soc-2010-code.soc-major-group}}-{{onet-soc-2010-code.soc-minor-group}}00, 
       ...

(Note the "namespace" like notation for the variable names, to avoid name clashes among different columns.)

Conditionals

The real usage of variables is to be used for 'conditionals', i.e., to control which template fires. Two approaches emerged.

Usage of several template files

This approach is based on the assumption that the template file itself has to be (mostly) valid syntax of the target, i.e., proper turtle for a turtle template, proper JSON for a JSON template, etc. The branching is achieved in the metadata:

   "template": [{
       "conditional-match" : "onet-soc-2010-code:soc-major-group",
       "path"              : "2010_Occupations-csv-to-ttl-1.ttl",
       "hasFormat"         : "text/turtle"
   }, {
       "conditional-match" : "onet-soc-2010-code:soc-mijor-group",
       "path"              : "2010_Occupations-csv-to-ttl-2.ttl",
       "hasFormat"         : "text/turtle"
   }]

i.e., the choice of a specific template depends on whether the regular expression is a match.

Usage of a conditional syntax in the template itself

In this case the restriction on the template syntax is dropped (i.e., it is all right if it is invalid, say, Turtle altogether), and the conditional is added to the template itself:

   ex:{{name} a ex:ONETSOC-Occupation:
        skos:notation "{{name}}",
        skos:prefLabel "{{title}}",
    {{#if onet-soc-2010-code:soc-major-group}
        skos:broader ex:{{onet-soc-2010-code.soc-major-group}}-0000, 
    {{#elif onet-soc-2010-code:soc-mijor-group}}
        skos:broader ex:{{onet-soc-2010-code.soc-major-group}}-{{onet-soc-2010-code.soc-minor-group}}00, 
    {{#endif}}

Issues/notes

What should be the output for an undefined pattern

I.e., what should happen if:

   ex:{{name}} rdf:type ... 

cannot be resolved, because 'name' is not a valid key in the metadata? Should the whole input line simply dropped?

Choice among conditional approaches

The problem I see with the first approach are:

  1. it may lead to a proliferation of template files if there are several conditionals which may make the management of those a bit of a nightmare (because, e.g., there will be repeated portions of the templates)
  2. implementation in javascript, that uses strongly asynchronous structures to retrieve the template files from the Web will lead to an increased "call-back hell"
  3. the implementation will have to rely on (or implement) a caching mechanism to avoid reading the template over the wire many times over.

Global templates

We have not really discussed the issue of global templates yet, i.e., output that should be produced only one. Turtle namespace declarations are a typical case.

The pattern followed for conditionals may be reused. Either the metadata has a reference to a global template as a separate key, or the template file itself may include a statement that explicitly refer to the repeated parts, something like:

   @prefix ex: <http://www.example.org> .
      
    {{#repeat}}
    ex:{{name}} a ex:ONETSOC-Occupation:
        skos:notation "{{name}}",
        skos:prefLabel "{{title}}",
    {{#if onet-soc-2010-code.soc-major-group}}
        skos:broader ex:{{onet-soc-2010-code.soc-major-group}}-0000. 
    {{#elif onet-soc-2010-code.soc-mijor-group}}
        skos:broader ex:{{onet-soc-2010-code.soc-major-group}}-{{onet-soc-2010-code:soc-minor-group}}00. 
    {{#endif}}
    {{#endrepeat}}

Datatypes

Do we want the template (or the metadata) to control the type conversion of the cell values? The problem is that the possible datatypes may depend on the target format; that may suggest that a syntax in the template may be the appropriate one

Template syntax

The '{{}}' syntax, used in Mustache and in these examples is fine for XML or Turtle. However, they may be inappropriate for JSON. Do we need two different syntaxes, depending on the media type of the target?

Repository Structure

Earlier discussion on sharing examples in Git filetree: