CSV2RDF

From CSV on the Web Working Group Wiki
Jump to: navigation, search

This page has no particular status within the working group.

Mapping of Web Tabular Data to RDF

This note describes producing RDF from CSV files. The CSV files are assumed to conform to the "Model for Tabular Data and Metadata on the Web".

That document defines 3 kinds of data model for tabular web data.

  • Core Data Model: defines the notion of table by defining the regular structure in the CSV file, such as having a header row and each row having the same number of columns.
  • Annotated Data Model: adds additional metadata about the file and may include the expected types of values in columns and whether a column can be used as a unique key.

@@status: currently not considering groups of tables in this note.

This note describes a process for the production of RDF from such tabular data. It works in the absence of additional metadata (Core Data Model level) and can produce more useful RDF if annotation is present, for example creating datatype xsd:date for data information, instead of simple strings.

The format and definition of the annotation metadata is not define din this note; it is a separate product of the working group and will apply to several different conversions to RDF, JSON and XML.

Context -- RDB2RDF

Previous work in mapping relational databases to RDF resulted in an automatic mapping, using only database schema information and also mapping based on a description of the translation:

Design Goals

(@@needs more)

  • Output should be streamable, that is, CSV files can be processed without needing to rad the whole file first.
  • Capture all details of the data model
  • Possible to perform much of the translation with text processing tools, not presuming a full RDF toolkit.

Issues

  • RFC 7111 fragment identifiers.
    Should a CSV frgement refer to something in the RDF?
    If so, rows might be be <http://host/data.csv#row=1>.
  • Balance of complexity.Here or assume other prcoessing.
    • Extracting different entities from a row (different subjects within a row)
    • Entities across rows

Conversion Process

Defined here for when there is no additional annotation.

Outline

Each row is used to produce a number of RDF triples all with the same subject. Each column has it's own predicate. Row numbering information is included.

  • Read annotations, if any.
  • Process headers.
  • Process each row.
    • Decide subject RDF term
    • Process each cell to create RDF terms (literals or URIs)
    • Create triples with the row subject, column predicate and cell value.

This process is row-oriented.

Process the header row

For each cell in the header row:
  Calculate the predicate for the column.
  Output information about the column (e.g. column number).

Each column name is used to produce a predicate URI. This formed from the URL of the CSV file, a # to start the URI fragment and name of the column.

Where possible the exact name of the column should be used but that is not always possible. A standard conversion algorithm is needed.

  • Spaces becoming _
  • Other characters, like /, % or % become %-encoded.

In standard "HTML Form URL encoding" (application/x-www-form-urlencoded), spaces become + but _ may be more natural for CSV data.

The Annotated Data Model recognizes that column order can be important so column number is included in the output.

Process a row

Decide subject, e.g. allocate a new blank node.
Generate row information (e.g. row number)
For each cell in the row:
  Decide on the RDF term for the value - if not annotation

The cell value datatype is determined by the column annotation if present. This may include the (human) language for the column.

If no annotation is available, then the datatype of the cell is determined by looking at CSV entry:

  • integers - optional sign, then digits 0-9
  • decimal - optional sign, then digits 0-9, decimal point, digits 0-9
  • xsd:double - exponent present.
  • otherwise, it is a string.

(@@ align choices with user expectations)

Example

For concrete example, we mostly use Turtle syntax. This makes it easier to see the RDF triples.

The CSV file is assumed to come from web location <http://host/data.csv>.

Town,Population
Southton,123000
Northville,654000

This is a CSV file of 2 columns, with 1 header row and 2 data rows.

In the absence of any annotations, each data row is given a different blank node.

Each row of the file generates a set of RDF triples, using the column information to determine the predicate. Triples for a given row have the same subject. In addition, row number information is added. This is kept separate from the particular CSV file column names by using a different URI to base URI generation on.

For the example data above, this would produce:

@prefix : <http://host/data.csv#> .
@prefix csv: <http://w3c/future-csv-vocab/> .

# Column information
<http://host/data.csv#metadata>
   csv:column [ csv:colName "Town" ;       csv:colPredicate :Town ;  csv:colIndex 1 ] ;
   csv:column [ csv:colName "Population" ; csv:colPredicate :Population ; csv:colIndex 2 ] ;
   .

# Data rows
[ csv:row 1 ; :Town "Southton"   ; :Population 123000 ] .
[ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .

Town names are strings and the population cells are numbers.

This data could be queried with SPARQL:

PREFIX : <http://host/data.csv#> .
# Find the names of towns with a population over half a million.
SELECT ?townName {
{ 
    ?x :Town ?townName ;
       :Popuation ?pop .
    FILTER(?pop > 500000)
  }
}


Without the bnode abbreviations:

@prefix : <http://host/data.csv#> .
@prefix csv: <http://w3c/future-csv-vocab/> .

<http://host/data.csv#metadata>
   csv:column _:c1 ;
   csv:column _:c2 .

_:c1 csv:colName "Town" ; 
     csv:colPredicate :Town ;
     csv:colIndex 1 .

_:c2 csv:colName "Population" ;
     csv:colPredicate :Population ;
     csv:colIndex 2 .

_:b0  csv:row 1 ;
      :Town "Southton" ;
      :Population "123000" .

_:b1  csv:row 2 ;
      :Town "Northville" ;
      :Population "654000" .

In N-Triples:

http://host/data.csv#metadata> <http://w3c/future-csv-vocab/column> _:Bf6db97 .
<http://host/data.csv#metadata> <http://w3c/future-csv-vocab/column> _:B0dbf1b .
_:Bf6db97 <http://w3c/future-csv-vocab/colName> "Town" .h
_:Bf6db97 <http://w3c/future-csv-vocab/colPredicate> <http://host/data.csv#Town> .
_:Bf6db97 <http://w3c/future-csv-vocab/colIndex> "1"^^<http://www.w3.org/2001/XMLSchema#integer> .
_:B0dbf1b <http://w3c/future-csv-vocab/colName> "Population" .
_:B0dbf1b <http://w3c/future-csv-vocab/colPredicate> <http://host/data.csv#Population> .
_:B0dbf1b <http://w3c/future-csv-vocab/colIndex> "2"^^<http://www.w3.org/2001/XMLSchema#integer> .
_:B147de0 <http://w3c/future-csv-vocab/row> "1"^^<http://www.w3.org/2001/XMLSchema#integer> .
_:B147de0 <http://host/data.csv#Town> "Southton" .
_:B147de0 <http://host/data.csv#Population> "123000"^^<http://www.w3.org/2001/XMLSchema#integer> .
_:B692bc1 <http://w3c/future-csv-vocab/row> "2"^^<http://www.w3.org/2001/XMLSchema#integer> .
_:B692bc1 <http://host/data.csv#Town> "Northville" .
_:B692bc1 <http://host/data.csv#Population> "654000"^^<http://www.w3.org/2001/XMLSchema#integer> .

Choice of syntax is not defined in this note.

These three examples are the same RDF. Blank node label have no significance to machines.

Alternatives

RDF lists provide a different way to retain ordering information. They are not easy to work with in RDF tripe stores due to little native support in general.

For example, the order of columns of retained by metadata in the RDF generated:

<http://host/data.csv#metadata>  csv:columns ( 
	[ csv:colName "Town" ; csv:colPredicate :Town ]
	[ csv:colName "Population" ; csv:colPredicate :Population ]
	) .

Notes

  • The CSV file may change over time - the predicate will be the same if the column names are the same even if their meaning is different. Whether this is a bug or a feature is a matter of opinion.
  • Columns names starting with a digit can lead to RDF/XML unfriendly predicates.
  • This process can be done by text processing for the case of CSV to Turtle. An RDF tool kit is not required.

Annotations

Additional annotations on the data may be used to:

  • check and choose the data type for a column, e.g. make all number xsd:double.
  • Provide a template for generating subject URIs, not using blank nodes.
    One column might be used to

Column datatyping

Subject URIs

Instead of a blank node, a generated URI could be used:

<http://host/data.csv#1>

giving data rows:

<http://host/data.csv#1> ; :Town "Southton"   ; :Population 123000 .
<http://host/data.csv#2> ; :Town "Northville" ; :Population 654000 .

There still might be a need to include csv:row depending on the importance of row order to avoid microparsing URIs.

Metadata about conversion

Metadata for the conversion could (optionally) be added:

@prefix : <http://host/data.csv#> .
@prefix csv: <http://w3c/future-csv-vocab/> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

<> csv:converter  <http://example.org/csv2rdf> ;
   csv:timestamp  "2014-03-19T16:48:20+00:00"^^xsd:dateTime .

<http://host/data.csv#1> csv:row 1 ; :Town "Southton"   ; :Population 123000 .
<http://host/data.csv#2> csv:row 2 ; :Town "Northville" ; :Population 654000 .

Table Groups

@requirements required

@@ need to create links by URI