Moving RDF to a Relational Database

Abstract

This documents a process for turning an RDF database into a relational database, sort of the opposite of what most folks are doing. The techniques are useful for normalizing data, discovering latent schemas and performing light analytics on RDF data. I use as an example, the Talis SPAAAACE Database (sparql endpoint). I used the SWObjects sparql binary to acquire data and render tables, but used no functionality outside of SPARQL and shell. The reader should be able to apply these processes to most SPARQL tools.

Notation

unix shell commands
SQL directives
# explanatory comments

Interrogations

We have to ask a few questions to get the shape of the data.

Distribution of Predicates

First, ask what is the distribution of predicates:

sparql -L application/sparql-results+xml \
 --service http://api.talis.com/stores/space/services/sparql \
 -e 'SELECT ?p (COUNT(*) AS ?c) {?s ?p ?o} GROUP BY ?p ORDER BY ?c ?p' \
 -l html
# Interpret results as sparql xml results format.
# Send query to talis SPAAACE endpoint.
# Get the predicates in order of frequency
# Dump results in a form I can paste into this document.

?p ?c
<http://purl.org/dc/terms/description> 1
<http://purl.org/dc/terms/title> 1
<http://rdfs.org/ns/void#sparqlEndpoint> 1
<http://rdfs.org/ns/void#uriRegexPattern> 1
<http://purl.org/dc/terms/source> 2
<http://purl.org/ontology/po/short_synopsis> 2
<http://xmlns.com/foaf/0.1/topic> 3
<http://xmlns.com/foaf/0.1/isPrimaryTopicOf> 4
<http://xmlns.com/foaf/0.1/primaryTopic> 4
<http://rdfs.org/ns/void#exampleResource> 6
<http://purl.org/dc/elements/1.1/title> 13
<http://purl.org/net/schemas/space/place> 36
<http://purl.org/net/schemas/space/country> 45
<http://schemas.talis.com/2005/dir/schema#etag> 58
<http://www.w3.org/2007/ont/httph#content-type> 58
<http://purl.org/net/schemas/space/missionProfile> 124
<http://purl.org/net/schemas/space/actor> 142
<http://purl.org/net/schemas/space/missionRole> 142
<http://purl.org/net/schemas/space/performed> 142
<http://purl.org/net/schemas/space/role> 142
<http://purl.org/net/schemas/space/mission> 183
<http://www.w3.org/2000/01/rdf-schema#label> 213
<http://xmlns.com/foaf/0.1/thumbnail> 299
<http://xmlns.com/foaf/0.1/depiction> 367
<http://xmlns.com/foaf/0.1/depicts> 367
<http://purl.org/net/schemas/space/nominalPower> 431
<http://purl.org/net/schemas/space/launchsite> 4984
<http://purl.org/net/schemas/space/launchvehicle> 4998
<http://purl.org/net/schemas/space/launched> 5074
<http://purl.org/net/schemas/space/mass> 5259
<http://purl.org/net/schemas/space/agency> 6120
<http://purl.org/dc/elements/1.1/description> 6551
<http://purl.org/net/schemas/space/internationalDesignator> 6551
<http://purl.org/net/schemas/space/launch> 6551
<http://purl.org/net/schemas/space/spacecraft> 6551
<http://xmlns.com/foaf/0.1/homepage> 6551
<http://xmlns.com/foaf/0.1/name> 6602
<http://purl.org/net/schemas/space/discipline> 7828
<http://purl.org/net/schemas/space/alternateName> 9372
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 12134

Distribution of Classes

How many are there of what classes?

sparql -l sparqlx \
 --service http://api.talis.com/stores/space/services/sparql \
 -e 'SELECT ?t (COUNT(*) AS ?c) {?s a ?t} GROUP BY ?t ORDER BY ?c ?s' \
 -8
# Tool-specific hortcut for application/sparql-results+xml.
# SPAAAACE...
# Get the types in order of frequency
# Dump results in the default table format,
#   with pretty UTF-8 box chars.

?t ?c
<http://rdfs.org/ns/void#Dataset> 1
<http://purl.org/ontology/po/Episode> 2
<http://purl.org/net/schemas/space/Mission> 11
<http://purl.org/net/schemas/space/Discipline> 16
<http://purl.org/net/schemas/space/LaunchSite> 36
<http://xmlns.com/foaf/0.1/Person> 51
<http://purl.org/net/schemas/space/MissionRole> 142
<http://xmlns.com/foaf/0.1/Image> 299
<http://purl.org/net/schemas/space/Launch> 5014
<http://purl.org/net/schemas/space/Spacecraft> 6562

Examples of Classes

Let's look at examples of these classes. It would have been nice to get examples with SELECT ?t (COUNT(*) AS ?c) (SAMPLE(?s) AS ?ex) ..., but the endpoint doesn't currently support it. Instead, we'll have to try each one individually on the unix command line (everybody's got one, right?).

sparql -l sparqlx \
  --service http://api.talis.com/stores/space/services/sparql \
  -e 'SELECT DISTINCT ?cutThisLine {?s a ?cutThisLine}' \
  -l csv \
| grep -v cutThisLine \
> allTypes.txt
# Force media type.
#   From SPAAAAACE...
#   Get distinct types.
#   Dump results in CSV,
# Remove the variable name.
# Stick results in a text file.

<http://purl.org/net/schemas/space/Discipline>
<http://purl.org/net/schemas/space/Launch>
<http://purl.org/net/schemas/space/LaunchSite>
<http://purl.org/net/schemas/space/Mission>
<http://purl.org/net/schemas/space/MissionRole>
<http://purl.org/net/schemas/space/Spacecraft>
<http://purl.org/ontology/po/Episode>
<http://rdfs.org/ns/void#Dataset>
<http://xmlns.com/foaf/0.1/Image>
<http://xmlns.com/foaf/0.1/Person>
( echo 'type,example' \
  && ( \
    cat allTypes.txt | while read type; do \
      echo -n $type ',' && ( \
        sparql -l sparqlx \
          --service http://api.talis.com/stores/space/services/sparql \
          -e "SELECT ?cutThisAlso { ?cutThisAlso a $type } LIMIT 1" \
          -l csv \
        | grep -v cutThisAlso \
      ); \
    done \
    | perl -pe 's/ ,/,/' \
  ) \
) > type_exs.csv
# Echo some column headings.
# and concatenate this...
#   Assign type to each type in allTypes.txt
#     echo the type and concatenate...
#       Force sparqlx format.
#       From SPAAAAACE...
#       Assign type to variable ?cutThisAlso.
#       Dump results in CSV.
#     Remove the line with the variable name.
# 
#   Finish the inner read loop
#   Trim out cat's love of spaces.
# Finish the outer group.
# Dump it all in type_exs.csv

Let's see that CSV in an HTML table...

sparql -d type_exs.csv -l html
# Dump the output in HTML for inclusion here:

?type ?example
<http://purl.org/net/schemas/space/Discipline> <http://nasa.dataincubator.org/discipline/technologyapplications>
<http://purl.org/net/schemas/space/Launch> <http://nasa.dataincubator.org/launch/1987-100>
<http://purl.org/net/schemas/space/LaunchSite> <http://nasa.dataincubator.org/launchsite/taiyuan>
<http://purl.org/net/schemas/space/Mission> <http://nasa.dataincubator.org/mission/apollo-10>
<http://purl.org/net/schemas/space/MissionRole> <http://nasa.dataincubator.org/mission/apollo-12/role/capsule-communicator/5>
<http://purl.org/net/schemas/space/Spacecraft> <http://nasa.dataincubator.org/spacecraft/1969-054A>
<http://purl.org/ontology/po/Episode> <http://www.bbc.co.uk/programmes/b007stmh#programme>
<http://rdfs.org/ns/void#Dataset> <http://nasa.dataincubator.org/>
<http://xmlns.com/foaf/0.1/Image> <http://nssdc.gsfc.nasa.gov/image/spacecraft/astp_apollo.jpg>
<http://xmlns.com/foaf/0.1/Person> <http://nasa.dataincubator.org/person/alfredmerrillworden>

Class Properties

What are the predicates from each of these example nodes? (Yes, this displays an arcs-out bias.)

sparql -d type_exs.csv \
  -e 'SELECT ?example ?p ?o
       WHERE {
         SERVICE <http://api.talis.com/stores/space/services/sparql> {
           ?example ?p ?o
         }
       }' -l html
# Read the types and examples we just extracted.
# Dump the example node and its predicates and objects...
# 
#   using the talis SPAAAAAAACE service...
# 
# 
# in HTML here (though you'd probably use sparqlt).

?example ?p ?o
<http://nasa.dataincubator.org/discipline/technologyapplications> <http://www.w3.org/2000/01/rdf-schema#label> "Technology Applications"
<http://nasa.dataincubator.org/discipline/technologyapplications> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/net/schemas/space/Discipline>
<http://nasa.dataincubator.org/launch/1987-100> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/net/schemas/space/Launch>
<http://nasa.dataincubator.org/launch/1987-100> <http://purl.org/net/schemas/space/spacecraft> <http://nasa.dataincubator.org/spacecraft/1987-100A>
<http://nasa.dataincubator.org/launch/1987-100> <http://purl.org/net/schemas/space/launchvehicle> "Proton Booster Plus Upper Stage and Escape Stages"
<http://nasa.dataincubator.org/launch/1987-100> <http://purl.org/net/schemas/space/launchsite> <http://nasa.dataincubator.org/launchsite/tyuratambaikonurcosmodrome>
<http://nasa.dataincubator.org/launch/1987-100> <http://purl.org/net/schemas/space/launched> "1987-12-10"^^<http://www.w3.org/2001/XMLSchema#date>
<http://nasa.dataincubator.org/launchsite/taiyuan> <http://www.w3.org/2000/01/rdf-schema#label> "Taiyuan, Peoples Republic of China"
<http://nasa.dataincubator.org/launchsite/taiyuan> <http://www.w3.org/2000/01/rdf-schema#label> "Taiyuan"
<http://nasa.dataincubator.org/launchsite/taiyuan> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/net/schemas/space/LaunchSite>
<http://nasa.dataincubator.org/launchsite/taiyuan> <http://purl.org/net/schemas/space/place> "Taiyuan"
<http://nasa.dataincubator.org/launchsite/taiyuan> <http://purl.org/net/schemas/space/country> "Peoples Republic of China"
<http://nasa.dataincubator.org/mission/apollo-10> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/net/schemas/space/Mission>
<http://nasa.dataincubator.org/mission/apollo-10> <http://purl.org/dc/elements/1.1/title> "Apollo 10"
<http://nasa.dataincubator.org/mission/apollo-10> <http://xmlns.com/foaf/0.1/isPrimaryTopicOf> <http://www.bbc.co.uk/programmes/b00lg2xb#programme>
<http://nasa.dataincubator.org/mission/apollo-10> <http://purl.org/net/schemas/space/missionRole> <http://nasa.dataincubator.org/mission/apollo-10/role/lunar-module-pilot>
<http://nasa.dataincubator.org/mission/apollo-10> <http://purl.org/net/schemas/space/missionRole> <http://nasa.dataincubator.org/mission/apollo-10/role/capsule-communicator/1>
<http://nasa.dataincubator.org/mission/apollo-10> <http://purl.org/net/schemas/space/missionRole> <http://nasa.dataincubator.org/mission/apollo-10/role/command-module-pilot>
<http://nasa.dataincubator.org/mission/apollo-10> <http://purl.org/net/schemas/space/missionRole> <http://nasa.dataincubator.org/mission/apollo-10/role/backup-commander>
<http://nasa.dataincubator.org/mission/apollo-10> <http://purl.org/net/schemas/space/missionRole> <http://nasa.dataincubator.org/mission/apollo-10/role/capsule-communicator/3>
<http://nasa.dataincubator.org/mission/apollo-10> <http://purl.org/net/schemas/space/missionRole> <http://nasa.dataincubator.org/mission/apollo-10/role/backup-command-module-pilot>
<http://nasa.dataincubator.org/mission/apollo-10> <http://purl.org/net/schemas/space/missionRole> <http://nasa.dataincubator.org/mission/apollo-10/role/capsule-communicator/2>
<http://nasa.dataincubator.org/mission/apollo-10> <http://purl.org/net/schemas/space/missionRole> <http://nasa.dataincubator.org/mission/apollo-10/role/backup-lunar-module-pilot>
<http://nasa.dataincubator.org/mission/apollo-10> <http://purl.org/net/schemas/space/missionRole> <http://nasa.dataincubator.org/mission/apollo-10/role/commander>
<http://nasa.dataincubator.org/mission/apollo-10> <http://purl.org/net/schemas/space/missionRole> <http://nasa.dataincubator.org/mission/apollo-10/role/capsule-communicator/0>
<http://nasa.dataincubator.org/mission/apollo-12/role/capsule-communicator/5> <http://www.w3.org/2000/01/rdf-schema#label> "Apollo 12 Capsule Communicator"
<http://nasa.dataincubator.org/mission/apollo-12/role/capsule-communicator/5> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/net/schemas/space/MissionRole>
<http://nasa.dataincubator.org/mission/apollo-12/role/capsule-communicator/5> <http://purl.org/net/schemas/space/role> <http://nasa.dataincubator.org/roles/capsule-communicator>
<http://nasa.dataincubator.org/mission/apollo-12/role/capsule-communicator/5> <http://purl.org/net/schemas/space/mission> <http://nasa.dataincubator.org/mission/apollo-12>
<http://nasa.dataincubator.org/mission/apollo-12/role/capsule-communicator/5> <http://purl.org/net/schemas/space/actor> <http://nasa.dataincubator.org/person/alfredmerrillworden>
<http://nasa.dataincubator.org/spacecraft/1969-054A> <http://xmlns.com/foaf/0.1/homepage> <http://nssdc.gsfc.nasa.gov/database/MasterCatalog?sc=1969-054A>
<http://nasa.dataincubator.org/spacecraft/1969-054A> <http://purl.org/net/schemas/space/launch> <http://nasa.dataincubator.org/launch/1969-054>
<http://nasa.dataincubator.org/spacecraft/1969-054A> <http://purl.org/net/schemas/space/discipline> <http://nasa.dataincubator.org/discipline/surveillanceandothermilitary>
<http://nasa.dataincubator.org/spacecraft/1969-054A> <http://purl.org/dc/elements/1.1/description> "Cosmos 287 was a first generation, low resolution Soviet photo surveillance satellite launched from the Baikonur cosmodrome aboard a Soyuz rocket. The film capsule was recovered after 8 days. "
<http://nasa.dataincubator.org/spacecraft/1969-054A> <http://purl.org/net/schemas/space/alternateName> "03991"
<http://nasa.dataincubator.org/spacecraft/1969-054A> <http://purl.org/net/schemas/space/agency> "U.S.S.R"
<http://nasa.dataincubator.org/spacecraft/1969-054A> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/net/schemas/space/Spacecraft>
<http://nasa.dataincubator.org/spacecraft/1969-054A> <http://purl.org/net/schemas/space/internationalDesignator> "1969-054A"
<http://nasa.dataincubator.org/spacecraft/1969-054A> <http://purl.org/net/schemas/space/mass> "4730.0"
<http://nasa.dataincubator.org/spacecraft/1969-054A> <http://xmlns.com/foaf/0.1/name> "Cosmos 287"
<http://www.bbc.co.uk/programmes/b007stmh#programme> <http://xmlns.com/foaf/0.1/topic> <http://nasa.dataincubator.org/launchsite/capecanaveral>
<http://www.bbc.co.uk/programmes/b007stmh#programme> <http://xmlns.com/foaf/0.1/topic> <http://nasa.dataincubator.org/launchsite/capecanaveral.html>
<http://www.bbc.co.uk/programmes/b007stmh#programme> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/po/Episode>
<http://www.bbc.co.uk/programmes/b007stmh#programme> <http://purl.org/ontology/po/short_synopsis> "James May investigates what the Space Race did for all of us."
<http://www.bbc.co.uk/programmes/b007stmh#programme> <http://purl.org/dc/elements/1.1/title> "Blast Off!"
<http://nasa.dataincubator.org/> <http://rdfs.org/ns/void#exampleResource> <http://nasa.dataincubator.org/person/eugeneandrewcernan>
<http://nasa.dataincubator.org/> <http://rdfs.org/ns/void#exampleResource> <http://nasa.dataincubator.org/person/neilaldenarmstrong>
<http://nasa.dataincubator.org/> <http://purl.org/dc/terms/title> "NASA Space Flight & Astronaut data"
<http://nasa.dataincubator.org/> <http://rdfs.org/ns/void#exampleResource> <http://nasa.dataincubator.org/spacecraft/1969-059A>
<http://nasa.dataincubator.org/> <http://rdfs.org/ns/void#sparqlEndpoint> <http://api.talis.com/stores/space/services/sparql>
<http://nasa.dataincubator.org/> <http://purl.org/dc/terms/description> "\n Conversion of various NASA datasets into RDF, starting with the spacecraft data from the NSSDC master catalog\n "
<http://nasa.dataincubator.org/> <http://rdfs.org/ns/void#exampleResource> <http://nasa.dataincubator.org/spacecraft/1957-001B>
<http://nasa.dataincubator.org/> <http://purl.org/dc/terms/source> <http://history.nasa.gov/SP-4029/Apollo_00a_Cover.htm>
<http://nasa.dataincubator.org/> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/ns/void#Dataset>
<http://nasa.dataincubator.org/> <http://rdfs.org/ns/void#exampleResource> <http://nasa.dataincubator.org/mission/apollo-11>
<http://nasa.dataincubator.org/> <http://purl.org/dc/terms/source> <http://nssdc.gsfc.nasa.gov/nmc/>
<http://nasa.dataincubator.org/> <http://rdfs.org/ns/void#exampleResource> <http://nasa.dataincubator.org/spacecraft/1977-084A>
<http://nasa.dataincubator.org/> <http://rdfs.org/ns/void#uriRegexPattern> "http://nasa.dataincubator.org/.+"
<http://nssdc.gsfc.nasa.gov/image/spacecraft/astp_apollo.jpg> <http://xmlns.com/foaf/0.1/thumbnail> <http://nssdc.gsfc.nasa.gov/thumbnail/spacecraft/astp_apollo.gif>
<http://nssdc.gsfc.nasa.gov/image/spacecraft/astp_apollo.jpg> <http://xmlns.com/foaf/0.1/depicts> <http://nasa.dataincubator.org/spacecraft/1975-066A>
<http://nssdc.gsfc.nasa.gov/image/spacecraft/astp_apollo.jpg> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Image>
<http://nasa.dataincubator.org/person/alfredmerrillworden> <http://xmlns.com/foaf/0.1/name> "Alfred Merrill Worden"
<http://nasa.dataincubator.org/person/alfredmerrillworden> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person>
<http://nasa.dataincubator.org/person/alfredmerrillworden> <http://purl.org/net/schemas/space/performed> <http://nasa.dataincubator.org/mission/apollo-12/role/backup-command-module-pilot>
<http://nasa.dataincubator.org/person/alfredmerrillworden> <http://purl.org/net/schemas/space/performed> <http://nasa.dataincubator.org/mission/apollo-15/role/command-module-pilot>
<http://nasa.dataincubator.org/person/alfredmerrillworden> <http://purl.org/net/schemas/space/performed> <http://nasa.dataincubator.org/mission/apollo-9/role/capsule-communicator/2>
<http://nasa.dataincubator.org/person/alfredmerrillworden> <http://purl.org/net/schemas/space/performed> <http://nasa.dataincubator.org/mission/apollo-12/role/capsule-communicator/5>

Relational Structure

By poking at this a bit, we get the impression that the data looks something like:

MissionRole +mission +actor +role +n Mission +id +title Role +id +label Crew +id +name SpacecraftDiscipline +craft +discipline Discipline +id +label LaunchSite +id +city +place +country Launch +id +date +site +vehicle Spacecraft +id +launch +mission +name +altname +designator +agency +mass

There are a couple guesses here, unsupported by the sample data. For instance, I assumed that a craft should reference a launch rather than the other way around. (This model allows a single launch to carry multiple craft.) Another is that the relationship between Spacecraft and Mission doesn't show up in our excerpt. More on that after the Craft/Launch relationship.

Craft/Launch Relationship

We can investigate whether there can be multiple craft per launch and visa versa. We'd like to get highest number of craft referencing a given launch:

PREFIX spac: <http://purl.org/net/schemas/space/> \
SELECT ?launch (COUNT(*) AS ?c)
  {?launch spac:spacecraft ?craft} 
GROUP BY ?launch
ORDER BY DESC(?c)
LIMIT 1
# 
# Project the launch and the number of craft,
# where ?launch has a spacecraft of some ?craft,
# counting the solutions for a given launch,
# starting with the highest number of craft,
# and reporting only the first solution.

Error messages indicate the DESC(?c) is not supported right now so we use the shell:

sparql -l sparqlx \
  --service http://api.talis.com/stores/space/services/sparql \
  -e 'PREFIX spac: <http://purl.org/net/schemas/space/> \
  SELECT ?launch (COUNT(*) AS ?c) \
    {?launch spac:spacecraft ?craft} \
  GROUP BY ?launch \
  ORDER BY ?c ?launch' \
  -l csv \
| tail -1
<http://nasa.dataincubator.org/launch/2007-012>,16
# 
#   SPAAAAACE...
#   more space
#   Project the launch and the number of craft,
#   where ?launch has a spacecraft of some ?craft,
#   counting the solutions for a given launch,
#   starting with the lowest number of craft,
#   report them all in CSV.
# Look at the last value
# which indicates that one poor launch carried 16 craft.

Now checking the maximum launches for a single craft (damn, fuse went out; don't worry, we'll get you up there, buddy):

SELECT ?craft (COUNT(*) AS ?c)
  {?craft spac:launch ?launch}
GROUP BY ?craft
ORDER BY ?c ?craft
# Project the craft and the number of launches it had,
# where ?craft has a launch of some ?launch,
# counting the solutions for a given craft,
# starting with the highest number of launches.

This yielded <http://nasa.dataincubator.org/spacecraft/WAMDII>,1 so it looks like we got lucky there.

Spacecraft and Mission Relationship

There is no evidence in the sampled triples of any relationship between Spacecraft and Missions. This is because the <http://purl.org/net/schemas/space/> appears to have both Spacecraft and MissionRoles in its domain:

PREFIX misn: <http://nasa.dataincubator.org/mission/> \
PREFIX spac: <http://purl.org/net/schemas/space/> \
SELECT * { ?s spac:mission misn:apollo-11  }
# 
# 
# All ?s with a mission of apollo-11.

?s
<http://nasa.dataincubator.org/mission/apollo-11/role/capsule-communicator/6>
<http://nasa.dataincubator.org/mission/apollo-11/role/capsule-communicator/1>
<http://nasa.dataincubator.org/spacecraft/1969-059C>
...

We can extract the latter with the pattern:

PREFIX spac: <http://purl.org/net/schemas/space/>
SELECT * { ?s a spac:Spacecraft ; spac:mission ?m }

Missing Classes

In our sampled properties, we can see that we never learned a class for items of type <http://nasa.dataincubator.org/roles/capsule-communicator>. Lets see what the database has about them:

sparql -l sparqlx \
  --service http://api.talis.com/stores/space/services/sparql \
  -e 'SELECT ?inS ?inP ?outP ?outO { \
        { ?inS ?inP <http://nasa.dataincubator.org/roles/capsule-communicator> } \
        UNION \
        { <http://nasa.dataincubator.org/roles/capsule-communicator> ?outP ?outO } \
      }' -8
# 
# SPAAAAACE...
# S and P of arcs in and P and O of arcs out.
#   Arcs into Mr. capsule-communicator.
# plus
#   Arcs out.

?inS ?inP ?outP ?outO
<http://nasa.dataincubator.org/mission/apollo-12/role/capsule-communicator/5> <http://purl.org/net/schemas/space/role> NULL NULL
<http://nasa.dataincubator.org/mission/apollo-12/role/capsule-communicator/0> <http://purl.org/net/schemas/space/role> NULL NULL
... ... always NULL always NULL

Apparently, there is no additional information about roles beyond an enumeration of them. Note, don't confuse these with MissionRoles, e.g. <http://nasa.dataincubator.org/mission/apollo-10/role/capsule-communicator/2>, which have a good number of arcs in and out:

?inS ?inP ?outP ?outO
<http://nasa.dataincubator.org/mission/apollo-10> <http://purl.org/net/schemas/space/missionRole> NULL NULL
<http://nasa.dataincubator.org/person/jackrobertlousma> <http://purl.org/net/schemas/space/performed> NULL NULL
NULL NULL <http://purl.org/net/schemas/space/actor> <http://nasa.dataincubator.org/person/jackrobertlousma>
NULL NULL <http://purl.org/net/schemas/space/mission> <http://nasa.dataincubator.org/mission/apollo-10>
NULL NULL <http://purl.org/net/schemas/space/role> <http://nasa.dataincubator.org/roles/capsule-communicator>
NULL NULL <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/net/schemas/space/MissionRole>
NULL NULL <http://www.w3.org/2000/01/rdf-schema#label> "Apollo 10 Capsule Communicator"

Creating Tables

The above results give us a bunch of ideas of how a relational database should hold the data. We won't know for sure until we populate the database, but we can guess/hope that any repeated properties will show up in our small sample. In the interest of maintaining foreign key integrity, we'll start with the leaf tables. The dependencies in the UML diagram give us few choices in the order for creating tables:

  1. Discipline
  2. LaunchSite
  3. Launch
  4. Mission
  5. Spacecraft
  6. SpacecraftDiscipline
  7. Role
  8. Crew
  9. MissionRole

Discipline

Discipline, represented in RDF with type space:Discipline, have a type and a label. Given that, all we need to do is capture a unique key, e.g. technologyapplications, and the label.

CREATE TABLE Discipline (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  label VARCHAR(50) UNIQUE,
  labelword VARCHAR(30) UNIQUE
)
# Create a Discipline table
#   with a primary key,
#   and the label text.
#   Why throw away a URL-sanitized label form?

Now we need to grab data about all of the discipline. (The data being only a labelword used to construct the URL and a full name.)

sparql -e 'PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
  SELECT (SUBSTR(STR(?who), 42) AS ?labelword) ?name
   WHERE {
     SERVICE <http://api.talis.com/stores/space/services/sparql> {
       ?who a <http://purl.org/net/schemas/space/Discipline> ;
            rdfs:label ?name }
   }' -l tsv \
  > Discipline.tsv
# We'll look for RDFS labels.
# The labelword starts at the 42nd character
#   of each discipline member's URL.
# From Talis SPAAAACE database.
# There are other rdfs:labels in the database
#   so make sure we collect the right ones.
# Produce a tab-separated-values file.

We can load this tab-separated-values file directly into a database. Here, I load it into MySQL:

mysql -u spaaaceman --local-infile SPAAACE
mysql> LOAD DATA LOCAL INFILE 'Discipline.tsv'
 INTO TABLE Discipline FIELDS
 TERMINATED BY '\t'
 OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY '\n'
 IGNORE 1 LINES
 (labelword, label);
Query OK, 16 rows affected, 1 warning (0.00 sec)
Records: 16  Deleted: 0  Skipped: 0  Warnings: 1
# Start MySQL client in a mood to honor LOAD DATA LOCAL.
# Load from our freshly-created Discipline.tsv.
#  Populate the new Discipline table.
#  values separated by TAB.
#  Lines end with newline.
#  Strings enclosed in ""s.
#  Skip the header with the variable names.
#  Populate the labelword and name attributes;
#    id's AUTO_INCREMENT will handle the primary key.
# Get a thumbs-up back from MySQL.

LaunchSite

LaunchSite, represented in RDF with type space:LaunchSite, have a type and a label. Given that, all we need to do is capture a unique key, e.g. technologyapplications, and the label. First, let's check our assumptions about there being a space:place and an rdfs:label for each site:

sparql -l sparqlx \
 --service http://api.talis.com/stores/space/services/sparql -e '
  PREFIX spac: <http://purl.org/net/schemas/space/>
  PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
  SELECT ?site (COUNT(*) AS ?c)
  { ?site a spac:LaunchSite ; spac:place ?place }
  GROUP BY ?site
  ORDER BY ?c ?site' -l csv | tail -1
<http://nasa.dataincubator.org/launchsite/yavne>,1
sparql -l sparqlx \
 --service http://api.talis.com/stores/space/services/sparql -e '
  PREFIX spac: <http://purl.org/net/schemas/space/>
  PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
  SELECT ?site (COUNT(*) AS ?c)
  { ?site a spac:LaunchSite ; rdfs:label ?label }
  GROUP BY ?site
  ORDER BY ?c ?site' -l csv | tail -1
<http://nasa.dataincubator.org/launchsite/kourou>,6
# 
# 
# 
# 
# 
# 
# 
# 
# good news, max count is 1.
# 
# 
# 
# 
# 
# 
# 
# 
# bad news, max count is 6!

We see that place names are not 1:1 on launch sites. An elegant model would have LaunchSites which refer to Places, and PlaceNames also refer to those Places. That's unnecessary because in our data, LaunchSites and Places would be 1:1. All we really need a LaunchSite table and a PlaceName table which refers to it. We can grab that data as two CSV files:

sparql -l sparqlx -e '
  PREFIX spac: <http://purl.org/net/schemas/space/>
  PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
  SELECT (SUBSTR(STR(?site), 42) AS ?labelword) ?label {
    SERVICE <http://api.talis.com/stores/space/services/sparql>
    { ?site a spac:LaunchSite ; rdfs:label ?label }
  }' -l tsv > proto_PlaceName.tsv

sparql -e '
  PREFIX spac: <http://purl.org/net/schemas/space/>
  SELECT (SUBSTR(STR(?site), 42) AS ?labelword) ?place ?country
  WHERE {
  SERVICE <http://api.talis.com/stores/space/services/sparql> {
    ?site a spac:LaunchSite ;
          spac:place ?place ;
          spac:country ?country }
  }' -l tsv \
> LaunchSite.tsv
# 
# We'll look for SPAAAACE LaunchSites
# and RDFS labels.
# Perform SUBSTR() locally.
#   SPAAAAAAACE...
# There are other rdfs:labels in the database
#   so make sure we collect the right ones.

# ...

Catching Schema Errors

Let's try loading the LaunchSite CSV into SQL:

CREATE TABLE LaunchSite (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  place VARCHAR(50),
  country VARCHAR(50) UNIQUE,
  labelword VARCHAR(40) UNIQUE
);
LOAD DATA LOCAL INFILE 'LaunchSite.tsv'
 INTO TABLE LaunchSite FIELDS
 TERMINATED BY '\t'
 OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY '\n'
 IGNORE 1 LINES
 (labelword, place, country);
Query OK, 18 rows affected (0.00 sec)
Records: 45  Deleted: 0  Skipped: 27  Warnings: 0
# Create a LaunchSite table
#   with a primary key,
#   and the place
#   and the country
#   and preserve the sanitized label.
# Load from our freshly-created LaunchSite.tsv.
#  Populate the new LaunchSite table.
#  values separated by TAB.
#  Lines end with newline.
#  Strings enclosed in ""s.
#  Skip the header with the variable names.
#  Populate the labelword and name attributes;
#    id's AUTO_INCREMENT will handle the primary key.

# Got a serious thumbs-down from MySQL.

Why'd we get 27 skipped rows? It turns out there is more than one country for some of the launch sites (which may undermine your notion of "site"). Again, there's an elegant model and an expeditious model. The former would involve the Places table I discussed above. The latter is, like for the PlaceNames, to simply have a CountryName table reference LaunchSite.

sparql -l sparqlx -e '
  PREFIX spac: <http://purl.org/net/schemas/space/>
  SELECT (SUBSTR(STR(?site), 42) AS ?labelword) ?country {
    SERVICE <http://api.talis.com/stores/space/services/sparql>
    { ?site a spac:LaunchSite ; spac:country ?country }
  }' -l tsv > proto_CountryName.tsv

sparql -e '
  PREFIX spac: <http://purl.org/net/schemas/space/>
  SELECT (SUBSTR(STR(?site), 42) AS ?labelword) ?place
  WHERE {
  SERVICE <http://api.talis.com/stores/space/services/sparql> {
    ?site a spac:LaunchSite ;
          spac:place ?place }
  }' -l tsv \
> LaunchSite.tsv
# 
# We'll look for SPAAAACE LaunchSites
# and RDFS labels.
# Perform SUBSTR() locally.
#   SPAAAAAAACE...
# We now omit country.

# ...

Load our now tiny LaunchSite table.

DROP TABLE LaunchSite;
CREATE TABLE LaunchSite (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  place VARCHAR(50) UNIQUE,
  labelword VARCHAR(40) UNIQUE
);
LOAD DATA LOCAL INFILE 'LaunchSite.tsv'
 INTO TABLE LaunchSite FIELDS
 TERMINATED BY '\t'
 OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY '\n'
 IGNORE 1 LINES
 (labelword, place);
Query OK, 36 rows affected (0.01 sec)
Records: 36  Deleted: 0  Skipped: 0  Warnings: 0
# Our last conception of the table didn't work.
# Try again with no country
#
#
#

# Load from our latest LaunchSite.tsv.
#
#
#
#
#

# Much better.

We now need to work out a recipe for making a foreign key in the database so we can load our PlaceName and CountryName tables.

Making a foreign key

In the RDF data, there are plenty of predicates which connect objects of different types, for instance, the launch predicate connects spacecraft to launches. In the relational view, those connections are encoded as foreign keys, for instance, from Spacecraft.launch to Launch.id. These foreign keys are typically of type integer, so a Spacecraft.launch value of 29 would refer to a Launch.id of 29.

In order to emulate this sensible practice, we need to map terms like <http://nasa.dataincubator.org/launch/1969-054> to integers. There are ways to accomplish this: An easy way is to load the CSV into a proto table and then join against LaunchSite on labelword. We'll do this with the proto_PlaceName.tsv and proto_CountryName.tsv created above.

CREATE TABLE proto_PlaceName (
  label VARCHAR(50) UNIQUE,
  labelword VARCHAR(40)
);
LOAD DATA LOCAL INFILE 'proto_PlaceName.tsv'
 INTO TABLE proto_PlaceName FIELDS
 TERMINATED BY '\t'
 OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY '\n'
 IGNORE 1 LINES
 (labelword, label);
Query OK, 55 rows affected (0.00 sec)
Records: 55  Deleted: 0  Skipped: 0  Warnings: 0
# 
#   there should be 1+ labels
#   for each labelwords 
#   
# Load from proto_PlaceName.tsv.   
#
#
# 
# 
# 
# Populate the labelword and name attributes;
# 
# Get a thumbs-up back from MySQL.

We can now populate the PlaceName table by joining proto_PlaceName against the existing

CREATE TABLE PlaceName (
  site INT UNSIGNED NOT NULL REFERENCES LaunchSite(id),
  label VARCHAR(50) UNIQUE
);
INSERT INTO PlaceName (site, label)
  SELECT LaunchSite.id, proto_PlaceName.label
    FROM proto_PlaceName
   INNER JOIN LaunchSite
     ON (LaunchSite.labelword=proto_PlaceName.labelword);
Query OK, 55 rows affected (0.01 sec)
Records: 55  Duplicates: 0  Warnings: 0
# 
#   
#   
#   
#   
#   
#   
#   
#   

We have to do exactly the same with proto_CountryName.tsv. Here, mostly without commentary:

CREATE TABLE proto_CountryName (
  label VARCHAR(50),
  labelword VARCHAR(40),
  UNIQUE KEY (label, labelword)
);
LOAD DATA LOCAL INFILE 'proto_CountryName.tsv'
 INTO TABLE proto_CountryName FIELDS
 TERMINATED BY '\t'
 OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY '\n'
 IGNORE 1 LINES
 (labelword, label);
Query OK, 45 rows affected (0.01 sec)
Records: 45  Deleted: 0  Skipped: 0  Warnings: 0
CREATE TABLE CountryName (
  site INT UNSIGNED NOT NULL REFERENCES LaunchSite(id),
  label VARCHAR(50),
  UNIQUE KEY (site, label)
);
INSERT INTO CountryName (site, label)
  SELECT LaunchSite.id, proto_CountryName.label
    FROM proto_CountryName
   INNER JOIN LaunchSite
     ON (LaunchSite.labelword=proto_CountryName.labelword);
Query OK, 45 rows affected (0.00 sec)
Records: 45  Duplicates: 0  Warnings: 0
#
#
# multiple sites are in the same country
# so the unique encompasses combinations of label,labelword
#
#
#
# 
# 
# 
# 
#
# 
# thumbs-up from MySQL

Launch

Launch follows the proto table pattern in order to create the foreign keys to LaunchSite. First we dump a CSV from the SPAAAAAACE database:

sparql -l sparqlx -e '
  PREFIX spac: <http://purl.org/net/schemas/space/>
  PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
  SELECT (SUBSTR(STR(?launch), 38) AS ?launchword)
         (STR(?when) AS ?date)
         (SUBSTR(STR(?site), 42) AS ?siteword) ?vehicDesc {
    SERVICE <http://api.talis.com/stores/space/services/sparql>
    { ?launch a spac:Launch ;
              spac:launched ?when ;
              spac:launchsite ?site ;
              spac:launchvehicle ?vehicDesc }
  }' -l tsv > proto_Launch.tsv
# 
# 
# 
# URI-safe launch name,
# date (without datatype),
# URI-safe site name, vehicle description
#
# 
# eliding spac:spacecraft because Spacecraft
# will have a reference to Launch (and cyclic
# foreign keys are a royal PITA).
# 

and load the CSV into SQL:

CREATE TABLE proto_Launch (
  launchword VARCHAR(10),
  date DATE,
  siteword VARCHAR(40),
  vehicDesc VARCHAR(80),
  UNIQUE KEY (launchword, date, siteword, vehicDesc)
);
LOAD DATA LOCAL INFILE 'proto_Launch.tsv'
 INTO TABLE proto_Launch FIELDS
 TERMINATED BY '\t'
 OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY '\n'
 IGNORE 1 LINES
 (launchword, date, siteword, vehicDesc);
Query OK, 5064 rows affected, 18 warnings (0.08 sec)
Records: 5064  Deleted: 0  Skipped: 0  Warnings: 18
CREATE TABLE Launch (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  launchword VARCHAR(10),
  date DATE,
  site INT UNSIGNED NOT NULL REFERENCES LaunchSite(id),
  vehicDesc VARCHAR(80),
  UNIQUE KEY (launchword, date, site, vehicDesc)
);
INSERT INTO Launch (launchword, date, site, vehicDesc)
  SELECT proto_Launch.launchword, proto_Launch.date,
         LaunchSite.id, proto_Launch.vehicDesc
    FROM proto_Launch
   INNER JOIN LaunchSite
     ON (LaunchSite.labelword=proto_Launch.siteword);
Query OK, 5064 rows affected (0.10 sec)
Records: 5064  Duplicates: 0  Warnings: 0
#
# Some launchwords are repeated, e.g. 1963-030
# These are very likely errors.
#
# 
# This is the tightest unique supported by the data.
#
#
#
# 
# 
# 
# 
#
# There are entries in the SPAAAAACE database with
#   empty date strings.

Some duplicate launches seemed to be associated with missions like Apollo 15 which had multiple launches (from earth and then the moon). Others seemed to be errors. You can see the list with:

sparql -l sparqlx \
  --service http://api.talis.com/stores/space/services/sparql -e '
PREFIX spac: <http://purl.org/net/schemas/space/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name (STR(?d1) AS ?former) (STR(?d2) AS ?latter) {
  ?l spac:launched ?d1, ?d2
  FILTER (?d1 < ?d2)
  ?c spac:launch ?l ; foaf:name ?name
} ORDER BY ?name ?d1'

	  

Eventually, I got fed up and wiped them out:

ALTER TABLE Launch RENAME Launch_dupes;
CREATE TABLE Launch (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  launchword VARCHAR(10) UNIQUE,
  date DATE,
  site INT UNSIGNED NOT NULL REFERENCES LaunchSite(id),
  vehicDesc VARCHAR(80),
  UNIQUE KEY (launchword, date, site, vehicDesc)
);
INSERT INTO Launch (SELECT * FROM Launch_dupes GROUP BY launchword);Query OK, 4937 rows affected (0.08 sec)
Records: 4937  Duplicates: 0  Warnings: 0

	    

Mission

The Mission table provides an identifier and a title for records of type space:Mission.

sparql -e 'PREFIX dc1: <http://purl.org/dc/elements/1.1/>
  SELECT (SUBSTR(STR(?mission), 38) AS ?titleword) ?name
   WHERE {
     SERVICE <http://api.talis.com/stores/space/services/sparql> {
       ?mission a <http://purl.org/net/schemas/space/Mission> ;
                dc1:title ?name }
   }' -l tsv \
  > Mission.tsv
# We'll look for DC titles.
# 
# 
# 
# 
# 
# 

Now create and populate the Mission table.

CREATE TABLE Mission (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  title VARCHAR(50) UNIQUE,
  titleword VARCHAR(30) UNIQUE
);
LOAD DATA LOCAL INFILE 'Mission.tsv'
 INTO TABLE Mission FIELDS
 TERMINATED BY '\t'
 OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY '\n'
 IGNORE 1 LINES
 (titleword, title);
Query OK, 11 rows affected (0.01 sec)
Records: 11  Deleted: 0  Skipped: 0  Warnings: 0
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 11 records!? That's right.
# This class includes only the fancy Apollo missions.

Spacecraft

Spacecraft follows the proto table pattern in order to create the foreign keys to both Launch and Mission. Include the mission arcs between Spacecraft and Mission, which are only present for Apollo missions and thus didn't happen to show up in the sample.

sparql -l sparqlx -e '
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc1:  <http://purl.org/dc/elements/1.1/>
PREFIX spac: <http://purl.org/net/schemas/space/>
SELECT (SUBSTR(STR(?s), 42) AS ?craftword)
       (SUBSTR(STR(?launch), 38) AS ?launchword)
       (IF (BOUND(?mission), SUBSTR(STR(?mission), 39), ?UNDEF) AS ?missionword)
       ?designator ?name ?agency ?mass ?craftDesc ?homepage {
SERVICE <http://api.talis.com/stores/space/services/sparql> {
  ?s a spac:Spacecraft ;
     spac:launch ?launch ;
#     spac:alternateName ?altName ;
     spac:internationalDesignator ?designator ;
     foaf:name ?name ;
     spac:agency ?agency ;
     spac:mass ?mass ;
     dc1:description ?craftDesc ;
     foaf:homepage ?homepage
  OPTIONAL { ?s spac:mission ?mission }
} }' -l tsv > proto_Spacecraft.tsv
# 
# 
# 
# 
# URI-safe spacecraft name,
# URI-safe launch name,
# ?mission may not be bound
# a bunch more details
# 
# 
# 
# skip spac:alternateName as it's 1:n
# 
# 
# 
# 
# 
# 
# only the Apollo missions are listed

and load the CSV into SQL:

CREATE TABLE proto_Spacecraft (
  craftword   VARCHAR(10) UNIQUE,
  launchword  VARCHAR(10),
  missionword VARCHAR(10),
  designator  VARCHAR(10) UNIQUE,
  name        VARCHAR(60),
  agency      VARCHAR(30),
  mass        FLOAT,
  homepage    VARCHAR(80) UNIQUE,
  craftDesc   BLOB
);
LOAD DATA LOCAL INFILE 'proto_Spacecraft.tsv'
 INTO TABLE proto_Spacecraft FIELDS
 TERMINATED BY '\t'
 OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY '\n'
 IGNORE 1 LINES
 (craftword, launchword, missionword, designator,
         name, agency, mass, craftDesc, homepage);
Query OK, 4964 rows affected (0.22 sec)
Records: 4964  Deleted: 0  Skipped: 0  Warnings: 0
CREATE TABLE Spacecraft (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  craftword  VARCHAR(10) UNIQUE,
  launch     INT UNSIGNED NOT NULL REFERENCES Launch(id),
  mission    INT UNSIGNED REFERENCES Mission(id),
  designator VARCHAR(10) UNIQUE,
  name       VARCHAR(60),
  agency     VARCHAR(30),
  mass       FLOAT,
  homepage   VARCHAR(80) UNIQUE,
  craftDesc  BLOB
);
INSERT INTO Spacecraft (craftword, launch, mission, designator,
                        name, agency, mass, homepage, craftDesc)
  SELECT proto_Spacecraft.craftword, Launch.id, Mission.id,
         proto_Spacecraft.designator, proto_Spacecraft.name,
         proto_Spacecraft.agency, proto_Spacecraft.mass,
         proto_Spacecraft.homepage, proto_Spacecraft.craftDesc
    FROM proto_Spacecraft
   INNER JOIN Launch
     ON (Launch.launchword=proto_Spacecraft.launchword)
   LEFT OUTER JOIN Mission
     ON (Mission.titleword=proto_Spacecraft.missionword);
Query OK, 4954 rows affected (0.19 sec)
Records: 4954  Duplicates: 0  Warnings: 0
#
# Some craftwords are repeated, e.g. 1963-030
# These are very likely errors.
#
# 
# This is the tightest unique supported by the data.
#
#
#
# 
# 
# 
# 
#
# There are entries in the SPAAAAACE database with
#   empty date strings.

SpacecraftDiscipline

SpacecraftDiscipline follows the proto table pattern in order to create the foreign keys to both Spacecraft and Discipline.

sparql -l sparqlx -e '
PREFIX spac: <http://purl.org/net/schemas/space/>
SELECT (SUBSTR(STR(?craft), 42) AS ?craftword)
      (SUBSTR(STR(?disc), 42) AS ?discword) {
SERVICE <http://api.talis.com/stores/space/services/sparql> {
  ?craft spac:discipline ?disc
} }' -l tsv > proto_SpacecraftDiscipline.tsv

	  

and load the CSV into SQL:

CREATE TABLE proto_SpacecraftDiscipline (
  craftword   VARCHAR(10),
  discword  VARCHAR(30),
  UNIQUE KEY (craftword, discword)
);
LOAD DATA LOCAL INFILE 'proto_SpacecraftDiscipline.tsv'
 INTO TABLE proto_SpacecraftDiscipline FIELDS
 TERMINATED BY '\t'
 OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY '\n'
 IGNORE 1 LINES
 (craftword, discword);
Query OK, 7828 rows affected (0.11 sec)
Records: 7828  Deleted: 0  Skipped: 0  Warnings: 0
CREATE TABLE SpacecraftDiscipline (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  spacecraft INT UNSIGNED NOT NULL REFERENCES Spacecraft(id),
  discipline INT UNSIGNED NOT NULL REFERENCES Discipline(id),
  UNIQUE KEY (spacecraft, discipline)
);
INSERT INTO SpacecraftDiscipline (spacecraft, discipline)
  SELECT Spacecraft.id, Discipline.id
    FROM proto_SpacecraftDiscipline
   INNER JOIN Spacecraft
     ON (Spacecraft.craftword=proto_SpacecraftDiscipline.craftword)
   INNER JOIN Discipline
     ON (Discipline.labelword=proto_SpacecraftDiscipline.discword);
Query OK, 5804 rows affected (0.24 sec)
Records: 5804  Duplicates: 0  Warnings: 0

	    

Crew

Persons, represented in RDF with type foaf:Person, have a type and a name, and repeated properties for the roles they filled. Given that, all we need to do is capture a primary key, e.g. alfredmerrillworden, and the name. Let's assume that the Persons we see are flight crew, and not all humans known to terrestrial space agencies.

CREATE TABLE Crew (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  crewword VARCHAR(30),
  name VARCHAR(50)
);
# Create a Crew table
#   with a primary key,
#   the labelword from the Person URL,
#   and the full name.

Now we need to grab data about all of the crew. (The data being only a labelword used to construct the URL and a full name.)

sparql -e 'PREFIX foaf: <http://xmlns.com/foaf/0.1/> \
  SELECT (SUBSTR(STR(?who), 38) AS ?labelword) ?name \
   WHERE { \
     SERVICE <http://api.talis.com/stores/space/services/sparql> { \
       ?who a foaf:Person ; \
            foaf:name ?name } \
   }' -l tsv \
  > Crew.tsv
# Factor our foaf namespace.
# The labelword starts at the 39th character
#   of each crew member's URL.
# From Talis SPAAAACE database.
# There are other foaf:names in the database
#   so make sure we collect the right ones.
# Produce a tab-separated-values file.

We can load this tab-separated-values file directly into a database.

LOAD DATA LOCAL INFILE 'Crew.tsv'
 INTO TABLE Crew FIELDS
 TERMINATED BY '\t'
 OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY '\n'
 IGNORE 1 LINES
 (crewword, name);
Query OK, 51 rows affected (0.01 sec)
Records: 51  Deleted: 0  Skipped: 0  Warnings: 0
# Start MySQL client in a mood to honor LOAD DATA LOCAL.
# Load from our freshly-created Crew.tsv.
#  Populate the new Crew table.
#  values separated by TAB.
#  Lines end with newline.
#  Strings enclosed in ""s.
#  Skip the header with the variable names.
#  Populate the labelword and name attributes;
#    id's AUTO_INCREMENT will handle the primary key.
# Get a thumbs-up back from MySQL.

MissionRole

Now let's capture the roles the crew played. These are in triples of the form:

<http://nasa.dataincubator.org/person/alfredmerrillworden> 	<http://purl.org/net/schemas/space/performed> 	<http://nasa.dataincubator.org/mission/apollo-12/role/backup-command-module-pilot>

This implies that there is a person, alfredmerrillworden, a mission apollo-12 and a task, backup-command-module-pilot.

CREATE TABLE MissionRole (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  crew INT UNSIGNED NOT NULL,
  mission INT UNSIGNED NOT NULL,
  task INT UNSIGNED NOT NULL,
  mysteriousInt INT UNSIGNED
)
# Capture the roles the crew performed.
# Always add a primary key to make life easier.
#   the labelword from the Person URL,
#   a string form (pathetic, I know) of the mission,
#   the task name,
#   and some number associated with capsule-communicators.

Working around some string manipulation limitations in the remote server, we move the raw data to a local solution set:

sparql -l sparqlx \
  --service http://api.talis.com/stores/space/services/sparql \
  -e 'SELECT ?who ?role { ?who <http://purl.org/net/schemas/space/performed> ?role }' \
-8 > roles.srt
# 
# From Talis SPAAAAACE database.
# Find the roles each person performed.
# Store locally for further manipuation.

Now we can manipulate strings to our heart's content:

sparql -d roles.srt \
  -e 'SELECT \
       (STRAFTER(STR(?who),
         "http://nasa.dataincubator.org/person/") AS ?crew) \
       (STRBEFORE(STRAFTER(STR(?role),
         "http://nasa.dataincubator.org/mission/"), "/") AS ?mission) \
       (STRAFTER(STRAFTER(STR(?role),
         "http://nasa.dataincubator.org/mission/"), "/role/") AS ?task) \
  { }' -8 > crew_role_mission.srt
# Working from local roles.srt
# Re-project data burried in the URLs.
#
#   Get labelword from Person identifier.
#
#   ?mission appears after "/mission/".
#
#   ?task appears after "/role/".
# and we can store the result in a new srt file.

?crew ?mission ?task
"donnfultoneisele" "apollo-10" "backup-command-module-pilot"
"leroygordoncooperjr" "apollo-10" "backup-commander"
"edgardeanmitchell" "apollo-10" "backup-lunar-module-pilot"
"charlesmossdukejr" "apollo-10" "capsule-communicator/0"
"joehenryengle" "apollo-10" "capsule-communicator/1"
... ... ...

Now we can really dig into the constants embedded in the URLs. We'll create a proto MissionRole populated by strings instead of foreign key integers:

sparql -d crew_mission_task.srt \
  -e 'PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> \
    SELECT ?crew ?mission \
    (IF (CONTAINS(?task, "/"),
         STRBEFORE(?task, "/"),
         ?task) AS ?task2) \
    (IF (CONTAINS(?task, "/"),
         xsd:integer(STRAFTER(?task, "/")),
         -1) AS ?mysterious) \
    { }' -l tsv > proto_MissionRole.csv
# Working from local crew_mission_task.srt
# We'll need xsd to create integers.
# Get the already-baked values from crew_mission_task.srt,
#
#
# and extract the task,
#
#
# and the mysterious number.
# store the lot in a new csv file.

?crew ?mission ?task2 ?mysterious
"donnfultoneisele" "apollo-10" "backup-command-module-pilot" -1
"leroygordoncooperjr" "apollo-10" "backup-commander" -1
"edgardeanmitchell" "apollo-10" "backup-lunar-module-pilot" -1
"charlesmossdukejr" "apollo-10" "capsule-communicator" 0
"joehenryengle" "apollo-10" "capsule-communicator" 1
"jackrobertlousma" "apollo-10" "capsule-communicator" 2
... ... ... ...

We'll create the proto MissionRole table and populate it:

CREATE TABLE proto_MissionRole (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  crew VARCHAR(30),
  mission VARCHAR(10),
  task VARCHAR(30),
  mysteriousInt INT,
  UNIQUE KEY (crew, mission, task, mysteriousInt)
);
LOAD DATA LOCAL INFILE 'proto_MissionRole.tsv'
 INTO TABLE proto_MissionRole FIELDS
 TERMINATED BY '\t'
 OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY '\n'
 IGNORE 1 LINES
 (crew, mission, task, mysteriousInt);
Query OK, 142 rows affected (0.03 sec)
Records: 142  Deleted: 0  Skipped: 0  Warnings: 0
CREATE TABLE MissionRole (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  crew INT UNSIGNED NOT NULL REFERENCES Crew(id),
  mission INT UNSIGNED NOT NULL REFERENCES Mission(id),
  role INT UNSIGNED NOT NULL REFERENCES Role(id),
  mysteriousInt INT,
  UNIQUE KEY (crew, mission, role, mysteriousInt)
);
# Capture the roles the crew performed.
# Always add a primary key to make life easier.
#   the labelword from the Person URL,
#   a string form (pathetic, I know) of the mission,
#   the task name,
#   and some number associated with capsule-communicators.
# Load from our freshly-created Crew.tsv.
#  Populate the new Crew table.
#  values separated by TAB.
#  Lines end with newline.
#  Strings enclosed in ""s.
#  Skip the header with the variable names.
#  Populate the labelword and name attributes;
#    id's AUTO_INCREMENT will handle the primary key.
# Get a thumbs-up back from MySQL.

Note that MissionRole.role references a Role table. We can populate that in SQL with the distinct roles in proto_MissionRole:

Role

CREATE TABLE Role (
  id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  role VARCHAR(30) UNIQUE
);
INSERT INTO Role (role)
  SELECT DISTINCT task FROM proto_MissionRole;
Query OK, 7 rows affected (0.01 sec)
Records: 7  Duplicates: 0  Warnings: 0

	    

Finishing MissionRole

Now we can populate MissionRole (@@which should really be called CrewRole) by joining against Mission, Crew and Role.

INSERT INTO MissionRole (crew, mission, role, mysteriousInt)
  SELECT Mission.id, Crew.id, Role.id, proto_MissionRole.mysteriousInt
    FROM proto_MissionRole
   INNER JOIN Mission
     ON Mission.titleword=proto_MissionRole.mission
   INNER JOIN Crew
     ON Crew.crewword=proto_MissionRole.crew
   INNER JOIN Role
     ON Role.role=proto_MissionRole.task;
Query OK, 142 rows affected (0.01 sec)
Records: 142  Duplicates: 0  Warnings: 0

	    

Validation

Let's ask some data of both the SPAAAAAAACE SPARQL endpoint and the SPCE database to make sure we've mapped the data in the former into the latter medium.

XXX

First, XXX:

@@@
@@@

?p ?c
<http://purl.org/dc/terms/description> 1
<http://purl.org/dc/terms/title> 1
<http://rdfs.org/ns/void#sparqlEndpoint> 1
<http://rdfs.org/ns/void#uriRegexPattern> 1
<http://purl.org/dc/terms/source> 2
<http://purl.org/ontology/po/short_synopsis> 2
<http://xmlns.com/foaf/0.1/topic> 3
<http://xmlns.com/foaf/0.1/isPrimaryTopicOf> 4
<http://xmlns.com/foaf/0.1/primaryTopic> 4
<http://rdfs.org/ns/void#exampleResource> 6
<http://purl.org/dc/elements/1.1/title> 13
<http://purl.org/net/schemas/space/place> 36
<http://purl.org/net/schemas/space/country> 45
<http://schemas.talis.com/2005/dir/schema#etag> 58
<http://www.w3.org/2007/ont/httph#content-type> 58
<http://purl.org/net/schemas/space/missionProfile> 124
<http://purl.org/net/schemas/space/actor> 142
<http://purl.org/net/schemas/space/missionRole> 142
<http://purl.org/net/schemas/space/performed> 142
<http://purl.org/net/schemas/space/role> 142
<http://purl.org/net/schemas/space/mission> 183
<http://www.w3.org/2000/01/rdf-schema#label> 213
<http://xmlns.com/foaf/0.1/thumbnail> 299
<http://xmlns.com/foaf/0.1/depiction> 367
<http://xmlns.com/foaf/0.1/depicts> 367
<http://purl.org/net/schemas/space/nominalPower> 431
<http://purl.org/net/schemas/space/launchsite> 4984
<http://purl.org/net/schemas/space/launchvehicle> 4998
<http://purl.org/net/schemas/space/launched> 5074
<http://purl.org/net/schemas/space/mass> 5259
<http://purl.org/net/schemas/space/agency> 6120
<http://purl.org/dc/elements/1.1/description> 6551
<http://purl.org/net/schemas/space/internationalDesignator> 6551
<http://purl.org/net/schemas/space/launch> 6551
<http://purl.org/net/schemas/space/spacecraft> 6551
<http://xmlns.com/foaf/0.1/homepage> 6551
<http://xmlns.com/foaf/0.1/name> 6602
<http://purl.org/net/schemas/space/discipline> 7828
<http://purl.org/net/schemas/space/alternateName> 9372
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 12134

	  

	  


$Revision: 1.6 $ of $Date: 2012/05/28 02:29:45 $ by $Author: eric $