RubyRdfDatabaseSetup

From W3C Wiki
Jump to: navigation, search

See also: [RDFRdb]

How to get SQL backends set up for RubyRDF.

This is being used (for example) to experiment with RestaurantRecommendation in RDF/XML. Some examples below are from that work.

PostgreSQL setup

We have code for Postgre`SQL and My`SQL database access from Ruby.

OK, assuming PostgreSQL and a unix-alike environment.

We have a freshly installed (apt-get install postgresql) PostgreSQL on a Debian box.

create a user

 
danbri@snowball2:~$ sudo su - postgres
postgres@snowball2:~$ createuser danbri
Shall the new user be allowed to create databases? (y/n) y
Shall the new user be allowed to create more new users? (y/n) y
CREATE USER


Let's make the new database (name should be arbitrary, though some tests assume 'rdfweb1'):

danbri@snowball2:~$ createdb rdfweb1
CREATE DATABASE


setup tables

There are various ways we can run the initdb-pg script.

Here's we create and initialise a couple of scratch databases used for testing:

danbri@fireball:~/s-rubyrdf/pack/tests$ createdb rubyrdf1; GET http://www.w3.org/2001/12/rubyrdf/db/initdb-pg.sql | psql rubyrdf1
CREATE DATABASE
ERROR:  table "triples" does not exist
ERROR:  table "resources" does not exist
CREATE
CREATE
CREATE
CREATE
CREATE
CREATE
CREATE
CREATE
danbri@fireball:~/s-rubyrdf/pack/tests$ createdb scutter1; GET http://www.w3.org/2001/12/rubyrdf/db/initdb-pg.sql | psql scutter1
CREATE DATABASE
ERROR:  table "triples" does not exist
ERROR:  table "resources" does not exist
CREATE
CREATE
CREATE
CREATE
CREATE
CREATE
CREATE
CREATE


(the error messages are normal; ignore...)

Or use Curl instead (eg. for Mac OS X where it comes as standard):

curl http://www.w3.org/2001/12/rubyrdf/db/initdb-pg.sql | psql rdfweb1


We should now be set up for RDF storage and query.

delete howto

Deleting :


danbri@fireball:$ psql rdfweb1
Welcome to psql, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help on internal slash commands
       \g or terminate with semicolon to execute query
       \q to quit

rdfweb1=> delete from resources;
DELETE 25237
rdfweb1=> delete from triples;
DELETE 31785
rdfweb1=>


'simple' triples table for debug

This query makes a complicated rdf store look nice and simple:

rdfweb1=> select r1.value as p, r2.value as s, r3.value as o, r1.keyhash as pc ode, r2.keyhash as scode, r3.keyhash as ocode from resources r1, resources r2,

resources r3, triples t where r1.keyhash=t.predicate and r2.keyhash=t.subject

and r3.keyhash=t.object;

We can make a view table using it:


rdfweb1=> create view simple AS select r1.value as p, r2.value as s, r3.value a
s o, r1.keyhash as pcode, r2.keyhash as scode, r3.keyhash as ocode  from resour
ces r1, resources r2, resources r3, triples t where r1.keyhash=t.predicate and
r2.keyhash=t.subject and r3.keyhash=t.object;
CREATE
rdfweb1=> \d simple
             View "simple"
 Column |       Type        | Modifiers
--------+-------------------+-----------
 p      | character varying |
 s      | character varying |
 o      | character varying |
 pcode  | integer           |
 scode  | integer           |
 ocode  | integer           |
View definition: SELECT r1.value AS p, r2.value AS s, r3.value AS o, r1.keyhash
AS pcode, r2.keyhash AS scode, r3.keyhash AS ocode FROM resources r1, resources
r2, resources r3, triples t WHERE (((r1.keyhash = t.predicate) AND (r2.keyhash =
 t.subject)) AND (r3.keyhash = t.object));


Maybe this should go in the default schema?

useful queries

Here are some things I'm finding useful. Note that one can also do a lot of work through the pure RDF interfaces, but when things are going wonky or you want extra expressivity, it is handy to be able to talk to the database as SQL. Projecting it into a simple-minded single table of predicate/subject/object makes it much easier to deal with than trying to compose queries against its actual multi-table representation, too.

Find all the predicates used:

All the predicates:


rdfweb1=> select distinct p from simple;
                        p
--------------------------------------------------
 http://chefmoz.org/rdf/elements/1.0/City
 http://chefmoz.org/rdf/elements/1.0/Country
 http://chefmoz.org/rdf/elements/1.0/Neighborhood
 http://chefmoz.org/rdf/elements/1.0/Zip
 http://purl.org/dc/elements/1.1/contributor
 http://purl.org/dc/elements/1.1/date
 http://purl.org/dc/elements/1.1/publisher
 http://purl.org/dc/elements/1.1/rights
 http://purl.org/rss/1.0/description
 http://purl.org/rss/1.0/items
 http://purl.org/rss/1.0/link
 http://purl.org/rss/1.0/modules/wiki/interwiki
 http://purl.org/rss/1.0/title
 http://www.w3.org/1999/02/22-rdf-syntax-ns#_1
 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
(15 rows)


All uses of some particular namespace:

select p,o from simple where p LIKE 'http://chefmoz.org/rdf/elements/1.0/%';

                        p                         |       o
--------------------------------------------------+----------------
 http://chefmoz.org/rdf/elements/1.0/City         | London
 http://chefmoz.org/rdf/elements/1.0/City         | London
 http://chefmoz.org/rdf/elements/1.0/City         | London
 http://chefmoz.org/rdf/elements/1.0/City         | London
 http://chefmoz.org/rdf/elements/1.0/Country      | United Kingdom
 http://chefmoz.org/rdf/elements/1.0/Country      | United Kingdom
 http://chefmoz.org/rdf/elements/1.0/Country      | United Kingdom
 http://chefmoz.org/rdf/elements/1.0/Country      | United Kingdom
 http://chefmoz.org/rdf/elements/1.0/Neighborhood | City of London
 http://chefmoz.org/rdf/elements/1.0/Neighborhood | Clerkenwell
 http://chefmoz.org/rdf/elements/1.0/Neighborhood | Southwark
 http://chefmoz.org/rdf/elements/1.0/Zip          | EC3V 0DR
 http://chefmoz.org/rdf/elements/1.0/Zip          | W1K 2RP
 http://chefmoz.org/rdf/elements/1.0/Zip          | WC1X 8JR
 http://chefmoz.org/rdf/elements/1.0/Zip          | SE1 9EF
(15 rows)


How many distinctly tagged graphs do we have? (note that libby/Inkling uses this field in ths SQL structure differently to DanBri/RubyRdf)


scutter1=> select count(distinct assertid) from triples;
 count
-------
   694
(1 row)

scutter1=> select distinct assertid from triples;

}}


Here we had a bunch of annoying rss:title properties we want to ignore...
{{{
rdfweb1=> select p,o from simple where p LIKE 'http://purl.org/rss/1.0/title' a
nd NOT s = 'http://the.earth.li/~kake/cgi-bin/cgi-wiki/TODO-link';
               p               |            o
-------------------------------+--------------------------
 http://purl.org/rss/1.0/title | Anchor Bankside, SE1 9EF
 http://purl.org/rss/1.0/title | Crosse Keys, EC3V 0DR
 http://purl.org/rss/1.0/title | Audley, W1K 2RP
 http://purl.org/rss/1.0/title | Calthorpe Arms, WC1X 8JR
(4 rows)


Offtopic: the ntriples we loaded...

(or tried to, what's up? need better debug tools!)

I grepped out the titles we got from the parser. Seems they didn't all get loaded.


<http://purl.org/rss/1.0/title> "Anchor Bankside, SE1 9EF" .
<http://purl.org/rss/1.0/title> "Calthorpe Arms, WC1X 8JR" .
<http://purl.org/rss/1.0/title> "Audley, W1K 2RP" .
<http://purl.org/rss/1.0/title> "Crosse Keys, EC3V 0DR" .
<http://purl.org/rss/1.0/title> "Cittie Of Yorke, WC1V 6BN" .
<http://purl.org/rss/1.0/title> "Counting House, EC3V 3PD" .


I'm beginning to suspect the missing two records got garbled due to the wierd SHA1 hashes we're using to map from text strings (literals and uris) to numbers.

A bit more fruitless diagnostic fiddling:


rdfweb1=> select * from resources where value like 'Cittie%';
 keyhash  |           value
----------+---------------------------
 53432275 | Cittie Of Yorke, WC1V 6BN
(1 row)


rdfweb1=> select * from simple where ocode = '53432275';
 p | s | o | pcode | scode | ocode
---+---+---+-------+-------+-------
(0 rows)

rdfweb1=> select * from simple where pcode = '53432275';
 p | s | o | pcode | scode | ocode
---+---+---+-------+-------+-------
(0 rows)

rdfweb1=> select * from simple where scode = '53432275';
 p | s | o | pcode | scode | ocode
---+---+---+-------+-------+-------
(0 rows)

  

rdfweb1=> select * from triples where object='53432275';
  subject   | predicate  |  object  |                        assertid
              | personid | isresource
------------+------------+----------+-------------------------------------------
--------------+----------+------------
 1046910663 | 2095767658 | 53432275 | uri=http://example.com/grubst:Cittie_Of_Yo
rke,_WC1V_6BN |          | f
(1 row)

rdfweb1=> select * from resources where keyhash='1046910663';
 keyhash | value
---------+-------
(0 rows)

rdfweb1=> select * from resources where keyhash='2095767658';
  keyhash   |             value
------------+-------------------------------
 2095767658 | http://purl.org/rss/1.0/title
(1 row)


DEBUG: insert into resources values ( '-721365290', 'Counting House, EC3V 3PD' )
;
danbri@fireball:~/s-rubyrdf/pack/tests/net$ grep '\-721365290' xx
DEBUG: insert into triples values ('-1529755978', '2095767658',  '-721365290', '
uri=http://example.com/grubst:Counting_House,_EC3V_3PD','' ,'f');


Wishlist

  • Libby had some useful stats queries.

consistency checker

  • Find me any numbers from 'triples' table that don't have a lookup in 'resources' (consistency concerns...). And vice-versa.

A partial check:


SELECT triples.subject FROM triples LEFT OUTER JOIN resources ON triples.subject=resources.keyhash WHERE resources.keyhash IS NULL

SELECT triples.predicate FROM triples LEFT OUTER JOIN resources ON triples.predicate=resources.keyhash WHERE resources.keyhash IS NULL

SELECT triples.object FROM triples LEFT OUTER JOIN resources ON triples.predicate=resources.keyhash WHERE resources.keyhash IS NULL


The 1st of these found some suspects:

   subject
-------------
 -1529755978
  1046910663


Aha. I think...

Two different values. What's up?!


danbri@fireball:~/s-rubyrdf/pack/tests/net$ ./tc_sqlgen.rb |grep 'http://the.ea
rth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Cittie_Of_Yorke%2C_WC1V_6BN'
RAW: Setting http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Cittie_Of_Yorke
%2C_WC1V_6BN -> -2001591609
RAW: Setting http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Cittie_Of_Yorke
%2C_WC1V_6BN -> 1046910663
RAW: Setting http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Cittie_Of_Yorke
%2C_WC1V_6BN -> -2001591609
KEY: http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Cittie_Of_Yorke%2C_WC1V
_6BN --> '-2001591609'
ADDING: insert into resources values ( '-2001591609', 'http://the.earth.li/~kake
/cgi-bin/cgi-wiki/wiki.cgi?Cittie_Of_Yorke%2C_WC1V_6BN' );


OK seem to have fixed it now.

(I'm looking foward to using the RDFQueryTestCases to get rid of some of this uncertainty...)

10 hits, 10 pubs. seems OK...

title country uri
Albion, N1 1HW United Kingdom http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Albion%2C_N1_1HW
Anchor Bankside, SE1 9EF United Kingdom http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Anchor_Bankside%2C_SE1_9EF
Andover Arms, W6 0DL United Kingdom http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Andover_Arms%2C_W6_0DL
Angel, N1 9LQ United Kingdom http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Angel%2C_N1_9LQ
Audley, W1K 2RP United Kingdom http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Audley%2C_W1K_2RP
Banker, EC4R 3TE United Kingdom http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Banker%2C_EC4R_3TE
Calthorpe Arms, WC1X 8JR United Kingdom http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Calthorpe_Arms%2C_WC1X_8JR
Cittie Of Yorke, WC1V 6BN United Kingdom http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Cittie_Of_Yorke%2C_WC1V_6BN
Counting House, EC3V 3PD United Kingdom http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Counting_House%2C_EC3V_3PD
Crosse Keys, EC3V 0DR United Kingdom http://the.earth.li/~kake/cgi-bin/cgi-wiki/wiki.cgi?Crosse_Keys%2C_EC3V_0DR