TaskForces/CommunityProjects/LinkingOpenData/THALIATestbed

From W3C Wiki

SWEO Community Project: Linking Open Data on the Semantic Web

THALIA testbed

This page collects information about THALIA testbed for benchmarking relational database to RDF mapping tools.

The page is part of the Linking Open Data Project.

You can download the latest version of THALIA benchmark immediately now.

Introduction

THALIA (Test Harness for the Assessment of Legacy information Integration Approaches) is a publicly available testbed and benchmark for testing and evaluating integration technologies. It provides researchers and practitioners with a collection of 40 relational database tables representing University course catalogs from computer science departments around the world. The data in the testbed provide a rich source of syntactic and semantic heterogeneities since we believe they still pose the greatest technical challenges to the research community. In addition, this testbed provides a set of twelve benchmark queries as well as a scoring function for ranking the performance of an integration system.

Testbed content

The initial XML/XMLSchema/XQuery version is accessible here. Our SQL/OWL/SPARQL version is presented below.

SQL scripts

You can either download THALIA benchmark or THALIA testbed SQL Schema versions. Benchmark version includes challenge university schemas and data. Testbed version contains schemas and data for all available universities courses.

MySQL 5.0.37 (bundled with archive)

PostgreSQL 8.2.3-1 (bundled with archive)

Virtuoso 5.x (and higher)

First public version of THALIA benchmark schema already available.

Full testbed version doesn't construct yet.

University computer science course ontology

First public version of universities computer science departments courses around the world already available too.

Benchmark SPARQL queries

A set of twelve benchmark queries represented in SPARQL already available too.

Examples of testbed SQL data in RDF format

Benchmark data is required for SPARQL queries in RDF format already available too.

THALIA testbed downloads

  • First public review version, 23 June, 2007 - download it.

Some testbed examples

Arizona State University

  • Step 1. Initial HTML version of a course.


110 Principles of Programming with Java. (3)  	MORE INFO
Concepts of problem solving using Java, algorithm design, structured programming, fundamental algorithms and techniques, and computer systems concepts. Social and ethical responsibility. Lecture, lab. Prerequisite: MAT 170.


  • Step 2. Original representation in the THALIA testbed in XML format.


    <Course Title="110 Principles of Programming with                                  Java. (3)">
        <MoreInfo.URL>http://www.eas.asu.edu/~cse110</MoreInfo.URL>
        <Description>Concepts of problem solving using Java, algorithm design, structured programming, fundamental algorithms and techniques, and computer systems concepts. Social and ethical responsibility. Lecture, lab. Prerequisite: MAT 170.</Description>
    </Course>


  • Step 3. Representation of the course translated into SQL.


    CREATE TABLE asu (
        Title TEXT NOT NULL,
        Description TEXT,
        MoreInfoURL TEXT);


    INSERT INTO asu (Title,Description,MoreInfoURL)
    VALUES ('110 Principles of Programming with                                  Java. (3)'
        ,'Concepts of problem                                  solving using Java, algorithm design, structured                                  programming, fundamental algorithms and                                  techniques, and computer systems concepts.                                  Social and ethical responsibility. Lecture, lab.                                  Prerequisite: MAT 170.'
        , 'http://www.eas.asu.edu/~cse110');


  • Step 4 and the most important. The course represented in RDF.


    <University rdf:about="http://purl.org/thalia/asu">
        <dc:title xml:lang="en">Arizona State University</dc:title>
    </University>

    <Course rdf:about="http://purl.org/thalia/asu/course/CSE110">
        <dc:title xml:lang="en">Principles of Programming with Java</dc:title>
        <dc:description xml:lang="en">Concepts of problem solving using Java, algorithm design, structured programming, fundamental  algorithms and techniques, and computer systems concepts. Social and ethical responsibility.</dc:description>
        <hasPrerequisite rdf:resource="http://purl.org/thalia/asu/course/MAT170"/>
        <skos:subject rdf:resource="http://purl.org/topic/thalia/ProgrammingLanguages"/>
        <skos:subject rdf:resource="http://purl.org/subject/thalia/AlgorithmDesign"/>
        <skos:subject rdf:resource="http://purl.org/subject/thalia/SystemArchitecture"/>
        <rdfs:seeAlso rdf:resource="http://www.eas.asu.edu/~cse110"/>
        <forUniversity rdf:resource="http://purl.org/thalia/asu"/>
    </Course>


  • Step 5. Examples of XQuery and SPARQL queries to find all courses with the string 'Data Structures' in the title.


    <XQuery>
        FOR $b in doc('umd.xml')/umd/Course
        WHERE $b/CourseName='%Data Structures%'
        RETURN $b
    </XQuery>


 
    <SPARQL>
        SELECT ?course
        WHERE
        {
            ?course a :Course;
                    dc:title ?title.
            FILTER (lang(?title) = "en")
            FILTER regex(?title, "Data Structures")
        }
    </SPARQL>


SQL to RDF mapping tools

There's a more comprehensive list at RdfAndSql.

Resources

Bibliography

Presentations

People Interested in the Area