W3C W3C Member Submission

Implementation of HDT

W3C Member Submission 30 March 2011

This version:
http://www.w3.org/submissions/2011/SUBM-HDT-Implementation-20110330/
Latest version:
http://www.w3.org/submissions/HDT-Implementation/
Editor:
Javier D. Fernández
Authors:
Javier D. Fernández
Miguel A. Martínez-Prieto
Claudio Gutierrez
Axel Polleres
Mario Arias
Alejandro Andrés
Guillermo Rodríguez-Cano

Abstract

This document contains a brief description of the implementation of a tool to create and interact with RDF HDT (Header-Dictionary-Triples).

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is a part of the HDT Submission which comprises five documents:

  1. Binary RDF Representation for Publication and Exchange (HDT)
  2. Extending VoID for publishing HDT
  3. RDF Schema for HDT Header Descriptions
  4. Relationship of HDT to relevant other technologies
  5. Implementation of HDT

By publishing this document, W3C acknowledges that the Submitting Members have made a formal Submission request to W3C for discussion. Publication of this document by W3C indicates no endorsement of its content by W3C, nor that W3C has, is, or will be allocating any resources to the issues addressed by it. This document is not the product of a chartered W3C group, but is published as potential input to the W3C Process. A W3C Team Comment has been published in conjunction with this Member Submission. Publication of acknowledged Member Submissions at the W3C site is one of the benefits of W3C Membership. Please consult the requirements associated with Member Submissions of section 3.3 of the W3C Patent Policy. Please consult the complete list of acknowledged W3C Member Submissions.

Table of Contents

Introduction

A step-by-step construction of HDT
Figure 1: A Step-by-step construction of the HDT format from a set of triples (PNG)

Figure 1 shows a conceptual description of the process of obtaining an HDT representation from a RDF graph. The first step extracts basic RDF features necessary to build the Dictionary and the underlying graph, as well as information that will be included in the Header. The second and third steps build the Dictionary and encode the Triples respectively. The abstract notion of HDT is finally implemented (fourth step) into a practical and usable HDT ready for modular and clean publication (and management) and compact exchange.

HDT-It! 0.7 is a C++ tool performing this process. It is a free software / Open Source C tool that makes use of Raptor library to provide a set of parsers and serializers between HDT and the main RDF syntaxes. It also provides a basic querying interface. The project is hosted at http://code.google.com/p/hdt-it.

HDT Creation

HDT creation refers to the process of converting an existing RDF document (in a given syntax) into HDT. HDT-It! makes use of Raptor library to parse firstly the given document (RDF/XML, N3, Turtle, JSON).

The HDT creation is guided by a configuration file given in the execution with the main parameters (documented in the project site). The original RDF document conversion is a multi-phase process.

Dictionary Building

The Dictionary component is an abstract class which is instantiated with a concrete dictionary implementation. HDT-It! 0.7 provides the concrete class DictionaryPlain which corresponds to the dictionary implementation by default.

HDT-It! 0.7 makes use of Hash and vector structures to maintain the mapping between strings and IDs, following the alphabetical order through a final sorting and re-mapping operation.

Triples Encoding

The Triples component is an abstract class which is instantiated with a concrete triples implementation. HDT-It! 0.7 provides the Plain Triples, Compact Triples and Bitmap Triples implementations. The configuration file will specify the concrete implementation to follow.

Once the dictionary is built, HDT-It! 0.7 makes a second read over the original RDF document replacing the IDs, building an auxiliary vector structure to represent the triples and sorting it following the Adjacency List order (by default or the order specified in the configuration file). This structure is used by any of the three given implementations.

HDT output

If output file/s are given in the configuration file, HDT-It! 0.7 creates the Header, Dictionary and Triples files.

This implementation, HDT-It! 0.7, does not perform the compress phase, both for the Dictionary and the Header. In this case, the user should have to run the appropriate application (e.g. gzip and Huffman) over the generated output and change the Header dc:format property.

HDT in use

HDT-It! 0.7 allows an HDT load from a given HDT Header. It allows several features:

This implementation, HDT-It! 0.7, does not perform the uncompress phase, both for the Dictionary and the Header. In this case, the user should have to run first the appropriate application (e.g. gzip and Huffman) over the original input.

Querying HDT

This feature is only available for Bitmap Triples, due to the operations (rank, select) allowed by the Bitmap indexing and used in Check&Find operation.

HDT-It! 0.7 allows to query by console or by a given file (documented in the project site). The operations can be:

The S-P-O Adjacency List order is assumed. The response patterns vary for alternative representations S-O-P Adj. List, P-S-O, P-O-S, O-P-S Adj. List and O-S-P Adj. List.


References

[Notation3 (N3)]
T. Berners-Lee, Notation 3. Available at http://www.w3.org/DesignIssues/Notation3.
[RDF/JSON]
K. Alexander. RDF/JSON: A Specification for serialising RDF in JSON. In SFSW 2008.
[RDF/XML]
D. Beckett. RDF/XML Syntax Specification (Revised). W3C Recommendation 10 February 2004. Available at http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/
[Turtle]
D. Beckett, T. Berners-Lee. Turtle - Terse RDF Triple Language. W3C Team Submission 14 January 2008. Available at http://www.w3.org/TeamSubmission/2008/SUBM-turtle-20080114/

Acknowledgements (Informative)

HDT work is partially funded by MICINN (TIN2009-14009-C02-02), Millennium Institute for Cell Dynamics and Biotechnology (ICDB) (Grant ICM P05-001-F), and Fondecyt 1090565 and 1110287. Javier D. Fernández is granted by the Regional Government of Castilla y Leon (Spain) and the European Social Fund.