Relationship of HDT to relevant other technologies

W3C Member Submission 30 March 2011

This version:: http://www.w3.org/submissions/2011/SUBM-HDT-Related-20110330/
Latest version:: http://www.w3.org/submissions/HDT-Related/
Editor:: Javier D. Fernández
Authors:: Javier D. Fernández
Miguel A. Martínez-Prieto
Claudio Gutierrez
Axel Polleres
Michael Hausenblas
Jürgen Umbrich

Copyright © 2011 DERI Galway at the National University of Ireland, Galway, Ireland, Free University of Bozen-Bolzano, The Open University, Universidad Politécnica de Madrid, Alcatel-Lucent, Cisco, OpenLink Software and Profium Ltd. All rights reserved.
This document is available under the W3C Document License. See the W3C Intellectual Rights Notice and Legal Disclaimers for additional information.

Abstract

This document contains a brief description of the relationship between RDF HDT (Header-Dictionary-Triples) and other selected relevant technologies.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications can be found in the W3C technical reports index at http://www.w3.org/TR/.

By publishing this document, W3C acknowledges that the Submitting Members have made a formal Submission request to W3C for discussion. Publication of this document by W3C indicates no endorsement of its content by W3C, nor that W3C has, is, or will be allocating any resources to the issues addressed by it. This document is not the product of a chartered W3C group, but is published as potential input to the W3C Process. A W3C Team Comment has been published in conjunction with this Member Submission. Publication of acknowledged Member Submissions at the W3C site is one of the benefits of W3C Membership. Please consult the requirements associated with Member Submissions of section 3.3 of the W3C Patent Policy. Please consult the complete list of acknowledged W3C Member Submissions.

VoID

The Vocabulary of Interlinked Datasets [VoID] provides a vocabulary and a set of instruction that allows the discovery and usage of linked data sets. VoID aims to bridge data publishers and data users, so that publishers can distribute the data sets (as a RDF dump, SPARQL Endpoints, etc.) and users can discover and use identified data sets given certain attributes.

In the first case, a VoID data set will specify a data dump void:dataDump) through a URI, and this URI will be entrance point of a HDT data set, i.e., the Header in which all the metadata is present.

In the second case, part of the publication metadata of the Header will make use of VoID properties. For example, the publication metadata can be pruned with VoID properties, such as void:sparqlEndpoint and void:exampleResource. Statistical metadata can also use basic VoID statistics, such as void:distinctSubjects.

Semantic Sitemaps

[Semantic Sitemaps] support efficient semantic data sets discovery and high-performance retrieval. It is based on extending the traditional Sitemap Protocol with new XML tags for describing the presence of RDF data (and to deal with specific RDF publishing needs).

HDT can interact with Semantic Sitemaps considering HDT as an RDF format, thus the Header of an HDT data set can be the final URI destination of a sitemap property, such as sc:datasetURI, and sc:dataDumpLocation. The Header could have some properties similar to the Semantic Sitemap ones, such a description of the change frequency, but the protocol is different, so that the interpretation could also differ.

Internet Archive ARC file format

The Internet Archive ARC file format (ARC_IA) and its latest revision, Web ARChive file format (WARC), specify a method for storing web crawls. They are provided as sequences of content blocks and some basic related information. WARC generalizes the format for a better harvesting, accessing, and exchanging of resources. It also allows efficient indexing for access by URL and date.

In terms of HDT, these formats would be seen as a basic Header together with the raw data. Each document in an ARC_IA or WARC web crawl is preceded by some header information, such as the document file format and size, outward links, etc. Headers and documents (html, gif, etc.) are codified in DAT and ARC files respectively.

HDT is focused on a homogeneous RDF data set, while ARC_IA and WARC approach complete and heterogeneous web crawls. Data definition in HDT is "fine-grained" in Dictionary and Triples, which can be indexed for "fine-grained" operations. ARC_IA and WARC provide a bigger granularity for the contained resources. They have a marked preservation design.

Efficient XML Interchange (EXI) Format

The Efficient XML Interchange Format (EXI) is a compact representation for XML. It is based on efficient encodings of XML event streams using a grammar-driven approach. The stream of events is represented using variable length codes. EXI can utilize schema information to improve compactness and processing efficiency. When schemas are used, it allows efficient user-defined Datatypes.

HDT shares with EXI the aims of efficiency, flexibility and compactness. EXI streams are codified in two parts, similar to the HDT core data; streams are composed by a header (similar to the HDT Control Information) and a EXI body with the events (equivalent to the HDT body).

In contrast, EXI is not focused on publication and resource discovery (the Header component of HDT) and it is not involved neither in indexing nor querying the data.

RDF Representations

To date, there are several representations for RDF data, but none of these proposals, though, seems to have considered data volume as a primary goal. HDT is different from other RDF representations because it is focused on publishing and exchanging RDF data at large. While current proposals try to be human-readable (with few compacting structures such as collection and lists of elements), HDT is an efficient machine-readable serialization format. Here we discuss some issues for the best-known RDF representations.

RDF/XML

RDF/XML, due to its verbosity, is good for exchanging data, but only at small scale. It includes some compacting features such as:

Regarding HDT, the third feature is supported by the dictionary configuration, defining common prefixes and base URI. The other three features are outperformed by the Adjacency List implementation of triples, e.g., in Compact Triples and Bitmap Triples.

[Notation3 (N3)] is a language which was originally intended to be a compact and readable alternative to RDF's XML syntax. Thus, it reduces verbosity and represents the RDF with a simple grammar based on the natural triples philosophy. It also allows some compacting features:

Except for the latter one, all these abbreviations are also present in HDT whether in dictionary configuration or in triples implementation. Dictionary configuration allows to define common prefixes and a base URI. Shorthands are well specified in the HDT syntax. Adjacency List implementation of triples outperforms the simple lists of N3. Blank nodes in HDT are named with the _: namespace prefix.

N3 is extended to allow greater expressiveness, e.g. with Quantification, which is not considered in HDT.

Turtle

[Turtle] is a more compact and readable alternative. It is intended to be compatible with N3, as a subset of it. Thus, it inherits its compact features and adds extra compact ability, e.g. through abbreviating RDF Collections ([Turtle], section 2.5).

RDF/JSON

[RDF/JSON] resembles Turtle, with the advantage of being coded in a language easier to parse and more widely accepted in the programming world, such as the JavaScript Object Notation (JSON). It is intended to be easy to read and write by humans and easy to parse and generate by machines. HDT is a more compact format focusing on machine-readable data at large scale, keeping the Header component as the entrance point both for humans and machines.

References

Acknowledgements (Informative)

HDT work is partially funded by MICINN (TIN2009-14009-C02-02), Millennium Institute for Cell Dynamics and Biotechnology (ICDB) (Grant ICM P05-001-F), and Fondecyt 1090565 and 1110287. Javier D. Fernández is granted by the Regional Government of Castilla y Leon (Spain) and the European Social Fund.