Back in October last year, I highlighted a EU-funded project we’re involved with around big data. Led by Sören Auer at the Fraunhofer Institute for Intelligent Analysis and Information Systems, Big Data Europe has built a remarkably flexible big data processing platform. It wraps a lot of well-known components like Apache Spark and HDFS in Docker containers, along with triple stores (4Store, Semagrow, Strabon) and more. Through a simple UI, you select the components you want, click and… it’s done. As many instances of whatever components you need all installed in whatever infrastructure you choose to create your processing pipeline.
Most interestingly from a W3C perspective, is the way it handles data variety – one of the 3 Vs of big data (or 4 Vs if you include veracity). Data is stored in whatever format its in – relational, CSV, XML, RDF, JSON – just as it is. That’s the data lake, or data swamp – choose your metaphor. What BDE does then is to apply a semantic layer on top of that so that you can run a SPAQRL query across all the data in the platform. A virtual graph is created at query time, with the SPARQL query deconstructed and individual bits of information pulled from whichever dataset is needed, then joined and returned as a single response. One of the lead engineers, Mohamed Nadjib Mami, explains more in this video. I heard about a similar approach being taken in a completely different context recently at the OGC’s Location Powers workshop in Delft. It’s an approach that has been shown to outperform the more usual approach of transforming everything into a single format at ingestion time as you only ever query a small portion of the data. Of course this all depends on semantics, vocabularies, URIs and links.
If you’d like to know more about the Big Data Europe Integrator Platform, including the Semantic Data Lake, please join us for our official launch Webinar next Wednesday, take a look at the project Web site, or dive right in to the GitHub repo.