Weblog

Big Data RDF Store Benchmarking Experiences

Recently we were able to present new BSBM results, testing the RDF triple stores Jena TDB, BigData,  BIGOWLIM and Virtuoso on various data sizes. These results extend the state-of-the-art in various dimensions:

  • scale: this is the first time that RDF store benchmark results on such a large size have been published. The previous published BSBM results published were on 200M triples, the 150B experiments thus mark a 750x increase in scale.
  • workload: this is the first time that results on the Business Intelligence (BI) workload are published. In contrast to the Explore workload, which features short-running “transactional” queries, the BI workload consists of queries that go through possibly billions of triples, grouping and aggregating them (using the respective functionality, new in SPARQL1.1).
  • architecture: this is the first time that RDF store technology with cluster functionality has been publicly benchmarked.

These results come more than 18 months since I released the call for participation for large-scale benchmarking on the LISA cluster of the latest v3.1 of the Berlin SPARQL Benchmark.

The LISA cluster of Dutch national supercomputing center SARA: powerful but hard to use for cluster database benchmarks.

While the new results are nothing short of spectacular, in this blog I will try to answer the question: what took us so long?

There were many hurdles that have stood in the way of running these experiments succesfully; and they can be divided into two kinds of hurdles: hardware – and software-related hurdles.

The hardware-related hurdles lie in the strictly time-shared nature of the LISA cluster, which is quite typical for centralized scientific computing environments used for e-science. To access such machinery, one has to create a fully scripted job that is put into a queue and this job subsequently gets executed after some while (minutes,hours, or days). The final results all have to be saved somewhere (typically in data files), and of course one generally also collects all kinds of logs recording performance data and error messages. After the job has run, the scientist then needs to inspect these files to find out what happened and whether the job produced the intended effects, or errors.
Further, the LISA cluster has various classes of machines, but only the most modern type of node had significant hard disk resources, needed for our large-scale experiments. And these nodes tended to be highly popular, therefore sparsely available only. It would typically take 2-3 days before our job would have found a  slot to run in the cluster. One may imagine that this non-interactive access is quite cumbersome to use if one is developing set-ups for multiple cluster database systems and tuning these. It is all too easy to make a small mistake, only to find out about the syntax error three days later, after the job ran; leading to a subsequent re-try. Of course, such job scripts can be tested on a single machine beforehand, but testing the behavior of a cluster database system on a single machine is not the same as testing it on the real hardware. Further, performance tuning, by experimenting with configuration file settings for the various products and reacting to the observed performance differences, homing in to ‘good’ parameter settings at every next try, is also very hard to do non-interactively.

For these reasons our efforts on the LISA cluster which started during winter 2011, and continued over spring 2012  did not make enough progress, and the project slowly grinded down. This changed when the  CWI brought into production its new SCILENS cluster, and we started using it in fall 2012.

[ A small side-road: the SCILENS cluster. SCILENS is a new kind of hardware cluster that has been designed from the ground up to serve large-scale data management. The machines in the SCILENS cluster are organized in three different levels, called ‘pebbles’, ‘bricks’, and ‘rocks’. Each level decreases in amount of nodes but the individual machines used in the level increase in computational and disk resources (and price tag). The SCILENS cluster uses cheap consumer hardware, optimized to pack as much power in as little space, making use of consumer home-threater mini-PC cases (‘Shuttle box’), connected by high-performance Infiniband network.

Due to the negative performance impact of network traffic during SPARQL query processing on large clusters (where joins tend to be 'communicating' joins where all machines need to exchange data), where network usage volume increases super-linearly with more nodes, it is generally better in RDF stores to work with fewer nodes with more (RAM) resources than with many nodes with little resources. Thus, we chose as our experimental platform the ‘rocks’ layer of SCILENS, that consists of sixteen 256GB RAM machines, each with 16 cores (dual socket Intel servers). the price-tag of the eight machines involved in the experiments, inclusive the Infiniband network infrastructure is below 70K EURO.

The 'bricks' and 'pebbles' layers of the SCILENS cluster are hand-built from 384 Shuttle boxes, packing CPU and ample disk resources in little space.

The SCILENS cluster contains much more I/O resources per CPU core than usual in compute clusters. The relation between CPU power and I/O resources is captured by the Amdahl number. This number is the amount of I/O bytes per core CPU cycle the system can deliver. In case of the SCILENS cluster this number is close to 1.0 whereas typical clusters at supercomputing facilities such as LISA at  SARA, only get to 0.2 (1 byte per 5 cycles). We do confess, while all this I/O power is interesting, in the workloads presented so far most data is RAM resident. One reason was that the high-performance multi-SSD I/O subsystem of the bricks layer at the time of testing was not yet operational. This provides ground for a follow-up experiment using this fast I/O layer. We expect this to accelerate the load phase, and also to allow to address even larger datasets efficiently on the same hardware. ]

The second kind of hurdle was software-related. On the one hand, this is about the ease-of-use and ease-of-configuration of the cluster versions of the tested RDF store products, and their stability. On the other hand, this is about the ability of their query optimizers to produce reasonable plans. Reasonable here means that the queries in the Explore Use Case of BSBM, which consist of “small” lookup queries achieve logarithmic complexity with respect to data size (i.e. sub-linear performance, queries finishing in milliseconds). In case of the Business Intelligence (BI) Use Case of BSBM, the story is quite different. Analytical queries detect trends in billions of triples by grouping and aggregating these. In the BI use case, consequently, the queries either have O(N) (i.e. linear) or O(N log N) complexity with respect to data size. However, there are many joins in these BI queries, and if the query optimizer is not able to find a good plan, one can easily get O(N*N) complexity. On large data sizes, this quadratic complexity then becomes an absolute show-stopper, as such queries then run for unreasonable time, making any experiment (including a cluster experiment) infeasible.  I will develop on these two difficulties a bit more, in turn, after a short interlude.

[ Another side-road: some thoughts on the BSBM benchmark. The BSBM V3.1 benchmark comes with a number of innovations over V3.0, in particular (i)  Power and Throughput overall metrics of performance, (ii) a parallelizable data generator without which generating 150 billion triples using the original single-thread data generator would have taken three weeks, and (iii) a drill-down mode for the BI workload that makes results more stable by choosing parameters from well-defined classes with stable selectivity across benchmark runs.

In the Explore use case, from V3.1 on Q6 is eliminated, as its performance was not sub-linear. In the recent experiments we also detect problems with Q5 of the Explore Use Case, whose running time on large data sizes comes to dominate overall query time. As such, it is our recommendation for the next version of BSBM to revise this query such that its performance stays sublinear, as is the case for all other Explore Use Case queries.

Overall, BSBM is from a RDF database point of view probably the best benchmark around, although a weakpoint remains the strictly regular structure of the data: BSBM can as well (or even better) be run on SQL systems, and as such does not highlight the strengths in data integration provided by RDF. This criticism, however holds for most RDF benchmarks around today. The LDBC consortium, described all below, is currently working on creating new benchmarks that address these limitations. ]

The main benefit of moving from LISA to SCILENS was the ability to move from a scripted queue-based system to one where we could have interactive access, i.e. using ssh. Regardless of the RDF store used, it is obvious that configuring and properly setting up a cluster system is more work and more error-prone than using a single-node system (the latter, is often very easy). Further, there is the issue of software stability. In case of Openlink Virtuoso, during much of 2012 we were working with alpha and beta versions of the V7 product that only recently has been released, in May 2013. The Virtuoso V7 product, brings a big overhaul of the previous Virtuoso V6 Cluster Edition product that allows much more complex queries (such as the BSBM BI workload) to be parallellized over a cluster. Further, and quite relevant for the cooperation of the database group of CWI in the LOD2 project is the fact that data storage Viruoso V7 now uses column-store technology with aggressive compression, and its query evaluation engine has been vectorized (i.e. data is not processed triple-at-a-time, but in vectors-at-a-time) — this vectorized column-store technology is even part of the open-source Virtuoso V7 single-node system. These technologies were pioneered by CWI in its MonetDB and Vectorwise systems.  Openlink has worked in the past three years on incorporating this in their architecture, in the move from V6 to V7. We were fully aware of the implications of working with pre-release ‘bleeding-edge’ development software provided to us through our collaboration with Openlink in the LOD2 EU project, and it obviously made the experimentation on SCILENS an affair of iterative improvement, taking many months. The Openlink folks helped a great deal in ironing out problems; also gradually increasing the performance levels to the surprising heights that were finally reported. In fact, the Virtuoso V7 results we published in the end were still obtained on a beta version that dates from February 2013, and we know that the version released recently actually is still faster.

I will now move to the second software-related difficulty, which is that cluster systems need to be tested at scale for the exercise to make sense. The only systems that we tested in the end using multiple nodes on the SCILENS cluster were BIGOWLIM and Virtuoso V7. That is, we backed off from serious attempts to run 4store and BigData on BSBM in cluster mode. The reason is that running in cluster mode tends to only make sense on large data sizes. Yet, these two systems for the complex queries in the Business Intelligence Use Case of BSBM, already on single node systems at relatively small data sizes (10M triples), seem to get bad query plans that cause long runtimes. With query times growing super-linearly, running at large (1 billion triples or more) just does not produce queries that finish in any reasonable time, whether a cluster is used or not.

We wanted to try the cluster editions specifically on large sizes (1 billion and more triples) because there is a performance penalty in moving from a single-node system towards a true cluster architecture, due to the fact that RDF data needs to be spread over the system and the nodes need to communicate over the network during query processing. In order to amortize this extra communication cost, using larger data sizes is required. One reason is that with more data volume being sent around, the effect of network latency becomes relatively less pronounced, and network latency (rather than network bandwidth) is the biggest danger to performance. The other reason is that on large data volumes, where the data starts to exceed the RAM of a single machine, and thus (random) disk I/O comes into the picture due to swapping, performance of single-node systems degrades. This means that on such data sizes, there is an additional benefit of using a cluster, namely that each node is only responsible for part of the data, which then again fits into RAM, and the disk I/O penalty is avoided.

Our conclusion after these experiments is that the architectural sweet-spot for using a cluster to handle RDF as opposed to a single-node store, starts in those situations where data size exceeds that what can be held in RAM economically on a single machine. In practice, in 2013 this means where the database footprint exceeds 256GB or such. Beyond that size, RAM prices for a single-node strongly increase, and a cluster solution makes sense. In case of Virtuoso Column Store (with all its compression), depending of course on the shape of the data, this would happen at the 50 billion triple boundary or such.

This conclusion comes on top of the more general observation that RDF stores have made significant advances in architecture (cluster-ready), functionality (Business Intelligence queries) as well as in performance and scalability. By now, we can truly conclude that Big Data projects can make use of RDF technology, and that is a win. This does not mean to say that database technologists are done with RDF, on the contrary. Truth is that on tasks like BSBM, relational technology offers still a magnitude of more performance for performing the same work – although we already mentioned that there are of course tasks where data  is irregular (“ragged”) and does not nicely fit a single schema, where RDF has advantages over the relational model (though such scenarios are not modeled by BSBM or most of existing RDF benchmarks).  Still, at CWI we think there is still a lot of room for improvement in RDF data storage and SPARQL query execution technology, and are actively working on a project that will advance the state of the art further, to become part of future RDF support in open-source MonetDB.

Finally, with these benchmark results, the last word on RDF store performance surely has not been written. In the newly started EU project  LDBC (Linked Data Benchmark Council), RDF store vendors and Graph database vendors are coming together with academics to create an industrial council that sets benchmark specifications, vets results and publishes these. The newly developed benchmarks will go beyond traditional RDF benchmarking in their coverage of features, and test RDF technology on truly ragged data – whereas BSBM at heart is a relational database benchmark, as the data in hit is highly regular. So, watch the LDBC web page and twitter feed for more news on RDF store benchmarking that is soon to come!

Acknowledgements. First off, most work on this was done by my PhD student Minh Pham Duc (“Duc”). Wherever I wrote “we”, I mean Duc and me. We would like to thank the BSBM authors Chris Bizer and Andreas Schultz for providing instructions and sharing the software/scripts at the very beginning of our benchmark experiment.  We also want to thank the store vendors and implementors for helping us to setup and configure their stores for the experiment. Lots of thanks to Orri Erling, Ivan Mikhailov, Mitko Iliev, Hugh Williams, Alexei Kaigorodov, Zdravko Tashev, Barry Bishop, Bryan Thompson, Mike Personick. Finally, our thanks go to Martin Kersten and Arjen de Rijke of CWI, and the entire Database Architectures research group at CWI for making the SCILENS cluster available to us, and bearing with many long-running jobs on it.The work on the BSBM Benchmark Version 3 is funded through the LOD2 project. Benchmarking efforts in our group continue in the LDBC EU project, which is going to create even more interesting benchmark definitions and metrics than used in BSBM.

References

  1. BSBM V3.1 specification. It describes the changes to the workload and the test driver.
  2. BSBM V3.1 data generator and driver. Note that for the moment this new BIBM test driver differs from the old BSBM code still on sourceforge.
  3. BSBM V3.1 April 2013 Result Page, with Cluster and 150billion triple results.
  4. Chris Bizer’s announcement of these results on the W3C mailing list.
  5. Preprint of paper on Experiences with Virtuoso Cluster RDF Column Store.
  6. LOD2 deliverable document D2.1.4 describing the Virtuoso V7 Cluster BSBM benchmarks in detail.
  7. CWI press release on these 150B BSBM results and the vectorized column-store technology in Virtuoso V7.
  8. Openlink press release announcing the availability of Virtuoso V7 and these 150B BSBM results.

 

Posted in Announcement, Publication, Workpackages, WP2 – Storing and Querying | Tagged , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


join our monthly webinars

Public Mailinglist & Newsletter

Please subscribe me to the LOD2 mailinglist.
my email address
my name (optional)
goto archive

Follow


Follow lod2project on Twitter