Pentaho   Performance        
       
          Promising, but more information required  

 

 

 

Introduction

 

Oracle is not shy when it comes to publishing OBIEE performance benchmarks; for example, their whitepaper:

 

*  Oracle Business Intelligence Suite Enterprise Edition: 50,000-User Benchmark on Sun SPARC Enterprise T5440 Server Running Oracle Solaris 10

 

provides considerable detail on architecture, user loading, number of concurrent users, think time, query response times, and the number of queries per second supported.

 

However, Pentaho seems oddly reluctant to provide a similar level of detail when it comes to the key performance metrics that would be of interest to a potential client: response times and queries per second for the reporting component and data transfer rates for the ETL component.

 

 

Performance Benchmarks

 

The most recent Pentaho overview of performance seems to be the 2014 whitepaper:

 

*  Performance and Scalability Overview

 

This document focuses on a qualitative discussion of Pentaho’s architecture, with the only nod in the direction of a quantitative statement on performance being:

 

*  “Pentaho’s in-memory caching capability enables ad hoc analysis of millions of rows of data in seconds.”

 

In Pentaho’s documentation on the Mondrian OLAP engine that is used for ad hoc reporting, “Optimizing Mondrian Performance” (dating from 2007), one of the authors makes the comment:

 

*  “Our largest site has a cube with currently ~6M facts on a single low end Linux box running our application with Mondrian and Postgres (not an ideal configuration), without aggregate tables, and gets sub second response times for the user interface (JPivot).”

 

which provides an indication that sub-second response times are achievable (though no mention is made of the BA server configuration, and, in particular, the size of its data cache).

 

 

Client Use Cases

 

At the end of Pentaho’s document, there is a table entitled “Customer Examples and Use Cases”; but once again this table provides little by way of detailed information on key metrics:

 

*  The hardware specifications provided seem to refer to the source systems (the datamarts or the ETL data sources) and not to the Pentaho BA Server or Pentaho DI Server configurations.

 

*  Some PDI data loading rates are specified: 200,000 rows per second and 20 billion chat logs per month into a 10+ TB, 20-node Hadoop cluster; and 10 million records per hour, 650,000 XML documents per week (2 to 4 MB each) into a 200 GB Oracle Cloudera Hadoop cluster.

 

*  The data volumes quoted as being present in the analytic datamarts (0.5-5.0) TB do, at least, demonstrate that Pentaho is being used by some organisations with substantial data processing requirements.

 

*  When it comes to the number concurrent users, the largest installation is listed as having 3,000, which is certainly substantial, but there is no information on the key performance metrics: queries per second and query response times.

 

So, all in all, the lack of key information is rather disappointing.

 

 

Reporting Performance

 

Mondrian can be configured in various ways, but for larger use cases it would typically be configured to run on a server cluster with a 64-bit OS, using the distributed, in-memory Infinispan data cache, an open-source data cache distributed under an Apache licence.

 

The key component here is the Infinispan data cache, which has been used in the financial services sector to provide very rapid access to large data volumes (the nearest equivalent in the Oracle stable would be Oracle Coherence). Though other data caches, such as Hazelcast, offer better performance than Infinispan, the fact that an open-source product has even been considered by the financial services sector (a sector not noted for counting the pennies when it comes to platform procurement), suggests that Infinispan is not far off from being “best of breed” technology, and is therefore likely to be more than adequate for the vast majority of commercial applications. Note that Pentaho’s CDC CTools extension to the default functionality uses a Hazelcast-based cache, and, in additiion, allows cache segments to be cleared selectively and programmatically (such as following an ETL update to the datasource).

 

In Infinispan’s distributed mode, memory across a collection of Mondrian server nodes is configured to form a single data cache with a specified level of replication, so as to avoid any single point of failure. The cache is “self-assembling” in the sense that each instance will discover its fellows on a local network using UDP multicast.

 

A query uses data from the cache if available (either by extracting the aggregate directly or by computing it from cached data); otherwise, the cache manager must query the data from the underlying relational database (ROLAP).

 

If the data needed to answer a query is cached, then a sub-second query response time is a reasonable expectation due to the speed with which cache operations can be performed – performance benchmarks for Infinispan come in at around 1.3 ms for cache gets and 5.5 ms for cache puts on a six node cluster – and the time spent by Mondrian in performing calculations and in generating HTML will typically be quite small.

 

If the data needed to answer a query is not cached, then Mondrian performance will revert to what one might expect from OBIEE (say 3-10 seconds): it all depends on how quickly the database can respond, either by aggregating the required data or by extracting it in pre-aggregated form from materialized views or aggregate tables.

 

Pentaho can make use of explicitly-defined aggregate tables (similar to the aggregation functionality built into the OBIEE RPD). However, aggregate tables are expensive to create and maintain, and so it’s far better to get the database to implicitly redirect queries to aggregate structures where possible (using, for example, Oracle’s query re-write functionality or Postgres’ Rule System to redirect the query in a manner that is transparent to the requesting application).

 

So, if most of your reports can make use of pre-defined aggregates or the data fits into the memory allocated for the cache, then your users can reasonably expect sub-second response times (following the nightly ETL, you might wish to schedule key reports to run so that the cache is populated when the first users come online in the morning – as per OBIEE); if not, then you should be no worse off than you are at present with OBIEE and your current datamart.

 

But without building a prototype, it won’t be easy to work out what cache size is required to service a given set of reports or a given percentage of the ad hoc requests generated by users.

 

 

Data Integration Performance

 

As is the norm with ETL vendors, Pentaho’s ETL transformation steps run in parallel on a streaming basis so there is no limitation on ETL data volumes (however, in practice, ETL cannot always be streamed by way of a single transformation and some intermediate staging may be necessary, with different transformations running sequentially as separate jobs).

 

Pentaho Data Integration (PDI) can run on server clusters, and, importantly, Pentaho claims that it scales approximately per the total number of cluster cores, and not just the total number of cluster nodes. Clusters can be dynamic, and PDI can also run in the cloud (Amazon EC2) if required.

 

Pentaho’s “Performance and Scalability Overview” documentation is also uninformative when it comes to quantifying Pentaho’s ETL performance. The only ETL performance benchmark published by Pentaho seems to be:

 

*  The Power of Pentaho and Hadoop in Action

 

conducted using a 129 node Cloudera Hadoop cluster, deployed on Amazon EC2 machines. Pentaho demonstrated a near constant processing rate of about one million rows per second, over four data volumes, ranging from about 0.5 to 4.0 TB (about 3 to 24 billion rows). Unfortunately, ETL engine scalability with the number of cluster nodes used was not tested, which rather limits the value of this benchmark.

 

However, a 2009 whitepaper from Bayon Technologies provides an interesting performance benchmark for PDI:

 

*  Pentaho Data Integration: Scaling Out Large Data Volume Processing in the Cloud or on Premise

 

This whitepaper is well worth reading for anyone considering adopting PDI.

 

Bayon used three datasets (50, 100, and 300 GB) prepared using the TCP-H data generator. Two use cases were considered: “read”, a simple read in parallel from a file (1,799,989,091 records for 300 GB); and “sort”, which read two files (1,799,989,091 and 500,000 records), sorted and aggregated the first record set (to around 500,000 records), sorted the second record set, joined both sets and aggregated the combination, and finally output the results. The second example is therefore a representative use case where large data files must be joined and aggregated to produce a small output volume.

 

Bayon ran the use case on an EC2 cluster (small instances with a 32-bit image), where each instance had one EC2 Compute Unit. The use cases were run on clusters of 10, 20, and 40 nodes.

 

For example, Bayon’s benchmark for the “sort” use case:

 

*  Reading 300 GB (1,799,989,091 records) from the master file, and

*  Using a 40 node cluster

 

took about 67 minutes – a data processing rate of 76.4 MB per second or 448,428 rows per second (reading from XFS filesystems on Amazon Elastic Block Storage, EBS, volumes). Bayon concluded that PDI scales in a linear manner with data volumes, and in a near-linear manner with the number of server nodes (see the whitepaper for graphs covering all the permutations).

 

The run time for the “read” use case was 95.5% that of the “sort” use case, indicating that the ETL, as might be expected, is heavily I/O bound.

 

Bayon also makes a good point about the cost-effectiveness of EC2 for intense, short-duration tasks, noting that at 2009 prices it cost only about $6 per billion rows to run the “sort” use case.

 

In order to derive some useful metrics from the “sort” use case we need to convert an EC2 Compute Unit into something more recognisable. While opinions differ, one EC2 Compute Unit is usually considered to be approximately equivalent to one core on a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. So, we can conclude that for read intensive ETL (irrespective of subsequent calculations and for a small volume on the final write step), PDI can process about 1.9 MB (11,210 records) per second per core. The EBS volume configuration is unclear from the paper, though with a quoted 1 TB limit, it probably corresponds to “EBS Magnetic” with a 40-200 IOPS rate and a maximum throughput of 40-90 MB/sec.

 

Note that if you are looking to an Amazon EC2 solution, the EC Compute Unit has been replaced by the vCPU, equivalent to a hyperthread – which probably equates to half a core.

 

For the standard kit you are more likely to have in your machine room, the benchmark is not so easy to apply. For example, for a cluster of four HP ProLiant DL560 Gen8 E5-4610v2 2P 32GB-R servers, costing perhaps $40,000, with two processors per server and 8 cores per processor, the “sort” use case might take 14 seconds to process 10 million records assuming the same sequential read rate. But the sequential read rate could be quite different from that available on Amazon EC2, and as the ETL processing is I/O bound (by about 95%), the run time for the “sort” use case could be substantially different.

 

So, there doesn’t appear to be any alternative to evaluating Pentaho on kit similar to that which you might use if you want to turn a “benchmark into a budget”. You might care to take up Pentaho’s offer:

 

*  “Contact us and we will set-up a personalized demo based on your unique use case showcasing our powerful data integration tools and rich analytics.”

 

or get a third-party, such as Bayon Technologies, to build you a proof of concept.

 

 

Summary

 

The best we can conclude from the dearth of quantitative information available on Pentaho performance is that:

 

*  Pentaho’s in-memory data cache should significantly outperform OBIEE’s disk-based cache, leading to sub-second query response times for queries that can be satisfied from the in-memory cache; and, if not, response times should be similar to those currently obtained with OBIEE.

 

*  In a clustered implementation, there is good evidence that Pentaho’s ETL will scale in a linear manner with data volumes, and in a near-linear manner with the number of server nodes. Pentaho’s ETL, being I/O bound for typical use cases, is likely to have similar performance characteristics to that offered by other ETL vendors given comparable hardware.