Large   Enterprise   Support     
          Ready to serve the multi-national?  




Enterprise Use Cases


Large enterprises get very nervous when it comes to adopting a product that is not already in use by other large enterprises – should things go wrong it’s comforting to think to yourself, “Nobody ever got fired for buying X”!


For some years, Pentaho has been moving upmarket, starting off as a supplier of reporting solutions to the SME sector, and then broadening its product offering to meet the more varied demands of the enterprise – steadily encroaching on the, once sovereign, territory of Oracle, IBM, SAP, and Microstrategy.


While Pentaho may not have the same number of large installations as Oracle and the other members of the Big-4, it can match them it terms of client organisations that have large numbers of concurrent users, organisations that are processing substantial data volumes using high-end proprietary analytics databases and Hadoop clusters. Based on a 2014 tabulation of client use cases:


*  Application Users: Pentaho lists two clients with about 200 concurrent users and one with 3,000. These numbers, particularly the latter, are substantial (the 3,000 concurrent users installation had in excess of 100,000 users in total).


*  Reporting Databases: Pentaho lists five clients with reporting data warehouse volumes ranging from 0.5 TB to 5 TB, stored in high-end analytic databases (Neoview, Greenplumb, Vectorwise, and Teradata). Clients with these data volumes and using these databases will not be choosing a BI suite principally on the basis of minimising licence costs, indicating that at least some large enterprises see in Pentaho a product that offers better functionality that its proprietary competitors.


*  ETL: Pentaho lists two clients using Hadoop clusters (with, for example, 10+ TB in one 20 node cluster). Load rates into the various ETL source systems include 200,000 rows per second, 20 billion chat logs per month, and 650,000 (2-4 MB) XML documents per week. It quotes an example of ETL federation that sources data from 22 countries.


Most of the statistics quoted above are associated with unidentified clients; but for the following two identified clients we have found references to individual case studies.


Stream Global Services (Convergys)


Stream Global Services, with 37,000 employees and annual revenues of $800 million, provides business process outsourcing for major companies, including Fortune 1000 companies. It uses Pentaho ETL to consolidate data from over 100 million voice, e-mail, and chat contacts per year, sourced from 22 countries (terabytes of raw data per month); see the following 2012 case study for more details:


*  Pentaho analytics platform implementation for Stream Global Services




Sheetz is US convenience store chain, with 13,600 employees and annual revenues of $5.2 billion. It uses Pentaho for its easy-to-use ad hoc analytics. Users access 2.5 billion rows of data (2.1 TB) held in a Teradata data warehouse (which includes 5 years’ worth of sales data); see the following 2011 case study for more details:


*  Pentaho analytics platform implementation for Sheetz





We haven’t found any reporting benchmarks, but Pentaho’s Mondrian in-memory OLAP engine can use the same Infinispan and Hazelcast (a modified version) data caches that are used for time critical applications in the financial services sector, so if your OLAP segments have been loaded into memory, you can reasonably expect the sub-second response times that are mandatory for some use cases in larger organisations. So, you can expect better performance with Pentaho than you would get with OBIEE, unless you are prepared to upgrade to Oracle Exalytics and use its embedded Hyperion OLAP engine.


In addition, to its in-memory Mondrian data cache, Pentaho can source report data from the output of an ETL transformation, using Pentaho Data Services, and this data can be cached in memory within the ETL server cluster during the day, which provides a useful mechanism for extending the amount of memory available for caching data and supporting high-performance reporting.


In terms of ETL performance, Pentaho can run instances of its ETL engine within a server cluster, and, in particular, it can run an instance within each node of a Hadoop cluster. In an ETL performance benchmark:


*  The Power of Pentaho and Hadoop in Action


conducted using a 129 node Cloudera Hadoop cluster, deployed on Amazon EC2 machines, Pentaho demonstrated a near constant processing rate of about one million rows per second, over four data volumes, ranging from about 0.5 to 4.0 TB (about 3 to 24 billion rows). Unfortunately, ETL engine scalability with the number of cluster nodes used was not tested, which rather limits the value of this benchmark.


However, a 2009 benchmark by Bayon Technologies (see Pentaho Performance) concluded that PDI scales in a linear manner with data volumes, and in a near-linear manner with the number of server nodes.



Big Data


Compared to Oracle, Pentaho offers far better support for connecting to Big-Data sources. In terms of Hadoop, Pentaho supports the major distributions:


*  Amazon EMR

*  Cloudera

*  Hortonworks

*  MapR

*  Spark


(and the standard Hadoop components, such as Sqoop, MapReduce, Hive, Pig, Oozie, and YARN); the major NoSQL databases, such as:


*  Cassandra / Datastax

*  CouchDB

*  HBase

*  MongoDB


and a wide range of analytic engines, such as:


*  Amazon Redshift

*  Greenplum (MPP PostgreSQL)

*  Netezza


*  Teradata

*  Vertica


See the article on Big-Data for more details.





As discussed in another article, product extensibility is a key requirement for most large enterprises. The APIs that are characteristics of Pentaho and other open-source products (and almost entirely absent from proprietary products), ensure that if a large enterprise has some non-standard requirement then it’s very likely that it can be met – yes, it will involve custom development that you’ll have to pay for, but you’re unlikely to be told “it just can’t be done”.