Support for   Big   Data     
          Something from nothing?  






From 2010 onwards, support for Big Data has become one of Pentaho’s principal areas of expertise: data ingestion, in-cluster ETL transformations, and process management – indeed by placing so much emphasis on its support for Big Data, Pentaho’s marketing frequently overshadows many of the product’s other features, features that would be of greater interest and value to most organisations.


Unlike Oracle, Pentaho doesn’t have a Big-Data offering of its own, and so it can afford to be datasource agnostic, and offer ETL connections to as many datasources as possible.


But what if the Big-Data datasource you’re interested in is not on the list of supported datasources? Well, with OBIEE you’re stuck. But if the datasource can be accessed using Java, then it’s possible to use it with Pentaho (using either a “User Defined Java Class” step or a custom ETL plugin).





In terms of Hadoop, Pentaho supports the major distributions:


*  Amazon EMR

*  Cloudera

*  Hortonworks

*  MapR

*  Spark


and the standard Hadoop components, such as Sqoop, MapReduce, Hive, Pig, Oozie, and YARN. For an overview, see:


*  Big Data with Pentaho 6.0


Pentaho has a facility to visually design MapReduce jobs (and claims a x15 productivity gain over handcoding in Java), and Pentaho ETL can run inside, as well as outside, a Hadoop cluster. In a performance benchmark:


*  The Power of Pentaho and Hadoop in Action


conducted using Pentaho ETL and a 129 node Cloudera Hadoop cluster, deployed on Amazon EC2 machines, Pentaho demonstrated a near constant processing rate over four data volumes, ranging from about 0.5 to 4.0 TB (3 to 24 billion rows).



NoSQL and Analytic Engines


Pentaho also supports a wide range of NoSQL databases, such as:


*  Cassandra / Datastax

*  CouchDB

*  HBase

*  MongoDB


and a wide range of analytic engines, typically with columnar storage, such as:


*  Amazon Redshift

*  Greenplum (MPP PostgreSQL)

*  Netezza


*  Teradata

*  Vertica



JDBC Access


Any step in a Pentaho Hadoop ETL transformation can be exposed via JDBC as a dynamic “virtual table” for use by Pentaho tools, third-party tools, or Java programs. In addition, most of the datasources listed above can also be directly accessed from the Pentaho metamodel designer and from a production web browser using the datasource wizard.