Support for   Big   Data     
       
          Something from nothing?  

 

 

 

Introduction

 

From 2010 onwards, support for Big Data has become one of Pentaho’s principal areas of expertise: data ingestion, in-cluster ETL transformations, and process management – indeed by placing so much emphasis on its support for Big Data, Pentaho’s marketing frequently overshadows many of the product’s other features, features that would be of greater interest and value to most organisations.

 

Unlike Oracle, Pentaho doesn’t have a Big-Data offering of its own, and so it can afford to be datasource agnostic, and offer ETL connections to as many datasources as possible.

 

But what if the Big-Data datasource you’re interested in is not on the list of supported datasources? Well, with OBIEE you’re stuck. But if the datasource can be accessed using Java, then it’s possible to use it with Pentaho (using either a “User Defined Java Class” step or a custom ETL plugin).

 

 

Hadoop

 

In terms of Hadoop, Pentaho supports the major distributions:

 

*  Amazon EMR

*  Cloudera

*  Hortonworks

*  MapR

*  Spark

 

and the standard Hadoop components, such as Sqoop, MapReduce, Hive, Pig, Oozie, and YARN. For an overview, see:

 

*  Big Data with Pentaho 6.0

 

Pentaho has a facility to visually design MapReduce jobs (and claims a x15 productivity gain over handcoding in Java), and Pentaho ETL can run inside, as well as outside, a Hadoop cluster. In a performance benchmark:

 

*  The Power of Pentaho and Hadoop in Action

 

conducted using Pentaho ETL and a 129 node Cloudera Hadoop cluster, deployed on Amazon EC2 machines, Pentaho demonstrated a near constant processing rate over four data volumes, ranging from about 0.5 to 4.0 TB (3 to 24 billion rows).

 

 

NoSQL and Analytic Engines

 

Pentaho also supports a wide range of NoSQL databases, such as:

 

*  Cassandra / Datastax

*  CouchDB

*  HBase

*  MongoDB

 

and a wide range of analytic engines, typically with columnar storage, such as:

 

*  Amazon Redshift

*  Greenplum (MPP PostgreSQL)

*  Netezza

*  SAP HANA

*  Teradata

*  Vertica

 

 

JDBC Access

 

Any step in a Pentaho Hadoop ETL transformation can be exposed via JDBC as a dynamic “virtual table” for use by Pentaho tools, third-party tools, or Java programs. In addition, most of the datasources listed above can also be directly accessed from the Pentaho metamodel designer and from a production web browser using the datasource wizard.