Statistics   Data Mining   Modelling     
       
          Something for the mathematicians  

 

 

 

Introduction

 

Data science, data mining, statistics, and predictive modelling most frequently make use of two modelling languages, R and Weka:

 

*  OBIEE has no support for either R or Weka (support for R is only present in the Oracle Database Enterprise Edition).

 

*  Pentaho supports both R and Weka as transformation steps within its ETL engine.

 

 

R Project

 

R is an open-source programming language widely used by statisticians, engineers, and scientists; it is utilised by over 1,600 library packages aimed at a variety of use cases. Support for R can be found in many computational and graphical applications, such as Mathematica, MATLAB, Spotfire, SPSS, STATISTICA, SAS, and Tableau.

 

However, many of the applications that support R are too lightweight to allow substantial data volumes to be processed. Pentahos placement of R within its ETL engine is ideally suited for handling large data volumes, since engine instances can run on the nodes of a dedicated server cluster (including within each node of a Hadoop cluster, if required).

 

Pentaho supports R via an R Script Executor step:

 

*  R Script Executor

 

which can be used to pass step input data to an R script and pass the results onto the next step in the transformation.

 

For a video demonstration of the use of Pentaho ETL with the Random Forest algorithm for attribute classification see:

 

*  Random Forest Classification

 

This example uses Pentaho ETL to randomly assign the data to training and scoring sets, and to apply the model to the scoring set, with the results displayed using a Confusion matrix. It illustrates the succinctness of R as a programming language.

 

 

Weka

 

Weka (Waikato Environment for Knowledge Analysis) is a machine learning, data analysis, and predictive analysis package, written in Java. It was acquired by Pentaho in 2006. Pentaho supports a:

 

*  Weka Scoring

 

step, which uses a Weka classification or clustering model to score new data rows (note that some Predictive Modelling Markup Language, PMML, models are also supported). The model is created using the Weka Explorer: the training data is imported, the model selected and run, and the output exported as a serialized Java object. Incremental Weka models can be trained as data is passed through the ETL transformation.

 

Pentaho also supports a:

 

*  Weka Forecasting

 

step which allows a Time Series Analysis model to be used to make predictions beyond the modelling window. The model is created using a plugin to the Weka Explorer, which transforms the data into a form that a propositional machine learning algorithm can process. For more information see:

 

*  Weka Time Series Analysis and Forecasting