Productivity:   Product   Integration     
          Passing metadata along the stack  




Impact of Corporate Size and Business Models


Pentaho is a much more integrated product than OBIEE, and its high level of integration and the ease with which a developer can move back and forth across the stack leads to a very substantial increase in productivity. The differences in the levels of integration seem to come from:


*  An understanding of, and a focus by, Pentaho’s architects on the challenges that IT managers face on a daily basis when it comes to delivering new business requirements quickly – when it comes to some OBIEE enhancements we are often left wondering whether its architects have ever used the product in a real-world application;


*  The relatively small size of Pentaho, which means that the architects of its different BI stack components interact with one another – the size of Oracle inevitably lends itself to a more “siloed” approach to product development and enhancement;


*  The ease with which own-build product components can be integrated – it would be impractical, from a financial perspective, for Oracle to substantially re-architect the disparate bought-in products that make up its BI suite.


Oracle’s business model, as with the other major BI vendors, is essentially that of being a very large-scale systems integrator: acquire products, rebrand them, and hope that the rebranding will, in and of itself, generate increased revenues; the downside is that the high costs of acquisitions do not permit much redesign beyond “look and feel” changes to the interfaces of the acquired products.


We can expect Pentaho to retain an advantage here, even as it grows in size: as long as its community development arm remains strong, new and enhanced functionality from independent developers will always be coming along, and the cost of productionizing the best of this functionality is likely to be modest.


To be more specific, let’s look at how integration differs between the two products by way of some examples.



Elimination of Redundancy


With OBIEE, ad hoc reporting (Answer & Dashboards) and pixel-perfect reporting (BI Publisher) remain for the most part (security aside) completely separate reporting applications, even where there is common functionality that in a well-integrated product would be consolidated; for example, if you want to schedule an A&D report there is one set of screens, but if you want to schedule a BI Publisher report then there is another.


Pentaho takes a much more rational approach, in that common functionality is implemented just once; for example, while there is a separate design tool for pixel-perfect reporting (Report Designer), a report once created can be published to the BA Server Repository, from where it can be scheduled in the same manner as the ad hoc reports (Analysis Reports) created using the User Console.



User-Created Metadata Models


With OBIEE, if there are no subject area items that support the report you wish to create then as a business user you’re stuck (you can create new reports in production but not new subject area items). You have to put in a request to IT, who will schedule a developer to modify a copy of the RPD offline, and which will, following testing, be used to replace the existing one in production, making the new subject area items available. How long will it take? A week if you’re lucky. A month if you’re not. But when did you want it? Now!


Pentaho has a far more thought-through approach to data access. A production user (or developer) with the requisite permissions can:


*  Connect to an existing, or a new, datasource, and


*  Use a wizard to easily construct a new subject area (one that is independent of existing subject areas, so there’s no need for testing and no danger of impacting existing functionality by using that subject area in production).


So, following what amounts to no more than 5-10 minutes work and without recourse to IT, production reports can be built immediately using the newly created subject area.


This “do-it-yourself-in-production” functionality is particularly useful when power users and business analysts want to perform data analysis on Excel files. All IT has to do is to dump the files into a suitable folder, and then the users can “slice-n-dice” the contents to their hearts’ content.



Direct Reporting from ETL Transformations


A diagram of the comparative high-level architectures of the Pentaho and OBIEE BI stacks is shown in Figure 1 below:

      Figure 1: Pentaho-OBIEE High-Level Architectures


One of the key features in this diagram is the presence of a direct link between the Pentaho ETL engine and the reports that are displayed within dashboards in the User Console. Any step in an ETL transformation can be declared to be a Pentaho Data Service. Doing so makes the data output of the step equivalent to a database table, allowing reports to use this “virtual table” as a data source. When a report is run from the User Console, the transformation in the ETL engine is executed dynamically, and the data returned is formatted for display within the user’s browser.


If multiple users want to run reports that make use of the same data, it is not necessary to rerun the transformation for each report invocation: instead, the data produced by the transformation can be cached in the ETL engine’s memory for a designated amount of time, so that the report data can be sourced directly from memory, leading to fast report response times on subsequent requests for the same data.


Having this functionality available affords many opportunities to reduce operating costs and to enhance developer productivity.


Eliminate Tactical Reporting Solutions


Because of the time it takes to build an entire BI stack (ETL, datamarts, metadata model, and reports), most organisations also feel the need to deploy a tactical reporting solution so as to be able to deliver urgent reporting requirements – using products such as QlikView and SAS.


A common characteristic of these products is that they contain a mini-ETL engine at the back-end (to combine, filter, and summarize data) together with a formatting engine at the front-end (to display the data in the user’s browser or to print a report to disk).


However, with Pentaho a tactical solution becomes redundant as Pentaho’s reporting tools can connect directly to the ETL engine (eliminating the hefty costs of licensing additional software, maintaining extra hardware, and of employing developers with skills-sets that can’t be used in strategic BI stack development).


In addition to the advantages of cost reduction and developer skill-set consolidation, there is a further advantage in that Pentaho’s ETL engine is a high-performance enterprise engine, with a very substantial collection of transformation steps, giving Pentaho a substantial advantage over the tools commonly used for tactical reporting (for example, an organisation with, say, a 20 node Hadoop cluster could run an instance of the Pentaho ETL engine within each Hadoop cluster node, so that ETL throughput would scale in proportional to the number of cluster nodes, obviating the need for a separate ETL server cluster, and allowing Pentaho reports to be run directly off the Hadoop cluster).


Transform ETL Memory into a Data Cache


Tactical reporting typically involves getting the DBA to dump transactional database data to disk files, so that running tactical reports against large data volumes does not impact transactional database performance. However, with Pentaho, it is possible to run the transformations that support key reports after the nightly batch has been completed so that the transformed data needed for tactical reporting is held in the ETL engine’s memory throughout the day, transforming the ETL engine into an in-memory data cache. Reports can then consume this pre-aggregated data throughout the day, without the need to dump data to disk files and without a performance hit on the transactional servers.


Strategic Solution Code Reuse


One of the downsides of using a tactical solution to deliver urgently needed functionality is that the code is thrown away and everything is redeveloped from scratch when the strategic solution is finally rolled-out. However, by using Pentaho for the tactical solution, there is a high-level of code reuse: most of the ETL remains the same (except that the output in dumped into datamart tables, on top of which is built a metamodel); and most of the report design remains the same (except that report business items are now sourced from a metamodel instead of a “virtual table”).


Enhanced Data Federation


The OBIEE RPD consists of a metamodel with a very limited facility at the backend to federate data (typically it combines rows or columns from different data sources).


While in the diagram above the ETL engine is shown feeding the datamarts during the nightly batch run, it could also be extracting data from the datamarts and feeding it into the reports (or into the metamodels) during the day. So, another use for the Pentaho ETL engine is as a much more powerful alternative to the data federation functionality found in the OBIEE RPD. In addition, it can also offer an in-memory cache for the data, further improving performance.


One of the most wasteful aspects of the typical BI stack is that the ETL servers only run during the night (and to meet the batch window the number of servers required may be substantial); however, this costly resource remains idle during the day. But with Pentaho the ETL engine can also be used productively throughout the day, storing pre-aggregated data from transformations run at the end of the nightly batch run and serving as a data federation engine to combine data from different datamarts. The net result is more performant reporting at no additional cost.


Automated Metamodel Generation


If your focus is not on getting custom reports into production, but on getting a subject area into production, one that can support both custom and ad hoc reporting, then Pentaho has another “go-faster” development paradigm that you can deploy.


Suppose you’ve got a dozen FK-related tables in a transactional database. You want to build a subject area and a set of reports around this data. How long does it take? The standard approach with OBIEE is to build a datamart, build ETL to populate the datamart, modify the existing RPD to create a new subject area, build the reports, test the entire stack (including regression testing), and then release the modified application into production.


Doing so represents is substantial piece of work, so there will lengthy discussions with business stakeholders to decide on the scope; there will be documentation and Excel layouts to be created and signed off; there will be budgets to be approved; there will be developers to be allocated; there will be testing to be arranged; and everything else that goes with a BI project. All in all, a few months work at a minimum. And, when it’s all “done and dusted”, the business stakeholders may well come back and say, “Well, it’s not quite what we wanted!”


Wouldn’t it be better to create a good first-cut of the ETL, the datamart, the RPD, and the reports, and then let the stakeholders have a “play” using production data – a “what-you-see-is-what-you-get” approach? Yes, but wouldn’t it take far too long and cost far too much. With OBIEE, yes. With Pentaho, no. Using Pentaho’s Agile BI it might be little more than a day’s worth of effort for a single developer as far as a resource commitment by IT is concerned.


The speed with which full-stack prototyping can be accomplished comes down to the integrated nature of Pentaho, and in particular to the ability of its ETL design tool, Spoon, to push metadata along the stack. Seems too good to be true! Well, here’s how it might go; a developer:


*  Creates an ETL input step to push the table joins down to the production database, limiting the data returned to a few thousand rows.


*  Creates an ETL output step to write the data to a datamart table (not yet created).


*  With a few clicks uses Spoon to automatically create the datamart table.


*  With a few clicks uses Spoon to run the transformation (there is no issue running against a production database during the day when the volume of data to be retrieved is small).


*  With a few clicks uses Spoon to automatically create the metadata model (RPD).


*  With a few clicks brings up the Analysis tool (Answers & Dashboards).


So, now – after, perhaps, 20 minutes’ worth of effort (everything apart from writing the SQL for the input step with take no more than about five minutes) – the developer can sit down with a business stakeholder to examine the subject area items and start creating reports and charts using production data:


*  A few hours later, the business stakeholder has decided on a dozen first-cut reports / charts that might be useful.


*  The developer schedules an ETL job to run at the end of the nightly batch run to bring in a few hundred thousand rows of production data to provide a more representative data set.


*  The developer tidies up the deliverables and publishes the metadata model and reports to the production server.


The next day a dozen stakeholders, who have a particular interest in this business area, log on and start evaluating the functionality using the data uploaded at the end of the nightly batch run – perhaps, the last quarter’s worth of production data or five years’ worth of data for 5% of the organisation’s customer base.


So, for one day’s worth of effort and at a minimal cost, IT has delivered an application that already has some significant business value. Business stakeholders can now decide what more they might want: extra subject area items, reports, and charts. After a week of making the occasional small modification, the prototype has been productionized as far as its user-facing functionality is concerned.


The business stakeholders are now in a very good position to assess the true value of the functionality to the business; IT is in a very good position to cost the effort involved in a full rollout based on projected user numbers and data volumes.


Once a budget has been signed off, IT can schedule the additional work needed to productionize the back-end of the prototype. Moving from prototype to production will involve enhancement rather than starting again from scratch: the reports will not change; only the back-end of the metamodel, the datamart, and the front-end of the ETL will need to be enhanced to efficiently handle the much larger number of production users and data volumes.


And while the final build is progressing, a subset of the business stakeholders will still have access to a fully functioning application with reduced data volumes.