Hadoop:   Snake Oil   or Cure-All     
          A Comparison between Hadoop and Relational DBMS'  






Hadoop! Perhaps you’ve heard of it. If you’re an IT manager, you’ve probably heard far too much about it! You may even have been pressurised into feeling that without implementing a Hadoop-based solution you are somehow not doing your job properly – not making use of the “latest and greatest technology that the world has ever seen”!


Well, if you’re interested in evidence-based IT decision making and are reluctant to pander to the blandishments of the Big Data vendors, then we’ve dug out some real world numbers that compare Hadoop with conventional relational DBMS implementations. The truth is that Hadoop does many things very badly – very badly because, being a very simple platform, it doesn’t do them at all. Hadoop as a standalone platform has a future, has a well defined niche, but it is a small one, and it is likely to remain so.


However, Hadoop is being re-architected and combined with relational DBMS technology to produce platforms that combine the best features of both. So when a Big Data vendor uses Hadoop as a buzzword to grab your attention, it’s worth investigating. It may be Hadoop, but not as you know it!



Reliable Information on Hadoop


Where has the information that you’ve read and heard about Hadoop come from? From vendors wishing to sell you hardware and software add-ons. From the small, unrepresentative set of their clients that they have pressed into service – in exchange for product discounts. From the endless commentaries in online magazines and blogs that “retweet” what the vendors are saying. But have you ever seen any real evidence that Hadoop could deliver for the use-case you have in mind? Have you ever seen any evidence that Hadoop increases performance and reduces the cost of ownership relative to those out-of-fashion relational alternatives? Ask a vendor for some hard numbers showing how its Hadoop offering performs for the use-case you have in mind, and you may well be disappointed. Ask them for a comparison of the cost of ownership with a comparable relational offering and you’re certain to be disappointed. So what can you do?


Well, you could set up your own benchmarks and compare Hadoop with a relational alternative for the use-case you have in mind. But unless you have a very high-value project, doing so is likely to be prohibitive in terms of the effort and cost involved.


You could ask a Hadoop vendor and a relational vendor to put forward clients who have similar use-cases to yours and compare the results. Both selections are likely to be equally biased, but the biases might, in part, cancel each another out to some extent, giving you a reasonable list of pros-and-cons to mull over when making that decision on procurement.


Or, you could look for cases where researchers have done a comparison and published the results. The problem is that there are every few comparisons between Hadoop and relational databases, but we’ve dug out a few from the realms of academia.



Relational DBMS versus Hadoop


So, what is the difference between a parallel relational DBMS – such as Teradata, Oracle Exadata, Vertica, or Greenplum – and Hadoop, with its famous MapReduce algorithm.


The easier way to think about the differences between a relational DBMS and Hadoop, is that:


*  A relational database does the “heavy lifting” on data loading; Hadoop does the heavy lifting on data extraction;

*  A relational database provides many additional services beyond storing data and serving it up; Hadoop just does the basics.


To expand on these points:


*  A relational database has a schema that ensures data items are easily classified; Hadoop does not. With Hadoop, developers may need to write custom parsers, which can be expensive in terms of time, testing, and maintenance. The situation is improving as various APIs and wrappers are added to deal with common data analysis scenarios; however, you need to take some time during the procurement process to assess what custom coding might be required for the use-cases you have in mind.

*  A relational database parses data at load time; Hadoop parses data at run-time. Hadoop makes much higher demands on the CPU than a relational implementation, which is fine if calculations are lightweight, and data volumes are large; but for complex calculations Hadoop will require more cores / servers than a relational implementation, potentially pushing up costs.

*  A relational database has data integrity checks that ensure that only valid data is entered into the database; Hadoop does not. This is not an issue if you have no need to cleanse or validate data as it is loaded into the database – but think carefully; is that source data really as clean as you think?

*  A relational database has indexes and organisational mechanisms such as columnar storage that speed up access to specific data; Hadoop does not. The key issue here is whether the queries you intend to run against the database will typically be selecting all, or a substantial fraction, of the data for processing, or, as is more often the case, will be selecting a comparatively small subset of the data. If you only need to select all or most of the data, then the go-faster relational data access mechanisms won’t help, and Hadoop might be the best solution.

*  A relational database supports data compression, which reduces storage volumes and I/O bottlenecks; though Hadoop also supports block and row-level compression, tests [1] indicates that enabling it does not significantly improve performance, and can make it worse.

*  A relational database tries to keep data in memory as much as possible and tries to avoid sending data between cluster nodes, which improves performance; Hadoop materializes intermediate calculations on disk, which can lead to network bottlenecks. However, on the plus side, if a fault occurs, Hadoop can recover mid-transaction, whereas a relational DBMS must start the transaction from the beginning. So, if you’re likely to have transactions that run for days at a time, then this fault tolerance might favour Hadoop (but bear in mind that a Hadoop cluster with a large number of low-power commodity servers will fail more often than a relational one that has a small number of high-power custom servers).

*  A relational database is always available to respond to a query immediately; a Hadoop cluster requires a certain amount of time to ramp up to its maximum processing capacity. Hadoop is designed for batch processing.





Definitions of real-time vary considerably: milliseconds in the case of the data feeds to the automatic trading systems used by investment banks; seconds for various mission-critical applications in which humans monitor and respond to updated data as it appears on a bank of screens.


Hadoop won’t meet these real-time requirements due to the length of time it takes to ramp-up and get started on a job:


*  “On a cluster of 100 nodes, it takes 10 seconds from the moment that a job is submitted to the JobTracker before the first Map task begins to execute and 25 seconds until all the nodes in the cluster are executing the job. [1]”



Transactional Databases


Transactional databases are required to handle a large number of writes / updates effectively, to provide good data integrity to validate incoming data, and to provide good indexing to allow users to rapidly retrieve small subsets of the data. Hadoop fails on all three criteria.



Business Analytics


Business analytics applications involve large numbers of users running queries against the same data. The data in question has a comparatively small volume: even if the underlying raw data runs to TBs, during ETL the data is summarized and aggregated so that it occupies a much smaller volume. The number of rows that have to be read per query is greatly restricted by the use of indexes and typically ranges from hundreds to tens of thousands of rows. Users expect ad hoc queries to return data within a few seconds. Hadoop fails to meet these criteria due to its slow start-up, its poor performance on read relative to load, and, especially, on its lack of index support.



Large Data Volumes


If there’s any task at which Hadoop should shine it’s the processing of large volumes of data. Think of the web crawlers that gather millions of web pages, pages that must be analysed and ranked. However, even in this scenario a parallel relational DBMS will do better in some circumstances: it all depends on the data volumes and on how many analysis tasks are to be performed upon it.


Hadoop has been compared to a parallel DBMS for this webpage analysis scenario (“A Comparison of Approaches to Large-Scale Data Analysis”, see [1]). The researchers used a configuration with two cores per node, and loaded 600,000 randomly generated HTML pages per node – 20 GB per node. On this configuration they ran Hadoop and a parallel relational DBMS (Vertica), and then compared the loading and task execution times.


For a 10 node configuration (200 GB total) the load times were almost the same for Hadoop and Vertica. But when it came to executing various web page analysis tasks, Hadoop ran much more slowly than Vertica, slower by factors of about 126, 50, 17, and 6 for the various tasks (see Figure 1 for an example of the execution times associated with various configurations). So, when running on a hardware configuration with a large number of commodity servers, a parallel relational DBMS can significantly outperform Hadoop when processing 200 GB sized data volumes.

      Figure 1: Vertica (plus another DBMS) versus Hadoop for a Join Task (from [1])


For the 100 node configuration (2 TB total), Hadoop loaded the data about four times faster than Vertica. But when it came to executing various web page analysis tasks, Hadoop again ran much slowly than Vertica, slower by factors of about 20, 14, 13, and 3 for the various tasks. But this is a more nuanced use-case. The additional load time for Vertica above that for Hadoop was about 16,500 seconds, whereas the run times for Hadoop on the various tasks were about 4,700, 1,650, 1200, and 155 seconds. So, if the objective had been to perform many types of analysis on the data then Vertica would have won out; but if, as in this use-case, the objective is to perform a relatively small number of tasks then Hadoop wins out – the longer run times are compensated for by the shorter load time.


Overall, the researchers found Hadoop to be over seven times slower than Vertica for analysing large volumes, so when 2 TB data volumes are being used for data discovery or to evaluate various predictive analytics formulae then a parallel relational DBMS is the best choice. But, for a well-defined task that is run repeatedly on different data sets, Hadoop will perform significantly better.


The faster loading time for Hadoop comes about because it is not really “loading” data; it is simply copying data from one local file store to another. The big assumption in the example above is that data already exists spread out across the local disks of the cluster (loading consists of reading the data and distributing two replicas to other nodes in the cluster). In a real world application, data doesn’t just materialize on local cluster disks; it must be feed in from some external source, and the feeding in of this data may well lead to greatly increased loading times that might remove the loading advantage found in the example above. So, you need to determine how “fat” the pipe feeding data into the Hadoop cluster is likely to be.


In addition, the example above assumes that the data to be loaded is clean and that no integrity checking is required – Hadoop does not have a schema that can reject data that doesn’t fit. If some form of data validation or cleansing is required, then the performance advantage of Hadoop will most likely be lost.


But, for one-off or limited analysis, with a fast data feed, with no data validation requirements, and with data volumes greater than, perhaps, 0.5 TB Hadoop seems to win out, making it the best choice in terms of performance for this type of application.



Cheap Low-end Servers aren’t Cheap!


You’ve heard the refrain endlessly: “Go for a large number of cheap low-end servers”. It sounds attractive. Rather than pay a substantial sum for a commercial licence for a parallel DBMS that runs on commodity servers (say Vertica), or perhaps specialized hardware (say Oracle Exalytics), why not go for free open-source software and a large number of cheap low-end servers?


If, as the comparison indicates, a Hadoop offering typically underperforms a conventional parallel DBMS by a factor of 7 when it comes to query execution, then you might be tempted to increase the number of cluster nodes so that Hadoop competes in terms of performance with a relational DBMS – expecting that the cost of the additional servers would be offset by the cost of licensing the proprietary software / hardware.


But, considering our example above, if a conventional DBMS solution would require, 100 servers, then where are the additional 600 servers going to go? Under the CEO’s desk? What advocates of cheap low-end servers don’t say is that the one-off capital cost of a server is only a minor factor when it comes to the overall costs of maintaining a server farm. What about the cost of electricity alone: whether you prefer the $400 per annum figure quoted by Gartner or the $800 per annum figure quoted by Vertatique try multiplying it by 600. But electricity is only a small part of the total cost of housing, maintenance, security, and replacement; in A Simple Model for Determining True Total Cost of Ownership for Data Centers, the Uptime Institute quotes a total cost of ownership of $4,900 per server per annum. Multiple that by 600 and then try asking your finance director to write a cheque. The point is that whatever figures you may reasonably use in your costing formula – be it for an in-house or an outsourced deployment – a large number of low-end servers can be called many things, but never cheap! So, once you’ve taken all the costs into account the price for a commercial licence and some specialized hardware than runs on a much smaller number of servers may suddenly seem quite attractive.



Open-Source Comparison


Any meaningful comparison between a relational DBMS and Hadoop needs to make use of a parallel relational DBMS that is optimized to run on a large cluster of servers – the likes of Teradata or Oracle Exalytics. However, another interesting question is the extent to which Hadoop outperforms the open-source DBMS’ that are the used in the small business sector. Would moving to Hadoop offer a significant performance advantage over the purchase of a license from one of the commercial relational DBMS vendors?


When it comes to a basic RDBMS they don’t come any simpler than MySQL, the workhorse for most of the world’s smaller web sites. A comparison between Hive / Hadoop and MySQL (“Benchmarking Performance for Migrating a Relational Application to a Parallel Implementation”, see [2]) was performed using the Hortonworks implementation on a four node cluster with 16 hyper-threaded cores and 48 GB of memory per node (64 cores in total). The researchers used query-7 from the TPC-DS benchmark as the basis for the test.


Queries were run against a variety of different dataset sizes. For small datasets, MySQL outperformed Hadoop at loading data, while Hadoop dominated at larger volumes. Load times were similar for dataset sizes in the 70-80 MB range (about 1,000,000 rows).


In terms of execution times, Hadoop outperformed MySQL by a factor of about 2.5 for query volumes of 39 GB, 117 GB, and 390 GB (Hadoop loaded the 390 GB dataset in about 40 minutes).


Admittedly, this test is of very limited relevance in that many MySQL databases will be transactional systems that add no more than a few thousand rows per day (and may well have acquired less than 1,000,000 rows over their entire lifetime of operation); there will be very few use-cases where MySQL would be used to query 39 GB worth of data. However, what is interesting about this test is that the performance advantage of Hadoop is so small when compared against a relational DBMS that has not been built from the ground up with parallel execution in mind.





As a standalone platform, there will certainly use-cases where Hadoop will outperform a parallel relational DBMS:


*  Data volumes greater than about 0.5 TB, and

*  One-off or limited analysis of the data, and

*  No data validation / cleansing requirements, and

*  Fast data feed into the server cluster.


While this use-case will not be relevant for most organisations, for those larger organisations that have an interest in, say, sentiment analysis a standalone Hadoop implementation may well prove worth the investment.


However, Hadoop is increasingly being used as a back-end component in hybrid computing platforms, and in this setting it is likely to find much broader uses. In these hybrid platforms, Hadoop has been re-architected to overcome some of the limitations that we have outlined above. For example, CitusDB adds a Postgres front-end to overcome the limitations of Hive, bypasses MapReduce to provide real-time access to distributed data, and claims a 3-5 performance advantage over a conventional Hive / Hadoop implementation.


This merging of Hadoop and relational technology seems likely to offer the best of both worlds, so when a vendor is trying to sell you a Hadoop-based solution, the potential benefits will all depend on the context – the Hadoop-based component may be quite small and the vendor may just be using “Hadoop” as a buzzword to grab your attention.


So, to answer the question we posed at the beginning of this article. Standalone Hadoop as a panacea for your IT problems? Snake oil! Hadoop as a component in a hybrid relational implementation? No, but worth evaluating exactly what the vendor has to offer!





[1] A. Pavlo, E. Paulson, A. Rasin, et al.,“A Comparison of Approaches to Large-Scale Data Analysis”, SIGMOD '09 Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data.


[2] K. Gadiraju, K. Davis, P. Talaga, “Benchmarking Performance for Migrating a Relational Application to a Parallel Implementation”, Advances in Conceptual Modeling, Volume 8823, Springer, 2014.