Is Hadoop the right tool for the job?

I recently posted some thoughts regarding Microsoft’s Windows-compatible Hadoop implementation, HDInsight.  I was investigating it for a project that I figured would benefit from a distributed processing approach, although ultimately decided to pursue other alternatives.  It led our team to make some quite interesting discoveries about Hadoop, and some scenarios of when current distributed processing solutions are and aren’t appropriate.

Example scenario

The project in question is actually a large-scale data processing solution, required to process millions of varied data files daily, parsing data points from JSON, XML, HTML and more, and writing to a storage solution.  Going back to the “Big Data” terminology, we were definitely looking at the potential for moving Terra-bytes of data per day, at least once we scale up, so needed a technology that could handle this, while remaining responsive, as processing time is actually a crucial factor.

What we quickly noticed, was that we actually didn’t need to use the Reduce part of the functionality.  All we wanted to do was just simply run Map jobs to identify and retrieve data points, rather than aggregating and summarising said data points.

Distributed processing options

Following investigation of, and subsequent rejection of HDInsight as a viable option for this project, we took a look into vanilla Hadoop, as well as some other distributed processing implementations and Hadoop add-ons.  Fortunately, there are a lot of very cool products out there.

Graph comparing Spark vs Hadoop
Spark claims to run up to 100x faster than Hadoop MapReduce

Cloudera Impala actually introduces its own distributed query engine, which avoids MapReduce to deliver near real-time query results.  It’s not intended as a replacement for MapReduce however, and is meant to complement a Hadoop cluster by offering alternative query techniques for accessing data from Hive and HDFS.

To properly evaluate the performance of these products against one another, we realised we needed a baseline.  Having a great deal of MS BI experience in our team, we thought it would be fun to create this baseline using our usual go-to data processing solution: SSIS.

The more we dug into the distributed architecture, the more it seems like we were looking for something else for our purposes, given the complete lack of requirement for a reduction function.

SSIS vs Hadoop

I won’t go into detail on this, as Links has already written up the results over on, but running our Map function on a single SSIS instance performed significantly better in each test than our Hadoop cluster.  The results we gathered seem to suggest that distributed is really only the correct approach when you are using both the Map AND Reduce functionality and/or working with extremely large datasets.  Indeed, the larger the dataset and the more data points involved, the more powerful and useful the reduce functionality becomes.

There is quite simply no straightforward alternative for performing this type of operation in traditional ETL platforms such as SSIS.

I’d like to find out what the comparison is like when performing this same test with Spark vs SSIS, just to see if the in-memory implementation provides the necessary performance boost, or if it’s still better to keep Map and MapReduce in two separate places.

Is Hadoop the right tool for the job?

Bottom line: It depends on the job.

If you’re not utilising both sides of the MapReduce coin though, even when processing millions of files, then the overhead of creating and managing jobs, is just not worth it.  And if you are using both Map and Reduce functionality, it may just be worth considering some of the other solutions out there as an alternative to Hadoop MapReduce.

The HDInsight account management page

Hands-on with Hadoop and HDInsight

Hadoop.  Everyone and their dog is talking about it.  That and “Big Data”.  There was an excellent post on Brent Ozar’s DBA Reactions Tumblr blog recently that encapsulated it perfectly, titled “When the executives ask if we’re Hadooping”.  It’s a valid point though, Hadoop is mentioned in just about every article these days, along with the phrase “Big Data” (which I personally don’t like at all).  The consensus, at least on the surface, seems to be that Hadoop will solve everyone’s problems, process anything, oh and bring world peace while it’s doing that.  My sarcastic tone belies a genuine interest in playing about with it though.  With so many people talking about Hadoop (in its many implementations), I was very keen to get an opportunity to try it for myself.

Fortunately, a project came along recently that seemed like it might benefit from a distributed processing approach.  So naturally, being primarily a Microsoft Business Intelligence person, I figured the best place for me to get started was to jump onto Windows Azure and try out HDInsight, Microsoft’s own Hadoop implementation (in conjunction with Hortonworks).

Testing HDInsight

Start page for creating a new HDInsight cluster
You can create your cluster in seconds

Getting started with HDInsight is simple.  Incredibly simple.  Just hit up and sign-in to get started and request your cluster.  The good news is, it’ll be live in minutes.  the bad is that you can only get 3 nodes to begin with, which severely limits your processing capacity, except for only the simplest jobs.

This led me to actually discount HDInsight as a platform for this project soon after.  Aside from the fact that at the time of writing, it’s still in preview stage (therefore no extra nodes, pricing information or scale-out options obviously available), on the default 3 nodes, we found that the performance was terribly slow, and the management of jobs and file system actually obscured somewhat by the web interface MS have added to try and simplify the experience.  Even as a predominantly .NET/Windows person, I was much more comfortable configuring jobs and manipulating HDFS directly via the command line, rather than via the web interface (That could totally just be me though).  If you use Remote Desktop to connect to your cluster, you can actually just launch the command line from there, and also browse HDFS using the HDFS web interface by connecting to the cluster’s head node.

The HDInsight account management page
You can manage all your jobs from the web interface

The preview nature of the platform was definitely a killer, at least for this project, as we were looking for something we could start with immediately, with the option to quickly boost capacity if necessary.  One of the key selling points for using a distributed architecture has to be the ability to quickly and easily scale out capacity by adding more nodes to the cluster.  Add to that the fact that we found performance to be very slow, and it was clearly not the best option for our purposes (To be completely fair though, my experiences with distributed processing solutions suggest they’re not the best choice for processing extremely large numbers of files, being more suited to handling smaller numbers of extremely large files).

Unfortunately, there’s not a huge amount of documentation available, and that which is available is not complete, so be prepared to roll up your sleeves and get your hands dirty.


I’m not for a second saying don’t try HDInsight though.  As a project, it’s still in its infancy and perhaps not moving as quickly as some of the others out there.  A Windows-based Hadoop implementation is still a very positive thing however, and while I didn’t really get on with the web UI, I’m sure others will find it fits their needs perfectly.

HDInsight just needs to haul itself up off its hands and knees and take those first couple of tentative steps.


  • Easy to get started
  • Windows-based
  • .NET code MapReduce functions
  • Awesome SDK
  • Pretty UI


  • Slow, especially on the default 3 nodes
  • UI obscures Hadoop and HDFS functionality
  • Incomplete documentation
  • Still in preview stage

I suggest that everyone gives it a go for themselves, as with most things in life (I was going to say in BI, but it’s equally applicable), one man’s trash is another man’s treasure, and depending on the requirements of each individual project, HDInsight may or may not be suitable.  Would I recommend it at the moment, ahead of a Linux-hosted Hadoop implementation?  No, I have to say I probably wouldn’t, but it’s good to see Hadoop hit Windows regardless, and there is definite promise in HDInsight.

It just needs to haul itself up off its hands and knees and take those first couple of tentative steps.