Is Hadoop the right tool for the job?

I recently posted some thoughts regarding Microsoft’s Windows-compatible Hadoop implementation, HDInsight.  I was investigating it for a project that I figured would benefit from a distributed processing approach, although ultimately decided to pursue other alternatives.  It led our team to make some quite interesting discoveries about Hadoop, and some scenarios of when current distributed processing solutions are and aren’t appropriate.

Example scenario

The project in question is actually a large-scale data processing solution, required to process millions of varied data files daily, parsing data points from JSON, XML, HTML and more, and writing to a storage solution.  Going back to the “Big Data” terminology, we were definitely looking at the potential for moving Terra-bytes of data per day, at least once we scale up, so needed a technology that could handle this, while remaining responsive, as processing time is actually a crucial factor.

What we quickly noticed, was that we actually didn’t need to use the Reduce part of the functionality.  All we wanted to do was just simply run Map jobs to identify and retrieve data points, rather than aggregating and summarising said data points.

Distributed processing options

Following investigation of, and subsequent rejection of HDInsight as a viable option for this project, we took a look into vanilla Hadoop, as well as some other distributed processing implementations and Hadoop add-ons.  Fortunately, there are a lot of very cool products out there.

Graph comparing Spark vs Hadoop
Spark claims to run up to 100x faster than Hadoop MapReduce

Cloudera Impala actually introduces its own distributed query engine, which avoids MapReduce to deliver near real-time query results.  It’s not intended as a replacement for MapReduce however, and is meant to complement a Hadoop cluster by offering alternative query techniques for accessing data from Hive and HDFS.

To properly evaluate the performance of these products against one another, we realised we needed a baseline.  Having a great deal of MS BI experience in our team, we thought it would be fun to create this baseline using our usual go-to data processing solution: SSIS.

The more we dug into the distributed architecture, the more it seems like we were looking for something else for our purposes, given the complete lack of requirement for a reduction function.

SSIS vs Hadoop

I won’t go into detail on this, as Links has already written up the results over on, but running our Map function on a single SSIS instance performed significantly better in each test than our Hadoop cluster.  The results we gathered seem to suggest that distributed is really only the correct approach when you are using both the Map AND Reduce functionality and/or working with extremely large datasets.  Indeed, the larger the dataset and the more data points involved, the more powerful and useful the reduce functionality becomes.

There is quite simply no straightforward alternative for performing this type of operation in traditional ETL platforms such as SSIS.

I’d like to find out what the comparison is like when performing this same test with Spark vs SSIS, just to see if the in-memory implementation provides the necessary performance boost, or if it’s still better to keep Map and MapReduce in two separate places.

Is Hadoop the right tool for the job?

Bottom line: It depends on the job.

If you’re not utilising both sides of the MapReduce coin though, even when processing millions of files, then the overhead of creating and managing jobs, is just not worth it.  And if you are using both Map and Reduce functionality, it may just be worth considering some of the other solutions out there as an alternative to Hadoop MapReduce.

Notes from building a Custom ForEach Enumerator in SSIS

I recently posted about a quandary in which I found myself that led to me building my own extended ForEach File Enumerator in SSIS.  All things considered, it was a reasonably straightforward experience, with most of my issues stemming from a relative unfamiliarity with Windows Forms development (I was always an ASP.NET man).  The whole process can actually be split into four very simple steps to make things easier:

  1. Create your Enumeration function
  2. Design your UI for SQL Server Data Tools (SSDT)
  3. Validation and assignment of input from the UI
  4. Deploy your new component.

As long as your new custom component isn’t too complicated, these steps can be completed very quickly, meaning you can be up and running in only a little longer than it would take to write everything in a Script Task, and think of the re-usability!

Continue reading…
C# Code for custom enumerator, overriding the GetEnumerator method

Extending the ForEach File Enumerator in SSIS

One of the primary reasons for using SSIS is to process multiple files, ETL (Extract, Transform, Load) data, and feed it into a destination, for example, a data warehouse. Built-in components such as the For Loop Container and the ForEach Loop Container make it very simple to enumerate a collection and perform some processing tasks for each entry, be it an array, a dataset, or a directory containing files. In fact, the ForEach File Enumerator option of the ForEach Loop Container allows you to even specify a mask to filter the list of files in a directory prior to processing, so you can weed out unwanted files from the loop. But what if a simple string match isn’t powerful enough to perform the filtering you need?

Continue reading…

Metadata Discovery in SSIS 2012 not working with temp tables

SQL Server Integration Services is a tidy bit of kit and an absolute must have on any SQL Server developer or administrator’s toolbelt.  However, it is as frustrating as it is sublime, with many obfuscated error messages, quirky behaviours and downright oddness.  I ran into one such quirk recently while setting up a Data Flow Task using an OLE DB Source that loaded its data from a Stored Procedure, specified by a package Variable.  Everything was working fine until I identified that I needed to do something in the stored procedure that resulted in my using a temporary table.  As soon as I updated the procedure and attempted to refresh the columns in my OLE DB Source, I was hit with the following error:

Msg 11525, Level 16, State 1, Procedure My_Stored_Procedure_Name, Line 1
The metadata could not be determined because statement ‘My SQL Statement Here;’ uses a temp table.

So why does this happen, even when I’ve defined the type of the column in my procedure? Continue reading…