Screenshot of the "Installation Type" page in the SQL Server 2012 SP1 setup wizard

Installing SQL Server Data Tools – Business Intelligence for Visual Studio 2012

As if renaming the accurately titled Business Intelligence Development Studio (BIDS) to the rather ambiguous SQL Server Data Tools (SSDT) wasn’t bad enough, in December, Microsoft’s latest SSDT release only brought half the expected capabilities to Visual Studio 2012.  Yep, the December 2012 SSDT download was missing a key component: the project and item templates for developing MS BI projects in Visual Studio.  Thankfully, the newest release (5th March, 2013) has finally added all of the MS BI templates to SSDT, so you can now develop SSIS packages, SSAS cubes and SSRS reports in the Visual Studio 2012 environment.

Which version?

Unfortunately, they’ve not made the whole process easy.  Searching for “SQL Server Data Tools” will likely lead you to a download which, upon installation, will add connectivity and server management tools to VS 2012 – making it like an up-to-date version of SQL Server Management Studio (SSMS), but without the BI project templates.

The latest release (with the BI templates) is actually called:

Microsoft SQL Server Data Tools – Business Intelligence for Visual Studio 2012

So make sure that if you’re trying to get SSDT for BI development work, that you download the correct version.  Unfortunately, that wasn’t the end of the issues, as I had a bit of trouble with installation that I felt needed sharing.

Installation woes

Begin by downloading the correct installer for the BI enabled version of SSDT from http://www.microsoft.com/en-us/download/details.aspx?id=36843 (782 MB).

Screenshot of the "Installation Type" page in the SQL Server 2012 SP1 setup wizard

Fig. 1: On a 64-bit SQL instance, be sure to select “Perform a new installation of SQL Server 2012”

Once you execute it, the installer will unpack and run the SQL Server 2012 SP1 setup wizard.  Don’t worry about this, remember that SSDT, like BIDS before it, is actually a component of SQL Server based upon the Visual Studio shell, NOT actually an extension to Visual Studio itself.

The trick with the installation is when you reach the Installation Type step (see Fig 1.).

If you’re running an x64-based SQL instance (64-bit), make sure to select “New Instance” on the Installation Type page, and NOT “Add features to an existing instance”.

This is because although the SQL Server instance is 64-bit, the Visual Studio 2012 shell is actually 32-bit.  If you attempt to upgrade a 64-bit instance with a 32-bit component, it fails the Installation Rules checks and won’t allow you to proceed.

Choosing “New Instance” will work but don’t worry, it doesn’t actually require creation of a new SQL instance, it just allows the installer to get past the pre-installation checks.

If you’ve got a 32-bit instance of SQL Server, it doesn’t matter what option you choose here.

Installation complete

Screenshot showing adding a new Business Intelligence project in Visual Studio 2012

The new Business Intelligence Project templates in action

Once the installation has completed (may require a restart), you can open Visual Studio 2012 (or the new SQL Server Data Tools 2012 item on your start menu) and get developing.  Click “New Project” in the File menu and check for the “Business Intelligence” templates to confirm that it’s worked.

I’ve yet to find any real differences between the Visual Studio 2012 based SSDT and the Visual Studio 2010 based version that shipped with SQL Server 2012.  At the moment, the main advantage of using this release seems to be to take advantage of the improved features of Visual Studio 2012 over its 2010 counterpart, rather than any advancements in the Business Intelligence templates/tools themselves.

They might be there, however, I just haven’t come across them yet.  Let me know in the comments below if you’ve spotted any improvements over SSDT 2010 and what they are.

Is Hadoop the right tool for the job?

I recently posted some thoughts regarding Microsoft’s Windows-compatible Hadoop implementation, HDInsight.  I was investigating it for a project that I figured would benefit from a distributed processing approach, although ultimately decided to pursue other alternatives.  It led our team to make some quite interesting discoveries about Hadoop, and some scenarios of when current distributed processing solutions are and aren’t appropriate.

 

Example scenario

The project in question is actually a large-scale data processing solution, required to process millions of varied data files daily, parsing data points from JSON, XML, HTML and more, and writing to a storage solution.  Going back to the “Big Data” terminology, we were definitely looking at the potential for moving Terra-bytes of data per day, at least once we scale up, so needed a technology that could handle this, while remaining responsive, as processing time is actually a crucial factor.

What we quickly noticed, was that we actually didn’t need to use the Reduce part of the functionality.  All we wanted to do was just simply run Map jobs to identify and retrieve data points, rather than aggregating and summarising said data points.

 

Distributed processing options

Following investigation of, and subsequent rejection of HDInsight as a viable option for this project, we took a look into vanilla Hadoop, as well as some other distributed processing implementations and Hadoop add-ons.  Fortunately, there are a lot of very cool products out there.

 

Graph comparing Spark vs Hadoop

Spark claims to run up to 100x faster than Hadoop MapReduce

Cloudera Impala actually introduces its own distributed query engine, which avoids MapReduce to deliver near real-time query results.  It’s not intended as a replacement for MapReduce however, and is meant to complement a Hadoop cluster by offering alternative query techniques for accessing data from Hive and HDFS.

To properly evaluate the performance of these products against one another, we realised we needed a baseline.  Having a great deal of MS BI experience in our team, we thought it would be fun to create this baseline using our usual go-to data processing solution: SSIS.

The more we dug into the distributed architecture, the more it seems like we were looking for something else for our purposes, given the complete lack of requirement for a reduction function.

 

SSIS vs Hadoop

I won’t go into detail on this, as Links has already written up the results over on thinknook.com, but running our Map function on a single SSIS instance performed significantly better in each test than our Hadoop cluster.  The results we gathered seem to suggest that distributed is really only the correct approach when you are using both the Map AND Reduce functionality and/or working with extremely large datasets.  Indeed, the larger the dataset and the more data points involved, the more powerful and useful the reduce functionality becomes.

There is quite simply no straightforward alternative for performing this type of operation in traditional ETL platforms such as SSIS.

I’d like to find out what the comparison is like when performing this same test with Spark vs SSIS, just to see if the in-memory implementation provides the necessary performance boost, or if it’s still better to keep Map and MapReduce in two separate places.

 

Is Hadoop the right tool for the job?

Bottom line: It depends on the job.

If you’re not utilising both sides of the MapReduce coin though, even when processing millions of files, then the overhead of creating and managing jobs, is just not worth it.  And if you are using both Map and Reduce functionality, it may just be worth considering some of the other solutions out there as an alternative to Hadoop MapReduce.

The HDInsight account management page

Hands-on with Hadoop and HDInsight

Hadoop.  Everyone and their dog is talking about it.  That and “Big Data”.  There was an excellent post on Brent Ozar’s DBA Reactions Tumblr blog recently that encapsulated it perfectly, titled “When the executives ask if we’re Hadooping”.  It’s a valid point though, Hadoop is mentioned in just about every article these days, along with the phrase “Big Data” (which I personally don’t like at all).  The consensus, at least on the surface, seems to be that Hadoop will solve everyone’s problems, process anything, oh and bring world peace while it’s doing that.  My sarcastic tone belies a genuine interest in playing about with it though.  With so many people talking about Hadoop (in its many implementations), I was very keen to get an opportunity to try it for myself.

Fortunately, a project came along recently that seemed like it might benefit from a distributed processing approach.  So naturally, being primarily a Microsoft Business Intelligence person, I figured the best place for me to get started was to jump onto Windows Azure and try out HDInsight, Microsoft’s own Hadoop implementation (in conjunction with Hortonworks).

 

Testing HDInsight

Start page for creating a new HDInsight cluster

You can create your cluster in seconds

Getting started with HDInsight is simple.  Incredibly simple.  Just hit up https://www.hadooponazure.com/ and sign-in to get started and request your cluster.  The good news is, it’ll be live in minutes.  the bad is that you can only get 3 nodes to begin with, which severely limits your processing capacity, except for only the simplest jobs.

This led me to actually discount HDInsight as a platform for this project soon after.  Aside from the fact that at the time of writing, it’s still in preview stage (therefore no extra nodes, pricing information or scale-out options obviously available), on the default 3 nodes, we found that the performance was terribly slow, and the management of jobs and file system actually obscured somewhat by the web interface MS have added to try and simplify the experience.  Even as a predominantly .NET/Windows person, I was much more comfortable configuring jobs and manipulating HDFS directly via the command line, rather than via the web interface (That could totally just be me though).  If you use Remote Desktop to connect to your cluster, you can actually just launch the command line from there, and also browse HDFS using the HDFS web interface by connecting to the cluster’s head node.

The HDInsight account management page

You can manage all your jobs from the web interface

The preview nature of the platform was definitely a killer, at least for this project, as we were looking for something we could start with immediately, with the option to quickly boost capacity if necessary.  One of the key selling points for using a distributed architecture has to be the ability to quickly and easily scale out capacity by adding more nodes to the cluster.  Add to that the fact that we found performance to be very slow, and it was clearly not the best option for our purposes (To be completely fair though, my experiences with distributed processing solutions suggest they’re not the best choice for processing extremely large numbers of files, being more suited to handling smaller numbers of extremely large files).

Unfortunately, there’s not a huge amount of documentation available, and that which is available is not complete, so be prepared to roll up your sleeves and get your hands dirty.

 

Conclusion

I’m not for a second saying don’t try HDInsight though.  As a project, it’s still in its infancy and perhaps not moving as quickly as some of the others out there.  A Windows-based Hadoop implementation is still a very positive thing however, and while I didn’t really get on with the web UI, I’m sure others will find it fits their needs perfectly.

HDInsight just needs to haul itself up off its hands and knees and take those first couple of tentative steps.

Pros:

  • Easy to get started
  • Windows-based
  • .NET code MapReduce functions
  • Awesome SDK
  • Pretty UI

Cons:

  • Slow, especially on the default 3 nodes
  • UI obscures Hadoop and HDFS functionality
  • Incomplete documentation
  • Still in preview stage

 

I suggest that everyone gives it a go for themselves, as with most things in life (I was going to say in BI, but it’s equally applicable), one man’s trash is another man’s treasure, and depending on the requirements of each individual project, HDInsight may or may not be suitable.  Would I recommend it at the moment, ahead of a Linux-hosted Hadoop implementation?  No, I have to say I probably wouldn’t, but it’s good to see Hadoop hit Windows regardless, and there is definite promise in HDInsight.

It just needs to haul itself up off its hands and knees and take those first couple of tentative steps.

Notes from building a Custom ForEach Enumerator in SSIS

I recently posted about a quandary in which I found myself that led to me building my own extended ForEach File Enumerator in SSIS.  All things considered, it was a reasonably straightforward experience, with most of my issues stemming from a relative unfamiliarity with Windows Forms development (I was always an ASP.NET man).  The whole process can actually be split into four very simple steps to make things easier:

  1. Create your Enumeration function
  2. Design your UI for SQL Server Data Tools (SSDT)
  3. Validation and assignment of input from the UI
  4. Deploy your new component.

As long as your new custom component isn’t too complicated, these steps can be completed very quickly, meaning you can be up and running in only a little longer than it would take to write everything in a Script Task, and think of the re-usability! Continue reading “Notes from building a Custom ForEach Enumerator in SSIS” »