Extending the ForEach File Enumerator in SSIS
One of the primary reasons for using SSIS is to process multiple files, ETL (Extract, Transform, Load) data, and feed it into a destination, for example, a data warehouse. Built-in components such as the For Loop Container and the ForEach Loop Container make it very simple to enumerate a collection and perform some processing tasks for each entry, be it an array, a dataset, or a directory containing files. In fact, the ForEach File Enumerator option of the ForEach Loop Container allows you to even specify a mask to filter the list of files in a directory prior to processing, so you can weed out unwanted files from the loop. But what if a simple string match isn’t powerful enough to perform the filtering you need?
It’s exactly this situation in which I found myself recently. I had a directory containing several hundred thousand files, all with names based on a number (i.e. 10000.txt, 10001.txt, 10002.txt) and I wanted to only process files whose name is above a certain numerical threshold. Essentially I was looking to evaluate an expression like:
[cc lang=”csharp” escaped=”true”]if ((int)filename.Replace(“.txt”,””) > threshold_number)
/* DO STUFF */
At this point, I figured I had 3 options:
- Enumerate all the files and then perform some custom logic to further filter the collection
- Create/find a custom ForEach File enumerator which could filter the files as required
- Inspect the directory prior to the loop, and pass it a list of pre-filtered filenames over which to iterate
I posed the question on StackOverflow as to whether it was possible to handle this via the existing components, but despite some excellent, in-depth feedback, the community all suggested the first route, which was essentially what I tried myself. It’s at this point that I began thinking about trying the other two approaches. All three are detailed below.
Option 1 – The Brute Force approach
First up, I decided to try the brute force approach to the problem, by simply enumerating the entire directory and then using a Script Task within my ForEach File Loop to check the conditional logic and decide whether to continue processing the file, or skip to the next iteration of the loop. Simple, no?
One major problem. When this runs over a huge collection of files (into the several hundreds of thousands), it runs like a dog (extremely slow),at least on my dev and production environments. Why does it take so long? Well, if you think about it, the ForEach File Enumerator must enumerate every single file in the directory which matches the file mask and retain this in memory as it iterates through the collection. This means we’re actually creating a FileInfo object for each of the files in the directory, but only actually using a a handful of those which match our conditional logic in the Script Task.
This seems like a hugely inefficient use of resources to me, which wouldn’t normally be a concern as it’s usually fast enough, but in situations where you’re processing large numbers of files, there must be a better approach.
Option 2 – The Custom approach
I had a dig around on the web, as I figured that something like this would be achievable if I could use a Regular Expression on the file mask. I came across Joost’s SSIS blog, where he has created a custom ForEach File enumerator which accepts a RegEx instead of a file mask expression. This is an absolutely brilliant component, and really adds some flexibility to the standard expressions available in SSIS. However, with RegEx, it’s tricky to perform an equal to/greater than comparison, as it’s intended to match characters, not values. In saying that, it is possible to play about with the positioning of each number but you’re going to end up with a hugely complex RegEx that you’re going to have to edit depending on the threshold number, which is a maintainability nightmare.
Joost van Rossum has a tonne of excellent (and FREE!) custom SSIS components on his site, so I’d definitely recommend checking it out.
Fortunately, SSIS makes it fairly easy to modify the standard ForEach Enumerator
So, I had a look into the ForEach Loop Container in SSIS and investigated creating my own one. After all, if you think about the file mask expression (exposed on the FileSpec property), it must under everything, evaluate to a boolean condition: does the file name match the mask? I figured that I could create my own ForEach File Enumerator, which could perform a more complex conditional comparison and only enumerate those files whose names match the condition.
Fortunately, SSIS makes it fairly easy to modify the standard ForEach Enumerator. All you really need to do is override the existing GetEnumerator method, returning a type that implements IEnumerator. Of course, if you’re creating a custom ForEach Enumerator, you don’t really want a hard-coded conditional expression, so you’ll probably need to take this a bit further and code up some properties and link this to a custom UI, to allow users to specify their own conditions at design time.
This post’s already getting a bit long to start a walkthrough of how I created this custom ForEach Enumerator, so I’ll write a follow up post detailing my first experience of custom SSIS component development (if you’re a Windows Forms developer, you’ll find it to be a piece of cake). In the meantime, you can download my component itself in DLL form from my downloads page. It’s pretty basic and should definitely be treated as beta software (let me know of any bugs via my Contact page), but it will:
- Compare an integer-based filename against a user-specified value.
- Allow choice of comparison (>, >=, ==, <=, <).
- Allow specification of an Object variable containing a list of integers which should be excluded from the enumerator, even if they match the condition (could be used to exclude problematic files, or ones which have already been processed).
It doesn’t currently:
- Allow comparison against an expression
- Traverse sub-folders
- Handle other types for comparison
Again, I’ll follow up with another post, detailing my creation process for this component and how it works. One thing I did find is that MSDN is surprisingly sparse on examples of custom SSIS component creation, so I had to dig about on the web for some help. Again, there’s not that much out there, but I did find some excellent advice on Dougbert’s blog that helped me avoid any major pitfalls. Hopefully, by sharing my experiences, I can help add a little more information to help the next person who fancies giving it a try.
Option 3 – The Alternate route
Alternatively, if you don’t fancy creating a custom SSIS component, you can perform your own enumeration of the directory, applying the conditional logic, and construct your filtered collection in a simple Script Task. Then, just pass the complete collection to a ForEach Loop using a ForEach Variable Enumerator. Yes, this means you’re essentially enumerating two collections (first the directory content, then a list filtered list of files/filenames for the actual ForEach Loop), but in my tests it was still much faster than Option 1. By using the DirectoryInfo.EnumerateFiles method, instead of DirectoryInfo.GetFiles, we can improve the performance even further, as this method asynchronously enumerates the directory, allowing simultaneous testing of a condition before assigning to the output. Because we can cut down the list of files returned while streaming, it offers improved performance over GetFiles, which will enumerate the entire directory first, before we can then convert each filename to an integer and check our condition.
If you’re not confident enough to try and develop your own SSIS ForEach Enumerator, or you simply don’t have the time, this should provide a viable workaround.
As with most things in SSIS, you’ve got plenty of options available to workaround this issue, and it’s likely that what works for one person may not for another, such are the plethora of hardware and software configurations out there that affect these things. I recommend trying a couple of different approaches, should you come across this situation (or one similar) to see what works best for you. But, after playing about with the different options above, I’ve now got a working SSIS package, which compares integer-based filenames from a huge directory, and does so extremely quickly.
As I mentioned above, I’ll write a follow up post detailing my experience of creating a custom ForEach File Enumerator, but in the meantime please let me know in the comments below if you’ve tried anything similar, and how you approached it.