Development

Extending the ForEach File Enumerator in SSIS

Extending the ForEach File Enumerator in SSIS

One of the primary reasons for using SSIS is to process multiple files, ETL (Extract, Transform, Load) data, and feed it into a destination, for example, a data warehouse. Built-in components such as the For Loop Container and the ForEach Loop Container make it very simple to enumerate a collection and perform some processing tasks for each entry, be it an array, a dataset, or a directory containing files. In fact, the ForEach File Enumerator option of the ForEach Loop Container allows you to even specify a mask to filter the list of files in a directory prior to processing, so you can weed out unwanted files from the loop. But what if a simple string match isn’t powerful enough to perform the filtering you need?

It’s exactly this situation in which I found myself recently. I had a directory containing several hundred thousand files, all with names based on a number (i.e. 10000.txt, 10001.txt, 10002.txt) and I wanted to only process files whose name is above a certain numerical threshold. Essentially I was looking to evaluate an expression like:

if ((int)filename.Replace(".txt","") > threshold_number)
{
  /* DO STUFF */
}

At this point, I figured I had 3 options:

  1. Enumerate all the files and then perform some custom logic to further filter the collection
  2. Create/find a custom ForEach File enumerator which could filter the files as required
  3. Inspect the directory prior to the loop, and pass it a list of pre-filtered filenames over which to iterate

I posed the question on StackOverflow as to whether it was possible to handle this via the existing components, but despite some excellent, in-depth feedback, the community all suggested the first route, which was essentially what I tried myself. It’s at this point that I began thinking about trying the other two approaches. All three are detailed below.

Option 1 - The Brute Force approach

Screenshot of a simple SSIS ForEach File loop with Script Task to further filter collection

Simple SSIS ForEach File loop with Script Task to further filter collection

First up, I decided to try the brute force approach to the problem, by simply enumerating the entire directory and then using a Script Task within my ForEach File Loop to check the conditional logic and decide whether to continue processing the file, or skip to the next iteration of the loop. Simple, no?

One major problem. When this runs over a huge collection of files (into the several hundreds of thousands), it runs like a dog (extremely slow),at least on my dev and production environments. Why does it take so long? Well, if you think about it, the ForEach File Enumerator must enumerate every single file in the directory which matches the file mask and retain this in memory as it iterates through the collection. This means we’re actually creating a FileInfo object for each of the files in the directory, but only actually using a a handful of those which match our conditional logic in the Script Task.

This seems like a hugely inefficient use of resources to me, which wouldn’t normally be a concern as it’s usually fast enough, but in situations where you’re processing large numbers of files, there must be a better approach.

Option 2 - The Custom approach

I had a dig around on the web, as I figured that something like this would be achievable if I could use a Regular Expression on the file mask. I came across Joost’s SSIS blog, where he has created a custom ForEach File enumerator which accepts a RegEx instead of a file mask expression. This is an absolutely brilliant component, and really adds some flexibility to the standard expressions available in SSIS. However, with RegEx, it’s tricky to perform an equal to/greater than comparison, as it’s intended to match characters, not values. In saying that, it is possible to play about with the positioning of each number but you’re going to end up with a hugely complex RegEx that you’re going to have to edit depending on the threshold number, which is a maintainability nightmare.
Joost van Rossum has a tonne of excellent (and FREE!) custom SSIS components on his site, so I’d definitely recommend checking it out.

Fortunately, SSIS makes it fairly easy to modify the standard ForEach Enumerator.

So, I had a look into the ForEach Loop Container in SSIS and investigated creating my own one. After all, if you think about the file mask expression (exposed on the FileSpec property), it must under everything, evaluate to a boolean condition: does the file name match the mask? I figured that I could create my own ForEach File Enumerator, which could perform a more complex conditional comparison and only enumerate those files whose names match the condition.

Fortunately, SSIS makes it fairly easy to modify the standard ForEach Enumerator. All you really need to do is override the existing GetEnumerator method, returning a type that implements IEnumerator. Of course, if you’re creating a custom ForEach Enumerator, you don’t really want a hard-coded conditional expression, so you’ll probably need to take this a bit further and code up some properties and link this to a custom UI, to allow users to specify their own conditions at design time.

using System;
using System.Collections.Generic;
using System.IO;
using Microsoft.SqlServer.Dts.Runtime;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

namespace PicnicerrorDotNet.SSIS.Enumerators
{
    [DtsForEachEnumerator(
        DisplayName = "For Each File Integer Expression Enumerator",
        Description="Custom for each file enumerator that enumerates files as the result of an integer comparison against the filename. Only for files which have an integer as a filename.",
        ForEachEnumeratorContact="https://picnicerror.net",
        UITypeName = "PicnicerrorDotNet.SSIS.Enumerators.ForEachFileIntegerExpressionEnumeratorUI,PicnicerrorDotNet.SSIS.Enumerators.FileIntegerExpressionUI,Version=1.0.0.0,Culture=Neutral,PublicKeyToken=7fef2b07d0ea7898")]
    public class ForEachFileIntegerExpressionEnumerator : ForEachEnumerator
    {
        #region "Properties"
        /// <summary>
        /// The directory which contains the files
        /// </summary>
        public string SourceDirectory
        {
            get;
            set;
        }

        /// <summary>
        /// File mask for filtering the directory contents I.E. "*.txt"
        /// </summary>
        public string FileSpec
        {
            get;
            set;
        }

        /// <summary>
        /// Flag indicating whether to perform the integer comparison or not
        /// </summary>
        public bool UseComparison
        {
            get;
            set;
        }

        /// <summary>
        /// Definition of the available conditional operators for the comparison
        /// </summary>
        public enum ConditionalOperatorType
        {
            GreaterThan,
            GreaterThanOrEqualTo,
            Equals,
            LessThanOrEqualTo,
            LessThan
        }

        /// <summary>
        /// The conditional operator to use for the comparison
        /// </summary>
        public ConditionalOperatorType ConditionalOperator
        {
            get;
            set;
        }

        /// <summary>
        /// The value to be compared against the filename integer
        /// </summary>
        public int ComparisonValue
        {
            get;
            set;
        }

        /// <summary>
        /// Flag indicating whether to remove files which appear in the exclusion list
        /// </summary>
        public bool UseExclusionList
        {
            get;
            set;
        }

        /// <summary>
        /// List of integers (i.e. filenames) which should be excluded from the final list
        /// </summary>
        public string ExclusionListName
        {
            get;
            set;
        }

        /// <summary>
        /// Which portion of the file name to retrieve
        /// </summary>
        public enum FilePathType
        {
            NameAndExtension,
            FullyQualified,
            NameOnly
        }

        /// <summary>
        /// Portion of the filename to be retrieved
        /// </summary>
        public FilePathType FilePath
        {
            get;
            set;
        }
        #endregion

        /// <summary>
        /// Override the GetEnumerator method to implement your own custom enumeration logic
        /// </summary>
        /// <param name="connections"></param>
        /// <param name="variableDispenser"></param>
        /// <param name="events"></param>
        /// <param name="log"></param>
        /// <returns></returns>
        public override object GetEnumerator(Connections connections, Microsoft.SqlServer.Dts.Runtime.VariableDispenser variableDispenser, IDTSInfoEvents events, IDTSLogging log)
        {
            //Get the directory, use getenumerator and return list of strings based on expression
            List<string> activeFiles = new List<string>();
            List<int> numbersToExclude = new List<int>();

            if (!String.IsNullOrEmpty(ExclusionListName))
            {
                //Retrieves the exclusion list variable
                Variables varList = null;
                variableDispenser.LockOneForRead(ExclusionListName, ref varList);
                numbersToExclude = (List<int>)varList[0].Value;
                varList.Unlock();
            }

            System.Threading.Tasks.Task listTask = System.Threading.Tasks.Task.Factory.StartNew(() =>
            {
                DirectoryInfo dir = new DirectoryInfo(SourceDirectory);
                foreach (FileInfo f in dir.EnumerateFiles(FileSpec))
                {
                    FileInfo file = f;
                    
                    //Grab each relevant level of the file path
                    string fileExtension = file.Extension;
                    string filePath = file.FullName;
                    string fileName = filePath.Substring(filePath.LastIndexOf("\\") + 1);
                    string fileNameOnly = fileName.Substring(0, fileName.IndexOf(fileExtension));
                    int valueToCompare = Convert.ToInt32(fileNameOnly);

                    //Check how to retrieve the filename
                    string outputFileName = filePath;
                    switch (FilePath)
                    {
                        case FilePathType.FullyQualified:
                            outputFileName = filePath;
                            break;
                        case FilePathType.NameAndExtension:
                            outputFileName = fileName;
                            break;
                        case FilePathType.NameOnly:
                            outputFileName = fileNameOnly;
                            break;
                    }

                    //Check whether we need to use the exclusion list, to further filter the results
                    if (UseExclusionList)
                    {
                        if (!numbersToExclude.Contains(valueToCompare))
                        {
                            if (UseComparison)
                            {
                                if (Match(valueToCompare))
                                    activeFiles.Add(outputFileName);
                            }
                            else
                                activeFiles.Add(outputFileName);
                        }
                    }
                    else
                    {
                        if (UseComparison)
                        {
                            if (Match(valueToCompare))
                                activeFiles.Add(outputFileName);
                        }
                        else
                            activeFiles.Add(outputFileName);
                    }
                }
            });

            System.Threading.Tasks.Task.WaitAll(new System.Threading.Tasks.Task[] { listTask });

            return activeFiles.GetEnumerator();
        }

        /// <summary>
        /// Performs the comparison operation depending on chosen operator
        /// </summary>
        /// <param name="valueToCompare"></param>
        /// <returns></returns>
        private bool Match(int valueToCompare)
        {
            switch (ConditionalOperator)
            {
                case ConditionalOperatorType.GreaterThan:
                    if (valueToCompare > ComparisonValue)
                        return true;
                    break;
                case ConditionalOperatorType.GreaterThanOrEqualTo:
                    if (valueToCompare >= ComparisonValue)
                        return true;
                    break;
                case ConditionalOperatorType.Equals:
                    if (valueToCompare == ComparisonValue)
                        return true;
                    break;
                case ConditionalOperatorType.LessThanOrEqualTo:
                    if (valueToCompare <= ComparisonValue)
                        return true;
                    break;
                case ConditionalOperatorType.LessThan:
                    if (valueToCompare < ComparisonValue)
                        return true;
                    break;
                default:
                    return true;
            }

            return false;
        }
    }
}

This post’s already getting a bit long to start a walkthrough of how I created this custom ForEach Enumerator, so I’ll write a follow up post detailing my first experience of custom SSIS component development (if you’re a Windows Forms developer, you’ll find it to be a piece of cake). In the meantime, you can download my component itself Github. It’s pretty basic and should definitely be treated as beta software (let me know of any bugs via Twitter), but it will:

  • Compare an integer-based filename against a user-specified value.
  • Allow choice of comparison (>, >=, ==, <=, <).
  • Allow specification of an Object variable containing a list of integers which should be excluded from the enumerator, even if they match the condition (could be used to exclude problematic files, or ones which have already been processed).

It doesn’t currently:

  • Allow comparison against an expression
  • Traverse sub-folders
  • Handle other types for comparison

Again, I’ll follow up with another post, detailing my creation process for this component and how it works. One thing I did find is that MSDN is surprisingly sparse on examples of custom SSIS component creation, so I had to dig about on the web for some help. Again, there’s not that much out there, but I did find some excellent advice on Dougbert’s blog that helped me avoid any major pitfalls. Hopefully, by sharing my experiences, I can help add a little more information to help the next person who fancies giving it a try.

Option 3 - The Alternate route

Screenshot of a SSIS package enumerating the collection first, then loop over object variable

Alternatively, enumerate the file collection in a Script Task, applying your conditional expression as you go, then loop over an Object variable collection

Alternatively, if you don’t fancy creating a custom SSIS component, you can perform your own enumeration of the directory, applying the conditional logic, and construct your filtered collection in a simple Script Task. Then, just pass the complete collection to a ForEach Loop using a ForEach Variable Enumerator. Yes, this means you’re essentially enumerating two collections (first the directory content, then a list filtered list of files/filenames for the actual ForEach Loop), but in my tests it was still much faster than Option 1. By using the DirectoryInfo.EnumerateFiles method, instead of DirectoryInfo.GetFiles, we can improve the performance even further, as this method asynchronously enumerates the directory, allowing simultaneous testing of a condition before assigning to the output. Because we can cut down the list of files returned while streaming, it offers improved performance over GetFiles, which will enumerate the entire directory first, before we can then convert each filename to an integer and check our condition.

If you’re not confident enough to try and develop your own SSIS ForEach Enumerator, or you simply don’t have the time, this should provide a viable workaround.

Conclusion

As with most things in SSIS, you’ve got plenty of options available to workaround this issue, and it’s likely that what works for one person may not for another, such are the plethora of hardware and software configurations out there that affect these things. I recommend trying a couple of different approaches, should you come across this situation (or one similar) to see what works best for you. But, after playing about with the different options above, I’ve now got a working SSIS package, which compares integer-based filenames from a huge directory, and does so extremely quickly.

As I mentioned above, I’ll write a follow up post detailing my experience of creating a custom ForEach File Enumerator, but in the meantime please let me know in the comments below if you’ve tried anything similar, and how you approached it.

comments powered by Disqus