Integration testing SSIS ETL packages

Integration testing SSIS ETL packages can be quite a challenge! The reason for this is often the package is written against several large source databases with several gigabytes (or even terabytes) of data. The only way people have of testing the SSIS package is to run it, which takes several hours so the feedback loop is very slow. This also can leave a company without an environment for a day (or more) whilst the issue is fixed.

How can we go about writing an automated test around our SSIS package to ensure that any changes are going to work before we run it against our large production databases? The answer to this is by writing a full end to end integration test that spins up the source database, inserts some data into it, spins up a destination database then runs the SSIS package and then asserts the data is now in the destination database. Once we have this in place we can test every aspect of an SSIS package to make sure it is functioning correctly. The build can then be run on a CI server so we have the same level of confidence about our SSIS package as we do for our production code!

A short rant before I begin explaining how the build works… It continues to amaze me how many companies do not have their database in source control so the only version of the database is the one sitting on the instance in live (and often different versions of it scattered around test environments). When you sit down and think about this it is a crazy situation. It is not that tough to put your database into source control and write a CI build around it. Having your database in source control gives you so many benefits its so surprising to me it is neglected as an after thought.

Rant over lets get on to how to create an integration test around an SSIS package. I have created a proof of concept that you are free to clone and use as you see fit see SSISIntegrationTesting on github. I have tried to make the readme on the repository as comprehensive as possible so if that is sufficient feel free to dive over there and check out the code.

The code uses the local sql server instance to create throw away databases for testing. This is a really useful trick and one that I’ve used in my SqlJuxt F# database comparison project. The great thing about using the local db instance on sql server is that it is available on any box where sql server management tools are installed. So you do not even need the full version of sql server installed to get up and running with it. So it makes it easy to get the build up and running on your build server.

The other trick to making your SSIS package testable is by parameterising your connection strings. To do this go to the data flow view in visual studio and click on the data flow tab. From there right click on the connection in the connection manager pane at the bottom and select “parameterize”. This allows you to pass in a parameter to override the connection string but it will default to the existing connection string you have set up.

If we open up the SSISIntegrationTesting.sln in the repository you will see the package1.dtsx SSIS package. This is a very simple package that uses ETL to copy all data from the products table in the source database to a products table in the destination database. Obviously in reality your SSIS job will be much more complex than this but by solving testing for this simple base case we can build from here.

I am a big fan of writing your tests using XBehave. This allows you to write meaningful descriptions in your test using Given, When, Then. On top of this I like to use builder classes to build the data and write descriptive methods for asserting the data. In my view the test should be readable in that you should be able to walk up to it and realise exactly what it is doing. Too many tests in my view have reams and reams of code and you have to spend quite a while working out what is going on.

From here on I think the best way to finish this article is to go through the integration test in the project and describe it step by step. I am going to paste the code line by line and then add a description below it. Although I do not think you will really need much of a description as the code is self describing as previously mentioned. All of the code for the test is in the PackageScenarios.cs file in the SSISTests project inside the SSISIntegrationTesting.sln in the github repository.

 "Given a source database with one row in the products table"
._(() =>
    sourceDatabase = _testServer.CreateNew();
    var connection = Database.OpenConnection(sourceDatabase.ConnectionString);
    connection.Products.Insert(ProductCode: 1, ShippingWeight: 2f, ShippingLength: 3f,
        ShippingWidth: 4f, ShippingHeight: 5f, UnitCost: 6f, PerOrder: 2);

The first step in the test sets up an empty source database. It then runs in our schema which is stored in the database.sql file. Note in a real project the schema should come from your database CI build. It then uses Simple.Data to insert a product into the products table. Simple.Data is an excellent lightweight ORM that we can use to make it easier to write queries against our database. In this example Simple.Data takes advantages of the C# dynamic type to create an insert statement for us.

"And an empty destination database with a products table"
._(() =>
    destDatabase = _testServer.CreateNew();

Next we create an another database this time to use for our destination database. Again we run in our schema which is contained in the database.sql file.

"When I execute the migration package against the source and dest databases"
._(() => result = PackageRunner.Run("Package1.dtsx", new
    Source_ConnectionString = sourceDatabase.ConnectionString.ToSsisCompatibleConnectionString(),
    Dest_ConnectionString = destDatabase.ConnectionString.ToSsisCompatibleConnectionString(),                    

Now comes the action of testing the SSIS package. Notice here that we are passing in the connection strings of our source and destination SSIS packages for use. This will override the connection strings in the package so our two test databases will be used.

"Then the package should execute successfully"
._(() => result.Should().BeTrue());

I have built the package runner to return a bool as to whether or not it succeeded. Which is sufficient for this proof of concept but if you wanted to extend this to return any specific errors that came back then you could do so. Here we just assert that the package ran successfully.

"And the products table in the destination database should contain the row from the source database"
._(() => destDatabase.AssertTable().Products.ContainsExactlyOneRowMatching(
   new {
       ProductCode = 1,
       ShippingWeight= 2f,
       ShippingLength= 3f,
       ShippingWidth= 4f,
       ShippingHeight= 5f,
       UnitCost= 6f,
       PerOrder= 2

Lastly we assert that the data is in fact in the destination database. This line of code looks quite magical so let me explain how it works. The AssertTable() extension method returns a dynamic which means after the “.” we can put anything we want, in this case we put “Products” as we want the products table. We then override the TryGetMember method on dynamic object to grab the string “products” and pass that along to our next method which is ContainsExactlyOneRowMatching. This method under the covers takes the anonymous C# that you pass in and uses Simple.Data to construct a sql query that can be run against the database. This means that this assertion is very efficient as it tries to select a single row from the products table with a where clause with all of the fields you pass in using the anonymous object. I think the syntax for this is very neat as it allows you to quickly assert data in your database and it is very readable.

Note all of the databases created by the test are destroyed in the Dispose method of the test. Go into the code if you want to see exactly how this happens.

There we have it, a full repeatable end to end SSIS integration test. I believe we have the building blocks here to create tests for more complex SSIS builds. I hope this helps you constructing your own builds, feel free to use any of the code or get in touch via the comments if you want any help/advice or have any suggestions as to how I can make this proof of concept even better.

SqlJuxt – Defining indexes on a table

I have just finished implementing the first implementation of table indexes. Both scripting out an index on the table using the fluent builder syntax and comparing indexes on tables. When writing this feature I had some interesting design decisions to make…

My first design for the type to represent index is shown below:

type Constraint = {name: string; columns: (Column * SortDirection)  list; clustering: Clustering}
type Index = {name: string; columns: (Column * SortDirection) list; clustering: Clustering; uniqueness: Uniqueness}

type Table = {schema: string; name: string; columns: Column list; primaryKey: Constraint option; indexes: Index list}

As you can see the primary key on the table is defined as Constraint of Option and the indexes are defined as Index list. When I started writing the code to use these types I noticed a lot of duplication. Then I realised that an index and a primary key are both really constraints just with slightly different properties. Those being that a primary key is always unique that’s what makes it a key!

I decided to extend the Constraint type by adding the uniqueness property to it. Then it was a simple job of extending the primary key methods to always set the uniqueness to unique. Now the type for a table looks like:

type Table = {schema: string; name: string; columns: Column list; primaryKey: Constraint option; indexes: Constraint list}

So a table has a list of indexes which could of course be empty and it may or may not have a primary key which we can represent by using a Constraint of option. The other advantage of modelling both primary keys and indexes using the constraint type is that we can select them out of the database when loading up to build the schema for comparison all at the same time. We simply have to extend the query to bring back the additional information of whether the constraint is unique and if it is a primary key or not so we know whether to put it on the primary key property of the table.

I did toy with the idea of having all constraints in a list on the table called constraints. That list would’ve the primary key if there was one and all of the indexes for the table. I decided against that approach as it feels a bit clunky to have to go through a list to find the primary key of a table. Also the reason I didn’t like that approach is that where possible you should use the type system and thus the compiler to enforce correctness in your program and make it impossible to model illegal state. If I had a list of constraints on a table I would have to manually check to make sure there was only one primary key in the list. Whereas if I have the primary key defined on the table as option of Constraint then there can only ever be one primary key.

If you want to check out the full source code and delve deeper feel free to check out the SqlJuxt GitHub repository.

XBehave – compiling the test model using the Razor Engine

In the last post I left off describing how I implemented the parsing of the XUnit console runner xml in the XUnit.Reporter console app. In this post I want to talk through how you can take advantage of the excellent RazorEngine to render the model to html.

I am going to talk through the console app line by line. The first line of the app:

var reporterArgs = Args.Parse<ReporterArgs>(args);

Here we are using the excellent PowerArgs library to parse the command line arguments out into a model. I love how the API for PowerArgs has been designed. It has a slew of features which I won’t go into here for example it supports prompting for missing arguments all out of the box.

Engine.Razor.Compile(AssemblyResource.InThisAssembly("TestView.cshtml").GetText(), "testTemplate", typeof(TestPageModel));

This line uses the RazorEngine to compile the view containing my html, it gives the view the key name “testTemplate” in the RazorEngine’s internal cache. What is neat about this is that we can deploy the TestView.cshtml as an embedded resource so it becomes part of our assembly. We can then use the AssemblyResource class to grab the text from the embedded resource to pass to the razor engine.

var model = TestPageModelBuilder.Create()

We then create the TestPageModel using the TestPageModelBuilder. Using a builder here gives you very readable code. Inside the builder we are using the XmlParser from earlier to parse the xml and generate the List of TestAssemblyModels. We also take the optional PageTitle argument here to place in the page title in our view template.

var output = new StringWriter();
Engine.Razor.Run("testTemplate", output, typeof(TestPageModel), model);
File.WriteAllText(reporterArgs.Html, output.ToString());

The last 3 lines simply create a StringWriter which is used by the engine to write the compiled template to. Calling Engine.Razor.Run runs the template we compiled earlier using the key we set “testTemplate”. After this line fires our html will have been written to the StringWriter so all we have to do is extract it and then write it out to the html file path that was parsed in.

That’s all there is to it. We now have a neat way to extract the Given, When, Then gherkin syntax from our XBehave texts and export it to html in whatever shape we chose. From there you could post to an internal wiki or email the file to someone, that could all be done automatically as part of a CI build.

If anyone has any feedback on any of the code then that is always welcomed. Please check out the XUnit.Reporter repository for all of the source code.

XBehave – Exporting your Given, Then, When to html

For a project in my day job we have been using the excellent XBehave for our integration tests. I love XBehave in that it lets you use a Given, When, Then syntax over the top of XUnit. There are a couple of issues with XBehave that I have yet to find a neat solution for (unless I am missing something please tell me if this is the case). The issues are

1) There is not a nice way to extract the Given, When, Then gherkin syntax out of the C# assembly for reporting
2) The runner treats each step in the scenario as a separate test

To solve these to problems I am writing an open source parser that takes the xml produced by the XUnit console runner and parses it to a C# model. I can then use razor to render these models as html and spit out the resultant html.

This will mean that I can take the html and post it up to a wiki so every time a new build runs the tests it would be able to update the wiki with the latest set of tests that are in the code and even say whether the tests pass or fail. Allowing a business/product owner to review the tests, see which pieces of functionality are covered and which features have been completed.

To this end I have created the XUnit.Reporter github repository. This article will cover the parsing of the xml into a C# model.

A neat class that I am using inside the XUnit.Reporter is the AssemblyResource class. This class allows easy access to embedded assembly resources. Which means that I can run the XUnit console runner for a test, take the resultant output and add it to the test assembly as an embedded resource. I can then use the AssemblyResource class to load back the text from the xml file by using the following line of code:

AssemblyResource.InAssembly(typeof(ParserScenarios).Assembly, "singlepassingscenario.xml").GetText())

To produce the test xml files for the tests I simply set up a console app, added XBehave and then created a test in the state I wanted for example a single scenario that passes. I then ran the XUnit console runner with the -xml flag set to produce the xml output. I then copied the xml output to a test file and named it accordingly.

The statistics for the assembly model and test collection model are not aligned to what I think you would want from an XBehave test. For example if you have this single XBehave test:

public class MyTest
public void MyScenario()
"Given something"
._(() => { });

"When something"
._(() => { });

"Then something should be true"
._(() => { });

"And then another thing"
._(() => { });

Then the resultant xml produced by the console runner is:

<?xml version="1.0" encoding="utf-8"?>
<assembly name="C:\projects\CommandScratchpad\CommandScratchpad\bin\Debug\CommandScratchpad.EXE" environment="64-bit .NET 4.0.30319.42000 [collection-per-class, parallel (2 threads)]" test-framework="" run-date="2017-01-27" run-time="17:16:00" config-file="C:\projects\CommandScratchpad\CommandScratchpad\bin\Debug\CommandScratchpad.exe.config" total="4" passed="4" failed="0" skipped="0" time="0.161" errors="0">
<errors />
<collection total="4" passed="4" failed="0" skipped="0" name="Test collection for RandomNamespace.MyTest" time="0.010">
<test name="RandomNamespace.MyTest.MyScenario() [01] Given something" type="RandomNamespace.MyTest" method="MyScenario" time="0.0023842" result="Pass" />
<test name="RandomNamespace.MyTest.MyScenario() [02] When something" type="RandomNamespace.MyTest" method="MyScenario" time="0.0000648" result="Pass" />
<test name="RandomNamespace.MyTest.MyScenario() [03] Then something should be true" type="RandomNamespace.MyTest" method="MyScenario" time="0.0000365" result="Pass" />
<test name="RandomNamespace.MyTest.MyScenario() [04] And then another thing" type="RandomNamespace.MyTest" method="MyScenario" time="0.000032" result="Pass" />

If you look carefully at the xml you will notice a number of things which are counter-intuative. Firstly look at the total in the assembly element it says 4, when we only had a single test. This is because the runner is considering each step to be a separate test. The same goes for the other totals and the totals in the collection element. The next thing that you will notice is that the step names in the original test have had a load of junk added to the front of them.

In the parser I produce a model with the results that I would expect. So for the above xml it I produce an assembly model with a total of 1 test, 1 passed, 0 failed, 0 skipped and 0 errors. Which I think makes much more sense for XBehave tests.

Feel free to clone the repository and look through the parsing code (warning it is a big ugly). Next time I will be talking through the remainder of this app which is rendering the test results to html using razor.

SqlJuxt – Using partial function application to improve the API

After I got all of the tests passing for creating and comparing primary keys on a table I looked back at the code and decided that it had a bit of a smell to it.  Consider the following code snippet:

let private withPrimaryKey columns table isClustered =
    let cs = getColumnsByNames columns table
    {table with primaryKey = Some {name = sprintf "PK_%s"; columns = cs; isClustered = isClustered}}
let WithClusteredPrimaryKey columns table =
    withPrimaryKey columns table true

let WithNonClusteredPrimaryKey columns table =
    withPrimaryKey columns table false

The isClustered parameter is of type bool. This in itself does not feel quite right. You can see what I mean when you look at the implementation of WithClusteredPrimaryKey or WithNonClusteredPrimaryKey. The line reads “withPrimaryKey columns table false”. It is obvious what all of those parameters are except the bool at the end. What does false mean?

Clearly this would be much better if it was self describing which we can easily do in F# using a discriminate union. By defining one like so:


The other part that was causing the code to smell but was perhaps less obvious was the order of the parameters. As described by the excellent post on F# for fun and profit on partial function application the order of your parameters is very important. In the builder functions shown above table is always placed last this means you do not need to mention it when using the script builder API, for example:

let rightTable = CreateTable "DifferentKeyTable"
                        |> WithInt "Column1"
                        |> WithInt "Column2" 
                        |> WithInt "Column3" 
                        |> WithClusteredPrimaryKey [("Column1", ASC); ("Column2", ASC)]
                        |> Build 

Notice how the table parameter does not need to be explicitly passed around. You would have to pass this if it was not the last parameter.

So we know that the order is important if we look at the with primary key functions we can see that the isClustered parameter being last stops us from using partial application so when we define the two methods to create a non clustered and clustered primary key we have to explicitly pass all of the parameters.

Here is the redesigned code of the WithPrimaryKey methods on the TableBuilder API:

let WithPrimaryKey clustering columns table =
    let cs = getColumnsByNames columns table
    {table with primaryKey = Some {name = sprintf "PK_%s"; columns = cs; Clustering = clustering}}
let WithClusteredPrimaryKey = WithPrimaryKey CLUSTERED
let WithNonClusteredPrimaryKey = WithPrimaryKey NONCLUSTERED

Notice how much cleaner these three methods are now. By moving the clustering parameter to the start and using a discriminate union instead of a bool we have achieved two things. Firstly, the code is now self documenting as we are not passing true or false but instead passing CLUSTERED or NONCLUSTERED. Secondly because the clustering parameter is now first we can use the magic of partial function application to define WithClusteredPrimaryKey and WithNonClusteredPrimaryKey. Those two methods simply bake in whether the key is clustered or not and then leave you to fill in the rest of the parameters.

I really love how F# allows you to write beautiful code like this. I’m still learning to write functional code and am really enjoying the experience. Any feedback comments are welcome so please keep them coming.

If you want to check out the full source code and delve deeper feel free to check out the SqlJuxt GitHub repository.

SqlJuxt – Building primary keys on a table

There were a few interesting design decisions I had to make when designing the code to script a primary key on a table. Before I dive into them I think it is good to see the finished code, here is a test that uses the TableBuilder to create a clustered primary key on a table:

let ``should be able to build a table with a clustered primary key on a mulitple columns``() =
    CreateTable "RandomTableName"
        |> WithInt "MyKeyColumn"
        |> WithInt "SecondKeyColumn"
        |> WithVarchar "ThirdCol" 50
        |> WithVarchar "ForthCol" 10
        |> WithClusteredPrimaryKey [("MyKeyColumn", ASC); ("SecondKeyColumn", DESC); ("ThirdCol", DESC)]
        |> Build
        |> should equal @"CREATE TABLE [dbo].[RandomTableName]( [MyKeyColumn] [int] NOT NULL, [SecondKeyColumn] [int] NOT NULL, [ThirdCol] [varchar](50) NOT NULL, [ForthCol] [varchar](10) NOT NULL )

ALTER TABLE [dbo].[RandomTableName] ADD CONSTRAINT [PK_RandomTableName] PRIMARY KEY CLUSTERED ([MyKeyColumn] ASC, [SecondKeyColumn] DESC, [ThirdCol] DESC)

What I like about this code is that just from reading it, it is obvious what the code will do. It will create a create table script with 4 columns with a clustered primary key on 3 of those cloumns.

I decided to go with the approach to take the minimal amount of arguments possible in order to define the primary key. Those being the list of column names and sort order of the columns that you want as a primary key. Originally I thought about allowing the user to specify the primary key name but actually this just adds noise to the code. One can be generated using the convention “PK_” + tableName. Also why make the user think about the name for the primary key when all we really care about is that there is a primary key there and that it has a unique name.

I love the way that in f# you can use discriminate unions to represent states to make the code easy to read and work with. In the above example I could have taken many approaches to specify the column sort order such as using a bool to say whether or not the column is ascending. However, if I had gone with that approach then when calling the code you would have ended up with “columnName true” or “columnName false”. This already feels horrible as just from reading the code you do not know what the true or false means. By defining a discriminate union of ASC/DESC you can immediately tell what the parameter is and what it is doing.

The primary key is defined as a constraint as the following type:

type Constraint = {name: string; columns: (Column * SortDirection)  list; isClustered: bool}

Then the table type has been extended to add a Constraint as a primary key using the option type. As a table may or may not have a primary key. It is nice that we can use Option to represent this rather than having to rely on null like we would in an imperative language.

The hardest part to making this work is taking the list of (String * SortDirection) that the WithClusteredPrimaryKey function takes and turning that in to a list of (Column * SortDirection). This is done using the following function:

let private getColumnsByNames (columnNames: (string * SortDirection) list) table =
            columnNames |> (c,d) -> let column = table.columns |> List.tryFind(fun col ->  match col with
                                                                                            | IntColumn i when = c -> true
                                                                                            | VarColumn v when = c -> true
                                                                                            | _ -> false)
                                                 match column with
                                                    | Some col -> (col, d)
                                                    | None -> failwithf "no column named %s exists on table %s" c )

It is great in F# how we can let the types guide us. If you look at the signature of the function above then we can see that it is:

getColumnsByNames (columnNames: (string * SortDirection) list) -> (table:Table) -> (Column * SortDirection) list

When you look at the types you can see that there are not too many ways this function could be implemented. Using what is in the room as Erik Meijer would say we go through the columns on the table and match them up with the column names that were passed in (throwing an exception if a name is passed in that is not on the table) and then return the actual column along with the sort direction.

F# is proving to be an interesting choice in writing the database comparison library. It is a totally different way of thinking but I feel that I’m starting to come to terms with thinking functionally.

If you want to check out the full source code and delve deeper feel free to check out the SqlJuxt GitHub repository.

SqlJuxt – Using disposable in F# to drop a test database

The integration tests in SqlJuxt need to do the following:

  • Create two databases
  • Set the databases up in a certain state (create tables, views etc)
  • Compare them
  • Assert the result of the comparison
  • Clean up the databases at the end (drop them)

Before I dive in to how I went about making this work I think its good to take a look at how the finished test looks:

let ``should return identical when two tables are the same``() =
    use left = createDatabase()
    use right = createDatabase()
    let table = CreateTable "TestTable"
                    |> WithNullableInt "Column1"
                    |> Build 

    runScript left table
    runScript right table
    loadSchema left.ConnectionString
            |> compareWith right.ConnectionString
            |> should equal IsMatch

I think that test reads really well. In fact you can tell exactly what is going on in the test from the code which is one of the key ingredients of a good test. The test is creating two databases, then creating the same table on both of the databases. It then compares them using the SqlJuxt library and expects the databases to match. The result of the comparison is a discriminate union which I will talk about more in an upcoming post. For now you can read that it says “should equal IsMatch” which is really clear.

The astute reader will notice that nowhere in the test does it mention dropping the databases so you might wonder how this is done. The secret behind this is IDisposable. I got the idea from reading this page on let, use and do on the F# for fun and profit site. Scott talks about using the dispose to stop a timer so I thought it would be neat to use it to drop the database.

To make this work the createDatabase function returns a class that implements IDisposable. Notice that the variables left and right are declared using the ‘use’ keyword and not the ‘let’ keyword. This means that when the variables go out of scope Dispose is automatically called.

This is how the disposable database type is defined:

type DisposableDatabase(name, connectionString) =

    member this.ConnectionString = connectionString
    member this.Name = name

    interface System.IDisposable with 
        member this.Dispose() = 
            dropDatabase this.Name

The code is a little bit clunky as we have to declare a class to implement IDisposable but the cool part is that we do not have to explicitly drop the database in our test it just happens for us.

If you want to check out the full source code and delve deeper feel free to check out the SqlJuxt GitHub repository.