Testing with large datasets - can't see the wood for the trees?

Learning my own lessons

In this post, I will examine some techniques for taming large datasets e.g. when you have > 500K rows in a database table or view and you are applying a change that may affect those rows in a variety of ways.  As with all good lessons, this is prompted by my own recent experience when I failed to follow my own advice so I thought I'd write this down in order to atone for my sins and to help my failing memory!

Background

A business change is requested.  In this case, it's to turn off some futures close-out functionality that may have far reaching consequences on the derived P&L.  Indeed, one of the new database procedures delivered as part of the change will update the realised and unrealised PnL for a subset of these trades. Like a good tester, I read and digest the business analysis document and begin to derive my test cases.

I deduce that there are a number of different trade and system states that need to be considered and tested.  

  • Trades that are expired and have yet to expire
  • Trades that have been partially or fully 'closed out' against other trades and trades that have not
  • Trades that are in base currency (GBP) or non-base currency (which are being processed differently)
  • Running the process through the simulated cut-over day and for subsequent processing days

I develop a matrix of tests and a set of test data that caters for all of the test cases I have identified.

I then install the new code, configuration and upload my test data and run the necessary test cases one by one.

I examine the results for each test and satisfy myself that in each scenario / test case the behaviour is as per expectation.

I also ensure that all the data I expect to be unaffected by the change remains the same.

The Power of EXCEPT

Here, I'd like to share a handy technique that I like to use for ensuring equivalence in datasets. Generally, rows in database tables are unique (based on a set of key columns).  This fact can be used together with SQL EXCEPT (introduced in SQL Server 2005) to do large dataset comparisons.

First backup the data as generated by the process under test before the change, (it'll be handy to have a backup of the control set later to further analyse the differences).

SELECT * INTO [BACKUP_TABLE] FROM [SOURCE_TABLE]

Now run the process that will overwrite your data.

Then find rows that are in one dataset but not the other (both ways around)

SELECT * FROM [SOURCE_TABLE] EXCEPT SELECT * FROM [BACKUP_TABLE]
SELECT * FROM [BACKUP_TABLE] EXCEPT SELECT * FROM [SOURCE_TABLE]

If both of these queries return no rows (and your data has a unique key) you know that the datasets are identical.  Of course, it's likely that you are expecting a number of differences, this is no problem because you can SELECT out the difference rows or even store them into another backup table and run the comparisons again to ensure the only differences are those you expected.

SELECT FBT.*,FST.* FROM [FILTERED_BACKUP_TABLE] FBT INNER JOIN [FILTERED_SOURCE_TABLE] FST
ON FBT.KEY_COLUMN = FST.KEY_COLUMN -- you'll need to include all the key columns
AND FBT.FIELD_THAT_SHOULD_BE_SAME <> FST.FIELD_THAT_SHOULD_BE_SAME

So what went wrong?

Being human, I made a dumb mistake.  I looked at the result of one of my test cases and marked as 'pass' a test that was, retrospectively, an obvious 'fail'.  In spite of my blushes, I'll reveal that the row in question showed the realised and unrealised value of the trade both = zero (a logical impossibility).

Which got me thinking (at last)

I had the testing taxonomy all wrong, the problem I was facing was complex and I had instinctively embraced the complexity and focused all my attention on the detailed individual test cases.  I was staring enchanted at single trees and ignoring the wood.

Given that I was testing a procedure that updates realised and unrealised P&L values, rather than thinking only of all the trade permutations why had I not gone for the obvious first step?

Make sure, that for all rows in the dataset after the update:

realized_value + unrealized_value = total_value 

Only after I'm satisfied that the overall size and shape of the wood is correct should I start considering the nuances of individual trees.  I would have saved myself a lot of time by focusing here before diving in.

Reverse engineering test cases

It was remarkable how quickly, faced with a bug report, my mind focused on the means of determining the size and scope of the problem, including running the check above.  If only, when I'd been scheming the original tests, I'd posed the question in reverse:

"Imagine that the code has been deployed untested and the user reports that the data is incorrect - how would you set about determining the exact nature of the problem?"

The test cases naturally fall out in the order they should be executed when you consider the problem in this way.

It's only a subtle distinction, but it sits alongside "Imagine all the ways in which this process could fail" in the testers armoury of methods to generate test cases.