This post deals with system-level integration tests, where we test many components of the system in a deployed environment. We test the system like the user would, using a GUI, a web service or other interfaces. These tests should be portable to other environments so that we can use them as regression tests during the applications life cycle.
Cheating the data painFor almost any integration test, data is something we have to consider. Our integration test commonly depends on some amount of data being setup prior to the test. It might be data that your code uses, valid parameters it needs or data it produces. Selecting and managing this data is often hard and has been a frequent pain-point for projects I have been part of.
So why is test data painful? Often the models our software are built on are complex, so understanding it requires hard work. It might be easy enough to understand it for one test case to work, but it is a completely different thing to gain a general understanding in order to create dozens of test cases. Another painful attribute is portability. You might own and know the development environment pretty well and you may have some "dummy data" setup, but what if you are testing in the UAT environment. Customers will have access and as we all know - they won't handle it gently...
So. Things are hard and painful. What happens? I have a few options, pick one...
- We skip it. Integration tests take to much time, are to expensive and have no value.
- We skip it. We have unit tests.
- We kind of skip it. We create tests only in our local environment, that will have to do!
- We think we don't skip it, but we really do. We create smaller, smoke-tests, in the environments outside of our control.
- We do it. We test all the environments, since we want our bases covered. We know that stuff happens and that any environment is a new one and that if we don't find the bugs - customers will.
Enduring the data painSince I have cheated the data pain many times I wanted to explore how I could bring some order to this mess. That's what we developers do, we organize messy things into stuff that at least we can understand.
I think there are ways to cheat that actually don't impact the quality of your tests.
So, let's get to it. Basically,you have four approaches for any data in your tests.
This is the "quick and dirty" approach. Here we assume that the same value will be available no matter what. Even if this may be true, it tends to be a quite naive approach. Moving from one environment to the other, data will change. But this approach is acceptable in certain cases:
- When you are just trying to get something together for the first time
- When you are creating throw-away-tests (Why would you? Even the simplest test adds value to you regression test suite!)
- When data really IS that stable (Countries, Languages etc)
2. Find anyThis approach is a bit more ambitious, but still requires low effort. Lets assume that you need to use a Country for some reason. Your environment is not setup for every single country in the world, nor are countries static - approach 1 is out of the question. For a database scenario, we'll create a simple "SELECT TOP 1 FROM xxx" query to retrieve the value to use. We don't care what country we get, as long as its valid. Only selecting the columns you need is a sound approach for many reasons, one is improved resilience against schema changes.
Note: My examples assumes that your data can only be retrieved from a database but, depending on the system, you might be able to collect data via web services, REST services etc.
3. Find with predicateHere's the more ambitious cousin of option 2, this time we make the same "SELECT TOP 1..." query, but we add some WHERE statements, since what exact entity we want is important. In the simplest scenario we might just want to make sure that the entity we use has not been "soft-deleted". Another example (sticking to the country-scenario) would be that we want a country that has defined states. Again, only query agains columns that you use. When these predicates become very advanced and start to grow hair, consider this
- Will the predicate always produce a match, is the data stable enough? In all environments?
- Should you consider creating a matching entity instead, using option 4?
4. Create your ownThis is the hardcore solution. If you want it done right, do it yourself! Our selects now become inserts and we create that entity that we need. This requires the deepest level of model knowledge since you need to know every column and table relation in order to make a valid insert.
So, if this is so great - why not use it everywhere and take the power back!? Well, there are a couple of reasons why such an approach has problems.
- Vulnerable tests
When you stick a large number of INSERT statements in your test setup, you are depending heavily on a stable database schema. Any new column (Non-NULL), renamed column or removed column will shred your test to pieces. And probably it will not fail in a pretty way, but in a time-consuming way that ultimately will make people question your ambitious effort.
- Non-portable tests
I am targeting system-level integration tests, that use the entire system - or at least as much of it as possible. Inserting data will assume that no duplicate data already exists, which is no problem in your empty developer sandbox database. However, I am guessing that empty databases are not that common in your deployed environments... Therefore moving your test suite closer to the production environment will be impossible. There's just no way that those environments will be empty.
Simply, this approach just takes too long. Figuring all the references out, understanding every piece of the database model even if many of them are irrelevant to what you are testing. Time can be spent more wisely.
Many inserts, large footprint. Cleaning it up is a large part of that data pain.
Selecting the right approach
|A model for categorizing test data and selecting data approach|
For some time I have advocated an approach where all data is created before the test and removed after, a strict "create your own" way. This is not only stupid but scares co-workers away from testing. Considering other options and seeing data from a focal/supportive and dynamic/stable perspective enables me to make new decisions for each situation and not try to fit every integration test into the same mold. It gives me the capability to put the effort where it is needed and put the slack where it is acceptable.
In the end, I just want higher quality tests and more of them. This might be one piece of the puzzle.