In Search of An In-Memory Database
As part of the podcast I did a few weeks ago (and that was released to the wild this last Monday), I talked about doing unit tests with an in-memory database. I have a few cycles available and I thought that I would poke around and see what was available. As I was poking, I was making some mental notes about what I wanted to do with this functionality. Since my mind is not a reliable storage unit, I wanted to write them down. So here I am.
Unit Testing
My interest is in unit testing of things related to the database. The typical application that I work on contains the following elements in its stack:
- UI: User Interface and Presentation logic that invokes methods in the
- SL: Service layer that contains application-specific logic that, in turn, invokes methods in the
- DMO: Domain Model Objects containing the business logic that, in turn, invoke methods in the
- DML: Data Mapping Layer that mediates between the domain model objects and the
- DBMS: The Database which persists data and retrieves it upon request
If I am building things in the spirit of unit testing, I want to be able to run a series of unit tests against the service layer and the domain model objects with some form of mocked-up or stubbed data mapping layer or database. Specifically, I want to get the maximum amount of testing intelligence from the software under test with the least amount of overhead. Accessing the database, at least for most of the test scenarios, is unnecessary and adds a lot of execution overhead, both to execute the database commands and to perform the setup/teardown logic against the database to ensure a valid, isolated test. If the interface between the DMO and the DML is fairly clean, I could create a mocks or test stubs and convince the DMO layer that it was talking to the actual database without any of the overhead of actually involving the network or the disk.
There are a couple of problems with this approach. First, depending upon the exact nature of the DML, I might have to write a fair amount of code to simulate the DML. I am not opposed to writing code but I am not too excited about repeatedly re-writing the simulation to match an evolving database schema. Second, if I want to define my tests in terms of the resulting state of the data on the database, I end up writing even more logic to simulate the database. This introduces even more brittle code into the mix. Third, if the DML is home-grown or a third-party library, I have no way to test that code without involving the disk-based database.
One of my goals is to be able to run my entire suite of tests after any significant change is made to the code base. At worst, it would take no more than a handful of minutes to run the whole set. Ideally, it would take no more time than it would take to execute a hydraulic exchange (that is, fluid out, and more fluid in) or a quick look at email/RSS aggregator. That is, code a method/test, get the pair to work with each other, and then run the entire test suite to ensure that you did not cause problems elsewhere in the code base. Repeat every 10 or 15 minutes.
A Technological Approach
It seems to me that an in-memory database provides a lot of "bang for the buck". Since the data never leaves the memory space, it should take much less time than going to a disk-based database. Since it is a more-or-less full-up simulation of its disk-based cousin, it should require much less effort to create an experience that replicates what the disk-based database provides. Finally, it gives me an option to extend "rapid-fire" testing closer to an end-to-end exercise of the entire software stack. I probably would not use it for everything, but the faster that it ran, the more likely I would be use it.
Requirements and Other Dreams
Here are some of the features that I think that I need and some features that I do not need:
- I need a complete simulation of all of the functional behavior of the disk-based database for all basic SQL processing. This includes:
- Column behaviors including errors if I try to insert a null into a non-null column, or a value outside of a check range.
- Unique primary keys.
- Identity keys.
- Default values.
- Transactions.
- Foreign key referential integrity.
- All of the common join processing.
- Aggregate functions.
- Supporting pretty much anything else that an ORM or ORM-like DML would generate.
- I need a way to create the in-memory database schema from the actual schema. I do not (repeat do not) want to have to maintain two versions of the database schema and keep them in sync with each other. No, I do not want to do that at all!
- I would like to have a way to load data into the in-memory database from the real database. I am specifically thinking of various read-only lookup tables that are involved with referential integrity linkages. I would be happy with a initial test setup that created a base in-memory database from which the databases for the individual tests were cloned and fleshed out. If that setup pulled data once per test suite execution to create the base in-memory database, that would be acceptable. Or perhaps we could create a base in-memory database, serialize it to disk, and retrieve it at the beginning of the test suite execution.
- I need a way to store test-specific data into the in-memory database. We could make requests through the DML but I am thinking that the capability to run SQL scripts directly would be a cleaner choice.
- I need a way to retrieve post-test data to determine if the software under test did what I expected it to do. Again, the ability to run SQL commands directly would seem a cleaner choice.
- I do not care about non-unique indices.
- I do not care about the durability of the data in the in-memory database. I expect that we would create an instance just before each test and destroy/reset it right after the test. Sorry, test data, but you are so expendable.
- I do not care very much about the ability to force non-functional error conditions such as timeouts or "disk full". I would much prefer to intervene between the DMO and the DML; I think that it would be much simpler.
What Happens Next?
I have found two possible candidates: SQLite and SharpHSQL. I am going to dig into each of them to see what there is to see.
Jon Stonecash is a technology consultant and has been designing, developing, and testing various kinds of software for such a long time that he has had the opportunity to make most of the serious software development mistakes at least once. His long term interests center about databases and the aspects of the application that handle data access and business logic. He is also interested in the tools that assist the development process, particularly code generation.