In-Memory Database: Waypoint #2
Previously, on In-Memory Database, I talked about building something new that is derived from the SharpHSQL and SQLite code bases. I have made some progress.
Architecture
I have made a couple of architectural decisions:
First, The database must support a base plus difference structure. What I mean by is is that I want to be able to create a base in-memory database with tables and views, wired together with referential integrity links, and populated with data. Once created, this base should be immutable. One best practice for unit testing is that each unit test reveals/confirms the quality of a single feature. A unit test with 20 Assert statements cries out for refactoring. To avoid this messiness, what I have typically done is to create a helper function that sets up the initial environment of the test. Each of the tests invokes this helper function and then applies a single test assertion against this environment. A variation on this is to use the helper function to do a partial setup, followed by the completion of the setup and the assertion in the test method. When the setup is simple and quick, this approach is quite acceptable.
Setting up an in-memory database to support a set of unit tests that need database support is probably not simple or quick. If we start out with a "blank slate", we would have to create the schema and then populate the data. Even doing this "in memory" will take some time. I know that you should wait until a performance problem rears its ugly head before thinking about optimization, but I do not always do what I know that I should do. As a minimum, I would want to setup the database once for each test run, store it away in a shared/static variable, and re-use it from there. Another approach would be to re-build the database as a file during each build process from the schema scripts. A further step in the build process might be to populate the non-volatile data (such as lookup tables). The tests could load the database into memory and read into the shared/static variable.
Each unit test would have to have its own overlay of this base data where all of the changes triggered by the unit test would occur. This overlay layer would implement the same interface as the base layer. The overlay layer would have to be able to store changes and integrate these changes with the data in the base layer. The unit test would apply its assertions against the overlay layer. Once the test was complete, the overlay layer would be discarded, leaving the base layer untouched to support the other tests. I have not worked out the full details of this but it is most important that all of this worked with multiple tests running, each on its own thread.
Second, it is important to make this whole package extensible. I have made a decision that the first cut of the package will focus on SQL Server 2005. I have also decided that I am going to focus on the CRUD operations. That means that there will be some functions that, at least for now, will not be supported. I do not want to restrict the in-memory database forever. That means that I am going to have to give some thought to how I might be able to extend the functionality to meet whatever needs come up. This is clearly going to affect how parsing is done and how commands are executed.
Coding
I have re-built the parsing logic. I have built a separate set of classes to handle the tokenization of the input stream. These classes are specific to the T-SQL dialect and recognize all of the peculiarities of the SQL language. These classes produce a list of tokens. Each token knows what it is (that is, reserved keyword, punctuation, text, number, and so on). I have also built a separate set of parsing classes that slice and dice the tokens to recognize commands. That is, there is a separate class to handle each of the commands. I thought about going out to get some generalized parsing mechanism but there seemed to be a couple of issues that worked against doing that. First, my experience is that SQL is not a neat language when it comes to parsing; that would suggest that I would have do a lot of "work around" effort to get the generalized parser to work on SQL. Second, I wanted the parsing mechanism to be extensible; I wanted to be able to plug new things into the package as the needs came up. Learning the details of the parsing process seems to work against that goal. With all of this in mind, I went with a home-grown solution. I have tried to separate the parsing mechanism from the rest of the logic to allow for a different parsing solution to be plugged in. I will blog on what I did in a later post.
I am just starting the specifying the syntax of each command. I have taken the approach of creating separate parsing classes for each command. I go down the list of tokens. When I find a token that corresponds to a reserved command, I instantiate an the corresponding command parsing class. That class sucks in as much of the token string as it needs, typically leaving the token list positioned to the next command token. When the parsing class is done, it can produce an instance of the command class that will do the real work. I thought briefly of combining the parsing and action parts of the command in the same class. I am not doing that for a couple of reasons. First, I may want to handle different dialects of SQL in the future. Second, I want to be able to construct the action parts of the commands without ever parsing a single character. Third, it just seemed wrong to do that.
Observations
This exercise has been interesting on a number of different levels. The design aspects are obvious. What has been more interesting, at least to date, is having to learn more about SQL syntax. I have already learned two or three things that I did not know about.
Jon Stonecash is a technology consultant and has been designing, developing, and testing various kinds of software for such a long time that he has had the opportunity to make most of the serious software development mistakes at least once. His long term interests center about databases and the aspects of the application that handle data access and business logic. He is also interested in the tools that assist the development process, particularly code generation.