Monday, September 3, 2012

Iterative and test driven methodology for writing complicated SQL queries

Iterative development

A lot of software developers  are familiar with writing code iteratively. Basically, you break down the problems into modules, and then break those modules down into finer modules till you get to the point where each module is small enough. This allows the developers to take "bite-sized" pieces of the problems, and concentrate on them one by one. This makes the code easy to develop, maintain and test. Anopther advantage of this approach is that each module can be tested independently, and you build test cases for each module as you go along.

Now, this approach works naturally with procedural languages. When you break the problem into discrete steps, you can divide the problems into modules easily. The problem is when developers who have been trained to work in procedural languages like Java are asked to write SQL. SQL is a declarative language. You tell the database what tables you want the data from, and what the relationships between the tables are. You cannot think procedurally when you are building SQL. The issue is that a lot of procedural developers forget to develop iteratively when they step outside of their comfort zone. They try to build the entire SQL statement in one shot, and stop thinking about how to test the SQL. I am presenting the approach that I take to develop complicated SQL queries iteratively.

Thinking Procedurally vs Thinking declaratively

Let's say you have a table that stores employees, and another table that stores their addresses. You want to get all the addresses of a given employee. Someone who thinks procedurally will break down the problem into 2 parts:- a) first get the employee id the given employee b) next get the address records that match the employee id

A declarative thinker, OTH, will not think of the problem as a series of steps. S/he will think of entities and the relationship between the entities. She will think "An Employee has many Addresses. This is modeled by putting the employee id in the adress table. So, I need to JOIN the  Employee table with the address table to get the addresses of all employees. If I add a where clause on employee name, we get the address of the given employee" The declarative thinker is not breaking the problem in a series of steps. The declarative thinker is thinking of relationships, and how the whole thing connects together

Challenges with developing iteratively in declarative languages

Iterative development comes naturally with procedural languages. Since, part of analyzing the problem procedureally is breaking down into bite sized pieces, developing those small pieces independently is natural.  

However, when you are developing declaritively, and working on a problem that requires data from 7 different entities, and all these entities with 13 relationships between them, it's very difficult to keep everything in your head. Even if you do, it's easy to make a mistake. 

Methodology

So, how do we develop a SQL iteratively? By brekaing it down into smaller pieces! How do you break down a query that needs 7 entities and 13 relationships? By starting with a query that needs 2 entities and 1 relationship, and then growing the query.

This is the approach that I use

  1. Write down the columns that you need in your output
  2. Write down the columns that you need in your input(or rather the column that you want to filter by)
  3. Against each output column, write down the table that the data will come from
  4. Against each input column, write down the table that contains that column. Generally, the input tables and output tables have some tables in common. However, there might be cases where the set of input tables and output tables are completely distinct
  5. Pick one output table. If you have multiple outputtables, pick the one that will generally have the least amount of data, or pick the one that has most number of columns that you are going to filter by. Remember that the database is goign to perform better if it has to operate on a smaller amount of rows. So, start with the table that returns the least amount of rows first
  6. Write the  select query with just that output table. Don't worry about all the other output columns yet. Just write a select query that returns the columns from that table, and filters on the input columns. This is your seed query. You are going to build the final query on top of this
  7. Test this query out. This very important. Do not jump ahead to the other tables unless you test this out. In your database, add test data that should be returned by the query. Alos, add negative test data; data that shouldn't be returned by your query. For example if you are writing a query that gets an employee by rank, add 2 employees:- one that matches the given rank, and one that doesn't. At this point, it might seem overkill to add test data for a trivial query. However, remember that this query is your seed query. It's going to grow into the final query. Along with the seed query, you need seed test data. Any test data that you define here will be useful as your query gets more complicated
  8. If you have performance concerns, run explain plan at this point.. At this point, you should be pretty confident about what you have written. It should do one thign and it should do it well. It may not do all that you need, but it should do it well
  9. Pick the next  logically related table. If none of the tables are logically related, find the table that is closest, and the intermediate tables that will help you get to that table
  10. Grow your query to include the next logically related table. Add Joins between the tables in the query and the next set of logically related tables. Add columns to the select clause; add where/group by clauses
  11. Now, your query has grown a bit. Time to test this grown query. You already had test data for the table in your seed query. Don;t lose that data. Add data to new tables that relates to the records in the old table. Remember to add negative test data in new tables. Also, remember that you had added negative test data in the old table. Make sure you add records to new tables that relates to the negative records in old tables. Testing with this test data will ensure that you haven't forgotten a clause that will cause your query to pull data that you don't need.
  12. Run Explain Plan again if required.
  13. Repeat steps 9-11 for all tables till all tables are done
At the end of this process, you will have a kick-ass set of test data. And you have the satisfaction of knowing that your query not only works, it works in cases where the data doesn't make sense. If you are writing Java unit tests, you have ready made test data that you can use in your unit tests.

Please note:- It's very tempting to test your query by modifying the same set of test data over and over again to test different cases. Let's say you have a OR condition in your where clause:- A lot of developers insert a record that satisfies one side of the OR condition. Once, they are satisfied that the query works in this cases, they modify that record so that it satisfies the other side of the OR condition. Don't do this!! Once you have built test data, don;t lose it!. Instead add a new test record. This ensures that the test case that you tested will be tested again in the next iteration of your query.