No ACK Expected: Musings of a Software Engineer: TDD vs. Non-TDD Experiment

The Experiment

I was fortunate enough today to take part in a very small (but extremely interesting) experiment (run by and performed using developers who have been using TDD for some time) that attempted to compare the quality of code, speed of development, and quality of the solution between three different "teams":

Pair TDD (2 developers)
Pair Non-TDD (2 developers)
Single Non-TDD (1 developer)

One additional person was part of the experiment, who played the role of the facilitator (Product Owner). The facilitator knew the entire exercise to be performed beforehand and was familiar with various solutions to it.

The exercise was a customized version of the FizzBuzz kata, which included the base kata as the first few stories, followed by some twists designed to cause difficulties for typical solutions to the kata. The exercise was broken up into 12 separate stories (task cards describing a new, desired functionality).

The stories were provided one-by-one, so that a team only ever knew the story that they were currently working on and had no advance knowledge of any upcoming stories (and were thus unable to pre-design for unknown new business requirements as if they were known). For example, if Team A was working on story 1, Team B was working on story 2, and Team C was working on story 1, only Team B knew what story 2 was. Neither Team A nor Team C received any knowledge of what story 2 contained until they finished story 1.

The teams were allowed to work at their own pace, allowing some teams to be several stories ahead of other teams. Times at which stories were completed were captured for each team, allowing a graph to be created at the end of the experiment to aid with visualizing the relative progress and speed of the different teams.

Code coverage and code quality were not directly measured during the experiment, but each team maintained and committed to a local git repository, capturing snapshots of their code at the completion of each of their stories.

(5/10/2013) Unfortunately, the git repos used by the teams could not be recovered, but the final source code generated by the teams was successfully recovered. The source code for all the teams is available at Github (TDD-vs-NonTDD-Experiment-20120824)

Story Cards

The story cards for this highly customized version of FizzBuzz were created by DJ Daugherty, who also served as the facilitator for the experiment.

Generate a list of numbers between 1 and 100 inclusive
For each number in the list, if the number is a multiple of 3, replace it with the word "Fizz"
For each number in the list, if the number is a multiple of 5 (but not a multiple of 3), replace it with the word "Buzz"
For each number in the list, if the number contains a 3 (but has not already been replaced by "Fizz" or "Buzz"), replace it with the word "Fizz"
For each number in the list, if the number contains a 5 (but has not already been replaced by "Fizz" or "Buzz"), replace it with the word "Buzz"
For each number in the list, if the number is a multiple of both 3 and 5, instead of replacing it with the word "Fizz", replace it with the word "FizzBuzz"
Change the generation of the list of numbers to be 0 to 100 inclusive. If the number is 0, replace it with the word "Zero"
Print the list (with all substitutions) forward, backward, and then forward again
Repeat 8, but follow it with another set of forward, backward, and forward, except using -100 to 0 as the set of numbers
Given the following custom ranges of numbers (note that this can either be printed to console or covered in a unit test), process them through the FizzBuzz replacement logic:

2, 5, 7, 9, 12, 15
-3, -2, 0, 7, 9, 11
0

Given a custom range of numbers (again, this can be covered in just a unit test or printed to console), process it through the FizzBuzz replacement logic:

-33.33

Given a string that contains FizzBuzz replacements for the range of values from 4 to 63 inclusive, "undo" the FizzBuzz logic to generate a candidate list of numbers. All occurrences of "Fizz" should be replaced by all the numbers in the range that would have been "Fizz", all occurrences of "Buzz" should be replaced by all the numbers in the range that would have been "Buzz", and all occurrences of "FizzBuzz" should be replaced by all the numbers in the range that would have been "FizzBuzz". For example:

1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz => 1, 2, [3, 6, 9], 4, [5, 10], [3, 6, 9], 7, 8, [3, 6, 9], [5, 10]

Story Completion Times: Comparison Across Teams

During the experiment, the facilitator kept track of when each team started and completed each of the stories. The following image is a photo that was taken of the graph produced from the data during the exercise. The Y-axis is the story number. The X-axis is the time of day. The large "Break" bar in the middle of the day is when we stopped for lunch. Not all teams completed all of the stories, which is why some of the lines end early.

Conclusions

To add to the completeness of this post, the following were the conclusions that were originally drawn from this experiment as part of the exercise.

Pair TDD is not necessarily the fastest to start – more initial overhead. However, the initial overhead quickly begins to pay off as additional complexity is added.
One might believe that Single Non-TDD would be the fastest, but it was quickly not the case.
Pairing (even without TDD) showed benefits early on, as the Single Non-TDD team had nobody to bounce ideas off of.
This problem was not very complex. Even so, TDD showed some benefits fairly quickly, allowing them to surpass the other teams by the end of the experiment. We believe this was largely due to the code coverage allowing for ease of refactoring, which the later stories required.
The communication process involved in pairing can allow for more elegant and robust solutions.
The teams that were used to doing TDD that were put into the situation where they had to use Non-TDD solutions felt "helpless" and were sometimes in situations where they felt like they had no efficient way of finding the underlying problem that was plaguing them.
The gap between the TDD and Non-TDD teams would have been widened if regression testing had been required for sign-off.

Criticisms

While I believe that such an experiment is valuable and believe that the data we gathered in the experiment is interesting, I do not feel it is as useful, telling, or convincing as it could have been.

The outcome was pretty much as we expected it to be: the Pair TDD team was a little slower off the starting blocks (due to the extra overhead of setting up test cases for the relatively trivial stories early on), but ended up finishing all the tasks of the exercise before the other two teams. Their code quality was significantly higher, as was their test coverage.

The biggest issue that I have with the outcome of the experiment (which I brought up during the retrospective, but which I felt was largely viewed as unimportant by the others), is that the code quality of the Pair Non-TDD and Single Non-TDD teams was so abysmal that it was almost unrealistically so. I cannot remember a time (before I started doing TDD) when code of that quality would have ever been allowed anywhere near a production system. Developers who would have attempted to commit such code would have been flogged and put on display for public humiliation. (Okay, not really that poorly treated, but they would have received a good talking to and probably some coaching to ensure that their code quality improved significantly.)

Moreover, the only team that had any test coverage at all was the Pair TDD team, which is also entirely unrealistic. There should have been some "test after" unit tests created by both of the Non-TDD teams, which would have also helped their velocity by contributing to their ability to more easily refactor their code to include new functionality.

What that means to me is that any comparisons that we make about code quality as a result of the experiment is entirely unrealistic. Because the code quality was unrealistic for those two teams, that then cascades and throws doubt upon all the other results. Would the Pair TDD team really have finished first if the code quality of the Non-TDD teams hadn't been so abysmal that it so adversely affected their velocity?

So, simply because the code quality was unrealistically poor for the two Non-TDD teams, it ends up casting doubt upon the results of the entire experiment and most likely would prove entirely unconvincing to anyone who was looking to the experiment to be convinced that TDD actually does help improve code quality and speed of delivery.

Even worse, the outcome of the experiment was effectively guaranteed to match what we wanted it to be.

Improving the Experiment

Several things can be done to improve the quality of the experiment:

Include a "Single TDD" team type that consists of a single developer using TDD practices.
Have multiple teams of each type, preferably 3-4 teams of each type, for a total of 12-16 teams (about 18-24 developers). This will provide for a much larger set of data points, making it possible to have much more confidence in the conclusions drawn from the collected data.
Use a more complex problem for the experiment. The customized version of the FizzBuzz kata used was significantly more complex than the traditional FizzBuzz, but was still relatively simple. At the same time, this needs to be a problem that the teams can realistically complete in less than a day, in order to leave adequate time for the introduction, breaks, and the closing retrospective.
Most importantly, make sure to have developers who wholeheartedly believe in the approach of the team that they are on and actively practice that approach in their everyday job. In order to obtain realistic code quality for all teams, this is extremely important. Having people who do not regularly practice the type of coding approach of the team of which they are a part makes for a lot of guessing as to what someone who does practice that type of coding would do in such a situation, instead of drawing upon actual experience. In the end, all that guessing will only lead to code that is less representative of the actual code that would have been produced by an actual practitioner of that approach.
Have more realistic "sign-off"/acceptance requirements for when a task is deemed "complete". For example, requiring an actual (quick) code review, some amount of regression testing to ensure past functionality has not been compromised, etc. Note that this, in conjunction with the massively increased number of teams, will mean that additional people will be needed beyond the facilitator (Product Owner) to help with the sign-off requirements, as the facilitator will quickly become a bottleneck.

Final Words

Despite all of my criticism, this was a valuable experiment, even if its greatest value was as an experiment about how to potentially better run this experiment in the future.

Edits:
8/24/2012 - Original post
5/10/2013 - Included more details about the experiment, image of the graph showing team progress through the stories, conclusions that were initially drawn from the experiment, and added a link to the final source code produced by each team.

Friday, August 24, 2012

TDD vs. Non-TDD Experiment