Friday, August 24, 2012

TDD vs. Non-TDD Experiment

The Experiment

I was fortunate enough today to take part in a very small (but extremely interesting) experiment (run by and performed using developers who have been using TDD for some time) that attempted to compare the quality of code, speed of development, and quality of the solution between three different "teams":

  • Pair TDD (2 developers)
  • Pair Non-TDD (2 developers)
  • Single Non-TDD (1 developer)
One additional person was part of the experiment, who played the role of the facilitator (Product Owner).  The facilitator knew the entire exercise to be performed beforehand and was familiar with various solutions to it.

The exercise was a customized version of the FizzBuzz kata, which included the base kata as the first few stories, followed by some twists designed to cause difficulties for typical solutions to the kata.  The exercise was broken up into 12 separate stories (task cards describing a new, desired functionality).

The stories were provided one-by-one, so that a team only ever knew the story that they were currently working on and had no advance knowledge of any upcoming stories (and were thus unable to pre-design for unknown new business requirements as if they were known).  For example, if Team A was working on story 1, Team B was working on story 2, and Team C was working on story 1, only Team B knew what story 2 was.  Neither Team A nor Team C received any knowledge of what story 2 contained until they finished story 1.

The teams were allowed to work at their own pace, allowing some teams to be several stories ahead of other teams.  Times at which stories were completed were captured for each team, allowing a graph to be created at the end of the experiment to aid with visualizing the relative progress and speed of the different teams.

Code coverage and code quality were not directly measured during the experiment, but each team maintained and committed to a local git repository, capturing snapshots of their code at the completion of each of their stories.

(5/10/2013) Unfortunately, the git repos used by the teams could not be recovered, but the final source code generated by the teams was successfully recovered.  The source code for all the teams is available at Github (TDD-vs-NonTDD-Experiment-20120824)

Story Cards

The story cards for this highly customized version of FizzBuzz were created by DJ Daugherty, who also served as the facilitator for the experiment.

  1. Generate a list of numbers between 1 and 100 inclusive 
  2. For each number in the list, if the number is a multiple of 3, replace it with the word "Fizz" 
  3. For each number in the list, if the number is a multiple of 5 (but not a multiple of 3), replace it with the word "Buzz" 
  4. For each number in the list, if the number contains a 3 (but has not already been replaced by "Fizz" or "Buzz"), replace it with the word "Fizz" 
  5. For each number in the list, if the number contains a 5 (but has not already been replaced by "Fizz" or "Buzz"), replace it with the word "Buzz" 
  6. For each number in the list, if the number is a multiple of both 3 and 5, instead of replacing it with the word "Fizz", replace it with the word "FizzBuzz" 
  7. Change the generation of the list of numbers to be 0 to 100 inclusive. If the number is 0, replace it with the word "Zero" 
  8. Print the list (with all substitutions) forward, backward, and then forward again 
  9. Repeat 8, but follow it with another set of forward, backward, and forward, except using -100 to 0 as the set of numbers 
  10. Given the following custom ranges of numbers (note that this can either be printed to console or covered in a unit test), process them through the FizzBuzz replacement logic: 
    • 2, 5, 7, 9, 12, 15 
    • -3, -2, 0, 7, 9, 11 
  11. Given a custom range of numbers (again, this can be covered in just a unit test or printed to console), process it through the FizzBuzz replacement logic: 
    • -33.33 
  12. Given a string that contains FizzBuzz replacements for the range of values from 4 to 63 inclusive, "undo" the FizzBuzz logic to generate a candidate list of numbers. All occurrences of "Fizz" should be replaced by all the numbers in the range that would have been "Fizz", all occurrences of "Buzz" should be replaced by all the numbers in the range that would have been "Buzz", and all occurrences of "FizzBuzz" should be replaced by all the numbers in the range that would have been "FizzBuzz". For example: 
    • 1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz => 1, 2, [3, 6, 9], 4, [5, 10], [3, 6, 9], 7, 8, [3, 6, 9], [5, 10]

Story Completion Times: Comparison Across Teams

During the experiment, the facilitator kept track of when each team started and completed each of the stories. The following image is a photo that was taken of the graph produced from the data during the exercise.  The Y-axis is the story number.  The X-axis is the time of day.  The large "Break" bar in the middle of the day is when we stopped for lunch.  Not all teams completed all of the stories, which is why some of the lines end early.

Conclusions

To add to the completeness of this post, the following were the conclusions that were originally drawn from this experiment as part of the exercise.
  • Pair TDD is not necessarily the fastest to start – more initial overhead. However, the initial overhead quickly begins to pay off as additional complexity is added. 
  • One might believe that Single Non-TDD would be the fastest, but it was quickly not the case. 
  • Pairing (even without TDD) showed benefits early on, as the Single Non-TDD team had nobody to bounce ideas off of. 
  • This problem was not very complex. Even so, TDD showed some benefits fairly quickly, allowing them to surpass the other teams by the end of the experiment. We believe this was largely due to the code coverage allowing for ease of refactoring, which the later stories required. 
  • The communication process involved in pairing can allow for more elegant and robust solutions. 
  • The teams that were used to doing TDD that were put into the situation where they had to use Non-TDD solutions felt "helpless" and were sometimes in situations where they felt like they had no efficient way of finding the underlying problem that was plaguing them. 
  • The gap between the TDD and Non-TDD teams would have been widened if regression testing had been required for sign-off.

Criticisms

While I believe that such an experiment is valuable and believe that the data we gathered in the experiment is interesting, I do not feel it is as useful, telling, or convincing as it could have been.

The outcome was pretty much as we expected it to be:  the Pair TDD team was a little slower off the starting blocks (due to the extra overhead of setting up test cases for the relatively trivial stories early on), but ended up finishing all the tasks of the exercise before the other two teams.  Their code quality was significantly higher, as was their test coverage.

The biggest issue that I have with the outcome of the experiment (which I brought up during the retrospective, but which I felt was largely viewed as unimportant by the others), is that the code quality of the Pair Non-TDD and Single Non-TDD teams was so abysmal that it was almost unrealistically so.  I cannot remember a time (before I started doing TDD) when code of that quality would have ever been allowed anywhere near a production system.  Developers who would have attempted to commit such code would have been flogged and put on display for public humiliation.  (Okay, not really that poorly treated, but they would have received a good talking to and probably some coaching to ensure that their code quality improved significantly.)

Moreover, the only team that had any test coverage at all was the Pair TDD team, which is also entirely unrealistic.  There should have been some "test after" unit tests created by both of the Non-TDD teams, which would have also helped their velocity by contributing to their ability to more easily refactor their code to include new functionality.

What that means to me is that any comparisons that we make about code quality as a result of the experiment is entirely unrealistic.  Because the code quality was unrealistic for those two teams, that then cascades and throws doubt upon all the other results.  Would the Pair TDD team really have finished first if the code quality of the Non-TDD teams hadn't been so abysmal that it so adversely affected their velocity?

So, simply because the code quality was unrealistically poor for the two Non-TDD teams, it ends up casting doubt upon the results of the entire experiment and most likely would prove entirely unconvincing to anyone who was looking to the experiment to be convinced that TDD actually does help improve code quality and speed of delivery.

Even worse, the outcome of the experiment was effectively guaranteed to match what we wanted it to be.

Improving the Experiment

Several things can be done to improve the quality of the experiment:

  • Include a "Single TDD" team type that consists of a single developer using TDD practices.
  • Have multiple teams of each type, preferably 3-4 teams of each type, for a total of 12-16 teams (about 18-24 developers).  This will provide for a much larger set of data points, making it possible to have much more confidence in the conclusions drawn from the collected data.
  • Use a more complex problem for the experiment.  The customized version of the FizzBuzz kata used was significantly more complex than the traditional FizzBuzz, but was still relatively simple.  At the same time, this needs to be a problem that the teams can realistically complete in less than a day, in order to leave adequate time for the introduction, breaks, and the closing retrospective.
  • Most importantly, make sure to have developers who wholeheartedly believe in the approach of the team that they are on and actively practice that approach in their everyday job.  In order to obtain realistic code quality for all teams, this is extremely important.  Having people who do not regularly practice the type of coding approach of the team of which they are a part makes for a lot of guessing as to what someone who does practice that type of coding would do in such a situation, instead of drawing upon actual experience.  In the end, all that guessing will only lead to code that is less representative of the actual code that would have been produced by an actual practitioner of that approach.
  • Have more realistic "sign-off"/acceptance requirements for when a task is deemed "complete".  For example, requiring an actual (quick) code review, some amount of regression testing to ensure past functionality has not been compromised, etc.  Note that this, in conjunction with the massively increased number of teams, will mean that additional people will be needed beyond the facilitator (Product Owner) to help with the sign-off requirements, as the facilitator will quickly become a bottleneck.

Final Words

Despite all of my criticism, this was a valuable experiment, even if its greatest value was as an experiment about how to potentially better run this experiment in the future.

Edits:
8/24/2012 - Original post
5/10/2013 - Included more details about the experiment, image of the graph showing team progress through the stories, conclusions that were initially drawn from the experiment, and added a link to the final source code produced by each team.

"Refactoring" and "Tech Debt" Are Not Dirty Words

It always bothers me when anyone (but especially Product Owners, Iteration Managers, or Project Managers) treat "refactoring" and "tech debt" as dirty words -- things that we shouldn't be doing because they are a "waste of time".  There is an important difference between "gold plating" and "refactoring" or "tech debt".

"Gold Plating" is the process of either (1) adding in features because "they might possibly be maybe needed at some point in the far distant future if some somewhat unlikely scenario were to arise as an actual business requirement"; or (2) rewriting code to a degree well past what is needed to make the code more maintainable -- rewriting code just to make it "prettier" or more "elegant" without actually adding a measurable (and useful) amount of value toward the code being more maintainable or understandable.

"Refactoring" is the process of rewriting code with the purpose of making it significantly more maintainable or understandable.

"Tech debt" is a category of code maintenance that deals with either (1) "refactoring" that is needed, but which the team did not have time to perform at the time a story was played due to time constraints; or (2) the addition of necessary functionality to allow for proper maintenance or support of the software.

The key difference between them is that "gold plating" has gone beyond what is "needed" and is dealing with "wants".  "Refactoring" and "tech debt" are only going so far as to deal with what is truly "needed" and stopping before it gets to the point of "wants".

If a developer is abusing the terms "refactoring" or "tech debt" by using them when they are really talking about "gold plating", then shame on them.  They are not only causing business value to be lost by pushing the incorrect prioritization of work, but they are also helping to reinforce the idea that "refactoring" and "tech debt" are dirty words.

We as developers are told that we should always be questioning the value of the stories being played to help ensure that the correct priority has been assigned to stories so we can provide the most value to the business as quickly as possible.  But, we should be vigilant in monitoring our own suggestions for additional work that we believe should be done to ensure that we are not advocating, as important or critical, something that is a "want".

Monday, August 6, 2012

But It's A Defect!

Yes, the story is a defect story, but that most definitely does not mean that it automatically has the highest priority of any story in the backlog.

Even defect stories have business value associated with them.  It may not even be worth the cost in man hours for the development and QA of the fix for the defect in terms of how long it will take before the business sees a return on investment for that fix.

Yes,  defect stories are defects because they mean that a previously played story is not behaving exactly as it was supposed to.  But, there is also a portion of the original story that is behaving exactly as it was supposed to.  That means that it is entirely possible (and somewhat probable) that the defect will affect such a small percentage of the interactions with the system that the code fix for the defect will be much more expensive for the business than any manual (or otherwise) work-around that the business can implement to deal with the defect.

I am not saying that all defect stories are low priority.  There are the occasional defect stories that are exceptionally high priority.  For example, if a defect brings down an entire warehouse.  That has a very real cost associated with it the longer that warehouse is still down, so the faster that defect gets fixed, the better.

But, those severe defects are few and far between.  If they're not, that's a warning that something isn't right with the way you are doing things.  Perhaps you're missing a level of testing (unit, system, integration, end-to-end, etc.) or an area of your system isn't as well tested as it should be.  Or, perhaps it is an indication that a portion of the system needs to be rewritten to be able to better handle certain types of situations that were thought to be rare edge cases but turned out to be much more common.

So, yes.  I am saying that just because you found a defect, it doesn't mean that you have to fix it.  Some defects just aren't worth fixing because you'll either never get a ROI on it, or some other work-around for the defect will provide a faster ROI for the business.

You'll know the critical defects that need to be addressed immediately when you see them -- and so will your Product Owner.  Just like any other story, let your Product Owner prioritize your defect stories.  If they are leaning toward "all defects are critical", help nudge them back toward looking at what the real business value is in fixing the defects.

"Defect" is just a label that we're applying to help categorize stories.  Don't let the choice of word used to describe the category influence your impression of the importance of the story.  The categorization of a story is distinct from its prioritization.

Tech Stories are Business Stories Too!

"Say What?"

One thing that I seem to be hearing a lot of lately is that "Tech Stories" aren't "Business Stories" and thus shouldn't be subject to prioritization by the Product Owner.  I beg to differ.

Tech stories are just as much business stories as any other story that we play.  That they provide some value to the system being developed means that they are also providing business value; the system is being developed to serve the business.  If the system doesn't have business value, we have no business developing it.  Similarly, if a story doesn't have business value, we don't have any business working on it.  So, because the technical stories are providing business value, they should be subject to the same prioritization as any other business story.

"But, my Product Owner just doesn't understand why my Tech Stories are important!"

It's your job to help your product owner understand the importance of the story in terms that they can understand.

Yes, you heard that right.  You are the one with the technical expertise and knowledge, so you should be able to explain to your Product Owner why your technical story is important.  If you cannot explain to your Product Owner why it is important in terms that they can understand (i.e. in terms that relate to business value), then perhaps your technical story really isn't that important after all.

If a technical story is designed to provide better error handling around one section of the system, then it's up to you to explain how better error handling around that section of the system will provide business value to your Product Owner.  For example, the better error handling may mean less frustration on the part of your users.  Or, it may mean that Level 2 Support may be able to more easily diagnose and address the issue, which would result in fewer problems being escalated to Level 3 Support (which is likely much more costly than Level 2 Support).

"If I just tell my Product Owner that the world will end if we don't get to play this technical story, I can get him to let me work on it right away.  Why bother with trying to figure out the real business value behind it?"

Because, if you aren't providing him with an accurate impression of what the true business value is of your technical story, then you aren't doing your job to provide the best business value that you can, as quickly as you can.  Your technical story may be providing much less business value than several of the other stories that have already been given a high priority.  It is your job to give your opinions and advice to your Product Owner to help him understand what the business value of some stories are (or to occasionally nudge them away from "want's" and back toward "need's").  But, it isn't your job to determine what is going to be the highest business value for the business -- that is your Product Owner's job.

No matter what the technical story, there is business value within the story.  It's up to you as the developer to help your Product Owner understand the business value in it so that it can be prioritized correctly.