Fixing Failure
It seems the software industry is routinely falling victim to large-scale failures. Typos cause developers to destroy production databases on their first day. AWS S3 went down due to a slight typo. The Boeing 737 MAX 8 is plagued by software issues. A Mars lander crashed because of simple unit conversion errors. Software projects frequently fail to meet quality, scope, time, and budget.
Are developers alone in creating these massive failures? What’s the secret to make them just become better trained? Will it help to just fire the ones who make mistakes? It may feel relieving to know (perhaps with an ounce of schadenfreude) that other professions do indeed experience massive failure, and frequently. The only difference is, they have a system to deal with it.
Medical professionals will receive a decade of intense training, go through certifications, and hundreds of hours of hands-on guided practice. Unfortunately, it seems a slight slip of the finger on a keypad can end the lives of up-to 600 people in the USA every year. What’s the outcome? Well, a paper titled Reducing number entry errors: solving a widespread, serious problem published Journal of the Royal Society Interface demonstrates a simple error-catching interface that just beeps and makes the screen go red when two decimal points are entered cuts errors in half. That little if-statement saves hundreds of lives. It’s protecting patients from a tiny and random slip of the finger for that one second a doctor was distracted, but at a massive scale.
It shouldn’t be a surprise that failure will happen. Failure is probabilistic. Failure will probably happen after enough attempts at the same task. While we can’t control failure, we can change the probability distributions. Doctors don’t just look at individual doctors entering numbers, they look at the entire system of interactions and move towards systems they are unlikely to cause failure. Likewise, casinos don’t like gamblers counting cards because it changes the odds and reduces their profits. Similarly, the best software teams don’t get really good at repeating the exact same process thousands of times, they rig the system so they win more often.
Be proactive and plan for success
Failures are a result of mistakes in either process, knowledge, or skill. Hiring smart developers only reduces risk if they have some overlap with your organization’s process, knowledge, or skills. If they don’t know how you make software, what you make, or lack the skills to do it your way, it might not matter at all who you hire.
When lacking knowledge, a developer may need to resort to trial-and-error to find problems. The key word here is error.
Rules can be incorrectly followed, or used in cases when the rules don’t apply. Raise your hand if you’re a manager who’s unfortunately gotten exactly what you asked for, but not what you really wanted.
Skills can be the hardest to confront. Even the best of us will not see that absent semicolon sometimes. Did you know airplane pilots are more likely to have an accident near their 1000th hour than their 50th?
Rigging the system in your favour is all about changing the way work gets done. Take the example of a typical sales team. They map the conversion rates between every stage in their process (a “sales funnel”). They can A/B test new pitches to improve conversions, identify how many new leads they need to reach revenue quotas, how many leads will fail to be converted, and even project revenue for the business. They systematically try to push failure out of their process.
The equivalent of sales metrics for a development team would be having predictable delivery throughput, predictable failure rates, and having hard data for where biggest improvements can be made.
Thinking about success and failure as an eventual outcome changes the way we plan. A web host that promises 99.9% uptime can feel like “That’s pretty good, they’ll probably never go down, but they might one day”. However, when modeling for a guaranteed 0.01% downtime, a much more deliberate plan is made: “We accept 8 hours of downtime this year, the support team is aware of this, and know how we will handle this”.
“Knowing where the trap is—that's the first step in evading it.”― Frank Herbert, Dune
Be reactive and correct flaws
When we fail it’s because we fell into a trap. A good team will make sure the trap is removed or hard to fall into. After the AWS S3 outage the team responsible created a “Correction of Errors” report, which explored real root causes, short-term solutions, long term solutions, and other lessons. This sort of investigation is commonly known as “Five Whys”.
A lot of people think they know what Five Whys is, and say “just ask why five times“. This is wrong. The point is to understand why the context that exists for a situation to be possible. Asking “Why?” can easily become “Who is responsible?” or “It’s just how it works“. This prevents the conversation from exploring how to make the team skew towards success, creating friction. The point is to dig into the context, you shouldn’t keep asking a literal “Why?” like a five-year-old child interested in why the sky is blue. John Allspaw has written an excellent post which describes all the pitfalls in his post “The Infinite Hows".
If facilitating this conversation turns into finger-pointing, try switching the questions around to focus on contexts:
What conditions needed to be possible for this to be possible?
How did that happen?
What was missing?
A strange criticism of this technique is that Five Whys has lots of root causes. Why would there only be one cause? This is real life: it’s complex, and it’s messy. Multiple causes is a great thing to find! You can prioritize solutions that will adjust the odds of failure against the cost to implement a solution. There is no one root cause, and I don’t know why anyone who’s experienced real life would expect otherwise. It’s just a map that explains the context that failure lives in.
Moving Forward
A healthy organization must accept that even skilled experts will eventually fail. Those failures can happen in low and high risk situations. When teams work together to eliminate causes of failure they rig the system against failure. The job of management is not to identify which individuals fail (they all will eventually fail), the job is to work with the entire team to find where failure might be probable. Afterwards comes changing the processes, knowledge, or training program to avoid failure.
Successful development teams exist when the context they work in sets them up for success every day. Failure will happen, it’s up to you to decide how often.
If you’re convinced, you’ll want to learn more about how to fail better. I suggest looking into “omission vs commission”, “categorization of human error”, “Stop the line”, and “fail-fast”.