The Art of Debugging: Part 0

Prologue

Why is Part 0 published after Part 1? Well I wasn’t sure yet how to write the introduction to this series/book. Books in this category often start by explaining why the problem is worth learning, which feels a bit weird to me. If I’m reading the book I’ve already decided it’s worthy of my attention. Also I’m unlikely to be swayed by an author who has a vested interest in convincing me it’s important. While I was driving to Melbourne for the long weekend an idea struck me; instead of telling the audience how hard the problem is (which you either know or will learn for yourself), I want to tell them why I think the problem is hard. This gives an insight into how I think about the problem at a meta level and sets the scene for what comes previously/next. Plus what’s more meta than a blog series about debugging with out of order execution.

Good problem solving skills

While Large Language Models have become almost ubiquitous in software development, capable of churning out massive amounts of code, they are relatively naive at problem solving. Similarly when most software developers first encounter bugs in production they struggle. Despite potentially years of experience debugging logic in their own software there’s a distinct gap in difficulty we must cross; but why?

To take it back to first principles; what makes for good problem solving? Evolution is the most basic example of it, since the creation of single cell organisms evolution has been striving to solve a problem (what that problem is we’ll leave to the philosophers and biologists). It does so by making random changes and relying on survival of the fittest. Humans have managed to solve incredibly challenging problems through our ability to think abstractly and learn without dying (mostly). We have optimised for making the process more efficient, any solvable problem can be solved with sufficient time and resources. So good problem solving is about efficiency, with a brute force search being the worst case.

What makes production debugging so hard?

The simple answer is; production environments are not just complicated, they are complex. These words have specific definitions from the Cynefin framework (pronounced kuh-NEV-in), which defines five domains; simple, complicated, complex, chaotic, and confusion. For our purposes we’ll focus on the first three, the last two are useful in some contexts (such as security incident response), require introducing topics which I am by no means an expert on (such as systems thinking).

To apply the framework to a problem solving context try thinking about when you walk into your house turn the light on and nothing happens; what do you do? Write 3 steps out if you want to.

Here’s my mental checklist:

Check the lights outside are on; street lights and other people’s houses.
Check the other lights in the house turn on and the TV/fridge.
Finally I’ll check the power meter to see if any of the safety switches or circuit breakers have tripped.

The problem is simple (although how the electricity gets to your house is complex), either there’s no electricity or your light globe is broken. You either have no electricity because there’s a power outage or there’s something wrong with your house electrics. You either replace the light globe, call an electrician or your power company. Now if you are an electrician then the issue is complicated.

Complicated is the domain we are most familiar with; the relationship between cause and effect is clear. While it may take many years to learn how to work effectively in this domain it’s possible to quickly analyse failure. People spend decades fine tuning their skills in this area and I love watching these masters at work because what can seem incredibly hard to a novice is made to look easy.

My first car was a 1979 Volvo I paid $300 for, obviously it broke down a lot. On one occasion the engine just conked out as I was driving and wouldn’t start. I called a roadside mechanic who took one listen to the noise the engine was making and immediately diagnosed the problem; the timing belt had broken; and told me it wouldn’t be expensive to repair. Unlike many modern cars there was sufficient clearance between the pistons and intake/exhaust valves such that they wouldn’t make contact and get bent when the belt broke. He was able to deduce the symptom (no compression) and accurately diagnose the cause and how to fix it, all within 30 seconds of arriving on scene.

Our understanding of these systems, whether it be compilers or 1970s cars, is gained through a process of reductionism; learning each component individually and its role in the system as a whole. But when a system exhibits emergent behaviour, where its properties are more than the sum of its parts, we are in the world of complex systems. Here cause and effect can only be deduced retrospectively; it cannot be accurately predicted. That’s one of the reasons we conduct Post Incident Reviews, to help us reflect and identify the contributing factors.

So how do we approach a problem we can’t know the causes of?

Probe, sense, respond

The Cynefin framework authors tell us we can learn instructive patterns that help us to approach “unknown unknowns” and identify the likely best next steps. But we need to continuously re-evaluate our approach and adjust course. This is what this series hopes to teach you. Rather than teaching you how to probe first, which depends on the technology you’re working with, I’ll start with how to sense. When you see a pattern of behaviour you’ve seen before you develop an intuition, the ability to develop hypotheses that lead you towards a better understanding. In my experience it’s this sixth sense that is the hardest to learn. Stay tuned for the next installment: The Entomology of Software Bugs.

Search This Blog

Systems and Thinking