The Art of Debugging: Part I

Prologue

This blog is designed to be the first of a multi-part series. If I find the motivation it will become a book.

My interest with computers really kicked off when I discovered Linux, the ability to inspect and understand what was going on, including reading the source code, enabled me to gain a much richer understanding of what my computer was doing and why things weren't working how I wanted them to.

I have had a pretty unusual career path for a Software Developer, I started as a support engineer for proprietary monitoring software for which I didn’t get to see the source code let alone write it. Then I became a System Administrator, debugging a lot of networking, storage, database, operating systems and more proprietary software. I, like many others, am not satisfied if I fix a problem and don't understand how. My brain has trouble letting go of that feeling. The satisfaction of understanding a bug that has stumped me and others for a long time is almost as addictive as nicotine.

The term "nerd snipe" entered my vernacular a long time ago. I love being sniped and my colleagues know it. I get a sense of excitement when I realise there's a challenge worthy of my attention, because I know I can immerse myself into a flow state that makes the hours pass easily.

Part 1: Your mental model is wrong

When debugging systems you are faced with two problems; the system isn't behaving as you expect and your mental model for how it works isn't correct. The second problem is always the case, but it's not a problem until you find yourself trying to reconcile it with some conflicting information. It can be as simple as a user report, an exception backtrace or a line on a graph. Something doesn't make sense, but you're not sure why.

Recognising that your mental model is wrong gives you a framework by which you can solve the underlying problem. Fighting the system, trying to force it to adhere to the laws you have invented, adds unnecessary friction to the process. Computer systems behave in complex ways and like a scientist we are limited by our systems for inspecting the current state. Recognising that there are limitations to your understanding of the system, and that these limitations are what stands in the way of fixing the problem, opens the door to being wrong. Once you accept your view is wrong then you can start building a more accurate view of what's going on.

Woods' Theorem: As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly.

Good engineers have a strong mental model of how the layer of the system they work with works and the layer below it. A Ruby on Rails developer understands the fundamentals of how Ruby, Rails, and the surrounding ecosystem, interacts to serve a website to a user. Great engineers know what layers exist beneath and have a grab bag of knowledge and tools to debug those layers of the stack.

The OSI model is a fantastic example of this encapsulation. It's a mental model and as such it has limitations to its usefulness. But it gives us a way to think about how TLS, HTTP, DNS and TCP/IP work together to deliver and display a website to a user.

Knowing your limitations

When I interview DevOps engineers I often ask a hypothetical “how would you debug this scenario?”. It can be difficult to describe your process of debugging abstractly, but it’s very insightful to understand how other people approach the problem. Usually it’s a typical “given a three tier web service; load balancer/http server, application server and database; how would you go about diagnosing a user reporting the site is slow?”. Generally speaking I have seen three approaches to this; top down: going from the user’s experience down; middle-out: starting with the load balancer or application server and seeing if the slowness exists up or down the stack and bottom-up; starting with the database first. The best engineers I’ve hired are aware of two or all of these techniques and know each is useful, there’s no right way.

Our limitations are what ultimately determine our ability to solve the problem, finding the boundaries of our abilities is what enables us to get better. It’s easy to get frustrated when we don’t know what’s going wrong, we’re learning the hard way and so it’s more likely our knowledge will be valuable to others, sharing what I’ve learnt is yet another rewarding outcome.

Making assumptions about how a system behaves is critical to the encapsulation model, it allows us to compartmentalise knowledge and focus on the details that most matter. Sometimes these assumptions are made because we don’t understand a component and haven’t had to before, these are relatively easy to correct because we know we don’t know. But the other kind, those made incorrectly based on incomplete or deceptive information, are much harder to unearth. When we are debugging we need to challenge these assumptions because it’s often nuanced or straight up incorrect expectations that are blinding us from the answer that would otherwise be obvious.

Recently I was debugging why an internal website was logging me out everytime I performed a certain action. When I went back through the signin flow I would be bounced around in a loop, never being able to do the task I wanted. I assumed it was a bug in the website, after talking with the team that owned it I found out my IP address was changing constantly. I assumed that my IP only changed rarely, but upon testing this assumption by running curl ipaddr.io a few times in a console I realised this was not true. I had turned on Cloudflare WARP and forgot to turn it off, WARP will multi-home the IP address it uses, round-robining between two. The system wasn’t telling me the reason it had logged me out, which is one problem, but I could’ve figured that out pretty easily myself if I’d tried. Instead I assumed it was someone else's fault.

Rubberducking

One of the points at which you know you’re on the right path is when you find yourself asking “how did this system ever work in the first place?”. You’ve turned your mental model inside out and now you can’t understand, based on the data you have, how it worked to begin with. One of my favourite movies as a teenager was a psychological thriller called Pi, in which the protagonist has an internal monologue that starts with “restate my assumptions”. This has become a mantra of mine while debugging, to help root out any assumptions I’ve made that could blind me from the problem.

Another technique to help this is called Rubberducking (or Rubber duck debugging). It involves grabbing a coworker and explaining the problem to them. They may ask questions, but mostly it’s explaining the problem out loud that enables you to root out the problem.

It’s when we say our assumptions out loud we often discover one of two things: we’d assumed something that we hadn’t verified, in verifying it we realise we were wrong or; we’d been focussed on a single aspect of the problem in our investigation and ignored some other area. While these phenomena are similar, one is slightly more conscious than the other. When we say the assumption out loud (or just type it in a team chat) we reflect on it, which kick-starts the problem solving process with fresh inspiration.

Knowing what you don’t know

Whether you ask a seasoned professional or a junior engineer what the problem is you’ll get the same answer “I don’t know”. But a seasoned professional knows a lot more than they are letting on; they have a grab bag of tools and techniques to find the problem. In this blog series we will go through many of these, but there are almost infinite possibilities. Knowing enough about the problem space gives us a much better chance of success and makes it a lot faster. When it’s a new problem you haven’t had a great deal of experience with it’s going to be a lot more frustrating and slower. Being aware of that the frustration and finding techniques to combat it are paramount. Before diving into an issue ask yourself a few questions:

What type of bug is this? In the next part I will introduce a taxonomy of bugs. For now, here's a non-exhaustive list.

Memory corruption
Logical
Race condition (A special and complicated type of logical bug)
Performance

What tools do I know that can help me resolve it?
What techniques do I know of that I can research to help me?

Being aware of your weaknesses helps you focus on the next logical step to take. When you’re all out of ideas you need better techniques and tools to help you, that’s a key part of the learning process and something that will pay dividends again and again. Being able to break down problems into smaller, incremental, steps is a valuable skill. Sometimes the solution seems tangential, in order to understand why this problem is occurring you have to build a reproduction, but your attempts fail. You’re missing a key piece of information that lets you understand why the issue is happening in the first place. How do you find out what is happening in production that is different to your reproduction? Sometimes you can brute force it, sometimes the answer just falls into your lap.

Knowing when to stop

It’s easy to get tunnel vision when debugging, the single focus is often necessary to find the problem. We almost always don’t have enough information, but sometimes data is so dearth we’re grasping at straws. We may have hypotheses, but no concrete evidence to continue effectively. You have two choices; continue trying to reproduce the issue or wait for more data to appear. The latter can be a difficult choice, while we can do everything possible to capture the data we need, we can’t know if it will be enough. It’s an uncomfortable scenario I’ve faced many times.

A tale of two bugs

Over a period of a few months a colleague and I debugged two memory corruption bugs that turned out to be in the same C library extension. A major challenge with memory corruption bugs is the data you collect is about when the corruption triggered a segmentation fault, not when the corruption occurred. A core dump tells you there is memory corruption, not why it happened.

In both instances I had been aware of the bug for several days but unable to make significant progress until the right set of circumstances let me reproduce it. The first was a crash in our continuous integration environment that was reproducible, the second was when my colleague realised the crash coincided with a regular maintenance task.

As Engineers we don’t like relying on luck or hope to fix our problems, but in the absence of a reproduction we sometimes have no other choice. Continuing to try and reproduce the problem, via brute force means, can be effective and there are various tools such as fuzzers and property testing that can help us. But sometimes it’s not worth the effort and we just have to wait.

Search This Blog

Systems and Thinking