Thoughts on Debugging

vardump

From the article:

  Here’s my one simple rule for debugging:
  
  Reproduce the issue.

Unless it's one of those cursed things installed at the customer thousands of miles away that never happens back in the lab.

Some things can be incredibly hard to debug and can depend on the craziest of things you'd never even consider possible. Like a thunderstorm causing voltage spikes that very subtly damage the equipment causing subtle failures a few months later. Sometimes that "software bug" turns out to be hardware in weird ways. Or issues like https://web.mit.edu/jemorris/humor/500-miles – every person who's debugging weird issues should read that.

Once you can actually reproduce the issue, you've often done 80-99+% of the work already.

avidiax

> Here’s my one simple rule for debugging:

> Reproduce the issue.

> . . .

> I’m not sure that I’ve ever worked on a hard problem . . .

I agree, the author has probably not worked on hard problems.

There are many situations where either a) reproducing the problem is extremely elusive, or b) reproduction is easy, but the cause is especially elusive, or c) organizational issues prevent access to the system, to source code, or to people willing to investigate and/or solve the issue.

Some examples:

For A, the elusive reproduction, I saw an issue where we had an executive escalation that their laptop would always blue screen shortly after boot up. Nobody could reproduce this issue. Telemetry showed nobody else had this issue. Changing hardware didn't fix it. Only this executive had the anti-Midas touch to cause the issue. Turned out the executive lived on a large parcel, and had exactly one WiFi network visible. Some code parsing that list of WiFi APs had an off-by-one error which caused a BSOD. A large portion of wireless technology (Bluetooth/Thread/WiFi/cellular) bugs fall into this category.

For B, the easy to repro but still difficult, I've seen numerous bugs that cause stack corruption, random memory corruption, or trigger a hardware flaw that freezes or resets the system. These types of issues are terrible to debug, because either the logs aren't available (system comes down before the final moments), or because the culprit doesn't log anything and never harms themselves, only an innocent victim. Time-travel tracing is often the best solution, but is also often unavailable. Bisecting the code changes is sometimes little help in a busy codebase, since the culprit is often far away from their victims.

Category C is also pretty common if you are integrating systems. Vendors will have closed source and be unable or unwilling to admit even the possibility of fault, help with an investigation, or commit to a fix. Partners will have ship blocking bugs in hardware that they just can't show you or share with you, but it must nonetheless get fixed. You will often end up shipping workarounds for errors in code you don't control, or carefully instrumenting code to uncover the partner's issues.

lordnacho

> Reproduce the issue.

Once you have done this, you are already over the hump. It's like being the first rider over the last mountain on the Tour de France stage, you've more or less won by doing that.

I'm not sure I even consider it a challenge if the issue is easily reproduced. You will simply grind out the solution once you have the reproduction done.

The real bugs are the ones that you cannot replicate. The kind of thing that breaks once a week on 10 continuously running machines. You can't scale that system to 1000 or more with the bug around, you'll be swamped with reports. But you also can't fix it because the conditions to reproduce it are so elusive that your logs aren't useful. You don't know if all the errors have the same root cause.

Typically the kind of thing that creates a lot of paths to check is "manual multithreading". The kind of thing where you have a bunch of threads, each locking and unlocking (or forgetting either) access to shared data in a particular way. The number of ways where "this read and then that writes" explodes quite fast with such code, and it also explodes in a way that isn't obvious from the code. Sprinkling log outputs over such code can change the frequency of the errors.

fch42

I follow, yet I disagree that "first priority" must always be a reproducer. There are a lot of conditions that can be rootcaused clearly from diagnostics; say, Linux kernel code deadlocks can exhibit as two different (in their stacks) repeatedly shown "task stuck for more than ... seconds" messages; the remainder follows from the code (to see the abba-lock-ordering violation). There's a certain fetishisation of reproducers not unlike the fetishisation of build-time testing - to denigrate a bug because "you can't reproduce it" or "if it doesn't show in the tests it needn't be changed". Personally, that mindset irks me. Fortunately, most developers are happy to learn more about their code any which way. And debugging, tracing, monitoring is cool in itself.

jillesvangurp

My rule for debugging is to park your assumptions and scientifically invalidate each thing you think might be the issue. It's usually something simple and your initial assumptions are probably wrong. Which is why trying to invalidate those is a productive course of action.

If you catch yourself thinking, it's probably X. Then you should try to prove yourself wrong. Because if your are, you are looking in the wrong place. And if you are struggling to understand why a thing is happening you can safely assume that something you assume to be true is in fact not true. Invalidating that assumption would be how you figure out why.

Assumptions can range from "there's a bug in a library we are using", "the problem must have been introduced recently", "the problem only happens when we do X", etc. Most of these things are fairly simple to test.

The other day I was debugging someone else's code that I inherited. I started looking at the obvious place in the code, adding some logging and I was getting nowhere. Then I decided to try to reproduce the problem in a place where that code was definitely not used to challenge the assumption I was making that the problem even was in that part of the code. I instantly managed to reproduce the issue. I wasted two hours staring at that code and trying to understand it.

In the end, the issue was with a weird bug that only showed up when using our software in the US (or as it turns out, the western hemisphere). The problem wasn't the functionality I was testing but everything that used negative coordinates.

Once I narrowed it down to a simple math problem with negative longitudes and I realized the problem was a missing call to abs where we subtracting values (subtracting a negative value means you are adding it). That function was used in four different places; each of those was broken. Easy fix and the problem went away. Being in Europe (only positive longitudes), we just never properly tested that part of our software in the US. The bug had lurked there for over a year. Kind of embarrassing really.

Which is why randomizing your inputs in unit tests is important. We were testing with just one hard coded coordinate. The fix included me adding proper unit tests for the algorithm.

dave333

Reproducing the problem not only allows in depth debugging but the conditions needed to reproduce can give clues as to the cause. The most significant/interesting bug of my career was a problem in 1978 with a Linotron 606 phototypesetter at the Auckland Star newspaper in NZ that occasionally would have a small patch of mangled text at a random place in the job. Reprint the text again and the problem would disappear. The problem had been outstanding for several months as it wasn't a showstopper. The hardware engineer and I figured it might be related to how the fonts were pulled off disk and put in the typesetting memory buffer so we set up some artificial disk transfer errors where every 50th transfer would fail and sure enough this made the problem happen 100% of the time. From there simply inspecting the code that transferred the fonts we found the problem which was that an extra blank character used for filling the background in reverse video (white text on black background) which was put at the top of the buffer was omitted when things were redone after a disk transfer error. So all the character addresses in the buffer were incorrect resulting in mangled characters.

dave333

A corollary rule is what makes a problem go away. I was pulled into a war room situation on a remote call (pre Zoom days) to debug a problem where after some digging it seemed some variable was being corrupted deep in the code. I suspected it might be a scope issue with two variables with the same name in different functions that the author thought were separate but which were actually one value. I descended the function tree from the top changing every variable "i" to something unique until luckily the problem went away. Made my boss happy.

pdpi

In a narrower sense of the word, one technique I developed early on in my career that I don’t see mentioned very often is exploratory debugging. At its most basic, you run the program, look at the output and find the output string in the source code. Then you set a breakpoint there, and go again. You’re now in a place where you can start understanding the program by working backwards from the output.

One thing that makes me sad about the pervasive use of async/await-style programming is that it usually breaks the stack in a way that makes this technique a bit useless.

toolslive

He has a point about reproducing the issue. However, tracing is better than logging, and for god's sake, put a f*cking breakpoint.

wakawaka28

This seems kinda basic. Of course you need a way to reproduce an issue. But is that all you got? The talking down to juniors at the end, as if he laid out some huge insights, is also slightly hilarious.

z33k

”If you don’t love your logging system, proactively fix that problem.”

Really, you have a ”one-system” where you can see _ALL_ the logs? I don’t believe that. This whole software thing is abstractions everywhere, and we are probably using some abstraction somewhere that isn’t compatible with this fabled ”one-system”.

Often the most debugging takes place on the least observable systems.