> Reproduce the issue.
> . . .
> I’m not sure that I’ve ever worked on a hard problem . . .
I agree, the author has probably not worked on hard problems.
There are many situations where either a) reproducing the problem is extremely elusive, or b) reproduction is easy, but the cause is especially elusive, or c) organizational issues prevent access to the system, to source code, or to people willing to investigate and/or solve the issue.
Some examples:
For A, the elusive reproduction, I saw an issue where we had an executive escalation that their laptop would always blue screen shortly after boot up. Nobody could reproduce this issue. Telemetry showed nobody else had this issue. Changing hardware didn't fix it. Only this executive had the anti-Midas touch to cause the issue. Turned out the executive lived on a large parcel, and had exactly one WiFi network visible. Some code parsing that list of WiFi APs had an off-by-one error which caused a BSOD. A large portion of wireless technology (Bluetooth/Thread/WiFi/cellular) bugs fall into this category.
For B, the easy to repro but still difficult, I've seen numerous bugs that cause stack corruption, random memory corruption, or trigger a hardware flaw that freezes or resets the system. These types of issues are terrible to debug, because either the logs aren't available (system comes down before the final moments), or because the culprit doesn't log anything and never harms themselves, only an innocent victim. Time-travel tracing is often the best solution, but is also often unavailable. Bisecting the code changes is sometimes little help in a busy codebase, since the culprit is often far away from their victims.
Category C is also pretty common if you are integrating systems. Vendors will have closed source and be unable or unwilling to admit even the possibility of fault, help with an investigation, or commit to a fix. Partners will have ship blocking bugs in hardware that they just can't show you or share with you, but it must nonetheless get fixed. You will often end up shipping workarounds for errors in code you don't control, or carefully instrumenting code to uncover the partner's issues.
Once you have done this, you are already over the hump. It's like being the first rider over the last mountain on the Tour de France stage, you've more or less won by doing that.
I'm not sure I even consider it a challenge if the issue is easily reproduced. You will simply grind out the solution once you have the reproduction done.
The real bugs are the ones that you cannot replicate. The kind of thing that breaks once a week on 10 continuously running machines. You can't scale that system to 1000 or more with the bug around, you'll be swamped with reports. But you also can't fix it because the conditions to reproduce it are so elusive that your logs aren't useful. You don't know if all the errors have the same root cause.
Typically the kind of thing that creates a lot of paths to check is "manual multithreading". The kind of thing where you have a bunch of threads, each locking and unlocking (or forgetting either) access to shared data in a particular way. The number of ways where "this read and then that writes" explodes quite fast with such code, and it also explodes in a way that isn't obvious from the code. Sprinkling log outputs over such code can change the frequency of the errors.
If you catch yourself thinking, it's probably X. Then you should try to prove yourself wrong. Because if your are, you are looking in the wrong place. And if you are struggling to understand why a thing is happening you can safely assume that something you assume to be true is in fact not true. Invalidating that assumption would be how you figure out why.
Assumptions can range from "there's a bug in a library we are using", "the problem must have been introduced recently", "the problem only happens when we do X", etc. Most of these things are fairly simple to test.
The other day I was debugging someone else's code that I inherited. I started looking at the obvious place in the code, adding some logging and I was getting nowhere. Then I decided to try to reproduce the problem in a place where that code was definitely not used to challenge the assumption I was making that the problem even was in that part of the code. I instantly managed to reproduce the issue. I wasted two hours staring at that code and trying to understand it.
In the end, the issue was with a weird bug that only showed up when using our software in the US (or as it turns out, the western hemisphere). The problem wasn't the functionality I was testing but everything that used negative coordinates.
Once I narrowed it down to a simple math problem with negative longitudes and I realized the problem was a missing call to abs where we subtracting values (subtracting a negative value means you are adding it). That function was used in four different places; each of those was broken. Easy fix and the problem went away. Being in Europe (only positive longitudes), we just never properly tested that part of our software in the US. The bug had lurked there for over a year. Kind of embarrassing really.
Which is why randomizing your inputs in unit tests is important. We were testing with just one hard coded coordinate. The fix included me adding proper unit tests for the algorithm.
One thing that makes me sad about the pervasive use of async/await-style programming is that it usually breaks the stack in a way that makes this technique a bit useless.
Really, you have a ”one-system” where you can see _ALL_ the logs? I don’t believe that. This whole software thing is abstractions everywhere, and we are probably using some abstraction somewhere that isn’t compatible with this fabled ”one-system”.
Often the most debugging takes place on the least observable systems.
Some things can be incredibly hard to debug and can depend on the craziest of things you'd never even consider possible. Like a thunderstorm causing voltage spikes that very subtly damage the equipment causing subtle failures a few months later. Sometimes that "software bug" turns out to be hardware in weird ways. Or issues like https://web.mit.edu/jemorris/humor/500-miles – every person who's debugging weird issues should read that.
Once you can actually reproduce the issue, you've often done 80-99+% of the work already.