Spending 3 months investigating a 7-year old bug and fixing it in 1 line of code

TehShrike

I particularly liked this part:

> Knowing very little about USB audio processing, but having cut my teeth in college on 8-bit 8051 processors, I knew what kind of functions tended to be slow. I did a Ctrl+F for “%” and found a 16-bit modulo right in the audio processing code.

That feeling of saving days of work because you remember a clue from previous experience is so good.

brogrammernot

This exact type of thing is why when I switched to the dark side (product) and sat in management meetings where often non-technical folks would go “we could measure by lines of code or similar” for productivity I often pointed out how that was a bad idea.

Did I win? Of course not, it’s hard for non-technical people to fully appreciate these things and any sort of larger infrastructure work, esp for developer productivity because it goes back to well how you going to measure that ROI.

Anyways, this was fun to read and brought back good engineering memories. I’d also like to say, as it brought back a bug I chased forever, fuck you channelfactory in c#.

pelagicAustral

Some of the stuff I've struggled with the most over the years have been SQL constraints that are not documented. I remember (probably like 10 years ago), I deployed an update to an ancient Windows Forms implementation that deprecated some login and instead made use of Windows Authentication. It worked like a charm for all users, but one! Checked everything, replicated the machine, tried so many weird stuff, and in the end, what was happening is that the "Users" table had a constraint in the number of characters for the username. This username was over the limit and was not being validated... Another one was a report that was giving the wrong amount, but getting the data from database seemed to do the math right... it was the damn Money datatype, changed to decimal, done...

EvgeniyZh

I recently found a bug whose fix amounted to one-liner. It all started with random CI failure when I was working on adding some new functionality.

I've rerun test locally -- no fail. I've changed the seed to one that was used in failing run -- nothing.

I add a loop to the code to repeat text hundred times -- still nothing. I run the test in bash loop hundred time -- 3 fails. So this already hints on some internal problems. I fixed every possible source of randomness and verified that all the inputs are identical between the runs -- still fails only once in a while. I started building MWE, but the function involved in reproduction is fairly complicated. I'm left with a hundreds lines of Jax code which fails in couple of percent of cases.

I look at the output of the compiler, and it is identical between failing and successful runs. So the problem is in the compiled code. The compiled code is ~1000 lines of HLO (not much better than assembly). Unfortunately HLO tooling is both unfamiliar to me and not well fit to this case (or at least I couldn't figure it out). So I start manually bisecting the code. I'm finally left with ~30 lines of HLO. It fails even less often (1% maybe), but at least it runs fast. It also seems to fail in exactly the same way (i.e., there is single incorrect output that I've between 3 fails). Now that's something maintainers can be hoped to look at.

It turned out that matrices with same content but different layout were deduplicated, leading to, in my case, transposed matrix being replaced by non-transposed one. The hash used for storage did take layout into account so the bug appeared only if two entries ended up in the same bucket (~3% of times). The fix was an obvious one liner [1].

[1] https://github.com/openxla/xla/commit/76e7353599d914546f9b30...

creeble

Ha, coincidentally, I designed and built an 8051-based MIDI switch in the early 90’s. There weren’t that many good tools at the time, and I designed everything from the software and UI to the circuit board and rack-mount case.

I even wrote an 8051 assembler in C, but found a good tiny-C compiler for it before it went into production.

You are not a programmer unless you’ve written key-debounce code :)

(OTOH, some of the worst programmers I’ve ever had the displeasure of working with were amazing low-level code hackers. In olden times, it seems like you were either good at that level of abstraction, or you were good at a much different [“higher”] level, seldom both.)

readthenotes1

"...it was based on a USB product we had already been making for PCs for almost a decade.

This product was so old in fact that nobody knew how to compile the source code. "

I think you mean "Management was so bad, nobody knew how to compile the source code".

There are plenty of systems out there that can and and plenty that cannot be reproduced from source. The biggest difference is the card taken to do so, not the age.

pvaldes

"I also ended up needing to find a Perl script that was buried deep in some university website. I still don’t know anything about Perl, but I got it to run"

Find dusty Perl script forgotten for years. Still works

Not the first time that I hear that

winrid

Reminds me of fixing an ~11yr old bug in Enemy Territory. I had to spend a night debugging the C code only to realize the issue was in the UI config: https://github.com/etlegacy/etlegacy-deprecated/pull/100/fil...

(IIRC UI scrolled twice for every mouse movement + you couldn't select items in server browser with mouse wheel as it would skip every other one)

magwa101

Similarly, I spent 6 weeks on a kernel token-ring driver intermittent initialization issue. This required kernel restarts over and over to observe the issue. Breakpoints were useless as they hid the issue. Turns out initialization in a specific step was not synchronous and reading the status was a race condition. It tooks weeks of staring, joking around, thinking, bs'ing, then suddenly, voila. Changed the order of the code, worked.

m3kw9

These one line fix always seem like a stupid bug , but in reality most bugs are like this and the fix is in the discovery

halifaxbeard

Reminds me of a bug I fixed in yamux, simply because of how long I've had to deal with it. Bug existed for as long as yamux did. (yamux is used by hashicorp for stream muxing everywhere in their products.)

If yamux's keepalive fails/times out, and you're calling Read on a demuxed stream, it blocks forever.

https://github.com/hashicorp/yamux/pull/127

rented_mule

The worst I experienced in this direction was also on a consumer device about 15 years ago. Performance was degraded and we couldn't explain it. A team of 5 of us was assembled to figure it out.

We spent over three months on it before finding a root cause. It was over two months before we could even understand how to measure it - we were seeing parts of the automated overnight test suite run taking longer, but every night it would be different tests that were slow. A key finding was that almost everything was slow on some boots of the device and fast on other boots of the device, and there was a reboot before each test was run. Doing some manual testing showed it being close to a 50% chance of a boot leading to slowness. Now what?

I eventually got frustrated and took the brute force / mindless approach... binary search over commits. Unfortunately, that wasn't easy because our build was 45-60 minutes, and then there was a heavily manual installation process that took 10-20 minutes, followed by several reboots to see if anything was slow. And there were several thousand commits since the last known good build (the previously shipped version of the device). The build/install/testing process was not easily automated, and we were not on git, otherwise using git-bisect would have been nice. Instead, I spent weeks doing the binary search manually.

That yielded the offending commit. The problem was that it was a massive commit (tens of thousands of lines of code) from a group in another part of the company. It was a snapshot of all of their development over the course of a couple of years. The commit message, and the authors, stated that the commit was a no-op with everything behind a disabled feature flag.

So now it was onto code level binary search. Keep deleting about half of the code in the commit, in this case by chunks that are intended to be inactive. After eventually deleting all the inactive code, there were still a few dozen lines of changes in a Linux subsystem that did window compositing. Those lines of code were all quite interdependent, so it was hard to delete much and keep things functional, so now on to walking through code. At least I could use my brain again!

Using the clue that the problem was happening about half the time and given that this code was in C, I started looking for uninitialized booleans. Sure enough, there was one called something like `enable_transparency`. Disabled code was setting it to `true`, but nothing was setting it to `false` when their system was disabled. Before their commit, there was no variable - `false` was being passed into the initializer call directly. Adding `= false` to the declaration was the fix.

So, well over a year of engineering hours spent to figure out the issue. The upside is that some people on the team didn't know how to proceed, so they spent their time speeding up random things that were slow. So the device ended up being noticeably faster when we were done. But it was pretty stressful as we were closing in on our launch date with little visibility into whether we'd figure it out or not.

gred

I love that the fix was an optimization allowing the code to keep the simplifying assumption that only one MIDI event ever needs to be buffered... rather than a "cleaner" / "future-proof" design change allowing buffering of more than one MIDI event.

figassis

The number of times I bumped by head against a desk, after missing multiple deadlines and then out of nowhere having a random moment of clarity such has “this gives me X vibes, but it would be insane if this was actually the case”, and then I do a quick string search and there it is.

anytime5704

First time I’ve seen hackernews link to Lemmy.

Love to see it. That place needs more organic growth.

tommiegannert

Kudos also to the original author for not doing premature optimization, of course. It wasn't until the iPad that it was needed. However, a TODO might have been useful. ;)

iam-TJ

Based on experience gained from debugging complex issues and code over decades I have a mantra I repeat to myself and others:

"Almost every bug turns out to be a 1 that should be 0, or a 0 that should be 1"

Keeping this in mind often keeps one focused on the detail of the underlying binary values and how they are being manipulated.

omoikane

> given a fixed denominator, any 16-bit modulo can be rewritten as three 8-bit modulos

Anybody know what's the exact transformation here? I searched around and found this answer, but it doesn't work:

https://stackoverflow.com/a/10441333

thebeardisred

This was a great read if for no more reason than this quote:

> I can still recall the cacophony of what amounted to an elephant on cocaine slamming on a keyboard for hours on end.

thenoblesunfish

Why is it so bad for a second noteon to leave the original note sounding? It makes no sense for keyboards but maybe you really do want two of the same note sounding.

langsoul-com

Half the challenge of the bug isn't fixing the code. It's finding wtf is happening.

shermantanktop

This kind of bug is always an emotional rollercoaster of anticipation, discovery, disappointment, angst, self-criticality, and satisfaction.