Did I win? Of course not, it’s hard for non-technical people to fully appreciate these things and any sort of larger infrastructure work, esp for developer productivity because it goes back to well how you going to measure that ROI.
Anyways, this was fun to read and brought back good engineering memories. I’d also like to say, as it brought back a bug I chased forever, fuck you channelfactory in c#.
I've rerun test locally -- no fail. I've changed the seed to one that was used in failing run -- nothing.
I add a loop to the code to repeat text hundred times -- still nothing. I run the test in bash loop hundred time -- 3 fails. So this already hints on some internal problems. I fixed every possible source of randomness and verified that all the inputs are identical between the runs -- still fails only once in a while. I started building MWE, but the function involved in reproduction is fairly complicated. I'm left with a hundreds lines of Jax code which fails in couple of percent of cases.
I look at the output of the compiler, and it is identical between failing and successful runs. So the problem is in the compiled code. The compiled code is ~1000 lines of HLO (not much better than assembly). Unfortunately HLO tooling is both unfamiliar to me and not well fit to this case (or at least I couldn't figure it out). So I start manually bisecting the code. I'm finally left with ~30 lines of HLO. It fails even less often (1% maybe), but at least it runs fast. It also seems to fail in exactly the same way (i.e., there is single incorrect output that I've between 3 fails). Now that's something maintainers can be hoped to look at.
It turned out that matrices with same content but different layout were deduplicated, leading to, in my case, transposed matrix being replaced by non-transposed one. The hash used for storage did take layout into account so the bug appeared only if two entries ended up in the same bucket (~3% of times). The fix was an obvious one liner [1].
[1] https://github.com/openxla/xla/commit/76e7353599d914546f9b30...
I even wrote an 8051 assembler in C, but found a good tiny-C compiler for it before it went into production.
You are not a programmer unless you’ve written key-debounce code :)
(OTOH, some of the worst programmers I’ve ever had the displeasure of working with were amazing low-level code hackers. In olden times, it seems like you were either good at that level of abstraction, or you were good at a much different [“higher”] level, seldom both.)
This product was so old in fact that nobody knew how to compile the source code. "
I think you mean "Management was so bad, nobody knew how to compile the source code".
There are plenty of systems out there that can and and plenty that cannot be reproduced from source. The biggest difference is the card taken to do so, not the age.
Find dusty Perl script forgotten for years. Still works
Not the first time that I hear that
(IIRC UI scrolled twice for every mouse movement + you couldn't select items in server browser with mouse wheel as it would skip every other one)
If yamux's keepalive fails/times out, and you're calling Read on a demuxed stream, it blocks forever.
We spent over three months on it before finding a root cause. It was over two months before we could even understand how to measure it - we were seeing parts of the automated overnight test suite run taking longer, but every night it would be different tests that were slow. A key finding was that almost everything was slow on some boots of the device and fast on other boots of the device, and there was a reboot before each test was run. Doing some manual testing showed it being close to a 50% chance of a boot leading to slowness. Now what?
I eventually got frustrated and took the brute force / mindless approach... binary search over commits. Unfortunately, that wasn't easy because our build was 45-60 minutes, and then there was a heavily manual installation process that took 10-20 minutes, followed by several reboots to see if anything was slow. And there were several thousand commits since the last known good build (the previously shipped version of the device). The build/install/testing process was not easily automated, and we were not on git, otherwise using git-bisect would have been nice. Instead, I spent weeks doing the binary search manually.
That yielded the offending commit. The problem was that it was a massive commit (tens of thousands of lines of code) from a group in another part of the company. It was a snapshot of all of their development over the course of a couple of years. The commit message, and the authors, stated that the commit was a no-op with everything behind a disabled feature flag.
So now it was onto code level binary search. Keep deleting about half of the code in the commit, in this case by chunks that are intended to be inactive. After eventually deleting all the inactive code, there were still a few dozen lines of changes in a Linux subsystem that did window compositing. Those lines of code were all quite interdependent, so it was hard to delete much and keep things functional, so now on to walking through code. At least I could use my brain again!
Using the clue that the problem was happening about half the time and given that this code was in C, I started looking for uninitialized booleans. Sure enough, there was one called something like `enable_transparency`. Disabled code was setting it to `true`, but nothing was setting it to `false` when their system was disabled. Before their commit, there was no variable - `false` was being passed into the initializer call directly. Adding `= false` to the declaration was the fix.
So, well over a year of engineering hours spent to figure out the issue. The upside is that some people on the team didn't know how to proceed, so they spent their time speeding up random things that were slow. So the device ended up being noticeably faster when we were done. But it was pretty stressful as we were closing in on our launch date with little visibility into whether we'd figure it out or not.
Love to see it. That place needs more organic growth.
"Almost every bug turns out to be a 1 that should be 0, or a 0 that should be 1"
Keeping this in mind often keeps one focused on the detail of the underlying binary values and how they are being manipulated.
Anybody know what's the exact transformation here? I searched around and found this answer, but it doesn't work:
> I can still recall the cacophony of what amounted to an elephant on cocaine slamming on a keyboard for hours on end.
> Knowing very little about USB audio processing, but having cut my teeth in college on 8-bit 8051 processors, I knew what kind of functions tended to be slow. I did a Ctrl+F for “%” and found a 16-bit modulo right in the audio processing code.
That feeling of saving days of work because you remember a clue from previous experience is so good.