tartuffe78
We were developing an Android multimedia application in the bad old 2.x days, and our QA person was running into a crash recording with the camera in our app.

She walked over to my desk to show me, and the crash wouldn't happen anymore. Went back to her desk and was able to reproduce it again, walked back to my desk and suddenly it wouldn't happen anymore.

After a bit of this, I finally brought my laptop over to her desk and discovered it was an out of memory error issue. She sat by the window, and with more sunlight at her desk she was getting higher frames per second from the camera, whereas in my dark corner the camera was recording way fewer frames and using less memory.

jcparkyn
Here's a fun one that's not really code-related:

I was working on a web application with a maintenance mode feature where it would show a "system offline" while changes were rolled out etc. We noticed there was a typo (or so we thought) in the message, so it said "system olffine" instead.

Because this message could be overridden by site admins, there was a somewhat convoluted data flow to produce the final message. I grepped the codebase for "olffine" and found nothing. Then I checked my local DB to see what message was stored - it was spelled correctly. Then I stepped through the backend code to look at the message being returned - still correct.

Eventually I was back in my web browser with the page open on the left (spelled incorrectly) and devtools on the right (spelled correctly). What the hell?

It turned out the problem was that the font we were using had made a mistake with the "ffl" ligature so it rendered as "lff". We contacted the font author and they fixed it.

asauce
Around 10 years ago, my friend and I were working on an assignment for our intro to programming course. The assignment involved controlling an LCD screen with an Mega 2560 board.

Finally, after many failed attempts and a few too many coffees we managed to complete the assignment. I left a comment in the code “// We did it!!!!”, and we called it a night.

The next day we tried to demo to our TA and suddenly our code wouldn’t upload. Tried multiple PCs, multiple arduino’s, and had multiple TAs look into our code. No idea.

Finally one brilliant TA heard our story and deleted the comment I left. Suddenly the upload worked! Turns out anytime the bootloader saw “!!!” anywhere in the code it would drop into debugging mode and cause the upload to fail. Even if it was in a comment! That bug gave me major trust issues working with that 2560 that semester haha

romanhn
Once upon a time I had to track down an issue where very rarely, with no discernible pattern the web app would produce garbled PDFs. We would restart all servers, everything would be fine for a month or two, then random bad output. Turned out this happened when an admin account remotely connected to an app server, which caused a reset of default screen resolution (only for admin accounts, not regular remote connections), which messed up the PDF library that relied on a specific resolution (it was HTML-to-PDF conversion). Lived with it for a couple of years at least before tracking down the root cause (after many many failed attempts). The fact that only the server with the reset resolution caused the problem confounded the issue.
smrq
We had an instance where a literal semicolon was accidentally added to the top of every page on our e-commerce website. Nobody noticed it for a while because it was hidden underneath the nav bar thanks to z indexing. The weird part is that after we found it and deleted it (someone erroneously terminated a line with a semicolon in a Razor template), it crashed the site entirely.

In the end, I found that the semicolon was in the <head>, and the inclusion of literal text caused document.body to be non-null. A later script in the head relied on the existence of document.body, making that particular semicolon load-bearing. (Angular 1 times, y'all.)

theginger
We were having a record breaking heat wave in the UK, we joked that see even Jenkins can't cope with the heat all builds are failing It was not uncommon to get transient failures so no one looked into it, it was too hot, in the UK we are not used to dealing with that. Eventually someone did look into it, there was some little used functionality in our platform which queried a weather API, someone had written a unit test that used the real live API and set the expected temperature value to be lower than the record high temperature in London, so our pipelines really had been broken by the hot weather.
ihalip
An executable from a nightly build kept crashing and I got tasked with fixing it. I was a novice back then and spent most of the day trying to figure out what was happening, and when looking at the disassembly I saw it was crashing at a 'hlt' instruction, which shouldn't have been there.

Next day, after another nightly build, no more crashes. I did a binary diff between the crashing version and the new one, it was a single bit. A bit flip on the build server.

helph67
Our health database was being used in many states but one hospital reported error messages of "Invalid patient postcode" to our Client Support. C.L. couldn't duplicate the problem so I was sent to investigate, after checking that our same version was not giving the same errors. Long story --> Short; the operator inputting the postcode data was had been using `l' instead of '1'. The font in use made it difficult to detect the differences.
slyall
Around 2000 I worked for an ISP. We had some customers reporting that they couldn't get to some websites. Narrowed it down customers who were:

  * Running Mac OS 9 (not Mac OS X)
  * dialed into an Ascend NAS (Not other vendors)
  * Assigned a Dynamic IP
  * Accessing ASP-based websites
they would get a blank page on the website.

We actually had a computer in the office we could replicate it on but gave up at that point since we didn't have something to debug the network traffic. For the single-digit number of customers we gave them static IPs or something.

cml123
Earlier this year I was looking at an NPE that started happening within a rules engine used by an insurance quoting application for discount eligibility. Ultimately the nature of error was evident - a rule was written in an unsafe way and necessary data was not present, but it wasn't immediately clear why the data wasn't being mapped properly.

The error had only been reported happening a few times in a development environment. I was able to discern that the first time it happened was the morning after an update to Spring 3. Debugging locally with code written just a few days earlier didn't trigger the error, so I knew the Spring 3 upgrades had to be related. The missing data was supposed to be derived as the result of a library call to another rules library maintained by a different team, used to derive pricing attributes from information on a request.

After a bit of debugging, I could see that the data in question should have been derived by this other rules engine, but no data of any kind was being mapped from it. No errors were logged in the scenario, and debugging was very fiddly. Notably, the error messages at different points in the debugging process differed on subsequent requests after the first request was submitted. This required restarting the application locally after each pass of the error. This made me think that some static structure was at play.

This rules engine made use of the popular Jackson package to parse YAML files containing lists of rules to be executed subject to constraints. I could see that this parsing initially worked, but failed shortly into the execution flow. No rules were being executed even though they were being scoped for execution. After a few hours of incrementally debugging the scenario, I saw the true culprit: a class from the Apache Commons library was missing at runtime. The ClassNotFoundException was silently ignored and allowed processing to continue, only resulting in a NPE for a limited number of scenarios that required this additional rules engine. The class in question should have been provided transitively from our dependency on the other rules engine maintained by another team, but migrating to Spring 3 seemed to cause some incompatibility with that error. Adding Apache Commons to our build config (and fixing the unsafe code) fixed the issue, but I still don't know perfectly why the issue was happening. I'll probably look back at in the near future

FlyingAvatar
This happened over 20 years ago, but I was helping a co-worker debug an issue they were having with a Windows application written in Delphi. This was before Google was a thing, and waaay before Stack Overflow, so getting help to solve these kind of problems was a bit more involved.

As far as the issue, if they ran the offending code in the debugger, it worked flawlessly. But it would fail every time in the production build. Usually, this would point to some kind of race condition, but the code section was innocuous. It was essentially running the Delphi equivalent of strpos on a local variable.

I was comparing the build flags between the debug version and the release version and one thing that caught my eye was the optimization flags for the compiler. Lo and behold if you brought the optimization level down two notches the bug went away.

I don't think I ended up getting into the disassembly to submit a bug report, as the optimizer was almost certainly doing something it shouldn't, but at least we found the source of the issue.

Since we didn't want to actually disable optimizations on our release build, the "permanent fix" was to re-write our own strpos-equivalent in such a way that compiler optimizations didn't break it.

sandreas
I once had a Bug in SQL Server, where a SUBSTRING(field, 1, 255) in a varchar field that was 255 in size improved the query performance about 1.5 times.

We never found out, what the problem was, but we traced it down to the SUBSTRING - removing it made the query significantly SLOWER.

That was weird.

orf
We once got paged because our app was down. But… it wasn’t. Everything was green and API requests where being processed.

After some digging and a lot of luck, it turned out that BT (huge ISP in the UK) and unceremoniously added our hostname to some internal blacklist, meaning it would never resolve, specifically and only for BT ISP users.

Getting it removed was non-trivial, and it as only through complete luck that an employee had a friend who worked at BT and was able to escalate the problem internally. Without that connection we would have been screwed, as there is nothing we could find about this blacklist on the internet or how to contact them about it.

Terrifying.

Sort of a bug, I guess? Maybe with BT?

aspenmayer
Windows BSOD on boot due to clear plastic tape on motherboard. Would boot and pass memtest86+ with no errors, but would consistently fail boot to Windows until tape removed from area opposite lower PCI slots. If I remember correctly, I could also successfully boot Linux while tape was present, but I may be mistaken as it's been a few years.

This issue was also present when booting a completely different hard drive with Windows 10, which usually would work fine on other systems.

Once tape was discovered and removed, normal boot was restored without incident.

BOFH's who want a good prank take note.

ksherlock
Weirdest? Some memory trashing which just so happened to replace a branch-if-equal with a branch-if-not-equal, and that code just happened to be in a multiplication subroutine (no hardware multiplication) causing multiplication to always give a negative result, post-trashing. This was showing up in some compiler-generated offset multiplications which were optimized out unless it was a debug build, long after the memory trashing took place.
ahfeah7373
Our JSP server would crash but only when the moons of Jupiter were aligned and the bakery across the street was serving the blueberry scone (CNR with the ham & cheese)
NetworkPerson
Randomly could not send emails through outlook. No amount of screwing with accounts or outlook files would resolve this. Then things would magically start working again for a time. Finally pinned it down to the wireless mouse. Still don’t know why, but if that specific model of Microsoft wireless mouse was plugged in, outlook wouldn’t send emails. Confirmed with another of this model mouse on a separate person’s computer.
digitalsushi
At the UNH-IOL in 2004 the departments all had their own /24 subnets. We were warring with one of them over their winning where we got lunch, and so we put a line of javascript in the webmail html client that would open the print page dialogue box 0.5% of the time if the request came from that subnet. Oh, I misread 'had' as 'put'.
oerdier
Couple years ago on employer's Macbook a JVM profiler kept crashing midway through profiling a local webserver. No error message. After digging deep, on some forum I found the suggestion to unplug all external monitors. I did indeed have an external monitor connected and thought "No way". Yes way.