I was working on a web application with a maintenance mode feature where it would show a "system offline" while changes were rolled out etc. We noticed there was a typo (or so we thought) in the message, so it said "system olffine" instead.
Because this message could be overridden by site admins, there was a somewhat convoluted data flow to produce the final message. I grepped the codebase for "olffine" and found nothing. Then I checked my local DB to see what message was stored - it was spelled correctly. Then I stepped through the backend code to look at the message being returned - still correct.
Eventually I was back in my web browser with the page open on the left (spelled incorrectly) and devtools on the right (spelled correctly). What the hell?
It turned out the problem was that the font we were using had made a mistake with the "ffl" ligature so it rendered as "lff". We contacted the font author and they fixed it.
Finally, after many failed attempts and a few too many coffees we managed to complete the assignment. I left a comment in the code “// We did it!!!!”, and we called it a night.
The next day we tried to demo to our TA and suddenly our code wouldn’t upload. Tried multiple PCs, multiple arduino’s, and had multiple TAs look into our code. No idea.
Finally one brilliant TA heard our story and deleted the comment I left. Suddenly the upload worked! Turns out anytime the bootloader saw “!!!” anywhere in the code it would drop into debugging mode and cause the upload to fail. Even if it was in a comment! That bug gave me major trust issues working with that 2560 that semester haha
In the end, I found that the semicolon was in the <head>, and the inclusion of literal text caused document.body to be non-null. A later script in the head relied on the existence of document.body, making that particular semicolon load-bearing. (Angular 1 times, y'all.)
Next day, after another nightly build, no more crashes. I did a binary diff between the crashing version and the new one, it was a single bit. A bit flip on the build server.
* Running Mac OS 9 (not Mac OS X)
* dialed into an Ascend NAS (Not other vendors)
* Assigned a Dynamic IP
* Accessing ASP-based websites
they would get a blank page on the website.We actually had a computer in the office we could replicate it on but gave up at that point since we didn't have something to debug the network traffic. For the single-digit number of customers we gave them static IPs or something.
The error had only been reported happening a few times in a development environment. I was able to discern that the first time it happened was the morning after an update to Spring 3. Debugging locally with code written just a few days earlier didn't trigger the error, so I knew the Spring 3 upgrades had to be related. The missing data was supposed to be derived as the result of a library call to another rules library maintained by a different team, used to derive pricing attributes from information on a request.
After a bit of debugging, I could see that the data in question should have been derived by this other rules engine, but no data of any kind was being mapped from it. No errors were logged in the scenario, and debugging was very fiddly. Notably, the error messages at different points in the debugging process differed on subsequent requests after the first request was submitted. This required restarting the application locally after each pass of the error. This made me think that some static structure was at play.
This rules engine made use of the popular Jackson package to parse YAML files containing lists of rules to be executed subject to constraints. I could see that this parsing initially worked, but failed shortly into the execution flow. No rules were being executed even though they were being scoped for execution. After a few hours of incrementally debugging the scenario, I saw the true culprit: a class from the Apache Commons library was missing at runtime. The ClassNotFoundException was silently ignored and allowed processing to continue, only resulting in a NPE for a limited number of scenarios that required this additional rules engine. The class in question should have been provided transitively from our dependency on the other rules engine maintained by another team, but migrating to Spring 3 seemed to cause some incompatibility with that error. Adding Apache Commons to our build config (and fixing the unsafe code) fixed the issue, but I still don't know perfectly why the issue was happening. I'll probably look back at in the near future
As far as the issue, if they ran the offending code in the debugger, it worked flawlessly. But it would fail every time in the production build. Usually, this would point to some kind of race condition, but the code section was innocuous. It was essentially running the Delphi equivalent of strpos on a local variable.
I was comparing the build flags between the debug version and the release version and one thing that caught my eye was the optimization flags for the compiler. Lo and behold if you brought the optimization level down two notches the bug went away.
I don't think I ended up getting into the disassembly to submit a bug report, as the optimizer was almost certainly doing something it shouldn't, but at least we found the source of the issue.
Since we didn't want to actually disable optimizations on our release build, the "permanent fix" was to re-write our own strpos-equivalent in such a way that compiler optimizations didn't break it.
We never found out, what the problem was, but we traced it down to the SUBSTRING - removing it made the query significantly SLOWER.
That was weird.
After some digging and a lot of luck, it turned out that BT (huge ISP in the UK) and unceremoniously added our hostname to some internal blacklist, meaning it would never resolve, specifically and only for BT ISP users.
Getting it removed was non-trivial, and it as only through complete luck that an employee had a friend who worked at BT and was able to escalate the problem internally. Without that connection we would have been screwed, as there is nothing we could find about this blacklist on the internet or how to contact them about it.
Terrifying.
Sort of a bug, I guess? Maybe with BT?
This issue was also present when booting a completely different hard drive with Windows 10, which usually would work fine on other systems.
Once tape was discovered and removed, normal boot was restored without incident.
BOFH's who want a good prank take note.
She walked over to my desk to show me, and the crash wouldn't happen anymore. Went back to her desk and was able to reproduce it again, walked back to my desk and suddenly it wouldn't happen anymore.
After a bit of this, I finally brought my laptop over to her desk and discovered it was an out of memory error issue. She sat by the window, and with more sunlight at her desk she was getting higher frames per second from the camera, whereas in my dark corner the camera was recording way fewer frames and using less memory.