Debugging a memory leak in a Clojure service

NightMKoder

If your clojure pods are getting OOMKilled, you have a misconfigured JVM. The code (e.g. eval or not) mostly doesn't matter.

If you have an actual memory leak in a JVM app what you want is an exception called java.lang.OutOfMemoryError . This means the heap is full and has no space for new objects even after a GC run.

An OOMKilled means the JVM attempted to allocate memory from the OS but the OS doesn't have any memory available. The kernel then immediately kills the process. The problem is that the JVM at the time thinks that _it should be able to allocate memory_ - i.e. it's not trying to garbage collect old objects - it's just calling malloc for some unrelated reason. It never gets a chance to say "man I should clear up some space cause I'm running out". The JVM doesn't know the cgroup memory limit.

So how do you convince the JVM that it really shouldn't be using that much memory? It's...complicated. The big answer is -Xmx but there's a ton more flags that matter (-Xss, -XX:MaxMetaspaceSize, etc). Folks think that -XX:+UseContainerSupport fixes this whole thing, but it doesn't; there's no magic bullet. See https://ihor-mutel.medium.com/tracking-jvm-memory-issues-on-... for a good discussion.

roenxi

This article showcases 2 harder-to-articulate features of Clojure:

1) Digging in to Clojure library source code is unsettlingly easy. Clojure's core implementation has 2 layers - a pure Clojure layer (which is remarkably terse, readable and interesting) and a Java layer (which is more verbose). RT (Runtime) happens to be one of the main parts of the Java layer. The experience of looking into a clojure.core function and finding 2-10 line implementation is normal.

2) Code maintenance is generally pretty easy. In this case the answer was "don't use eval" and I've had a lot of good experiences where the answer to a performance problem is similarly basic. The language tends to be responsible about using resources.

pron

> -XX:+TraceClassLoading -XX:+TraceClassUnloading

The best insight into the operation of the JVM is now obtained via a single mechanism, JFR (https://dev.java/learn/jvm/jfr/), the JDK's observability and monitoring engine. It records a whole lot of event types: https://sap.github.io/SapMachine/jfrevents/

See here for examples related to tracking memory: https://www.morling.dev/blog/tracking-java-native-memory-wit...

ayewo

Interesting article.

1. I’m having a bit of trouble parsing this paragraph:

> The reason eval loads a new classloader every time is justified as dynamically generated classes cannot be garbage collected as long as the classloader is referencing to them. In this case, single classloader evaluating all the forms and generating new classes can lead to the generated class not being garbage collected.

To avoid this, a new classloader is being created every time, this way once the evaluation is done. The classloader will no longer be reachable and all it’s dynamically loaded class.

It sounds like the solution they adopted was to instantiate a brand new classloader each time a dynamic class is evaluated, rather than use a singleton classloader for the app’s lifetime.

Sarkie

It's 9/10 always the classloader and a newInstance call on every request.

MBlume

The article makes it sound like the system was using eval (probably on a per-request basis, not just on start-up), and also like ceasing to use eval was pretty trivial once they realized eval was the problem. I'd be curious why they were using eval and what they were able to do instead.

henning

If you can go from ~60ms p99 response times to ~45 from reduced garbage collection, that means GC has a major impact on user-perceptible performance on your application and proves that it is an extremely expensive operation that should be carefully managed. If you have a modern microservices Kubernetes blah blah bullshit setup, this fraud detection service is probably only one part of a chain of service calls that occurs during common user operations at this company. How much of the time users wait for a few hundred bytes of actual text to load on screen is spent waiting for multiple cloud instances to GC?

The only way to eliminate its massive cost is to code the way game programmers in managed languages do and not generate any garbage, in which case GC doesn't even help you very much.

What should be hard about app scalability and performance is scaling up the database and dealing with fundamental difficulties of distributed systems. What is actually hard in practice is dealing with the infinite tower of janky bullshit the Clean Code Uncle Bob people have forced us to build which creates things like massive GC overhead that is impossible to eliminate with totally rewriting or redesigning the app.