We fine-tuned an LLM to triage and fix insecure code

tptacek

I've been playing with o1 on known kernel LPEs (a drum I've been beating is how good OpenAI's models are with Linux kernel stuff, which I assume is because there is such a wealth of context to pull from places like LKML) and it's been very hit-or-miss. In fact, it's actually pretty SAST†-like in its results: some handwavy general concerns, but needs extra prompting for the real stuff.

The training datasets here also seem pretty small, by comparison? "Hundreds of closed source projects we own"?

It'd be interesting to see if it works well. This is an easy product to prove: just generate a bunch of CVEs from open source code.

† SAST is enterprise security dork code for "security linter"

asadeddin

I'm Ahmad, the founder of Corgea. We're building an AI AppSec engineer to help developers automatically triage and fix insecure code. We help reduce 30% of SAST findings with our false positive detection and accelerate remediation by ~80%. To do this for large enterprises we had to fine-tune a model that we can deploy that is secure and private.

We're very proud of the work we recently did, and wanted to share it with the greater HN community. We'd love to hear your feedback and thoughts. Let me know if I can clarify anything in particular.

Mountain_Skies

Finding SQL Injection is pretty trivial for SAST tools. The difficulty is what happens next. After whatever tool finds several thousand SQLI vulns in a Cold Fusion application from 2001 that hasn't been touched in over a decade, someone must be identified to take responsibility for changing the code, testing it, and deploying it. Even if the tool can change the code, no one will want to take responsibility for changes made to an application that has quietly running correctly since before most of their department has been at the company using an ancient technology that no one has experience with deploying into production. This is where so many vulns live.

Shift left and modern development patterns can catch a very large amount of known vulns so in newer applications things become mostly about fixing newly discovered vulns and doing it in an active development cycle. It's the older code that's the real scary monster and identifying the vulns is the least scary part of the process to get them remediated and put into production.

Anything that reduces false positives is good, especially if it does so without also making a significant reduction in identified true positives, but none of that changes the fact that it is the low hanging fruit of the system.

WalterBright

> an SQL injection vulnerability

I simply do not understand why the SQL API even allows injection vulnerability. Adam Ruppe and Steven Schweighoffer have done excellent work in writing a shell API over it (in D) that makes such injections far more difficult to inadvertently write.

On airplanes, when a bad user interface leads to an accident, the user interface gets fixed. There's no reason to put up with this in programming languages, either.

sachahjkl

let me introduce you to the much better and reliable world of: static analysis

xrd

I was ready to sign up after I read the article. But, when I click on the button at the bottom ("Ready to fix with a click?"), nothing happens. After open dev tools, I can see it registers the click with a linkedin ad tracker network event, but nothing happens. Maybe Firefox blocking?

bamboozled

[flagged]

bigiain

What an awesome way of finding companies who suspect their code is insecure, and then having them give you their source code. And _charging_ them for it, presumably to make it an easier sell to CXOs: 'Nah, it's not those free software hippy communists, they're gonna make you pay through the nose for this, like a _proper_ compliance checkbox ticking outsourced vendor!"

I wonder if this is an NSA front? Or Palintir maybe? Or NSO?

vouaobrasil

These small incremental AI tools seem in isolation to be helpful things for human coders. But over a period of decades, these interations will eventually become mostly autonomous, writing code by themselves and without much human intervention compared to now. And that could be a very dangerous thing for humanity, but most people working on this stuff don't care because by the time that happens, they will be retired with a nice piece of private property that will isolate them from the suffering of those who have not yet obtained their private property.

nodeshiftcloud

we find the idea of fine-tuning an LLM to triage and fix insecure code intriguing. However, we have concerns about the limitations posed by the size of the training dataset. As @tptacek mentioned, relying on "hundreds of closed source projects" might not provide the diversity needed to effectively identify a wide range of vulnerabilities, especially in complex systems like the Linux kernel. Incorporating open-source projects could enrich the model's understanding and improve its accuracy. Additionally, benchmarking the model by attempting to generate CVEs from open-source code seems like a practical way to assess its real-world effectiveness. Has anyone experimented with expanding the training data or testing the model against known vulnerabilities in open-source repositories?