On Vulnerability Fixes, and also LLMs

During RSA I pithily commented on LinkedIn (what's a tweet for LinkedIn called?) that the announcements of Large Language Model generated auto-fixes for static analysis findings were premature.  Part of this is I absolutely detest the ambulance chasing BS of the security vendor space, but also because I don't think the companies actually understand fixing vulnerabilities in practice.  If they did, they would be a lot more precise in what they were pitching.

Ignoring the very real limitations of LLMs (because apparently everyone is - they are amazing... at very specific things right now) let's talk just about a taxonomy of code changes to fix a static analysis finding, because they are a useful rubric in what LLMs may be promising to be applied to.  It's also a useful rubric for how we should think about the user experiences reporting these issues to developers - we tend to have a one size fits all way of reporting static analysis findings (it's a bug in ADO/JIRA/GitHub, it's a PR comment, it's a red squiggle in an IDE, it's a glossy PDF printout a super overpriced consulting firm hands out on their last day, etc.) but they can represent vastly different engineering investment to correct.

(note: this taxonomy is primarily for a finding that already exists in the product, rather than a newly introduced one.  For timely detections of new issues, the last two categories basically do not apply)

Trivial Fixes

These are findings that take almost no effort to correct; A SQL injection where you convert a constructed query to a parameterized statement, XSS where you slap HTML encoding on a variable, a buffer overflow solved just by converting strcpy to strcpy_s/strlcpy, etc. The chance of regression is quite low in the change, it's localized to a very narrow set of code, the context necessary to craft the fix is minimal, and it's almost always going to be the same sort of fix for the same sort of problem.  These are the perfect scenario for LLMs - there is very likely to be a large training set showing what the change should be, the prompt text can be fairly small due to the minimal context necessary, and it isn't likely to seriously blow anything up.  

BUT, using LLMs to solve this problem is like driving a semitruck to the corner market to get milk when it's a 30 second walk.  Sure, it gets you there, but it's hardly the efficient way.  A ton of the findings in this space can be automated with simple regex replacements (see DevSkim), and if you have a full semantic engine you can cover almost all of this category of fixes both more reliably and more repeatedly than LLMs will be able to.  The primary thing LLMs get you is efficiency in crafting the fix transformation, because humans have to build the other transformation options (though they can use GitHub Copilot to help them be more efficient at it, which is probably the right application of LLMs right now).  There is value in that efficiency gain, as it may be the difference between there being an automation option and there not, but it is primarily an efficiency gain for the tool vendor rather than for the tool user.

Moderate Fixes

These are fixes a dev can crank out in a couple hours to a day, but they need to spend a bit of time understanding the right spot in code to change, where the code change to fix the problem is a bit less generalizable (i.e. that you can't just take a fix for one instance and use it as the obvious solution for another instance), relatedly where there are multiple fix approaches that apply to different scenarios, where multiple changes may be necessary in conjunction for the fix, and where there might be some side effects of specific options of fixes.  A ton of memory corruption and type confusion falls into this space - the related code contributing to the issue or fix is along the same path without a ton of intersecting paths (i.e. you can change it without changing behaviors elsewhere), but the scenario is complex enough that it just doesn't lend itself to traditional auto-fix automation.  If LLMs are going to provide unique value and utility right now, it's this category.

Load Baring Fixes

This is a fix for a vulnerability in code that is highly interconnected with other code, especially if its behavior is interconnected with EXTERNAL code, where even small semantic drift in how the code behaves can have side effects.  Often this is only partially a characteristic of the vulnerability type, and more determined by the characteristics of the code the vulnerability is in (though it is probably feasible for a good semantic analysis engine to detect some of the scenarios that indicate that its load baring code).  As an analogy, if a 50 story skyrise is going to have a crack in the central support columns, the engineers are going to be much happier if that crack is on the 50th floor rather than the first floor.  Either way it's a crack in concrete, but where the crack is matters a huge amount. These are the sorts of issues where if the code has a bug, the team goes and gets the developer who has enough mental model of how that code interconnects to other code that they can take a stab at fixing it without side effects, because nobody else on the team trusts themselves no matter how trivial the issue.  (One of the impetuses for the world of Microservices and atomic code packages was to prevent load baring code, but in practice what this mostly does is hide the load from both the engineers and static analysis.  That's how you get in a world where in order to fix a vuln in one microservice, dozens need code changes and redeployments).

There isn't an automation story here right now.  If humans are afraid of the code, they are 12,000x more afraid of automation making changes they don't understand to it.  The very worst thing that could happen is that they trust LLMs despite that apprehension, but they will only make the mistake once.  Perhaps one day LLMs will have such an exhaustive prompt limit that they can fully internalize the implications of the change to all interconnected code, make systemic changes simultaneously to address the side effects, and legitimately tackle fixes humans are bad at, but Nvidia doesn't yet make the GPUs powerful enough to power that world.

Functionality Changes

Some vulnerability classes can only be fixed by outright changing the code functionality.  Almost all crypto mistakes fall into this category, as does almost all deserialization vulnerabilities.  The specific code to correct the issue might be quite modest, but if that code was used against data at rest in the bad format all of that data needs to be migrated to the new, and if the code was used in a transport mechanism every connecting client or calling service also needs to be changed.  These aren't "bugs" in the sense of how you would classify the amount of work to address the issue in something like an Azure DevOps or Jira board - this is whatever you would call feature work (e.g. story or epic depending on magnitude).  It's going to be at least one dev for a good chunk of time, and in some instances a whole dev team to address the problem (or multiple dev teams if it's at the interface between two different services/components/etc.).

These categories of issues really should be communicated quite differently to development teams, because a development lead/manager is often going to need to plan the resources for the fix.  Coordinating all of the code changes is enough of a task that its likely going to remain easier to coordinate people making those changes than to coordinate automation (LLMs or otherwise) for quite some time.

LLMs are going to continue to evolve.  Their prompt limits will expand, and eventually surpass the amount of code context a humans can efficiently reason over with accuracy.  The input data that helps form the LLMs possible suggestions will increase.  While there may be some wall after which they don't improve more, but we haven't hit it yet.  BUT, that's eventually - for now I don't think the vendors have any choice but to rush out LLM driven auto-fixes, because it will seem like a table stakes feature.  And like the rush to cover as many languages and vuln classes as possible just to compete, it's going to produce A LOT of garbage in the short term.   But maybe smart vendors will target their LLMs for the categories of auto-fixes where they can be generally uniquely useful right now.

Separate from skepticism about LLM auto-fixes right now, for all of you using static analysis, another takeaway is that there is value in understanding what sort of fix different types of detections require, and varying how you report them to developers based on those characteristics.  A vulnerability that requires a functionality change will get more traction quicker going to a dev manager in whatever mechanism they prefer, than to a junior dev in whatever interface they prefer.