Coding with LLMs misplaces the understanding

21 Nov, 2024

Until you hit a limit, LLM coding appears to work extremely well, even if you don’t know how to code. You can ask for a simple application, follow instructions, ask it to correct error messages, and tell it certain things are wrong, adding new requirements. You can do all of pretty unconsciously and without thought as to how the system needs to work; as long as you have a direction in mind and the application is relatively simple, there’s a good chance you’ll be able to put the LLM in the right direction.

I’d like to call this approach “unconscious coding”. It’s like when you are driving, and 45 minutes pass without you realizing. It’s when you are interacting with a complex system, but instead of trying to understand the function and intention behind each component, and how the organization and aggregation of components comes together to produce expected behavior, you treat the system as a black box. What’s funny is that many engineers are able to pull off a light version of this even without LLMs. Sometimes, in a medium to large codebases, it can be quicker to “spray and pray” -- if there is a bug in the system, don’t try to understand the root cause of the issue. Instead, if you’ve seen a lot of such bugs, you can often imagine the solution that has worked in many cases prior. You can “bandaid” things up in a way that qualifies as a fix to behavior, but merely fixes the symptoms, not the underlying cause. This is not often a good strategy, but sometimes you see it in PRs — “I’m not sure why this works, but it does”.

It’s difficult and time consuming to acquire true understanding of an underlying system, especially when that system was created as a collaboration between many different engineers who worked on it at different times, with different intentions and assumptions. Skilled engineers can piece together contextual information and read between the lines to understand what was intentional, what was careless, what is dangerous to touch for political reasons, and what is fair game. LLMs are at a huge disadvantage here because they only have access to the context provided to them. And humans do this funny thing where they often keep the most important information in their head and they never write it down. When joining a new team, you have to creatively infer the true reasons for people’s actions. This is a huge barrier for LLMs.

It is often debated as to whether LLMs have the ability to ‘understand’. This is a bit of a semantic debate, but in the context of building entire applications from scratch and treating them as a black box, the quantity in question is the human operator’s understanding, not the LLM’s. If you treat your application code as a black box, and only observe the ends of the system, you will hit limits. The most common limit I have experienced when doing this myself is that of poor code organization. Since I am asking the LLM for things that could be considered “product requirements”, its no surprise that this happens. I have found that these limits are extended the more I proactively think about implementation architecture, but in some ways this defeats the point of using the LLM in the first place. For those who have no experience architecting code in the first place, this is likely a common mode of interaction. State of the art LLMs like Claude 3.5 Sonnet have a very high but not unlimited capacity to infer about complex code structures, and the more complexity you ask it to add on, the more likely any given query is to mess something up. If you feed in confusing, contradictory code with flaws in it back into the LLM, you’ll get even more errors in the new code. In other words, errors accumulate and are multiplicative.

Beyond poor organization, the real limit you will hit is your own lack of understanding. Let’s assume LLMs are infinitely capable in terms of closed calculation: as long as they are provided with the appropriate context information, they can solve the problem (if it is possible to solve). That’s the key issue — not everything is possible, and the requirements you dream up are likely to have flaws. Software engineering is about defining a domain language for the problem at hand and discovering the rules of the system you dream up. If you don’t understand the underlying system you are interacting with, you are liable to do all sorts of things that wouldn’t make sense in hindsight. There are significantly more ways to be wrong than to be right; there are infinitely many more false statements than true ones. As you pile on more and more features, the number of constraints every new feature touches can grow exponentially, and so must your consideration of the permutations. If you don’t understand the rules of the system, you’ll end up in an endless loop of whack-a-mole, but you won’t be able to step back and see it.

LLMs are fundamentally unreliable tools (in similar ways that humans could be considered fundamentally unreliable). To use them in order to produce reason and truth, ironically you must engage with them cynically. With humans, you have to worry less about flawed reasoning and more about misaligned incentives, although flawed reasoning is still quite the problem.

LLMs are idea suggestion, thought mirroring, and guessing machines, but they resemble oracles.