Navigation
Search
|
What you absolutely cannot vibe code right now
Tuesday July 8, 2025. 11:00 AM , from InfoWorld
LinkedIn has become the new Twitter now that Twitter is… well, X. LinkedIn is a place of shockingly bold claims. One person claimed to be so confident in agentic development that they are going to generate their own operating system on the level of iOS or Android. Ever the iconoclast, I pointed out that it was not possible they would ever publish or install it.
Another pitchman promoted the idea that large language models (LLMs) are producing more and higher-quality pull requests (PRs) than humans, based on the number of PRs on a tool and their acceptance rate. I pointed out that this isn’t possibly true. I wasn’t motivated to write something to classify them, but I sampled about 20. It turned out that the dashboard our enthusiast was looking at is picking up mainly people’s private projects, where they are basically auto-approving whatever the LLMs send (YOLO style), and a large number of the commits are LLM-style “everything that didn’t need to be said” documentation. Or as one person accepting the merge put it, “Feels like lot of garbage added — but looks relavant [sic]. Merging it as baseline, will refine later (if applicable).” Don’t get me wrong, I think you should learn to use LLMs in your everyday coding process. And if any statistics or reported numbers are accurate, most of you are at least to some degree. However, I also think it is essential not to misrepresent what LLMs can currently do and what is beyond their capabilities at this point. As mentioned in previous posts, all the current LLM-based tools are somewhat limiting and, frankly, annoying. So I’m writing my own. Honestly, I expected to be able to flat-out vibe code and generate the patch system. Surely the LLM knows how to make a system to accept patches from an LLM. It turns out that nothing could be further from the truth. First of all, diffing and patching are one of those deceptively complex areas of computing. It was a lesson I forgot. Secondly, writing a patch system to accept patches from something that isn’t very good at generating clean patches is much more complicated than writing one for something that produces patches with a clean algorithm. Generating a patch system that accepts patches from multiple models, each with its own quirks, is very challenging. It was so hard that I gave up and decided to find the best one and just copy it. Trial and errors The best patch system is Aider AI’s patch system. They publish benchmarks for every LLM, evaluating how well they generate one-shot patches. Their system isn’t state-of-the-art; it doesn’t even use tool calls. It’s largely hand-rolled, hard-won Python. The obvious thing to do was to use an LLM to port this to TypeScript, enabling me to use it in my Visual Studio Code plugin. That should be simple. Aside from that part, Aider had already figured out it’s a bunch of string utilities. There is no Pandas. There is no MATLAB. This is simply a string replacement. I also wanted to benchmark OpenAI’s o3 running in Cursor vs. Anthropic’s Claude Opus 4 running in Claude Code. I had both of them create plans and critique each other’s plans. To paraphrase o3, Opus’s plan was overcomplicated and destined to fail. To paraphrase Claude Opus, o3’s code was too simplistic, and the approach pushed all the hard stuff to the end and was destined to fail. Both failed miserably. In the process, I lost faith in Claude Opus to notice a simple problem and created a command-line tool I called asko3 (which later became “o3Helper”) so that Claude could just ask o3 before it made any more mistakes. I lost faith in Cursor being able to keep their back end running and reply to any requests, so o3 in Cursor lost by default. Onward with the next combo, standalone Claude Opus 4 advised by standalone o3. That plan also failed miserably. o3 suggested that Opus had created a “cargo cult” implementation (its term, not mine) of what Aider’s algorithm did. It suggested that the system I use for creating plans was part of the problem. Instead, I created a single document plan. Then I had o3 do most of the implementation (from inside Claude Code). It bungled it completely. I had Claude ask o3 to review its design without telling it that it was its own design. It eviscerated it. Claude called the review “brutal but accurate.” Finally, I still needed my patch system to work and really didn’t care to hand-code the TypeScript. I had Claude copy the comments over from Aider’s implementation and create a main method that served as a unit test. Then I had Claude port each method over one at a time. When something failed, I suggested a realignment method by method. I reviewed each decision, and then we reviewed the entire process — success. This was as far from vibe coding as you can be. It wasn’t much faster than typing it myself. This was just a patch algorithm. The fellow hoping to “generate an operating system” faces many challenges. LLMs are trained on a mountain of CRUD (create, read, update, delete) code and web apps. If that is what you are writing, then use an LLM to generate virtually all of it — there is no reason not to. If you get down into the dirty weeds of an algorithm, you can generate it in part, but you’ll have to know what you’re doing and constantly re-align it. It will not be simple. Good at easy This isn’t just me saying this, this is what studies show as well. LLMs fail at hard and medium difficulty problems where they can’t stitch together well-known templates. They also have a half-life and fail when problems get longer. Despite o3’s (erroneous in this case) supposition that my planning system caused the problem, it succeeds most of the time by breaking up the problem into smaller parts and forcing the LLM to align to a design without having to understand the whole context. In short I give it small tasks it can succeed at. However, one reason the failed is that despite all the tools created there are only about 50 patch systems out there in public code. With few examples to learn from, they inferred that unified diffs might be a good way (they aren’t generally). For web apps, there are many, many examples. They know that field very well. What to take from this? Ignore the hype. LLMs are helpful, but truly autonomous agents are not developing production-level code at least not yet. LLMs do best at repetitive, well-understood areas of software development (which are also the most boring). LLMs fail at novel ideas or real algorithmic design. They probably won’t (by themselves) succeed anywhere there aren’t a lot of examples in GitHub. What not to take from this? Don’t conclude that LLM’s are totally useless, and that you must be a software craftsman and lovingly hand-code your CSS and HTML and repetitive CRUD code like your pappy before you. Don’t think that LLMs are useless if you are working on a hard problem. They can help; they just can’t implement the whole thing for you. I didn’t have to search for the name of every TypeScript string library that matched the Python libraries. The LLM did it for me. Had I started with that as a plan, it would have gone quickly. If you’re doing a CRUD app, doing something repetitive, or tackling a problem domain where there are lots of training materials out there, you can rely on the LLMs. If you’re writing an operating system, then you will need to know how to write an operating system and the LLM can type for you. Maybe it can do it in Rust where you did it last time in C, because you know all about how to write a completely fair scheduler. If you’re a full-stack Node.js developer, you will not be (successfully) ChatGPT-ing an iOS alternative because you are mad at Apple.
https://www.infoworld.com/article/4018235/what-you-absolutely-cannot-vibe-code-right-now.html
Related News |
25 sources
Current Date
Jul, Wed 9 - 08:29 CEST
|