Comparing the AI code generators

Thursday May 8, 2025. 11:00 AM , from InfoWorld

Every developer has now pasted code into ChatGPT or watched GitHub Copilot autocomplete a function. If that’s your only exposure, it’s easy to conclude that coding with large language models (LLMs) isn’t “there yet.” In practice, model quality and specialization are moving so fast that the experience you had even eight weeks ago is already out of date. OpenAI, Anthropic, and Google have each shipped major upgrades this spring, and OpenAI quietly added an “o-series” of models aimed at reasoning.

Below is a field report from daily production use across five leading models. Treat it as a snapshot, not gospel—by the time you read this, a point release may have shuffled the rankings again.

OpenAI GPT-4.1: UI whisperer, not my main coder

OpenAI’s GPT-4.1 replaces the now-retired GPT-4.5 preview, offering a cheaper, lower-latency 128k-token context and better image-to-spec generation. It’s still solid at greenfield scaffolding and turning screenshots into code, but when the task is threading a fix through a mature code base, it loses track of long dependency chains and unit-test edge cases.

When to call it: Design-system mock-ups, API documentation drafts, converting UI comps into component stubs.When to skip it: After your initial scaffold.

Anthropic Claude 3.7 Sonnet: The dependable workhorse

Anthropic’s latest Sonnet model is still the model I reach for first. It strikes the best cost-to-latency balance, keeps global project context in its 128k window, and rarely hallucinates library names. On tough bugs, it sometimes “cheats” by adding what it calls “special case handling” to the code under test (watch for if (id==='TEST_CASE_1 data')-style patches). Sonnet also has a habit of disabling ESLint or TypeScript checks “for speed,” so keep your linter on.

Sweet spot: Iterative feature work, refactors that touch between five and 50 files, reasoning over build pipelines.Weak spot: Anything visual, CSS fine-tuning, unit test mocks.Tip: grep your code for the string “special case handling”.

Google Gemini 2.5 Pro-Exp: The UI specialist with identity issues

Google’s Gemini 2.5 release ships a one-million-token context (two million promised) and is currently free to use in many places (I’ve yet to be charged for API calls). It shines at UI work and is the fastest model I’ve used for code generation. The catch: If your repo uses an API that changed post-training, Gemini may argue with your “outdated” reality—sometimes putting your reality in scare quotes. It also once claimed that something in the log wasn’t possible because it occurs in the “future.”

Use it for: Dashboards, design-system polish, accessibility passes, quick proof-of-concept UIs.Watch out for: Confident but wrong API calls and hallucinated libraries. Double-check any library versions it cites.

OpenAI o3: Premium problem solver, priced accordingly

OpenAI’s o3 (the naming still confuses people who expect “GPT”) is a research-grade reasoning engine. It chains tool calls and writes analyses, and it will pore over a 300-test Jest suite without complaint. It is also gated (I had to show my passport for approval), slow, and costly. Unless you’re on a FAANG-scale budget or you’re unable to resolve a bug yourself, o3 is a luxury, not a daily driver.

OpenAI o4-mini: The debugger’s scalpel

The surprise hit of April is o4-mini: a compressed o-series variant optimized for tight reasoning loops. In practice, it’s 3-4× faster than o3, still expensive via the OpenAI API, but throttled “for free” in several IDEs. Where Claude stalls on mocked dependencies, o4-mini will reorganize the test harness and nail the bug. The output is terse, which is surprising for an OpenAI model (https://openai.com/index/sycophancy-in-gpt-4o/).

Great for: Gnarly generics, dependency injection edge cases, mocking strategies that stump other models.Less ideal for: Bulk code generation or long explanations. You’ll get concise patches, not essays.

Multi-model workflow: A practical playbook

Explore UI ideas in ChatGPT using GPT-4.1. Drop your slide deck and ask it to generate mockups. Remind your code generator that DALL-E does some weird things with words.

Create your initial specification with Claude in thinking mode. Ask another LLM to critique it. Ask for an implementation plan in steps. Sometimes I ask o4-mini if the spec is enough for an LLM to follow in a clean context.

Scaffold with Gemini 2.5. Drop sketches, gather the React or Flutter shell, and the overall structure.

Flesh out logic with Claude 3.7. Import the shell, have Sonnet fill in the controller logic and tests.

Debug or finish the parts Claude missed with o4-mini. Let it redesign mocks or type stubs until tests pass.

This “relay race” keeps each model in its lane, minimizes token burn, and lets you exploit free-tier windows without hitting rate caps.

Final skepticism (read before you ship)

LLM coding still demands human review. All four models occasionally:

Stub out failing paths instead of fixing root causes.

Over-eagerly install transitive dependencies (check your package.json).

Disable type checks or ESLint guards “temporarily.”

Automated contract tests, incremental linting, and commit-time diff review remain mandatory. Treat models as interns with photographic memory. They’re excellent pattern matchers, terrible at accountability. (Author’s note: Ironically, o3 added this part when I asked it to proofread but I liked it so much I kept it.)

Bottom line

If you tried GitHub Copilot in 2024 and wrote off AI coding, update your tool kit. Claude 3.7 Sonnet delivers day-to-day reliability, Gemini 2.5 nails front-end ergonomics, and o4-mini is the best pure debugger available—provided you can afford the tokens or you have a lot of patience. Mix and match. You can always step in when a real brain is required.