Navigation
Search
|
Down and out with Cerebras Code
Monday September 15, 2025. 11:00 AM , from InfoWorld
When a vendor offered 2000 tokens per second (TPS) of Qwen3-Coder-480B-A35B-Instruct (aka Qwen3 Coder) for $50 (Cerebras Code Pro) or $200 (Cerebras Code Max), I, like many, was spellbound. However, the offer was sold out almost instantaneously. When the next window opened up, I grabbed a Max plan immediately. Not shockingly, the 2k TPS claim is basically a lie.
As Adam Larson, who runs the YouTube channel GosuCoder, put it, “When you see speeds of up to 2000 tokens per second, what do you think you should get? Would you be happy with 1000, 500, 200, 100, 50, 25? Okay, at what point is this true? I’ve run a bunch of tests in different applications, hitting the API, and not once did I hit 2000 tokens per second. In fact, not once on any particular long test did I ever hit 500 tokens per second.” In his excellent review, Larson reports getting under 100 TPS “even on the small things.” I don’t work like most developers who use large language models. My goal is autonomous code generation. I don’t really sit there and tell the LLM to “ok now write this.” Instead, I create detailed plans up front and have the model execute them. The recent spate of Claude Max limitations directly affected me. Suddenly, it wasn’t even four-hour windows of generation; it was two, and Anthropic has promised to lower my weekly and monthly intake as well. Cerebras offered an out. Sure, Qwen3 Coder isn’t Claude Opus or even Sonnet, but I’d previously worked on adding SIMD support for Arm to Go using this model (I haven’t finished). The model is maybe Sonnet 3.7 in non-thinking mode, with some unpredictable bright moments where it sometimes outdoes Opus. Out of Fireworks and into the fire However, my start with Cerebras’s hosted Qwen was not the same as what I experienced (for a lot more money) on Fireworks, another provider. Initially, Cerebras’s Qwen didn’t even work in my CLI. It also didn’t seem to work in Roo Code or any other tool I knew how to use. After taking a bug report, Cerebras told me it was my code. My same CLI that worked on Fireworks, for Claude, for GPT-4.1 and GPT-5, for o3, for Qwen hosted by Qwen/Alibaba was at fault, said Cerebras. To be fair, my log did include deceptive artifacts when Cerebras fragmented the stream, putting out stream parts as messages (which Cerebras still does on occasion). However, this has been generally their approach. Don’t fix their so-called OpenAI compatibility—blame and/or adapt the client. I took the challenge and adapted my CLI, but it was a lot of workarounds. This was a massive contrast with Fireworks. I had issues with Fireworks when it started and showed them my debug output; they immediately acknowledged the problem (occasionally it would spit out corrupt, native tool calls instead of OpenAI-style output) and fixed it overnight. Cerebras repeatedly claimed their infrastructure was working perfectly and requests were all successful—in direct contradiction to most commentary on their Discord. Feeling like I had finally cracked the nut after three weeks of on-and-off testing and adapting, I grabbed a second Cerebras Code Max account when the window opened again. This was after discovering that for part of the time, Cerebras had charged me for a Max account but given me a Pro account. They fixed it and offered no compensation for the days my service was set to Pro, not Max, and it is difficult to prove because their analytics console is broken, in part because it provides measurements in local time, but the limits are in UTC. Then I did the math. One Cerebras Code Max account is limited to 120 million tokens per day at a cost equivalent to four times that of a Cerebras Code Pro account. The Pro account is 24 million tokens per day. If you multiply that by four, you get 96 million tokens. However, the Pro account is limited to 300k tokens per minute, compared to 400k for the Max. Using Cerebras is a bit frustrating. For 10 to 20 seconds, it really flies, then you hit the cap on tokens per minute, and it throws 429 errors (too many requests) until the minute is up. If your coding tool is smart, it will just retry with an exponential back-off. If not, it will break the stream. So, had I bought four Pro accounts, I could have had 1,200,000 TPM in theory, a much better value than the Max account. Other users in the Cerebras Discord channel were more upset by the limited context window. Cerebras limits Qwen3 Coder to 131k context. That’s a little more than half the native context the model supports. While this is a workable context size, it requires careful context management and tools that adapt to it. For perspective, Claude Code only recently allows for larger context sizes; until recently, the max context was less than 200k. To work with 131k, the prompts have to be enough but small. Tools have to prevent the model from biting off more than it can chew. Stock Roo Code is not going to be a great experience. In my opinion, 128k to 131k is the minimum viable context length for coding, but it is just barely feasible. Qwen3 Coder is a very good model and the first open-weight model that is practically viable for code generation. However, it is non-thinking. That means it has trouble planning. This isn’t a Cerebras-specific issue; it is simply how this model works. Tools that provide Claude Code-like “todo lists” will perform better, but if you’re hoping to get Qwen to generate an autonomous plan or even coordinate it, the results could be disappointing. Coding with Cerebras’s Qwen I ran Qwen3 Coder in Cerebras Code Max to create an AI-driven autonomous generation plan and to execute it. For the test, I generated the quintessential AI-driven todo list app. I used my CLI tool LLxprt Code as integrated with the Zed IDE. The results were not terribly impressive: https://github.com/acoliver/todo-cerebras-qwen3-480. I had to create four after-the-fact realignment prompts (Qwen had forgotten to wire in things like the startup). It never actually implemented the LLM bits.For comparison, this is the same app (https://github.com/acoliver/todo-claude) using the same process with Claude in LLxprt Code in Zed. You’ll note the plan is better, but the prompt to get to that plan was the same. With both Claude and Qwen, I have to do something I call “plan gardening.” After the model creates the plan, I have a new session to evaluate and correct it until it is in good form. After implementation, I gave Claude 4 realignment prompts (the outputted application still doesn’t store context for the todo chat). Claude did initially fake the LLM integration but fixed the app within four realignment prompts, the same number as Qwen, so I accepted that as equal.I hit my daily limit on Cerebras Code when doing this. I did not hit my limit, even the four-hour one, on Claude. Cerebras took longer to generate the todo app with the throttles. I didn’t measure the exact time, but with Cerebras, I started mid-day, and generating the app took until night. Claude took maybe an hour or two; I was busy writing this while the app was generating. I didn’t run this test generation with Qwen3 Coder using another provider to see how well the model performs on Cerebras vs. alternatives. However, anecdotally speaking, Cerebras’s Qwen appears to be less effective than the Fireworks pay-by-token version and seems slightly inferior to the free hosted version offered by Qwen/Alibaba themselves. This tracks with Larson’s observation of about an 8% drop in performance in his evaluation. Cerebras promises and user response Cerebras is still entrepreneuring this system. They created their own Model Context Protocol (MCP) server for use with Claude Code. The idea would be that you use Claude to plan and Cerebras’s Qwen3 Coder to write code. Some users have reported good results, others less so. Cerebras has also started promoting a CLI by Michael Pfaffenberger called Code Puppy. Pfaffenberger, to his credit, was the first to get his CLI to work stably with Cerebras. Pfaffenberger has been one of the strongest third-party advocates for Cerebras. As he told me in a private message on Discord: It’s been a decent experience overall. The limits are pretty unfortunate, but I like the company a lot, so I am willing to overlook the ‘false advertising’ for now. I do not like Claude Code as a CLI very much. We’re in an era where we can vibe code our own tools right now. The fact that I can’t use my own CLI without breaking their OAuth is a huge turn-off. Cerebras doesn’t [care] what I use. Even Pfaffenberger was quick to point out that the experience has been “less than we hope—we need a higher context window… The limits are a speed bump that diminishes their main value proposition. For some reason, these mega chips with 900k cores seem to be limited in RAM size… but I may not fully understand the architecture. Overall, I’m going to continue using it, b/c I think it has a good future.” Other users, such as a developer by the handle of diegonix, have been less positive and see Cerebras’s issues as a symptom of an overall industry problem: Companies are launching more and more AI products, but they don’t care about users. They just want to dig into investors’ pockets and burn their money. I have a Windsurf account. The owner rushed to sell it, Cognition bought it, and now Windsurf is abandoned. I have the OpenAI business plan, and it’s an AI bot that assists me with support. I’ve been waiting for two days for human support. Cerebras, you saw there, a lack of transparency and a misaligned product. Groq, poor guys, are suffering, unable to serve the base they already have. And the most emblematic case was Anthropic, which suffered for months with an intelligence problem (the models were dumb) and kept it a secret for months. In that period, there was an exodus of users due to the lack of transparency in the Pro and Max plans. What did they do? Series F round, instead of focusing on the current passionate users. For their part, Cerebras has been promising prompt caching. They seem to have started rolling it out. I’m not optimistic about their implementation, because they appear to be jury-rigging it into the Chat Completions APIs rather than using the more appropriate Responses API from OpenAI (which supports this natively). Fireworks, in contrast, has Responses API support for some models (but no tool calling outside of MCP, which is strange). Why would users care about prompt caching? Well, it could be faster, but execution isn’t the real issue; it is the TPM throttle. Cerebras might also not count cached tokens against your limit. However, the company has not stated that this is their intention, just that they’re working on it and it will somehow answer the problems users have had. Is Cerebras Code worth it? Honestly, the verdict is still out. It took me a long time to get Cerebras Code working correctly in any tool I work in. Others have claimed more success, but most are not trying to do full autonomous development like I am. Pfaffenberger himself is only using the Pro plan from Cerebras for non-work stuff. “If I weren’t using Anthropic models in Vertex AI at work, I would not be able to use Cerebras as my sole solution,” he told me. For my purposes, if I use Claude to plan, I’m able to get somewhat decent results from Qwen on Cerebras. Since Cerebras un-downgraded me to the Max plan that I paid for, I haven’t hit my daily limit. But Cerebras Code Max is not faster than Claude, given the TPM limit.I guess I’m still paying for hope this month. I think Larson said it best: I love this. I’ve been talking about someone providing a plan like this for a very long time. I’m just not a fan of how they’ve rolled this out—from me hitting my limit [for the day] in 41 minutes without even being able to get a single task done in an existing code base to now, when I start deep diving into what they’re promising, I’m starting to get skeptical about everything. The bottom line: Really promising technology in this model, really compelling subscription. Disappointing execution, terrible transparency, and perhaps even a tendency to be deceptive. I’ll probably hold on for a month or so with “cautious pessimism” and hope they change their approach and correct their offering. In any case, I think there is a market for honesty. There is another world where Cerebras said “Hey we’re building something and it won’t be perfect but we’re hoping to achieve X outcome and we’ll give you Y tokens per minute and Z per day,” and said “Yes we know there are problems with our compatibility and here are the problems and here is how we’re fixing it.” There is a world where Cerebras evaluated Cerebras Code against Claude Code with Sonnet and made sure it outperformed Claude cost-wise at both the Pro and Max price points. In this alternate reality, Cerebras acknowledged these issues (and fixed their dang usage console) and just comp’d anyone who had problems while being clear about what was being improved and how it would be improved next. The thing is, developers understand the hiccups and bugs of a developing product and they will bear with you if you are open, honest, and treat them fairly. Something Anthropic isn’t doing. Anyone who follows Cerebras’s playbook and treats developers fairly will likely win hearts and minds, not just users vs. the perception of “a lack of transparency and a misaligned product.” Cerebras was given an opportunity to offer comment, but declined.
https://www.infoworld.com/article/4055909/down-and-out-with-cerebras-code.html
Related News |
25 sources
Current Date
Sep, Mon 15 - 14:18 CEST
|