Cerebras Code now supports GLM 4.6 at 1000 tokens/sec(www.cerebras.ai)

189 pointsby nathabonfim592 days ago28 comments

andai2 days ago
I have been using Z.ai's (creators of GLM) "Coding Plan" with GLM-4.6. $3/month and 3x higher limits than Claude Pro, they say.
(I have both and haven't run into any limits yet, so I'm probably not a very heavy user.)
I'm quite impressed with the model. I have been using GLM-4.6 in Claude Code instead of Sonnet, and finding it fine for my use cases. (Simple scripting and web stuff.)
(Note: Z.ai's GLM doesn't seem to support web search (fails) or image recognition (hallucinates). To fix that I use Claude Code Router and hooked up those two features to Gemini (free) instead.)
I find that Sonnet produces much nicer code. I often find myself asking Sonnet to clean up GLM's code. More recently, I just got the Pro plan for Claude so I'm mostly just using Sonnet directly now. (Haven't had the rate limits yet but we'll see!)
So in my experience if you're not too fussy you can currently get "80% of Claude Code" for like $3/month, which is pretty nuts.
GLM also works well in Charm Crush, though it seems to be better optimized for Claude Code (I think they might have fine tuned it.)
---
I have tested Kimi K2 at 1000 tok/s via OpenRouter, and it's bloody amazing, so I imagine this supercharged GLM will be great too. Alas, $50!
- realo2 days ago
  I just asked glm-4.6 how to setup a z.ai api key with claude code and it kept on saying it has no idea what claude code is...
  Quite funny, actually.
  - andai2 days ago
    Ask and you shall receive!
    https://docs.z.ai/devpack/tool/claude
    tldr
    "env": { "ANTHROPIC_AUTH_TOKEN": "your_zai_api_key", "ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic" }
    Although if you want an Actually Good Experience I recommend using Claude Code Router
    https://github.com/musistudio/claude-code-router
    because it allows you to intercept the requests and forward them to other models. (e.g. GLM doesn't seem to support search or images, so I use Gemini's free tier for that.)
    (CCR just launches Claude with the base url set to a local proxy. The more adventurous reader can also set up his own proxy... :)
    brianjking2 days ago
    Does this allow me to use Claude Code as the orchestration harness with GLM 4.6 as the LLM along with other LLMs? Seems so based on your description, thanks for the link.
    andai2 days ago
    Explain like I'm 5?
    realo2 days ago
    Benchmark score
    Humans : 1
    AI : 0
- raihansaputraa day ago
  actually the web search/image needs the higher tier subscription with their mcp. k gladly paid for it, i basically don't care about usage anymore. it's great at execution. i still find it not so great at planning though. i wish i can use api sonnet/codex for planning and glm on claude code for executing easily, with codex/claude code. best thing right now is kilo code with the orchestrator but haven't tried it.
mythz2 days ago
AI moves so fast that Vibe Coding still has a negative stigma attached to it, but even after 25 years of development, I'm not able to match the productivity of getting AI to implement the features I want. It's basically getting multiple devs to set out and go do work for you where you just tell them what you want and provide iterative feedback till they implement all the features you want, in the way you want and to fix all the issues you find along the way, which they can create tests and all the automated and deployment scripts for.
This is clearly the future of Software Development, but the models are so good atm that the future is possible now. I'm still getting used to and having to rethink my entire dev workflow for maximum productivity, and whilst I wouldn't unleash AI Agents on a decade old code base, all my new Web Apps will likely end up being AI-first unless there's a very good reason why it wouldn't provide a net benefit.
- dust422 days ago
  It just depends on what you are doing. A green field react app in typescript with a CRUD API behind? The LLMs are a mind blowing assistant and 1000t/s is crazy.
  You are doing embedded development or anything else not as mainstream as web dev? LLMs are still useful but no longer mind blowing and often produce hallucinations. You need to read every line of their output. 1000t/s is crazy but no longer always in a good way.
  You are doing stuff which the LLMs haven't seen yet? You are on your own. There is quite a bit of irony in the fact that the devs of llama.cpp barely use AI - just have a look at the development of support for Qwen3-Next-80B [1].
  [1] https://github.com/ggml-org/llama.cpp/pull/16095
  - RealityVoid2 days ago
    > You are doing embedded development or anything else not as mainstream as web dev?
    Counterpoint, but also kind of reinforcing your point. It depends on the kind of embedded development. I did a small utility PCB with an ESP32, and their libs are good there is active community, they have test frameworks. LLM's did a great job there.
    On the other hand, I wanted to drive a timer and a pwm module and a DMA dma engine to generate some precise pulses. The way I chained hw was... Not typical, but it was what I needed and the hw could do it. At that, Claude failed miserably and it only led me to waste my time, so I had to spend the time to do it manually.
  - koito172 days ago
    > You are doing embedded development or anything else not as mainstream as web dev? LLMs are still useful but no longer mind blowing and often produce hallucinations.
    I experienced this with Claude 4 Sonnet and, to some extent, gpt-5-mini-high.
    When able to run tests against its output, Claude produces pretty good Rust backend and TypeScript frontend code. However, Claude became borderline unproductive once I started experimenting with uefi-rs. Other LLMs, like gpt-5-mini-high, did not fare much better, but they were at least capable of admitting lack of knowledge. In particular, GPT-5 would provide output akin to "here is some pseudocode that you may be able to adapt to your choice of UEFI bindings".
    Testing in a UEFI environment is quite difficult; the LLM can't just run `cargo test` and verify its output. Things get worse in embedded, because crates like embedded_hal made massive API changes between 0.2 and 1.0 (the latest version), and each LLM I've tried seems to only have knowledge of 0.2 releases. Also, for embedded, forget even thinking about testing harnesses (which at least exist in some form with UEFI, it's just difficult to automate the execution and output for an LLM). In this case, you cannot really trust the output of the LLM. To minimize risk of hallucination, I would try maintaining data sheets and library code in context, but at that point, it took more time to prompt an LLM than handwrite code.
    I've been writing a lot of embedded Rust over the past two weeks, and my usage of LLMs in general decreased because of that. Currently planning to resume development on some of my "easier" projects, since I have about 300 Claude prompts remaining in my Zed subscription, and I don't want them to go to waste.
    miki1232112 days ago
    This is where Rust's "if it compiles, it's probably correct" philosophy may come in handy.
    "Shifting bugs left" is even more important for LLMs than it is for humans. There are certain tests LLMs can't run, so if we can detect bugs at compile time and run the LLM in a loop until things compile, that's a significant benefit.
    nathan_compton2 days ago
    My recent experience is that llms are dogshit at rust, though, unable to correct bugs without inserting new ones, going back and forth fixing and breaking the same thing, etc.
    energy1232 days ago
    A while ago I gathered every HN comment going back a year that contains Rust and LLM and about half are positive and half are negative.
    embedding-shape17 hours ago
    Sounds like the general "LLMs are net useful or not" sentiment here too. Personally Rust+LLMs work great, and workflow is rapid for as long as you can get the LLM to run one command to say "good or bad" without too much manually work, then it can iterate until it all works. Standard advice for prompting like "Don't make tests pass by changing assertions" tends to make the experience better too, but that's not Rust specific either.
    manmal2 days ago
    Aren’t we all though?
    RealityVoid2 days ago
    > Also, for embedded, forget even thinking about testing harnesses (which at least exist in some form with UEFI, it's just difficult to automate the execution and output for an LLM).
    I think this doesn't have to be like this and we can do better for this. If LLMs keep this up, good testing infrastructure might become more important.
    koito172 days ago
    One of my expectations for the future is the development of testing tools whose output is "optimized" in some way for LLM consumption. This is already occurring with Bun's test runner, for instance.[0] They are implementing a flag in the test runner so that the output is structured and optimized for token count.
    Overall, I agree with your point. LLMs feel a lot more reliable when a codebase has thorough, easy-to-run tests. For a similar reason, I have been drifting towards strong, statically-typed languages. Both Rust and TypeScript have rich type systems that can express many kinds of runtime behavior with just types. When a compiler can make strong guarantees about a program's behavior, I assume that helps nudge the quality of LLM output a bit higher. Tests then help prevent silly regressions from occurring. I have no evidence for this besides my anecdotal experience using LLMs across several programming languages.
    In general, I've had the best experience with LLMs when there's plenty of static analysis (and tests) on the codebase. When a codebase can't be easily tested, then I get much less productivity gains from LLMs. So yeah, I'm all for improving testing infrastructure.
    [0] https://x.com/jarredsumner/status/1944948478184186366
  - lifthrasiir2 days ago
    There aren't many things that LLMs haven't really seen yet, however. I have successfully used LLMs to develop a large portion of WebAssembly 3.0 interpreter [1], which surely aren't in their training set because WebAssembly 3.0 was only released months ago. Sure, it took me tons of guidance but it was useful enough for me.
    Even llama.cpp is not a truly novel thing to LLMs, there are several performant machine learning model executors available in their training sets anyway, and I'm sure llama.cpp can benefit from LLMs if they want; they just chose not to.
    [1] https://github.com/lifthrasiir/wah/
  - a day ago
    undefined
  - almostgotcaught2 days ago
    I've said it before but no one takes it seriously: LLMs are only useful if you're building something that's already in the training set ie already commodity. In which case why are you building it???
    antonvs2 days ago
    The obvious point that you're missing is that there are literally infinite ways to assemble software systems from the pieces that an LLM is able to manipulate due to its training. With minor guidance, LLMs can put together an unlimited number of novel combinations. The idea that the entire end product has to be in the training set is trivially false.
    whiterook62 days ago
    It's not that the product you're building is a commodity. It's that the tools you're using to built it are. Why not build a landing page using HTML and CSS and tailwind? Why not use swift to make an app? Why not write an AWS lambda using JavaScript?
    mythz2 days ago
    "LLMs are only useful..."
    Is likely why no one takes you seriously, as it's a good indication you don't have much experience with them.
    2 days ago
    undefined
    energy1232 days ago
    Bell Labs should have fired all their toilet cleaners. Nothing innovative about a toilet.
    philipp-gayret2 days ago
    It's true, when I was working with LLMs on a novel idea it said sorry I can't help you with that!
    nurettin2 days ago
    Do you avoid writing anything that the programming community has ever built? How are you alive???
    becquerel2 days ago
    Because I'm getting paid to.
    dboreham2 days ago
    Historically big AI skeptic here: what you say is very not true now. LLMs aren't just regurgitating their training data per se. I've used LLMs on languages the LLM has not seen, and it performed well. I've used LLMs on code that is about as far from a React todo app as it's possible to get.
- kaspermarstal2 days ago
  We need a new term for LLMs actually solving a hard problems. When I help Claude Code solve a nasty bug it doesn’t feel like “vibing” as in “I tell the model what I want the website to look like”. It feels like sniping as in “I spot for Claude Code, telling how to adjust for wind, range, and elevation so it can hit my far away target”.
  - scosman2 days ago
    From what I recall of the original Karpathy definition, it’s only “vibe coding” if you aren’t reading the code it produces
    lukan2 days ago
    Yes, I vote for keeping that definition and not throw it all into a box. LLM assisted coding is not vibe coding.
    kaspermarstala day ago
    My point exactly, it is not vibe coding so it should not be called vibe coding. What should we call it then?
    embedding-shape17 hours ago
    LLM-assisted Development. Something that for me works in practice, vibe-coding never did, you really need to carefully review and steer constantly if things are to work out longer than just a few features.
    JimDabell2 days ago
    You’re right. It’s explicitly about not caring about the code:
    > There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
    — https://x.com/karpathy/status/1886192184808149383
    kaspermarstal2 days ago
    Cool, I did not know that. That makes perfect sense.
  - fuzzy_biscuit2 days ago
    So we're the spotter in that metaphor. I like it!
    riskable2 days ago
    "spotter coding" or perhaps "checker coding"?
    "verified vivisection development" when you're working with older code :D
  - demarq2 days ago
    - backseat engineer
    - keyboard princess
    - Robin to the Batman
    - meatstack engineer
    - artificial manager
    kaspermarstal15 hours ago
    Keyboard Princess is good, Artificial Manager is even better.
- epolanskia day ago
  > AI moves so fast that Vibe Coding still has a negative stigma attached to it
  As it should, it's about writing code on vibes, not looking at the code, it's literally the definition of the term:
  https://x.com/karpathy/status/1886192184808149383
  And when I say literally I'm including dictionaries:
  https://blog.collinsdictionary.com/language-lovers/collins-w...
- ffsm82 days ago
  The industry of "software" is so large... While I agree with web development going this route, I'm not sure about "everything else".
  You could argue that that's the bulk of all software jobs in tech, and you'd likely be correct... But depending on what your actual challenge is, LLM assistance is more of a hindrance then help. However creating a web platform without external constraints makes LLM assistance shine, that's true
  - killerstorm2 days ago
    Well, there are certainly kinds of code LLMs would struggle with, but people generally underestimate what LLMs are capable of.
    E.g. Victor Taelin is implementing ultra-advanced programming language/runtime writing almost all code using LLM now. Runtime (HVM) is based on Interaction Calculus model which was only an obscure academic curiosity until Taelin started working on it. So a hypothesis that LLMs are only capable of copying bits of code from Stack Overflow shall be dismissed.
    thesz2 days ago
    I took a look at the Taelin's work [1].
    [1] https://github.com/HigherOrderCO/HVM
    From my understanding, main problem there is a compilation into (optimal) CUDA code and CUDA runtime, not language or internal representation per se. CUDA is hard to debug, some help can be warranted.
    BTW, this HVM thing smells strange. The PAPER does not provide any description of experiments where linear parallel speedups were achieved. What were these 16K cores? What were these tasks?
    killerstorma day ago
    Taelin is experimenting with possible applications of interaction calculus. That CUDA thing was one of experiments, and it didn't quite work out.
    Currently he's working on a different thing: a code synthesis tool. AFAIK he got something better than anything else in this category, but whether it's useful is another question.
    thesza day ago
    > something better than anything else in this category
    That is a strong statement.
    [1] https://en.wikipedia.org/wiki/Id_(programming_language)
    Id [1] was run on the CM-5 (then) supercomputer and demonstrated superlinear parallel speedups on some of the tasks. That superlinear speedup was due to better cache utilization on individual nodes.
    In some of the tasks the amount of parallel execution discovered by Id90 would lead to overflow of content-addressable memory and Id90's runtime implemented throttling to reduce available parallelism to make things to be done at all.
    Does the PAPER of HVM refers to Id (Id90 to be precise)? No, it does not.
    This is serious negligence of Taelin.
    keithba2 days ago
    I’ve also experimented with using rust to create a new programming language where I vibe coded (eg never wrote myself). My opinion is that it’s quite capable with disciplined management.
    https://github.com/GoogleCloudPlatform/aether
    Note: the syntax is ugly as a trade-off to make it explicit and unambiguous for LLMs to use.
- 2 days ago
  undefined
- namanyayg2 days ago
  Exactly, Codex gpt-5-high is quite like sending smart devs. It still makes mistakes, and when it does they're extremely stupid ones, but I am now accepting the code it generates as throwable and I just reroll when it does something dumb.
- cyanydeez2 days ago
  Do you think that people will read your flowery prose and suspect you're just part of the dead Internet. We are still waiting for all these AI enhanced apps to flood the market.
divmain2 days ago
I have been an AI-coding skeptic for some time. I always acknowledged LLMs as useful for solving specific problems and making certain things possible that weren't possible before. But I've not been surprised to see AI fail to live up to the hype. And I never had a personally magical moment - an experience that shifted my perspective à la the peak end rule.
I've been using GLM 4.6 on Cerebras for the last week or so, since they began the transition, and I've been blown away.
I'm not a vibe coder; when I use AI coding tools, they're in the hot path. They save me time when whipping up a bash script and I can't remember the exact syntax, or for finding easily falsifiable answers that would otherwise take me a few minutes of reading. But, even though GLM 4.6 is not as smart as Sonnet 4.5, it is smart enough. And because it is so fast on Cerebras, I genuinely feel that it augments my own ability and productivity; the raw speed has considerably shifted the tipping point of time-savings for me.
YMMV, of course. I'm very precise with the instructions I provide. And I'm constantly interleaving my own design choices into the process - I usually have a very clear idea in my mind of what the end result should look like - so, in the end, the code ends up how I would have written it without AI. But building happens much faster.
No affiliation with Cerebras, just a happy customer. Just upgraded to the $200/mo plan - and I'll admit that I was one that scoffed when folks jumped on the original $200/mo Claude plan. I think this particular way of working with LLMs just fits well with how I think and work.
- realo2 days ago
  I was AI skeptic too a year ago , but recently i wanted a windows exe program to do the same as a complicated bash script on linux.
  i gave the bash script to claude code, which immediately started implementing something in the zig language. after a few iterations, i had zig source code that compiled in linux , produced a windows exe and perfectly mimicked the bash script.
  I know nothing about zig programming.
  - aurareturn2 days ago
    I've been maintaining my company's Go repos using Claude after our Go developer left. I don't know anything about Go.
- ramraj072 days ago
  Your post has inspired me to check them out. How do you use it, with their UI oe to power some other open source tool?
  Do you suggest that this thing is so fast its simpler now to quickly work on one thing at a time instead of the 5 background tools running in parallel which might have been a pattern we invented because these things are so slow?
  - divmain2 days ago
    I’ve been using the crush TUI primarily. I like that I have the flexibility to switch to a smarter model on occasion - for awhile I hesitated to pick up AI coding at all, simply because I didn’t want to be locked into a model that could be immediately surpassed. It’s also customizable enough with sane defaults.
  - bananapub2 days ago
    you want to be using something like opencode in a terminal, not the web ui.
    you’ll need to try it and see what the speed does to your workflow.
Flux1592 days ago
Was able to sign up for the Max plan & start using it via opencode. It does a way better job than Qwen3 Coder in my opinion. Still extremely fast, but in less than 1 hour I was able to use 7M input tokens, so with a single agent running I would be able easily to pass that 120M daily token limit. The speed difference between Claude Code is significant though - to the point where I'm not waiting for generation most of the time, I'm waiting for my tests to run.
For reference, each new request needs to send all previous messages - tool calls force new requests too. So it's essentially cumulative when you're chatting with an agent - my opencode agent's context window is only 50% used at 72k tokens, but Cerebra's tracking online shows that I've used 1M input tokens and 10k output tokens already.
- NitpickLawyer2 days ago
  > For reference, each new request needs to send all previous messages - tool calls force new requests too. So it's essentially cumulative when you're chatting with an agent - my opencode agent's context window is only 50% used at 72k tokens, but Cerebra's tracking online shows that I've used 1M input tokens and 10k output tokens already.
  This is how every "chatbot" / "agentic flow" / etc works behind the scenes. That's why I liked that "you should build an agent" post a few days ago. It gets people to really understand what's behind the curtain. It's requests all the way down, sometimes with more context added, sometimes with less (subagents & co).
  - embedding-shape21 hours ago
    Many API endpoints (and local services for that matter) does caching at this point though, with much cheaper prices for input/outputs that were found in the caching. I know Anthrophic does this, and DeepSeek I think too, at the very least.
- zaptrem2 days ago
  They don't have prefix caching? Claude and Codex have this.
  - versteegen2 days ago
    At those speeds, it's probably impossible. It would require enormous amounts of memory (which the chip simply doesn't have, there's no room for it) or rather a lot of bandwidth off-chip to storage, and again they wouldn't want to waste surface area on the wiring. Bit of a drawback of increasing density.
sheepscreek2 days ago
I used their $50 plan and with the previously offered Qwen3 coder 480B. While fast - none of the “supported” tools I tried were able to use it in a way that didn’t hit the per minute request limit in a few seconds. It was incredibly frustrating. For the record, I tried OpenCoder, VSCode, Quen Coder CLI, octofriend and a few others I don’t remember.
Fast forward to now, when GLM 4.6 has replaced Qwen3 coder in their subscription plan. My subscription was still active so I wanted to give this setup another shot. This time though, I decided to give Cline a try. I’ve got to say, I was very pleasantly surprised - it worked really well out of the box. I guess whatever Cline does behind the scenes is more conducive to Cerebra’s API. I used Claude 4.5 + Thinking for “Plan” mode and Cerebras/GLM 4.6 for “Act”.
The combo feels solid. Much better than GPT-5 Codex alone. I found codex to be very high quality but so godawful slow for long interactive coding sessions. The worst part is I cannot see what it’s “thinking” to stop it in its tracks when it’s going in the wrong direction.
In an essence, Cerebras + GLM 4.6 feels like Grok Fast 1 on steroids. Just couple it with a frontier + thinking model for planning (Claude 4.5/GPT-5/Gemini Pro 2.5).
One caveat: sometimes the Cerebras API starts choking “because of high demand” which has nothing to do with hitting subscription limits. Just an FYI.
Note: For the record, I was coding on a semi-complex Rust application tuned for low-latency mix of IO + CPU workload. The application is multi-threaded and makes extensive use of locking primitives and explicit reference counting (Arc). All models were able to handle the code really well given the constraints.
Note2: I am also evaluating Synthetic's (synthetic.new) open-source model inference subscription and I like it a lot. There's a large number of models to choose from, including gpt-oss-120 and their usage limits are very very generous. To the point that I don't think I will ever hit them.
- anonzzzies2 days ago
  I run opencode with cerebras and it goes on and on. No issues so far. It is not a codex or claude code but its fast that it allows for a much more interactive experience. Our results with Claude Code and Codex are much better quality but when, let's use the term vibing, opencode wirh cerebras is more fun.
- zixuanlimit2 days ago
  Have you tried Z.ai’s official coding plan? Any differences in performance?
andai2 days ago
I have created a "semi-interactive" AI coding workflow.
I write what I want, the LLM responds with edits, my 100 lines of Python implement them in the project. It can edit any number of files in one LLM call, which is very nice (and very cheap and fast).
I tested this with Kimi K2 on Groq (also 1000 tok/s) and very impressed.
I want to say this is the best use case for fast models -- that frictionlessness in your workflow -- though it burns tokens pretty fast working like that! (Agentic is even more nuts with how fast it burns tokens on fast models, so the $50 is actually pretty great value.)
KronisLV2 days ago
Been using Cerebras for quite a while now, previously with their Qwen3 Coder and now GLM 4.6, overall the new model feels better at tool calls and code in general. Fewer tool call failures with RooCode (should also apply to Cline and others too), but obviously still not perfect.
Currently on the 50 USD tier, very much worth the money, am kinda considering going for the 200 USD tier, BUT GPT-5 and Sonnet 4.5 and Gemini 2.5 Pro still feel needed occasionally, so it'd be stupid to go for the 200 USD tier and not use it fully and still have to pay up to around 100 USD for tokens in the other models per month. Maybe that will change in the future, when dealing with lots of changes (e.g. needing to make a component showcase of 90 components, but with enough differences between then to make codegen unviable) Cerebras is already invaluable.
Plus the performance actually makes iterating faster, to a degree where I believe that other models should also eventually run this fast. Oddly enough, the other day their 200 USD plan showed as "Sold out", maybe they're scaling up the capacity gradually. I really hope they never axe the Code plans, they literally have no competition for this mode of use. Maybe they'll also have a 100 USD plan some day, one can hope, but maybe offering just the 200 plan is better from an upsell perspective for them.
Oh also, when I spill over my daily 24M limits, please let me use the Pay2Go thing on top of that, if instead of 24M tokens some day I need 40M, I'd pay for those additional ones.
behnamoh2 days ago
If they don't quantize the model, how do they achieve these speeds? Groq also says they don't quantize models (and I want to believe them) but we literally have no way to prove they're right.
This is important because their premium $50 (as opposed to $20 on Claude Pro or ChatGPT Plus) should be justified by the speed. GLM 4.6 is fine but I don't think it's still at the GPT-5/Claude Sonnet 4.5 level, so if I'm paying $50 for it on Cerebras it should be mainly because of speed.
What kind of workflow justifies this? I'm genuinely curious.
- cschneid2 days ago
  so apparently they have custom hardware that is basically absolutely gigantic chips - across the scale of a whole wafer at a time. Presumably they keep the entire model right on chip, in effectively L3 cache or whatever. So the memory bandwidth is absurdly fast, allowing very fast inference.
  It's more expensive to get the same raw compute as a cluster of nvidia chips, but they don't have the same peak throughput.
  As far as price as a coder, I am giving a month of the $50 plan a shot. I haven't figured out how to adapt my workflow yet to faster speeds (also learning and setting up opencode).
  - bigyabai2 days ago
    For $50/month, it's a non-starter. I hope they can find a way to use all this excess bandwidth to put out a $10 equivalent to Claude Code instead of a 1000 tok/s party trick I can't use properly.
    wyrea day ago
    Cerebras offers pay-per-token. What are you asking for? Claude Code starts at $100, or $15/mtok. Cerebras is already much cheaper, but you want it to be even cheaper at $10?
    typpilol2 days ago
    I feel the same and it's also why I can't understand all these people using small local models.
    Every local model I've used and even most open source are just not good
    behnamoh2 days ago
    the only good-enough model I still use it gpt-oss-120b-mxfp4 (not 20b) and glm-4.6 at q8 (not q4).
    quantization ruins models and some models aren't that smart to begin with.
    csomar2 days ago
    GLM-4.6 is on par with Sonnet 4.5. Sometimes it is better, sometimes it is worse. Give it a shot. It's the only model that made me (almost) ditch Claude. The only problem is, Claude Code is still the best agentic program in town and search doesn't function without a proper subscription.
    mcpeepants2 days ago
    z.ai hosted GLM 4.6 works great with claude code, drops right in
    esafak2 days ago
    Have you tried opencode?
    DeathArrow2 days ago
    Have you tried Claude Code Router with GLM 4.6?
    https://github.com/musistudio/claude-code-router
    xadhominemx2 days ago
    $600 per year is a trivial cost for a professional tool
    bigyabaia day ago
    $600 per anything is Herman Miller territory, pal. I'm not paying that for a SaaS.
- threeducks2 days ago
  > but we literally have no way to prove they're right
  Of course we do. Just run a benchmark with Cerebras/Groq and compare to the results produced in a trusted environment. If the scores are equal, the model is is either unquantized, or quantized so well that we can not tell, in which case it does not matter.
  For example, here is a comparison of different providers for gpt-oss-120b, with differences of over 10 % for best and worst provider.
  https://artificialanalysis.ai/models/gpt-oss-120b/providers#...
- msp262 days ago
  Groq does quantise. Look at this benchmark from moonshotai for K2 where they compare their official implementation to third party providers.
  https://github.com/MoonshotAI/K2-Vendor-Verifier
  It's one of the lowest rated on that table.
- nine_k2 days ago
  > What kind of workflow justifies this?
  Think about waiting for compilation to complete: the difference between 5 minutes and 15 seconds is dramatic.
  Same applies to AI-based code-wrangling tasks. The preserved concentration may be well worth the $50, especially when paid by your employer.
  - behnamoh2 days ago
    they should offer a free trial so we build confidence in the model quality (e.g., to make sure it's not nerfed/quantized/limited-context/etc.).
    conception2 days ago
    A trial is literally front and center on their website.
    NitpickLawyer2 days ago
    You can usually use them with things like openrouter. Load some credits there and use the API in your preferred IDE like you'd use any provider. For some quick tests it's probably be <5$ for a few coding sessions so you can check out the capabilities and see if it's worth it for you.
    behnamoh2 days ago
    openrouter charges me $12 on a $100 credit...
- NitpickLawyer2 days ago
  > What kind of workflow justifies this? I'm genuinely curious.
  Any workflow where verification is faster / cheaper than generation. If you have a well tested piece of code and want to "refactor it to use such and such paradigme", you can run n faster model queries and pick the fastest.
  My colleagues that do frontend use faster models (not this one specifically, but they did try fast-code-1) to build components. Someone worked out a workflow w/ worktrees where the model generates n variants of a component, and displays them next to each other. A human can "at a glance" choose which one they like. And sometimes pick and choose from multiple variants (something like passing it to claude and say "keep the styling of component A but the data management of component B"), and at the end of the day is faster / cheaper than having cc do all that work.
- xadhominemx2 days ago
  It’s because the model weights and KV cache are stored in SRAM. It’s extremely expensive per token.
niklassheth2 days ago
This is more evidence that Cognition's SWE-1.5 is a GLM-4.6 finetune
- prodigycorp2 days ago
  Can you provide more context for this? (eg Was SWE-1.5 released recently? Is it considered good? Is it considered fast? Was there speculation about what the underlying model was? How does this prove that it's a GLM finetune?)
  - NitpickLawyer2 days ago
    People saw chinese characters in generations made by swe-1.5 (windsurfs model) and also in the one made by cursor. This led to suspicions that the models are finetunes of chinese models (which makes sense, as there aren't many us/eu strong coding models out there). GLM4.5/4.6 are the "strongest" coding models atm (with dsv3.2 and qwen somewhat behind) so that's where the speculation came from. Cerebras serving them at roughly the same speeds kinda adds to that story (e.g. if it'd be something heavier like dsv3 or kimik2 it would be slower).
    prodigycorp2 days ago
    Really appreciate this context. Thank you!
  - mhuffman2 days ago
    I suspect they are referencing the 950tok/s claim on Cognition's page.
    prodigycorp2 days ago
    Ah. Thx. Blogpost for others: https://cognition.ai/blog/swe-1-5
    Takeaway is that this is sonnet-ish model at 10x the speed.
- nl2 days ago
  Not at all. Any model with somewhat-similar architecture and roughly similar size should run at the same speed on Cerabras.
  It's like saying Llama 3.2 3B and Gemma 4B are fine tunes of each other because they run at similar speeds on NVidia hardware.
Alifatisk2 days ago
I don't know about GLM 4.6, some have said they are bench-maxing so I kinda lost my interest in trying them out, does it live up to its reputation? Is it that good like Sonnet 4.5?
jmpman21 hours ago
I subscribe to the $50 plan, and for the past day, my results have been laggy. Once the query actually starts, it’s as fast as normal, but time to first token wouldn’t even be the best way to describe the issue - it appears to be a problem with getting my request even into the hardware. Possibly an overloaded queue on the front end?
lordofgibbons2 days ago
At what quantization? And if it is in fact quantized below fp8, how is the performance impacted on all the various benchmarks?
- antonvs2 days ago
  They claim they don't use quantization.
  The reason for their speed is this chip: https://www.cerebras.ai/chip
RomanPushkina day ago
PSA: they're sold out ("GLM4.6 Temporarily Sold Out") as of 11/8/2025 for both $50 and $200 plan
vladgur2 days ago
Where I’m unfortunately hitting the wall with these assistants is developing a desktop app in say Java swing - Claude cannot verify that ui it produced is actually functional
- NaomiLehman2 days ago
  what is your AI dev stack? have you tried Kilo Code?
  - vladgura day ago
    Is Kilo code better than Claude at launching a native or a desktop app and understanding and acting on the desktop UI.
    Clicking buttons, reading values from fields, etc
dust422 days ago
1000 tokens/s is pretty fancy. I just wonder how sustainable the pricing is or if they are VC-fueled drug dealers trying to convert us into AI-coholics...
It is definitely fun playing with these models at these speeds. The question is just how far from real pricing is 500M tokens for $50?
Either way the LLM usage will grow for some time to come and so will grow energy usage. Good times for renewables and probably fusion and fission.
Selling shovels in a gold rush was always reliable business. Cerebras is only rated at $8.1B as of one month ago. Compared to Nvidia that seems pocket change.
elzbardico2 days ago
50 dollars month cerebras code plan, first with qwen-420, now with glm, is my secret weapon.
Stalin used to say that in war "quantity has a quality all its own". And I think that in terms of coding agents, speed is quality all its own too.
Maybe not for blind vibe coding, but if you are a developer, and is able to understand the code the agent generates and change it, the fast feedback of fast inference is a game changer. I don't care if claude is better than GLM 4.6, fast iteractions are king for me now.
It is like moving from DSL to gigabit fiber FTTH
- kristianpa day ago
  > speed is quality
  Plenty of people will agree with you. [1]
  [1] https://blog.codinghorror.com/performance-is-a-feature/ (2011)
ojosilva2 days ago
Here's a customer of the $200 max plan for 2 months. I fell in love with the Qwen3 Coder 480B model, Q3C, that was fast, twice the speed of GLM. GLM 4.6 is just meh, I mean, way faster than competitors, and practically at Sonnet 4.x level in coding and tool use, but not a life-changing difference.
Yes, Qwen3 made more mistakes than GLM, around 15% more in my quick throwaway evals, but it was a more professional model overall, more polished in some aspects, better with international languages, and being non-reasoning, ideal for a lot of tasks through the API that could be ran instantaneously. I think the Qwen line of models is a more consistent offering, with other versions of the model for 32B and VL, now a 80B one, etc. I guess the problem was that Qwen Max was closed source, signalling that Qwen may not have a way forward for Cerebras to evolve. GLM 4.6 covers precisely that hole. Not that Cerebras is a model provider of any kind, their service levels are "buggy" (right now it's been down for 1h and probably won't be fixed until California wakes up at 9am PST). So it does feel like we are not the customers, but the product, a marketing stunt for them to get visibility for their tech.
GLM feels like they (Z.ai) are just distilling whatever they can get into it. GLM switches to Chinese sometimes, or just cuts off. It does have a bit of more "intelligence" than Q3C, but not enough to say it solves the toughest problems. Regardless, for tough nuts to crack I use my Codex Plus plan.
Ex: In one of my evals, it took 15 turns to solve an issue using Cerebras Q3C. I took 12 turns with GLM, but overall GLM takes 2x the time, so instead of doing a full task from zero-to-commit in say 15 minutes, it takes 24 minutes.
In another eval (Next.js CSS editing), my task with Q3C coder was done in 1:30 minutes. GLM 4.6 took 2:24. The same task in Codex took 5:37 minutes, with maybe 1 or 2 turns. Codex DX is that of working unattended: prompt it and go do something else, there's a good chance it will get it right after 0, 1 or 2 nudges. With CC+Cerebras it's a completely different DX, given the speed it feels just like programming, but super-fast. Prompt, read the change, accept (or don't), accept, accept, accept, test it out, accept, prompt, accept, interrupt, prompt, accept, and 1:30 min later we're done.
Like I said I use Claude Code + a proxy (llmux). The coding agent makes a HUGE difference, and CC is hands-down the best agent out there.
- andai2 days ago
  Asked GLM-4.6 to introduce itself. "Hello! I'm glad you asked. I'm a large language model, trained by Google. (...)"
  It seems to have been fine-tuned on Claude Code interactions as well. Though unfortunately not much of Claude's coding style itself? (I wish!)
alyxya2 days ago
It would be nice if there was more information provided on that page. I assume this is just the output token generation speed. Is it using speculative decoding to get to 1000 tokens/sec? Is there lossy quantization being used to speed things up? I tend to think the number of tokens per second a model can generate to be relatively low on the list of things I care about, when things like model/inference quality and harness play a much bigger role in how I feel about using a coding agent.
- cschneid2 days ago
  Yes this is the output speed. Code just flashes onto the page, it's pretty impressive.
  They've claimed repeatedly in their discord that they don't quantize models.
  The speed of things does change how you interact with it I think. I had this new GLM model hooked up to opencode as the harness with their $50/mo subscription plan. It was seriously fast to answer questions, although there are still big pauses in workflow when the per-minute request cap is hit.
  I got a meaningful refactor done, maybe a touch faster than I would have in claude code + sonnet? But my human interaction with it felt like the slow part.
  - alyxya2 days ago
    The human interaction part is one of the main limitations to speed, where the more autonomous a model can be, the faster it is for me.
odie55332 days ago
I find the fast models good for rapidly iterating UI changes with voice chat. Like "add some padding above the text box" or "right align the button". But I find the fast models useless for deep coding work. But a fast model has its place. Not $50/month though. Cursor has Compose 1 and Grok Code Fast for free. Not sure what $50/month gets me that those don't. I liked the stealth supernova model a lot too.
- gardnr2 days ago
  GLM 4.6 isn't a "fast" model. It does well in benchmarks vs Sonnet 4.5.
  Cerebras makes a giant chip that runs inference at unreal speeds. I suspect they run their cloud service more as an advertising mechanism for their core business: hardware. You can hear the founder describing their journey:
  https://podcasts.apple.com/us/podcast/launching-the-fastest-...
  - 2 days ago
    undefined
- bn-l2 days ago
  Composer and grok fast are not free.
  - versteegen2 days ago
    grok-code-fast-1 is free (for a limited time) in opencode zen, has been for a while. (Originally it was billed as a test/to gather training data for xAI.) But right now GLM 4.6 is also temporarily free there (hosted by opencode themselves; they call it "big pickle", and there's no data collection), has been for weeks, and GLM 4.6 is far better (better than Haiku and not very far off Sonnet), and still very fast, so I have no use for gcf1 anymore.
  - odie55332 days ago
    They are both free in Cursor right now.
w-m2 days ago
I wanted to try GLM 4.6 through their API with Cline, before spending the $50. But I'm getting hit with API limits. And now I'm noticing a red banner "GLM4.6 Temporarily Sold Out. Check back soon." at cloud.cerebras.ai. HN hug of death, or was this there before?
renewiltord2 days ago
Unfortunately for me, the models on Cerebras weren’t as good as Claude Code. Speedy but I needed to iterate more. Codex is trustworthy and slow. Claude is better at iterating. But none of the Cerebras models at the $50 tier were worth anything for me. They would have been something if they’d just come out but we have these alternatives now.
- elzbardico2 days ago
  I don't care. I want LLMs to help with the boring stuff, the toil. It may not be as intelligent as Claude, but if it takes care of the boring stuff, and it is fast while doing it, I am happy. Use it surgically, do the top-down design, and just let it fill the blanks.
  - renewiltord2 days ago
    Give it a crack. It took a lot of iteration for it to write decent code. If you figure out differences in prompting technique, do share. I was really hoping for the speed to improve a lot of execution - because that’s genuinely the primary problem for me. Unfortunately, speed is great but quality wasn’t great for me.
    Good luck. Maybe it’ll do well in some self-directed agent loop.
2 days ago
undefined
lvl1552 days ago
I want to use Cerebaras but it’s just not production ready. I will root for them on sideline for now.
seduerr2 days ago
It’s just amazing to have a reliable model at the speed of light. Was waiting for such a great model for a long time!
hereme8882 days ago
So basically "change this in the UI" and you see it happen almost real time.
gatienboquet2 days ago
Vibe Slopping at 1000 tokens per second
- mmaunder2 days ago
  Yeah honestly having max cognitive capability is #1 for me. Faster tokens is a distant second. I think anyone working on creating valuable unique IP feels this way.
  - conception2 days ago
    This us where agents actually shine. Having a smart model write code and plan is great and then having cerebra’s do ask the command line work, write documents effectively instantly and other simple tasks does sped things up quite a bit.
lousken16 hours ago
sold out
jauntywundrkind2 days ago
I'm curious to know what the cost is to switch contexts. Pure performance is amazing, but how long does it take to get a system going, to load the model and build context? What systems cans witch contexts while keeping the model non-destructively vs when is executing destructive?
I have a lot of questions about how models are run at scale; so curious to know more. With such a massive wafer as chip as Cerebras, it feels like perhaps switching might be even more consuming. Or maybe there's some brilliant strategy to have multiple contexts all loaded that it can flip between! Inventorying & using so much ram so spread out is it's own challenge!