by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.
I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.
Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.
I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.
These are way more valuable metrics than "hey build X"
- Vibes are too subjective, I want an actual A/B test!
- An A/B test is too limited, I want a benchmark! (You are here.)
- Those benchmarks never seem to be reliable, I just go on vibes.
> A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is"
All of your suggestions are better but they're hard, so someone casually evaluating an AI isn't going to do them.
And so on and so forth. Again, I'm not saying this is impossible but I am saying that if you tried to do it, and you got the money, and you built the test, and got the human subjects clearance, and you ignored that during the process of all that at least one more frontier model would come out, you can count on HN anklebiting your "rigorous" study even so, and probably being correct about a lot of the issues it could have because it would take several iterations of this to build a reasonable protocol... at which point it would quite possibly also be obsoleted by progress again.
[1] https://blog.neurips.cc/2025/09/30/reflecting-on-the-2025-re...
There are far more opportunities that can be served when the world's intellectuals have the raw weights and can fine tune, splice, distill, and reapply.
Imagine having raw unfettered access to Fable. It can be refit to structural biology. It can be fine tuned on the repo for smaller context requirements. It can be run cheaper and air gapped.
The world wants this.
That said, maybe we just disagree on how to drive change, and that’s fine. I’ll leave it.
It's the "starting from empty slate" greenfield that's the real problem.
We used to make fun of Engineers who follow a README on a framework, test it on an empty project, and say "this framework is the best for our 10 year running production app". Greenfield mentality is always the solution to all problems and problem to all solutions.
One should still measure oneshotting, it's an important self-measurement metric - but against an established, large codebase.
* SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios https://arxiv.org/abs/2512.18470 * SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration https://arxiv.org/abs/2603.03823
Note that after the model generated a bunch of (intermediary) code, they still have to have it tested and get bugs fixed (via the agent/harness). In this "one shot" you still have agent loops against human defined objectives.
And these toy examples give some insight as to how the model performs. If the test were "here's some code written by $corp, please take these tickets and work on them" it may be a "real" example but nobody would be able to make sense of actually how "hard" it is, or how "well" the model did the job, besides the workers already familiar with the context.
At least everyone knows what a 3D game is.
But for a more practical issue, the ultimate goal of LLMs is to replace software engineers, or at least enable everybody to become a software engineer, to use a more up-beat phrasing that's no less accurate. And so an LLM's ability to reliably construct something from a poorly defined, contradictory, or otherwise flawed prompt, while accurately inferring intent is probably the first finish line.
I think however that they should have used the same harness and also repeated the experiment a few times to judge the variance in results.
Right, model intelligence defines the scope of things they can one shot
I also suspect that users naturally calibrate to a model's useful scope, gradually getting positive/negative feedback and gradually making their requests bigger/smaller than before
Similar to how ML was all the hype about 12 years ago and then it submerged again for a couple of years.
It's a relatively objective way of testing LLMs, and I think it's pretty representative of how strong models are overall.
The outcome of this test mirrors how GLM 5.2 and Opus 4.8 work for me: they're both similarly capable of fully executing a given task, but Opus tends to have a bit more "taste" in how it handles unstated details or implicit requirements.
> what you'll get is a series of assumptions made by the model
Yes, but that's why we use these models in the first place. We don't want to explicitly write down all the details because that would mean writing code. So we write a higher-level, human-language spec, and let the LLM fill in the blanks. The question is how good they are at doing that.
I agree generating millions of tokens from a handful of input tokens doesn't convey anything meaningful to me.
Guardrails/conventions should be enforced in linters, formatters, static analysis tooling; not specs/prompts.
Elixir is where I prefer to build software, so it would be creating a custom Credo rule.
https://developers.openai.com/api/docs/guides/structured-out...
Nothing else operates on the logprobs level and literally bans continuations that fail your schema.
Of course, with a software engineer at the helm - the models are going to be able to be guided to produce much better output. (Or worse, depending on the engineer!)
To really evaluate how a model is to use in real life, it should have access to tools, and be able to iterate on something, like they do when you use them in an agent harness.
None of that iteration need necessarily to have a human driving it (although if you're building something you want to be able to maintain, you probably need a human driving the design and architecture), you can just let the model do a couple of tries and give it input into how it's doing, and you get something closer to how people use these models in reality.
This is the wrong metric to target. Today's models can feel one-shot but they are so at the expense of resilient ReAct loops that brute force their way out of the mess initial prompts created.
And each iteration is expensive.
Sometimes failing fast and early is better than going for one-shot models that try to mitigate the mess they created with reasoning steps and ReAct loops.
Additionally, with "Hey build X" nobody is happy with the methodology and people rightfully complain about the set up.
Using your suggestion the methodology would require a lot of presumptions & arguments regarding why you choose it and think it relevant to people.
Either people would not "get" it quickly enough or would disagree/not be interested on the setup because its not how they use LLMs.
On another, being able to reliably tackle minor tasks with no handholding is very valuable in itself. Sometimes implementation details are important, but often, the most important thing is to Get It Done.
The top agent is for steering, but all subagents are mostly oneshot prompts
The reality is: - business rules change - ideas for improvement may arise from the initial prompt - updates to submodules/functions/configs/secrets are BLOCKERS ... etc.
One shot prompting for the expecations of complete software is seemingly more and more a show of incompetence of the use of this technology. It's like trying to make my toddler eat a ham sandwich from the peanut butter & jelly I put in front of him.
Current models aren't capable of that, but that doesn't mean it's not possible.
If you made models able to code to long spec, you would be left with the hard issue of having to write them.
Software was always that way, though.
> given the sufficiently smart compiler
For those unaware, this is a similar quote used by compiler proponents. The first full compiler was created in 1957 (+/- 70 years ago) and the "sufficiently smart compiler" never happened, hand written code from the best coders still is faster. Now, that doesn't mean that compilers didn't do the job well enough, we just accepted that 90-95% of the top speed was enough for almost everything.
To the LLM one shotting point, it took 30 (40?) years for compilers to be good enough for the mass market. Caveat early adopter and investor.
Plus what pyrale said.
The agentic engineering paradigm is just a narrative trend pushed by AI companies to get people to 10x their token consumption per prompt. It plays into people's laziness and addiction to dopamine too causing addict like behavior in people that fall prey to this trend.
If I do that, I'm literally slower then just doing the change without sufficiently specifying it to the model.
I can see how a junior dev or generally someone that's not particularly knowledgeable about the language or framework they're working with may benefit from such usage, but for experienced people there is very little value in that approach.
I say this because I've just had to face this decision this month with Copilot introducing the usage based billing. I attempted to scale back my usage, first with non-opus - output essentially became discardable as it continually hallucinated no existing fields in the responses of Apis etc... Then my scoping the changes smaller and smaller, until I ultimately gave up and reduced usage to just generating tests.
What is tested often makes no sense at all, completely implausible edge cases are tested on internals, while it doesn't create tests for the overall application using user events.
And some things in these test cases are downright ridiculous: instead of instantiating your classes, it sets up some barebones fake objects reimplementing some of the behavior of your actual class, then ignores the TypeScript errors via force cast or similar.
Then it proceeds to slap some test ids on the output, stubs components and dependencies more or less randomly, adds some assertions on test ids and calls it a day.
Apparently that's good enough for many colleagues to open a MR for that garbage.
That said, at home with SOTA models I happily hand large units of work to it, outsource much of the thinking, and get workable results. I think this is the future.
I see little value in throwing a ton of context at an llm and waiting 10-20 minutes for a coin flip on whether or not its going to produce junk. I'd rather do quick 60 second turns, get most of the way there and fix the rest myself if I have to. I'd rather honestly just not use them.
Everyone that I've ever interacted with and claims to prompt in "seconds" actually needs multiple minutes to think about the solution they want the model to implement - and then need twice as long to formulate that into a sentence which provides the model enough context to actually do that
So the more realistic estimates are "I'd rather spend the 2 minutes just implementing the minor change myself, instead of spending 1.5 minutes thinking about it, then 2.5 minutes writing the prompt and then waiting 1 minute for it to finish"
That's the main value I've been getting out of coding agents. I have them do (comparatively) simpler tasks or explorative tasks in the background while I'm in a meeting, doing code reviews, or otherwise working on something else.
"Well obviously you provided better follow-up prompts to the one that came out better."
Also nothing about human-provided plan files and guardrails preclude the one-shot benchmark test. Heavens, I almost said "real coding," but in "real agentic program creation" you'd obviously be doing multi-turn interaction with the agent, but how can you provide a fair test when the model's output n determines your n+1 response?
The business guy would say "hey build me this and that" and would get _something_ to show of.
An engineer will have a long conversation with a llm about the exact requirements, tech stack, tradeoffs. He would understand what is built, how is it built, and refine on the fly until he gets something sensible.
It won't be as fast as "build this", but the result will be much better and more maintainable.
For the enginering workflow, you don't need Fable. Any model better or equivqlent to Sonnet 4.6 would do. Yes, sometimes it will hallucinate, sometimes it'll be wrong, but it's our job as engineers to correct it and have full ownership of the result.
And yet, even the smartest AI in the world would give an alternative solution every time you invoke it. And you still need someone to judge what is right and what is not.
If we stopped developing LLMs the the only reasonable way to benchmark them would be to compare yheir performance with all the tricks we can build on top of them. Sine the are still developing rapidly any apples to apples comparison is worthwhile.
Of course this particular benchmark is not really single prompt but rather "agentic without steering".
Instruction following has been down for years, and while there are of course metrics that continue to improve as the frontier advances (for example, the ability to continue following the original instructions even as context grows), you can't really get that much better at performing a list of instructions as-written if the instructions are sufficiently precise enough that there's no wiggle room for interpretation (which seems to be what you are describing).
For example, one of the things that got me the most excited for Fable 5 was its ability to work for over eight hours straight on a single instruction and seemingly faithfully the entire time. That was something I observed personally after trying out the same workflow that runs for maybe two or three hours with Opus and then still needs followups. Fable needed no followups. That's a game changer for me compared to the prior state of the art.
That kind of stuff is going to end up being the most beneficial to people who are touching the edges of their knowledge or even exploring completely new areas. And that type of work is exactly the kind of work that makes agentic coding so powerful, even as much as it gets harder to judge the quality of the work when you lack the skills yourself. It's a good thing that the quality increases across the board, even for skilled practitioners.
For example, even people who know how to write inference engines or how matmul kernels work or how to optimize model architecture can't always predict just the sheer breadth of things agents can try to improve performance, and sometimes you get over some wall and reach a completely different optimum that you just wouldn't have reached in any reasonable amount of time by applying traditional knowledge even if you're an expert in the field.
That kind of stuff is amazing. And that's exactly the kind of stuff that one-shot prompting is testing for. It's kind of like testing for the model's "innovation", as much of an oxymoron that is.
Since Opus 4.6 I've seen later Anthropic models being more and more capable on one hand, but also less useful on multi turn open tasks.
It feels like with each model they are more and more prone to go "their own way" and jump into the implementation as soon as they can.
I can't but blame it on benchmarks and fine tuning around prompt-to-solution work.
Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.
Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).
I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out.
Appreciate you sharing the results of your tests though!
Thinking about it, I would say that the majority of agentic work I do, by a long shot, is subagents which are launched from the main session, using a prompt of its choosing. Those could be considered short versions of these fully autonomous tasks.
* Install pi and a bunch of extensions from their package repo
* Realize that all the packages (with a few exceptions) are massively overcomplicated and vibe coded
* Ask pi to rebuild a very simple version of the packages I used. So e.g. subagents - all the default subagent extensions are massively complicated with named agents, recursion, communication. I made one that stripped all that out.
* Then whenever I hit an annoyance, spin up a parallel session and fix it.
It's less work than it appears because I have ~5 extensions: hooks, subagents, background processes, a custom footer, a loop command... Maybe that's it. Within a couple of days you can have a setup pretty close to Claude Code but with a fraction of the base context use. After gradual improvements over a few weeks/months you'll have a system far better, tuned to your exact preference.
Of course, just like Linux or any other highly tunable system equally important is having the restraint to not spend all your time tuning it. I've definitely had a couple of days where I was bored with my real work and did that, but whatever, it beats browsing reddit.
As for getting long running tasks, I set a looping message every ~20m and tell the agent to strictly track progress in a session doc, then reread and continue after each compaction.
I've not come across a programming task that would take an LLM ten hours.
You make a very strong claim at the end that the hype is mostly real, and making it clear to what extent your claim holds should help the reader.
If I recall, that model had a couple issues. One was the issue of being monkeyed with, for which they gave everyone credits.
The other feature/bug, depending on your POV, was being Anthropic's least personable release, not papering over everything with self help guru therapy language.
Opus 4.6 didn't LARP. It was more direct, less fussy, less discussy, and very much less "wait, one more thing" within a couple edits after embarking on what should have been the spec, than 4.7 or 4.8 are.
When in engineer brain mode, working as as you describe (good old fashioned XP-style staff engineer pair programming with a language-savvy mentee not yet full-stack or system wise), I found the clearer I was about my goal and the better I could express it, the more often I'd get an expanded clarified response I could then iterate to steer for ever tighter cleaner more specified responses, then let it go build the whole thing without it agonizing and waffling.
The next two releases regressed on that dimension, wanting to figuratively "sit with" every decision and re-validate spiritual alignment along the way, no matter how clearly expressed.
Curiously to me, Fable seemed to hit the best of both worlds, I had the highest commit per turn with Fable, approaching 73%, where I'm usually under 17% of LOC written being good enough to commit, usually taking 9 - 11 turns to get the code where I'm comfortable with it.
Thanks to this, Fable cost more, but actually cost less, if that makes sense.
Arguably, Fable, and 4.6, played more outcome-correctness oriented than journey-experience oriented. It's easy to see how this could happen with human reinforced learning if not all judges are staff or principal engineer level, or constitution values are more Portlandia than Finlandia.
ANTHROP\C needs to balance these at the constitution level:
“We will work in a humane and thoughtful way, but production is the final judge. We will listen to people, but we will not let discussion replace decision. We will value craft, but not at the expense of usefulness. We will move fast, but not by hiding risk. We will measure outcomes, but not pretend that everything important is easy to measure.”
GLM-5.2 actually has really good intent understanding though, on par with GPT-5.5 and Opus from my experience.
- it takes it sweet time to get code rolling, not the fastest model by any means
- it strays a lot during discovery/planning but then corrects
- it's not steering friendly, as it hallucinates things that it doesn't follow later on
- its output is quite good
A sample use case: I was optimizing rendering on Swift+Zig codebase. It chocked on 5k data entries.
GLM 5.2 spent 20 minutes building the benchmarks and getting data out, which made me frustrated so I blocked non-editing tool access and went AFK, after approx. 30 minutes I found that it used already-made benchmarks and some "conclusions" to optimize 3 choke points. Output pointed that it couldn't validate suspicions and asked for more data.
Implementation worked well, it was idiomatic and non-intrusive. I would even say that it was more idiomatic than GPT 5.5 effects on same repo.
I would opt in in using it more BUT GPT usually completes same requests 5x faster.
GLM 5.2 was spark for preparing and running inside isolated containers with JJ workspaces (so that multiple can be ran in parallel).
Which provider are you using? I got a z.ai Lite Coding Plan and it's my understanding z.ai is on the slower side of providers and the Lite plan gets lower priority on top of that. In the api key console, it shows dipping below 60 tok/sec which is quite slow.
Switched the model to GLM-5.2 halfway in the middle of a troubleshooting session (didn't even bother to reprompt, just changed it in the middle of its reasoning), gave it a few minutes, problem fixed. This is with the subscription based allocation on OpenCode Go, where a problem like this would completely burn up my Opus for the current 5 hours or even the current week.
It's always a shock to me how opaque most other models are!
It also is pretty resilience to letting you inject in while it's working without going off course or while getting back on track after, which I appreciate
This is (unfortunately) by design. The proprietary models hide their reasoning traces so they can't be used for model distillation. Sometimes even when they do show reasoning, it isn't the model's real trace - IIRC, someone was able to demonstrate that Opus' reasoning is usually a summary made with Haiku behind the scenes.
It is less than 20% of the cost of Opus at API rates. 1.40/4.40 vs 5/25.
Maybe makes sense if you have z.AI's (not greatly priced) subscription plan, but it's not competitive against an OpenAI or Anthropic monthly coding subscription plan. I burned through almost $10 worth of tokens just doing an hour of work.
You get access to a whole bunch of bleeding edge open models including GLM-5.2, Kimi K2.7, DeepSeek 4 Pro, etc. Inference is run on US/SG/EU cloud providers with zero data retention policies. The $20/mo tier is very generous, in my experience.
They picked a task that heavily favors a model that can do multi-modal with images, and GLM still came within striking distance.
What I'm hearing from this article is that the next generation of open models that includes better multi-modal support are basically no-brainers for adoption.
Seems like a HUGE win for Z.ai and open models in general here.
I also use MiniMax-M3 in utility roles like explore/library tasks.
I’ve had a z.ai subscription for several months so I’m on the older pricing. I’m really not sure it would make sense to do this at current rates - I could bump my Codex plan instead.
Can someone explain to me where that time usage is coming from if not from the model operation itself?
Are the individual tool calls more complex and take more time to complete? Or is the rate of tok/s lower because the model does more compute per token?
In addition to that, some of the open weights models like GLM 5.2 or DeepSeek v4 Pro tend to be MUCH slower when generating tokens, which contributes to the perceived slowness. Although I wouldn't call models like GLM 5.2 slow by any means, e.g. it is currently one of the fastest models inside Notion today.
A better way would be to use https://github.com/openbmb/MiniCPM-V
We have come a long way, and very clearly have a long way yet to go.
https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...
I haven't been keeping up on hardware costs for state of the art LLM inference, but this remark made me ask myself how many readers of the article would actually be able to run this model on hardware they own. How much would it cost to acquire such a setup?
To be cost effective with inference providers, you have to find some way to be using it 24/7.
If they decided to collude, they could absolutely say "from now on you no longer have access to model X because you're an asshole"
The commercial inference offering are also downstream of one of those 3 projects (or trt-LLM if they're nvidia). It would impact Ollama, and fireworks, together, and everyone else.
Don't tempt fate.
GLM-5.2, severely quantised, 512GB Mac Studio, somewhere between $10k-$35k for a used M3. Or run it on a CPU with 768GB of RAM by getting an old PowerEdge with DDR4 for around $5,000.
Qwen-3.6-35b-q6, runs well on an RTX 5090 ($4000 + cost of a PC), runs medicore on an Intel Arc B70 ($1000 + cost of a PC plus lots of fiddling to get the setup to work right).
Gemma is a good candidate for the cheaper stuff, but I lack personal experience with using it locally
But, it produces solid results for a fraction of the price. Worth checking out if you have the time.
One of my goto "tests" of a new frontier models is having it rebuild a programming language from scratch. For GLM 5.2 I had it rebuild the old Rebol language in Rust:
https://github.com/mhs/rebol-clone-glm-5.2
It did a fairly good job roughing in the language for a low token cost.
Off topic, but does anyone else instantly pick up on LLMisms like this? It seems like all the models have converged on this style of writing, and improvements aren't really changing it.
Glm game was completely broken Opus game was at first glance ok but also with bugs
Different models with different cost produced different non perfect results . How is it “close” ? :)
Also on costs : glm burns more tokens on average vs opus . Gpt5.5 burns less surprisingly
Yes, in terms of API pricing, GLM 5.2 outperforms the competition. But the only people that use API billing for their coding work are large corporations, where these highly subsidized subscriptions are being fazed out.
At the same time, none of these companies will use a Chinese API for their employees.
For individuals and smaller teams, Z.ai's coding subscription is outperformed by Anthropic and OpenAI. You probably get around the same usage with Claude, but Codex definitely offers more usage for the amount you pay.
We can have a debate how much Z.ai closed the gap to GPT5.5 and Opus 4.8, but if I can freely decide between them in a world where they all cost the same, I simply wouldn't choose GLM.
So the important question becomes: How good will the offering from Z.ai get with GLM 5.3 or 6 and how much will OpenAI and Anthropic cripple their current offering in the near future.
If the world needs any more evidence of Europe's short-sightedness, it would be them running to China to spite the US (instead of creating fertile grounds for their own tech).
Employees and students used to coding with thousands of dollars worth of tokens (on a 20/100 dollar plan) will push enterprise to spend.
Having a Chinese model that is competitive won't displace this enterprise spend. But an open model hosted in the US/EU might.
The existence of GLM 5.2 puts a ceiling on how much OpenAI/Anthropic can charge for API Access.
Except there is no evidence of this at all, just people comparing API and subscription pricing. The leaked financial info for OpenAI shows inference is profitable right now, though it does not show a distinction between subscription and API revenue... but if subscription revenue was so lossy, it would hard for total inference to still be profitable.
I believe this is the reason why we can even have this debate. Without this kind of competition we would not have these subsidies.
I just think that as of today, most people will not find a good reason to switch to GLM.
It's annoying that the plans are so restrictive beyond usage limits. Understandable maybe, but annoying. In practice, only Anthropic (and maybe Google) are really restrictive though. They really scared me away with their policy of charging API rates after the fact if they consider your usage not TOS-aligned. This might be an ungrounded fear that I have, but I feel this is something they'd do so they scared me away.
As well as people using 3rd party harnesses like OpenCode.
> At the same time, none of these companies will use a Chinese API for their employees
So who are Amazon Bedrock (who serve GLM) targetting?
Individuals are presumably going with one of the cheaper US providers such as DeepInfra ($0.18/M cached input for GLM vs $0.50 for Opus) or Fireworks AI.
A company can buy a NVIDIA B300 and serve it's developers in house with unlimited tokens.
nice try but you intentionally ignored the entire Chinese market & Chinese big corporates. there are 130 Chinese companies in the fortune 500 list, with an average revenue of 80 billion USD each. do you think they are going to sign up for Claude, Codex or GLM? now consider South East Asia, Africa, Middle East, Middle Asia and South America, tell me why their large corporates won't be using GLM API billings?
your western centric view of the world is totally out of date, like it or not, 2026 is vastly different from 1996, the US no longer controls high tech whatsoever.
For people who follow open LLMs, none of these were quiet and all were the most interesting open model release for a few days/weeks. In one or two months, it will be some other model again. Now I do appreciate the real rapid improvements in open models. But there's also a ton of hype and fast-fashion around all of this.
GLM passes a meaningful threshold of reliability/utility that puts it in a different category for real work. Just like Opus really took off after passing a threshold with 4.5. It's the first open model to do that.
And there are valid reasons to run local, even if performance (quality and speed) aren't best.
From my Opus vs DS 4 Pro personal benchmarks, 16 different real-life work tasks, DS 4 has performed as well as Opus 4.8 high overall but with few drawbacks:
- on the 16 tasks, one needed several prompts to be steered back into the topic
- its review capabilities seem much worse
- DS4 had the cleanly better solution in 3 cases out of 16, with Opus "only" doing cleanly better 2 times out of 16. But still, I want to emphasize, is the worst case scenarios that imho matter the most, not the best ones, and on that front Opus outperformed.
That being said I spent less than 2$ of API working 4 days, which is more or less what I would've spent with Anthropic APIs for less than one task.
Opus is most expensive model in pay as you go model, but IMO fair comparison should include subscription price as well. For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.
And z.ai themselves also have subscriptions.
I’m currently trying to figure out whether a downgrade from Max 5x to Pro in combination with one of those would save me money and if so, how much.
Edit: seems like Anthropic Pro + GLM Pro (Yearly) would let me almost halve my costs of Anthropic Max 5x. Only concerns are about GLM 5.2 not having vision support and also being kinda slower and also not being as good as Opus.
I think it's most fair to compare the plain token pricing that is used by everyone.
As a consumer, yes, it's totally fair. All that matters to me is the price I pay at the pump, not whether that price is "real" or not.
Anthropic have claimed they expect their first profitable quarter this year -- they may have bigger margins on their raw API than you realise.
I'm saying that this is not necessarily the case. They do a lot of optimisation and don't have the same price pressure to lower margins. They may not be losing as much on subscriptions as people think.
From there I collected the following US providers currently serving GLM 5.2:
- Together (https://www.together.ai/models)
- Fireworks (https://fireworks.ai/models)
- Featherless (https://featherless.ai/models)
"Build a 3D platformer game from scratch, in raw WebGL, with no game engine or 3D library" would be a zero-shot prompt.
I like it, but the lite plan ate 22% usage of my 5h reset window in a single session after 2 prompts on xhigh of GLM 5.2 [1m]
Result was satisfactory, I think stuff is decent, I'm happy to use either, wish there was a combined subscription plan where I could get both
Coupled with a local Headroom (https://github.com/headroomlabs-ai/headroom) you'll be able to use a LOT without hitting your 5h window :)
Definitely the best $ value for me considering the reasonable performance of GLM5.2.
They provide a rolling window quota, so you're never really out of quota contrary to other providers, you can adjust day to day.
Check it out if interested : https://synthetic.new/?referral=kwjqga9QYoUgpZV
---
Docs & all models : https://dev.synthetic.new/docs/api/models
So, 8000$, plus it's unavailable. 3 years of Codex/Opus subscription.
> API prices
Which are irrelevant for 200$ Codex/Opus plans that are times cheaper.
I would like to give them a try but I certainly not have the money to get a system able to run them, and I don't really want to pay more than the state of the art
My only, I guess feedback, is that it's not really clear about the price.
Would the 21.92 be the API pricing I guess?
Cost $5.39 (real billed) ~$21.92 (estimate, list pricing)
If it builds a UI and can't look at it, it's askin ls whether the app looks right.
The real time 3d fluid dynamics appear to be the tricky part, I wish I still had opus access, would love to see if it can do it.
Pro tip: You could use a multi-modal model to verify images as a subagent spawned by GLM 5.2, to get around this issue.
I am not sure where this is going to lead us but it is fun to watch.
I used it with Cerebras inference at a time when it had a good coding plan at a low price, and delivered tons of stuff using it.
Tried with 2 harnesses and it seems bad + slow
At the end of the day, the time earned is more important then the cost for big players.
The ability to spawn 10 claude agents and rush a project to outcompete someone is more important for big businesses in my imo. Also the small details that GLM missed would take significant more time to iron out, considering it already took double the time.
I do hope other (open weight) models catch up, but to act like they are anywhere close for me is a bit disingenuous.
Also, every single lab does RL on benchmarks, which is why Opus 4.6 was the last truly great assistant, after it, all models tend to drift into implementation asap.