758 pointsby marinesebastian5 hours ago90 comments
  • XCSmean hour ago
    I just tested it on my benchmarks[0], it's GLM-5.2 level, at 2x cost, but also 2x faster.

    Weak spots (categories it fails):

        - Trivia — 0/3 - basically not much built-in knowledge
        - Combined tool-calling tasks — score 45/100, sometimes makes invalid tool calls
        - Puzzle Solving — score 77, flubs carwash-like tests
    
    [0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...
    • XCSmean hour ago
      As always, note: faster than GLM-5.2 doesn't mean too much, as GLM-5.2 is served by different providers, so the inference speed can vary drastically between providers or over time.
      • yieldcrv31 minutes ago
        What’s everyone favorite GLM provider?

        z.ai doesnt always have the most reliable AI

        but I don’t mind the party seeing my trade secrets and thoughts compared to an American corporation + the party seeing my trade secrets and thoughts. So thats not a functional difference to me, and the Chinese one won’t reply to subpoenas so thats a value add tbh

        So I’ll consider all, fastest tokens/sec wins

        • eli28 minutes ago
          Fireworks.ai is solid. And if you care more about speed than cost they have a "fast" variant that I think just throws more hardware at the model for about 2x the cost.
  • doctoboggan4 hours ago
    The cost per task chart is telling me that I should _never_ use Sonnet 5 above medium effort level - Opus always performs better for a given cost. So I guess the takeaway is that if Sonnet 5 medium isn't good enough for you, switch models, not effort levels.
    • AquinasCoder4 hours ago
      While I appreciate, they publish this information, it's increasingly hard to keep track of it all. I've lost the mental model of how different models at different effort levels perform and what tasks they are good at.

      In practice, I tend to just use the default on Claude Code that works well enough. But I wonder to what degree other users really play around with these settings to optimize for their project.

      • matheusmoreira4 minutes ago
        I always use Opus 4.8 at max effort for everything. The $20 subscription didn't have enough tokens, but the $100 had too many of them. So now I just max out Opus in order to maintain 100% weekly utilization.
      • brobdingnagiansan hour ago
        I tend to run it on High and then step it up for problems where I'm noticing it struggles, bump it back down after. Sometimes I accidentally leave a session in Ultracode for a day and wonder why things are taking so long, but generally happy with the results.
      • nolok37 minutes ago
        Same boat as you, and my answer is "... Except when I ask and overall or checkup task that is specifically heavy or overseeing in which case I use the maximum level" which lately meant ultracode.

        I'm not going to play around with thinking level every request because the goal is to make me save time not spend it in a different setting menu.

      • jbvlkt2 hours ago
        Exactly this is my problem with all AI tools. I want someone else to create working tools for me so I can focus on my product. It is the same with other tools. I do not want to spent huge amounts of energy and time to setup my IDE, operating system or desk layout. I guess it is too early to have that now.
        • jerojeroan hour ago
          I think that's the whole selling point of lovable?
      • sanderjd3 hours ago
        What I want is a harness that knows how to optimize this kind of thing for me.
        • cunningfatalist3 hours ago
          You might want to check out Amp: https://ampcode.com/
          • sanderjdan hour ago
            I appreciate the suggestion! But it isn't clear to me, from reading their marketing site, what they bring to the table from this perspective. Can you give me a more targeted pitch?
        • manojlds3 hours ago
          Which is your own harness and your own evals for your tasks I guess
          • munk-aan hour ago
            I don't demand a customized compiler for my code even if such a compiler could outperform gcc. There is a lot of value in focusing on correctness to an extreme degree even if the outcome might be suboptimal to something more tailored - a tool with a large customer base can justify more resources going into its maintenance.
          • sanderjd3 hours ago
            Maybe. But that sounds like a large amount of bespoke work for what seems like a common problem?
            • manojlds2 hours ago
              I was talking about enterprise agents and then realized the question is more about coding agents.
              • sanderjd2 hours ago
                Ah I see! Yes, I was talking about a coding harness, not an enterprise agent. I entirely agree with you that your suggestion of driving it via evals is the right thing for that use case!
      • j4520 minutes ago
        Just because it’s hard to keep track of doesn’t mean it’s not relevant.

        Playing around with learning the differences is incredibly helpful to schedule on ones calendar weekly for an hour or two, while saving links throughout the week to try out.

      • paulddraperan hour ago
        It's almost like you want an automatically intelligent choice of your artificial intelligence.

        Understandable frankly.

      • jacooper4 hours ago
        Just use deepswe as a reference point.
    • 2001zhaozhao4 hours ago
      There are two wrinkles to this:

      - For Claude.ai subscriptions I think Sonnet is much cheaper than Opus. This is why there was a "Sonnet only" usage bar for Max tier for the longest time.

      - For some tasks the sheer amount of raw input tokens is the most important. For example multimodal computer use tasks. You can't make them any more efficient on Opus by turning down the reasoning, so a cheaper model like Sonnet is useful for them

      • timcobb4 hours ago
        > This is why there was a "Sonnet only" usage bar for Max tier for the longest time.

        it's still there. I still don't totally grok why I can't use all my tokens on Sonnet if I want to... maybe that signals something?

        • laughingcurve2 hours ago
          Distillation attacks? Volume of calls?
        • i0002 hours ago
          They want to encourage diversifying model use.
          • radladan hour ago
            Seems kinda weird - it's cognitive load I'd love to avoid. If I'm going to take it on, I might as well try other providers.
          • aqfamnzcan hour ago
            Why?
            • munk-aan hour ago
              It helps solicit more feedback and lets them trial different approaches. You're not just a user, you're a tester!
    • Torkel4 hours ago
      Yeah, I was looking at the same chart and was very surprised at where the curve is relative to opus... Feels like sonnet 5 is "what if opus had an extra-low effort level"?
    • XCSmean hour ago
      Well, it is a Sonnet model, it is indeed better[0] than Sonnet 4.6 (smarter, faster, cheaper), but I don't see why would you use it as opposed to Opus 4.8 low or GLM-5.2...

      [0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

    • energy1234 hours ago
      The arguable caveat is Sonnet may run faster (although this isn't known for sure, due to more tokens being used for the same task), so you can potentially get more done in a synchronous iterative workflow

      I don't really believe this however, because so much time is spent fixing up after models, that a slower but more intelligent model is a net time saver in my experience.

      • kolinkoan hour ago
        From my benchmarks, sadly, it doesn't seem to be the case much. Surprisingly. I found Sonnet comparable in speed to Opus (sic), but perhaps I was testing it wrong?
        • riverbirch2 minutes ago
          I can confirm this, I too I'm not seeing much of a difference in practice
    • partsch30 minutes ago
      I feel like the charts have been adjusted. I am quite sure, they looked different a couple hours ago...
    • johnfn4 hours ago
      That's just one benchmark, though. Tab to the next one and Sonnet 5 performs better as effort goes up just as you'd expect. I imagine the suggestion is that performance vs effort tradeoff is task dependent.
      • energy1234 hours ago
        No it doesn't? It's worse than Opus across the whole shared frontier on both plots.
        • acchow2 hours ago
          Agreed. The graphs clearly show that opus 4.8 performs strictly better at the same cost per task
          • jsnell2 hours ago
            But they don't show "strictly better" performance at cost per task!

            The graphs show parts of the cost/performance pareto frontier occupied by Opus 4.8 and others occupied by Sonnet 5.0. If Opus 4.8 was strictly better at cost per task like you say, by definition the entire frontier would be occupied by Opus.

            So neither is pareto-dominant over the other. In contrast, Sonnet 5.0 is Pareto-dominent over Sonnet 4.6 on those graphs.

            • energy123an hour ago
              > by definition the entire frontier would be occupied by Opus.

              But the entire frontier is occupied by Opus under any reasonable interpolation scheme (piecewise linear which is what they've done, and most reasonable spline or polynomial fits would also lead to the same result) over the overlapping x values for which both are defined.

              Under that interpolation scheme, for x > ($ cost of Opus low effort), Opus is Pareto-dominant over Sonnet 5. You can see this by picking any point on Opus's interpolation and realizing that you get strictly worse by switching to Sonnet for the same x value or the same y value. Meaning if you want to pay the same $x then you get a worse y, or if you want the same y you pay more $x.

              • jsnellan hour ago
                I really don't get what you're proposing. The cost ranges do not overlap at the low end. You can't (by definition!) interpolate outside of the range.

                If you mean extrapolate, at that point you're just making up data. The available effort levels are discrete and covered totally by the benchmarks. You can draw on the monitor with a sharpie to show a "ultra-low" effort level for Opus that scores better than Sonnet "low" at the same price, but it doesn't magic the ultra-low effort into actual existence.

                (Anyway, the blog post now has an errata and a graph that shows substantially better relative performance for Sonnet 5.0 than the original graph.)

                • energy12337 minutes ago
                  That's why I said "over the shared frontier" in my first post and more precisely in my second post I said "over the overlapping x values for which both are defined."

                  It was a claim that applies to a range of x-values where both curves are defined.

                  Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region. Which is what I understand to be your point?

                  • jsnell25 minutes ago
                    The post I was replying to said "performs strictly better at the same cost per task". That claim was obviously not true, there are costs where Opus cannot do the task and Sonnet can, so Opus can't be performing strictly better that the same cost. It seems that you agree that it is not true.

                    You could make it true by artificially dropping some of the data points, but, like, why?

                    (Again, this is moot given the updated graph.)

                    > Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region.

                    Not so! It's only sound to do that at the low end of the cost axis (x) or the high end of the performance axis (y). You can't do it at the low end of the performance axis or the high end of the cost axis.

    • lucamark3 hours ago
      You're referring to the Agentic search, but if you look at the Agentic computer use the cost is basically halved.

      However, I am also confused about market positioning. Too expensive to perform daily tasks - open souce models are much cheaper - and not frontier model to address complex real world problems.

      Rarely used Sonnet btw.

      • annzabellean hour ago
        > Too expensive to perform daily tasks - open souce models are much cheaper

        There is a real advantage, especially for businesses, in using an off the shelf solution from a corporate provider.

        Personally, the advantage of not having to set up multiple solutions from multiple sources outweighs the cost of a $20 a month subscription. Think about why a lot of consumers prefer Apple devices over Linux. There are a lot of advantages to Linux, but "never having to think about my tools" is its own advantage.

      • energy1233 hours ago
        You're the second person that has said this but I cannot understand why you are interpreting the "Agentic computer use" graph in this manner.

        The graph shows that Opus is cheaper than Sonnet for the same performance. Unless I am suffering a cognitive blindness thing right now.

        • lucamark3 hours ago
          Wrong! Look at it better. It shows that Opus has superior performance but at higher cost.
          • doctoboggan3 hours ago
            No, you are misunderstanding the graph. Draw a vertical line anywhere, that is a "constant cost" line. For any given cost, Opus 4.8 has a higher performance than Sonnet 5. Only where Sonnet 5 effort is at medium or low would it make any sense to use it, as there isn't even an equivalent Opus effort level to compare to.

            Alternatively you can draw a horizontal "constant performance" line and see that Opus is cheaper for a given performance level.

          • 827a3 hours ago
            Why are you comparing xhigh reasoning between Sonnet and Opus? Of course Sonnet xhigh is cheaper than Opus xhigh, but that isn't the point; the point is that at e.g. 80% accuracy on Opus costs ~$0.45 (medium reasoning) whereas on Sonnet it costs ~$0.52 (xhigh/max reasoning).
          • brokencode3 hours ago
            That is a bad comparison. Compare Sonnet xhigh against Opus medium, which is both better and cheaper.
          • energy1233 hours ago
            No, that's apples and oranges. You need to compare Sonnet5's 79% with the interpolated Opus4.8's 79%.
      • girvo43 minutes ago
        The specific market positioning is... for me to use at my big tech company job, where we aren't allowed to use GLM and similar, but have fixed caps on how much token usage we're allowed to rack up a month.
    • booi3 hours ago
      i actually exclusively use Sonnet in low effort level. It's too slow otherwise and at a higher effort levels is strictly worse than Opus.
    • goldenarman hour ago
      It's funny the exact same thing happened to Gemini 3.5 flash. Cheaper and more agentic model that ends up worse and more expensive than 3.5 pro low.
      • Readerium3 minutes ago
        3.5 Pro not yet launched, you mean 3.1 pro?
    • intellijdd4 hours ago
      I noticed that as well but with the introductory pricing, I wonder how true that is.

      It would be great to see these charts with the promotional pricing just because it’s here for about two whole months.

      I guess I could get Sonnet 5 to do it.

    • seiru3 hours ago
      Worth noting that the default chart there is for "agentic search performance", not coding. I didn't see an effort comparison for coding specifically.
    • manojlds3 hours ago
      Opus 4.8 high doing better and cheaper than Sonnet 5 xhigh
    • al_borland3 hours ago
      What is a "task" in real-world terms? If it will be $15/million output tokens, and high/xhigh is somewhere in the $7.50/task range. Does that mean a single task is using 500k tokens. That seems like it would start to add up fast.
      • wyre3 hours ago
        I’ve found input tokens is around 5x more than output, so a task could be a couple million thinking tokens and then a few couple 100k output tokens?
    • Natelinathan3 hours ago
      I just re-wrote the /code-review skill anthropic ships to use Sonnet 4.6 for some tasks as it was using Opus for simple git diff commands and similarily mechanical tasks (launched 100+ agents for one of my diffs, cmon). I wonder how Sonnet 5 will impact my usage.

      Does anyone else have any review token saving measures?

    • make315 minutes ago
      it might be worth it if speed is an issue
    • nicce3 hours ago
      > Opus always performs better for a given cost.

      Assume it to get deprecated sooner rather than later.

    • ZeWaka4 hours ago
      It's very interesting. Why even release a new product that underperforms at the same price level? Why not just lock it?

      I guess it's probably a lot cheaper for them to run, and it cuts costs for them. Seems disingenuous, though.

  • microtonal4 hours ago
    Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models.

    I have been using Sonnet 4.6 more than Opus, because I'm mostly doing agent-assisted development and not fully agent-driven development. This announcement does not make me positive, I have found that the more models are optimized for fully agentic development, the worse they get at assisted development and often start doing too much despite very strict/specific instructions.

    I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.

    • Brendinooo4 hours ago
      Yeah, there's a real opportunity for one of these companies to invest time in a model that's tuned for, to use your term, agent-assisted developement.

      Trouble is, everyone inside their buildings seems to believe that no one will be working like that in a year or two.

      • everforward4 hours ago
        There’s no way to justify their valuations if they get downgraded to a pair programming tool. They need fully agentic stuff to work and replace human engineers to even come close.

        Offhand, I’m not even certain whether a model like that could justify the constant retraining we’re doing on the agentic models.

        It doesn’t make a lot of sense to spend millions or billions on training to reduce hallucinations by 0.3% if your model assumes a human is in the loop to course-correct them.

        • keeda3 hours ago
          Some napkin math -- total global labor compensation is about 50% of the GDP, which puts it in the USD 50 - 60 Trillion range: https://ourworldindata.org/grapher/labor-share-of-gdp

          This source claims that knowledge workers alone (probably because they are paid much more) account for 35 - 50 Trillion of that: https://github.com/danielmiessler/Substrate/blob/main/Data/K...

          If LLMs can boost their productivity even by an average of 5% (studies from ~2024 put it in the ~30% range depending on task) that is ~1.5 - 2.5T in value annually. Even if the AI industry can capture a fraction of that, that is a huuuge monetization opportunity.

          Note, at 5% productivity boost, humans are not just in the loop, they are the loop. AGI or large-scale replacement of humans is not even needed, but the financial opportunity is already immense, and it scales with how much human productivity can be improved (i.e. how much work can be offloaded to LLMs.)

          Now, I don't think AGI will happen soon (or has already happened, depending on how you define it) but I do think humans will be a much smaller part of the loop and large-scale job displacement will happen once companies figure out how to properly use AI.

          At this point, the financial upside for the AI industry is extremely high but will be limited by the social turmoil that will inevitably ensue (which we're already seeing brewing in the data center backlash.)

          • e92 hours ago
            I want to propose alternative reality where 1.5-2.5T in value doesn't go to a handful of companies. Instead it turns out to be like restaurants where this gets distributed to lots and lots of small, local, mostly interchangeable teams. There will of course be some super star "chefs" leading the industry and setting trends and some "restaurant chain" like big businesses and supply chain for all of this.
            • bdamm2 hours ago
              How? Training and operating models seems to naturally focus on those willing to invest quite significantly in these operations.
              • nish__an hour ago
                If RAM prices come down, running your own models will be relatively affordable.
            • xxpor2 hours ago
              The world is not zero sum. Value is created, not just preserved. Anthropic and OpenAI creating value does not imply that smaller guys can not also create value.
              • afavour2 hours ago
                But marketplaces also exist and big players in a marketplace are often able to manipulate the market such that they are advantaged and small players are not able to break in.
                • mpynean hour ago
                  This is true of every market that has ever existed, and that's not stopped small players from finding niches.
            • actionfromafar2 hours ago
              Sysco is pretty big.
          • everforward29 minutes ago
            > Note, at 5% productivity boost, humans are not just in the loop, they are the loop. AGI or large-scale replacement of humans is not even needed, but the financial opportunity is already immense, and it scales with how much human productivity can be improved (i.e. how much work can be offloaded to LLMs.)

            The studies I've seen recently (at least in the software space) put it at something like a 10% increase in coding speed, which for me would probably translate to something like a 3% increase in productivity. I spend a lot more time on things like getting agreement between teams, documenting approaches to things that don't exist on the wiki, etc, that LLMs are significantly less effective at. Or just can't do; no one will be happy if I send an LLM instead of me to meetings.

            I suspect a lot of roles are like that. They give a 10-30% boost to the core role function, but that core role is still only 30-50% of what you do.

            > that is ~1.5 - 2.5T in value annually

            That seems really large, but it's ~2-3x Walmart's yearly revenue, and OpenAI and Anthropic both have estimated valuations that compare to Walmart's market cap. And this is before we consider that they need to do it for cheaper or why would anyone bother. Realistically, potential revenue is probably half that at best.

            It's also before cutthroat pricing really kicks in. People are willing to pay for Claude right now; I still suspect that as time goes on people will start looking towards Deepseek/GLM/etc models that provide 95% of the performance at 10% of the price. That'll cut the market even further.

            The question is how much demand for knowledge work swells as prices fall, and whether that's a soft landing or a crash.

          • ricardobayes2 hours ago
            I am deeply surprised by the silence of philosophers, sociologists, liberal arts majors, economists. Where are the think tanks who contemplate and debate the societal aspects? The tech is advancing full steam but the "other side" doesn't feel anywhere nearly ready.
            • bloppe2 hours ago
              Idk why you're perceiving silence. Feels to me like this is the main thing people talk about nowadays.
              • scarmig2 hours ago
                It has to do with the scope of what they're discussing. It seems extraordinarily small: e.g. what if AI increases productivity growth by 0.4%? Do data centers use too much water? Are AIs racist when reviewing resumes?

                The frontier labs, on the other hand, are thinking about replacing all human labor, ending death, and the risk of it causing human extinction. Most of the apparatus we're talking about approach it very parochially; it's almost like they're embarrassed to take the grander ideas even a little seriously, for being too nerdy/sci-fi.

                • freejazzan hour ago
                  The public would happily string up any of these CEOs if given the chance
            • bdamm2 hours ago
              Because the "other side" is busy trying to anthropomorphise AI into solving the trolly problem, while being mostly clueless about the actual problems.

              They'll show up after the fact and whinge endlessly about how they should have been involved.

            • freejazzan hour ago
              Silence? Even the pope has come out against AI? Who hasn't? Diplo??
            • digitaltrees2 hours ago
              Reid Blackmun has written several books and has a consultanting agency to guide ethical implementation of AI
          • hedora2 hours ago
            You’re trying to apply value based pricing (infinite margin upside) to a commodity.

            Pre-bubble pricing: $1400 gets a 128GiB iGPU optimized for inference. Glm and kimi need 800-1000GiB. Call it 1TiB. The $1400 boxes could be ganged into sets of 4-8, with a switch. Call the switch $1000.

            Each box has a TDP of 250W. 8 x 250/120V = 16.666A, or one household circuit in the US, so no new power infrastructure is needed.

            $1400 x 8+1000=$12,200. Assuming standard five year depreciation, that’s $2440 a year. There are a billion knowledge workers alive today. So that’s $2.4T annual revenue. Average net profit margins on computer hardware are 4.3%. That works out to $105B net income, globally.

            So, I guess the question is whether the (currently #2) open weight models provide $1.4-2.4T less value per year than the #1 and #3 models, and, if so, if customers can measure this, or are willing to spend 2x more and deal with censorship, data theft, intentional enshitification, sabotage, ads, product placement, etc, to get the slightly “better” model.

            Also, note that my numbers assume moore’s law stopped for all time in 2024, but we’ve seen HW improvements since then.

          • danenania2 hours ago
            I’d also point out that LLM inference revenue already totals more than 100B annually based on publicly reported numbers. Almost none of that is replacing knowledge workers. Almost all is increasing their productivity. So empirically what you describe is already happening to a nontrivial degree.
          • parineum2 hours ago
            > If LLMs can boost their productivity even by an average of 5% (studies from ~2024 put it in the ~30% range depending on task) that is ~1.5 - 2.5T in value

            Minus the cost of inference, that might not be the boon you're making it out to be. I hear what people around here are spending on their api and I'm skeptical that these tools are making me that much more productive.

            Personally, for assisted development, I haven't seen much progress in a while.

        • overgard3 hours ago
          That's a really good point. I think if there wasn't the insane amount of money involved and these were treated as tools instead, they would probably be MORE productive. I think a person working hand in hand with an AI instead of delegating is the sweet spot of making things fast while also not losing understanding or control of the system. You are absolutely right that these companies can't justify their valuations if they do that though. I just got a new mac to run models locally, and so far the results have been positive with some small hiccups. I'm thinking the future of this tech will likely be better tooling with better IDE integrations rather than "Claude plz make me a SaaS kthx"
          • everforward23 minutes ago
            > I'm thinking the future of this tech will likely be better tooling with better IDE integrations rather than "Claude plz make me a SaaS kthx"

            I think this sort of thinking is a trap, because it presumes that all software has the same constraints.

            There's a spectrum of requirements between "chuck this over the wall at Claude, it only has to work once" and "this is a literal rocket ship, formally verify the whole thing".

            I've made some things with Claude I don't understand and don't control. It's fine, they're still useful to me. Things for the house that I wasn't going to build manually, some dashboarding stuff and scripts for work, stuff that can crash and burn and I'll be fine.

            They won't justify trillions in investment, but they are useful.

            Equally, I do agree with you on some things. Sometimes I hand-hold the LLM or forgo it entirely because I want to be 100% sure I know how something works, and can justify a decision if it causes a production outage.

            I think the future is probably multiple different tools with different goals. Better IDE integration for some uses, an entirely separate "LLM herd controller" kind of thing for when you're okay with vibe-coding, and the most interesting is something in the middle where you're more in the loop than pure vibe-coding, but don't see the full context like in an IDE. Something where it surfaces changes to key components, but hides things like test changes.

          • ah1508an hour ago
            > while also not losing understanding

            That's a key point. Keeping knowledge and know how inside the company is strategic. For most people GPS did not result in better sense of direction, spellchecking did not help to write without making mistakes, and delegating translation to deepl does help to be better in a foreign languages. I don't see the gain for an individual, a company, a society if a technology reduces the ability to think, do stuff, understand complex problem, working hard at something. Hiring junior also matters, what is boring for a senior dev is useful for a junior, like the "wax on wax off" in Karatekid. Then when the senior dev retired the junior is not junior anymore and the know how is still here. I want to to transfer my knowledge to a junior, not to anthropic or google or openai.

            Ideally, working hand in hand with an AI could be like driving a motorcycle vs riding a bicycle. Both are fine, but you go much faster with a motorcycle and you don't lose any ability. But prompting a motorcycle auto-pilot by voice sound a bit stupid and boring. Insane use of energy rarely comes into the equation, which is a bit weird. Personally it is why I am never tempted to use AI. However I see value in AI for finding weakness in a code (inverse of flattery), writing tests with all the edge cases based on specs since tests are often sloppy, asking a fresh view on a very difficult problem. I'd love to hear about the equivalent of move#32 in game 2 (AlphaGo vs Lee Sedol) in a difficult programming task. But I think that massive delegation of code writing is how you lose the knowledge and the know how: what keeps us sharp.

            Final word: I asked once a review to claude, the codes involved a db transaction. Nothing complicated, Claude said everything was fine. However the transaction isolation level was not set (I did it on purpose, like if I did not know about isolation levels). He did not ask me if it was my intention to keep the default level. I would have preferred a challenging feedback: why did you chose the default isolation level ? Is it on purpose ? Do you know that the default depend on the db ? Do you know about isolation ? Tell me about the business use case and I'll explain which one would be the best.

          • user439282 hours ago
            I am thinking the opposite. I've been having great results with handing more and more responsibilities to the agent.

            Contrary to what some people suggest, I have not hit any maintenance or reliability dead ends. If something breaks, the agent fixes it.

            If it cannot, I have the agent instrument the code and work through the logs to check hypotheses, until the source of the issue is found.

            If even that would fail, which did not yet happen, I can still do some old fashioned digging and learning, like I always have.

            This is for native mobile app development, and the code base is around 100k LOC.

        • tskj3 hours ago
          Dario has publicly claimed each model has been profitable, even accounting for its training costs; it's just that each new model is exponentially more expensive to train than the last, so the income lags and it looks like the company is losing money overall.

          Now, we can't know if this is true unfortunately, but it's not directly contradicted by anything that's known publicly at least. I thought it was an interesting way to frame it and makes the whole situation look marginally less bad.

          • NorwegianDudean hour ago
            A common extreme misconception is that inference is expensive and that providers are loosing a lot of money. Inference is extremely lucrative and profitable.
        • sanderjd4 hours ago
          My two cents is that the way to square this circle is that the valuations should be lower and they should be spending a lot less on constant retraining.

          Unfortunately (from my perspective) it seems like the US companies are increasingly stuck in their current model. I think it's a competitive disadvantage.

          But obviously most of the real insiders seem to disagree with me, so I'm probably wrong :)

          • wyre3 hours ago
            The insiders disagree because they are benefiting greatly from the insane valuations, right?

            Chinese models are quickly commodifying frontier inference, the US Gov is preventing domestic SOTA models access to the public and without those models why would consumers still spend $200/month to use the best models?

            It’s such a mess and isn’t inspiring confidence as a non-investor.

            • sanderjd2 hours ago
              Are they benefiting from the insane valuations though? If the valuations deflate before the insiders are able to exit, I think that would be worse for them than a lower but sustainable valuation.

              It all comes down to whose prediction of the future is closer to correct. I think the most likely future is commodification of inference and "agent-assisted" rather than "agent-driven" workflows dominating the future of work. But insiders - who both know way more than me, and also have more skin in the game, both for better and worse - seem to really think I'm wrong about that.

              So I dunno! Could go either way!

              • wyrean hour ago
                Even if the future is agent-driven workflow, that doesn't stop the commodification of inference. a good agent-driven workflow, in my experience, is a byproduct of the harness and scaffolding around the agent.

                What insiders are you talking about? They're going to be hot towards the possibilities so they can exit to a massive windfall. I dont know why they would want to be publicly critical of these technologies that could make millions on IPO.

                • sanderjd34 minutes ago
                  I'm talking about people who work at the frontier labs who talk to the press, and what seems to be the revealed beliefs of those same people from the strategies we see their companies pursuing.

                  My point is that actually it would be worse for these people if the valuations are only high during this period - which will last awhile longer from now! - where their equity is not liquid, but crashes as the market figures out this commoditization thing.

                  But if we're wrong about how that's going to go, then this isn't a concern because there won't be any devaluation. And to me that seems to be what they honestly think is going to happen. And they know more than me (and I think they're a lot smarter than me), so this does temper my confidence in my own predictions.

        • ricardobayes2 hours ago
          At some point it's going to plateau, maybe already has. Then they will switch to FPGA/ASIC-based model-specific hardware for lower consumption. I'm pretty sure the "space data centers" won't use GPUs, they are not radiation-tolerant whereas FPGAs can be.

          https://www.cerebras.ai/blog/gemma-4-on-cerebras-the-fastest...

          • quaverquaver2 hours ago
            I would not take "space data centers" as a given! from most to least likely these will be vaporware, vaprorized-ware, rubble-ware, loss leaders.
            • 2 hours ago
              undefined
        • JumpCrisscross4 hours ago
          > no way to justify their valuations if they get downgraded to a pair programming tool

          I think there is. Pair today doesn’t mean they’re locked into that forever.

          • ChrisLTDan hour ago
            Their valuations don't make sense as just programming tools, period. Forget about if they are still human driven.
        • EddieRingle2 hours ago
          > There’s no way to justify their valuations if they get downgraded to a pair programming tool.

          Honestly I still don't see how they justify their valuations, period. If anything they're serious liabilities.

          Open-weight models are improving and reaching "good enough" levels for more and more tasks. They're also known quantities; you know what you're getting with them and don't have to worry about the model silently (or not so silently) being switched out from under you (whether that's because Anthropic/OpenAI decides you're not worthy of their latest and greatest for one reason or another, or they switch you to a quantized model to save on compute, or they simply sunset the specific model you've been relying on).

          And if the open-weight model doesn't run on your local hardware already, there are any number of hosting providers that will handle that for you (so you're back to just paying for colocation/cloud usage instead of nebulous tokens).

          Closed models are improving as well, sure, but diminishing returns will eventually kick in (as they already have for various tasks, as I said).

          So if not their models, where does their value come from? Just simple network effects/lock-in? "Normal" users will drift to other options if they start showing more and more ads, and enterprise customers will surely be looking for opportunities to avoid lock-in and reduce risk.

          I think the last argument I've heard is that these valuations are basically a bet that Anthropic and/or OpenAI will achieve AGI that can fully replace human labor, so they'll essentially be able to sell that replacement labor to everyone. They haven't managed to pull that off, yet, however. Businesses that have tried to replace humans almost immediately realized either that the AI's capabilities were oversold or that they at least needed a human in the loop still, to some degree. And even if they do achieve AGI, that would surely become an issue of national security (they're already flirting with that today), so who's to say governments won't simply nationalize the best AI labs and either remove them from the economy entirely or perhaps even provide models as a public service to level the playing field?

          That all sounds like a giant gamble, if anything. And it's incredibly frustrating to watch as someone that's been unemployed for a year because (a) budgets are being burned on tokens and (b) LLM-generated applications are flooding hiring teams and preventing real people from being seen. (Not to mention, as someone that spends a lot of time in gaming circles, the fact that DRAM and flash storage is quickly becoming inaccessible is just an additional frustration that means people can't even find temporary relief in entertainment.) I can only hope this bubble finally implodes before I lose my house.

          • pixl9740 minutes ago
            >Open-weight models are ...

            <banned>

            Not the first one to come up with that likely outcome either. I mean, if you're being restricted from SOTA models now, how long do you expect before the FBI kicks in your door for using an 'illegal' open model?

      • pkulak3 hours ago
        And every benchmark is "build GTA-6 from nothing, as a single-page web app".
      • ricardobayes2 hours ago
        They have to, but also everyone working at 3D printing companies thought "industry 4.0" is going to completely override everything, we are going to print housing and going to print a mug at home and drink coffee out of it.

        Today's news that Amazon is hiring 11k interns. I think part of the AI story was used as a convenient excuse to get rid of some "fat" and some covid overhiring and gave companies an out to change course.

      • rconti2 hours ago
        I wonder how portable the existing models are for different use cases. As good as they are for greenfield development or working in a single or across a few tightly coupled repos, they're absolutely terrible at debugging distributed systems and make incredibly wrong yet extremely confident assertions all the time.

        I don't know if it's a matter of just requiring a tiny amount of optimization or wholesale redesign.

      • popalchemist4 hours ago
        Whether they believe it or not is immaterial. It is the end-goal they want to achieve, because then they own the means of production entirely.
        • pigpop2 hours ago
          They own the means of production for the leading models but they're far from monopolizing them since the techniques are well known. At this point it's a matter of having a head start and lots of capital to pay for the data annotation and GPU time to train them. Others are playing catch-up but they're hot on their heals which is the biggest reason for them to continue spending like crazy to keep their leads.

          For the non-bleeding edge they have a lot of competition with more competitors showing up every day.

          The way this is playing out is not surprising, it's similar to any other technological breakthrough as it becomes commercialized. Eventually those means of production will become commoditized as well.

          • 12 minutes ago
            undefined
        • quaverquaver2 hours ago
          these are capital intensive commodity businesses. They can be plenty big - see railroads or airplanes... or refining... but that doesn't mean that most value won't be added elsewhere.
        • jatora3 hours ago
          I find these nefarious intention theories shallow. It can both be the case that the endstate is them owning the means of production without that being the intended guiding goal. Companies can chase profit without being Leninistic boogeymen.
          • WhyIsItAlwaysHN3 hours ago
            There is no nefariousness in owning all the means of production, it's the endgame of maximizing profit.

            However the result is exactly the same, concentration of power.

            • pigpop2 hours ago
              This is such a defeatist and low agency take. "means of production" are not a limited resource like gold that you have to extract from natural sources or divvy up. They are fundamentally skill and knowledge that anyone can attain and put to use, maybe not on the same scale as a well funded business but even those businesses had to start somewhere in order to grow to the size they are now. So rather than casting aspersions on them, your time would be better spent learning how you too can create some means of production and start producing value.
            • popalchemistan hour ago
              No nefariousness other than the subjugation of the majority of humanity? You're insane
              • WhyIsItAlwaysHN2 minutes ago
                What I meant is that nefariousness from people is not a prerequisite. It's a machine that wants to maximize all profit and all the evil is a natural product. If you magically put saints in charge they would be eaten and replaced by the same kind of people very quickly if the end goal remains.
          • cousinbryce3 hours ago
            Sam Allan has said some things that would make Lenin blush
      • jambalaya84 hours ago
        As I said, working ourselves out of our jobs within the span of a few years.
    • jerf3 hours ago
      I've been using Kimi K2.6 lately (don't have 2.7 available through blessed work channels yet) for tasks where I already know what it is I want to do and I want to just step through the process in pieces, and it's fine. Do I have to correct it maybe a bit more than Opus? Yeah, but the real cutoff would be between "I have to read every line" and "I can just trust it without reading every line" and for me neither model hits that mark, and I expect it to be a while yet for that. Is it as good as Opus if I want to spit ball about architecture and then convert that to code? No, but I don't have that problem all the time, and it's there if I do need it.

      And now in a heavy coding week rather than bumping up against my spend limit by late Wednesday or Thursday I'm comfortably below it all week.

      That said if anything I feel like I have to reign in K2.6 much more than Opus, actually. If I want to just ask it a question without it inferring some coding task to immediately start doing, it takes a lot more care to prevent it from just running off half-cocked off of an only 3/4s-cocked idea of my own. I use "plan" mode with both but it's somewhat more defensive with K2.6 than Opus.

    • nozzlegear4 hours ago
      > I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.

      I've moved completely to local models that I run with my M1 Mac Studio (64gb ram) some time ago. But for the rare times when I feel the local, quantized Qwen3.6 isn't enough, I just connect to Openrouter and use something like Kimi, GLM or Deepseek for a fraction of the price of Anthropic et al.

      • plasticsoprano3 hours ago
        Which quant do you use? I have a similar setup and the speed is atrocious at 4-bit.
        • nozzlegear2 hours ago
          I'm using 4-bit as well, with the MoE model. I also use the MLX versions which are optimized for Apple CPUs (from what I understand anyway, I'm just an LLM layman). According to my oMLX dashboard, I'm getting about 50 tokens per second out of this model – not blazing fast, but more than fast enough to be useful to me.

          https://huggingface.co/mlx-community/Qwen3.6-35B-A3B-OptiQ-4...

      • kamranjon3 hours ago
        This is the way
    • m3h3 hours ago
      I think you should try an OpenAI model like GPT 5.5. It is better at following instructions and boundaries set during prompt. It feels like a more capable "agent assistant" than Claude models but without loss of intelligence.

      Most of my work involves "Agentic engineering" instead of fire-and-forget. I like to stay involved during the planning as well as review and ask a lot more questions from the agent than I've seen others doing. In a way, I'm using the agent in a sort of "hyper auto-complete" mode to fill in the blanks (rather big blanks) once I've set out the requirements, scope and design (sometimes specific module boundaries). This works best for me.

      • ifwinterco2 hours ago
        I prefer GPT 5.5 to Opus but both are absurdly expensive token hogs, I can't afford to use either as my main model at $work with the monthly spend cap we have.

        I use Composer (since we use Cursor) or GPT 5.3-codex as my workhorse models and only break out the big guns when I have a genuinely difficult problem to solve.

        IMO somewhat weirdly 5.3-codex might be the best overall coding model OpenAI have ever released. It's 90% as good as 5.5 and costs about 20% as much, since it's both cheaper per token and uses fewer tokens for the same task.

        I'll miss it when they inevitably deprecate it, but hopefully I can use Kimi K2.7 by then

        • m3h2 hours ago
          I didn't realize GPT 5.3 Codex was that good.

          OpenAI claims to have made their new Terra model as good as GPT 5.5, but with half the cost per intelligence. Hopefully, this will bring it closer to the price you're expecting (or even better considering GPT models have good acceptance/success rates according to benchmarks).

    • jklmnopqrstuvw4 hours ago
      From my own experience, GLM-5.2 generally cost more tokens and much more slow.
      • pimeys4 hours ago
        I use GLM 5.2 Fast from Fireworks and its very fast. Where are you using it from?
      • microtonal4 hours ago
        Which inference provider do you use? (Admittedly, I currently use K2.7 a lot more currently.)
      • james2doyle4 hours ago
        Tokens and speed are a factor but does it require less back and forth to get things right? Being "fast and cheap but wrong" still has a cost that an otherwise "expensive and slow" exchange does not
    • mark_l_watson2 hours ago
      Good point, I also like to do the work myself, with an assistant under my control. I am usually really happy with DeepSeek v4 Flash that I feel just mostly does what I tell it to do, but I do switch to Pro for harder tasks.

      There are so many models, and I personally ignore benchmarks so it takes some time to try different models on my use cases. Fortunately, it is ‘good enough’ to do the work to find a few models that work for me, and just use them for a month or two before re-investing time for my own evals to possibly change models.

      People should evaluate what works for them and ignore other people and benchmarks. (Apologies if that sounds snarky.)

    • whateveracct4 hours ago
      agent-assisted development uses orders of magnitude fewer tokens than agent-driven development

      the incentives aren't there sadly

      • sanderjd4 hours ago
        Not for a business model that scales revenue by token usage. But other business models are available.
    • mohamedkoubaa4 hours ago
      I've been moving more to Composer 2.5 for the same reason. KISS principle.
      • everfrustrated30 minutes ago
        Composer 2.5 fast (via Grok) is honestly amazing. Its been implementing everything I've asked and getting it right first time. Been impressed with it's front end ability.

        If this was the last model I could ever use I think I would be happy.

      • AdminAdmim4 hours ago
        Same for me, downgraded Cursor Subscription because when i use Cursor i use 90% Composer 2.5 fast
    • mattmatheus2 hours ago
      I've been working to use the best model for the task for about 6 months and have found great success doing plan with the 'frontier' model but punting implementation down to a 'lesser' model. I'm using the Beads-Rust (a rust fork of GasTown's beads) as my issue tracker. So far, so good.
    • xpct4 hours ago
      I've been largely disappointed how much the Claude models ignore custom instructions, and sometimes even prompts on the chat interface. It sometimes feels like talking to a wall, or as if there was a third person in the chatroom whose messages I can't see.

      I can't help but feel this is intentional towards the 'Agentic' workflow.

      • spacephysics4 hours ago
        I think this seems purposeful, as there's 2 opposing forces at play: - Have a model that follows the users instructions - Have a model that follows the system prompt instructions more

        For the 'safety' argument (Re: Fable), they need these models to have basically a 2-tier instruction system, but given LLMs aren't great with actual Logic unless they program it out to test, this runs afoul and we get one or the other.

        Feels like optimizing for either precision or recall, but can't have both

        • wqaatwt3 hours ago
          A suppose a solution might be going with a customizable harness like pi and merging Anthropic’s system prompt with a personalized custom one to remove all contractions
          • arcanemachiner3 hours ago
            You still have to manage/fight with the post-training that is baked into the model itself.
      • marcindulakan hour ago
        I keep adding selected cases of CLAUDE.md instructions non-compliance reported on claude-code github to that issue https://github.com/anthropics/claude-code/issues/13689. Subjectively the amount of such cases seems lower during the past month. It may be that claude-opus-4-8 (default thinking) is a bit better at instructions following than past models.
      • manveerc4 hours ago
        Totally agreed. I sometimes wonder if they are making the model "lazy" with each iteration, it keeps getting better at avoiding work.
        • skerit4 hours ago
          This is why Fable was so good. It followed instructions and it was in no way lazy.
          • DontchaKnowit4 hours ago
            People keep making comments about fable like this? You could only use it for what like a week? How is that at all enough time to evaluate? Opus 4.6 didnt suffer from this problems for a hot minute and then when newer models were released it got worse. I think they change a ton behind the scenes and allocate compute however they want, so the model you use today may behave much differently than how it behaved yesterday
            • boc3 hours ago
              The ~72 hours I had access to Fable were by far the most productive I've had in months. Re-wrote massive parts of my codebase and caught a ton of bugs and logic issues that had silently slipped through before. I went over my subscription limit and immediately kept paying the API price to keep going. It was that good.
            • marcindulakan hour ago
              For me claude-fable-5 failed to follow the instruction following test I'm making against various models https://github.com/marcindulak/claude-fails-to-follow-claude...
            • plorkyeran3 hours ago
              It was a pretty stark difference. I had the opposite problem where it did too much and overshot what I wanted from it so I certainly assume that if it had stuck around it would have gotten tuned back a bit pretty quickly.
            • pdimitar3 hours ago
              > You could only use it for what like a week? How is that at all enough time to evaluate?

              By observing how in 4 workdays it achieved more than Opus in ~11 days. I am my team's backend lead and the Fable 5 model finally turned the tide on my overwhelming backlog. Back to Opus and I have to treat it like special-education kid multiple times a day.

            • tskj3 hours ago
              You didn't really have to use it more than a day honestly to tell what kind of shocking paradigm change it was. Man do I miss it.
            • Analemma_3 hours ago
              Heh, it's not crazy if you're here in the Bay: I know multiple people who more-or-less disappeared for days when Fable came out because they were running their benchmarks, and only emerged blinking into the sunlight when the USG banned it. That's just how things are here now, most people are normal but there are some serious LLM dope addicts out and about.
          • acters4 hours ago
            I've been seeing LLMs act lazy from the very beginning. They got a little better but smaller models really only want to have a single task given to them. Mythos at least does work. RIP
      • gs174 hours ago
        > or as if there was a third person in the chatroom whose messages I can't see.

        If you set off a classifier, that's how it looks to Claude.

        • xpct4 hours ago
          I wasn't working with anything sensitive, but it really does feel like it sometimes condenses even something low like three bullet points to two.

          IMO, they were quite good with checklists even a year ago, and tried to tick off each one.

      • storus4 hours ago
        Try to run your prompts through Claude to pinpoint any ambiguous parts that can be interpreted in multiple ways, or self-contradictory sections. I typically resolve any prompt-ignoring issues with that.
      • Sohcahtoa82an hour ago
        [dead]
    • duxup2 hours ago
      “Hey I saw some messed up function commented out that at face value is a bad idea… so here it is again with some nonsense assumptions ….”

      I ask “where did you get that?” … too often if I’m not constantly guiding it, and even then it still goes off the rails.

    • arikrahman2 hours ago
      I have also started shifting to models more reasonable for my wokflow. I've been using the Reasonix harness for Deepseek, and cache hits make the token use basically free. This is with unsubsidized models as well, using American providers.
    • bckr2 hours ago
      I suggest you encoding your invariants in the harness. Architectural invariants that can be mechanically checked, including which modules are approved, which dependencies, etc.
    • lacooljan hour ago
      gemma-4-e4b is very good at assistance too, and is local and fast and small (and "free")
    • a_c3 hours ago
      I actually use sonnet 4.6 for my day to day coding too. It consumes much less token and good enough. Opus is just too token consuming for it to be useful to me.
      • bazhand3 hours ago
        Have you tried '/model opusplan' I've had strong results mixing opus for planning with sonnet implementing.
        • a_c3 hours ago
          I haven't. Thanks for the heads up will give it a try! I use opus to comment on code design quite often though. It became a pattern that I made a skill for me to ask for second opinions https://news.ycombinator.com/item?id=48733092 Would love to hear your feedback if you don't mind!
        • vtail3 hours ago
          Fascinating! How did you learn about this?
          • bazhand2 hours ago
            It was something that was used for token efficiency. Most of the settings and use cases are quite poorly communicated but asking Claude to review the latest release changelog (https://github.com/anthropics/claude-code/blob/main/CHANGELO...) is quite useful. Combined with @"claude-code-guide (agent)" to read it's own docs for settings/configs is super helpful.

            The quite useful tool is to use /opusplan along with /codex:rescue (https://github.com/openai/codex-plugin-cc) means you get quite a strongly reviewed plan using native claude + codex without having to implement the mostly useless trust-me-bro plugins and other bs.

    • epolanski4 hours ago
      I've been saying for ages that since Opus 4.6 models are increasingly smarter but further unhelpful as assistants.

      Fable was amazing as a vibecoder but as an assistant it can't resist jumping into implementation and filling chats of pointless jargon.

      It's really grim if you're looking for assistance instead of an implementor.

      GPT 5.5 Pro and Fable are gorgeous bullshitters that pretend to be right (often convincingly because they are very smart) even when they are wrong and I need tons of energy to process their information.

      I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.

      • thewebguyd4 hours ago
        By design, unfortunately. If they are just assistants, they can't sell the dream of "we're going to replace human labor completely" to the C-suite.
        • baq4 hours ago
          It isn’t a dream, it’s a reality for some of us here and it will be increasingly so for everyone else. Amazingly, USG intervening slowed the dynamic greatly (fortunately?)

          The problem is obviously who will be left. There’s a lot of scifi to catch up on.

        • epolanski4 hours ago
          I think that they are simply evaluated on prompt to solution benchmarks.
      • whstl3 hours ago
        Yep, this is why experiences and ratings of models vary so wildly.

        I recently migrated a very large web app to Tailwind and Opus kept screwing up over and over, refactoring and changing the design, the more complex the component became.

        I ended up asking Haiku to do it and it managed to do everything correctly, pretty much without intervention.

      • mullingitover3 hours ago
        > I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.

        I've taken to instructing the agent to manage the subagent, and the principal agent's sole job is to ensuring the subagent follows instructions to the letter.

      • epolanskian hour ago
        Just to follow up on what I mean, this was my first interaction with Sonnet 5:

        "I just cloned this repo, investigate how to set it up, don't install anything, just collect information"

        _spews information_

        I proceed with the setup, but get a Linux specific dependency in a bash script, so I want to evaluate whether it can be rewritten...

        "There's this error on MacOS, I think it's because we need linux-utils from brew, verify whether the script can be written in bare posix"

        _proceeds installing linux-utils and all the rest_

        "Didn't I tell you to not install anything?"

        _you're absolutely right_

        F*k me..

    • trollbridge3 hours ago
      No kidding. I expect to have models to use which are optimised for different use cases.

      Sonnet as an autonomous agentic model is silly. We already have other models for that if you want something weaker and cheaper than Opus.

    • spullara2 hours ago
      if you like that, use gpt models instead.
  • Jcampuzano23 hours ago
    I'm struggling to understand why I'd ever use this instead of just using a lower effort level for opus given on many of the benchmarks listed the cost per task rises above opus at anything higher than medium effort.

    Only thing I can think of is for when someone is out of opus credits. Of course there are API billing use cases but I'd probably still just use opus on low.

    • c0m47053a few seconds ago
      Specific task based benchmarks don't reflect a lot of day to day agentic use cases in my experience. If you are working on a series of discrete tasks and can clear context after each one and move to the next, you might get that sort of efficiency from Opus low effort. I often find that when working through a real problem, iterating and discovering, context length can creep up, and that is where opus tends to get expensive.
    • itopaloglu833 hours ago
      More and more I find myself trying to stop Opus from doing something stupid, and at every turn I need to tell it to stop overcomplicating things.

      I think the models are being optimized for wealth extraction from users and companies, instead of solving problems.

      I don't know why Opus would try to create an entire library when I told it specifically to do something simple that would take 2-3 lines of Python.

      • __natty__an hour ago
        > More and more I find myself trying to stop Opus from doing something stupid, and at every turn I need to tell it to stop overcomplicating things

        Yeah, that’s my thoughts as well. I feel it’s great for benchmarks and some tasks while in other it tries to spend as much tokens as possible, tries to overcomplicate task and needs seconds or third round of steering that costs. With the scale Anthropic operates I bet it’s huge amount of extra money just to make sure their model works.

      • post-it3 hours ago
        > I don't know why Opus would try to create an entire library when I told it specifically to do something simple that would take 2-3 lines of Python.

        Because it reasons in one direction. First it encounters some kind of issue with 2-3 lines of Python that might make it not work, and then it goes onto plan B, which is making a library, but it doesn't circle back and compare the effort of making the library to working around whatever might make the 2-3 lines not work. Except sometimes it does, because it's inscrutable.

    • nicce3 hours ago
      Older Opus models will likely get deprecated and then over time this is the cheapest model. That is how prices are currently increased.
      • ChrisLTDan hour ago
        Yeah... Sonnet becomes the new cheap model, and some Fable class model becomes the more expensive/better one.
    • phainopepla22 hours ago
      Looking at some of the agentic coding benchmarks on the system card[0], pages 117-118, it seems that running it at low outperforms Sonnet 4.6 at any level, and is a good deal cheaper as well. So on low it could be a good workhorse for an Opus-planned task.

      [0] https://www.anthropic.com/claude-sonnet-5-system-card

    • SirMaster3 hours ago
      Maybe it's not for you? I don't pay, so I can't even use Opus... So this is an upgrade over Sonnet 4.6 for me.
    • enraged_camel3 hours ago
      Speed is a huge reason. Sometimes you just need some simple tasks get done fast, and waiting 30-60 seconds for opus to even start thinking can really slow things down.
      • humanymous3 hours ago
        Opus with low reasoning effort would be faster than Sonnet with high reasoning. So that won't exactly help. I think it would just be what those models are optimized to perform
  • conradkay5 hours ago
    Wow, seems worse even on price/performance than GLM 5.2, which is only 744b parameters.

    From the system card: "On CyberGym vulnerability discovery, Claude Sonnet 5 is less capable than Sonnet 4.6, and far less capable than Opus 4.8 and Mythos 5

    As with the other evaluations in this section, these results were achieved with all safeguards turned off. When run with our default mitigations, Sonnet 5 scored a 0 on CyberGym"

    • sixtyj4 hours ago
      I have tried to rewrite an article with GLM-5.2 and with Sonnet 4.6. Completely different results as LLM is non-deterministic. But GLM-5.2 made a lot of subtle mistakes that needed to be corrected by hand. On the opposite, Sonnet found and corrected all mistakes in the second round.

      Similar situation was with planning and coding. GLM-5.2 seems to be good “on paper” but the real usage results was different.

      And I am not an attorney for Claude or GLM-5.2… :)

      But as I’ve been using LLM models daily since Nov 2022 I have realized that all common tests have to be confirmed in your project - there is no “one model rules them all” - you need to dig out a specific model from that LLM haystack with thousands of models.

      Benchmarks help but they start to be similar to fuel consumption specs in car ads - real consumption is different for everybody :)

    • Retr0id4 hours ago
      Finally, a viable business strategy - sell security-oblivious code monkeys for cheap, then charge premium rates for agents capable of cleaning up the mess.
      • JacobAsmuth4 hours ago
        I think instead they should sell super hackers and get their product banned instantly and go bankrupt
    • loufe4 hours ago
      Not to single you out, parent commenter, but I really hope the quality of discourse on HN will move past these basic comparisons eventually. It seems like every thread on every model release has the exact same comments.

      "Wow, X models is Y% better or worse than Claude Z model on T benchmark"

      "That's irrelevant, they're just benchmaxing."

      "Not useable for daily coding or agentic workloads, the vibes are totally wrong."

      "It's almost as good, and costs a lot less, so I will absolutely use it."

      "I cannot imagine justifying using these, as the step change means open models lower costs do not make up for the productivity loss"

      I'm an unhappy Anthropic customer and really rooting for open models and non-gatekept intelligence, but how do we move on from this now meme-like model release discourse rigamarole. I do not know what that would be. I don't design LLMs nor benchmarks, and I genuinely appreciate that people do their best to provide information, even if non-perfect here. I'm sure most of you who actively read these comment pages on announcements must feel similarly, though, right?

      • tripleee4 hours ago
        I'm not sure what else can be said? I've found benchmarks to be a very weak signal for how good/bad the model is, but it's the #1 thing the companies highlight.

        20 minutes after the announcement there's no real useful statement that can be made about it.

      • conradkayan hour ago
        Yeah you definitely have to be skeptical regarding sentiment for open/local model capabilities, since there's bias from what people want to be true.

        I generally agree with this in spirit https://www.seangoedecke.com/are-new-models-good/ , but I think you can read Anthropic's results showing Sonnet 5 as almost strictly worse than Opus 4.8 as very credible/meaningful, and then draw comparisons from that

      • 4 hours ago
        undefined
      • tiahura4 hours ago
        "It's totally obvious they quantitized Claude Z"
  • taspeotis11 minutes ago
    > Claude Opus 4.7 and later Opus models, Claude Fable 5, Claude Mythos 5, Claude Mythos Preview, and Claude Sonnet 5 use a newer tokenizer that contributes to their improved performance on a wide range of tasks. This tokenizer produces approximately 30% more tokens for the same text. Claude Sonnet 4.6 and earlier models use the previous tokenizer.
  • Sol-4 hours ago
    Wonder if the whole cyber paranoia leads to their models ultimately generating less secure code. After all, if it has the ability to generate safe code, it would imply that it knows something about cybersecurity, which could surely be used to hack all the banks in the world.
    • pennomi4 hours ago
      Trying to censor nudity in image generation models caused all kinds of problems with anatomy in image models. I’m sure these models will have similar issues with security.
      • raincole21 minutes ago
        Censorship on image generation models works on another level. The models can generate NSFW, but there are extra computer vision models checking if the images can be shown to the users. It's especially obvious for Grok and ChatGPT.
      • NonHyloMorph39 minutes ago
        Interesting, you find that in medieval painting, due to the authority of the catholic church.
    • deaux4 hours ago
      > Wonder if the whole cyber paranoia leads to their models ultimately generating less secure code.

      This may be the goal.

  • simonwan hour ago
    Claude Sonnet 5 itself described its pelican as looking like a goose:

    > Illustration of a white goose riding a bicycle, with one wing extended forward to grip the handlebar, set against a plain white background with a brown ground line.

    https://simonwillison.net/2026/Jun/30/claude-sonnet-5/

  • phtrivieran hour ago
    What is the reference, unbiased, honest, reputable and trustworthy site that ranks and compare models on the couple of realistic metrics that matters ? ("Does it work for code", "no, I mean, for real", "how much does it cost", etc...) ?
    • kccqzy13 minutes ago
      It’s not really possible unless you try. Different people use models so differently. The whole model situation has made public minute differences in personal preferences in the process of coding. Some people think carefully and strive to write code that’s as bug free as humanly possible on the first try; others write something that is only approximately correct and then iterate afterwards. The former people would align with a model that thinks for 40 minutes before producing flawless code; the latter would be driven mad by this excessive thinking. Some people like to interrupt AI as soon as they see AI making a mistake, others let AI continue and tell them about the mistake afterwards.
    • girvo32 minutes ago
      Truthfully? There isn't one. They all have flaws. Your best bet is to look at all of them, and then run a suite of evals yourself. Its rough out here!
    • bel8an hour ago
      The only metric that worked for me is running the same prompt 5x for each LLMs on my projects.

      I keep specific branches a state where they are ready to develop new features.

  • m3h3 hours ago
    Important to note: "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."
    • ComplexSystems32 minutes ago
      So the post-introductory price is set such that Sonnet 5 will cost 100%-135% as much?
      • m3h28 minutes ago
        Correct. Albeit the nuance here is that a more capable model might solve problems more efficiently and faster, possibly saving you tokens.

        As with any new model, you won't know the real impact until you start using it for your workload.

    • mattas3 hours ago
      "We can raise prices in two ways: (1) raise the price per token and (2) increase the number of tokens we generate on your behalf. We promise not to do (2) maliciously. Promise."
      • conradkay3 hours ago
        I think the incentives are less bad since a good chunk of usage comes from subscription plans.

        There was a fairly major regression in Claude Code performance for some time when they changed the system prompt to try and make it less verbose (saving tokens). And if I'm not misremembering, there were a lot of complaints when they changed the default effort from high to medium.

      • squeegmeister3 hours ago
        Wouldn't it be more malicious for them not to mention this at all?
        • Alifatisk3 hours ago
          Sure, but I think doing it this way allows them to later on say they were transparent about it. Completely hiding this would make it very difficult for them excuse when getting caught.
  • phillipcarter5 hours ago
    Seems to be another great incremental update to the workhorse, nice!

    I've been using Sonnet instead of Opus for almost all coding tasks for a while now. A little elbow grease to break down tasks and you can spend a lot less money for just about the same output quality.

    • SeanAnderson2 hours ago
      Crazy. I just changed the default for our entire org to Opus because people were continually unimpressed with Sonnet's abilities. It's fascinating to think how varied people's experiences are when interacting with LLMs and how much the outcomes depend on how people approach interacting with the models.
    • thewebguyd4 hours ago
      Yeah I think people are sleeping on the smaller/faster models like Sonnet. As long as you have a detailed plan or small, well scoped individual tasks Sonnet can implement just fine. Opus will still do better at more open ended tasks or completely "vibe coding." Or spec/plan with Opus, and have Sonnet implement.
      • conradkay3 hours ago
        I was surprised to learn that Sonnet generally has the same tokens per second as Opus
        • Computer02 hours ago
          I would indeed be more inclined to use it if the tokens per second were better. Though I would be then using their more expensive Opus less though. Perhaps it is strategy.
          • conradkay2 hours ago
            They should add a Sonnet 5 fast mode at ~Opus pricing
  • satvikpendem4 hours ago
    > Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.

    Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.

    And Opus 4.8 is still cheaper for a higher pass rate (much less open weight models like GLM 5.2) so not sure why I'd use Sonnet except on the low effort level for I suppose trivial tasks where I want it to work only 50% of the time judging by the graph. The pricing doesn't really make any sense.

    • secretslol4 hours ago
      "Lower ability to perform cybersecurity-related tasks" makes me super concerned it will leave my codebase like Swiss cheese for any American granny with access to Fable 5, when we non-American Brits, or rest-of-worlders, don't have access to it to clean our codebases.
      • __alexs4 hours ago
        100% this. I read these caveats in new models and all I hear is "we made sure this model has no idea about computer security." Such a weird thing to brag about.
      • doublescoop4 hours ago
        This is code for "this model can't be used to hack other systems as effectively as Opus or Mythos."
        • kube-system3 hours ago
          "dangerous cyber skills, such as developing software exploits" is very plainly referring to the same thing you are, but is more precise industry terminology rather than the loaded slang "hack".
          • doublescoop2 hours ago
            I was referring to "Lower ability to perform cybersecurity-related tasks," which is newspeak for hacking.
      • cute_boi4 hours ago
        I think they don’t understand that cybersecurity skills are what prevent bad code from making it into production.

        It’s like telling a chef to cook without a knife because knives can kill people.

        Dario and his lackeys at Anthropic aren’t visionaries.

        • norseboar4 hours ago
          I think this is more aimed at the US gov't than anything. They want to be clear that it's not very good at hacking, so that the gov't won't ban it.

          I'm sure they're well-aware that this also will make it worse at building secure systems, but the gov't isn't restricting releases based on that.

        • baq4 hours ago
          I think you misunderstood what their vision is, or rather what their possible futures are. They are many steps ahead of almost everyone, both in wargaming possibilities and the actual realized path. What doesn’t make sense to you may be the only safe option for them.
          • frabcus2 hours ago
            I've been wondering this - I don't have an intuition for Anthropic's gaming around military applications, or how this stage could play out in terms of relationship to Government controlling AI.

            Are there some Less Wrong posts or similar I should read that probably explain it?

          • tancop3 hours ago
            > What doesn’t make sense to you may be the only safe option for them

            thats true because their point of view makes no sense for us. dario is all in on lesswrong machine god theory and really believes they need to create a super intelligence before anyone else. that means doing as much as possible to slow down others progress and accelerate your own. but the fact that they believe its the only option doesnt make it true for the rest of us.

            • baq3 hours ago
              Never said otherwise, but it changes nothing. Their beliefs got them to this point on the timeline and that in itself cannot be ignored (or should I say, it should inform our priors...?) You can like or dislike them or what they do or don't do, but you must respect them regardless of that, purely because of their track record.
      • kube-system3 hours ago
        > any American granny with access to Fable 5,

        Fable is effectively not available to the general public in the US either

      • goalieca4 hours ago
        That’s not even close to true. Unless you’re vibe coding trash that a better model might catch.
        • secretslol4 hours ago
          I don't think so. During the time I was using Fable 5, I was getting it to clean security bugs that Opus 4.8 had introduced ... bugs which weren't localised to a single PHP file but were caused by cascading data flow through multiple PHP files. I'm not an expert on security but I know I wouldn't have found these myself. I knew from day one of Fable's release that it would do thorough security audits and fix loads of flaws, even offering up PoCs to help show that it fixed them, as long as I didn't explicitly ask it to do a security audit. I just said, "My codebase is a mess," and it went on for an hour doing a thorough security audit and helping plug numerous holes. This was before the "fix my code" story came out.
        • 4 hours ago
          undefined
    • zlurker4 hours ago
      They spent months hyping up Mythos and ended up with it banned. I’d assume they want to both differentiate their products and appeal to regulators here
      • worldsavior4 hours ago
        They will release it eventually. Once they see the Chinese models are close to Mythos level they will release it before, so it will be "revolutionary".
        • jaapz4 hours ago
          It was already released. US government is the only reason it's not available to us mere mortals anymore
          • satvikpendem4 hours ago
            Due to Dario hyping it up as a world ending model. If they kept their mouths shut we'd all have it now still.
            • baq4 hours ago
              Where is gpt 5.6?
              • satvikpendeman hour ago
                If not for Dario hyping Mythos and Fable, GPT 5.6 would've released just fine on schedule as a point release without all the fear mongering. It was because Fable was banned that now the government is scrutinizing all models.
              • 081c28a923 hours ago
                Victim of the same hype generated by Dario. Now everyone has to walk on eggshells, do limited releases to trusted partners, and nerf their cybersecurity capabilities lest they get deemed “too powerful to release”.
                • M3L0NM4N2 hours ago
                  Yeah and our government is continuing to take pages from China's playbook for the last fucking decade... and not the plays that work.
          • worldsavior2 hours ago
            Obviously I meant released for public use.
      • sixothree4 hours ago
        I'm starting to think it discovered a 0-day held hidden by our government.
    • kristianc4 hours ago
      There's two classes of models now - the cybersecurity ones that none of us are getting, and the 'safe' models released for general consumption. This is letting us know which side of the divide it sits on.
      • Taek4 hours ago
        There's also Chinese models, which aren't trying to self-limit capabilities.
        • axus4 hours ago
          Surely the Chinese government will see US gov's intervention and say "Government control of business is stupid, our industry will have more independence from CCP control for the benefit of the world".
        • baq4 hours ago
          …as long as you don’t ask them about certain dates or squares.

          Also, I wouldn’t expect Mythos-class models to be allowed to be openly released by the CCP. Thinking otherwise is pure naivety.

          • girvo30 minutes ago
            Depends on the model. Step (from StepFun) will happily yap about Tiannemen to you, if you're running it locally.

            Quite a lot of these models have "safety" (lol) filters in front of them, vs it being heavily encoded into the weights not.

          • satvikpendeman hour ago
            Like the sibling said, you can fine tune if the rejections are in the weights but most often it's actually in the API harness itself; download Qwen or DeepSeek and run it locally to ask about certain dates and squares and it will happily tell you.
          • atemerev4 hours ago
            Well, the weights are open. De-CCP-ing them is a trivial task, about 40 minutes on modern hardware. So can be done for about $50.
            • bjelkeman-againan hour ago
              Any good reference for how?
              • ls61235 minutes ago
                • atemerev32 minutes ago
                  Heretic is a general abliterating framework, mostly used to remove safety alignment, not CCP alignment. Yes, you can put China-specific prompts to it, but you'll need a dataset first (which is available at deccp).

                  Also Heretic as it is does not work for GLM5.2 (at least as of 3 days ago when I tested it). You'll need some hybrid approaches.

              • atemerev37 minutes ago
                https://github.com/AUGMXNT/deccp - one example for Qwen models. For GLM 5.2, abliteration/realignment works somewhat differently, but with Claude's help, you can finish the job.

                I am planning to release the steering patch for the GLM 5.2 eliminating pro-CCP alignment in the next few days.

      • bwat494 hours ago
        this seems rather counter-productive, wouldn't a model with less cybersecurity capabilities be more likely to produce insecure code? Not to mention, Chinese models don't have these restrictions and can be used to exploit said unsecure code.

        I supposed I shouldn't be surprised at how the trump admin is approaching AI regulation, counter-productive is really all they do

        • ihsw2 hours ago
          As contradictory as it sounds, they (Anthropic) are probably trying to dance the fine line where its public models can write secure code but cannot exploit insecure code.
    • MostlyStable4 hours ago
      Why do you think they are bragging? Anthropic has long been the company to give us by far the most in-depth information about their models, both positive and negative. I read this as them just stating a fact about this model that users would want to know.
      • organsnyder4 hours ago
        I'm absolutely certain that their marketing team has input on (if not owning) these announcements.
        • gallerdude4 hours ago
          Of course. But is it really impossible that Dario’s directive to the marketing team is “try not to make us look bad, but also be honest about our models’ capabilities, so people can stay informed”?
        • MostlyStable4 hours ago
          I find it interesting how two different directly opposed messages seem to have both been interpreted as being nothing but marketing speak.
      • MallocVoidstar4 hours ago
        The preceding sentence is

        >Our safety assessments found that Sonnet 5 shows an overall lower rate of undesirable behaviors than Sonnet 4.6, and is generally safer to use in agentic contexts.

        which is obviously painting that as a good thing. So reading the next sentence as "in other good news" is reasonable.

        • MostlyStable4 hours ago
          While I'm still not sure I would characterize that as bragging, you're right that that is a fair interpretation. However, another Fair interpretation of that is something along the lines of "the downside or cost of this positive thing is this following negative thing."
      • satvikpendem4 hours ago
        Anthropomorphic, most in-depth? That's laughable given how closed down they've been over the years. If you want in-depth, DeepSeek actually still publishes papers of their methods for anyone to implement leading to being by far the most cost efficient model provider for the performance.
        • MostlyStable4 hours ago
          I was talking about reporting on testing and capabilities. Yes, open models provide a greater amount of information about the development of the model and how to run it yourself, but I am quite confident that literally no AI company, open or closed, conducts and reports so thoroughly on testing about the capabilities of their models.
    • bluepeter4 hours ago
      Flowers for Algernon. And, sadly, expect this from now on. You saw it with OpenAI releasing Sol/Terra/Luna with a chart showing how they weren't quite as good as Mythos. It's all messaging to the USG to try to avoid/minimize arbitrary review from multiple agencies. 'Hey, it's smart, but look how stupid it is at "cyber."'
    • K0balt4 hours ago
      Restricting the models isn’t about restricting offensive capabilities. They were already very well aligned to reduce that risk.

      This recent government interference is about trying to preserve US offensive cyberwarfare and cyberespionage capabilities. It’s not about “bad actors”. It’s about defensive capabilities becoming pervasive and cheap, which would kneecap us cyberoffensive capability.

      It’s like making seatbelts illegal so that police chases can be more effective.

    • dgacmu4 hours ago
      One of the best queries I've done with an LLM recently was: Create a plan for improving the robustness and resilience of this code, particularly to untrusted inputs.

      Gemini wouldn't do a security audit. But it came up with a great set of mitigations and identified an extant XSS flaw in the process of improving robustness.

      There's an awful lot of good that can come from proactive, defensive use of LLMs. I realize there's also a lot of pain when the difficulty of exploit finding drops suddenly, but in the long term we may all benefit from the defensive side of this.

    • lanthissa4 hours ago
      so it doesn't get blocked. last time they said a model was great at cyber it didnt turn out well
    • pseudosavant2 hours ago
      So that the current US administration doesn't block broad usage of Sonnet 5 probably. They'd have to collect your ID and approve you if it was good at cybersecurity. Because such is the freedom in the U.S. right now.
    • Philpax4 hours ago
      To avoid Lutnick getting on their case again.
      • dgellow4 hours ago
        He has the opportunity to do the funniest thing ever
    • nozzlegear3 hours ago
      It seems obvious to me that they put that in there in an effort to avoid another reaming out by the long, orange dick of the US government.
    • johnfn4 hours ago
      > Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.

      What exactly do you want Anthropic to say here? "This model, the one we are about to give to the entire world for cheap, is really good at hacking"? Saying Sonnet is terrible at cybersecurity is the most reasonable thing they can say, out of a lot of bad options.

    • 4 hours ago
      undefined
    • doctoboggan4 hours ago
      You have to pay more for that, and/or go through some USG vetting process.
    • 2001zhaozhao4 hours ago
      They are obviously trying to avoid getting Sonnet 5 blocked.
    • WithinReason4 hours ago
      That part is likely directly addressed to the US government.
    • chvid4 hours ago
      Does it mean it generates code with random security holes?
    • jayd164 hours ago
      Market segmentation?
    • re-thc4 hours ago
      > And Opus 4.8 is still cheaper for a higher pass rate

      Unless it spams as much as Opus, I doubt it. Opus 4.8 literally spams text like puke. On a longer run especially if you get cache misses here and there the bulk of the cost is all the extra context it adds.

    • drcongo4 hours ago
      What makes that a brag?
  • brunooliv3 hours ago
    I only wish Opus 4.6 from earlier this year at a faster inference speed. Since Opus 4.6 things have been so much messier and the overall push for more agency isn’t really panning out for agent assisted development as much as they would like
  • theLiminator4 hours ago
    Seems like the way to go for any smaller models is to only use the low reasoning levels, and for anything where you'd want it to reason harder, to just use a larger model.

    In effect, high reasoning only makes sense when you're using the frontier model and need extra performance (higher levels of reasoning are never pareto optimal unless you're at the largest model size).

    • adam_arthur2 hours ago
      I've found disabling reasoning entirely but adding a "reason" to the JSON response from the LLM to work significantly faster and consume many fewer tokens for narrowly scoped prompts.

      At least for Claude family models.

      e.g. {

        "reason": "<Describe why you picked this result>",
      
        "selection": "<The number of the value you selected>"
      
      }

      I'm sure native reasoning produces more accurate results, but for my use case the quality was about the same, and the model would reason for thousands of tokens in native reasoning vs just 1-200 with response level reasoning.

      Again, to be clear, this is for deterministic/pipeline style workflows, not agentic/coding use.

    • docheinestages4 hours ago
      My experience with using low reasoning effort has been nothing but a waste of time. Claude often keeps guessing, not calling tools to ground itself, and basically at the end I end up wasting the same amount of tokens or just switch to Opus on xhigh. It's been a terrible experience.
    • mwigdahl4 hours ago
      Not to sound like an LLM, but that seems exactly right to me. Use it as a cheaper, high-functioning task subagent and lower reasoning for a master Opus session. As long as not every portion of your task requires maximum intelligence, you should come out ahead.
      • user439283 hours ago
        Won't any input be charged uncached, and the output of the small model charged again as uncached input to the bigger model?

        I don't know whether that comes out ahead compared to just staying with the better model in the first place.

        • mwigdahl3 hours ago
          It's a good question, but for multiturn conversations even cached context adds up quickly. My experience has been that spawning off subagents for defined tasks in a large overall plan generally makes me come out ahead.

          I'm sure folks' mileage will vary though.

  • mag72695 hours ago
    When can we get a new Haiku? 4.5 came out nearly a year ago, and it's showing its age.
    • scosman4 hours ago
      Look at Qwen for that level of intelligence.
      • anthonypasq4 hours ago
        needs to be on bedrock for me to use it at work
        • 0xbadcafebee3 hours ago
          Gemma 4, Kimi K2.5, MiniMax M2.5, gpt-oss, GLM 5, Qwen3 Coder Next, DeepSeek V3.2, Devstral 2, are all available on AWS Bedrock and all are about Haiku level
          • scosman2 hours ago
            Kimi K2.5 >> Haiku. Gemma 4 32b might fit the bill.
  • johnfahey4 hours ago
    Judging from those cost-performance graphs, Sonnet doesn't make sense to run at anything higher than a medium reasoning level, since Opus 4.8 low reasoning outclasses it for the price.

    This line as a selling point is also pretty funny:

    > Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.

  • wolttam5 hours ago
    I didn't think they'd actually release a model that was worse than the open-weight frontier and at a higher price-point. Wow.
    • LUmBULtERA4 hours ago
      That's yet to be determined. I think a lot of open-weight models are benchmaxxed and their usefulness for many tasks are not represented by those.
      • enraged_camel3 hours ago
        Yes, this has been my experience. They all struggle with long-horizon tasks and eventually start going in circles.
    • s3p4 hours ago
      Why did the other reply to this get flagged as dead? It was a comment about how someone would come out saying that Sonnet 5 would be better on the pelican test and therefore it has to be good. But I guess HN loves pelican SVGs so much that you're not allowed to criticize it.
      • steveklabnik3 hours ago
        If you look at the account history, it's pretty clearly an account-level thing, not a comment-level thing.
    • 27484848485 hours ago
      [flagged]
  • matheusmoreira8 minutes ago
    Who cares about Sonnet? I want to know about Fable. Are the export restrictions really going to be permanent?
    • stingraycharles6 minutes ago
      It’s supposed to happen when Anthropic introduces identification, which I believe is planned for mid-July.
  • an hour ago
    undefined
  • garo-pro4 hours ago
    Seems like the cyber detection even is on Sonnet now. https://support.claude.com/en/articles/14604842-real-time-cy...
  • mchusma5 hours ago
    This is much more interesting of a model at $2/$10 (their launch pricing) than at full price. There are many competing models at around this level of performance.

    I also like that the difference between low, medium, high, xhigh seems more spread, which is actually a good thing for people trying to tune applications. Running Sonnet 5 on low with the launch pricing makes this potentially a better fit than Haiku or open source models for some tasks. I don't think it will make sense at full price.

    • mchusma4 hours ago
      Really if they wanted a standout model that would really take the wind out of GLM's sails, they should have made this the new Haiku, priced at Haiku levels with this performance.
  • DonsDiscountGas4 hours ago
    I'd love if they would include speed (though I know there are difficulties involved). At this point the quality of Opus 4.8 is no longer my limiting factor, it's the speed, so a faster model would be great.
    • boc3 hours ago
      Have you tried Opus on fast mode?
      • DonsDiscountGas4 minutes ago
        I haven't because I'm not made of money but maybe I will
  • SkitterKherpi3 hours ago
    $5/$25 for Opus 4.8 vs $3/$15 doesnt seem cheaper enough to be too worth it. It depends how much better it is than e.g. Mimo, but I imagine Mimo and co to be too cost efficient in the lower tier to be overtaken by Sonnet for most tasks.
  • alvis4 hours ago
    Ironically, the key message of today's release is that Sonnet 5 is far less capable than Opus 4.8 and Mythos 5. It's a funny development is the past few weeks
  • tokengod5 hours ago
    That’s nice, but we want Fable
    • giancarlostoro5 hours ago
      The reality is that Fable will eventually be obsolete and Sonnet / Opus will surpass it. Fable did cost 2x as much as Opus, so I assume it involves a much higher cost for what it did, but I wouldn't be surprised if Fable will be obsoleted by Opus or even Sonnet sooner or later at less cost.
      • ianhawes4 hours ago
        Okay I don’t care about “eventually”, I want Fable now.
        • arcatech4 hours ago
          Have you considered getting better at coding so you can build stuff yourself instead of waiting for models you might not be able to get access to anymore?
          • giancarlostoro3 hours ago
            I'd love to meet the devs who can spin up full feature web apps in under 15 minutes with all the bells and whistles I've gotten Claude to spin up and code. I don't think the AI haters understand the level of time cutting that you can achieve with a very simple and reasonably crafted prompt.

            I'm talking back-end, with database models, classes, queries, accompanying front-end layouts, with real dynamic data, running. Stuff that takes days to weeks to spin up, with minimal errors or issues, having cut down on days or weeks of effort, you can focus on testing and making it all into better code.

            • arcatech2 hours ago
              And the trade off for that productivity is relying on a completely untrustworthy company/product that gets more expensive and uncertain by the week while your skills erode.
              • halfmatthalfcat15 minutes ago
                Companies don't care about your skillz, they care about velocity and costs. If AI helps increase velocity and decrease cost by lowering total headcount, then its a massive win. That factors in AI "unpredictability".
          • cesarvarela3 hours ago
            This is like telling someone who wants a motorcycle that they should get better at running instead.
            • arcatech3 hours ago
              When the motorcycle manufacturers keep making each new model worse and more expensive and the government keeps trying to ban them.
    • astlouis444 hours ago
      Same
  • 827a2 hours ago
    Tbh we'll see what using it looks like, but the reasoning/cost charts do not look promising. It seems like the only useful reasoning level for Sonnet 5 is Low; medium might trade blows at price/performance with Opus, but anything beyond that Opus is Just Better.

    I struggle to understand where this model fits in. If I need a cheap model for simple stuff (like, summarizing an email); I'd go Haiku (actually, I'd go Deepseek v4 Flash, but you catch my drift). I just can't think of many tasks where I'm like "yeah let me reach for Sonnet Low Reasoning so I can save a dollar but also seriously run the risk of it failing"; I'd just reach for Opus Low.

    • brokencode2 hours ago
      Kind of crazy how bad this release actually is. I even dug around in the full system card, and every graph showed the same thing.

      Low and maybe medium will save money on simpler tasks, but after that it just isn’t worth it compared to Opus.

      I wish they would have explained in the blog post why they think anybody would ever want to use this above medium.

      Maybe it works well on things that aren’t clear in the benchmarks.

  • johnhamlin4 hours ago
    Kind of hilarious how much they’re touting that it sucks at cybersecurity like it’s a feature
  • chipgap985 hours ago
    Interesting that tasks on extra high cost almost the same as Opus 4.8 with a slightly worse performance
    • bredren5 hours ago
      This is on the browsercomp graph, right?

      In that, it seems sonnet 5 on high costs more than opus 4.8 at a lower pass rate. Am I reading this correctly?

      Edit: It looks like the key value proposition of the updated model is that it is much better than Sonnet 4.6.

      Wheras, Sonnet 5 delivers great value (by browsercomp benchmarks and compared to opus) when running in low and medium.

      So: Sonnet 4.6 should ~never have been run for low, medium or high when Opus 4.8 has been available. Whoops, I think I have some skills that delegate easy stuff to Sonnet.

      ---

      I remember Anthropic pivoting everyone's default model to Opus but had not seen it put so starkly before.

      I am a bit confused on the subscription `/usage` screen. It splits out sonnet usage, and I'd presumed that would have contributed to a lower use of subscription Quota.

      But if this is correct, Sonnet usage was basically like smoking unfiltered cigarettes.

      • mchusma4 hours ago
        I agree with this assessment, IMO my takeaway from this is "Generally run Sonnet on low, otherwise use Opus". It's kind of like an "extra low" setting of Opus. (depends on the application for sure).
        • bredren4 hours ago
          It would be good if Anthropic provided some kind of feedback or even toggle to auto-route requests for models being used at thinking levels that would be a better value using a different model.

          Sort of like, getting an automatic upgrade at a car rental or hotel if there is availability.

    • mcbuilder5 hours ago
      LRMs are plateauing for sure, not that there won't be gains to be had in the future, but it's not like the era of rapid progress that was the past year any more.
      • gdhkgdhkvff3 hours ago
        I agree that the rapid improvement from like 2023-24 era is over (from a perspective of going from a 3/10 to a 7/10, you can’t then go to a 11/10). There was just so much more space to grow back then.

        But isn’t Fable supposed to be another step change? I never used it, myself.

        Tbh, at this point I think top tier models are smart “enough” (I’m sure this will look antiquated in a year), and the way to give me MORE noticeable improvement is to make them much faster rather than much smarter. Or even a way to automatically and accurately pick faster models when it makes sense. I know that IDE’s have Auto modes, but it’s not something that I trust right now to pick smart+fast instead of picking “maybe smart enough”+”cheaper for harness owner”

      • roughly5 hours ago
        A great many people were predicting this would be the case a year ago and being told they were wrong and to get on the boat.
        • mcbuilder4 hours ago
          I consider myself to be in that cohort as well. :)
  • edude03an hour ago
    Let’s see how long until opus 5 comes out but to me this lends some credence to the rumour that fable/mythos was supposed to be opus 5
  • andai5 hours ago
    Opus 4.8 beats Sonnet 5 on the pareto frontier in several of their graphs (Agentic Search, Agentic Computer Use).

    In other words, for certain tasks, Opus 4.8 is cheaper than Sonnet 5, and does better than Sonnet 5.

    I've noticed this pattern on a lot of benchmarks. You can try to emulate a bigger model by ramping up the test time compute (max reasoning, more turns, model fusion etc.), but you can't reach the same quality level, and you often exceed the cost you would have paid by just using a bigger model.

    tldr: if you're doing something hard, just use a bigger model.

    • copperx5 hours ago
      And Claude Code penalizes you for using Sonnet on the subscription plan, so there's little reason to use it.
      • bredren4 hours ago
        This is what I realized, can you provide more detail on how you've observed this? The /usage screen does not make it clear.
        • MillionOClock4 hours ago
          Not the original commenter, but personally I noticed my quota usage didn’t feel like it was being spent at a much lower rate when using Sonnet even on a relatively low thinking budget and based on a few comments here it seems I might not be the only one. Has anyone else noticed this? Wasn’t it different in the past? I thought I would be getting to use Sonnet much much more than Opus but it did not feel that way despite being on 20x plan.
      • gverrilla4 hours ago
        How so?
  • taytus14 minutes ago
    Roughly on par with GLM 5.2 at 5x the price
  • cenobyte3 hours ago
    Claude Sonnet 5 is built to be the most agentic Sonnet model yet.

    or

    The Dodge Charger is built to be the most Charger like car yet.

  • docheinestages4 hours ago
    But does it burn tokens just like Opus? That's the feeling I have nowadays. Regardless of what model I choose, the 5-hour limit gets exhausted in the first hour or so.
    • a_c3 hours ago
      "Claude Sonnet 5 is available everywhere today at an introductory price of $2 per million input tokens and $10 per million output tokens through August 31, 2026. It then moves to standard pricing at $3 per million input tokens and $15 per million output tokens.2"

      "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."

      If we trust them, then it is roughly the same as sonnet 4.6

  • caste38 minutes ago
    idk, i think they just tried to compensate for the ban of fable, nothing too good
  • theplumber4 hours ago
    Is there any reason to use Sonnet instead of GLM?
    • hootz3 hours ago
      Your US company banning usage of non-american models. Other than that, no.
    • atemerev4 hours ago
      Speed. But mostly no.
  • alvis5 hours ago
    What I starting to hate is that each model's effort level can mean completely different power.

    Today sonnet 5's med level effort is equivalent to sonnet 4.6 low level effort :/

    • nsingh24 hours ago
      That seems to only be true for the "Agentic Search" benchmark. That benchmark in particular is a bit weird, because Sonnet 4.6 effort levels had a relatively small effect, so Sonnet 5 med is basically comparable to all effort levels of Sonnet 4.6.
    • 5 hours ago
      undefined
  • ThouYS2 hours ago
    Why did this get the coveted "5"? I want an Opus that can compete with GPT 5.5
  • oybngan hour ago
    In my case, 4.6 degraded massively over time. 5 fails the same basic tasks that I gave 4.6 yesterday. And quite frankly this low, med, high, extra, max, turbo, ultra, ludicrous nonsense is getting tiresome
  • m3h4 hours ago
    Why is Claude Sonnet 5 allowed to be released but OpenAI Terra not? Are they not the same class of models?
  • Cu3PO424 hours ago
    Sonnet 5 is not currently available in the EU region on Bedrock, whereas previous models were and still are. I wonder if this is only due to early stages of the rollout or if this is due to recent US restrictions.

    Unfortunately that means I won't be using it at work for now.

  • rw24 hours ago
    The use of the "cheaper models" in big AI companies are next to useless as they don't even score as well as the open/super cheap Chinese models. Only the frontier big models like Fable and Opus have value.
  • kingjimmy4 hours ago
    interesting footnotes: "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer... can map to more tokens: roughly 1.0–1.35× depending on the content type." AKA expect higher costs on Sonnet 5 vs Sonnet 4.6 for the same tasks.
    • winstonp3 hours ago
      same happened to Opus 4.7
  • OsrsNeedsf2P3 hours ago
    Great timing. I just started using Claude Sonnet as a long term reverse engineering project[0] for a game I used to play as a kid. The cheaper tokens but sufficiently smart with hard verification makes it a perfect combo for the task

    [0] https://github.com/dginovker/BFME-Source-Code/

  • tripleee4 hours ago
    interesting how much worse the sentiment around Anthropic is getting
    • mwigdahl4 hours ago
      Seems like a combination of multiple factors:

      "They took my shit away!" -- 3-day Fable 5 addicts (me)

      "How dare they tell Trump no?" -- US nationalist / "my country right or wrong" types

      "Great to see a closed source company fail!" -- open source boosters

      "Great to see an American company fail!" -- anti-US, and/or pro-China folks

      "Great to see a successful company fail!" -- anti-capitalists and/or sour-grapes crab bucket types

      "Serves you right for ripping off creators!" -- copyright warriors

      "They keep silently nerfing the models!" -- secret downgrade conspiracy theorists

      "Quit killing the planet!" -- anti-datacenter advocates

      • thepasch3 hours ago
        I'm personally in the "they keep releasing shameless lobbying papers disguised as thinly veiled research or essay-coded content, push anticompetitive walled-garden practices, show little else but contempt for their non-enterprise customer base, refuse to communicate about anything and choose public silence as their baseline, seemingly force their employees into vows of public silence as well, actively degrade their products across the board with their vibeslop approach with measurable impacts on customers, openly attack not only open weights models but open source software, and all while pretending they're the 'public benefit corporation' formed by a valiant group of heroes escaping from a duplicitous snake and who, even in light of their own massively duplicitous behavior as of late, should apparently be trusted to be the some sort of arbiter over what this tech should get to be and how it should get to be used while they could hardly be more gleeful about how we're all going to be replaced in 6 months from now perpetually" camp.

        Which is a bit of a bummer considering they do genuinely make the best model that's most pleasant to work with in my opinion.

      • tripleee3 hours ago
        It seems to be more them losing goodwill combined with their marketing.

        I don't agree with your framing that all negativity is from crazies

        • mwigdahl3 hours ago
          I don't think all the negativity is from crazies, but big chunks of it are certainly motivated. I certainly left out numerous other categories.
          • feralcoderan hour ago
            The amount of anti-Anthropic and anti-Dario posts i've seen on reddit threads has gotten a bit ridiculous.

            It feels like your analysis is mostly spot on, it's the confluence of several motivated parties pouring effort into social media.

            Many of the posters are pro-foreign models/pro-open source, and most can't distinguish the difference between "open source" and open weight models like Qwen, Minimax, or GLM.

            Reminds me of the old "free as in beer" vs "free as in speech" debate. Free beer means you don't pay, but you don't get to see the recipe or change it. Free speech means you get the actual source and the right to study it, modify it, and redistribute it.

            Open weight models are basically the beer version. You can download the weights, run them locally, fine-tune them, quantize them, host them on your own boxes — but what you have is a finished product, not the blueprint for how it was built.

            • tripleee16 minutes ago
              Fable as released was censored to the point of being useless for many tasks. Now surprise surprise it's not even available unless you're pre-approved.

              Qwen is also censored - although since it's open weight, there are completely uncensored versions available.

              The owners of Qwen can't jack up the prices to something I'm unable to pay. They can't take it away.

              The owners of Qwen can't log and train on my data.

              Open weight models share far more in common with free speech than free beer.

              If big daddy Dario and his company are getting pushback it's not being of some motivated group trying to take them down. They brought it on themselves.

      • noumenon1111an hour ago
        Most of these are good points though with the right framing.
      • 0xbadcafebee3 hours ago
        "OpenAI models are better, cheaper, and more reliable" - rational people
  • arendtio4 hours ago
    > Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.

    It seems being incompetent is a feature now...

  • primaprashant4 hours ago
    Based on both performance vs price charts, it seems using Opus 4.8 with med effort is almost a better choice than using Sonnet 5 at xhigh effort
  • SoKamil4 hours ago
    I believe that’s gonna be meta for agentic coding this year for enterprises. Cost optimized models approaching SOTA capabilities on software engineering but without cybersec training.
  • scottfits4 hours ago
    > the computer use evaluation OSWorld-Verified. Sonnet 5 (orange line) is a strict improvement over Sonnet 4.6

    cool to see, still waiting for models to get better at computer use.

  • beernet5 hours ago
    Anthropic's run on the model and product side of things is highly impressive. They got Sam A. punching the air consistently, which is well-deserved and self-inflicted above all.
    • CuriouslyC3 hours ago
      Wdym? They've been knocking it out of the park on marketing, but Claude Code is still a meme, and Opus is getting trashed by GPT5.5 meanwhile you can't even use their "dominant" model, and anecdotal reports from when people could use Fable, when they weren't getting silently poisoned, was that it was only marginally better than GPT 5.5 in terms of SWE smarts, mostly being better in terms of pleasantness to interact with and design taste.
      • beernet3 hours ago
        > Claude Code is still a meme

        Claude Code generates more revenue than OpenAI...It appears to be a nice meme.

        • CuriouslyC3 hours ago
          Like I said, Anthropic's marketing is killing it, they've got people freely(?) shilling for them on public forums so even if they have shit developer relations and community relations and a model that's mostly worse while being more expensive, they can ride a wave of misinformation.
  • swe_dima3 hours ago
    Not sure what niche it's going to occupy: too expensive for it's intelligence category.
  • 4 hours ago
    undefined
  • jerrygoyal4 hours ago
    It's actually a huge update for building products, given most tasks are sub-agent driven where Sonnet is used, steered by Opus.
  • whh2 hours ago
    It's not Fable, but I'll take it.
  • docproof4 hours ago
    The jump in reasoning quality is noticeable. What's interesting is how it handles ambiguous instructions now — it seems to ask fewer clarifying questions and just makes a reasonable judgment call. That's a double-edged sword depending on your use case.
  • baalimago4 hours ago
    Not looking great for an upcoming IPO
    • mrcwinn4 hours ago
      You’re right, it’s looking stellar. Well beyond great. Real, and unprecedented, revenue growth will do that for a company.
      • CuriouslyC3 hours ago
        "Real and unprecedented revenue growth"

        Bro that is financial engineering, not real revenue growth. They engineered the switch to usage based pricing and a price hike timed the quarter before they wanted to go public, long enough to juice their numbers but not long enough for them not to be able to manage backlash and have to walk things back. Then they tried to extrapolate that manufactured bump to make it look like they have record shattering revenue growth.

  • mellosty4 hours ago
    Sonnet seems to be really expensive
    • mrcwinn4 hours ago
      Have you followed Anthropic at all?
  • benjiro294 hours ago
    Anybody notice that they did not include Sonnet 5 Max in the "Agentic Search results", when comparing to Opus 4.8 ...

    Based upon the "Agentic Computer usage", Sonnet 5 Max was going to be off "Agentic Search results" chart. lol ...

    In short, Sonnet 5 Low/Medium is more cost efficient, if its a task below Opus 4.8 Medium. For the rest its expensive and your better off using Opus 4.8.

    Why even release this model?

    • ricardobeat4 hours ago
      Because it’s a massive improvement over the previous model, and cheaper?

      You are reading too much into the graph and ignoring the threshold of usefulness for real world tasks. By that logic Sonnet 4.5 would have never been worth using.

      • benjiro294 hours ago
        Am i missing something? Because your making my point. Its only worth it compared to Opus 4.8, if the tasks your running requires Opus 4.8 low (or non-existing lower).

        For the rest the gap in pricing vs efficiency is so small, that there is no point in using Sonnet. I am looking at their own cost comparisons vs efficiency...

        • ricardobeat3 hours ago
          The point is that Sonnet at medium or even low will be smart enough for most daily tasks. You’re defining “worth using” as if you always need the highest performance possible, which is what these benchmarks measure, but most work doesn’t need it. You’ll pay more to get the same result. Sonnet 4.5 is very popular as a main model currently, this is a free upgrade.

          I use Haiku a lot for agent workflows, if I can get better output at similar prices, Sonnet 5 will replace it completely.

    • bredren4 hours ago
      I'd narrow that to why even allow the harness to run `high` on this model?
  • mellosty4 hours ago
    It does not pass the "I want to wash my car, should I drive or walk"
  • smallerfish4 hours ago
    Ah that's why Opus has been so slow for the last couple of days.
  • m3kw939 minutes ago
    should have called it 4.9, it don't deserve the 5 monkeier
  • prmph3 hours ago
    So many things to think about regarding these "benchmarks":

    - Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement?

    - Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model?

    - Would it be more useful to move toward a comparative rather than absolute ranking?

  • tensegrist5 hours ago
    there was a vibecoded prediction market–style page that was put up yesterday (?) that got the date exactly right i think
  • joaohaas3 hours ago
    Important to note that the cost graphs are heavily distorted. The agentic serch one for example is divided into 3 'columns': $0-$2, $2-$5 and $5-$10.

    And yet, the $2-$5 section is the widest, even though it only contains a single point.

    I can't even say if this is making the product look better or not, but it sure is weird. Maybe Claude just hallucinated those splits xD

  • guelo2 hours ago
    Have they ever said what the difference is between Sonnet and Opus? Are they trained differently? Different architectures? Is Sonnet a distillation? Is it just that Sonnet has less resources for inference?

    None of the other labs are doing this kind of long lived two model series.

    • jsnell2 hours ago
      Gemini has had Pro and Flash since May 2024, across three major version nunmbers. The Opus and Sonnet naming is only two months older than that.
  • artursapek2 hours ago
    I run a proofreading benchmark that tests how well models can find and fix errors in English text. They get several passes in a simple agent loop. Sonnet 5 is definitely better than Sonnet 4.6, but inferior on both quality and cost to GLM 5.1, GLM 5.2, Gemini 3.1 Flash, and Gemini 3.1 Pro. https://revise.io/errata-bench
  • PeterStuer3 hours ago
    Anyone else feel like Opus 4.8 got significantly dumber over the last 2 weeks?
  • ai_fry_ur_brain2 hours ago
    Finally a model release where everyone is realising the scam. The world is healing (maybe).
  • Scroll_Swe5 hours ago
    I don't pay so I'm glad for the upgrade. I usually use Gemini, Mistral Le Chat (Vibe...) or Deepseek as they have way more generous free limits and I can basically spam forever.
  • docheinestages4 hours ago
    Is it just me or is there a huge difference between how much one can accomplish in a 5-hour window with GPT 5.5 on xhigh versus any Claude model?
    • mrcwinn4 hours ago
      I exclusively use 5.5-xhigh-fast within Codex and find it superior to Opus 4.8.
  • jchw4 hours ago
    American AI company status: We are now bragging about how bad our models are unironically.

    Okay.

  • _pdp_4 hours ago
    Too expensive?
  • gverrilla4 hours ago
    Is this the default model for non-paying users? If so, that could be an interesting move in the competition for this segment.
  • ekjhgkejhgk4 hours ago
    In effective terms they're lowering prices.
  • micromacrofoot4 hours ago
    So they repackaged Fable and added "don't scare the government" to the prompt
    • actionfromafar2 hours ago
      This is downvoted, but how can it not be a little true?
  • andrewchambers3 hours ago
    The whole fable fiasco really soured me on Anthropic. This just looks disappointing by comparison.
  • 5 hours ago
    undefined
  • 5 hours ago
    undefined
  • botfriendsarent39 minutes ago
    Sonnet 5 OUCH! every model is just loaded with more hurt, stolen content, BS prompts, more scare tactics, more illusions, more government lobbying, less honesty.

    Oh Claude you master of software engineering does it ever end? DO you have no bounds?

    How may we further assist you oh Claude?

  • moomin4 hours ago
    I feel like this is a bit of a disappointment. Sonnet 4 was a clear step above Opus 3.x, while this is a lot muddier.
  • kvetchingan hour ago
    GLM 5.2 is better and cheaper. Maybe they are trying to embarrass Trump by making it look like we are losing to China.
  • mesmertech5 hours ago
    Ok thats a one month clock to the next Opus model at least, so thats a silver lining to a meh model.
  • stackedinserter4 hours ago
    "Our new model is proudly dumber now!"
    • mwigdahl4 hours ago
      What? If you're comparing their models in the same size class, Sonnet 5 is Pareto-optimal over Sonnet 4.6.
      • zamadatix4 hours ago
        I think they mean per dollar in the perf/$charts, not per marketing class. I.e. the new model is a complete Pareto failure in said perf/$ charts with the sole exception of Sonnet 5 low, which is dumb enough to not have comparison at all. Opus 4.8 delivers a better outcome per dollar, regardless what the underlying size of the models is.

        I'd generously assume this is something about the specific category of agentic task presented in the chart... but it does raise the question "then why is that category the one they chose to highlight here".

        • mwigdahl3 hours ago
          For agentic computer use Sonnet 5 low performs better than Sonnet 4.6 medium at just under half the cost, and better than Opus 4.8 low at 25% off. Their success rates are not that far off.

          Agentic search is a different story, but even there it still dominates 4.6 (as in, for everything Sonnet 4.6 can do, Sonnet 5 can do it as well or better at the same or lower cost).

          Yes, Opus 4.8 dominates Sonnet 5 over its entire range in both categories, but Opus's lower range is limited and there is a valid regime on the lower end where Sonnet 5 use makes economic sense. This is not the case for Sonnet 4.6 where Opus 4.8 dominates it completely on both charts.

          Edit -- reading your response closer I think we're saying the same things, maybe just disagreeing on whether that lower end is valuable or not.

  • Getchowned4 hours ago
    Fable soon please.
  • varispeed3 hours ago
    What is the point if it is one Trump's brain fart away from being blocked?
  • Danii272 hours ago
    [flagged]
  • justicehunter4 hours ago
    [dead]
  • aykutseker4 hours ago
    [dead]
  • ricardobeat4 hours ago
    [dead]
  • lucynight4 hours ago
    AMAZING