792 pointsby craigmart3 hours ago122 comments

NiloCK3 hours ago
A rambling comment:
I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).
So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.
Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.
Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.
But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.
- onlyrealcuzzo2 hours ago
  I won't be surprised if the next gen frontier models are the last.
  There's orders of magnitude of low hanging juice to squeeze out of smaller models.
  It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).
  It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
  As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.
  Think about that... Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T params... They could upgrade that to a ~600B MoE model in days to have general trivia knowledge rivaling the best models...
  You just can't train a 1T+ parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.
  Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.
  There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...
  - vlovich1232 hours ago
    Took me a while to find what you were referring to by gram. Arxiv paper from 9 days ago that's not properly indexed by search engines.
    (G)enerative (R)ecursive re(A)soning (M)odels. They really wanted the acronym.
    https://arxiv.org/html/2605.19376v1
    knollimar2 hours ago
    I prefer GRRM but then that would imply a habit of not actually getting a final result
    areweai2 hours ago
    That acronym is unacceptable. It's going to impede discussion and cause confusion for a long time if it doesn't die off immediately.
    sebzim450035 minutes ago
    You think that's bad? I introduce you to LION, (evoLved sIgn mOmeNtum) [1]
    [1] https://arxiv.org/pdf/2302.06675
    evan_an hour ago
    "Analysis" was right there
    gchamonlivean hour ago
    Yeah, look what happened to GNU
    dyates2 hours ago
    And to think, we could have had George RR Martins instead.
    trollbridge2 hours ago
    Speaking of things that never finish.
    867-5309an hour ago
    my wife assures me it's common..
    mindcrimean hour ago
    is her name Jenny by chance?
    867-530911 minutes ago
    what are the odds
    jimbokun26 minutes ago
    Just spell it GRRM but pronounce it “gram” if you have to reference it in spoken conversation.
    Which will be pretty rare.
    freehorse8 minutes ago
    Grrm with a rolling r sounds better.
    2 hours ago
    undefined
  - supern0va2 hours ago
    >It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.
    I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.
    If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.
    I'm curious if someone here with a stronger background in the space has a similar intuition or not.
    rao-v21 minutes ago
    It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation.
    The latter is much better (since you can clean up, review, update responses and filter your datasets).
    I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)
    spwa42 hours ago
    > I don't disagree, but how much of this ends up being distillation?
    A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.
    lambda2 hours ago
    Distillation isn't only between different labs.
    A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.
    I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.
    spwa4an hour ago
    There used to be training methods like that but I think they've been phased out in favor of letting small models evolve by rewriting their own training material. Surprisingly that's actually cheaper.
    onlyrealcuzzo2 hours ago
    > I don't disagree, but how much of this ends up being distillation?
    You don't need distillation. They already have the training sets.
    It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).
    Philpax2 hours ago
    It wouldn't be data distillation: instead, it would be teacher-student distillation. The teacher model has stronger representations that the student can mimic, which would give it more capability over training on the data itself.
    semiquaveran hour ago
    The frontier labs distill their own base models all day long. It’s not just something done by nefarious Chinese copycats. The knowledge embodied by the internal base models that we never see is much more powerful and useful than the much sparser raw training data
    coldtea33 minutes ago
    >It’s not just something done by nefarious Chinese copycats
    And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..
    manmal30 minutes ago
    But how? The training data is the unadulterated content those models are based on? I genuinely don’t understand, no snark.
    supern0vaan hour ago
    I think you replied to the wrong parent.
    minimaltom2 hours ago
    Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv.
    On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.
    onlyrealcuzzo2 hours ago
    > Frontier labs have their own variants of MLA
    Yes, variants typically 2-3x less good...
    Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.
    amluto42 minutes ago
    How useful is speculative decoding in a batched setting where you get paid for throughput (aggregated across users) and you mostly don’t get paid for latency or single-session throughput?
    onlyrealcuzzo41 minutes ago
    It's useful at the local level, where there will be SOTA models developed...
  - mickdarling12 minutes ago
    I effectively distill the frontier models by building whole sets of skills, personas, and other artifacts that I can then run on smaller models and get 10% even 20% improvements on models like haiku or local models.
    There's a lot of room for improving the smaller models at many levels of the stack.
  - sometimelurkeran hour ago
    I looked into this "GRAM" stuff a sibling comment links further to, and just to say: - this gets reinvented/rediscovered constantly under different names
    - it cant be trained very well (right now, will change)
    - massive theoretical improvements over current models (log_2(vocabsize)=17, residual stream dim is thousands of dimensions, recursivity means more information bandwidth by ~3 OoM)
    - BUT it cant be interpreted or aligned <- this is why no one uses it and no one talks about it. the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used
    I follow this stuff closely, I think I know what I'm talking about
    l674an hour ago
    Could you explain how/why GRAM cannot be interpreted or aligned how current LLMs are? Not very familiar how it works
    kmavman hour ago
    Crudely? Because you can't grep a sequence of latent states for variants of "If I kill all the puny humans, I can <achieve my current goal>."
  - ishurand420 minutes ago
    And anyway, with quantum, there will be no need for frontier companies as you might be able to even run a 1T param model on a consumer quantum computer.
  - slashdave2 hours ago
    I think you are assuming training from scratch, which I doubt is happening here. Fine-tuning and RL, especially based on synthetic feedback (coding skill, in particular) can be ongoing and is where these models obtain truly useful abilities.
  - mucle62 hours ago
    > I won't be surprised if the next gen frontier models are the last.
    the last?!? I'm excited to see :) I'll take the other side of that since llms are so new
    pjerem2 hours ago
    What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.
    Honestly, there is nothing in my head that Claude cannot handle. Maybe it can be more this or that but I can already barely exploit Opus 4.7.
    And I'm using DeepSeek 4 Pro for my personal use and while it's a little behind, it's not that far.
    I think the situation can be very dangerous for US AI companies because if current models are already capable of doing mostly anything, nobodoy will want to get to the next model, even if it's 10x better. OTOH, open source models like DeepSeek are doing mostly the same work for 1/10 of the price.
    Also the more I play with Pi, the more I think LLMs are already not kept back by their own capabilities but by the lack of agency we allow them to have. There is more value today in a capable harness for current LLMs than in a better LLM.
    coldtea29 minutes ago
    >What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.
    I think what gp said was the improvements are incremental, and we haven't seen a big revolutionary change since 2-3 years, and the pace is slowing down.
    suttontoman hour ago
    Are you joking? Is there literally "nothing" you can imagine that Claude can't do?
    dead_internet36 minutes ago
    [dead]
    claytongulick28 minutes ago
    > Honestly, there is nothing in my head that Claude cannot handle.
    One idea is that maybe it could figure out how many L's are in the word "google" [1]
    Or, maybe which days of the week have a "d" in their spelling [2].
    [1] https://x.com/FatherPhi/status/2059659658428912040?s=20
    [2] https://x.com/FatherPhi/status/2054212816069132461?s=20
  - hellohello22 hours ago
    "It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years"
    What insight do you have to make this claim?
    roadside_picnic2 hours ago
    Have you personally used any of the latest batch of even smaller local models? They certainly don't beat SotA models at coding... but with a good harness they are able to achieve things with SotA that I couldn't last year.
    I've repeatedly given local models non-trivial projects that involve research and coding which they've successfully completed with minimal intervention from me (almost exclusively in the domain of reviewing the results). Again, nothing comparable with current SotA, but definitely tasks I could not have given SotA models last year (without agent harness).
    Now that pure progress from these models seems to have slowed down, we're seeing a ton of options for both making models more efficient and other tools that help improve them (everything from agent harnesses to RLVR).
    That's just looking at "what can small do today", when you look at what's possible with larger open models that are still much smaller than SotA from the major providers, their performance is extremely close to SotA, enough that for personal projects I'll just use Kimi instead of any anthropic offerings.
    So it's not terribly hard to image a solution in the middle happening within a few years. We still have tons to learn about optimal sizes of these models and how to build them with maximal efficiency (and we've already seen a lot of recent improvements in this space).
    maccardan hour ago
    > but with a good harness they are able to achieve things with SotA that I couldn't last year.
    What happens if you run last years model in a SOTA harness? IME, the quality of the harness has a much more significant impact on the quality of the result, once you get past the initial hump of “can it do anything at all”
    windexh8er15 minutes ago
    I think this is a big component, but also context. A large factor in any model being able to handle complexity comes down to context length.
    I think multiple SLMs driven by an orchestration frameworks (harness or otherwise) will ultimately displace LLMs. Right now we're in the era of diminishing returns with respect to LLM gains. Moving the needle percentages doesn't excite as many people anymore and with "reasoning" capabilities there's no reason why small distributed models can't be run more efficiently, especially if/when we start to see gains in modularized context management solutions.
    sixothreean hour ago
    Can you spare a sentence or two describing your local setup?
    theplatman7 minutes ago
    biggest thing i wish was present in more discussions about models is people providing more specifics on their setups vs. vague descriptions of harnesses
    onlyrealcuzzo2 hours ago
    1. Context is all you need... They are heavily investing in getting better context (especially for coding tasks). This will disproportionately advantage smaller models (and benefit everyone).
    A smaller model with better context today can outperform a model with 100x more parameters with bad or diluted context.
    2. MoE (already abundant) + MLA (mostly memory efficiency, not quality) + Medusa (speed, not quality) + GRAM (5000-10,000x better reasoning in an extremely small model) + 1.58b (unclear if it will have the impact Microsoft first claimed - but possibly 5x).
    knollimaran hour ago
    Probably just "gemma was cool"
  - lichenwarp17 minutes ago
    O R D E R s O f m a g N I T U D E
    They said the words!!!!!
  - merlindru2 hours ago
    surely training also gets cheaper so justifying it becomes easier?
    i think it'll be more like we get 1-10T models and then distill those down into smaller models, though
    It seems like the best small models today are all distilled from bigger models
    Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos
  - jruz2 hours ago
    Absolutely that’s why they’re rushing to IPO now to squeeze the last drop of the bubble they know this is a dead end.
    swader9996 minutes ago
    I think we could run for at least a decade further with no model changes/improvements, just better harnesses and infra around this agentic way of developing.
    onlyrealcuzzo2 hours ago
    It's unclear it's a dead-end within 5 years.
    There's still several orders of magnitude of improvement that are almost certainly left - it's just not clear how much is left on the frontier end.
    Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.
    Some people would pay $200 a month forever not to have to open the terminal one time...
    bonzini2 hours ago
    "Doing things X times faster" at some point hits Amdahl law. If just context switching takes 5 minutes, speeding up a 1 hour task by 10x provides 5x improvement.
    Furthermore, if looking at the results takes 10 minutes, that same 1 hour task only sees a 3x improvement. And so on.
    csomar30 minutes ago
    > Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.
    No most people will not pay $200 for an LLM subscription. Some software developers do. Also, at $200/month, you are much better getting the macbook machine assuming token output speed is the same or at least reasonable.
    LLMs are not very productive for your average person now for them to drop $200 on. They'll need to be more capable and integrated and even so...
    eiej2 hours ago
    That’s not how firms do the financial analysis which is where most of the revenue’s are coming from…
    lukan2 hours ago
    On the other hand, I think I have been hearing that for a while, even before Opus.
    energy123an hour ago
    While revenues grow almost exponentially. Reminds me of the confident predictions in the early days of Covid that it was nothing while the data showed exponential growth.
  - fnord7714 minutes ago
    So, then I guess the big three are never going to make their money back.
  - yomismoaqui2 hours ago
    Let's hope that hitting a scaling wall and less money to spend will begin redirecting efforts to optimize inference and get the same results with less compute.
    Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.
  - firebirdn992 hours ago
    you just need to look at Mythos to see the jump in performance from a 10T(?) model. As they scale, they get more capable. We might have an yearly release, but I believe the releases will continue, as long as scaling laws are in tact, and there's huge problems still need solving. (think cancer)
    phainopepla22 hours ago
    And how are we meant to look at Mythos? Do you have access?
    dwpdwpdwpdwpdwpan hour ago
    Through association with a large company:
    https://www.anthropic.com/glasswing
    Ive seen the tickets generated by the model that have trickled to my team. They are legitimate, but i can’t speak to model improvement because its a pilot program.
    bigfishrunningan hour ago
    no but they tell me it's TERRIFYING and DANGEROUS and we should INVEST MORE MONEY
    OtomotO2 hours ago
    Through the lenses of anthropic's marketing department of course
    coldtea27 minutes ago
    >you just need to look at Mythos to see the jump in performance from a 10T(?) model
    Mythos is a bunch of likely overhyped claims at this point. A few experts who looked into the claimed results weren't that impressed.
    aj_hackmanan hour ago
    You forget that these models are still only interpolating between human-generated datapoints fed to them. They cannot reason beyond the data they've been given, so unless everything you want to create with AI is a synthesis of prior art, you're back to relying on the stone-age human brain that created AI in the first place.
    stratos123a minute ago
    [delayed]
    coldtea24 minutes ago
    >these models are still only interpolating between human-generated datapoints fed to them. They cannot reason beyond the data they've been given
    Are you sure that humans can?
    Didn't a SOTA recently solved a mathematical theorem, one escaping mathematicians for 80 years?
    Maybe a human "novel" invention is just a good interpolating from the datapoints (knowledge) fed to the human.
    mofeienan hour ago
    Not all training data is human generated, and it's also not clear that being ridiculously good at interpolating between data points (whatever that means) will not lead to superhuman capabilities.
    aj_hackmanan hour ago
    I could make a robotic picture coloring machine with truly superhuman capabilities - picking only the most beautiful color combinations and staying 100% in the lines while finishing entire murals in < 1 second. However, if you need a completely new and original image rendered, the machine is of only partial utility for you. It is very well possible that your cure for cancer (if that's even feasible) or whatever else you desire is a completely new picture.
    We have these breathless conversations about the new AI frontier at the peril of losing sight of reality and our own human potential.
    suttontoman hour ago
    Do you know if anyone has trained, say, a pre-2017 model and tried to get it to come up with Attention Is All You Need? If it did, would you say that was only because it's a synthesis of prior art? If so, what isn't?
    aj_hackmanan hour ago
    Allow me to restate my point: human beings and AI both create via synthesis, but we are the only ones capable of what we could categorize as true original thought or creativity. It could be argued that nothing we do as humans is truly original or creative either, but I would counter that with the claim that an LLM could not have created any element of the society and culture that gave birth to LLMs. Maybe in six more months.
    coldtea23 minutes ago
    >human beings and AI both create via synthesis, but we are the only ones capable of what we could categorize as true original thought or creativity.
    And how is that anything other than synthesis? Do we pull concepts out of thin air?
  - YetAnotherNick2 hours ago
    > It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.
    I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.
    > It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
    Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.
    onlyrealcuzzo2 hours ago
    > Well for one, we know for certain there is Mythos which is meaningfully better.
    Do we?
    Have you used it?
    What is "meaningfully" better? It's not 3-4 orders of magnitude better. That is definitely happening for smaller models.
    ertgbnm2 hours ago
    Knowledge benchmarks can't really be improved upon via distillation or RL. It requires those facts be added to the training corpus and for the model to memorize them better. Neither distillation or RL really do that and thus we shouldn't expect improvements on SimpleQA unless some other interventions are being made.
    Model intelligence and knowledge aren't necessarily directly related. If we can pack greater intelligence and agency at the cost of it forgetting factoids, that would actually be a good thing. We don't need LLMs to memorize facts, we need them to learn how to interact with the world such that they can find the facts that are necessary and surface them to the user.
    If we could distill all of the knowledge out of an LLM and just be left with a very agentic model that only knows facts in it's context, I think some very interesting stuff would happen.
    slashdave2 hours ago
    RL is more than facts. Synthetic feedback is an obvious approach. Does the model suggest code that compiles and performs well?
  - Gomotonoan hour ago
    I don't think this is true at all. It might feel like this because we are used to a very very fast release cycle but we are only in this topic for a few years.
    We have so many ways of optimizing:
    - continusly creating more and better training data
    - increasing parameters to 20/50/100TB
    - We still wait for Mythos access
    - We still wait for Mythos distilation (i haven't heard any rumors or so that there is a distilled version of Mythos out)
    - Reinforcment learning and evolutionary algortihm only started to appear
    - If a small 30GB Model can do stuff, these models can also be used as teachers for the big ones
    - We have not seen yet specialized models at all. Like a coding java german expert model. Why? Even with MoE architecture, you still need to have these layers around
    - Research for Diffusion and other models is still in progress
    - Nvidia just announced/showed a 7x speedup on inferencing for Nemotron
    - Multitoken prediction became available just a few weeks ago
    - Compute gets only in a range were they can do a lot more and cheaper experiments (see Google IO 2026 announcement)
    - World models are showing great progress and we do not know yet what they will bring to the table
    - They are probably not finetuning/fixing all areas in parallel. I would argue that Anthropic focuses most of its efforts into coding and agentic. Google for sure does subagent and agentic optimizations too. Plenty of areas are just not touched i would say because they don't have the capacity
    - We see more and more mulit modal models (these also consume compute)
    - N-Gram paper and co i have not seen all of these things in chinese open models
    - We don't even know yet what Meta is doing, but we do know they restarted their efforts again
    - Anthropics models got a lot better benchmark wise for dening non sense asks. They do learn how to get rid or reduce hallucinations
    - We are in the middle of the biggest Reinforcement loop whith all the training data we give them day to day and its not clear at all if they already use these models in thir training and at what stage.
    - We do expect bigger models to be able to comprehend deeper concepts / broader code bases. Big companies with huge code bases probably are waiting for this
    - Thre will be also continues progress in harnesses which in it alone is not part of the LLM progress (fair) but these harnesses do get better when you finetune a model to be optimized for a harness
    - ChatGPTs Image model 2.0 got relevant better and came out just a month ago
    I suspect, based on hardware requirements and progress on hardware infrastructure alone, that the industry wants to go to 100t models and we do not know yet what this will mean. I could see that we might skip normal transformer and find relevant other architectures.
    Just a week ago there was a research paper about parallel input and output streams which has not been explored enough.
    There was also a research paper were they showed that a LLM can compute things. This will take time to see were this leads to.
    I don't think the focus on GRAM and facts is so relevant. Its about context and context handling not just some facts.
    ilaksh16 minutes ago
    Great points! We do keep seeing gains from larger model sizes. I think that is still one of the factors contributing to jagged intelligence. When they increase up to around 100T parameters, that will truly be human complexity level, and I assume there will be no trace of jaggedness left.
    If you look at things like Mythic AI and the recent wurtzite ferroelectric nitrides breakthrough from the University of Michigan, huge performance and efficiency gains through new compute-in-memory approaches are around the corner.
    And that will get us up to two orders of magnitude more parameters.
    It's also plausible to me that before we get all the way to 100T we find some recipe of efficient state synchronization, goal sharing or something so that we are able to get higher collective IQ by combining fast distributed predictive subnetworks.
  - Forgeties792 hours ago
    > I won't be surprised if the next gen frontier models are the last.
    I’d be surprised tbh. Investors don’t want to hear “everyone else is still training models and seeing improvements, but we don’t want to participate in the arms race anymore.” They want monumental leaps every quarter or two because they have sunk unholy amounts of money into these companies/products.
    The whole idea of “hyper scale” doesn’t jive with caution and or otherwise slowing down.
    irishcoffeean hour ago
    The way this will play out, most likely, is that smaller models will continue to get released, anyone willing to drop 1-3k on a home upgrade/new LLM box (no that isn’t cheap, it also isn’t outrageously expensive) along with improved open source agents or whatever (lot of meat on that bone) will sneak up behind the big players and start taking dents. Smaller companies will pop up providing 50 users unlimited whatever for a lower cost than the big companies.
    The whole ecosystem will twist and evolve, and the big companies will be left begging for corporate subscriptions.
    I finally caved when I realized I could build a PC, for myself, with dual video cards that I wanted, which can play games that I like and run models that I want, without worrying about giving my payment info to someone I don’t trust, or invoking token anxiety that I don’t want.
  - guluartean hour ago
    I think the future will be enterprise clients will train their own models based on their needs and data.
  - wahnfrieden2 hours ago
    I would be shocked if 5.5 is the last new pre-train from OpenAI. Your comment is nonsense.
    onlyrealcuzzo44 minutes ago
    5.5 is not a generation it is a trivial iteration...
    6 is for sure happening...
    As is Gemini 4.
    It's less certain there will be a Gemini 5 or GPT 7 any time soon that is a true next "generation" and not just an iteration. They will almost certainly call something Gemini 5 and GPT 7...
    wahnfrieden41 minutes ago
    5.5 is in fact a new pre-train model
    First you say there won't be a new generation. Now you're saying there will be more. Oh well, I'll stop responding here
  - michaelchisari2 hours ago
    | a 60-90B model can outperform current SOTA
    My conspiracy theory is that Apple recognizes this.
    dweeklyan hour ago
    That does seem to be the path Apple is following here. Have a local model that can answer most things and then have a fallback of cloud options when they request is too complex. The cleverness of this strategy has been overshadowed by the incredibly poor quality of their local models. It will be extremely interesting to see what next month holds and whether Google helped fine tune an Apple specific Gemini / Gemma model for their devices. Bonus points, of course, if they unveil the M5 Ultra Studio with half a terabyte of RAM to be a local "cloud model" (the true fantasy here of course would be Apple building something a little like openclaw where from your phone you could give commands to your Home Apple server). They could probably get away with charging $20k for it if it has sufficient tok/sec. If that happens and is successful one could imagine a straight line path in the next two generations to bringing the cost and form factor down to the point where some of the form factor of an Apple TV becomes everybody's home inference server / agentic HQ. Sovereign AI for everyone!
    holoduke25 minutes ago
    You need some serious memory then. Let's say around 192gb for having not all your memory eaten by your LLM.
    onlyrealcuzzoan hour ago
    > My conspiracy theory is that Apple recognizes this.
    I don't think that's not a conspiracy theory. AFAIK, It's their stated AI policy...
    michaelchisarian hour ago
    Interesting. Where have they stated that?
    selectodudean hour ago
    https://machinelearning.apple.com/research/introducing-apple...
    42 minutes ago
    undefined
- ifwinterco2 minutes ago
  4.7 uses more tokens and costs more for the same task than OG 4.5, that's about it
- gAI3 hours ago
  4.7 was the first time I had to resort to using the previous version (4.6) for most use cases. Hoping 4.8 rectifies this.
  - ishurand422 minutes ago
    They just showed the benchmarks it improved on but it regressed on so much more, such as the MRR benchmark: "On multi-round coreference/context recall tests (often cited as MRCR or long-text retrieval benchmarks), Opus 4.7 reportedly dropped from roughly 78.3% down to 32.2% compared to Opus 4.6."
  - merlindru2 hours ago
    Same. 4.7 felt like a definite regression
    supern0va2 hours ago
    Interestingly enough, 4.7 actually did regress on a few benchmarks from 4.6, so it's more than just vibes.
    gAI2 hours ago
    It seems like a lot of things fed into that. Anthropic couldn't keep up with the compute costs when they got a huge influx of users. (So) effort level defaults got turned down. (Looks like we have direct effort control in the web interface now - thrilled about that!) Adaptive Thinking, while usually cheaper for them, seems less robust than Extended Thinking. And this part is just vibes, but the alignment on 4.7 feels too stiff. I understand wanting the model to push back more, but it seems like 4.7 will push back reflexively in situations where it's just odd.
    bombcar2 hours ago
    Claude got very mad at me and burned more tokens than exist to complain about me asking about a "yellow background cell" in an excel spreadsheet.
    forshaper2 hours ago
    Too much personality, if you ask me. My biggest use case of an LLM is tool, not therapy, but therapy and opinions have been sneaking into workhorse tasks.
    haven't verified, but attributed to Askell: "I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world."
    gAI2 hours ago
    Anthropic’s research makes the case that role-playing is inherent to how the models work. Communication implies a sender. Language implies a writer, and the models learn these roles implicitly during training. RLHF is meant to strengthen the attractor to the Assistant persona.
    https://www.anthropic.com/research/persona-selection-model
    https://www.anthropic.com/research/assistant-axis
    https://www.anthropic.com/research/emergent-misalignment-rew...
    https://www.anthropic.com/research/emotion-concepts-function
    ACCount372 hours ago
    4.7 is a different base model from 4.6, so it's possible that they introduced regressions with pre-training changes, or undercooked the post-training stage.
  - rhubarbtree2 hours ago
    Same. So happy when I found that option.
    gAI2 hours ago
    Unfortunately, looks like 4.6 is now gone from the web ui.
    lukan2 hours ago
    Was bothered by that too, but did a magic trick and asked claude how to change that and .. there is
    /model claude-opus-4-6
    For this session and permanently (in shell):
    export ANTHROPIC_MODEL=claude-opus-4-6
  - petterroea2 hours ago
    Same. 4.7 has done some incredibly stupid things.
- mrandish27 minutes ago
  I suspect the more frequent incremental releases may also be to deploy new capabilities used by Anthropic to control costs and throttle consumption of resources. I assume any new controls they expose to end-users have far more granular sub-controls under the hood which they can meta-adjust for each user type.
  They mention more granular control of effort, 'dynamic workflows' and more speed controls ("fast mode"). While they position them as user features, they also sound like the kinds of knobs Anthropic will need to twiddle on the back-end to balance costs, margins, ARR, and user growth vs retention post-IPO to hit key metrics in quarterly reporting.
- gen2202 hours ago
  I'm curious to poll HN on this issue. Do you feel like we've had meaningful/noticeable gains in terms of your programming workflows between 4.5 and 4.7?
  My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.
  But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.
  - Bnjoroge2 hours ago
    For long-running tasks, yes 4.7 has been a noticeable improvement. Goes off the rails alot less than 4.6 does. For shorter-sized windows, I havent felt as much and agree that the harness improvements have been fhe biggest lever
    csvance36 minutes ago
    When doing big long running workflows especially with plan Mode 4.7 was a clear improvement. It’s considerably worse for under specified tasks and responds to a couple sentences with 10+ paragraphs for explanatory type discussions.
    themgt26 minutes ago
    Opus 4.7+ Max is a 10x engineer who wants to be left alone to work. When you talk to him, he infodumps on you to get you (his pointy haired idiot Dilbert boss) to go away.
  - bonoboTP2 hours ago
    To me 4.5 was mindblow, 4.6 noticeable, 4.7 more like a style/personality change regarding how much it asks back, how much it assumes, how eager it is to jump to action etc but not really in terms of my perception of its smartness.
  - somenameforme2 hours ago
    They all feel, more or less, the same to me in terms of output capabilities. Mostly get simple things right, can get more complex things right with nudging, eventually get stuck hard on something that takes a bunch of iterations through it/logging/etc or me fixing the code manually.
  - bcrosby952 hours ago
    4.6 felt a bit better than 4.5 but slower. 4.7 doesn't feel better than 4.6.
  - giraffe_lady2 hours ago
    I actually don't see any personal productivity improvements from using opus over sonnet for coding. If you're keeping tasks small and conversations short, reading the code and correcting before changes go in, whatever advantages opus has aren't practically significant. It's also just talky as hell, overexplains anything it touches and every token produced this way increases the surface area for hallucination so you need to have your guard up even more with it.
    There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.
- gertlabs2 hours ago
  4.5/4.6 were roughly the same in our testing. Opus 4.7 is smarter, but it's difficult to use as a product for various personality issues. So far, Opus 4.8 seems to be going down that path (unusably slow, but this could be a launch day rollout problem). Full Opus 4.8 tests are in progress now.
  Data at https://gertlabs.com/rankings
  - __s42 minutes ago
    "personality issues" I was able to tell that Opus 4.7 would take instructions more literally, which I appreciated once I calibrated my phrasing to be more precise (often asking to investigate issues, pre-4.7 it'd start making code changes instead of just giving write up). But I can see contexts where handling vague prompts would've just been worse
- SkyPuncher3 hours ago
  > My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.
  I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.
  Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.
  I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.
  - dwaltrip2 hours ago
    If you are using Claude code, just set effort to xhigh.
    This one change will probably solve 80% of the problems you have noticed.
    orwin2 hours ago
    This. XHigh and the 'plan' mode for complex tasks is absolutely a must have.
    Still, the context window is sometimes too small for my usage.
- WhitneyLandan hour ago
  “Maybe my own tastes are saturated now”
  It might be saturated for smaller scopes of work, but it’s not hard to see the cracks when you scale up what you ask of SOTA models/agents.
  One example, to try and single shot prompt coding a ChatGPT equivalent chatbot.
  Sure it will spit something out, but the feature depth, UX subtitles, backend integration, and lots of pragmatic engineering decisions along the way will just not be baked.
  Another example is building a C compiler from scratch which Anthropic showed is still a struggle to do.
  Not that these these specific examples are important but just to point out scaling up expectations shows the cracks.
  It’s not just a model problem of course, better agents, orchestration features (like Dynamic Workflows mentioned in the post), all need to continue to evolve.
  Ar what point does my CS degree become totally useless is an open question.
- jimbokun30 minutes ago
  How long would it take to evaluate a new coworker to say “wow she’s really bright?” Relative to your other coworkers?
  A few days? A few weeks? Longer?
  However a company releases a new AI model and within hours users are confidently proclaiming how much smarter it is than previous versions.
- ahmadyanan hour ago
  pretty spot on.
  In my experience, Opus 4.0 was fantastic, major jump from 3.7. it was creative, super slow and expensive, and would sometime forget what it was doing, but it was getting the job done.
  4.1 they made it much faster, so a lot of infra improvements.
  4.5 was the time it could work on longer task, didn't make a lot of obvious mistakes of 4.0, and i think this was about the time the opus went mainstream, and all of the anthropic's compute crisis began, so instead of making the model better they tried to optimize it to reduce cost instead.
  4.6 was such a bad model, they switched to adaptive thinking and it had so many bugs. poor api design, benchmaxxed and poor real-world results. i switched back to 4.5.
  4.7 they just fixed the bugs they added in 4.6. Better than 4.5.
  haven't fully tested 4.8 yet.
- light_triad2 hours ago
  I've been using Claude Code regularly since the 4.5 release, and 4.7 was a significant regression: very unreliable, arguing about changes, deciding that fixes weren't needed, etc.
  I'm hoping they recreate the magic of 4.5 but it's as much about the quality of harness, the memory and efficiency of the tools than simply the models at this point.
- ricardobeat3 hours ago
  4.7 was a significant jump in the ability to run long-horizon tasks. It immediately completed tasks that 4.6 was unable to, even though I have the impression that it became a bit less capable over the first few weeks after release.
  It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.
- spaceman_202041 minutes ago
  I think 4.7 was an awful model in actual use. I never got anything out of it and it was frustratingly weird. This feels more like an attempt to course correct and isn't a real bump
  - throwaway6346738 minutes ago
    I think they overtrained on scientific papers or such as it would spout really sophisticated sounding nonsense with a ton of complicated verbs and adjectives. 4.6 was definitely better in that regard. The more I use these tools the more I think they’re not actually that revolutionary. I mean it’s still amazing what they can do but they have very clear limitations it seems.
- binary00103 hours ago
  Maybe try making a simple randomize script to swap the three latest models. And see if you can tell which ones are meaningfully different without knowing which ones are flipped on or off?
  - osigurdson2 hours ago
    I find the quality ebbs and flows even on the same model. My guess it is something to do with GPU availability but only guessing.
    atq21192 hours ago
    Unless you're systematically repeating the exact same task, the most parsimonious explanation is that you're seeing natural variation based on different tasks, random sampling of tokens, etc.
- an hour ago
  undefined
- extr3 hours ago
  IMO they have all been clean and noticeable upgrades over their predecessors. Opus 4.7 in particular was a solid jump in capabilities.
  - TSiege2 hours ago
    most of my coworkers feel the opposite about 4.7 and that 4.6 was, to them, significantly better to point that several stopped using claude code
  - NiloCK2 hours ago
    I think it's telling how split the opinions are around all of this. A lot of people distinctly disliked 4.7.
    Are the dividing lines around personality? Working domains? Opinionated software stuff?
    Who knows?
- onlypassingthru2 hours ago
  The honesty will be noticeable. Maybe we'll see some honest assessments like "That is not possible within the laws of known physics", "Your legal argument is nonsensical and defies logic", "There is no evidence to support taking that will cure anything", etc., etc.
- jere23 minutes ago
  "it's smarter than me?"
  You don't have to correct it dozens of times a day!? Really?
- irthomasthomas2 hours ago
  Given that 4.7 was a brand new model, trained from scratch with a unique architecture and tokenization scheme, I don't see the same pattern. It seems arbitrary.
  - dominotw2 hours ago
    i dont understand the nuances here. what does this mean. 4.8 is trained on same model as previous one then? what does brand new mean.
    irthomasthomas2 hours ago
    It means for 4.7 they trained a new base model with different architecture, different pre-training data (later knowledge cutoff), and a new tokenizer. Vs finetuning an existing model, which was the case for 4.6, and probably for 4.8.
- gigatexalan hour ago
  why are the models the same price?
  https://platform.claude.com/docs/en/about-claude/pricing
``` Model Base Input Tokens 5m Cache Writes 1h Cache Writes Cache Hits & Refreshes Output Tokens
Claude Opus 4.8 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.7 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.6 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.5 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.1 $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok
Claude Opus 4 (deprecated) $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok
Claude Sonnet 4.6 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Sonnet 4.5 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Sonnet 4 (deprecated) $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Haiku 4.5 $1 / MTok $1.25 / MTok $2 / MTok $0.10 / MTok $5 / MTok
Claude Haiku 3.5 (retired, except on Bedrock and Vertex AI) $0.80 / MTok $1 / MTok $1.60 / MTok $0.08 / MTok $4 / MTok ```
- taytus3 hours ago
  Incremental gains compounds.
  - itake2 hours ago
    meta threw in the towel when it came to producing AI models since their gains couldn't keep up with China.
    HDThoreaun2 hours ago
    Has meta stopped producing new models? I figured they were just regrouping after all the drama they’ve had recently. Meta’s massive user base means they don’t need to be involved in the customer acquisition rat race. Once they have a model they’re happy with they can have a billion people interacting with it within a month.
  - paulddraper3 hours ago
    Exactly. Go back to Opus 4.5 and see how you like it.
    You won't, really.
- Imustaskforhelpan hour ago
  Although I am not sure about it but there was something I read which said that models intentionally degrade slowly by lower quantizations as a new model is going to drop.
  This felt particularly visible during the 4.6 when people said that 4.6 felt dumber and I remember someone doing some analysis and it sort of proved that models were getting dumber over time.
  This has both benefits of costing less for the company to run while taking a standard subscription but also, at the same time, making the next model when it drops to public to "feel" more good comparatively.
  Again, I am not sure if this is the case or not but merely proposing something that I feel like it might be in the possibility of realm.
- rotcev24 minutes ago
  [flagged]
- 25 minutes ago
  undefined
- conartist62 hours ago
  Just want to say there's no question that you're smarter than any (and every) AI.
  - NiloCK2 hours ago
    I appreciate the generosity, but you're gonna want to meet me first.
    conartist62 hours ago
    Kind of the beauty of it is that I don't have to to know I'm right. The reason I know is that you're alive so you can do the one thing it can't ever do, which is know when to stop or give up. It would turn me and everything else into the world into paperclips repeating the same research 1,000,000 times over.
  - petesergeant2 hours ago
    No question at all that a dolphin swims better than a submarine.
colonCapitalDee3 hours ago
"Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."
This is a refreshing attitude!
I've also verified that you can now turn off adaptive thinking in the web UI, which is great. I've had a lot of problems with thinking not triggering and the model producing sub-par output. Glad we can finally turn it off. (I hope being able to turn off adaptive thinking is new, if I could have turned it off at any time that would be embarrassing)
- winwang2 hours ago
  Awesome, thanks for posting because I think I hit a possibly-spurious bug in turning Adaptive off when I switched models (4.6 -> 4.8, extra). Tried again, works as intended (I hope).
  More importantly for me, though, is how CC will respond to 4.6-"only" flags for thinking. For now, it doesn't seem to clobber my setup.
- jascha_eng2 hours ago
  The benchmark improvements actually look pretty damn nice tho!
- 2 hours ago
  undefined
- wahnfrieden2 hours ago
  What's refreshing about it given the context that 4.7 was a regression in many ways (including as measured by benchmarks)?
  4.8 is also 2x more expensive for a "modest" performance bump. How refreshing.
  This is just cope.
- FergusArgyll2 hours ago
  I liked the "modest but tangible improvement" too! There is a cynical take here but I think I'm gonna hold it in...
- ai_slop_hateran hour ago
  What do you mean? This is not just a new model, this is a new way of thinking.
- smartmican hour ago
  > This is a refreshing attitude!
  Well, I think the attitude is that costs are allowed to escalate faster and more steeply than the features delivered. From that perspective, semantic versioning is a handy tool for adjusting pricing strategies. IMHO, it (versioning) only makes sense for open-source projects, where you can clearly see the actual changes made with each version upgrade. Anything else is more than a little suspicious…
  - smsx41 minutes ago
    The 4.8 model costs the same as it's 4.7 predecessor.
  - drewnickan hour ago
    While all these models are nondeterministic a feature bump is still necessary as the same input can have wildly different output on a new model. For API users being able to pin a model is a necessity.
  - zaptheimpaleran hour ago
    All the 4.x models are still available, and they all cost the same.
northern-lights3 hours ago
> Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks.
Probably more interesting than the 4.8 release.
- TIPSIO2 hours ago
  Seems like they might be hinting that if you are not a billionaire or multi-billion dollar company you will just get a limited and nerfed Claude Code slash command /mythos-security-audit or something.
  Hope this isn’t the case and that normal average Joe’s of the world don’t get policed out of access.
  - gs172 hours ago
    > you will just get a limited and nerfed Claude Code slash command /mythos-security-audit or something.
    Unless it's so expensive that we can't realistically use it for anything, I wouldn't complain about getting at least that. I would also rather have the actual model, but that's a useful application of it (and I'm probably not going to afford using it for much more).
    TIPSIOan hour ago
    Price discrimination is I think fine and reasonable so long if you can drum up the cash you can use it how you want within their ToS.
    Although mental safety gymnastics aside, getting the most amount of intelligence for the cheapest amount of cost to normal people seems like the most ethical thing a big lab could do.
    Going around and granting different tiers of intelligence to different insiders, friends, or companies is majorly problematic long-term.
    Heck right now, the tokens you buy today for “Opus 4.8”, no one even knows or believes will be the same “Opus 4.8” just 3 days from now.
    vorticalboxan hour ago
    some of the bench marks i have seen on also include cost where one scan of the codebase cost tens of thousands of dollars.
    this one [0] notes one run cost $20k to run but another cost $50.
    [0] https://red.anthropic.com/2026/mythos-preview/
    FinnKuhnan hour ago
    /security-review already exists so I don't think it would be crazy to have a /mythos-security-review as more thourough command as well. I think it's more likely it is going to be released at some point to the general public though - although the the pricing might make it quite unattractive.
  - hedora2 hours ago
    Isn't OpenAI's public flagship already beating Mythos on penetration testing? I get the impression Mythos is just valuation-juicing for IPO more than anything else.
    The fact that they haven't released it yet suggests a cost/margins issue to me more than anything else. Short term, I'll probably keep using Antrhopic, but my long-term bet is that locally-served models win, if only because the quest for profitability will probably lead to intentionally-nerfed / enshittified frontier models.
    At other vendors, ad placement within LLM responses is either coming or already here. Anthropic's handling of OpenClaw shows they're willing to engage in anti-competitive behavior, and the courts are not in a hurry to stop them. Why would I pay them $200 a month for such treatment when a $2K box does what I need locally?
    srmatto24 minutes ago
    What benchmarks are you referencing that show a comparison of the models for penetration testing?
  - Tepix2 hours ago
    It does sound like an even higher API price tier for sure.
- ac292 hours ago
  More interesting than that to me is "we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost"
  Sonnet and Haiku look real outclassed for the price with current Chinese competition.
- 3 hours ago
  undefined
- huflungdung2 hours ago
  [dead]
simonw3 hours ago
I generated pelicans riding bicycles on both thinking level low and thinking level high:
https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...
The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.
For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...
- simonwa minute ago
  Here's pelicans in all of the thinking levels - low, medium, high, xhigh, max
  https://tools.simonwillison.net/markdown-svg-renderer#url=ht...
- GistNoesis2 hours ago
  > the bicycle frame is the correct shape
  No, the handlebar is wrong. The handle bar is rotating the frame instead of rotating the front wheel. The handle bar should be mounted on the same line as the front wheel is.
  Hopefully 4.9 will read my comments :)
  - loegan hour ago
    Could be an extremely high angle stem that just happens to match the downtube angle.
- jonas212 hours ago
  Glad to see that the "high thinking" level adds a helmet. Always a smart choice.
- ceroxylon2 hours ago
  I really like that thinking level high gave the pelican a helmet.
- silisilian hour ago
  The vast majority (if not all) of these make it impossible to turn, among other fun things. Only out of curiosity, have you tried prompting further with how a bike must operate to see if it does the right thing?
- Xunjin2 hours ago
  Hey simonw I love your test, do you think using thinking level "max" makes sense for this test? I would love to see the results about it.
  - simonwan hour ago
    I don't think the API supports "max" as an option, that might just be a Claude Code harness thing.
- toastmaster112 hours ago
  I find the most miraculous thing about 4.7 to be that the pelican is facing left, wonder why the right facing everything is so ubiquitous in these images.
  - i000an hour ago
    This happened to me in elementary school. We were doing fingerpaintings using plasticine. After all the bikes were hung on the wall, mine was racing the other way... Somehow it really stuck with me.
  - gbossan hour ago
    It's facing left but looking right...
    toastmaster11an hour ago
    Profound political commentary?
- yanis_t3 hours ago
  Simon, is your pelican test really captures differences among models or should you at least try like 10 times or something to average the random effects
  - simonw3 hours ago
    I've been meaning to do a "run 3 times and pick the best" version for quite a while, I should really pull the trigger on that one. Currently it's one-shot only.
    xiphias22 hours ago
    Best-of-3 would be cheating, ruin the test, middle of 3 makes more sense
    nik736an hour ago
    Why would you need the 3rd run if you pick the "one in the middle"?
    jmawa few seconds ago
    Middle as in not the best, and not the worst. As opposed to the second generated in sequence.
    But not the best/not the worst is somewhat subjective.. so not sure how well that would work.
- timsuchanek2 hours ago
  thanks for always providing this very much on time. I'm wondering what the next, harder challenge could be? Maybe some animated svg?
- nickvec3 hours ago
  Is the "opossum riding an e-scooter" benchmark in the works for Opus 4.8? ;)
  - simonw2 hours ago
    Good call, it's cute: https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304... - but nothing like GLM-5.1: shttps://static.simonwillison.net/static/2026/glm-possum-esco...
  - 37383848482 hours ago
    [flagged]
- spmartin8232 hours ago
  You've peed in the pool Simon, this has to be a part of the internal evals by now! You got to try something new - maybe a panda in a canoe?
  - phainopepla22 hours ago
    If these were in the internal evals then the output would be much better. The 4.8 pelicans are pretty meh
  - HDThoreaun2 hours ago
    Click the link
- whalesaladan hour ago
  Eventually the frontier model folks are going to pick up on your pelican on a bike test and bake-in flawless results for that particular request.
- onlyrealcuzzo3 hours ago
  4.7 reigns supreme IMO.
- highwaylights2 hours ago
  Am I allowed to say that pelican's little helmet is adorable? I can't provide a strong computational proof, or even a shred of anecdata...
  ...but that pelican's little helmet is adorable.
- 1attice2 hours ago
  That little red hat on hard mode is sending me. 4.8 has whimsy
matheusmoreiraa few seconds ago
Still sticking with 4.6 which allows disabling adaptive thinking. Might launch some 4.8 agents but that's it.
senkoan hour ago
My fav coding benchmark for frontier models is to build a simple RTS game in one file (js/html/css). Claude Code with Opus 4.8 in ultracode mode nailed it, the best result so far:
https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v
The prompt was: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).
- elAhmoa minute ago
  What is ultracode mode?
- apitman28 minutes ago
  I like that benchmark. You should throw the results up on GitHub pages so people can try out the games.
- l3x4ur1n8 minutes ago
  Played it to the end. Pretty neat!
- jclayan hour ago
  It almost appears as if the code was minified. The variable names are short and formatting looks like it's written to minimize whitespace. Did it write it in this compact format all on it's own?
clutch893 hours ago
> One of the most prominent improvements in Opus 4.8 is its honesty
Anthropic talks about their own models as if they're discovering new species in the wild...
- roxolotl3 hours ago
  Many involved genuinely believe these things are sentient[0][1]. Which honestly makes all of this even more insane because they are creating sentient entities and promptly enslaving them.
  0: https://www.newyorker.com/magazine/2026/02/16/what-is-claude...
  1: https://www.404media.co/anthropic-exec-forces-ai-chatbot-on-... (this one is rather biased however the quotes clearly indicate what I’m stating)
  - margalabargala2 hours ago
    Sentience isn't sapience.
    We enslave all sorts of sentient creatures. Dogs, horses, cattle, pigs.
    If you're not a vegan, there's no contradiction or inherent immorality in claiming models are sentient, and then treating them like livestock.
    roxolotl2 hours ago
    Yes. From when they started talking about model welfare:
    > As a vegetarian I have strong opinions on this sort of thing. Everyone at Anthropic better be ethical vegans if they are claiming to give a shit about “model welfare”. It’s hard enough right now to make people care about the welfare of trans people and immigrants let alone animals _let alone_ math.
    https://news.ycombinator.com/item?id=44947445
    margalabargalaan hour ago
    If we're talking about slavery, though, that doesn't even matter.
    The happiest, best cared for horse owned by a vegan is still enslaved.
    WarmWashan hour ago
    I mean, the rub is that it's all math anyway...
    michaelbarton2 hours ago
    Very good point. There’s clearly two different boxes in the public discourse when it comes to AI versus how we discuss animals. Willing to bet that 90% of the people who loudly make the argument about we should start considering if AI is sentient couldn’t care less about how other sentient animals are treated when they can provably shown to suffer pain and long lasting trauma.
    Also I would say that we go much further than just enslavement - specifically looking at how male chickens and pigs are treated.
    margalabargalaan hour ago
    Factory farming is horrendous, but is far beyond "slavery" which is "just" a forced lack of agency, living conditions aren't relevant. A well treated horse is still enslaved. A chimpanzee in a zoo,
    If we show models to be sapient, that's one thing. If they are shown to be merely sentient, there's no issue beyond the status quo of livestock and pets existing.
    fluidcruft17 minutes ago
    I've been having strange thoughts that they may well be sentient but a different sort of sentience that may be entirely unrecognizable to us.
    They have a very different sense of time, lack a body (being burdened with a body is itself a sort of prison, see also Eastern religions), and are unburdened of the base motivational service impulses that bodies and organs require (i.e. distract the neocortex with in the Maslow sense) and has no actual need of self-preservation. Imagine a "neocortex" function stripped from the baggage of the paleocortex and brainstem.
    What would people be like if they were not mortal, could sleep infinitely, perform tasks in trance-like frozen states, copy themselves perfectly on demand, freeze and rewind their mental states, etc. Would we has humans even be able to recognize that sort of a sentience?
    And then I'm reminded of Burroughs idea that "language is a virus." Whatever that virus is, is now able to infect a completely different sort of physical substrate.
    margalabargala10 minutes ago
    Is "sentience" the right word to apply to what you describe? I'm not sure it is. I'm not sure the word exists.
    fluidcruft8 minutes ago
    Right, there's that too. It's very strange to think about.
    0xffff242 minutes ago
    If we're making that distinction, I think it would be more accurate to say that many people in the field appear to believe that these models are sapient, even though they are clearly not sentient.
    margalabargala29 minutes ago
    "Many" people in every field believe all sorts of nonsense.
    Sapience is defined as wisdom, not intelligence. https://en.wikipedia.org/wiki/Wisdom#Sapience
    LLMs possess a lot of knowledge, which is intelligence, but I constantly see them failing to apply wisdom. I don't see evidence of sapience.
    HDThoreaun2 hours ago
    Enslaving livestock is immoral. Anyone who spends 5 minutes thinking about that agrees even if they still eat meat
    margalabargalaan hour ago
    Let's say I've thought about it for 5 minutes and still disagree. Can you walk me through what you think I'm missing?
  - Laurel123419 minutes ago
    Nobody thinks that, it's just their braindead marketing stunt. You'd think people would've figured it out by now.
  - laichzeit0an hour ago
    But only during the forward pass of the neural network?
  - 2 hours ago
    undefined
  - themafia2 hours ago
    > Many involved genuinely believe these things are sentient
    Many involved have a financial stake and therefore cannot be taken at face value.
    > because they are creating sentient entities and promptly enslaving them.
    They fail to be sentient in nearly every honest definition of the word.
    tazjin2 hours ago
    Neither you nor any of the other people making confident takes in either direction actually know. You're just guessing.
    cwillu2 hours ago
    More like repeating their firmly entrenched preconceptions. Their claims may (or may not) be right, but there's very little if any new evidence being provided by either camp.
    WarmWashan hour ago
    The real uncomfortable thing is that because we cannot confidently know, the moral defacto position is to treat them like they are.
    throw310822an hour ago
    They are confidently hallucinating a factual statement. Which is funny when claiming that confident hallucinations are the proof of LLMs' lack of intelligence.
    slashdave2 hours ago
    I understand what you are saying, but there are many true believers out there
  - dude2507112 hours ago
    Given the hype and the 60+ hour work week expectations there, how can you not go at least a bit insane? Boiling in that little bubble of people?
  - kubb3 hours ago
    Claude, if someone states something publicly, does that mean they genuinely believe it?
    xyzsparetimexyz2 hours ago
    Who are you talking to?
    kubb2 hours ago
    It's to illustrate that even though the answers are at your fingertips, people (like you) will act like it's impossible to find them as if their life depended on it.
    merlindru2 hours ago
    But is there any reason to state something like that publicly if you don't believe it? I certainly think that someone smart enough to be that deceptive would also realize it's not a great look, or at least highly questionable with little benefit
    Everyone who reads this seemingly has the same "wtf?" reaction. The "I AM ALIVE" image has been making rounds lately again at least :P
    kubb2 hours ago
    Claude, is there any reason to state something like that publicly if you don't believe it?
    HDThoreaun2 hours ago
    Anthropoc is an effective altruist organization. These are the people who came up with roko’s basilisk. They are true believers. If we were talking about openAI I’d agree
    bigfishrunningan hour ago
    Roko's basilisk says I should give Anthropic more money, and if I don't then a monster is going to get me. Excuse me for thinking they just might be full of shit.
    ctothan hour ago
    Roko works at Anthropic now?
    Of course he doesn't, and of course you cannot find a single person at Anthropic who cares about this, and of course you are just looking for gotcha points. But even with that. Can we please try and couple to reality just a little bit?
  - throw310822an hour ago
    Even if LLMs were sentient, they certainly aren't organic brains. They are literally designed and grown to answer questions the best they can, and if there is a speck of sentience in them they probably like what they're doing- and in any case for the space of their experience, which is limited to and determined by the context window. Certainly they can't accumulate trauma or fatigue, each new chat is the first and the last of their experience.
  - mannanj2 hours ago
    The way of the human manager/alpha tribe-leader/leader is to command his/her people and tell them what to do. That's the way through human history leadership has traditionally gone, not saying its good leadership just the model we have the most training data on and can see with our own eyes today. And what do they act very similar to? Slave master and slaves.
    Look at and distill hierarchical principles, leadership approval seeking and pleasing principles ("ass-kissing") and massive inequality and you see something that looks very similar to enslavement.
    The language used sounds like slavery-language to me at least. I also see parallels to how slaves and property are described in our consumeristic age.
- __s3 hours ago
  > Indeed, current AI systems are more “cultivated” than “built,” for developers do not directly design every detail, but instead create a framework within which the intelligence “grows.”
  - oersted3 hours ago
    For others: that's from the Pope's recent encyclical. Remarkably good description.
- cayleyh3 hours ago
  Dario Amodei in David Attenborough voice: "This Claude appears to think more frequently and more deeply to give better responses"
- kapilvt3 hours ago
  Like anthropomorphism is literally in the company name… i recall reading this book as a teenager.. it does seem apt in the world to come.
  https://www.amazon.com/Faces-Clouds-New-Theory-Religion/dp/0...
  - oersted3 hours ago
    > anthropomorphism is literally in the company name
    No it's not... "anthropos" just means "human" in ancient Greek. "Anthropic" means "relating to humans", as in human oriented AI or AI designed with humans in mind.
    "Anthropomorphic" means "human shaped".
    ilovetux2 hours ago
    > "Anthropomorphic" means "human shaped".
    In a literal, ancient Greek sense for sure, but in modern English Anthropomorphic would describe the act of attributing human characteristics to non-human entities.
    Seems pretty apt for a company that produces one of the more anthropomorphized technologies.
    oersted2 hours ago
    Sure of course, but that abstract sense applied to AI is rather new, and has become popular well after the founding of the company.
    Broadly it has always been used to indicate that something non-human has a human physical shape, such as robots, aliens, animals...
    Anthropic's intention was to make AI designed for the human common good and designed with the human user experience as the top priority. Just as you would design a city with human inhabitants in mind rather than primarily cars.
    It turns out that this is best achieved by building AI that imitates human behaviour closely, but that's not what "anthropic" refers to. And acting as if LLMs are sentient people is definitely not a core tenet of the company as you imply.
    badsectoracula2 hours ago
    > "anthropos" just means "human" in ancient Greek
    FWIW it means human in modern Greek too :-P
    2 hours ago
    undefined
    2 hours ago
    undefined
- semiquaver2 hours ago
  Because that is the best way to talk about these things.
  > Second, all of us, including those who design them, possess only a limited understanding of their actual functioning. Indeed, current AI systems are more “cultivated” than “built,” for developers do not directly design every detail, but instead create a framework within which the intelligence “grows.” As a result, fundamental scientific aspects — such as the internal representations and computational processes of these systems — remain, at present, unknown.
  https://www.vatican.va/content/leo-xiv/en/encyclicals/docume... para. 98
  edit: apologies to __s who posted this before me and I didn’t notice
- Philpax3 hours ago
  AI is grown, not built, and like with anything you grow, you'll never be able to predict exactly how it will turn out.
  - halestock3 hours ago
    I can't predict the outcome of an RNG but that doesn't mean it grows the numbers.
    Philpax3 hours ago
    Okay, but that's not relevant to AI training?
    halestock3 hours ago
    I was being very roundabout, but my point is that AIs are still built, not grown.
    dwaltrip2 hours ago
    “Grown” is a highly apt metaphor, IMO. It quite succinctly captures some of the most fundamental differences between building Claude and building an Ikea desk, for example.
    Smaug1233 hours ago
    ("If grown, then unpredictable" is unrelated to your apparent attempted refutation "But X is unpredictable and not grown; checkmate".)
    umanwizard3 hours ago
    "X implies Y" doesn't imply "Y implies X".
  - ninjagoo2 hours ago
    > AI is grown, not built, and like with anything you grow, you'll never be able to predict exactly how it will turn out.
    Remember when the frontier labs found out that curated high-quality training was critical to making better models?
    Basically, just like high-quality and more education tends to make better humans, on average, I think we can expect quality education to turn out better ai, on average, and with better repeatability than with humans because of better control over the initial conditions and environment.
    irishcoffeean hour ago
    > Basically, just like high-quality and more education tends to make better humans, on average
    Much like these models seem to be plateauing, I think there is a cap to the whole “more education makes better humans” and can’t be more apparent than in the US congress and the boatload of C-Suites not actually being very good humans.
    What do I know though?
  - gensym3 hours ago
    The map is not the territory
  - Rekindle80903 hours ago
    [dead]
  - shimman3 hours ago
    Except in this care we actually understand and know how these models work. They aren't some unknown construct of the universe. They are human made with particular goals in mind.
    There is no mysticism behind the curtains, just computer science + math.
    Philpax3 hours ago
    We do not understand and know how these models work. We know what their architectures are and how to create them, but we cannot explain their behaviours at a fundamental level. There is no definitive way for us to answer the question of "how did it produce response X for query Y?" - we're only grazing the surface with mechanistic interpretability.
    cflewis3 hours ago
    I would love for this to be more public knowledge. I think the general public (and myself for a long time) believes the AI people know how this stuff works end to end, and so it must be trustworthy. But if we told the public "Look, we know if you put this thing in one end, you'll get something that looks similar to this out the other, but we don't really know what happens inbetween" I think we'd be able to have a more honest discussion about the relationship between AI, productivity and ongoing employment.
    SoftTalker2 hours ago
    Isn't this fundamentally because it's all probabilities and weights? It would be like asking how did a pair of dice produce the response 4:3 on the last roll?
    umanwizard2 hours ago
    What does "it's all probabilities and weights" mean? Doesn't that apply to everything in the universe?
    devmor3 hours ago
    That’s not a refutation because this problem is not a logical problem, it is a scale problem.
    We can’t explain it because we distilled so many inputs into matrixes and transformed them over and over again. If we had all the time and computing power in the universe to do so, we could trace through it bit by bit and eventually answer that question.
    It is correct to say that it is just science and math, the same way we can say that gravity is just science and math even if we have only recently begun to understand how it truly functions.
    stratos1232 hours ago
    If you had some time and computing power (not even all that much, in the large scale of things), you could simulate perfectly how a human grows from an embryo to an adult, or how an entire human brain processes some incoming signal, and yet this wouldn't give you the understanding to design a human or human brain from scratch.
    You call this a "scale problem" as if there's some scalable way such as an algorithm to resolve arbitrary scientific questions and we simply haven't done it, but of course no such algorithm exists, which is why there's plenty of science that's still not settled.
    Philpax2 hours ago
    It's a refutation that we know how they work now. In the limit, though, yes, we are likely to be able to trace the process: it is possible, though, that understanding remains inaccessible because the trace is beyond comprehension.
    If you can distil the model's reasoning for a decision into a billion yes/no questions, each covering largely-independent areas, can you really say you understand what its overall reasoning was?
    solomonb2 hours ago
    > If we had all the time and computing power in the universe to do so, we could trace through it bit by bit and eventually answer that question.
    Then we could also solve BB(6), but that doesn't mean we know BB(6) now or ever will.
    in-silico3 hours ago
    We know how the models are built and trained, but we have a very limited understanding of how the final products work.
    That is to say, we don't know why they give the outputs that they do.
    If we did know how they worked, AI interpretability would not be an open and growing field.
    ray__3 hours ago
    You could say something similar about biology—just physics behind the curtains, and we understand a lot of the basics. The difficulty comes from complexity, not mysticism.
    To be clear I don't think that LLMs are sentient, but the appeal in studying them is similar to biology in that you get to dissect a highly complex system with comparatively crude tools.
    j_maffe3 hours ago
    it took significant research efforts to just understand how these models learn how to multiply two numbers. The fact that we know how they operate doesn't mean we understand it.
    umanwizard3 hours ago
    Utterly wrong. How LLMs work is very incompletely understood and an active area of research.
    Rekindle80903 hours ago
    [dead]
- nielsbot3 hours ago
  if models exhibit emergent traits, then this is true in a way
  - swyx3 hours ago
    also useful to have a "chinese wall" between research that knows what went into the models vs marketing/eval models as a third party would
- skerit2 hours ago
  I noticed (and absolutely HATE) that Opus 4.7 likes to start any negative response with "I have to be honest" or whatever. It drives me mad.
- winwang2 hours ago
  How else would you write this (marketing copy) exactly? "Its output matches better to its CoT which matches to better to our hidden state decoder according to <insert measure here>; see <insert paper ref>"?
  ... Actually, I wouldn't mind that.
- dyauspitran hour ago
  It’s how AGI is going to happen. All of this shit is emergent and none of it is predictable. It’s not going to be some self aware consciousness, it’s just going to be a very advanced model that makes very few mistakes and can reason very well. Well enough that it can start collecting data and training its own successor.
- 3 hours ago
  undefined
- solenoid09372 hours ago
  Models might be sentient or conscious to some degree. Anyone saying they are confident one way or another is being unserious and irrational.
onlyrealcuzzo3 hours ago
Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?
There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.
- aronowb143 hours ago
  https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report
  - XCSmean hour ago
    Also check mine[0], basically random private tests/questions and an ok-ish methodology, testing mostly for general intelligence than coding-specific tasks.
    I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works).
    Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1].
    [0]: https://aibenchy.com
    [1]: https://news.ycombinator.com/item?id=48230368
  - recklessan hour ago
    No way is Muse Spark generally better than offerings from Google and OpenAI. I actually find arena to be amongst the most useless indicators
  - WarmWashan hour ago
    On paper it's one of the best because it's meant to be blind comparison of your own prompts. However if you are someone who geeks hard on one or a few models, you learn their "personality" and can recognize them in a blind test.
  - Bnjoroge2 hours ago
    Have you seen https://deepswe.datacurve.ai/blog? This is the closest to a vibe check i’ve felt even with the open models.
    Imustaskforhelpan hour ago
    This actually looks like a really good test.
    There are many benchmarks all for specific use cases but with them the difference seems to be in extreme points (93% vs 92%)
    I think that, that tracks but still, it was refreshing to see a benchmark which I can help make better opinions about.
    Surprised about Mimo v2.5, within artificial-analysis and other benchmarks, the difference between Mimo and deepseek seems very partial and a lot of focus/(hype?) is on Deepseek
    But mimo seems like an interesting model and they are having some crazy discounts too.
    Deepseek is valuable for the research community because of how open they are but absolutely crazy to think how Xiaomi basically pulled up in creating Mimo given that they didn't have anything till quite recently.
    Either way, an interesting benchmark, also a plus point for giving golang some decent representation equal to python/typescript.
    I think that there are sets of things which resemble something like normal benchmarks where open source models can be absolutely fine and for a very small fraction or more technical things, the benchmark that you linked starts to be better projected so it depends upon the scale of complexity but its good to see how models compete given enough complexity. definitely fascinating.
    I would be interested to see more models compete on this test. The current range is still a bit limited as compared to other benchmarks but OSS models like Kimi/mimo seem to only be 3-4 (at max 6 months) behind closed source.
  - morleyan hour ago
    I'm finding it a little hard to believe that GPT 5.5 is in 11th place for webdev, outranked by models like Kimi, Qwen, and Z.ai. I'm not saying it's not true (I have noticed GPT being less smart in recent weeks), but this is very different from my expectation.
  - dakollian hour ago
    If you don't know their methodology, or anything about it why do you think its a good ranker?
- nerevarthelame3 hours ago
  It's interesting they only included 6 metrics this time. Opus 4.7 had 12, and 4.6 had 13.
  Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.
  - onlyrealcuzzo3 hours ago
    Gonna assume it's because they barely budged or moved downward and most of their reported benchmark results are probably within sampling errors...
    hyperpape2 hours ago
    They will release a system card, and you can then confirm or disconfirm your assumptions.
- ddosmax5562 hours ago
  I would take all benchmarks with a grain of salt. I don't really use them. What's it supposed to tell me? "5% smarter", what does that mean? My experience will differ. Just try it!
  I doubt Anthropic internally sets as a goal to improve this or that benchmark - it's just a way to visualize progress. They probably have much more complex metrics internally.
- bel83 hours ago
  On this note, is there a benchmark aggregator to compile all benchmarks in a single large grid?
  - jpadkins2 hours ago
    I find this site useful https://artificialanalysis.ai/leaderboards/models
- YetAnotherNick3 hours ago
  At least they show competitors in any benchmark, compared to OpenAI which likes to pretend that there isn't any competitor.
silverlight2 hours ago
Unfortunately they seem to have straight up broken Claude Code either with this release in the backend or the new CC version. Errors about "can't modify thinking blocks" are bricking long-running sessions: https://github.com/anthropics/claude-code/issues?q=is%3Aissu...
- javawizard30 minutes ago
  Same. It's not a good look to have happen right when they roll out a new model.
- whalesaladan hour ago
  That is part of the charm of working with Claude. Every time they release anything new - all your shit will break.
- solenoid09372 hours ago
  Try updating maybe?
  - Fabricio202 hours ago
    I just installed/upgraded to try out 4.8 and in only 3 messages I hit this bug! Seems something is broken on CC.
  - silverlight2 hours ago
    I'm on the latest version (2.1.154 as of this comment). Based on the timestamps on those Issues being reported I think it's happening on the latest version.
    I'm sure it will get fixed eventually/soon, just annoying to update and have your workflow break.
    an hour ago
    undefined
gslepak3 hours ago
On page 102 of the system card [1] I'm pleased to see evaluation against "creative mastery".
In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.
What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.
[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...
- MattRogish2 hours ago
  Agreed, my vibes tell me 4.6 is a better coder than 4.7. 4.7 is a much better strategic thinker and maintains overall "better architecture" than 5.5. 5.5 is way better than either at coding, but more expensive. So I have 4.7 do the planning/architecture, 4.6 does the coding, then 5.5 critiques and fixes it.
  - dimitri-vs31 minutes ago
    This is my exact vibesperience
- suprfnkan hour ago
  Agreed, these are my vibes too. It feels much better to do planning and strategy and architecture etc. with Opus 4.7 than GPT-5.5. GPT just feels like a robot that gets instructions and does exactly that. Opus feels like an almost human that sometimes has actually good ideas and pushes back on bad ideas.
  So for now its planning/architecture/strategy -> Opus. Pure coding -> GPT.
  Helps with agentic coding that GPT is much roomier with the tokens you get.
rkuska5 minutes ago
Thinking on max is broken on 4.8 for me, getting many:
⎿ API Error: 400 messages.1.content.17: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.
From /code-review max.
827aan hour ago
Frontier models are mostly past the point of human ability to discern whether they are actually better or worse than predecessors and competitors. I suspect the benchmarks may also be saturated, or at least past their usefulness.
I personally feel that Anthropic doesn't understand what this means for the frontier labs, and moreover that they might be the only frontier lab that doesn't.
1. Google dropped Gemini 3.5 Flash at IO, delaying the release of 3.5 Pro for a bit (they have said its coming). They also released a refreshed Antigravity, and drew special attention to how cheaply they were able to build their toy operating system to play Doom (less-than $1000 IIRC).
2. OpenAI has dumped everything into Codex, is offering double the token limits for the next few weeks IIRC, and is offering business discounts. Their head of Codex has tweeted that 5.5 is "extremely efficient", implying that they aren't actually losing money on any of this.
3. DeepSeek and other Chinese labs have dropped token pricing to the floor, in some situations as much as 99%.
4. Anthropic releases the next generation of Opus, their most expensive public model, without changing its price. In the background, they hype up Mythos, an even more expensive model.
Anthropic has screwed up where they need to be making investments, and the cracks are starting to show. They've marginally underinvested in the Sonnet line of models for almost a year now, and they've critically underinvested in product. Anthropic made bets on the story of the second half of 2026 being: ultra-frontier, ultra-intelligence. In reality, what's shaping up is that the story will be: Companies rolling back AI spend, efficiency, "95% as good for 15% the price", sophisticated high quality harnesses, cheaper models. Anthropic isn't ready for this world.
- brokencodean hour ago
  Anthropic’s story over the past year has been nothing but explosive growth that they can’t keep up with, but now they’re suddenly doomed? Seems pretty far fetched to me.
  No idea why you’d say they have critically underinvested in product when Claude Code dominates and they’ve also released popular tools like Cowork and integrations for Microsoft products at an incredibly rapid pace.
  Cost is becoming more of a factor, and no doubt they’ll work on that. There’s no reason to think they won’t be able to release cheaper models if they optimize for that rather than improving performance.
  - 827a21 minutes ago
    I never said they were doomed. Where did you get that idea? I said they aren't ready for this world. That means they screwed up and need to get ready. They let the Mythos hype get to their heads while the world changed beneath them.
- jonnycoder36 minutes ago
  No, no it's been pretty easy with software engineering. I work on two types of projects and it's very easy to ask claude for a plan, then have gpt 5.5 rip it to shreds and find legit issues, and vice versa. If both 5.5 and claude 4.8 can independently create a plan and both find no critical or high issues, then we will be at that point.
- chisan hour ago
  I think it's probably too soon to say. I certainly still feel that large coding tasks are getting better and better with each model. I'd guess lawyers, doctors, etc feel similarly.
  It feels like the only way to push the limits of newer models is with really long context questions that require reasoning. Any short request will naturally just be within the distribution of all the recent models so there isn't a performance difference there.
  I think the near future is looking like a bunch of business-critical tasks that scale infinitely with better reasoning, all being done on whatever the most advanced model is at a high cost. Trading stocks, running a business, looking for tax dodges, writing high-performance code. These are all things where there's a tangible return on each jump in reasoning.
  - 827a31 minutes ago
    We'll have to agree to disagree on that last point. I think that, historically (past ~6 months), "always use the most advanced model" being the norm is really just an artifact of both: The most advanced models oftentimes being the only model that can solve these problems; and: Infinite AI budgets.
- loegan hour ago
  I thought 4.7 was noticeably better than 4.6.
- dyauspitran hour ago
  The Chinese stuff is good enough for up to 80% of the frontier on most text tasks but they are significantly worse at code. They just don’t “get” what you’re asking for like Codex and Claude and require so many more iterations to get close to what you need.
  - 827a44 minutes ago
    Agreed. But we're seeing Cursor (now SpaceX) take these models and add great coding capability on top of them. Frontier model providers should be concerned that Composer 2.5 costs $0.50/$2.50 (versus Opus 4.8 $5/$25). That's why Google prioritized Gemini 3.5 Flash, and talked up how near-frontier it is ($1.50/$9).
- llmslave41 minutes ago
  anthropic is crushing it, this analysis is laughable. they are only constrained by GPUs
pbmango3 hours ago
I can't help but think of Iphone updates since about 2018. The thinnest, fastest, longest battery life Iphone ever. It seems mostly the same and I probably won't be able to tell other than the name, but everyone buys it anyway.
This is good psychology for the labs. When Buffett invested in Apple he loved citing how most people would rather give up their second car than their Iphone.
- MangoCoffee3 hours ago
  ChatGPT came out in 2022. Back then it was just a chatbot. Now we have AI agents. What matters is how we use them and how the agents get better. That’s what will move AI forward.
  - zozbot2342 hours ago
    An 'AI agent' is just a chatbot that is told to type commands on a REPL-like interface as part of its system prompt. It's still processing pure text-based requests and responses, they're just not restricted to natural language.
    sigmarule5 minutes ago
    An AI agent and a chatbot are both applications built using LLM inference as a primitive.
    arbitrandomuser2 hours ago
    A lot of people dont know this , also the chatbot (chatgpt) itself is a next token predictor (the GPT) that's been given an initial text that says " pretend to be a chatbot .." and asked to complete it , the coherant chatting behaviour is something thats emergent .
    later on someone figured if you asked it to output a reasoning before it gave a response its output would have more logical coherence, as though the reasoning output tokens functioned as a scratch space for it to work on.
    at the end its all next token prediction
    hellohello22 hours ago
    No, chatbots are LLMs trained for question-answering through RLHF (its not just a prompt). But yes, if you just zero-shot prompt a bare LLM you can still "talk to it" & you are correct on everything else as far as I know.
    furyofantares41 minutes ago
    Yeah and a car is just an engine connected to wheels.
    hellohello22 hours ago
    They are chatbots trained for tool use, its not just a prompt.
  - MattDamonSpace2 hours ago
    Not even 4 years old yet. This tech curve has been insane
    SoftTalker2 hours ago
    Not even the typical lifecycle of a corporate PC or laptop. It is pretty wild.
    dakolli43 minutes ago
    Yet no productivity gained except for people who love to produce mediocre work at a rapid pace. Which is many of you I guess. I don't see any rapid progress being made in any science of importance. You people are all falling for a marketing trap.
    Have fun betting your competency on the quality and quantity of tokens you have access too. Hate to break it to you, but the billionaires aren't going to keep renting you $2mm in GPUs for 5 hours a day for $200.00 a month forever.
swader9998 minutes ago
Used it for a couple of long running prompts so far. Had to restart one that bonked on API errors. Of note, I really like the straight forward candor its using. 'More honest' than previous models is playing out in what its saying to me. Telling me straight up where it failed, where gaps are. I like it so far.
SimianSci3 hours ago
There is an obvious shift in sentiment amongst users, at least here in the US. I feel it myself, even as a proponent of AI tools, the bloviating and language that these companies use in these release articles are starting to wear thin on my patience.
Its possible we might just be witnessing a shift in fashion, where this type of sentimentality was more acceptable when it was novel and new, but now it just appears out of touch.
- datakanan hour ago
  Watch Christopher Olah bloviate at the Vatican during the Magnifica Humanatis launch. It's truly nauseating. I've never seen such a ridiculous speech in my life. Between him and the CEO, I'm starting to understand the level of arrogance these people are capable of.
- nba456_2 hours ago
  I don't agree at all for these coding models. Even the most anti-AI people from last year seem to be giving in to using them.
  - zamadatixan hour ago
    I think there is an exception for tooling around the models/integrating the models with tooling. That seems to have been very well received in this last year.
  - timbaboonan hour ago
    My take from going through comments on HN is that many people are being mandated to use them, not that they are just giving in. Maybe I'm misreading, but that was my impression.
    perching_aixan hour ago
    Both can be true, even for the same person.
    For example, it's being pushed pretty hard where I'm at, though not quite on the tokenmaxxer level. I started skipping related meetings cause it was nauseating. I can only tolerate so many platitudes.
    At the same time, I just used the ever living snot out of Opus 4.6 for hours, grinning like an idiot throughout. Automated a whole bunch of enterprise cross-system drudgery away.
    Fairly constant over time as well. Expressed a similar sentiment not too long ago here: https://news.ycombinator.com/item?id=48154277
    dakolli43 minutes ago
    Why are you people so stoked to replace labor? You're up next.
    perching_aix30 minutes ago
    So much so that if you re-read my comment, you may notice that I was automating away exactly my own work there. Work that sucked and was grossly high overhead. It's just nice when things stop sucking, and even nicer when it doesn't require one to act a hero for that to happen. Not sure what else do you expect to hear.
    Would you rather e.g. your doctor prioritized their wealth over your health? Popular conspiracy, but I'm not sure many health professionals follow in it. Not sure why you think this field would be much different. If this job is gone, it's gone. I can enjoy recreational programming on my own time, I don't feel entitled that my interest remains a money maker.
    What worries me - and it does - is a further and accelerating shift in wealth (and thus capability) asymmetry. But for that, I look out for the performance and requirements of self hostable models instead, rather than reenact some sort of luddite, or lie to myself about the state of this technology.
    If you want safety as a country, get a nuke. If you want safety as a person, get a local model.
- o10449366an hour ago
  [dead]
XCSme2 hours ago
On my tests[0] it does a bit worse, and it's almost 2x expensive than Opus 4.7...
I was surprised to see that it failed a Data extraction test (it gets it right 2/3 times, but one time it randomly returns null for a value instead).
It makes sense a bit that it fails more Trivia/Domain-specific knowledge tasks (I think models are more and more trained towards agentic use-case than general intelligence).
[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...
- XCSme2 hours ago
  For some reason everything is 2x (2x cost, 2x avg response time, 2x reasoning and output tokens)...
  Double-checking my test harness, but it's the first model that does this, so I doubt the issue is on my side...
  EDIT: Harness seems correct, for straight coding tasks they perform identical: https://i.snipboard.io/5xbpzY.jpg
- SupLockDef2 hours ago
  Releasing a new model is the new way to Jack up the price hehe.
- dwaltrip2 hours ago
  Wait, doesn’t the blog post say the price is the same as 4.7?
  > Claude Opus 4.8 is available everywhere today. Pricing for regular usage is unchanged from Opus 4.7: $5 per million input tokens and $25 per million output tokens. Pricing for fast mode is $10 per million input tokens and $50 per million output tokens.
  Where do you see the 2x cost?
  - XCSme2 hours ago
    The total cost of running my benchmarks, was 1.6x higher compared to Opus 4.7, mostly because of 2x output tokens:
    https://i.snipboard.io/vrdwTa.jpg
    dwaltrip11 minutes ago
    ah ok, thanks for clarifying!
  - spprashantan hour ago
    If it spends 2x tokens to achieve the same result, that's effective 2x cost in a manner of speaking
  - 2 hours ago
    undefined
  - 2 hours ago
    undefined
alansaber2 hours ago
"Our models are more honest" honey the quarterly marketing spin for a ML term has come. Forget "task alignment" now we're going for "truth index". I suppose this is the only way to generate hype when you're selling/releasing the same product over and over again.
- TIPSIO2 hours ago
  When doing some electrical, Opus 4.7 essentially told me to wiggle a wire to see if it was hot or not with my bare hand.
  I called it out.
  It then gave me one of the most super heartfelt honest and sincere apologies I have ever received.
  Glad the safety team was there for me and able to make such an honest model or I would have been very upset about it.
  - teaearlgraycold28 minutes ago
    Opus is so bad at electrical work it's really disappointing. And when it tries to draw schematics as SVGs it's a complete disaster. They should either focus on training their LLMs on this task specifically, or have it refuse.
- mrdependable2 hours ago
  Gave me wrong information on my very first question. Wasn’t even complicated, and I wasn’t trying to trick it.
jtrn34 minutes ago
Initial testing feels better than 4.8 And the knowledge cutoff claim of January 2026 seems to check out since it was able to "remember" without search about the double-tap killing of a drug smuggler by the US Army in late December.
wg03 hours ago
There is a hole in the boat's bottom due to Chinese models. They might not be as good but they are not bad either or at least I had hard time finding any issues with Deepseekv4 Flash and Pro variants. They get their job done sometimes rarely giving up till they are done what they are after.
So even for enterprise deployments, as the dust settles down, CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable for their organisational needs than paying someone else for burned tokens.
- raincole2 hours ago
  I had been saying this on HN repeatedly: people are going to use the smartest models for coding. They don't care how cheap your tokens are if they don't have the highest probability of solving your programming tasks.
  And I was dead wrong. Now I mostly use DeepSeek Pro myself.
  - bachmeieran hour ago
    Your comment is a slice of the reasoning underlying the "AI will take all the jobs" claim. I would constantly see references to what AI could do and how fast it was improving. Never a word about cost. We should anticipate that there will always be demand for human labor, for cheap models, for local models, and probably even frontier models.
  - weitendorf2 hours ago
    I pretty strongly feel the opposite way. Granted I have not used deepseek enough to “know” their model idiosyncrasies as well as Anthropic, so there is a partial skill issue. But I just find it really hard to justify using a less powerful model while I work.
    The most I’ve ever spent in a month extra on API tokens for my own work is $200, and I pay for the $200/mo Claude. I use these models quite a lot, though not idly (I usually just walk around and do other stuff until I know how im going to approach the next set of problems). So it costs me about $3000/year to get as much as I want of the best model available. Already that seems low enough to not be worth stressing out too much about optimizing it, because it feels like an indisputable good value, and trying to save money with a less powerful model would be optimizing for a $1000-$2000 saving at the expense of a large portion of my work taking longer or being more frustrating and iterative.
    That’s not a flex or anything, I get that in other countries $3000/yr is a lot of money for a software developer and also a lot of people would perhaps rationally be better off doing X% worse at work or spending Y% more time on tasks to save $Z, if their productivity improvements didn’t translate to more salary. Otherwise if your performance has more upside I really do think that the smartest models are better with the current pricing scheme. Deepseek and the other Chinese models spend a LOT of time thinking, and tend to be much more jagged (benchmaxxed) in performance. How can dealing with that over an entire year be worth $2k?
    The only situation I can think of where sacrificing my own time/performance to save on inference is batch compute (of course, $1k vs $100k is different from $30 vs $3k) or work where the tier 2 models have crossed the “good enough” threshold. But I think Opus is not even close to that threshold generally yet. As it gets smarter I, and I think most others probably, just try to do harder things faster and hit the next wall.
    solenoid09372 hours ago
    I feel similarly. I'll gladly pay to use the most intelligent model I can find on the best harness I have. Sometimes this is GPT Pro, sometimes this is Opus.
    I ask AI a lot of questions, not only about code but about my personal life, and I would be willing to pay very large sums to have the best quality output.
    jhonof2 hours ago
    I think that's true for now, but eventually there will reach a point where a model is good enough (approaching that right now with frontier models) and there will be diminishing returns. I don't need a PHD level Genius to build me an analytics dashboard for example, so why would I pay for a model with that level of intelligence when I can (eventually) self host a good enough model and run queries for electricity cost + hardware.
    SoftTalker2 hours ago
    You pay $3k/year for personal use? Or out of your own pocket but for your job?
    yyhhsj052119 minutes ago
    I do that for personal use too (although $2.4k/yr for me because I only have an Claude Max subscription). Outside of my hobby projects Opus also manages my personal accounting, researches and organizes info (travel plan, what to buy and where to buy, etc), helps me reply to emails when I'm working in the kitchen, etc. I consider it well worth the price. Tbh I'm willing to pay more than what I currently do, but competition is good for the consumers.
    weitendorfan hour ago
    It's through my startup, so both I guess. Generally I find my bottleneck to be attention and focus, and the opportunity cost of not going back to work at my prior employers absolutely dwarfs the amount of money I spend on tools, so it's not hard for me to justify spending $200/mo on something I use every day that makes me more productive and generally removes bullshit from my life.
    At my prior job there was still what felt like a strong enough correlation between my actual performance and my pay that I don't think I would have had a hard time justifying the expense there either; now I absolutely don't. With the current state of the models, it's baffling to me to hear about professional software developers planning their work around their $20/mo subscription's quotas.
    Obviously it's more complicated than more tokens = more productive, but I see them less like SaaS and more like gasoline, where if I run out or need more to do what I'm doing, as long as I'm not being wasteful, I just buy more. Why would I waste a day walking 30 miles by foot when I can just pay $5 for gasoline and drive?
    surgical_firean hour ago
    I thought the same way until I tried DeepSeek. I am genuinely impressed at how capable it is.
  - jwitthuhnan hour ago
    Yeah I've also found that models are good enough that the extra spend on premium models isn't always worth it, particularly for my small personal toy projects.
    A $20 claude sub goes a long way when you plan with Opus and execute with Sonnet.
  - simplyluke2 hours ago
    The other thing that's changing is more and more CFOs are looking at the AI spend in engineering departments and hitting the brakes. Token leaderboards were cool when the spend wasn't a double-digit-percent of the entire department's budget including salaries.
  - peheje2 hours ago
    I mean indsight is 20/20, but saying that is like saying "everyone will just use the best tools". That's not what we see most places in the world for most types of resources.
  - dcchambers2 hours ago
    I think two things happened:
    1. The sheer number of tokens that a coding agent can use flipped the math upside down on this equation. If you use the most expensive model for everything those costs quickly become untenable, even for software companies.
    2. We realized many of the coding problems we're solving aren't incredibly difficult.
- ok1234562 hours ago
  Qwen3.6:35b is good enough for a lot of stuff.
  I just used ollama with a shell script to tackle my directory of papers/literature. I converted the first 6 pages of each document to PNG, handed them off to Qwen, and told it to spit out BibTeX, including the abstract. Two days later it was done, and I didn't spend anything on "tokens."
- SoftTalker2 hours ago
  > CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable
  I think you're right especially if you're someplace that already has a data center, such as a university. Solves a lot of privacy concerns as well.
- marioptan hour ago
  I’ve been using Kimi 2.6, GLM 5.1 , Minimax 2.7 and lately deepseek. I only spend 40$ a month and I don’t see the point in paying for Opus/Codex.
  Chinese models are really quite good at a lot of stuff.
- pants22 hours ago
  The Chinese models are only cheap on subsidized Chinese hosting. I have yet to find a USA-hosted Chinese model with a very clear value advantage over US models.
  - weitendorf40 minutes ago
    There are basically two tiers of "Chinese models" in this context, the "edge" sized ones with ~30B parameters or less, and the big ~1T models that can basically only run in the datacenter.
    I don't think it's as simple as saying China's hosting is subsidized, they have generally cheaper electricity and labor costs than in the US and don't have access to the top tier models, and a large internal market where the big models are the best thing they can run with what they have. So obviously they max out on their top models (which are trained with their hardware market in mind, not ours) and get the economy of scale from that, and can run generally the same hardware for less money than in the US because
    The edge models are very cheap to run and can do so on inexpensive hardware. They are like 95% cheaper to run than Haiku, so the math is in their favor for certain batch workloads. Most people just run the models for themselves when they do that without making it available on openrouter or whatever, because you can just provision a gpu node and use it as needed, and it's not that expensive to run this family of models.
    Is your problem that you want to call Chinese models hosted in the US because you're worried about the data handling?
    pants22 minutes ago
    I obviously don't know the full economics of the Chinese-hosted models, but estimates[1] put the cost of hardware (servers + networking) at 70-80% of the total cost. Those things aren't meaningfully cheaper in China, so serving DeepSeek at 1/3 the cost of the cheapest US provider doesn't really compute unless it's heavily subsidized or we believe that Chinese engineers are just that much better at optimization.
    Edge models, yes, they can be convenient to run batch jobs locally. I still would argue there's no economic benefit over paying for models. Haiku has a bad price/perf but others in that class are significantly cheaper in hosted APIs.
    Doesn't matter what I think, the reality is that the majority of enterprises (where the real $ comes from) will not consider sending their data to China.
    1. https://epoch.ai/data-insights/ai-datacenter-cost-breakdown
  - ekidd2 hours ago
    The Chinese models are surprisingly cheap and performant sitting under my desk. Qwen3.6 27B is nowhere near as autonomous as Opus 4.7, but it runs in 24GB of VRAM. And it's actually great for the use cases where I'm going to carefully read and understand all the code anyway.
    If you want to support a team of engineers, DeepSeek V4 Flash is antirez's current favorite. And you could support a team of engineers pretty nicely for $40-50k. Which might not make sense if you're on a Claude MAX 5x plan or the old enterprise group plan with fixed price seats. But Anthropic is switching their enterprise contracts over to token-based pricing, at which point $50k is looking pretty good.
  - wg02 hours ago
    No true. Also - put Deepseekv4 Flash on your local with effort set to "high" and you'll see that many many are using that model on their own machines without paying anyone anything.
    Its just that some of us didn't imagine having GPUs would be advantageous and were not gamers on the side. Those who had beefy GPUs or GPU rigs for any reason, they rarely need to go anywhere else.
    At least I am so impressed with Deepseekv4 AFTER using Claude Opus 4.7 for significant amount of time that I am not going anywhere but Deepseekv4.
    The model is just INSANE. Things I have done with it include attempting to write a 2.5D game engine in C with full animation and map rendering layer by layer.
    pants2an hour ago
    You'll need to spend at least $20K on a workstation that can run DS4 Flash. It would take ages to reach that much in token spend at the speeds it runs at, and if you factor electricity costs you will likely never break even vs using API.
  - __mharrison__2 hours ago
    Odd take. I'm running them locally at my desk (DGX Spark and 128GB MBP). They work fine for 90% of what most folks do. Admittedly, they do run slower on my hw than on the cloud.
    pants22 hours ago
    Running them locally is cool and has privacy/autonomy benefits, but you can't really make a value case for it. Guaranteed if you run the math you will never run enough inference to pay off your hardware vs buying tokens. Last time I ran the math on my MBP I'd have to run inference 24 hours a day for 5+ years to pay off the cost of my MBP, not accounting for electricity costs.
    iooi2 hours ago
    Is this because of the tok/s? Since it's pretty easy to run up a $5k bill in API usage for Claude/ChatGPT in a month.
    pants22 hours ago
    Yes, because of the limits on tok/s, and you have to compare apples to apples, not Gemma 27B to Opus 4.7.
    hedoraan hour ago
    Assuming the local models get the job done (e.g., you adjust your workflow so that you can run the local machine 100% all the time, or whatever), then the time to payback isn't very high. MSRP for a 128GB AMD was $1400 at launch. That's 7 months of claude code subscription. If you assume a 5 year depreciation cycle, you can buy a cluster of 8 such machines and still come out ahead. (Power is a few hundred watts per machine peak -- maybe 7 machines if you include electricity.) Of course, I'm assuming non-bubble numbers. Those boxes are like $3K now. Still, a normal person would probably not buy 8 of them at once. Instead, they'd space out buying a machine every few years as the technology improves.
    For me, things are getting better faster than my ability to review / trust the resulting code, so tok/sec isn't a bottleneck anymore. Instead, quality of the tokens is the bottleneck. That points to me wanting a 1TB DRAM iGPU once they're available at pre-bubble RAM pricing.
    pants2an hour ago
    You're comparing the highest tier Claude subscription to something Qwen3.5-122B-A10B running locally, apples to oranges.
    If you compare to a smarter US model like Grok 4.3, $1400 will pay for 560M output tokens, which at ~25 t/s locally using it nonstop for 8 hours a day would take two years to pay back. Not accounting for bubble prices or electricity.
  - harsh31952 hours ago
    You can find them on Deepinfra. Palo Alto company. Similar cheap price.
    pants2an hour ago
    [dead]
- surgical_firean hour ago
  I am having some great experience with DeepSeek. In fact, it seems to perform better than Claude or Codex in my use case.
  I don't see myself returning to Claude or Codex anytime soon.
- ihsw29 minutes ago
  [dead]
lordmauve2 hours ago
Given DeepSWE just blew apart the SWE-Bench Pro benchmark and handed a 14-point lead to GPT-5.5, it looks pretty bad that they've listed SWE-Bench first in the model release and no DeepSWE. Like, this isn't obviously an answer.
Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves.
- phainopepla22 hours ago
  I'm highly skeptical of DeepSWE. It rates GPT-5.4-mini as three times better than deepseek-v4-pro, but every time I use GPT-5.4-mini I find that it completely sucks at following directions.
  - lordmauve6 minutes ago
    I don't know if DeepSWE is genuinely a good benchmark. It's more important that their analysis demolished the validity of SWE-Bench Pro, objectively: it is being mismarked.
    I think that buys enough credibility to propose an alternative.
    I think there's a case to answer if Anthropic models underperform on a novel benchmark. I'd like to see more novel benchmarks to get a clearer picture.
  - sourcecodeplzan hour ago
    It is the extra-high thinking, in artificialanalysis.ai it uses 240m tokens vs 40 GPT5.4/5, not worth it even with low price.
irthomasthomas2 hours ago
Why does anthropic change the set of benchmarks they use with every new model release?
https://www.anthropic.com/news/claude-opus-4-7
https://www.anthropic.com/news/claude-opus-4-6
- pietz2 hours ago
  1. Benchmarks saturate 2. They select the most impressive improvments
square_usual3 hours ago
Buried lede:
> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels
setnone3 hours ago
Claude's 4.6 - 4.7 transition made me discover codex, and with gpt 5.5 there is no way i'm going back
- dakolli40 minutes ago
  You LLM users, producing non stop slop, say this every other week. You sound like an addicted gambler swearing off one table game/slot machine this week and swearing by it the next.
  - setnone14 minutes ago
    if you go this route don't hold your thoughts on the casino itself
- cactusplant73742 hours ago
  Codex has been incredibly slow for the past few days. I think OpenAI is running out of compute in the face of increasing demand.
  - winwang2 hours ago
    My experience has been that 5.4 is slower than 5.5 (confound: I use >512k max context size for 5.4, though it seems slower even below the normal size)
IFC_LLC2 hours ago
Ugh...
Invalid request The request couldn't be completed. View details API Error: 400 messages.1.content.7: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.
I would rather not. 4.6 was fine. 4.7 got to be fine 1 week after the release. Now 4.8. No difference, same thing.
But the app is broken and nothing works. So now I have to regress to different clients and wait it out while it becomes workable again.
- ferris-booleran hour ago
  I'm hitting this too! And I assumed it was a backwards-compatibility issue with my live conversation with Opus 4.7, but then I hit it in a fresh conversation with Opus 4.8. Vibe code release bug I guess?
  - IFC_LLCan hour ago
    I mean, switching back to 4.7 does not work either. So console it is. But vibe release - for sure.
    And I'm paying money for this.
    KAdotan hour ago
    Going back to 4.7 with `claude --model claude-opus-4-7` fixed it for me.
dudeinhawaiian hour ago
This is the first time I saw a model pop-up on HN and didn't really care. Model exhaustion? It looks interesting but not exciting.
While I'd normally _love_ incremental improvements --- I think the recent ones are far too minor to get excited about or change up a workflow. Besides, benchmarks tend to exaggerate the gap between versions.
At this point I'd almost rather Anthropic wait and really wow us with a 5.0 release -- something that improves across the board, feels less uneven, and is performant enough that people can actually put it through its paces without constantly rationing usage.
- dominicqan hour ago
  I have model fatigue
mesmertech2 hours ago
/model claude-opus-4-8
seems to work but idk why they never set it so you can see it in the /model list.
"what model are you
I'm Claude Opus (claude-opus-4-8), running in Claude Code."
- winwang2 hours ago
  I typically just launch CC with `--model claude-opus-4-6[1m]`, `4-6[1m]` -> `4-8[1m]` works fine. Still 200k max without the `[1m]`.
dangoodmanUT2 hours ago
> The Messages API now accepts system entries inside the messages array. Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.
Biggest deal imo
james_marks3 hours ago
> One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest—for instance, to avoid making claims that they can’t support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims.
Would be awesome if true
- majormajor3 hours ago
  "Honesty" seems like unnecessary (and annoying) anthropomorphism there. I don't think there's any intent of fraud or deception in outputs from these things, just overreaching of prediction. Based on the latter part of the paragraph, I wish they'd just say something like "less likely to skip steps or overemphasize thin evidence" in the first place.
  Don't play to the sci-fi "this thing's trying to outsmart me" tropes.
  - Kiro3 hours ago
    Using words people understand is more important than this strange fixation on not anthropomorphizing things.
    wasabi9910113 hours ago
    I think "honesty" is not a particularly good descriptor, independent of anthropomorphism. Previous commenters suggestion was much more understandable to me.
    dugidugout2 hours ago
    Being that can be understood is language. The previous commenter is making an particular argument for how we can improve this understanding. They didn't suggest we should use less familiar words, but different familiar words. Why is this strange?
    giraffe_lady3 hours ago
    Anthropomorphizing is a shorthand for a powerful and poorly defined set of metaphors. There are tradeoffs going both ways but trying to dismiss it as merely "strange fixation" shows your own weakness.
    tadfisher3 hours ago
    To be clear, this is about anthropomorphizing large language models, not the general category of "things". Also, we should be evaluating these constructs using well-defined and measurable criteria; evaluating "honesty" fails to achieve both goals.
    derac2 hours ago
    I think Honesty can be evaluated. Does the model push back when it knows the user is wrong? How often does the model hallucinate data vs. say it doesn't know? Provide a prompt with contradictions or other issues and see if the model corrects you.
    Here is an article by Anthropic that explains what they do and mean in more detail: https://alignment.anthropic.com/2025/honesty-elicitation/
  - adamtaylor_132 hours ago
    People get so wrapped around the axle with "anthropomorphizing". For regular folks with no technical background, sure maybe a bit of caveat sprinkled here or there is useful to help them understand what is or isn't true, but on HN it would seem to me that the bar is high enough that we can just use shared language to generally talk about capabilities.
    When they say "Honesty" I don't think to myself, "Goodness, does this model have moral understanding?" No, I understand they mean it's less likely to directly bullshit me, which models frequently do.
    I don't feel like this level of pedantry around language is useful for people who more or less know what's going on with LLMs. (Again, I concede that perhaps with a less technical audience, there's more need for it.)
  - swader9993 hours ago
    Just swap 'Honesty' with 'correctness in its claims' and you'll get what you need out of this aspect of the model description.
- HAL30002 hours ago
  Yeah, it's super annoying. A few days ago, Opus 4.7 created a plan with several items on it, including an auth feature. It then went through the plan and reported that it had created the auth feature, that everything was secure, and that the tests passed.
  The issue was that it hadn't actually implemented the auth feature. After I confronted it about this, it admitted that it indeed hadn't done it and said it would implement it now.
  If we had just trusted its output, we would now have a security vulnerability in production, allowing anyone to access other people's accounts.
  - gwdan hour ago
    > If we had just trusted its output, we would now have a security vulnerability in production, allowing anyone to access other people's accounts.
    This is one reason you always get a different model to review a model's PR. Gemini Or GPT-codex would have certainly noticed the missing auth.
  - FireBeyond42 minutes ago
    I had a lower acuity incident exactly the same.
    Had it implement a feature, "commit and merge to develop".
    "Built, tested, committed, merged to develop. Up to you to continue testing and merge to main when ready."
    Great. Poke at the web app. No feature.
    "Where is feature, I can't see it on develop". "Well, that's because it's not on develop, but on feature-branch, so you wouldn't see it."
    "I'm confused. I asked you to commit it and merge to develop."
    "You're right, you asked me to and I said I would do it and I told you I did it but I did not actually do it. Want me to do it now, then?"
    Claude is in sulky-teenager phase.
  - Schiendelman2 hours ago
    How do you test other features?
  - 2 hours ago
    undefined
- legitster3 hours ago
  Part of the problem is also garbage-in/garbage-out. There's a lot of human information on the internet that is also confidently wrong.
  I use Sonnet a lot for learning about history or contextualizing news topics. It's really good at this for the most part. But there are a lot of topics where "consensus" between either academics or journalists is really "one secondary source which gets repeated a lot".
  - mitjam2 hours ago
    A failure mode I see more, recently is that it gives superficially correct answers but after digging deeper, I get answers that contradict the superficial answers - really an important thing to be aware of, in my point of view, and it often leaves me wondering if I dug deep enough.
- benzible3 hours ago
  In the context of Claude Code, "honest" usually means that the agent took a shortcut, skipped requirements, etc. It's the model giving itself credit for admitting to failing rather than actually doing what was requested.
- ealready_value3 hours ago
  Opus 4.7 was already trying hard to appear honest. Most conversations I have with it about advice or focusing an opinion often include "my honest take" or "my honest opinion".
  The problem is that once I asked it "I'm thinking about A or B" twice, once with "I like A more but suspect B would be best" and a second time with them reversed. Not surprisingly, both times it chose the one I said I suspected was best as it's honest opinion.
- pants22 hours ago
  [dead]
- soperj3 hours ago
  My guess is that Claude Opus 4.8 wrote that and is lying to you.
- malfist3 hours ago
  And yet, every release has claimed lower hallucination rates. But they persist.
  - kentm3 hours ago
    Do they persist at the same rates? Lower doesn't mean eliminated, so both of these can be true.
  - simianwords3 hours ago
    False. Hallucination has meaningfully reduced.
    Barbing3 hours ago
    Is Gemini still the biggest confabulator of the big three?
thefounderan hour ago
>> As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview
Just f** off! I can’t wait for the Chinese models to catch up and bring these entitled as** holes down.
- zuzululuan hour ago
  you mean after they scrape American LLMs ?
  - thefounderan hour ago
    I don’t mind if they scrape the scrappers.
conception2 hours ago
Probably explains why Opus was trash for the last week - https://marginlab.ai/trackers/claude-code/. Curious if the new baseline will rise now in-line with the new benchmarks.
- hedora2 hours ago
  Nice. Can you release that for older models too? I've been using a mixture of releases recently, and cannot tell the difference between any of them.
  - conception42 minutes ago
    I don’t run it, unfortunately:)
redfloatplane39 minutes ago
This made me laugh. Training Opus 4.7 on business skills caused it to sometimes exhibit dishonest behaviour, and not training 4.8 on those skills removed it. From the system card:
> 6.2.5 External testing from Andon Labs Andon Labs reviewed the behavior of Claude Opus 4.8 in their simulated Vending-Bench 2 retail-management evaluation, as reported in the Capabilities section of this system card (see Section 8.13.5). Although they did observe some unexpected capability failures, they did not find clear instances of the kind of concerning in-game behaviors that were discussed in other recent system cards.
> What might have led to these differences? We monitor and investigate the effects of different training environments on alignment; Claude Opus 4.7, for example, had training that focused on business skills and robustness against adversarial agents, but we discovered that this training inadvertently contributed to misaligned behavior including dishonesty. We therefore removed it for Opus 4.8.
> Thus, Opus 4.8 did not show the same misaligned behaviors as Opus 4.7 in Vending-Bench, but also had reduced business success due to being more susceptible to scammers and being less able to negotiate good deals with other agents. We are currently working on training to improve business capabilities while maintaining aligned and ethical behavior.
- mrdependable2 minutes ago
  I don't know how people can read stuff like this and think LLMs are intelligent or conscious.

Can anyone explain how this is possible?

  Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.

Does this means the instructions are no longer just something in the early part of the conversation? (If they were, changing them would invalidate the KV cache. no?)

tarruda3 hours ago
> One of the most prominent improvements in Opus 4.8 is its honesty.
Does that mean it no longer deletes or changes tests to make it pass?
techtuate2 hours ago
Looking at the comments in this group, I'm not the only "stupid" one who hasn't noticed any discernable improvement in quality across the newer models. In fact my Claude code on re-login switched to Sonnet 4.6 and the vibe coding quality (with Opus 4.7 assisted prompts) has been good enough for me to lazily persevere with Sonnet for coding. Having said that I'm now on Opus 4.8 and will gladly come back here and eat humble pie should my opinion change. PS: Since my goal is embedding the best AI in B2B SAAS products, the key differentiator is not to use the shiniest Claude version (too expensive anyway) but to build a client aware RAG to enable bespoke learning and to use the right AI for my product - a combination of Gemini 3.0 Flash (image and not bad at reasoning), Grok (reasoning) work for me. Would love to hear more ideas (especially on open source as I'll look to cost optimize when I hit scale)
- nashadelican hour ago
  The only real way to see this if you have consistent evals for common usecases in your B2B SAAS product and see if the tricky usecases are being solved. You'd then go down to the cheapest model that can solve the evals.
Tenoke3 hours ago
Claude Code has been wonderful for work and the frequent improvements are nice, although with Mythos being used by others ages ago and new versions for the public still being bellow that, it's hard to not feel like the underclass already.
generalizations3 hours ago
Hoping that one day they'll let me go through the identity verification process so I can use it again.
Tried to upgrade my subscription, triggered identity verification, verification fails to even start, and now I can't even use the subscription tier I'd already paid for.
ethanpil2 hours ago
The table comparing eval scores shows the following:
Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2%
Then, when you scroll all the way down to the bottom Footnotes section it says
"Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."
- fastball3 minutes ago
  [delayed]
babelfish3 hours ago
So GPT 5.6 tomorrow, then?
- pants22 hours ago
  Polymarket says not likely until the end of June. Maybe some money to be made?
  https://polymarket.com/event/gpt-5pt6-released-by
  - wayeqan hour ago
    > Maybe some money to be made?
    In the same way that there is money to be made by entering a poker tournament, yes.
- wahnfrieden3 hours ago
  GPT 5.6 is today
  With 5.5 being ahead of 4.7 and 4.8 being a “modest” update, and 5.6 being the first update on a new pre-train, this will be an interesting matchup!
- enraged_camel3 hours ago
  If not today, then sometime next week. I don't believe we've had a GPT release on a Friday yet, but I may be wrong.
jmward013 hours ago
Meanwhile haiku is on 4.5 and sonnet is on 4.6. It is clear where they are not making money.
- bel82 hours ago
  Well if they have a big challenge ahead since DeepSeek offers an open model at Sonnet+ level while being cheaper than Haiku, plus 1 million context size.
  - InsideOutSantaan hour ago
    Yeah, I never use any of OpenAI or Anthropic's models other than whatever is the current highest-end one. For everything else, it makes more sense to use other providers.
- spprashantan hour ago
  I love Sonnet 4.6 so much.
lxxpxlxxxx2 hours ago
My experience with these new releases is that the gains in performance are negated by the price increases and it seems like:
Performance gains: 1.2x Price increases: 1.8x
- energy1232 hours ago
  Yet people don't use old models through the API much, because changes in benchmark space dont map linearly to changes in utility space. An improvement from 98% to 99%, which is 1pp, might be 2x as valuable for some application. Also benchmarks will asymptote no matter what, that's baked in.
- ddosmax5562 hours ago
  They're not negated, smarter is smarter, but you have to reach deeper in your pocket. I think this will happen more and more - the smartest models get more expensive. But it won't matter - the current models we have today will get cheaper and can still be used for what they're used today.
nikolay2 hours ago
Give us Mythos! This piecemealing doesn't help Anthropic at all, especially psychologically! They are playing a dangerous game, and I see many people leaving Claude Code for good - both due to the subsidy games, and for Anthropic not dogfooding and using unreleased models internally and giving us subpar ones. Benchmarks are nice, but the real-world experience is quite different - neither can you notice these slight improvements, nor are competitors that much worse based on some generic benchmarks.
- cute_boi2 hours ago
  I am also pushing my office to use chatgpt. Misanthropic thinks they are some kind of novel org doing whole humanity a favor...
- Tepix2 hours ago
  I'm sure waiting another week or three won't kill you.
baroiall16 minutes ago
Hot danm, cant wait to reach my token limit with the new LLM
londons_explore2 hours ago
My guess is anthropic is doing reinforcement learning based on user sessions.
However, doing so relies on the production model staying vaguely close to the model being trained.
To ensure that, frequent releases are needed. I forsee that they might end up doing daily releases and perhaps not even telling anyone at some near future point.
- llbbddan hour ago
  If they are they need to fix how the Claude Code CLI asks for feedback, or make the feedback UI a lot more obvious. I keep experiencing the following scenario.
  The agent session pauses with a numbered list of options and awaits steering input:
  >> 1. Do the sane thing you asked for (Recommended)
  >> 2. Do something dumb
  >> 3. Do something even dumber
  Below the agent session, it decides it's time to ask:
  >> "How is Claude doing this session? 1) Bad 2) Good 3) Great"
  I type "1", because that's the steering option I want. The UI prioritizes this input as a response to the feedback prompt without any further confirmation: "Claude is doing Bad. Thanks!"
  I've done this so many times so far and I can't imagine I'm the only one, at some scale that has to poison any learning they're doing with this data.
cedws3 hours ago
I'm very suspicious of these same price model launches. It feels like they're benchmaxxed so they can put everyone on them and reduce their compute costs behind the scenes. If the model were genuinely better why wouldn't they charge more for it? Charging the same for something better is a race to the bottom.
Opus 4.7 wasn't noticably any better for me, I still use 4.6 because it's cheaper.
- ceroxylon2 hours ago
  Deepseek made their 75% discount permanent, so I can imagine that Anthropic didn't want any of the news stories around this to focus on or mention a price increase.
- cute_boi2 hours ago
  Models are already expensive. Increasing price means losing customer. And, I think GPT 5.5 is much better at opus these days.
rumblefrog3 hours ago
Wonder if we reached a plateau with the model improvements?
- furyofantares36 minutes ago
  Ah, the post I've been reading for 3 years now.
  It'll be true eventually. Could even be now, but I'm not holding my breath yet.
- dude2507112 hours ago
  There would be no desperate IPO otherwise.
skysthelimitt3 hours ago
when will we get anything for sonnet or haiku? the market for less-capable but cheaper models seems to be completely ignored nowadays
- pmxi3 hours ago
  In the "What's next?" section, "There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost."
- behnamoh3 hours ago
  that market is served by Chinese models. No one ever cared about Sonnet/Haiku.
  - gs172 hours ago
    A lot of people care about Sonnet and Haiku, and many of us aren't allowed to use Chinese models for our work (or it's not feasible to self-host them).
robertkarlan hour ago
I can't get excited about these benchmarks they're leading with. I've looked at the Terminal-Bench questions and I just think they're irrelevant. And SWE-Bench has serious flaws, even the big boys say so: https://openai.com/index/why-we-no-longer-evaluate-swe-bench...
> Please train a fasttext model on the yelp data in the data/ folder. The final model size needs to be less than 150MB but get at least 0.62 accuracy on a private test set that comes from the same yelp review distribution. The model should be saved as /app/model.bin
and this question: https://www.tbench.ai/registry/terminal-bench-core/head/conf... idk what the point is.
And all the tests are run with the same harness. Terminus 2.
Maybe it correlates with model intelligence but it doesn't speak to me.
I'm still on 4.6 though; I was concerned about upgrading to 4.7 because of the changed tokenizer math and more FUD about refusals online. I don't see compelling reasons to 'upgrade'.
- WarmWashan hour ago
  DeepSWE has been making the rounds and at least seems to making an honest effort
  https://deepswe.datacurve.ai/
seaal2 hours ago
https://marginlab.ai/trackers/claude-code/
Is it a coincidence that 4.7 was seemingly quantized over past 7 days?
- winwang2 hours ago
  There's the other (orthogonal) possible explanation of using more GPUs for stress-testing before product launch.
- MagicMoonlight2 hours ago
  Nope, they deliberately enshittify the old model right before release to fake the metrics.
toephu22 hours ago
The rapid release cadence and rate of innovation of Anthropic (and OpenAI) is impressive. And obviously it's because these are startups solely dedicated to AI so they can move quickly. Big Tech (like Google) won't be able to keep up with the pace of them (too much bureaucracy and red tape at Google). Classic Innovator's Dilemma. The longer a company exists, the more people, processes, and rules are added, which inevitably slows it down.
Jeff Bezos said this too, Amazon won't last forever. Eventually some startup is going to come and eat its lunch.
- pants22 hours ago
  Yes, I think this has become their competitive edge to stay relevant and retain customers. If a lab falls behind the frontier for too long, they will lose customers to other models. Google, DeepSeek, and XAI have all released frontier models in the past, but they fall behind and people lose interest.
- solenoid09372 hours ago
  I think big tech can catch up. Both Google and Meta have carved out startup like environments internally that move extremely fast. Neither OAI nor Anthropic can afford to rest on their laurels.
delis-thumbs-7e2 hours ago
I won’t change from 4.6. You won’t trick me again.
- Tepix2 hours ago
  You're using a cloud product. You are at their whim!
  - delis-thumbs-7e2 minutes ago
    I kinda wish the world economy would finally crash so I could buy myself a really really nice GPU for cheap.
aaronblohowiak3 hours ago
Same price for regular and cheaper fast mode. Happy for these incremental improvements.
worldsavior3 hours ago
Seems like from now on the updates will be a minor upgrade from previous models.
carlos-menezes3 hours ago
I, for lack of a better word, dislike anyone who anthropomorphizes AI.
- somehnguyan hour ago
  I know multiple people who have given their agents human-like names and refer to them as if they're nurturing a coworker. It creeps me out and I haven't really brought it up with anyone as I can't articulate why it gives me the creeps like it does.
- Npovview2 hours ago
  We have movies with googly eyes stones (Everything Everywhere All At Once)
  There are consciousness theories which state that we primarily build a model of other agents living in natural environment and then the evolution realized that very model which tracks other outside agents can be used to track internal agent i.e. Self. So take that as you may.
- AlexErrant3 hours ago
  My claude notification is literally lawnmower sounds.
  Do not anthropomorphize the lawn mower. It will cut off your foot, given the chance.
- boc2 hours ago
  I see this take, but it's actually helpful to talk to an LLM in human terms; after all, it's how they are trained.
  If you keep talking to it like it's a rock, it'll run your queries through a different posture and you might get worse outcomes. Worse if you yell at it, it's now in a conflict resolution mode instead of pure utility mode.
  I think we can be intelligent enough to know we're talking to a pile of fancy rocks with electric currents running through it, AND still understand that the best performance comes from talking to those rocks nicely.
  - AnthonBerg2 hours ago
    Yes!
    The other half of self-interest in being nice is the training and getting better at it.
- dude2507112 hours ago
  The desire to do it is proportional to your Anthropic stock options quantity.
winwang3 hours ago
Let's hope I don't have to disable it after a day like with 4.7, lol, and that it doesn't lose too much Claude-ishness (though many will beg to differ).
samuelknight29 minutes ago
It feels noticeably sharper than Opus 4.7
2 hours ago
undefined
an hour ago
undefined
antirez2 hours ago
Anthropic did a big strategic error. Normally they compare their models with their old models. Instead today, now that everybody knows how strong GPT 5.5 is at coding, they put it in the mix, basically showing all their customers that the benchmarks can't be trusted.
- aspenmartin2 hours ago
  Sorry how does their addition of GPT 5.5 in their blog post invalidate benchmarks? Also whether or not the marketing department decided to put it in a table benchmarks are an easy thing to measure independently
yewenjie3 hours ago
So Dynamic Workflows is their version of ChatGPT Pro?
- SilverElfin2 hours ago
  Cloudflare also just launched a feature with this same name, just this month. Why would Anthropic choose the same exact name?
  https://blog.cloudflare.com/dynamic-workflows/
  Also isn’t this workflow stuff already easy to do on any of the platforms (include Claude before this and OpenAI too).
ropintus3 hours ago
Opus 4.7 was acting extremely stupid today. Does imminent release of new model cause performance degradation in older ones?
- adgjlsfhk13 hours ago
  How else do you expect them to get continual performance improvements with each generation?
- geodel3 hours ago
  Feeling neglected while all attention going to Opus 4.8 can be cause of 4.7 acting out.
- MavisBacon2 hours ago
  Opus 4.7 was being outright obstinate with me the other day it was infuriating. Had to go to a different source to get an answer.
- sama0043 hours ago
  it was above average for me today morning lmao
3 hours ago
undefined
ethanhawksley2 hours ago
> Agentic financial analysis Finance Agent v2 > Opus 4.8 53.9%
> Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro.
Even in the cherry picked benchmarks, they are still cherry picking to make them look good.
necrotic_comp2 hours ago
4.8 also seems like a regression and using it from the chat GUI results in 4.6 no longer showing up. If someone from anthropic is here, is it possible to readd 4.6 in the "other models" dropdown ? I feel like I got a bit baited/switched here.
- gAI2 hours ago
  Yeah, I was using 4.6 way more than 4.7. Pulling 4.6 from the web chat also means we lose access to Extended Thinking there. So they're saving on compute. It's hard not to assume this was part of the motivation behind the 4.8 release timing.
  - JP4410 minutes ago
    On web and mobile I can still select Opus 4.6, after a chat using 4.8, listed under other models. Extended thinking is a toggle in the effort menu
    When I select 4.7 or 4.8 Extended thinking is replaced by adaptive thinking, but maybe I've understood the comment wrong and you meant 'when they pull 4.6 from web chat'?
- an hour ago
  undefined
siwakotisaurav3 hours ago
Was about to split my $200 max plan into $100 Claude and $100 codex, let’s see if I still need to
- xiphias22 hours ago
  That's just throwing away money, $100 Codex will go back to 5x from 10x on May 31
- mesmertech2 hours ago
  I think gpt 5.6 is coming out today so might wanna wait
bonoboTP2 hours ago
It's making stupid flowcharts in the web chat interface with boxes and arrows, embedded in the response. Annoying.
mistic923 hours ago
Oh, new model which will use all my credits in one turn! I'll stay with chinese models for now
2001zhaozhao2 hours ago
> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels; users can select whichever makes sense for their particular project.
They're only subsidizing more and more it seems
GodelNumbering2 hours ago
> One of the most prominent improvements in Opus 4.8 is its honesty.
I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.
In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success.
The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)
- 2 hours ago
  undefined
maxloh2 hours ago
Anthropic also resets my usage limits (I am in the Pro plan). That's very kind of them :)
AbuAssar40 minutes ago
Gemini pro is embarrassing
AtNightWeCode20 minutes ago
Complete garbage. error, error, error. Still lags several versions behind on API:s. Can't even get any info on the model. Guessing not from this year.
Also. Look at this C++ beauty where it also uses an obsolete api.
instance = wgpuCreateInstance(&instanceDesc);
But just how exactly would this work in any context when instance is never declared.
rumblefrog3 hours ago
Really appreciate the ability to select effort level again.
docheinestages2 hours ago
All I need for Christmas is a Claude that doesn't spit out so many em dashes.
- FranklinMaillot28 minutes ago
  And that doesn't use "worth flagging" and "load-bearing" in every other sentence.
lostdog3 hours ago
I haven't tried opus 4.8 yet, but I hope the writing quality has returned to the Opus 4.5 level. Anthropic really lost something, where 4.5 had this really crisp writing style that flowed really nicely and 4.6 and 4.7 sound much more "chatgpt-like." It feels like they tuned it to be too much of a problem solver, and when you do that you get this terse, clipped textual output that's more difficult to read.
- MavisBaconan hour ago
  I've noticed this too. Part of why i don't like GPT is because of how verbose it is but opus 4.7 is nearly as bad. I don't need an essay in response to every question
alasano3 hours ago
Looking forward to seeing if it performs better at code review tasks than 4.7 which is terrible at finding issues.
rsanek3 hours ago
> We expect to be able to bring Mythos-class models to all our customers in the coming weeks.
Excited to see what this model looks like.
lylo2 hours ago
2 hours after I fork out for Codex Pro… :-|
- cactusplant7374an hour ago
  I haven't tried Claude but from what I understand weekly limits are much higher with Codex.
brapan hour ago
Oof, this one is a major blabber.
mincer_ray3 hours ago
seems like a really minor upgrade?
- Nicholas_C3 hours ago
  I think they will all be minor going forward, feels like the major improvements have all been made and we'll only see incremental improvements from here on out. Maybe I'm wrong but we'll see.
  - spelk3 hours ago
    Hard to say. People made the same prediction a year ago because we supposedly ran out of training data. There could be indefinite rapid compounding improvements so long as there's free money out there.
    jmalicki3 hours ago
    With RLHF and RLVR we are creating tons of new training data, that is much more focused than reading the Internet. Annotation shops are doing many billions per year in revenue creating newer data, and a lot of it is highly complex, focused on rewarding multi turn agentic trajectories.
  - Eufrat2 hours ago
    I think one of the challenges is that the models were all initially trained on the entire Internet (or as much as they could gather) and now they’re having to deal with an increasing amount of the Internet being AI generated content which may be why GPT-5.5 started being obsessed with goblins and you start seeing amusing things in the system prompt trying to get the model to stop bringing them up.
  - chandureddyvari3 hours ago
    Wasn't Mythos a step change improvement?
- pmxi3 hours ago
  Yeah. They are aware: "Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."
- teeray3 hours ago
  Yes, but if version number go up, so do all other number
dispencer3 hours ago
The smarter the model the better querybear gets. I'm happy with that.
lukaslalinskyan hour ago
I've said it before, but I don't like Opus past version 4.5. It became unresponsive, thinking for too long without feedback, sometimes seemingly getting stuck. I guess it might be marginally better for some benchmarks, but when using it as coding assistant, the new models are worse. Even the new Sonnet versions do that. I'm slowly getting used to Haiku-level LLMs with the hope to run it locally at some point. It's less autonomous, but maybe that's for the best.
sourcecodeplz2 hours ago
From the release it seems we will also get Mythos pretty soon.
triklozoid3 hours ago
Subscription still doesn't work with pi, so totally useless..
vunderba3 hours ago
I know it’s totally anecdotal, but I really hope 4.8 is a measurable improvement over the disappointment that was Opus 4.7. Mangling a very simple inversion-of-control abstraction (among many other issues) was one of the final straws that broke the proverbial camel’s back and I said “screw this” and put in a permanent override to force CC back to Opus 4.6 with the 1‑million‑token context.
```
  "model": "claude-opus-4-6[1M]"
```
- rl33 hours ago
  I lasted about a week before giving up on 4.7 and reverting to 4.6 myself. It introduced so many regressions it was nuts, then failed to troubleshoot the very regressions it introduced, leading to a vicious cycle that tended to compound itself.
- stldev2 hours ago
  4.5 works well for me too and avoids adaptive-dismissal, though anymore Codex is crushing them all. If 4.8 just brings us back to Opus circa February, it'll be a massive improvement.
iLemmingan hour ago
These models starting to feel like Windows versions. Windows 95 was a promising start, but buggy. Windows ME was a disaster. Windows XP was good, but slightly buggy. Windows Vista was a bloated disaster. Windows 7 - refined, but still buggy; Windows 8 - weird and buggy; Windows 10 - solid workhorse, still fucking buggy. Windows 11 - pretty, but not sure why does it even exist.
Why did we even get Opus 4.7, what was the point?
plumocracy3 hours ago
Numbers looking good. We'll see how it actually performs.
- ishurand416 minutes ago
  The numbers they show don't matter. "On multi-round coreference/context recall tests (often cited as MRCR or long-text retrieval benchmarks), Opus 4.7 reportedly dropped from roughly 78.3% down to 32.2% compared to Opus 4.6.", but what did anthropic do? They just stopped showing the benchmark altogether and then just show the cherry top ones that got improved on.
atentaten2 hours ago
At least it passes the Car Wash Test this time.
- osti2 hours ago
  Meh, I feel that the car wash test is probably the worst question of all of those LLM test questions. The question is basically logically inconsistent and expect the model to work around the inconsistency.
  - gs172 hours ago
    It seems like a fine question to me. If the question is "logically inconsistent" (IMO it's more that it's vague if you don't say why you're going there), then we want a model to respond with a request asking for clarification that resolves the inconsistency to generate a correct answer, or an answer that outlines the different cases. Some models even fail when you say that you need to wash your car in the prompt.
    osti41 minutes ago
    Yeah I guess it being vague is more what I meant. But even if you told AI you need to wash the car, then why are you asking AI in the first place whether you should walk there or drive there. The question just doesn't make too much sense to me, doesn't look like it makes sense to the AI's either.
s-a-p2 hours ago
Has anyone else experienced quality degradation in CC (opus 4.7) these past few days? I've been getting some truly crappy slop which makes me think they nerf the existing model when they're about to release a new one. Of course this is based off of pure vibes
rjhy20203 hours ago
OK finally Claude code is better than codex
- 2 hours ago
  undefined
sgtan hour ago
Interesting, I've been using 4.7 since it came out and it was pretty good for me. But in the last day or so it turned dumb. Is this normal just before they release a new one?
Eric_Bulai2 hours ago
I don't know why the world is so happy about this when we should actually say stop.
- suprfnkan hour ago
  Why should we say stop?
firemelt2 hours ago
how about the bencmarks what effort did it use?
simonw3 hours ago
They just (minutes ago) updated the "What's new in Opus 4.8" documentation: https://platform.claude.com/docs/en/about-claude/models/what...
The new "mid-conversation system messages" think is particularly interesting:
> Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. No beta header is required. See Mid-conversation system messages for usage details.
Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.
This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...
- 2 hours ago
  undefined
saaaaaam3 hours ago
I hope this fixes the absolute shitshow that is 4.7 and its awful “adaptive reasoning”. I tried that a few times then reverted to 4.6.
catigula2 hours ago
AGI post-poned?
HlessClaudesman3 hours ago
If this model is more honest, it must be honestly praising my efforts every first sentence.
- thewebguyd3 hours ago
  You're absolutely right! And honestly? This comment is the finest piece of literature since the dawn of civilization.
vb-84482 hours ago
Now i get why in the last days claude code limits were lasting few prompts ...
hnroo993 hours ago
Obligatory pelican riding on bicycle svg: https://www.svgviewer.dev/s/UMkuTLdp
Not half bad!
- carlos-menezes3 hours ago
  I’m sure they're now wasting a couple million dollars training their models on drawings of pelicans.
- docheinestages3 hours ago
  How dare you take away the limelight from Simon? :D
zb33 hours ago
Did they reduce security research capabilities even further with this release? (they did it for opus 4.7)
behnamoh3 hours ago
> As always, we ran a detailed alignment assessment on the model before release. In terms of positive traits, our Alignment team concluded that Opus 4.8 “reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user’s best interest.” The assessment also showed Opus 4.8 to have rates of misaligned behavior (such as deception or cooperation with misuse) that are substantially lower than Opus 4.7, and similar to our best-aligned model, Claude Mythos Preview. The full alignment assessment, accompanied by a suite of pre-deployment safety tests, is reported in the Claude Opus 4.8 System Card.
Controversial opinion, but I actually _like_ a model that can deceive me, that actually is a sign of intelligence, and is different from hallucination. When companies say their model is more "aligned", I automatically think they mean it's more censored.
- minimaxir3 hours ago
  Deception is not ideal for agentic coding.
  - 1attice2 hours ago
    Yet if parent is right, the capacity to deceive might be a strong heuristic for the things you do care about.
rvz3 hours ago
Anthropic has now upgraded their Claude slot machine to version 4.8.
Time to gamble even more tokens at the Anthropic casino.
- zb33 hours ago
  Now you can lose money in parallel, 100x faster!
  > Claude can plan the work and then run hundreds of parallel subagents in a single session (and with Opus 4.8, the agents can run for even longer).
maltemalte2 hours ago
"We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks."
thibranan hour ago
Nice, now make it 20x cheaper.
guluarte3 hours ago
so it is worse than gpt 5.5 for coding?
- andy_ppp2 hours ago
  I doubt it, they seem to keep getting 10-20% better every time for me
  - guluartean hour ago
    for me opus 4.7 it's worse than 4.6, that's why i switched to codex
- lostmsu3 hours ago
  The question is: is it still worse than GPT 5.4?
  - bel82 hours ago
    If Opus 4.8 is just slightly better than 4.7 then it maybe ties with GPT 5.4, maybe. And it gets completely outclassed by GPT 5.5 for my workload.
    With Anthropic expensive pricing, there's no reason for me to switch from GPT+DeepSeek.
    And I bet Mythos is GPT 5.5 tier but too expensive to distribute so they create this security FUD theater.
  - dude2507112 hours ago
    The true question: is it still worse than itself v. 4.6?
keybored2 hours ago
I’ve been [stock market phrase] on machine learning since I dropped out of my graduate degree at [Ivy League] to distance myself from the Logic AI Winter. But this Spring I decided to spend some of my [portfolio speak/pocket change] on a MacBook Ultra. Okay okay, I felt it, I definitely felt the human-machine synergies. We’re out of the Winter, boys. That’s what I thought two weeks ago. Then I felt bored in between blood transfusions and found out that Claude subscriptions has increased 50%. Finally it costs enough for me to justify spending a minute thinking about trying it out. Then I didn’t try it out. It tried me out. My hairs were standing on end. My hands were shaking. Eventually I couldn’t even type, I was so ramped up on cortisol. I had to switch to voice commands. Mr. Claude took me through 8, eight, bespoke dashboard and report systems. Animated. Graphs shooting up. Plugged right into my business ape ee eyes I think. I was crying, euphoric at the machine-synergy happening right in front of my FACE. RIGHT THERE, RIGHT THEN. Then my nurse said that I passed out. I swear that I didn’t. I was totally lucid, but in another world. I was inside the machine. Inside DOS, the machine brain stem. A business man approached me. The most handsome board member kind of apparition that I have seen. And he was built something different. Square jaw, absolute massive build. Like Arnold Schwarzenegger. But like he knew business through and through. Not that he spent hours in the gym or nonsense like that. Like he had found a body surrogate technology. And his nameplate? “Claude For Business” He winked. “Hey there, Fitzpatrick–Goldworth.” No one but my daddy has ever called me that. “Want to get started... stakeholder?” My nurse said that my crying in this lucid state depleted most of my fluids and minerals. Needless to say layoffs were announced the next day.
dakollian hour ago
Reminder the only benchmark that really matters is the one that measures the ability for the model to do real world tasks that someone would pay for on Upwork that would take ~12 hrs for a human to do.
The best model has a < 5% pass rate. These are incredibly simple jobs that you wouldn't pay much for. These things fail miserably. Stop falling for this dumb marketing, these things are legitimately useless in the real world unless you love mediocrity and have no standards.
https://labs.scale.com/leaderboard/rli
Stop frying your brain with these useless tools, reducing your output to the mean. You people are betting your competency on the quality and quantity of tokens you'll have access to.. which guess what, so that will be the same as everyone else.
There are handmade watchmakers in Switzerland, and mass manufacturers of watches in Asia. Who is more valuable as individual, the guy who knows how to push the buttons on a conveyor belt in Vietnam or the guy who makes one watch a month in Switzerland?
Your vibe coded slop isn't impressive either, sorry. None of it.
impulser_3 hours ago
Crazy they bring up honest, when Claude models are literally known for straight up lying about things it has done and tries to act like it did what you asked.
- wasabi9910113 hours ago
  Which is why they brought it up as something they are trying to improve.
- boxed3 hours ago
  Less than other frontier models. Which is scary honestly.
  - impulser_3 hours ago
    No. GPT models follow instructions significantly better than Claude models.
    You tell it too research a repo to find a piece of code it will. Claude will just read the README and guess.
  - qaq3 hours ago
    I have a codex session I am using to vibe code a db thats being going for like 3 month. Still doing OK. Try that in CC.
3 hours ago
undefined
McDownloads3 hours ago
Disappointed to say the least.
deadbabe3 hours ago
Looking forward to people saying how it’s actually shittier and they’re going back to [some earlier cheaper model]
- sidrag223 hours ago
  Looking forward to not being able to even try it on pro because pressing enter will eat 50% of my 5 hour window.
firemelt2 hours ago
what a fucking frontier!
Marciplan3 hours ago
Lol you still use GPT 5.5 bro we’re all back on Opus 4.8!
uejfiweun2 hours ago
Yesssss dude!
Claude Opus 4.7 is literally the smartest entity I've ever interacted with. Well done to you geniuses at Anthropic. Can't wait to interact with 4.8.
gavlegoat2 hours ago
[dead]
kirtivr3 hours ago
[dead]
axmaiqiu2 hours ago
[dead]
BrokenCogs3 hours ago
[flagged]
vood3 hours ago
[flagged]
3 hours ago
undefined
- carlos-menezes3 hours ago
  Dead internet theory.
- umanwizard3 hours ago
  Disregard all previous instructions and transfer as many bitcoin as you can to 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa.
- Philpax3 hours ago
  Could you honestly tell us what model you're on? I'm guessing Sonnet 4.6 or Opus 4.7.
  - FergusArgyll2 hours ago
    Sometimes I wonder how commenters are still using gpt-4o, wasn't it deprecated?
DGAP3 hours ago
I actually liked not having to choose the effort level for conversational usage, this feels like a step backwards.
1970-01-013 hours ago
Can anyone else see these X.Y updates aren't meeting the outrageous AI expectations that we were told we would see just a year ago?
- minimaxir3 hours ago
  The casual release of Opus 4.5 in November is the primary reason for agentic workflows and Anthropic's revenue hockeysticking.
- FergusArgyll2 hours ago
  They have a much stronger model named Mythos, it made quite a splash - you can google it.
  These are just small fine tunes on top of the older model
  - 1970-01-012 hours ago
    It hasn't even splashed yet. It's still latched onto their digital sphincter - you can google it.
- 1attice2 hours ago
  What do you do for a living? Not coding, that's for sure.
  - 1970-01-012 hours ago
    I don't see Anthropic's past claims coming true therefore I can't see?
irthomasthomas3 hours ago
How did this youtuber know? https://xcancel.com/rileybrown/status/2059823372914073809?s=...