232 pointsby meetpateltech3 hours ago28 comments
  • simonw2 hours ago
    Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...

    Solid bird, not a great bicycle frame.

    • btown2 hours ago
      Thank you for continuing to maintain the only benchmarking system that matters!

      Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/

      • gabiruh30 minutes ago
        It's interesting how some features, such as green grass, a blue sky, clouds, and the sun, are ubiquitous among all of these models' responses.
    • pwython2 hours ago
      How many pelican riding bicycle SVGs were there before this test existed? What if the training data is being polluted with all these wonky results...
      • nerdsniper32 minutes ago
        You're correct. It's not as useful as it (ever?) was as a measure of performance...but it's fun and brings me joy.
    • _joel2 hours ago
      Now this is the test that matters, cheers Simon.
  • Aurornis3 hours ago
    The benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.

    Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

    • InsideOutSanta2 hours ago
      > it's comparing to last generation models (Opus 4.5 and GPT-5.2).

      If it's anywhere close to those models, I couldn't possibly be happier. Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

      • Aurornis2 hours ago
        > Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

        Before you get too excited, GLM-4.7 outperformed Opus 4.5 on some benchmarks too - https://www.cerebras.ai/blog/glm-4-7 See the LiveCodeBench comparison

        The benchmarks of the open weights models are always more impressive than the performance. Everyone is competing for attention and market share so the incentives to benchmaxx are out of control.

        • InsideOutSanta2 hours ago
          Sure. My sole point is that calling Opus 4.5 and GPT-5.2 "last generation models" is discounting how good they are. In fact, in my experience, Opus 4.6 isn't much of an improvement over 4.5 for agentic coding.

          I'm not immediately discounting Z.ai's claims because they showed with GLM-4.7 that they can do quite a lot with very little. And Kimi K2.5 is genuinely a great model, so it's possible for Chinese open-weight models to compete with proprietary high-end American models.

          • GorbachevyChase38 minutes ago
            From a user perspective, I would consider Opus 4.6 somewhat of a regression. You can exhaust your the five hour limit in less than half an hour on, and I used up the weekly limit in just two days. The outputs did not feel significantly better than Opus 4.5 and that only feels smarter than Sonnet by degrees. This is running a single session on a pro plan. I don’t get paid to program, so API cost matter to me. The experience was irritating enough to make me start looking for an alternative, and maybe GLM is the way to go for hobby users.
          • Aurornis2 hours ago
            I think there are two types of people in these conversations:

            Those of us who just want to get work done don't care about comparisons to old models, we just want to know what's good right now. Issuing a press release comparing to old models when they had enough time to re-run the benchmarks and update the imagery is a calculated move where they hope readers won't notice.

            There's another type of discussion where some just want to talk about how impressive it is that a model came close to some other model. I think that's interesting, too, but less so when the models are so big that I can't run them locally anyway. It's useful for making purchasing decisions for someone trying to keep token costs as low as possible, but for actual coding work I've never found it useful to use anything other than the best available hosted models at the time.

            • ffsm82 hours ago
              For the record, opus 4.6 was released less then a week ago.

              That you think corporations are anything close to quick enough to update their communications on public releases like this only shows that you've never worked in corporate

        • miroljub29 minutes ago
          Yeah, I'm sure closed source model vendors are doing everything within their power to dumb down benchmarks, so they can look like underdogs and play a pity game against open weight models.

          Let's have a serious discussion. Just because Claude PR department coined the term benchmaxxing, we we should not be using it unless they shell out some serious monetes.

    • dongobread37 minutes ago
      What a strangely hostile statement on an open weight model. Running like 20 benchmark evaluations isn't trivial by itself, and even updating visuals and press statements can take a few days at a tech company. It's literally been 5 days since this "new generation" of models released. GPT-5.3(-codex) can't even be called via API, so it's impossible to test for some benchmarks.

      I notice the people who endlessly praise closed-source models never actually USE open weight models, or assume their drop-in prompting methods and workflow will just work for other model families. Especially true for SWEs who used Claude Code first and now think every other model is horrible because they're ONLY used to prompting Claude. It's quite scary to see how people develop this level of worship for a proprietary product that is openly distrusting of users. I am not saying this is true or not of the parent poster, but something I notice in general.

      As someone who uses GLM-4.7 a good bit, it's easily at Sonnet 4.5 tier - have not tried GLM-5 but it would be surprising if it wasn't at Opus 4.5 level given the massive parameter increase.

    • throwup2382 hours ago
      > Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

      Agreed. I think the problem is that while they can innovate at algorithms and training efficiency, the human part of RLHF just doesn't scale and they can't afford the massive amount of custom data created and purchased by the frontier labs.

      IIRC it was the application of RLHF which solved a lot of the broken syntax generated by LLMs like unbalanced braces and I still see lots of these little problems in every open source model I try. I don't think I've seen broken syntax from the frontier models in over a year from Codex or Claude.

      • algorithm3142 hours ago
        Can't they just run the output through a compiler to get feedback? Syntax errors seem easier to get right.
        • NitpickLawyer2 hours ago
          The difference is in scaling. The top US labs have oom more compute available than chinese labs. The difference in general tasks is obvious once you use them. It used to be said that open models are ~6mo behind SotA a year go, but with the new RL paradigm, I'd say the gap is growing. With less compute they have to focus on narrow tasks, resort to poor man's distillation and that leads to models that show benchmaxxing behavior.

          That being said, this model is MIT licensed, so it's a net benefit regardless of being benchmaxxed or not.

        • rockinghigh2 hours ago
          They do. Pretty much all agentic models call linting, compiling and testing tools as part of their flow.
      • ej882 hours ago
        the new meta is purchasing rl environments where models can be self-corrected (e.g. a compiler will error) after sft + rlhf ran into diminishing returns. although theres still lots of demand for "real world" data for actually economically valuable tasks
    • dust422 hours ago
      They are all just token generators without any intelligence. There is so little difference nowadays that I think in a blind test nobody will be able to differentiate the models - whether open source or closed source.

      Today's meme was this question: "The car wash is only 50 meters from my house. I want to get my car washed, should I drive there or walk?"

      Here is Claude's answer just right now:

      "Walk! At only 50 meters (about 150 feet), it would take you less than a minute to walk there. Driving such a short distance would actually be less convenient - by the time you get in the car, start it, drive, and park, you could already be there on foot. Plus, you'd save fuel and your car would be right there waiting for you after the wash is done."

      Here is ChatGPT, also right now:

      "Since it’s only *50 meters* (about 160 feet), walking makes a lot more sense.

      Here’s why:

      * *It’s extremely close* — you’ll get there in under a minute. * *Driving would actually be less efficient* — by the time you start the car, move it, and park, you’re already there. * *Cold starts aren’t great for your engine* — very short drives don’t let the engine warm up properly. * *It saves fuel and emissions*, even if it’s a tiny amount. * You avoid the hassle of maneuvering such a short distance.

      The only time driving might make sense is if:

      * The weather is terrible (heavy rain, snow, extreme cold). * You have mobility issues. * The car wash requires you to stay in the vehicle and you don’t want to leave it there.

      Otherwise, this is a perfect “walk it over” situation. "

      Please save us all that talk about frontier and SOTA and that only the closedAI models are any good and the others are all so bad and benchmaxxed. For most purposes a Toyota is just as good as a BMW or a Merc or whatever luxury brand tickles your fancy. Even worse, the lastest 80B Qwen Next is not far from Opus 4.6 but runs on my laptop.

      • uncharted94 minutes ago
        I tried this prompt with all the major AI models that I know and have installed, and only GLM from Z.ai and Gemini 3 Flash could answer it. Even in that case, Gemini 3 Flash gave a bit more nuanced answer, but every other model like Claude Opus 4.5, Chat GPT, Grok - everything failed.
      • cherryteastain2 hours ago
        Gemini 3 Flash ("Fast" in the web app) did not have trouble with this:

        Since you presumably want the car to be clean, the answer depends entirely on how the car is getting washed. The Breakdown If it’s a Professional Car Wash: You’ll need to drive. Bringing the car to the wash is generally the prerequisite for them washing it. Plus, walking 50 meters just to ask them to come pick up a car that is visible from their front door might result in some very confused looks from the staff. If it’s a Self-Service Pressure Washer: Again, drive. Dragging a 1,500kg vehicle 50 meters by hand is an incredible workout, but probably not the "quick chore" you had in mind. If you are hiring a Mobile Detailer: Walk. Stroll over there, tell them where you're parked, and walk back home to watch TV while they do the work in your driveway.

      • j-bosan hour ago
        GLM 4.7: "You should drive.

        Since the goal is to get your car washed, the car needs to be at the car wash. If you walk, you will arrive at the car wash, but your car will still be sitting at home"

        Are you sure that question is from this year?

      • Hammershaft27 minutes ago
        Claude 4.6 got it first try:

        "You’ll want to drive — you need the car at the car wash for them to wash it!

        Even though it’s just 50 meters, the car has to be there. Enjoy the fresh clean car on the short drive back! "

      • haute_cuisine2 hours ago
        Doesn't seem to be the case, gpt 5.2 thinking replies: To get the car washed, the car has to be at the car wash — so unless you’re planning to push it like a shopping cart, you’ll need to drive it those 50 meters.
      • bonoboTP2 hours ago
        It's unclear where the car is currently from your phrasing. If you add that the car is in your garage, it says you'll need to drive to get the car into the wash.
      • Scene_Cast22 hours ago
        I just ran this with Gemini 3 Pro, Opus 4.6, and Grok 4 (the models I personally find the smartest for my work). All three answered correctly.
        • miroljub21 minutes ago
          They had plenty of time to update their system prompts so they don't be embarrassed.

          I noticed whenever such meme comes out, if you check immediately you can reproduce it yourself, but after a free hours it's already updated.

      • king_phil2 hours ago
        Gemini 3 Pro:

        This is a classic logistical puzzle!

        Unless you have a very unique way of carrying your vehicle, you should definitely drive.

        If you walk there, you'll arrive at the car wash, but your car will still be dirty back at your house. You need to take the car with you to get it washed.

        Would you like me to check the weather forecast for $mytown to see if it's a good day for a car wash?

        • interloxiaan hour ago
          For me, various forms of Gemini respond with "Unless you are planning on carrying the car there" which I find to be just sassy enough to be amusing.
      • Aurornis2 hours ago
        If you're asking simple riddles, you shouldn't be paying for SOTA frontier models with long context.

        This is a silly test for the big coding models.

        This is like saying "all calculators are the same, nobody needs a TI-89!" and then adding 1+2 on a pocket calculator to prove your point.

        • grey-area2 hours ago
          No it’s like having a calculator which is unable to perform simple arithmetic, but lots of people think it is amazing and sentient and want to talk about that instead of why it can’t add 2 + 2.
      • esafak2 hours ago
        This is a great comment. The frontier labs ought to be embarrassed they can't get this right. A human would not make the same mistake, after thinking for a second.
      • smokel2 hours ago
        > They are all just token generators without any intelligence.

        Maybe, but "intelligence" doesn't have a clear, agreed definition. And calling them "just token generators" skips over how complex that generation actually is.

      • rockinghigh2 hours ago
        How is this riddle relevant to a coding model?
      • zozbot2342 hours ago
        1. Analyze the User's Request:

        Distance: 50 meters (very short).

        Goal: Get the car washed.

        Decision: Drive or walk?

        2. Evaluate the Options:

        Option A: Walk

        Pros: Healthy, environmentally friendly, no need to find parking (though it's close), easy to chat with neighbors.

        Cons: The car is left at the car wash. The user can't "walk" the car to the car wash. The car must be moved to the car wash to be washed.

        Option B: Drive

        Pros: Moves the car to the location where the service is performed. It's the only logical way to get the car washed.

        Cons: Driving 50 meters is technically bad for a cold engine (short trips), uses a negligible amount of fuel, might seem lazy.

        3. Identify the Logical Flaw/Humor:

        The user is asking if they should walk to get their car washed.

        Unless the car wash involves someone coming to their house (mobile wash) or carrying the car, the car must be at the car wash.

        Walking to the car wash leaves the car at home. Walking back from the car wash leaves the car at the car wash.

        Therefore, the user must drive the car to the car wash.

        4. Formulate the Answer:

        Direct Answer: You should drive.

        Reasoning: The car needs to be at the car wash to be washed. You cannot walk the car there.

        Nuance/Context: 50 meters is a very short distance. Driving a car such a short distance is generally not great for the engine (doesn't allow oil to circulate/warm up properly), but it is necessary in this specific case.

        Alternative (if applicable): Is there a mobile wash option? If so, walking isn't even needed. But assuming a standard car wash bay: Drive.

        5. Drafting the response (Internal Monologue/Trial Phrases):

        Draft 1: You have to drive. You can't wash the car if it's at your house and you are at the car wash.

        Draft 2: You should drive. Since the car needs to be washed, it needs to be at the car wash. Walking there doesn't help get the car clean.

        Refining for clarity and wit: Point out the obvious - the car is the object being washed. It needs to move.

        Final Polish (similar to the provided good output):

        State clearly: You should drive.

        Explain why: The car needs to be there.

        Add a practical note: Driving 50 meters is a bit silly for engine health, but unavoidable here.

        Consider the scenario of "walking back": Maybe the user meant "walk back"? No, the prompt says "drive there or walk there". The intent is transportation to the location.

        6. Constructing the final output:

        "You should definitely drive."

        "Why? Because the car wash cleans the car, so the car needs to be at the car wash. If you walk, the car stays at home."

        "Additional thought: Driving 50 meters is technically a 'cold start' short trip which isn't ideal for the engine, but it's the only way to get the service done!"

        Final Output Generation (matching the logic above).

        You should drive.

        Here is why:

        1. The car needs to be there: You are getting the car washed, so the car must be at the car wash. You cannot walk the car there by yourself.

        2. Efficiency: Driving 50 meters takes about 10 seconds. Walking 50 meters takes about a minute, but you would be leaving the car behind.

        The only exception is if you are planning to get a mobile car wash (where they come to you) or if you are dropping the car off and walking back home while they clean it. But to simply "get there," you must drive.

        (The above output is from GLM-5)

    • ionelaipatioaei2 hours ago
      I think the only advantage that closed models have are the tools around them (claude code and codex). At this point if forced I could totally live with open models only if needed.
      • evv2 hours ago
        The tooling is totally replicated in open source. OpenCode and Letta are two notable examples, but there are surely more. I'm hacking on one in the evenings.

        OpenCode in particular has huge community support around it- possibly more than Claude Code.

        • ionelaipatioaeian hour ago
          I know, I use OpenCode daily but it still feels like it's missing something - codex in my opinion is way better at coding but I honestly feel like that's because OpenAI controls both the model and the harness so they're able to fine tune everything to work together much better.
        • Daviey2 hours ago
          It's there now, `opencode models --refresh`
      • quikoa2 hours ago
        If tooling really is an advantage why isn't it possible to use the API with a subscription and save money?
        • ionelaipatioaeian hour ago
          In my opinion it is because if you control both the model and the harness then you're able to tune everything to work together much better.
    • cmrdporcupine3 hours ago
      I tried GLM 5 by API earlier this morning and was impressed.

      Particularly for tool use.

    • yieldcrv2 hours ago
      come on guys, you were using Opus 4.5 literally a week ago and don't even like 4.6

      something that is at parity with Opus 4.5 can ship everything you did in the last 8 weeks, ya know... when 4.5 came out

      just remember to put all of this in perspective, most of the engineers and people here haven't even noticed any of this stuff and if they have are too stubborn or policy constrained to use it - and the open source nature of the GLM series helps the policy constrained organizations since they can theoretically run it internally or on prem.

      • Aurornis2 hours ago
        > something that is at parity with Opus 4.5

        You're assuming the conclusion

        The previous GLM-4.7 was also supposed to be better than Sonnet and even match or beat Opus 4.5 in some benchmarks ( https://www.cerebras.ai/blog/glm-4-7 ) but in real world use it didn't perform at that level.

        You can't read the benchmarks alone any more.

  • pcwelder3 hours ago
    It's live on openrouter now.

    In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

    To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.

    [1] https://github.com/rusiaaman/chat.md

    • manofmanysmiles2 hours ago
      I love the idea of chat.md.

      I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.

      I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.

    • data-ottawa2 hours ago
      Custom tool calling formats are iffy in my experience. The models are all reinforcement learned to follow specific ones, so it’s always a battle and feels to me like using the tool wrong.

      Have you had good results with the other frontier models?

    • nolist_policy2 hours ago
      Could also be the provider that is bad. Happens way too often on OpenRouter.
      • pcwelder2 hours ago
        I had added z-ai in allow list explicitly and verified that it's the one being used.
    • sergiotapia2 hours ago
      Be careful with openrouter. They routinely host quantized versions of models via their listed providers and the models just suck because of that. Use the original providers only.
      • nullbyte4 minutes ago
        I specifically do not use the CN/SG based original provider simply because I don't want my personal data traveling across the pacific. I try to only stay on US providers. Openrouter shows you what the quantization of each provider is, so you can choose a domestic one that's FP8 if you want
  • justinparus3 hours ago
    Been using GLM-4.7 for a couple weeks now. Anecdotally, it’s comparable to sonnet, but requires a little bit more instruction and clarity to get things right. For bigger complex changes I still use anthropic’s family, but for very concise and well defined smaller tasks the price of GLM-4.7 is hard to beat.
    • rapind36 minutes ago
      Anecdotal, but I've been locked to Sonnet for the past 6-8 months just because they always seem to introduce throttling bugs with Opus where it starts to devour tokens or falls over. Very interested once open models close the gap to about 6 months.
    • monoosoan hour ago
      This aligns very closely with my experience.

      When left to its own devices, GLM-4.7 frequently tries to build the world. It's also less capable at figuring out stumbling blocks on its own without spiralling.

      For small, well-defined tasks, it's broadly comparable to Sonnet.

      Given how incredibly cheap it is, it's useful even as a secondary model.

  • cherryteastain2 hours ago
    What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips per reporting [1]. Hence we can conclude the semiconductor to model Chinese tech stack is only 3 months behind the US, considering Opus 4.5 released in November. (Excluding the lithography equipment here, as SMIC still uses older ASML DUV machines) This is huge especially since just a few months ago it was reported that Deepseek were not using Huawei chips due to technical issues [2].

    US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [3]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.

    [1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...

    [2] https://www.techradar.com/pro/chaos-at-deepseek-as-r2-launch...

    [3] https://www.reuters.com/world/china/chinas-customs-agents-to...

    • mark_l_watsonan hour ago
      US Secretary of State Bressent just publicly said that the US needs to get along and cooperate with China. His tone was so different than previously in the last year that I listened to the video clip twice.

      Obviously for the average US tax payer getting along with China is in our interests - not so much our economic elites.

      I use both Chinese and US models, and Mistral in Proton’s private chat. I think it makes sense for us to be flexible and not get locked in.

    • re-thc2 hours ago
      > What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips

      Has any of these outfits ever publicly stated they used Nvidia chips? As in the non-officially obtained 1s. No.

      > US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips

      Sort of. It's all a front. On both sides. China still ALWAYS had access to Nvidia chips - whether that's the "smuggled" 1s or they run it in another country. It's not costing Nvidia much. The opening of China sales for Nvidia likewise isn't as much of a boon. It's already included.

      > At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China

      Again, it's a front. It's about news and headlines. Just like when China banned lobsters from a certain country, the only thing that happened was that they went to Hong Kong or elsewhere, got rebadged and still went in.

      • cherryteastainan hour ago
        > Has any of these outfits ever publicly stated they used Nvidia chips? As in the non-officially obtained 1s. No.

        Uh yes? Deepseek explicitly said they used H800s [1]. Those were not banned btw, at the time. Then US banned them too. Then US was like 'uhh okay maybe you can have the H200', but then China said not interested.

        [1] https://arxiv.org/pdf/2412.19437

        • re-thcan hour ago
          > Uh yes? Deepseek explicitly said they used H800s [1]. Those were not banned btw, at the time

          Then they haven't. I said the non-officially obtained 1s that they can't / won't mention i.e. those Blackwells etc...

    • seydor2 hours ago
      We can conclude that they ll flood the world with huawei inference chips from Temu and create worldwide AI pollution
  • woeirua3 hours ago
    It might be impressive on benchmarks, but there's just no way for them to break through the noise from the frontier models. At these prices they're just hemorrhaging money. I can't see a path forward for the smaller companies in this space.
    • lukev31 minutes ago
      I expect that the reason for their existence is political rather than financial (though I have no idea how that's structured.)

      It's a big deal that open-source capability is less than a year behind frontier models.

      And I'm very, very glad it is. A world in which LLM technology is exclusive and proprietary to three companies from the same country is not a good world.

    • syntaxing2 hours ago
      Tim Dettmers had an interesting take on this [1]. Fundamentally, the philosophy is different.

      >China’s philosophy is different. They believe model capabilities do not matter as much as application. What matters is how you use AI.

      https://timdettmers.com/2025/12/10/why-agi-will-not-happen/

      • woeirua2 hours ago
        Sorry, but that's an exceptionally unimpressive article. The crux of his thesis is:

        >The main flaw is that this idea treats intelligence as purely abstract and not grounded in physical reality. To improve any system, you need resources. And even if a superintelligence uses these resources more effectively than humans to improve itself, it is still bound by the scaling of improvements I mentioned before — linear improvements need exponential resources. Diminishing returns can be avoided by switching to more independent problems – like adding one-off features to GPUs – but these quickly hit their own diminishing returns.

        Literally everyone already knows the problems with scaling compute and data. This is not a deep insight. His assertion that we can't keep scaling GPUs is apparently not being taken seriously by _anyone_ else.

        • syntaxing2 hours ago
          Was more mentioning the article about the economic aspect of China vs US in terms of AI.

          While I do understand your sentiment, it might be worth noting the author is the author of bitandbytes. Which is one of the first library with quantization methods built in and was(?) one of the most used inference engines. I’m pretty sure transformers from HF still uses this as the Python to CUDA framework

        • qprofyeh2 hours ago
          There are startups in this space getting funded as we speak: https://olix.com/blog/compute-manifesto
      • re-thc2 hours ago
        When you have export restrictions what do you expect them to say?

        > They believe model capabilities do not matter as much as application.

        Tell me their tone when their hardware can match up.

        It doesn't matter because they can't make it matter (yet).

    • riku_iki2 hours ago
      maybe being in China gives them advantage of electricity cost, which could be big chunk of bill..
  • esafak3 hours ago
    I got fed up with GLM-4.7 after using it for a few weeks; it was slow through z.ai and not as good as the benchmarks lead me to believe (esp. with regards to instruction following) but I'm willing to give it another try.
  • mnickyan hour ago
    What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].

    We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.

    I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?

    How do you read this?

    [1] https://imgur.com/a/EwW9H6q

    • phamiltonan hour ago
      Intelligence per token doesn't seem quite right to me.

      Intelligence per <consumable> feels closer. Per dollar, or per second, or per watt.

      • mnicky18 minutes ago
        It is possible to think of tokens as some proxy for thinking space. At least reasoning tokens work like this.

        Dollar/watt are not public and time has confounders like hardware.

  • pu_pe3 hours ago
    Really impressive benchmarks. It was commonly stated that open source models were lagging 6 months behind state of the art, but they are likely even closer now.
  • jnd03 hours ago
    • cmrdporcupine3 hours ago
      yes, plenty of good convo over there, the two should probably be merged
  • unltdpower32 minutes ago
    I predict a new speculative market will emerge where adherents buy and sell misween coded companies.

    Betting on whether they can actually perform their sold behaviors.

    Passing around code repositories for years without ever trying to run them, factory sealed.

  • algorithm3143 hours ago
    Here is the pricing per M tokens. https://docs.z.ai/guides/overview/pricing

    Why is GLM 5 more expensive than GLM 4.7 even when using sparse attention?

    There is also a GLM 5-code model.

    • logicprog3 hours ago
      I think it's likely more expensive because they have more activated parameters, which kind of outweighs the benefits of DSA?
    • l5870uoo9y3 hours ago
      It's roughly three times cheaper than GPT-5.2-codex, which in turn reflects the difference in energy cost between US and China.
      • anthonypasq2 hours ago
        1. electricity costs are at most 25% of inference costs so even if electricity is 3x cheaper in china that would only be a 16% cost reduction.

        2. cost is only a singular input into price determination and we really have absolutely zero idea what the margins on inference even are so assuming the current pricing is actually connected to costs is suspect.

      • re-thc2 hours ago
        It reflects the Nvidia tax overhead too.
  • nullbyte2 hours ago
    GLM 5 beats Kimi on SWE bench and Terminal bench. If it's anywhere near Kimi in price, this looks great.

    Edit: Input tokens are twice as expensive. That might be a deal breaker.

    • bradfaan hour ago
      GLM-5 at FP8 should be similar in hardware demands to Kimi-K2.5 (natively INT4) I think. API pricing on launch day may or may not really indicate longer term cost trends. Even Kimi-K2.5 is very new. Give it a whirl and a couple weeks to settle out to have a more fair comparison.
    • westernzevon2 hours ago
      It seems to be much better at first pass tho. We'll see how real costs stack up
    • dingnuts2 hours ago
      [dead]
  • beAroundHere3 hours ago
    I'd say that they're super confident about the GLM-5 release, since they're directly comparing it with Opus 4.5 and don't mention Sonnet 4.5 at all.

    I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.

    • revolvingthrow3 hours ago
      Qwen and GLM both promise the stars in the sky every single release and the results are always firmly in the "whatever" range
    • esafak3 hours ago
      I place GLM 4.7 behind Sonnet.
  • tgtweakan hour ago
    Why are we not comparing to opus 4.6 and gpt 5.3 codex...

    Honestly these companies are so hard to takes seriously with these release details. If it's an open source model and you're only comparing open source - cool.

    If you're not top in your segment, maybe show how your token cost and output speed more than make up for that.

    Purposely showing prior-gen models in your release comparison immediately discredits you in my eyes.

    • rolymathan hour ago
      I feel like you're over reacting.

      They're comparing against 5.2 xhigh, which is arguably better than 5.3. The latest from openai isn't smarter, it's slightly dumber, just much faster.

  • mohas2 hours ago
    I kinda feel this bench-marking thing with Chinese models is like university Olympiads, they specifically study for those but when time comes for the real world work they seriously lack behind.
    • OsrsNeedsf2P2 hours ago
      I kinda feel like the goalposts are shifting. While we're not there yet, in a world where Chinese models surpass Western ones, HN will be nitpicking edge cases long after the ship sails
      • Oras2 hours ago
        I don’t think it’s undermining the effort and improvement, but usability of these models aren’t usually what their benchmarks suggest.

        Last time there was a hype about GLM coding model, I tested it with some coding tasks and it wasn’t usable when comparing with Sonnet or GPT-5

        I hope this one is different

  • goldenarm2 hours ago
    If you're tired of cross-referencing the cherry-picked benchmarks, here's the geometric mean of SWE-bench Verified & HLE-tools :

    Claude Opus 4.6: 65.5%

    GLM-5: 62.6%

    GPT-5.2: 60.3%

    Gemini 3 Pro: 59.1%

  • seydoran hour ago
    I wish China starts copying Demis' biotech models as well soon
  • meffmadd3 hours ago
    It will be tough to run on our 4x H200 node… I wish they stayed around the 350B range. MLA will reduce KV cache usage but I don’t think the reduction will be significant enough.
  • karolist3 hours ago
    The amount of times benchmarks of competitors said something is close to Claude and it was remotely close in practice in the past year: 0
    • ionelaipatioaei2 hours ago
      I honestly feel like people are brainwashed by anthropic propaganda when it comes to claude, I think codex is just way better and kimi 2.5 (and I think glm 5 now) are perfectly fine for a claude replacement.
      • mark_l_watsonan hour ago
        So much money is on the line for US super scalers that they probably pay for ‘pushes’ on social media. Maybe Chinese companies are doing the same.
        • GorbachevyChase2 minutes ago
          I would say that’s more certain than just a “probably“. I would bet that some of the ridiculous fear mongering about language models trying to escape their server, blackmail their developers, or spontaneously participating in a social network are all clandestine marketing campaigns. The technology is certainly amazing and very useful, but I don’t think any of these terminator stories were boosted by the algorithms on their own.
  • eugene33063 hours ago
    why don't they publish at ARC-AGI ? too expensive?
    • Bolwin3 hours ago
      Arc agi was never a good benchmark that tested spatial understanding more than reasoning. I'm glad it's no longer popular
      • falcor843 hours ago
        What do you mean? It definitely tests reasoning as well, and if anything, I expect spatial and embodied reasoning to become more important in the coming years, as AI agents will be expected to take on more real world tasks.
      • eugene33063 hours ago
        spatial or not, arc-agi is the only test that correlates to my impression with my coding requests
  • ExpertAdvisor013 hours ago
    They increased their prices substantially
  • woah3 hours ago
    Is this a lot cheaper to run (on their service or rented GPUs) than Claude or ChatGPT?
    • esafak3 hours ago
      • leumon3 hours ago
        although apparently only the max subscription includes glm-5
        • esafak2 hours ago
          Yes, thank you for pointing that out. It's probably load management thing.
    • su-m4tt3 hours ago
      dramatically cheaper.
  • surrTurr2 hours ago
    we're seeing so many LLM releases that they can't even keep their benchmark comparisons updated
  • dana321an hour ago
    Just tried it, its practically the same as glm-4.7 - it isn't as "wide" as claude or codex so even on a simple prompt is misses out on one important detail - instead of investigating it ploughs ahead with the next best thing it thinks you asked for instead of investigating fully before starting a project.
  • testuser_xyz2 hours ago
    [flagged]
    • Aurornis2 hours ago
      This account is named `testuser_xyz` and was registered seconds before posting this generic LLM-style output.

      The growing number of obvious LLM bots posting on Hacker News is getting worrisome.

      • minimaxir2 hours ago
        That is what both the flagging mechanism and the green highlight for new accounts is for.
      • testuser_xyz2 hours ago
        Hi, I am just using Hacker News as a playground to test some trash bot my customer begged me to write so that he can click on ads more effectively online. Hope you have a good day though :)
        • minimaxir2 hours ago
          That is disrespectful to do in any community forum. Don't do it again.
  • petetnt3 hours ago
    Whoa, I think GPT-5.3-Codex was a disappointment, but GLM-5 is definitely the future!
    • cmrdporcupine3 hours ago
      I find 5.3 very impressive TBH. Bigger jump than Opus 4.6.

      But this here is excellent value, if they offer it as part of their subscription coding plan. Paying by token could really add up. I did about 20 minutes of work and it cost me $1.50USD, and it's more expensive than Kimi 2.5.

      Still 1/10th the cost of Opus 4.5 or Opus 4.6 when paying by the token.

    • mnicky3 hours ago
      > I think GPT-5.3-Codex was a disappointment

      Care to elaborate more?