9 pointsby d3ckard7 hours ago4 comments
  • ai-tamer7 hours ago
    Same. The numbers match your feel. Going from 4.6 to 4.7: +14.6 on MCP-Atlas, +10.9 on SWE-bench Pro, tool errors cut by two-thirds. But BrowseComp dropped 4.7 points. Anthropic's own announcement says 4.7 "takes the instructions literally" where 4.6 interpreted them loosely, and recommends re-tuning prompts accordingly. In a conversational loop with an opinionated developer, that translates to less quality because less reasoning — the model executes instead of thinking through. https://llm-stats.com/blog/research/claude-opus-4-7-vs-opus-... https://www.anthropic.com/news/claude-opus-4-7
  • troglodytetrain7 hours ago
    Anthropic definitely appears to be heavily hamstringing the LLM response quality as a sort of rate limiting implementation.

    I've built my own custom coding harness at my slow corp job, as for some reason they give us unlimited Anthropic tokens here but only if used from their custom bespoke 'chatgpt' derivative website. However, since they expose the backend Api directly due to the questionable design decision of making all backend api calls via javascript on the client side, it has been possible to actually leverage these unlimited tokens via my 'openclaw we have at home' and its been a fun project.

    But, In the last few days I've watched live, several times, as the tool use agent suddenly on a turn, completely forgets the correct tool call tags clearly defined in its system prompt, hallucinating a completely new tool call format that I have never seen before, before weirdly fixing itself some minutes and some turns later. Literally never been a problem before across at this point hundreds of hours of dev time and thousands of euros of token spend.

    That, in addition to a. New refusals from agent for same prompt that worked fine before, and b. A large amount of cloudflare forbidden responses during specific times of the day.

    As I am in EU currently, I notice its been happening in the late evening for me, around 8pm (12-3pm CST), which I presume is peak usage times in US.

    • federicchauvat6 hours ago
      Interesting — I hadn't tracked the hours on my side. A small community tool to collect this would help. The hard part is separating "the model got nerfed" from "my prompts don't fit the new behavior anymore". Think downdetector for LLMs, but based on real metrics instead of user reports. Opt-in client wrapper, anonymized telemetry, public dashboard. Does it exist already? I just searched and couldn't find anything.
  • metadat7 hours ago
    They don't have enough compute capacity relative to growth rate. The new tokenizer in Opus 4.7 hints at a coming foundational / architectural change. I expect the next point release will deliver decent results more efficiently.

    Ymmv. I've been using GPT Codex 5.3 and now 5.4 for the past few months and it works great and is reliable.

  • blinkbat7 hours ago
    Floundering