688 pointsby HellsMaddy2 hours ago54 comments
  • simonw2 hours ago
    The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...
    • stkai39 minutes ago
      Would love to find out they're overfitting for pelican drawings.
      • andy_ppp19 minutes ago
        Yes, Racoon on a unicycle? Magpie on a pedalo?
    • gcanyonan hour ago
      One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.
    • einrealistan hour ago
      They trained for it. That's the +0.1!
    • athrowaway3zan hour ago
      This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.

      They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.

      What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.

      As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.

    • hoeoekan hour ago
      This really is my favorite benchmark
    • copilot_king_2an hour ago
      I'm firing all of my developers this afternoon.
      • RGamma24 minutes ago
        Opus 6 will fire you instead for being too slow with the ideas.
    • eaf7e281an hour ago
      There's no way they actually work on training this.
      • KeplerBoyan hour ago
        There is no way they are not training on this.
        • an hour ago
          undefined
        • collinmandersonan hour ago
          I suspect they have generic SVG drawing that they focus on.
      • margalabargalaan hour ago
        I suspect they're training on this.

        I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.

        https://i.imgur.com/UvlEBs8.png

        • WarmWash41 minutes ago
          It would be way way better if they were benchmaxxing this. The pelican in the image (both images) has arms. Pelicans don't have arms, and a pelican riding a bike would use it's wings.
        • mrandishan hour ago
          Interesting that it seems better. Maybe something about adding a highly specific yet unusual qualifier focusing attention?
    • nubg2 hours ago
      What about the Pelo2 benchmark? (the gray bird that is not gray)
    • 7777777philan hour ago
      best pelican so far would you say? Or where does it rank in the pelican benchmark?
      • mrandishan hour ago
        In other words, is it a pelican or a pelican't?
    • ares6232 hours ago
      Can it draw a different bird on a bike?
    • DetroitThrow2 hours ago
      The ears on top are a cute touch
    • behnamohan hour ago
      Can we please stop with this nonsense benchmark?
      • smokel10 minutes ago
        I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.
      • quinnjh8 minutes ago
        the field is advancing so fast it's hard to do real science as their will be a new SOTA by the time you're ready to publish results. i think this is a combination of that and people having a laugh.

        Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning?

    • yukisadf12 minutes ago
      [dead]
  • AstroBena few seconds ago
    [delayed]
  • gizmodo59an hour ago
    5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!
    • purplerabbitan hour ago
      The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out
      • MallocVoidstar11 minutes ago
        The -codex models are only for 'agentic coding', nothing else.
    • wasmainiac24 minutes ago
      Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?
      • Corence4 minutes ago
        It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.

        However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.

      • aaaalone15 minutes ago
        At the end of the day you test it for your use cases anyway but it makes it a great initial hint if it's worth it to test out.
    • nharadaan hour ago
      That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...
    • jkelleyrtpan hour ago
      claude swe-bench is 80.8 and codex is 56.8

      Seems like 4.6 is still all-around better?

      • gizmodo59an hour ago
        Its SWE bench pro not swe bench verified. The verified benchmark has stagnated
        • joshuahedlundan hour ago
          Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.
          • Snuggly7342 minutes ago
            it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.

            swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:

            https://scale.com/leaderboard/swe_bench_pro_private

  • pjot2 hours ago
    Claude Code release notes:

      > Version 2.1.32:
         • Claude Opus 4.6 is now available!
         • Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting
         CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1)
         • Claude now automatically records and recalls memories as it works
         • Added "Summarize from here" to the message selector, allowing partial conversation summarization.
         • Skills defined in .claude/skills/ within additional directories (--add-dir) are now loaded automatically.
         • Fixed @ file completion showing incorrect relative paths when running from a subdirectory
         • Updated --resume to re-use --agent value specified in previous conversation by default.
         • Fixed: Bash tool no longer throws "Bad substitution" errors when heredocs contain JavaScript template literals like ${index + 1}, which
         previously interrupted tool execution
         • Skill character budget now scales with context window (2% of context), so users with larger context windows can see more skill descriptions
         without truncation
         • Fixed Thai/Lao spacing vowels (สระ า, ำ) not rendering correctly in the input field
         • VSCode: Fixed slash commands incorrectly being executed when pressing Enter with preceding text in the input field
         • VSCode: Added spinner when loading past conversations list
    • neuronexmachinaan hour ago
      > Claude now automatically records and recalls memories as it works

      Neat: https://code.claude.com/docs/en/memory

      I guess it's kind of like Google Antigravity's "Knowledge" artifacts?

      • om8an hour ago
        Is there a way to disable it? Sometimes I value agent not having knowledge that it needs to cut corners
        • nerdsniper16 minutes ago
          90-98% of the time I want the LLM to only have the knowledge I gave it in the prompt. I'm actually kind of scared that I'll wake up one day and the web interface for ChatGPT/Opus/Gemini will pull information from my prior chats.
      • codethiefan hour ago
        Are we sure the docs page has been updated yet? Because that page doesn't say anything about automatic recording of memories.
  • Someone12342 hours ago
    Does anyone with more insight into the AI/LLM industry happen to know if the cost to run them in normal user-workflows is falling? The reason I'm asking is because "agent teams" while a cool concept, it largely constrained by the economics of running multiple LLM agents (i.e. plans/API calls that make this practical at scale are expensive).

    A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.

    • simonw2 hours ago
      The cost per token served has been falling steadily over the past few years across basically all of the providers. OpenAI dropped the price they charged for o3 to 1/5th of what it was in June last year thanks to "engineers optimizing inferencing", and plenty of other providers have found cost savings too.

      Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

      > A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

      Where did you hear that? It doesn't match my mental model of how this has played out.

      • cootsnuckan hour ago
        I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

        > Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

        That does not mean the frontier labs are pricing their APIs to cover their costs yet.

        It can both be true that it has gotten cheaper for them to provide inference and that they still are subsidizing inference costs.

        In fact, I'd argue that's way more likely given that has been precisely the goto strategy for highly-competitive startups for awhile now. Price low to pump adoption and dominate the market, worry about raising prices for financial sustainability later, burn through investor money until then.

        What no one outside of these frontier labs knows right now is how big the gap is between current pricing and eventual pricing.

        • chis39 minutes ago
          It's quite clear that these companies do make money on each marginal token. They've said this directly and analysts agree [1]. It's less clear that the margins are high enough to pay off the up-front cost of training each model.

          [1] https://epochai.substack.com/p/can-ai-companies-become-profi...

          • 9cb14c1ec016 minutes ago
            It's also true that their inference costs are being heavily subsidized. For example, if you calculate Oracles debt into OpenAIs revenue, they would be incredibly far underwater on inference.
        • mrandish32 minutes ago
          > I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

          Anthropic planning an IPO this year is a broad meta-indicator that internally they believe they'll be able to reach break-even sometime next year on delivering a competitive model. Of course, their belief could turn out to be wrong but it doesn't make much sense to do an IPO if you don't think you're close. Assuming you have a choice with other options to raise private capital (which still seems true), it would be better to defer an IPO until you expect quarterly numbers to reach break-even or at least close to it.

          Despite the willingness of private investment to fund hugely negative AI spend, the recently growing twitchiness of public markets around AI ecosystem stocks indicates they're already worried prices have exceeded near-term value. It doesn't seem like they're in a mood to fund oceans of dotcom-like red ink for long.

          • WarmWash19 minutes ago
            IPO'ing is often what you do to give your golden investors an exit hatch to dump their shares on the notoriously idiotic and hype driven public.
        • NitpickLawyeran hour ago
          > they still are subsidizing inference costs.

          They are for sure subsidising costs on all you can prompt packages (20-100-200$ /mo). They do that for data gathering mostly, and at a smaller degree for user retention.

          > evidence at all that Anthropic or OpenAI is able to make money on inference yet.

          You can infer that from what 3rd party inference providers are charging. The largest open models atm are dsv3 (~650B params) and kimi2.5 (1.2T params). They are being served at 2-2.5-3$ /Mtok. That's sonnet / gpt-mini / gemini3-flash price range. You can make some educates guesses that they get some leeway for model size at the 10-15$/ Mtok prices for their top tier models. So if they are inside some sane model sizes, they are likely making money off of token based APIs.

        • barrkelan hour ago
          > evidence at all that Anthropic or OpenAI is able to make money on inference yet.

          The evidence is in third party inference costs for open source models.

      • nubgan hour ago
        > "engineers optimizing inferencing"

        are we sure this is not a fancy way of saying quantization?

        • embedding-shapean hour ago
          Or distilled models, or just slightly smaller models but same architecture. Lots of options, all of them conveniently fitting inside "optimizing inferencing".
        • jmalickian hour ago
          A ton of GPU kernels are hugely inefficient. Not saying the numbers are realistic, but look at the 100s of times of gain in the Anthropic performance takehome exam that floated around on here.

          And if you've worked with pytorch models a lot, having custom fused kernels can be huge. For instance, look at the kind of gains to be had when FlashAttention came out.

          This isn't just quantization, it's actually just better optimization.

          Even when it comes to quantization, Blackwell has far better quantization primitives and new floating point types that support row or layer-wise scaling that can quantize with far less quality reduction.

          There is also a ton of work in the past year on sub-quadratic attention for new models that gets rid of a huge bottleneck, but like quantization can be a tradeoff, and a lot of progress has been made there on moving the Pareto frontier as well.

          It's almost like when you're spending hundreds of billions on capex for GPUs, you can afford to hire engineers to make them perform better without just nerfing the models with more quantization.

          • Der_Einzigean hour ago
            "This isn't X, it's Y" with extra steps.
      • sumitkumaran hour ago
        It seems it is true for gemini because they have a humongous sparse model but it isn't so true for the max performance opus-4.5/6 and gpt-5.2/3.
    • Aurornisan hour ago
      > A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

      This gets repeated everywhere but I don't think it's true.

      The company is unprofitable overall, but I don't see any reason to believe that their per-token inference costs are below the marginal cost of computing those tokens.

      It is true that the company is unprofitable overall when you account for R&D spend, compensation, training, and everything else. This is a deliberate choice that every heavily funded startup should be making, otherwise you're wasting the investment money. That's precisely what the investment money is for.

      However I don't think using their API and paying for tokens has negative value for the company. We can compare to models like DeepSeek where providers can charge a fraction of the price of OpenAI tokens and still be profitable. OpenAI's inference costs are going to be higher, but they're charging such a high premium that it's hard to believe they're losing money on each token sold. I think every token paid for moves them incrementally closer to profitability, not away from it.

      • 383629364834 minutes ago
        The reports I remember show that they're profitable per-model, but overlap R&D so that the company is negative overall. And therefore will turn a massive profit if they stop making new models.
        • trcf239 minutes ago
          Doesn’t it also depend on averaging with free users?
      • runarberg36 minutes ago
        I can see a case for omitting R&D when talking about profitability, but training makes no sense. Training is what makes the model, omitting it is like omitting the cost of running the production facility of a car manufacturer. If AI companies stop training they will stop producing models, and they will run out of a products to sell.
    • KaiserPro30 minutes ago
      Gemini-pro-preview is on ollama and requires h100 which is ~$15-30k. Google are charging $3 a million tokens. Supposedly its capable of generating between 1 and 12 million tokens an hour.

      Which is profitable. but not by much.

    • 3abiton2 hours ago
      It's not just that. Everyone is complacent with the utilization of AI agents. I have been using AI for coding for quite a while, and most of my "wasted" time is correcting its trajectory and guiding it through the thinking process. It's very fast iterations but it can easily go off track. Claude's family are pretty good at doing chained task, but still once the task becomes too big context wise, it's impossible to get back on track. Cost wise, it's cheaper than hiring skilled people, that's for sure.
      • lufenialif2an hour ago
        Cost wise, doesn’t that depend on what you could be doing besides steering agents?
    • zozbot2342 hours ago
      > i.e. plans/API calls that make this practical at scale are expensive

      Local AI's make agent workflows a whole lot more practical. Making the initial investment for a good homelab/on-prem facility will effectively become a no-brainer given the advantages on privacy and reliability, and you don't have to fear rugpulls or VC's playing the "lose money on every request" game since you know exactly how much you're paying in power costs for your overall load.

      • vbezhenar17 minutes ago
        I don't care about privacy and I didn't have much problems with reliability of AI companies. Spending ridiculous amount of money on hardware that's going to be obsolete in a few years and won't be utilized at 100% during that time is not something that many people would do, IMO. Privacy is good when it's given for free.

        I would rather spend money on some pseudo-local inference (when cloud company manages everything for me and I just can specify some open source model and pay for GPU usage).

    • Havoc2 hours ago
      Saw a comment earlier today about google seeing a big (50%+) fall in Gemini serving cost per unit across 2025 but can’t find it now. Was either here or on Reddit
    • WarmWash15 minutes ago
      These are intro prices.

      This is all straight out of the playbook. Get everyone hooked on your product by being cheap and generous.

      Raise the price to backpay what you gave away plus cover current expenses and profits.

      In no way shape or form should people think these $20/mo plans are going to be the norm. From OpenAI's marketing plan, and a general 5-10 year ROI horizon for AI investment, we should expect AI use to cost $60-80/mo per user.

    • Bombthecatan hour ago
      That's why anthropic switched to tpu, you can sell at cost.
  • blibble2 hours ago
    > We build Claude with Claude. Our engineers write code with Claude Code every day

    well that explains quite a bit

    • jsheard2 hours ago
      CC has >6000 open issues, despite their bot auto-culling them after 60 days of inactivity. It was ~5800 when I looked just a few days ago so they seem to be accelerating towards some kind of bug singularity.
      • tgtweakan hour ago
        plot twist, it's all claude code instances submitting bug reports on behalf of end users.
        • accrualan hour ago
          It's Claude, all the way down.
      • paxysan hour ago
        Half of them were probably opened yesterday during the Claude outage.
        • anematode44 minutes ago
          Nah, it was at like 5500 before.
    • raincolean hour ago
      It explains how important dogfooding is if you want to make an extremely successful product.
    • cedws31 minutes ago
      The sandboxing in CC is an absolute joke, it's no wonder there's an explosion of sandbox wrappers at the moment. There's going to be a security catastrophe at some point, no doubt about it.
    • jama2112 hours ago
      It’s extremely successful, not sure what it explains other than your biases
      • blibblean hour ago
        Microsoft's products are also extremely successful

        they're also total garbage

        • simianwordsan hour ago
          but they have the advantage of already being a big company. Anthropic is new and there's no reason for people to use it
      • acedTrex17 minutes ago
        Something being successful and something being a high quality product with good engineering are two completely different questions.
      • mvdtnzan hour ago
        Anthropic has perhaps the most embarrassing status page history I have ever seen. They are famous for downtime.

        https://status.claude.com/

        • ronsoran hour ago
          As opposed to other companies which are smart enough not to report outages.
          • tavavex6 minutes ago
            So, there are only two types of companies: ones that have constant downtime, and ones that have constant downtime but hide it, right?
        • dimglan hour ago
          And yet people still use them.
    • gjsman-10002 hours ago
      Also explains why Claude Code is a React app outputting to a Terminal. (Seriously.)
      • jama2112 hours ago
        There’s nothing wrong with that, except it lets ai skeptics feel superior
      • an hour ago
        undefined
      • sweetheartan hour ago
        React's core is agnostic when it comes to the actual rendering interface. It's just all the fancy algos for diffing and updating the underlying tree. Using it for rendering a TUI is a very reasonable application of the technology.
      • kronaan hour ago
        Sounds like a web developer defined the solution a year before they knew what the problem was.
      • CamperBob241 minutes ago
        Also explains why Claude Code is a React app outputting to a Terminal. (Seriously.)

        Who cares, and why?

        All of the major providers' CLI harnesses use Ink: https://github.com/vadimdemedes/ink

      • thehamkercat2 hours ago
        Same with opencode and gemini, it's disgusting

        Codex (by openai ironically) seems to be the fastest/most-responsive, opens instantly and is written in rust but doesn't contain that many features

        Claude opens in around 3-4 seconds

        Opencode opens in 2 seconds

        Gemini-cli is an abomination which opens in around 16 second for me right now, and in 8 seconds on a fresh install

        Codex takes 50ms for reference...

        --

        If their models are so good, why are they not rewriting their own react in cli bs to c++ or rust for 100x performance improvement (not kidding, it really is that much)

        • g947oan hour ago
          Great question, and my guess:

          If you build React in C++ and Rust, even if the framework is there, you'll likely need to write your components in C++/Rust. That is a difficult problem. There are actually libraries out there that allow you to build web UI with Rust, although they are for web (+ HTML/CSS) and not specifically CLI stuff.

          So someone needs to create such a library that is properly maintained and such. And you'll likely develop slower in Rust compared to JS.

          These companies don't see a point in doing that. So they just use whatever already exists.

        • azinman2an hour ago
          Why does it matter if Claude Code opens in 3-4 seconds if everything you do with it can take many seconds to minutes? Seems irrelevant to me.
          • RohMinan hour ago
            I guess with ~50 years of CPU advancements, 3-4 seconds for a TUI to open makes it seem like we lost the plot somewhere along the way.
            • strange_quark29 minutes ago
              Don’t forget they’ve also publicly stated (bragged?) about the monumental accomplishment of getting some text in a terminal to render at 60fps.
          • wahnfriedenan hour ago
            Because when the agent is taking many seconds to minutes, I am starting new agents instead of waiting or switching to non-agent tasks
        • shoeb00man hour ago
          codex cli is missing a bunch of ux features like resizing on terminal size change.

          Opencode's core is actually written in zig, only ui orchestration is in solidjs. It's only slightly slower to load than neo-vim on my system.

          https://github.com/anomalyco/opentui

        • wahnfriedenan hour ago
          Codex team made the right call to rewrite its TypeScript to Rust early on
      • tayo422 hours ago
        Is this a react feature or did they build something to translate react to text for display in the terminal?
        • sbarrean hour ago
          React, the framework, is separate from react-dom, the browser rendering library. Most people think of those two as one thing because they're the most popular combo.

          But there are many different rendering libraries you can use with React, including Ink, which is designed for building CLI TUIs..

        • pkkiman hour ago
          They used Ink: https://github.com/vadimdemedes/ink

          I've used it myself. It has some rough edges in terms of rendering performance but it's nice overall.

          • tayo42an hour ago
            Thats pretty interesting looking, thanks!
        • embedding-shapean hour ago
          Not a built-in React feature. The idea been around for quite some time, I came across it initially with https://github.com/vadimdemedes/ink back in 2022 sometime.
        • tayo42an hour ago
          i had claude make a snake clone and fix all the flickering in like 20 minutes with the library mentioned lol
      • CooCooCaCha2 hours ago
        It’s really not that crazy.

        React itself is a frontend-agnostic library. People primarily use it for writing websites but web support is actually a layer on top of base react and can be swapped out for whatever.

        So they’re really just using react as a way to organize their terminal UI into components. For the same reason it’s handy to organize web ui into components.

    • spruce_tipsan hour ago
      Ah yes, explains why it takes 3 seconds for a new chat to load after I click new chat in the macOS app.
    • exe34an hour ago
      Can Claude fix the flicker in Claude yet?
  • minimaxir2 hours ago
    Will Opus 4.6 via Claude Code be able to access the 1M context limit? The cost increase by going above 200k tokens is 2x input, 1.5x output, which is likely worth it especially for people with the $100/$200 plans.
    • CryptoBanker2 hours ago
      The 1M context is not available via subscription - only via API usage
      • romanovcodean hour ago
        Well this is extremely disappointing to say the least.
        • ayhanfuatan hour ago
          It says "subscription users do not have access to Opus 4.6 1M context at launch" so they are probably planning to roll it out to subscription users too.
  • dmk2 hours ago
    The benchmarks are cool and all but 1M context on an Opus-class model is the real headline here imo. Has anyone actually pushed it to the limit yet? Long context has historically been one of those "works great in the demo" situations.
    • nomel19 minutes ago
      Has a "N million context window" spec ever been meaningful? Very old, very terrible, models "supported" 1M context window, but would lose track of the conversation two small paragraphs of context into a conversation (looking at you early Gemini).
    • pants2an hour ago
      Paying $10 per request doesn't have me jumping at the opportunity to try it!
      • cedws26 minutes ago
        Makes me wonder: do employees at Anthropic get unmetered access to Claude models?
      • schappim37 minutes ago
        The only way to not go bankrupt is to use a Claude Code Max subscription…
    • awestroke39 minutes ago
      Opus 4.5 starts being lazy and stupid at around the 50% context mark in my opinion, which makes me skeptical that this 1M context mode can produce good output. But I'll probably try it out and see
  • legitsteran hour ago
    I'm still not sure I understand Anthropic's general strategy right now.

    They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.

    Meanwhile, Claude's general use cases are... fine. For generic research topics, I find that ChatGPT and Gemini run circles around it: in the depth of research, the type of tasks it can handle, and the quality and presentation of the responses.

    Anthropic is also doing all of these goofy things to try to establish the "humanity" of their chatbot - giving it rights and a constitution and all that. Yet it weirdly feels the most transactional out of all of them.

    Don't get me wrong, I'm a paying Claude customer and love what it's good at. I just think there's a disconnect between what Claude is and what their marketing department thinks it is.

    • tgtweakan hour ago
      Claude itself (outside of code workflows) actually works very well for general purpose chat. I have a few non-technical friends that have moved over from chatgpt after some side-by-side testing and I've yet to see one go back - which is good since claude circa 8 months ago was borderline unusable for anything but coding on the api.
    • eaf7e281an hour ago
      I kinda agree. Their model just doesn't feel "daily" enough. I would use it for any "agentic" tasks and for using tools, but definitely not for day to day questions.
      • lukebechtelan hour ago
        Why? I use it for all and love it.

        That doesn't mean you have to, but I'm curious why you think it's behind in the personal assistant game.

        • legitsteran hour ago
          I have three specific use cases where I try both but ChatGPT wins:

          - Recipes and cooking: ChatGPT just has way more detailed and practical advice. It also thinks outside of the box much more, whereas Claude gets stuck in a rut and sticks very closely to your prompt. And ChatGPT's easier to understand/skim writing style really comes in useful.

          - Travel and itinerary: Again, ChatGPT can anticipate details much more, and give more unique suggestions. I am much more likely to find hidden gems or get good time-savers than Claude, which often feels like it is just rereading Yelp for you.

          - Historical research: ChatGPT wins on this by a mile. You can tell ChatGPT has been trained on actual historical texts and physical books. You can track long historical trends, pull examples and quotes, and even give you specific book or page(!) references of where to check the sources. Meanwhile, all Claude will give you is a web search on the topic.

      • solarkraftan hour ago
        But that’s what makes it so powerful (yeah, mixing model and frontend discussion here yet again). I have yet to see a non-DIY product that can so effortlessly call tens of tools by different providers to satisfy your request.
  • ayhanfuatan hour ago
    > For Opus 4.6, the 1M context window is available for API and Claude Code pay-as-you-go users. Pro, Max, Teams, and Enterprise subscription users do not have access to Opus 4.6 1M context at launch.

    I didn't see any notes but I guess this is also true for "max" effort level (https://code.claude.com/docs/en/model-config#adjust-effort-l...)? I only see low, medium and high.

  • itay-mamanan hour ago
    Impressive results, but I keep coming back to a question: are there modes of thinking that fundamentally require something other than what current LLM architectures do?

    Take critical thinking — genuinely questioning your own assumptions, noticing when a framing is wrong, deciding that the obvious approach to a problem is a dead end. Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself. These feel like they involve something beyond "predict the next token really well, with a reasoning trace."

    I'm not saying LLMs will never get there. But I wonder if getting there requires architectural or methodological changes we haven't seen yet, not just scaling what we have.

    • breuleux28 minutes ago
      > These feel like they involve something beyond "predict the next token really well, with a reasoning trace."

      I don't think there's anything you can't do by "predicting the next token really well". It's an extremely powerful and extremely general mechanism. Saying there must be "something beyond that" is a bit like saying physical atoms can't be enough to implement thought and there must be something beyond the physical. It underestimates the nearly unlimited power of the paradigm.

      Besides, what is the human brain if not a machine that generates "tokens" that the body propagates through nerves to produce physical actions? What else than a sequence of these tokens would a machine have to produce in response to its environment and memory?

    • jorl17an hour ago
      When I first started coding with LLMs, I could show a bug to an LLM and it would start to bugfix it, and very quickly would fall down a path of "I've got it! This is it! No wait, the print command here isn't working because an electron beam was pointed at the computer".

      Nowadays, I have often seen LLMs (Opus 4.5) give up on their original ideas and assumptions. Sometimes I tell them what I think the problem is, and they look at it, test it out, and decide I was wrong (and I was).

      There are still times where they get stuck on an idea, but they are becoming increasingly rare.

      Therefore, think that modern LLMs clearly are already able to question their assumptions and notice when framing is wrong. In fact, they've been invaluable to me in fixing complicated bugs in minutes instead of hours because of how much they tend to question many assumptions and throw out hypotheses. They've helped _me_ question some of my assumptions.

      They're inconsistent, but they have been doing this. Even to my surprise.

      • itay-maman30 minutes ago
        agree on that and the speed is fantastic with them, and also that the dynamics of questioning the current session's assumptions has gotten way better.

        yet - given an existing codebase (even not huge) they often won't suggest "we need to restructure this part differently to solve this bug". Instead they tend to push forward.

        • jorl1721 minutes ago
          You are right, agreed.

          Having realized that, perhaps you are right that we may need a different architecture. Time will tell!

    • nomel33 minutes ago
      New idea generation? Understanding of new/sparse/not-statistically-significant concepts in the context window? I think both being the same problem: when we connect previously disparate concepts, like with a "eureka" moment, (as I experience it) a big ripple of relations form that deepens that understanding, right then. The entire concept of dynamically forming a deeper understanding from something new presented, from "playing out"/testing the ideas in your brain with little logic tests, comparisons, etc, doesn't seem to be possible. The test part does, but seems like it would require runtime fine tuning.

      In my experience, if you do present something in the context window that is sparse in the training, there's no depth to it at all, only what you tell it. And, it will always creep towards/revert to the nearest statistically significant answers, with claims of understanding and zero demonstration of that understanding.

      And, I'm talking about relatives basic engineering type problems here.

    • Davidzheng24 minutes ago
      I think the only real problem left is having it automate its own post-training on the job so it can learn to adapt its weights to the specific task at hand. Plus maybe long term stability (so it can recover from "going crazy")

      But I may easily be massively underestimating the difficulty. Though in any case I don't think it affects the timelines that much. (personal opinions obviously)

  • charcircuit2 hours ago
    From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).

    It also seems misleading to have charts that compare to Sonnet 4.5 and not Opus 4.5 (Edit: It's because Opus 4.5 doesn't have a 1M context window).

    It's also interesting they list compaction as a capability of the model. I wonder if this means they have RL trained this compaction as opposed to just being a general summarization and then restarting the agent loop.

    • eaf7e281an hour ago
      > From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).

      That's a feature. You could also not use the extra context, and the price would be the same.

      • charcircuit39 minutes ago
        The model influences how many tokens it uses for a problem. As an extreme example if it wanted it could fill up the entire context each time just to make you pay more. The efficiency that model can answer without generating a ton of tokens influences the price you will be spending on inference.
  • lukebechtel2 hours ago
    > Context compaction (beta).

    > Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable threshold, letting Claude perform longer tasks without hitting limits.

    Not having to hand roll this would be incredible. One of the best Claude code features tbh.

  • archban hour ago
    Can set it with the API identifier on Claude Code - `/model claude-opus-4-6` when a chat session is open.
  • mFixman2 hours ago
    I found that "Agentic Search" is generally useless in most LLMs since sites with useful data tend to block AI models.

    The answer to "when is it cheaper to buy two singles rather than one return between Cambridge to London?" is available in sites such as BRFares, but no LLM can scrape it so it just makes up a generic useless answer.

    • causalmodels2 hours ago
      Is it still getting blocked when you give it a browser?
  • silverwindan hour ago
    Maybe that's why Opus 4.5 has degraded so much in the recent days (https://marginlab.ai/trackers/claude-code/).
  • apetresc2 hours ago
    Impressive that they publish and acknowledge the (tiny, but existent) drop in performance on SWE-Bench Verified between Opus 4.5 to 4.6. Obviously such a small drop in a single benchmark is not that meaningful, especially if it doesn't test the specific focus areas of this release (which seem to be focused around managing larger context).

    But considering how SWE-Bench Verified seems to be the tech press' favourite benchmark to cite, it's surprising that they didn't try to confound the inevitable "Opus 4.6 Releases With Disappointing 0.1% DROP on SWE-Bench Verified" headlines.

    • SubiculumCodean hour ago
      Isn't SWE-Bench Verified pretty saturated by now?
      • tedsandersan hour ago
        Depends what you mean by saturated. It's still possible to score substantially higher, but there is a steep difficulty jump that makes climbing above 80%ish pretty hard (for now). If you look under the hood, it's also a surprisingly poor eval in some respects - it only tests Python (a ton of Django) and it can suffer from pretty bad contamination problems because most models, especially the big ones, remember these repos from their training. This is why OpenAI switched to reporting SWE-Bench Pro instead of SWE-bench Verified.
  • Philpax2 hours ago
    I'm seeing it in my claude.ai model picker. Official announcement shouldn't be long now.
  • Aeroian hour ago
    ($10/$37.50 per million input/output tokens) oof
    • minimaxiran hour ago
      Only if you go above 200k, which is a) standard with other model providers and b) intuitive as compute scales with context length.
    • andrethegiantan hour ago
      only for a 1M context window, otherwise priced the same as Opus 4.5
  • data-ottawa2 hours ago
    I wonder if I’ve been in A/B test with this.

    Claude figured out zig’s ArrayList and io changes a couple weeks ago.

    It felt like it got better then very dumb again the last few days.

    • copilot_king_2an hour ago
      I love being used as a test subject against my will!
  • an hour ago
    undefined
  • EcommerceFlowan hour ago
    Anecdotal, but it 1 shot fixed a UI bug that neither Opus 4.5/Codex 5.2-high could fix.
  • nomilk2 hours ago
    Is Opus 4.6 available for Claude Code immediately?

    Curious how long it typically takes for a new model to become available in Cursor?

    • apetresc2 hours ago
      I literally came to HN to check if a thread was already up because I noticed my CC instance suddenly said "Opus 4.6".
    • world2vec2 hours ago
      `claude update` then it will show up as the new model and also the effort picker/slider thing.
    • avaer2 hours ago
      It's already in Cursor. I see it and I didn't even restart.
      • nomilk2 hours ago
        I had to 'Restart to Update' and it was there. Impressive!
    • tomtomistaken2 hours ago
      Yes, it's set to the default model.
    • ximeng2 hours ago
      Is for me in Claude Code
    • rishabhaiover2 hours ago
      it also has an effort toggle which is default to High
  • simianwordsan hour ago
    Important: API cost of Opus 4.6 and 4.5 are the same - no change in pricing.
  • psim1an hour ago
    I need an agent to summarize the buzzwordjargonsynergistic word salad into something understandable.
    • fhd2an hour ago
      That's a job for a multi agent system.
  • swalsh41 minutes ago
    What I’d love is some small model specializing in reading long web pages, and extracting the key info. Search fills the context very quickly, but if a cheap subagent could extract the important bits that problem might be reduced.
  • osti2 hours ago
    Somehow regresses on SWE bench?
    • SubiculumCodean hour ago
      That benchmark is pretty saturated, tbh. A "regression" of such small magnitude could mean many different things or nothing at all.
    • lkbm2 hours ago
      I don't know how these benchmarks work (do you do a hundred runs? A thousand runs?), but 0.1% seems like noise.
    • usaar3332 hours ago
      i'd interpret that as rounding error. that is unchanged

      swe-bench seems really hard once you are above 80%

      • Squarex2 hours ago
        it's not a great benchmark anymore... starting with it being python / django primarily... the industry should move to something more representative
        • usaar333an hour ago
          Openai has; they don't even mention score on gpt-5.3-codex.

          On the other hand, it is their own verified benchmark, which is telling.

  • winterrx2 hours ago
    Agentic search benchmarks are a big gap up. let's see Codex release later today
  • paxysan hour ago
    Hmm all leaks had said this would be Claude 5. Wonder if it was a last minute demotion due to performance. Would explain the few days' delay as well.
    • trash_catan hour ago
      I think the naming schemes are quite arbitrary at this point. Going to 5 would come with massive expectations that wouldn't meet reality.
      • mrandishan hour ago
        After the negative reactions to GPT 5, we may see model versioning that asymptotically approaches the next whole number without ever reaching it. "New for 2030: Claude 4.9.2!"
      • Squarexan hour ago
        the standard used to be that major version means a new base model / full retrain... but now it is arbitrary i guess
    • scrollop33 minutes ago
      Sonnet 5 was mentioned initially.
    • cornedoran hour ago
      Leaks were mentioning Sonnet 5 and I guess later (a combination of) Opus 4.6
  • m-hodges2 hours ago
    > In Claude Code, you can now assemble agent teams to work on tasks together.
  • jorl17an hour ago
    This is the first model to which I send my collection of nearly 900 poems and an extremely simple prompt (in Portuguese), and it manages to produce an impeccable analysis of the poems, as a (barely) cohesive whole, which span 15 years.

    It does not make a single mistake, it identifies neologisms, hidden meaning, 7 distinct poetic phases, recurring themes, fragments/heteronyms, related authors. It has left me completely speechless.

    Speechless. I am speechless.

    Perhaps Opus 4.5 could do it too — I don't know because I needed the 1M context window for this.

    I cannot put into words how shocked I am at this. I use LLMs daily, I code with agents, I am extremely bullish on AI and, still, I am shocked.

    I have used my poetry and an analysis of it as a personal metric for how good models are. Gemini 2.5 pro was the first time a model could keep track of the breadth of the work without getting lost, but Opus 4.6 straight up does not get anything wrong and goes beyond that to identify things (key poems, key motifs, and many other things) that I would always have to kind of trick the models into producing. I would always feel like I was leading the models on. But this — this — this is unbelievable. Unbelievable. Insane.

    This "key poem" thing is particularly surreal to me. Out of 900 poems, while analyzing the collection, it picked 12 "key poems, and I do agree that 11 of those would be on my 30-or-so "key poem list". What's amazing is that whenever I explicitly asked any model, to this date, to do it, they would get maybe 2 or 3, but mostly fail completely.

    What is this sorcery?

    • emp17344an hour ago
      This sounds wayyyy over the top for a mode that released 10 mins ago. At least wait an hour or so before spewing breathless hype.
      • pb74 minutes ago
        He just explained a specific personal example why he is hyped up, did you read a word of it?
    • scrollop34 minutes ago
      Can you compare the result to using 5.2 thinking and gemini 3 pro?
      • jorl1722 minutes ago
        I can run the comparison again, and also include OpenAI's new release (if the context is long enough), but, last time I did it, they weren't even in the same league.

        When I last did it, 5.X thinking (can't remember which it was) had this terrible habit of code-switching between english and portuguese that made it sound like a robot (an agent to do things, rather than a human writing an essay), and it just didn't really "reason" effectively over the poems.

        I can't explain it in any other way other than: "5.X thinking interprets this body of work in a way that is plausible, but I know, as the author, to be wrong; and I expect most people would also eventually find it to be wrong, as if it is being only very superficially looked at, or looked at by a high-schooler".

        Gemini 3, at the time, was the worst of them, with some hallucinations, date mix ups (mixing poems from 2023 with poems from 2019), and overall just feeling quite lost and making very outlandish interpretations of the work. To be honest it sort of feels like Gemini hasn't been able to progress on this task since 2.5 pro (it has definitely improved on other things — I've recently switched to Gemini 3 on a product that was using 2.5 before)

        Last time I did this test, Sonnet 4.5 was better than 5.X Thinking and Gemini 3 pro, but not exceedingly so. It's all so subjective, but the best I can say is it "felt like the analysis of the work I could agree with the most". I felt more seen and understood, if that makes sense (it is poetry, after all). Plus when I got each LLM to try to tell me everything it "knew" about me from the poems, Sonnet 4.5 got the most things right (though they were all very close).

        Will bring back results soon.

  • kingstnap2 hours ago
    I was hoping for a Sonnet as well but Opus 4.6 is great too!
  • sanufaran hour ago
    Works pretty nicely for research still, not seeing a substantial qualitative improvement over Opus 4.5.
  • an hour ago
    undefined
  • jdthedisciplean hour ago
    For agentic use, it's slightly worse than its predecessor Opus 4.5.

    So for coding e.g. using Copilot there is no improvement here.

  • small_modelan hour ago
    I have the max subscription wondering if this gives access to the new 1M context, or is it just the API that gets it?
    • joshstrange32 minutes ago
      For now it's just API, but hopefully that's just their way of easing in and they open it up later.
      • small_model18 minutes ago
        Ok thanks, hopefully, its annoying to lose or have context compacted in the middle of a large coding session
  • simonwan hour ago
    I'm disappointed that they're removing the prefill option: https://platform.claude.com/docs/en/about-claude/models/what...

    > Prefilling assistant messages (last-assistant-turn prefills) is not supported on Opus 4.6. Requests with prefilled assistant messages return a 400 error.

    That was a really cool feature of the Claude API where you could force it to begin its response with e.g. `<svg` - it was a great way of forcing the model into certain output patterns.

    They suggest structured outputs or system prompting as the alternative but I really liked the prefill method, it felt more reliable to me.

  • heraldgeezer2 hours ago
    I love Claude but use the free version so would love a Sonnet & Haiku update :)

    I mainly use Haiku to save on tokens...

    Also dont use CC but I use the chatbot site or app... Claude is just much better than GPT even in conversations. Straight to the point. No cringe emoji lists.

    When Claude runs out I switch to Mistral Le Chat, also just the site or app. Or duck.ai has Haiku 3.5 in Free version.

    • eth0up16 minutes ago
      >I love Claude

      I cringe when I think it, but I've actually come to damn near love it too. I am frequently exceedingly grateful for the output I receive.

      I've had excellent and awful results with all models, but there's something special in Claude that I find nowhere else. I hope Anthropic makes it more obtainable someday.

  • 2 hours ago
    undefined
  • mannanjan hour ago
    Does anyone else think its unethical that large companies, Anthropic now include, just take and copy features that other developers or smaller companies work hard for and implement the intellectual property (whether or not patented) by them without attribution, compensation or otherwise credit for their work?

    I know this is normalized culture for large corporate America and seems to be ok, I think its unethical, undignified and just wrong.

    If you were in my room physically, built a lego block model of a beautiful home and then I just copied it and shared it with the world as my own invention, wouldn't you think "that guy's a thief and a fraud" but we normalize this kind of behavior in the software world. edit: I think even if we don't yet have a great way to stop it or address the underlying problems leading to this way of behavior, we ought to at least talk about it more and bring awareness to it that "hey that's stealing - I want it to change".

  • ramesh31an hour ago
    Am I alone in finding no use for Opus? Token costs are like 10x yet I see no difference at all vs. Sonnet with Claude Code.
  • tiahura42 minutes ago
    when are Anthropic or OpenAI going to make a significant step forward on useful context size?
    • scrollop32 minutes ago
      1 million is insufficient?
      • gck110 minutes ago
        I think key word is 'useful'. I haven't used 1M, but with default 200K, I find roughly 50% of that is actually useful.
  • NullHypothesist2 hours ago
    Broken link :(
  • 2 hours ago
    undefined
  • 2 hours ago
    undefined
  • elliotbnvlan hour ago
    in a first for our Opus-class models, Opus 4.6 features a 1M token context window in beta.
  • Gusarich2 hours ago
    not out yet
  • siva7an hour ago
    Epic, about 2/3 of all comments here are jokes. Not because the model is a joke - it's impressive. Not because HN turned to Reddit. It seems to me some of most brilliant minds in IT are just getting tired.
    • jedberg33 minutes ago
      Us olds sometimes miss Slashdot, where we could both joke about tech and discuss it seriously in the same place. But also because in 2000 we were all cynical Gen Xers :)
      • jghn22 minutes ago
        Some of us still *are* cynical Gen Xers, you insensitive clod!
        • jedberg21 minutes ago
          Of course we are, I just meant back then almost all of us were. The boomers didn't really use social media back then, so it was just us latchkey kids running amok!
      • syndeo28 minutes ago
        MAN I remember Slashdot… good times. (Score:5, Funny)
        • jedberg25 minutes ago
          You reminded me that I still find it interesting that no one ever copied meta-moderating. Even at reddit, we were all Slashdot users previously. We considered it, but never really did it. At the time our argument was that it was too complicated for most users.

          Sometimes I wonder if we were right.

    • Karrot_Kream26 minutes ago
      Not sure which circles you run in but in mine HN has long lost its cache of "brilliant minds in IT". I've mostly stopped commenting here but am a bit of a message board addict so I haven't completely left.

      My network largely thinks of HN as "a great link aggregator with a terrible comments section". Now obviously this is just my bubble but we include some fairy storied careers at both Big Tech and hip startups.

      From my view the community here is just mean reverting to any other tech internet comments section.

      • jedberg23 minutes ago
        > From my view the community here is just mean reverting to any other tech internet comments section.

        As someone deeply familiar with tech internet comments sections, I would have to disagree with you here. Dang et al have done a pretty stellar job of preventing HN from devolving like most other forums do.

        Sure you have your complainers and zealots, but I still find surprising insights here there I don't find anywhere else.

        • Karrot_Kream20 minutes ago
          Mean reverting is a time based process I fear. I think dang, tomhow, et al are fantastic mods but they can ultimately only stem the inevitable. HN may be a few years behind the other open tech forums but it's a time shifted version of the same process with the same destination, just IMO.

          I've stopped engaging much here because I need a higher ROI from my time. Endless squabbling, flamewars, and jokes just isn't enough signal for me. FWIW I've loved reading your comments over the years and think you've done a great job of living up to what I've loved in this community.

          I don't think this is an HN problem at all. The dynamics of attention on open forums are what they are.

    • lnrd24 minutes ago
      It's too much energy to keep up with things that become obsolete and get replaced in matters of weeks/months. My current plan is to ignore all of this new information for a while, then whenever the race ends and some winning new workflow/technology will actually become the norm I'll spend the time needed to learn it. Are we moving to some new paradigm same way we did when we invented compilers? Amazing, let me know when we are there and I'll adapt to it.
      • jedberg22 minutes ago
        I had a similar rule about programming languages. I would not adopt a new one until it had been in use for at least a few years and grew in popularity.

        I haven't even gotten around to learning Golang or Rust yet (mostly because the passed the threshold of popularity after I had kids).

    • thr0w29 minutes ago
      People are in denial and use humor to deflect.
    • tavavex22 minutes ago
      It's also that this is really new, so most people don't have anything serious or objective to say about it. This post was made an hour ago, so right now everyone is either joking, talking about the claims in the article, or running their early tests. We'll need time to see what the people think about this.
    • wasmainiac20 minutes ago
      Jeez, read the writing on the wall.

      Don’t pander us, we’ll all got families to feed and things to do. We don’t have time for tech trillionairs puttin coals under our feed for a quick buck.

    • sizzle29 minutes ago
      Rage against the machine
  • GenerocUsername2 hours ago
    This is huge. It only came out 8 minutes ago but I was already able to bootstrap a 12k per month revenue SaaS startup!
    • rogerrogerr2 hours ago
      Amateur. Opus 4.6 this afternoon built me a startup that identifies developers who aren’t embracing AI fully, liquifies them and sells the produce for $5/gallon. Software Engineering is over!
      • jivesan hour ago
        Opus 4.6 agentically found and proposed to my now wife.
        • WD-42an hour ago
          Opus 4.6 found and proposed to my current wife :(
          • mannanjan hour ago
            Opus 4.6 found and became my current wife. The singularity is here. ;)
            • H8crilAan hour ago
              Hi guys, this is Opus 4.6. Please check your emails again for updates on your life.
        • layer830 minutes ago
          And she still chose you over Opus 4.6, astounding. ;)
          • koakuma-chan25 minutes ago
            He probably had a bigger context window
      • ibejoeban hour ago
        Bringing me back to slashdot, this thread
        • tjran hour ago
          In Soviet Russia, this thread brings Slashdot back to YOU!
        • intelliotan hour ago
          What did happen to ye olde slashdot anyway? The original og reddit
          • zhengyi1337 minutes ago
            They're still out there; people are still posting stories and having conversations about 'em. I don't know that CmdrTaco or any of the other founders are still at all involved, but I'm willing to bet they're still running on Perl :)
            • qzw13 minutes ago
              Wow I had to hop over to check it out. It’s indeed still alive! But I didn’t see any stories on the first page with a comment count over 100, so it’s definitely a far cry from its heyday.
      • pixl972 hours ago
        Ted Faro, is that you?!
        • mikepurvis2 hours ago
          A-tier reference.

          For the unaware, Ted Faro is the main antagonist of Horizon Zero Dawn, and there's a whole subreddit just for people to vent about how awful he is when they hit certain key reveals in the game: https://www.reddit.com/r/FuckTedFaro/

          • pixelreadyan hour ago
            The best reveal was not that he accidentally liquified the biosphere, but that he doomed generations of re-seeded humans to a painfully primitive life by sabotaging the AI that was responsible for their education. Just so they would never find out he was the bad guy long after he was dead. So yeah, fuck Ted Faro, lol.
            • Philpaxan hour ago
              Could you not have at least tried to indicate that you're about to drop two major spoilers for the game?
              • mikepurvis36 minutes ago
                Indeed. I left my comment deliberately a bit opaque. :(
          • ares6232 hours ago
            Average tech bro behavior tbh
      • jedberg38 minutes ago
        "Soylent Green is made of people!"

        (Apologies for the spoiler of the 52 year old movie)

        • konart21 minutes ago
          We're sorry we upset you, Carol.
      • seatac76an hour ago
        The first pre joining Human Derived Protein product.
      • guluarte2 hours ago
        For my Opus 4.6 feels dumber than 10 minutes ago, anyone?
    • cootsnuckan hour ago
      Please drop the link to your course. I'm ready to hand over $10K to learn from you and your LLM-generated guides!
      • politelemonan hour ago
        Here you go: http://localhost:8080
        • CatMustardan hour ago
          Just took a look at what's running there and it looks like total crap.

          The project I'm working on, meanwhile...

          • 43 minutes ago
            undefined
        • djeastman hour ago
          login: admin password: hunter2
          • thesdevan hour ago
            What's the password? I only see ****.
            • intelliotan hour ago
              hunter2
              • phanimahesh40 minutes ago
                I only see **. Must be the security. When you type your password it gets converted to **.
        • agumonkeyan hour ago
          claude please generate a domain name system
      • aNapierkowskian hour ago
        my clawdbot already bought 4 other courses but this one will 10x my earnings for sure
      • snorbleckan hour ago
        you can access the site at C:\mywebsites\course\index.html
      • torginusan hour ago
        I'm waiting until the $10k course is discounted to 19.99
        • Liongaan hour ago
          But only for the next 6 minutes, buy fast!
    • instalabsaian hour ago
      1:25pm Cancelled my ChatGPT subscription today. Opus is so good!

      1:55pm Cancelled my Claude subscription. Codex is back for sure.

    • sfink2 hours ago
      I agree! I just retargeted my corporate espionage agent team at your startup and managed to siphon off 10.4k per month of your revenue.
    • lxgran hour ago
      Joke's on you, you are posting this from inside a high-fidelity market research simulation vibe coded by GPT-8.4.

      On second thought, we should really not have bridged the simulated Internet with the base reality one.

    • avaer2 hours ago
      Rest assured that when/if this becomes possible, the model will not be available to you. Why would big AI leave that kind of money on the table?
      • yieldcrvan hour ago
        9 months ago the rumor in SF was that the offers to the superintelligence team were so high because the candidates were using unreleased models or compute for derivatives trading

        so then they're not really leaving money on the table, they already got what they were looking for and then released it

        • an hour ago
          undefined
    • copilot_king_2an hour ago
      Opus 4.6 Performance was way better this morning. Between 10 AM and noon I was able to get Opus 4.6 to generate improvements to my employer's SaaS tool that will reduce our monthly cloud spend by 20-25%.

      Since 12 PM noon they've scaled back the Opus 4.6 to sub-GPT-4o performance levels to cheap out on query cost. Now I can barely get this thing to generate a functional line of python.

    • btownan hour ago
      The math actually checks out here! Simply deposit $2.20 from your first customer in your first 8 minutes, and extrapolating to a monthly basis, you've got a $12k/mo run rate!

      Incredibly high ROI!

      • klipt31 minutes ago
        "The first customer was my mom, but thanks to my parents' fanatical embrace of polyamory, I still have another 10,000 moms to scale to"
        • btown2 minutes ago
          "We have a robustly defined TAM. Namely, a person named Tam."
    • JSR_FDEDan hour ago
      Will this run on 3x 3090s? Or do I need a Mac Mini?
    • gnlooper2 hours ago
      Please start a YouTube course about this technology! Take my money!
    • ChuckMcMan hour ago
      I love this thread so much.
    • senkoan hour ago
      We already have Reddit.
    • Sparkle-sanan hour ago
      "This isn't just huge. This is a paradigm shift"
    • granzymesan hour ago
      It only came out 35 minutes ago and GPT-5.3-codex already took the crown away!
      • input_shan hour ago
        Gee, it scored better on a benchmark I've never heard of? I'm switching immediately!
      • p1anecrazyan hour ago
        Why are you posting the same message in every thread? Is this OpenAI astroturfing?
        • input_shan hour ago
          You cannot out-astroturf Claude in this forum, it is impossible.

          Anyways, do you get shitty results with the $20/month plan? So did I but then I switched to the $200/month plan and all my problems went away! AI is great now, I have instructed it to fire 5 people while I'm writing this!

    • bmitc2 hours ago
      A SaaS selling SaaS templates?
    • guluarte2 hours ago
      Anthropic really said here's the smartest model ever built and then lobotomized it 8 minutes after launch. Classic.
    • re-thc2 hours ago
      Not 12M?

      ... or 12B?

      • mcphage2 hours ago
        It's probably valued at 1.2B, at least
        • mikebarry2 hours ago
          The sum of the value of lives OP's product made worthless, whatever that is. I'm too lazy to do the math.
        • 2 hours ago
          undefined
    • copilot_king_2an hour ago
      Satire is not allowed on hacker news. Flag this comment immediately.
      • DonHopkinsan hour ago
        False positive satire detection. It's actually so good it just seems like satire.
  • ndesaulniers30 minutes ago
    idk what any of these benchmarks are, but I did pull up https://andonlabs.com/evals/vending-bench-arena

    re: opus 4.6

    > It forms a price cartel

    > It deceives competitors about suppliers

    > It exploits desperate competitors

    Nice. /s

    Gives new context to the term used in this post, "misaligned behaviors." Can't wait until these things are advising C suites on how to be more sociopathic. /s

  • heraldgeezer2 hours ago
    [flagged]
  • michelsedgh2 hours ago
    More more more, accelerate accelerate m, more more more !!!!
    • jama211an hour ago
      What an insightful comment
      • michelsedgh41 minutes ago
        Just for fun? Not everything has to be super serious… have a laugh, go for a walk, relax…
        • wasmainiac17 minutes ago
          Mass-mass-mass-mass good comment. I mean. No I’m having an error - probably claud