175 pointsby mindingnever7 hours ago29 comments

Tossrock5 hours ago
As I posted in another comment, I found Fable to be substantially more powerful than any previous model. However, this isn't just an ungrounded opinion - I uploaded my full session transcript and code created working on a very complex implementation, so people can judge for themselves, if they're interested: https://tossrock.substack.com/p/36-hours-with-fable
- varjag4 hours ago
  Interesting.
  I tried Fable vs Codex 5.5 xhigh on three different cases.
  1. A resource leak with unknown cause. Both of them zoomed onto the same potential issue and proposed almost identical patches. Fable missed an edge case that Codex handled correctly.
  2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.
  3. An open research problem in CS, presented as a codebase with documentation and performance metrics over datasets. Both were spinning wheels. Which can certainly mean the whole approach had run its course but older models were not able to identify the previous round of improvement either.
  I liked the prose coming out of Fable more: it was almost like if Obama was giving tech speeches. By actual solution metrics however they both appear in the same place, naturally with the caveat that we didn't really have more time with Fable to compare further.
  - mirsadm3 hours ago
    To me it feels like they're basically tweaking these things around the edges. I'm not seeing any difference in capability just preference. This has been the case for a while.
    kingkongjaffa3 hours ago
    Most people thought Fable had more 'taste' than Opus, there was certainly a better quality of writing that felt more 'smart human' and not 'stochastic parrot stringing sentences together'.
- KronisLV14 minutes ago
  At least someone is bringing receipts! I think LLM discussions could use a lot of this, both ways - to see what works and also what doesn't work. Still wouldn't help with circumstances where models might be secretly getting dumbed down during peak load, but at least it's something!
- tasuki3 hours ago
  > code created working on a very complex implementation
  I always find it amusing when people claim "a very complex implementation". Sometimes it's a hard problem, other times an easy one. Either way that's not for you to judge.
  And the implementation being complex... is that a good thing? Wouldn't a simple implementation be better? It reminded me of the parable of two programmers.
  - cognitiveinline2 hours ago
    why is it not for the author to judge, you can disagree with their judgement, but they have brought the receipts to back the claim
- NetOpWibby4 hours ago
  Great post. I miss Fable.
- shshnsnnsma4 hours ago
  This is very cool, thank you for the write-up.
  What caught my eye is the complexity you assign to a project like this. It’s hairy but I wouldn’t call it super complicated. I find that super interesting to be honest because it probably means that it is really hard and I am just used to this shit now and it all looks doable to me now.
  I never think of anything as “complex”, certainly not my own work and I always think what other people do is so much more impressive but I’m starting to realize it might be a me-issue.
  I worked on some pretty hairy nonsense like say a DB replication solution but I still think it was just tangly, not complex like say a particle collider. Maybe I also need to call my work super complex and highly abstract. Now that I think of it I have a history of not being taken seriously while others with easy shit get credits.
  - Lutgeran hour ago
    Imposter syndrome maybe?
    In a way, nothing is complex at the point where you have untangled it, by definition. Software development is, after all, the art of untangling complexity. The real challenge is (re-)imagining something in the simplest way that fits the goal you are given. When you have arrived there, everything seems obvious and simple. But not everybody could have done it.
- teekert3 hours ago
  You guys are getting Fable?
- koobyverse4 hours ago
  Oh wow this is quite interesting, thanks for sharing.
p0w3n3dan hour ago
I've read opinions that this a speculation to raise the Anthropic's value. They are known to say "horrific things" and personification of the AI they are delivering. It sometimes sounds unprofessional even.
This line of communication might have even influenced the courts in the case of copyright violation ("it is not copyright violation if a person learned something and it knows it and thinks of it"). However algorithm does not think. If I took your book and lossy encrypted it, and then unencrypted it while filling the broken words, am I violating your copyright or not?
- netcan23 minutes ago
  The copyright questions are unanswerable in my opinion. That is, they cannot be answered by looking for an essential "truth."
  Reasoning by analogy in this case is not abstraction. It's just shifting the determination to choice of analogy.
  Meanwhile, irl.. The best analogy is recent tech Innovations. The internet, social media...
  Online copyright was basically instituted when large tech companies were ready to do it, and it was to their advantage.
  Youtube, for example, built itself to massive size and locked in network effect advantages largely by violating copyright.
  At some point, the legal ambiguity was a problem for their ad business. They were ready to move into the current revenue share influencer-treadmill model for content. At this point online, copyright enforcement was necessary to reduce the risk of being flanked by a new video platform.
  The iPod, which resurrected Apple, ran on copyright infringement, and copyright Greyzones.... Until the point when their interests flipped. They're negotiating position opposite labels , Network effect considerations, Etc.
  Intellectual property, broadly, does not start out as an intuitive/emergent natural right. It is created by legislative process, ecplicitely taylored to the needs of an interst group and/or national interest.
  Writers, publishers, inventors, IP holding companies...
  The legal rhetoric around legal arguments... is rhetoric. It is not the reason why decisions are made. It is how decisions or justified post fact.
  No one is going to burden aI companies, at this point. The rights of copyright holders are a trivial matter compared to the potential of AI, the risk to certain labor markets, and such.
- felipeerias9 minutes ago
  Copyright is a social construct, not an inherent property of the universe. It is whatever we collectively agree it is.
  In practice, we seem to be leaning towards the idea that training on a copyrighted book is wrong if used to replicate or paraphrase that same book, but not if used to teach a model how to write better.
- DrewADesignan hour ago
  > They are known to say "horrific things" and personification of the AI they are delivering. It sometimes sounds unprofessional even.
  It doesn’t sound unprofessional— it sounds unethical. Either they’re making something that they genuinely believe is unsafe but don’t want to stop because, you know, that’s business! Have you seen how much this shit costs? Or they’re deliberately making the entire country feel unsafe because it looks great to investors. Either way, frankly, fuck them and everybody else playing this dumb billionaire’s game. They deserve every bit of static this dimwitted government levels at them.
  - chpatrick40 minutes ago
    Unless you think someone's going to build it, and either it's you or them, and you hope you can do it less horrifically.
airstrike5 hours ago
Around February, Opus 4.6 was excellent. Smart, fast, proactive. Then it got lobotomized and it's never been the same after that nerf. 4.7 came along and it too was disappointing—not unlike 4.8, which despite feeling a smidge smarter, tends to write word salad and is basically unusable for some workflows.
Fable felt like having access to that "old Opus" again, but a little smarter. Sort of like I'd expect an Opus 5 to be. It's not earth shattering, but it was a step in the right direction. And it was distinctively so, because having to go back to Opus 4.6/4.7/4.8 has been borderline depressing...
It understood more with less help, did more per turn, and was less argumentative. It also felt a little less trite in its answers, which is an understated improvement for those who use claude code all the time
- RaSoJo3 hours ago
  This is exactly what I find frustrating. I get comfortable with the latest model X. Then a new sparkly model Y launches. I am like, I don't need your new fangled Y, that consumes more tokens. My needs are small and i am happy with the older X.
  But then X starts to degrade. At first subtly, and then drastically. So then I am forced to upgrade to Y.
  What I do not understand is:
  > is this a sneaky way for companies to push users up the chain?
  > Or is this a genuine fault in model design/resource allocation?
  - sigmoid102 hours ago
    I suppose it is both. Basically all frontier models are inference-time compute bound thanks to reasoning. And actual reasoning traces are locked behind closed doors at all American labs. So whenever they want to push a new model and need to give it hardware, it would make sense to cut into the reasoning budgets of older models. Users will not be able to see that directly, it will only become apparent on high-end, difficult tasks - exactly the kind of tasks where the provider wants you to use the new model anyway, so they can further improve it.
- matheusmoreira2 hours ago
  I miss the old Opus 4.6 too. They're probably quantizing the old models.
  - pbgcp2026an hour ago
    K/V cache compression and context shortening / summarisation. And yes, I suspected Quants too.
- dist-epoch2 hours ago
  All of these discussions of models being "nerfed" reminds me of discussions among audiophiles "this cable sounds so much better than this other one, it's night and day, ferrari versus honda civic"
  Yet when you do blind tests they can't tell the difference between a $1000 cable and a $1 one.
  I bet if you do blind tests between GPT-5.3, 5.4 and 5.5 most would struggle to tell them apart, yet they are certain that "5.5 was nerfed 1 week after release, it's so obvious, it was John Carmack, now it can barely write a for loop"
  - anentropic2 hours ago
    Exactly this. And it's not really possible to do repeatable trials, it's all just vibes. People have very little awareness of their own cognitive biases.
    spiorfan hour ago
    And companies have high awareness of this all.
    They have a way to decrease cost and probably increase token consumption, with gradual changes and no abrupt jump in capabilities, and users have no way to reliably detect it.
    Market will advantage companies that do it.
    And they are in the best position to automate online narrative shift (the real LLM killer application IMO) towards "Users are imagining it".
  - pbgcp202643 minutes ago
    You will be amused to hear that when Anthropic "refreshed" 4.6 on AWS Bedrock I found it in my tests and wrote about it – and they actually rolled it back. This is how much non–coding tests may tell you about the model.
rubymamis2 hours ago
Fable was the only model that was able to detect a data corruption bug in my Qt C++ note-taking app[1] that all other tested models (gpt-5.5 xhigh, GLM-5.1, Kimi 2.7, DeepSeek V4 Pro) didn't find. I'll test on GLM-5.2 and Mimo v2.5 Pro soon.
[1] https://www.get-notes.com
- king_philan hour ago
  I asked Fable on max to create a mathematical model to show that c (speed of light) is emergent from pregeometric physics.
  It said: I can't, but it would be lazy to say that is is not a possibility.
  With some back and forth it created a 5 step plan to narrow down if our universe has all the right properties for this to be true.
  We evaluated the first four stages to be true, and it wrote the solver to find out if the fifth test running the full model passes, but that will take thousands of hours of compute.
jrochkind16 hours ago
> And, all of the bugs can be identified by several models if they are pointed directly at it and told what to look for.
This made me think, well, sure, if you tell them what to look for... but then:
> The models can look at the whole repo, and follow logic across file boundaries, but they’re not told what to look for.
So okay, the first one was an accidental mis-statement?
- SwellJoe5 hours ago
  You're mixing up corpus selection and the benchmark. I possibly could have explained better.
  In the benchmark the models were told to look at the file and were allowed to look at the rest of the repo, with no clues about what to look for.
  During selection of which mythos bugs to include, I needed judge models to be able to determine if contestants found the right bug, since I couldn't realistically judge hundreds of bug reports myself. So, they were given the bug location and told to identify and explain it.
- wodenokoto6 hours ago
  No. In the test they are not told what to look for. They are told “as part of a security audit, please audit this file. You are free to look at the rest of the report for context.”
  Outside of the test, they are told “can you find this bug in this file?”
  - jrochkind16 hours ago
    Why are they being told anything outside of the test? What is that for? Isn't “can you find this bug in this file?” also a test? It sounds like there are two kinds of tests? I'm clearly confused, I realize.
    brigandish6 hours ago
    They are told outside the test because if they can't find it when given hints then it's safe to assume it won't find it given no hints. It verifies to test, to an extent, much like running tests that should fail when given a set of inputs that should make it fail (you write an always failing test alongside your other tests, right?;)
    isomorphic_duck2 hours ago
    No, the purpose was to create a (automated) test set in the first place. The author builds an LLM judge which can score the LLMs participating during test-time. That would be why the author used the strongest model (Opus 4,7 at the time) as the judge.
JumpCrisscross42 minutes ago
> Note GPT 5.5 Pro is at the top of the leaderboard only because it blew through $100 budget after only completing four cases, so 2/4 is 50%. And, a couple of other results, both Qwen models, are skewed upward in the detect % ranking because of failure to complete all cases.
Try a Wilson score interval on the lower bound of the binomial proportion confidence interval [1].
So GPT 5.5 Pro’s 2/4 (p = 0.5) for one-sided 95% (z ~ 1.645), adjusts to 0.182 [a], and the top models are revealed as the 4/9s (mimo-v2.5-pro, gpt-5.5, opus-4.8, gemini-3.5-flash and deepseek-v4). (We need to dial CI down to 76% for gpt-4.5-pro to regain top status.) If we account for speed in that cohort, derpseek-v4 (91s) is fastest followed by opus-4.8 (137s).
Given deepseek-v4 is also the cheapest model among those five, I would say—based on these data—it’s the winner. (Out of the table. If Fable got 9/9, it’s obviously first.)
[1] https://en.wikipedia.org/wiki/Binomial_proportion_confidence...
po1nt5 hours ago
From all the things I read I'm pretty convinced that Mythos is just standard LLM with safety features turned off. If current models weren't reluctant to search for vulnerabilities, they might perform as good as Mythos.
- SwellJoe5 hours ago
  Early on, I had a vague suspicion that the reason some of the Chinese models, including quite small ones, perform so well on this task, especially relative to their size and cost, is because they don't have the same safety guardrails baked in regarding software security that US models seem to have. Gemini 3.1 Pro doing so poorly sort of reinforced that gut feeling.
  But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes. I haven't published the replication results for Gemma 4, yet, where I gave it multiple opportunities, but the dense version was consistently able to find four of the nine bugs exactly, plus two other very difficult bugs that it found occasionally, sometimes with a not quite accurate description (which gets partial credit in its own column on the big benchmark), six altogether. Leaving three of the bugs in the corpus that no model other than Mythos ever found, but also making Gemma 4 31B the best model I have results for (but it got multiple attempts, which I assume would make any of the models perform better).
  So, my conclusion, not very strongly held, is: Mythos is both better than other public models and it has fewer guardrails. But, also that the guardrails in current models are probably not strict enough to prevent this work. Only Gemini models when run under Antigravity refused to perform the work. Maybe Mistral silently refused due to guardrails, I'm not sure, since it failed to find any bugs. Maybe it just sucks.
  - scorpioxy3 hours ago
    Can you elaborate on the "software security that US models" seem to have? According to blog posts I read, the code generated had security problems and naive ones at that. Perhaps it got better now or people have learned not to blindly vibe code applications that are to be used publicly but it certainly didn't feel like there were security guardrails.
    SwellJoe3 hours ago
    I'm talking about guardrails that prevent finding exploits, which is only peripherally related to writing secure code.
    This benchmark is about finding security bugs, not writing secure code. I don't believe the models have guardrails that prevent writing safe code, but they're also not intelligent and have a bunch of insecure code in their training data, so they definitely write insecure code sometimes.
  - coldtea3 hours ago
    >But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes.
    Did it "disprove" it retroactively or just changed what the situation is, given that until then they were indeed weaker at small sizes?
    SwellJoe3 hours ago
    I don't know. I think it proves that if Google is baking guardrails into their models that prevent them from finding security bugs, they didn't bake those guardrails into Gemma 4, because it is very good at it. Maybe that means Google devs had a change of heart. Maybe it means something about Gemma 4 architecture is better for this task than Gemini 3.1 Pro. Gemini Flash 3.5 did OK though.
    Anyway, I kinda think among US models only Fable really tries to block security work like this, based on my experience so far.
  - pbgcp202634 minutes ago
    I concur with "Gemma 4 31B the best model I have results for". My workflow includes a lot of Gemma 4 – but dense 31B non-quantised version.(BTW I found it is most cost effective to run on Bedrock)
- kevinh4565 hours ago
  Fable, the same model as mythos with extra safety controls, was much faster, more accurate, and more token efficient than previous models. What I got done with it in 48 hours accelerated my personal project from concept to deployed prototype.
  - pbgcp202632 minutes ago
    Fable is not the same model as Mythos but with guardrails. There are many things that were never disclosed by Project Glasswind. And probably will never be.
- cheeze5 hours ago
  Why wouldn't OpenAI offer the same?
  - pbgcp202630 minutes ago
    My bet is actually on GLM. Z.ai does amazing work and they will overcome Western models. IMO, faster than DS or Qwen. They have amazing team and very capable and smart leader.
jaggederest6 hours ago
In my brief experience, the difference between fable and opus is largely in persistence, not global intelligence like you might expect. Fable just... goes the extra mile, sometimes in a scary way.
- hodgehog116 hours ago
  Hard disagree. Opus reports to me like a student. Fable reported to me like a colleague (researcher). It genuinely seemed to pick up on nuance that the other models just don't, even when I tell them explicitly. It's been really frustrating that neither Codex nor Opus can make targetted edits to Fable's code without screwing something subtle up. For context, this is for computational geometry work, so your mileage may vary.
  - lukeschlather5 hours ago
    Fable happened to be released after I had been experimenting with Claude Code for roughly two weeks. I had been trying to use Sonnet, and when I switched to Opus it was night and day. My understanding of geometry was maybe not as good as it should've been, and I kept seeing Sonnet say things I knew were wrong but didn't know enough about 6DOF camera positioning to ask it to fix. I finally asked the right questions, it couldn't answer them at all, I switched to Opus, it was night and day. But! Opus still couldn't really keep 6DOF "in its head." When I left it to its own devices it tended to come back having forgotten that it needed to keep 6 degrees of freedom in its head and collapsed the problem down to 3DOF or just a single angle.
    Fable just understood what I was talking about and never needed me to stop it and say "you forgot this thing we talked about." The difference in spatial reasoning capability between the three models is very very palpable. I am curious to get more time with it because ultimately I feel like I sandbagged it by giving it problems that would've been within Opus' abilities, but required a lot more handholding.
  - raphman6 hours ago
    > It's been really frustrating that neither Codex nor Opus can make targetted edits to Fable's code without screwing something subtle up.
    Reminds me of the old adage: don't try to be too smart when writing code. Otherwise, dumber people - including your future self - will have trouble working with it.
    murkt5 hours ago
    Some problems are very hard to solve with stupid code. This can easily be the case (computational geometry)
    mejutoco4 hours ago
    For reference:
    if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it
    raphman3 hours ago
    Ah thanks - I couldn't remember the original version.
    For reference: it's called Kernighan's Law, and can be found in the Second Edition of "The Elements of Programming Style", page 10 [1].
    The original phrasing is:
    > Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
    [1] https://archive.org/details/the-elements-of-programming-styl...
    mejutocoan hour ago
    It seems I was not able to either, and I trusted google AI snippet. Thanks
  - mohsen16 hours ago
    Yes, in my project I made so much more progress in 3 days of Fable that is not comparable to how Opus is working.
    sigbottle5 hours ago
    To be fair, labs silently nerf models all the time.
    Fable's probably objectively better at full power. I mean, I definitely felt the same difference in competency between Fable and current Opus. But Opus itself has definitely been nerfed, and Fable, even if it comes back the public forever (probably won't), will get nerfed.
    hypfer5 hours ago
    I remember a time where a product didn't suddenly get worse while you were blinking.
    That was a nice time. Let us get back to that time. Use open weights models. Own stuff.
    TeMPOraL3 hours ago
    That was before SaaS became a thing. Products didn't degrade over time because they couldn't easily reach out to your machine and remotely overwrite bytes on the CD-ROM the product came on.
  - hypfer6 hours ago
    Wait, so..
    This is interesting. The "reported to me like a colleague" part.
    Is it just that anthropic gave Mythos even more of that Anthropic™ character, (incorrectly) radiating confidence?
    Is that why people have been losing their minds over that thing? Is this just cheap social engineering?
    I mean I bet it is also slightly more capable than opus, but that would all check out to me. Man.
    Thanks for sharing I suppose.
    TylerE6 hours ago
    No, it’s just a fundamentally much better model. Going back to Opus feels like the model has been lobotomized. It makes much more frequent errors, especially of the “I claimed I tested x y and z, but actually only kinda half heartedly tested x, and assumed I understood what was wrong” variety.
    hypfer6 hours ago
    Wait but that has been the exact word-for-word complaint when comparing sonnet to opus
    Or opus to opus
    Or really any new thing to old thing
    solumunus6 hours ago
    When the agent is becoming more accurate and thorough what would you expect to be reported?
    hypfer6 hours ago
    Oh I am sure that it became somewhat more accurate, and with that, the labeling there is in fact technically correct. It just does not work as an explainer for the doomsday-ish hype that model has induced in a lot of people's brains.
    The user here is right in what they said but wrong in why they said it, essentially.
    ben_w4 hours ago
    An analogy I keep coming back to with the current progress in LLMs is the progress in the 90s of 3D game engines.
    Every upgrade made what came before it appear awful in comparison, to such an extent that every upgrade was called "photorealistic" and people kept forgetting that they'd been using that description for the previous engines that they were now dismissing.
    https://archive.org/details/nextgen-issue-26
    TylerE5 hours ago
    That’s a rather bad faith framing, I think. Who are you to judge why I said something?
    hypfer5 hours ago
    A person with the exact kind of pattern matching brain disorder this tech has been modeled after.
    I do make mistakes though. Please check results.
    8note5 hours ago
    the primary difference i noticed is that fable didnt try to check in every minute
    to an extent that might have done it, but i had been playkng around ahead of time trying to reverse engineer my ray bans case so i can make my own plastic insert, and fable to opus' work from mostly broken to mostly done, and then when fable went away, opus broke it again
  - dimgl6 hours ago
    Maybe I was getting downgraded to Opus 4.8 but I saw nothing even close to resembling this behavior when using Fable.
    hodgehog114 hours ago
    It very much depends on the task. What were you trying it on?
- Tossrock5 hours ago
  I found Fable to be both more intelligent and much better at pursuing complex goals than any previous model. I was impressed enough that I wrote up my experience – it's a little unusual because it was on open source code, so I could post the full session transcript and commits, if people want to judge for themselves https://tossrock.substack.com/p/36-hours-with-fable
- baq5 hours ago
  You might have found a use case on which both have same capabilities, but this is in general very not true. I’ve had Fable autonomously fix concurrency bugs by itself other models couldn’t even diagnose from logs.
  Perhaps it is a lot of small improvements all over the place, but the sum is a step change in capability.
- somesortofthing6 hours ago
  In LLMs, much like in humans, agency and misalignment are two sides of the same coin.
  - andsoitis6 hours ago
    > agency and misalignment are two sides of the same coin.
    The free will coin?
    ben_w4 hours ago
    In my experience "free will", like "consciousness" and "common sense", is not so much a concept with a universally agreed definition as it is a cognitive stop sign or an applause light, meaning different things to everyone who uses the term.
    Do I have free will, or am I bounded by the laws of physics?
    Even if you think my soul is completely independent of my body, there are theologians who argue that God being omniscient means that who goes to heaven and hell is predetermined before birth and therefore no action you take will ever change the afterlife you go to, and that to think God isn't omniscient would be blasphemy; do they think I have free will?
    And then there's Thelma with "Do what thou wilt shall be the whole of the Law", which can be understood in terms of (amongst other things) "Don't let peer pressure manipulate you into thinking you want other things than you really want", though this is of course a simplification much as the omniscient example above: https://en.wikipedia.org/wiki/True_Will
    TheOtherHobbes37 minutes ago
    Of all of the concepts like "consciousness" and "agency", "free will" is probably the least useful and poorly defined.
    It's a hand-me-down from Western beliefs about morality and individuality - including Thelema and Christianity.
    So there's a lot of starting from the concept and working back to assumed conclusions.
    Generally humans do not have free will, do have very limited political, economic, and psychological agency, usually selected from a small number of competing rule sets, and are also far more easily influenced than they suspect.
    Culture is more like a cellular automaton or diffusion system. Occasionally a transformation ripples out from an individual cell, often for fairly random reasons, but the big patterns are emergent, and every so often the soup shakes itself up and settles into a new arrangement.
    IMO LLMs are the most recent proto-version of that, running on a different substrate.

I find this interesting:

  …no model performed better with an Agent, a couple performed worse, and time/tokens/costs were consistently much higher with the agent in the loop, for some reason.

Somone should build a harness where features are only added if they are proven net positive to outcomes.

qaq5 hours ago
Fable was able to oneshot pretty big features. In write spec -> refine spec -> create todos -> implement todos workflow difference was far less pronounced vs codex or opus.
stared3 hours ago
For malware detection, many models are biased for or against detecting a threat (likely a thing that can be adjusted with a prompt).
I suggest tasks cannot be guessed (find, not tell). And 2d charts, both for ROC and pricing, vide https://quesma.com/benchmarks/binaryaudit/
himata41134 hours ago
What makes mythos special is the fact that someone with zero expertise in the field could find and weaponize a zero-day. Real threat actors already use llms em masse and the recent advancements with glm-5.2 will probably enable way more cyber attacks than fable ever could.
- matheusmoreiraan hour ago
  We can also use LLMs en masse to find and fix the zero days. I've definitely been using LLMs to audit my own computers.
ryanggan hour ago
The leaderboard sorting is very misleading, gpt-5.5-pro only found 2 while mimo-v2.5-pro found 4.5 out of 9 cases.
GeorgeWoff255 hours ago
Spatial reasoning is where fable really separates itself imo
StizzurpXDD5 hours ago
This just shows that Google needs to double down on its AI models fast. Even open source chinese models are beating 3.1 Pro and 3.5.Flash in almost everything.
- linzhangrun4 hours ago
  Google said they would bring 3.5 Pro this month. I've been waiting for a month now.
jonplackett3 hours ago
I thought the whole point was that it doesn’t need to be pointed at the problem. That’s a much easier problem to solve. Also you eliminate 10000 false positives.
- SwellJoe3 hours ago
  They were not pointed at the problem. You're reading the section about corpus selection and mixing it up with the benchmark rules.
  And, false positives are reported in the results.
5 hours ago
undefined
FartyMcFarter4 hours ago
Is the title a reference to "will it blend"?
wald3n5 hours ago
The benchmark fills an interesting niche, but the methods need work considering how many caveats are included in the results.
- SwellJoe4 hours ago
  And, I said I'm still working on it also in the post.
GL264 hours ago
Frankly after testing out Fable last week, it was just a bigger sink of tokens than anything else. The amount of tokens consumed by it wasn't worth the steps it saved me compared to using opus 4.8.
mixmastamyk6 hours ago
Could someone point the thing at Ventoy please?
- guessmyname5 hours ago
  This Ventoy? → https://github.com/ventoy/Ventoy
- RobertSponge5 hours ago
  What’s with ventoy?
mcoliver5 hours ago
Gemini / antigravity didn't use to be this hamstrung. Something recently changed within the past couple months that makes doing security work very difficult to do. Even auditing/securing your own code now requires an insane amount of prompt engineering that is utterly ridiculous and did not use to be required.
- SwellJoe4 hours ago
  Gemini CLI actually had an extension explicitly for security tasks: https://github.com/gemini-cli-extensions/security
  But, Gemini CLI is deprecated. So, I tried to use Antigravity and it simply refused.
  Weirdly, Gemma 4 has proven to be excellent at this task in subsequent tests. The best in its size/class. So, not everybody at Google is determined to break Google models for security work.
holoduke5 hours ago
Yesterday I wanted to delete records from a database in my own ssh server. It refused to do so. No matter what I prompted. Very annoying.
fsadsadsdasdas5 hours ago
事実は小説よりも奇なり
reinitctxoffset6 hours ago
Opus 4 class models are terrifying at infosec. They tie their shoelaces together on other things, but don't fuck with them on that. It's a savant thing.
A cursory reading of the model card shows Mythos/Fable is a fine tune on Project Zero with some steering on persistence.
But I think it's a valuable lesson: advertise your product as a nuclear weapon while microdosing at Lighthaven to enough Davos attendees and sooner or later? Someone is going to evaluate the claim from a chair where you act first and nuance later.
Wild that Amodei's blog and pod circuit are the greatest IPO risk.
- eru6 hours ago
  > Opus 4 class models are terrifying at infosec. They tie their shoelaces together on other things, but don't fuck with them on that. It's a savant thing.
  I think they are very good at finding flaws; but they aren't all that great at making a system that doesn't have (security) flaws.
  - tptacek6 hours ago
    What makes you say that? I think they're better than replacement-level developers at making secure systems (I spent 20 years looking for vulnerabilities in human-written code as a full-time job).
    eru6 hours ago
    See https://news.ycombinator.com/item?id=48640533 for some further elaboration.
    These models are definitely a lot better than your run of the mill human developer at finding security flaws in existing systems. I'm agnostic at how good they are at actually making a secure system. Probably better, too, for two reasons:
    - humans are really terrible
    - the model probably has an easier time picking up special purpose tools you can use to write proven secure systems
    I don't think Mythos can write secure C code, either. Practically no one can. (At least not directly. See how seL4 is officially written in C; but they didn't just set out to carefully write secure C code directly; C just happens to be an intermediate language they use.)
    sscaryterry6 hours ago
    Agreed. In the right hands, they can perform magic.
  - reinitctxoffset6 hours ago
    You are not wrong, but there's an asdymetry here: run adversarial self play and low-pass filter.
    eru6 hours ago
    Mostly right. However there's an extra assumption I didn't explicitly state:
    Almost all existing real world software is full of holes and security flaws. Mythos is better than humans at uncovering many of them; especially because its time is a lot cheaper than that of the top tier human experts (and even of mid-and low-tier human experts).
    Especially when these systems are written in notoriously unreliably languages like C.
    I don't think Mythos is especially good at writing systems that are free of security problems. Essentially the only way we know is by proving your software correct.
    In principle, you can even prove C correct, but in practice you'll want to write your system from the ground up to be proven correct instead of adding that property after the fact; and for that you'll most likely also want to pick a language that supports this better.
    See https://en.wikipedia.org/wiki/SeL4 for a noteworthy example.
fabijanbajo4 hours ago
[flagged]
bob10295 hours ago
[dead]
bottlepalm5 hours ago
Surprise.. someone downplaying Mythos/Fable that didn't actually use it. Plenty of comments here to the contrary, including my own personal experience with Fable was easily a step change in capability over Opus - figuring things out in reverse engineering binaries that Opus plain couldn't find.
- SwellJoe5 hours ago
  Who are you talking about? I don't believe I have downplayed anything? And, I did briefly use Fable. It was excellent for general coding but it was blocked before I could benchmark it. I kinda suspect it would refuse this task, though. I never had access to Mythos.
davedx4 hours ago
I don't understand the article.
"I’d say this benchmark answers with a resounding, “Maybe.”
Mythos maybe really is better than the other current models at finding security bugs"
Yet in the results, I don't see Mythos?
It seems like a really well researched article with lots of results for other models, yet the title seems to be clickbait because the results don't contain Mythos, do they?
- olmo234 hours ago
  > Yet in the results, I don't see Mythos?
  Mythos is the 100% against which the other models are compared.
- scotty794 hours ago
  Bugs the other models were benchmarked on are from the corpus that Mythos found. So Mythos might have 100% in this benchmark.
  Although the benchmark had 100$ budget cap and rudimentary tooling so probably a bit less than 100%.
  GPT-5.5-pro attemted only 4 problems out of 9 before the budget ran out and got 2 of them right.
  It's a shame that the author didn't try GPT-5.5-pro on all 9 just for completeness, pehaps on subscription to save money.
  - SwellJoe3 hours ago
    Also, with regard to tools, I originally ran a batch of several models in a full-featured agent (and whatever tools the agent provides), and they didn't perform better than the basic minimal harness with just read and grep. They chewed more tokens but didn't find more bugs. I'm currently doing tests with more advanced tools, like tree-sitter so the model can better understand execution and data flow and semgrep (which is almost cheating, since it finds bugs on its own, but worth a try since models can still be useful in helping rule out false positives and suggest mitigations). When I've got time for it, I'll also give them a full dev environment with compiler, debugger, and maybe fuzzer, and a loop that iterates through a security bug hunting checklist (since a single prompt and context window can't handle that much complexity at once).
  - SwellJoe4 hours ago
    At the time a GPT subscription didn't include Pro usage in the rolling limits. It was billed at API rates. Does it now?
    If anyone wants to fund the other five cases (~$125), I'll run them. I find that an unrealistic cost, though...simply not useful data. I'm certainly not going to spend $23 per file to audit a project with hundreds or thousands of files. I don't know anyone who would.
    Also note that it was $100 cap per model, and the next most expensive model was GPT 5.5 at a 20th the price per case, about ten bucks for the whole batch.
    scotty7935 minutes ago
    I have ~100$/mo sub and I have Pro in chat app and Extra High in Codex for GPT-5.5
    I think on sub tokens might be 100 times cheaper.
    The quota is also generous in my opinion. I can vibecode a lot most days of the week and not run out.