I tried Fable vs Codex 5.5 xhigh on three different cases.
1. A resource leak with unknown cause. Both of them zoomed onto the same potential issue and proposed almost identical patches. Fable missed an edge case that Codex handled correctly.
2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.
3. An open research problem in CS, presented as a codebase with documentation and performance metrics over datasets. Both were spinning wheels. Which can certainly mean the whole approach had run its course but older models were not able to identify the previous round of improvement either.
I liked the prose coming out of Fable more: it was almost like if Obama was giving tech speeches. By actual solution metrics however they both appear in the same place, naturally with the caveat that we didn't really have more time with Fable to compare further.
I always find it amusing when people claim "a very complex implementation". Sometimes it's a hard problem, other times an easy one. Either way that's not for you to judge.
And the implementation being complex... is that a good thing? Wouldn't a simple implementation be better? It reminded me of the parable of two programmers.
What caught my eye is the complexity you assign to a project like this. It’s hairy but I wouldn’t call it super complicated. I find that super interesting to be honest because it probably means that it is really hard and I am just used to this shit now and it all looks doable to me now.
I never think of anything as “complex”, certainly not my own work and I always think what other people do is so much more impressive but I’m starting to realize it might be a me-issue.
I worked on some pretty hairy nonsense like say a DB replication solution but I still think it was just tangly, not complex like say a particle collider. Maybe I also need to call my work super complex and highly abstract. Now that I think of it I have a history of not being taken seriously while others with easy shit get credits.
In a way, nothing is complex at the point where you have untangled it, by definition. Software development is, after all, the art of untangling complexity. The real challenge is (re-)imagining something in the simplest way that fits the goal you are given. When you have arrived there, everything seems obvious and simple. But not everybody could have done it.
This line of communication might have even influenced the courts in the case of copyright violation ("it is not copyright violation if a person learned something and it knows it and thinks of it"). However algorithm does not think. If I took your book and lossy encrypted it, and then unencrypted it while filling the broken words, am I violating your copyright or not?
Reasoning by analogy in this case is not abstraction. It's just shifting the determination to choice of analogy.
Meanwhile, irl.. The best analogy is recent tech Innovations. The internet, social media...
Online copyright was basically instituted when large tech companies were ready to do it, and it was to their advantage.
Youtube, for example, built itself to massive size and locked in network effect advantages largely by violating copyright.
At some point, the legal ambiguity was a problem for their ad business. They were ready to move into the current revenue share influencer-treadmill model for content. At this point online, copyright enforcement was necessary to reduce the risk of being flanked by a new video platform.
The iPod, which resurrected Apple, ran on copyright infringement, and copyright Greyzones.... Until the point when their interests flipped. They're negotiating position opposite labels , Network effect considerations, Etc.
Intellectual property, broadly, does not start out as an intuitive/emergent natural right. It is created by legislative process, ecplicitely taylored to the needs of an interst group and/or national interest.
Writers, publishers, inventors, IP holding companies...
The legal rhetoric around legal arguments... is rhetoric. It is not the reason why decisions are made. It is how decisions or justified post fact.
No one is going to burden aI companies, at this point. The rights of copyright holders are a trivial matter compared to the potential of AI, the risk to certain labor markets, and such.
In practice, we seem to be leaning towards the idea that training on a copyrighted book is wrong if used to replicate or paraphrase that same book, but not if used to teach a model how to write better.
It doesn’t sound unprofessional— it sounds unethical. Either they’re making something that they genuinely believe is unsafe but don’t want to stop because, you know, that’s business! Have you seen how much this shit costs? Or they’re deliberately making the entire country feel unsafe because it looks great to investors. Either way, frankly, fuck them and everybody else playing this dumb billionaire’s game. They deserve every bit of static this dimwitted government levels at them.
Fable felt like having access to that "old Opus" again, but a little smarter. Sort of like I'd expect an Opus 5 to be. It's not earth shattering, but it was a step in the right direction. And it was distinctively so, because having to go back to Opus 4.6/4.7/4.8 has been borderline depressing...
It understood more with less help, did more per turn, and was less argumentative. It also felt a little less trite in its answers, which is an understated improvement for those who use claude code all the time
But then X starts to degrade. At first subtly, and then drastically. So then I am forced to upgrade to Y.
What I do not understand is:
> is this a sneaky way for companies to push users up the chain?
> Or is this a genuine fault in model design/resource allocation?
Yet when you do blind tests they can't tell the difference between a $1000 cable and a $1 one.
I bet if you do blind tests between GPT-5.3, 5.4 and 5.5 most would struggle to tell them apart, yet they are certain that "5.5 was nerfed 1 week after release, it's so obvious, it was John Carmack, now it can barely write a for loop"
They have a way to decrease cost and probably increase token consumption, with gradual changes and no abrupt jump in capabilities, and users have no way to reliably detect it.
Market will advantage companies that do it.
And they are in the best position to automate online narrative shift (the real LLM killer application IMO) towards "Users are imagining it".
It said: I can't, but it would be lazy to say that is is not a possibility.
With some back and forth it created a 5 step plan to narrow down if our universe has all the right properties for this to be true.
We evaluated the first four stages to be true, and it wrote the solver to find out if the fifth test running the full model passes, but that will take thousands of hours of compute.
This made me think, well, sure, if you tell them what to look for... but then:
> The models can look at the whole repo, and follow logic across file boundaries, but they’re not told what to look for.
So okay, the first one was an accidental mis-statement?
In the benchmark the models were told to look at the file and were allowed to look at the rest of the repo, with no clues about what to look for.
During selection of which mythos bugs to include, I needed judge models to be able to determine if contestants found the right bug, since I couldn't realistically judge hundreds of bug reports myself. So, they were given the bug location and told to identify and explain it.
Outside of the test, they are told “can you find this bug in this file?”
Try a Wilson score interval on the lower bound of the binomial proportion confidence interval [1].
So GPT 5.5 Pro’s 2/4 (p = 0.5) for one-sided 95% (z ~ 1.645), adjusts to 0.182 [a], and the top models are revealed as the 4/9s (mimo-v2.5-pro, gpt-5.5, opus-4.8, gemini-3.5-flash and deepseek-v4). (We need to dial CI down to 76% for gpt-4.5-pro to regain top status.) If we account for speed in that cohort, derpseek-v4 (91s) is fastest followed by opus-4.8 (137s).
Given deepseek-v4 is also the cheapest model among those five, I would say—based on these data—it’s the winner. (Out of the table. If Fable got 9/9, it’s obviously first.)
[1] https://en.wikipedia.org/wiki/Binomial_proportion_confidence...
But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes. I haven't published the replication results for Gemma 4, yet, where I gave it multiple opportunities, but the dense version was consistently able to find four of the nine bugs exactly, plus two other very difficult bugs that it found occasionally, sometimes with a not quite accurate description (which gets partial credit in its own column on the big benchmark), six altogether. Leaving three of the bugs in the corpus that no model other than Mythos ever found, but also making Gemma 4 31B the best model I have results for (but it got multiple attempts, which I assume would make any of the models perform better).
So, my conclusion, not very strongly held, is: Mythos is both better than other public models and it has fewer guardrails. But, also that the guardrails in current models are probably not strict enough to prevent this work. Only Gemini models when run under Antigravity refused to perform the work. Maybe Mistral silently refused due to guardrails, I'm not sure, since it failed to find any bugs. Maybe it just sucks.
This benchmark is about finding security bugs, not writing secure code. I don't believe the models have guardrails that prevent writing safe code, but they're also not intelligent and have a bunch of insecure code in their training data, so they definitely write insecure code sometimes.
Did it "disprove" it retroactively or just changed what the situation is, given that until then they were indeed weaker at small sizes?
Anyway, I kinda think among US models only Fable really tries to block security work like this, based on my experience so far.
Fable just understood what I was talking about and never needed me to stop it and say "you forgot this thing we talked about." The difference in spatial reasoning capability between the three models is very very palpable. I am curious to get more time with it because ultimately I feel like I sandbagged it by giving it problems that would've been within Opus' abilities, but required a lot more handholding.
Reminds me of the old adage: don't try to be too smart when writing code. Otherwise, dumber people - including your future self - will have trouble working with it.
if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it
For reference: it's called Kernighan's Law, and can be found in the Second Edition of "The Elements of Programming Style", page 10 [1].
The original phrasing is:
> Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
[1] https://archive.org/details/the-elements-of-programming-styl...
Fable's probably objectively better at full power. I mean, I definitely felt the same difference in competency between Fable and current Opus. But Opus itself has definitely been nerfed, and Fable, even if it comes back the public forever (probably won't), will get nerfed.
That was a nice time. Let us get back to that time. Use open weights models. Own stuff.
This is interesting. The "reported to me like a colleague" part.
Is it just that anthropic gave Mythos even more of that Anthropic™ character, (incorrectly) radiating confidence?
Is that why people have been losing their minds over that thing? Is this just cheap social engineering?
I mean I bet it is also slightly more capable than opus, but that would all check out to me. Man.
Thanks for sharing I suppose.
Or opus to opus
Or really any new thing to old thing
The user here is right in what they said but wrong in why they said it, essentially.
Every upgrade made what came before it appear awful in comparison, to such an extent that every upgrade was called "photorealistic" and people kept forgetting that they'd been using that description for the previous engines that they were now dismissing.
to an extent that might have done it, but i had been playkng around ahead of time trying to reverse engineer my ray bans case so i can make my own plastic insert, and fable to opus' work from mostly broken to mostly done, and then when fable went away, opus broke it again
Perhaps it is a lot of small improvements all over the place, but the sum is a step change in capability.
The free will coin?
Do I have free will, or am I bounded by the laws of physics?
Even if you think my soul is completely independent of my body, there are theologians who argue that God being omniscient means that who goes to heaven and hell is predetermined before birth and therefore no action you take will ever change the afterlife you go to, and that to think God isn't omniscient would be blasphemy; do they think I have free will?
And then there's Thelma with "Do what thou wilt shall be the whole of the Law", which can be understood in terms of (amongst other things) "Don't let peer pressure manipulate you into thinking you want other things than you really want", though this is of course a simplification much as the omniscient example above: https://en.wikipedia.org/wiki/True_Will
It's a hand-me-down from Western beliefs about morality and individuality - including Thelema and Christianity.
So there's a lot of starting from the concept and working back to assumed conclusions.
Generally humans do not have free will, do have very limited political, economic, and psychological agency, usually selected from a small number of competing rule sets, and are also far more easily influenced than they suspect.
Culture is more like a cellular automaton or diffusion system. Occasionally a transformation ripples out from an individual cell, often for fairly random reasons, but the big patterns are emergent, and every so often the soup shakes itself up and settles into a new arrangement.
IMO LLMs are the most recent proto-version of that, running on a different substrate.
…no model performed better with an Agent, a couple performed worse, and time/tokens/costs were consistently much higher with the agent in the loop, for some reason.
Somone should build a harness where features are only added if they are proven net positive to outcomes.I suggest tasks cannot be guessed (find, not tell). And 2d charts, both for ROC and pricing, vide https://quesma.com/benchmarks/binaryaudit/
And, false positives are reported in the results.
But, Gemini CLI is deprecated. So, I tried to use Antigravity and it simply refused.
Weirdly, Gemma 4 has proven to be excellent at this task in subsequent tests. The best in its size/class. So, not everybody at Google is determined to break Google models for security work.
A cursory reading of the model card shows Mythos/Fable is a fine tune on Project Zero with some steering on persistence.
But I think it's a valuable lesson: advertise your product as a nuclear weapon while microdosing at Lighthaven to enough Davos attendees and sooner or later? Someone is going to evaluate the claim from a chair where you act first and nuance later.
Wild that Amodei's blog and pod circuit are the greatest IPO risk.
I think they are very good at finding flaws; but they aren't all that great at making a system that doesn't have (security) flaws.
These models are definitely a lot better than your run of the mill human developer at finding security flaws in existing systems. I'm agnostic at how good they are at actually making a secure system. Probably better, too, for two reasons:
- humans are really terrible
- the model probably has an easier time picking up special purpose tools you can use to write proven secure systems
I don't think Mythos can write secure C code, either. Practically no one can. (At least not directly. See how seL4 is officially written in C; but they didn't just set out to carefully write secure C code directly; C just happens to be an intermediate language they use.)
Almost all existing real world software is full of holes and security flaws. Mythos is better than humans at uncovering many of them; especially because its time is a lot cheaper than that of the top tier human experts (and even of mid-and low-tier human experts).
Especially when these systems are written in notoriously unreliably languages like C.
I don't think Mythos is especially good at writing systems that are free of security problems. Essentially the only way we know is by proving your software correct.
In principle, you can even prove C correct, but in practice you'll want to write your system from the ground up to be proven correct instead of adding that property after the fact; and for that you'll most likely also want to pick a language that supports this better.
See https://en.wikipedia.org/wiki/SeL4 for a noteworthy example.
"I’d say this benchmark answers with a resounding, “Maybe.”
Mythos maybe really is better than the other current models at finding security bugs"
Yet in the results, I don't see Mythos?
It seems like a really well researched article with lots of results for other models, yet the title seems to be clickbait because the results don't contain Mythos, do they?
Mythos is the 100% against which the other models are compared.
Although the benchmark had 100$ budget cap and rudimentary tooling so probably a bit less than 100%.
GPT-5.5-pro attemted only 4 problems out of 9 before the budget ran out and got 2 of them right.
It's a shame that the author didn't try GPT-5.5-pro on all 9 just for completeness, pehaps on subscription to save money.
If anyone wants to fund the other five cases (~$125), I'll run them. I find that an unrealistic cost, though...simply not useful data. I'm certainly not going to spend $23 per file to audit a project with hundreds or thousands of files. I don't know anyone who would.
Also note that it was $100 cap per model, and the next most expensive model was GPT 5.5 at a 20th the price per case, about ten bucks for the whole batch.
I think on sub tokens might be 100 times cheaper.
The quota is also generous in my opinion. I can vibecode a lot most days of the week and not run out.