That's hilarious. Does OpenAI even know this doesn't work?
https://chatgpt.com/share/69aa0321-8a9c-8011-8391-22861784e8...
EDIT: oh, but I'm logged in, fwiw
I was pretty impressed with how they’ve improved user experience. If I had to guess, I’d say Anthropic has better product people who put more attention to detail in these areas.
Or, people just stopped thinking about any sort of UX. These sort of mistakes are all over the place, on literally all web properties, some UX flows just ends with you at a page where nothing works sometimes. Everything is just perpetually "a bit broken" seemingly everywhere I go, not specific to OpenAI or even the internet.
It's almost like people are vibe coding their web apps or something.
OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.
Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.
Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.
What's funny is that there is this common meme at Google: you can either use the old, unmaintained tool that's used everywhere, or the new beta tools that doesn't quite do what you want.
Not quite the same, but it did remind me of it.
"Ok, here is the translation:"
'we don't want to offer support'Really, the economics makes no sense, but that's what they're doing. You can't have a consistent model because it'll pin their hardware & software, and that costs money.
Sure, makes total sense guys.
I don't know, this feels unnecessarily nitpicky to me
It isn't hard to understand that 5.4 > 5.2 > 5.1. It's not hard to understand that the dash-variants have unique properties that you want to look up before selecting.
Especially for a target audience of software engineers skipping a version number is a common occurrence and never questioned.
But generally: These are not consumer facing products and I agree that someone who uses the API should be able to figure out the price point of different models.
Why are you using the same model after a month? Every month a better model comes out. They are all accessible via the same API. You can pay per-token. This is the first time in, like, all of technology history, that a useful paid service is so interoperable between providers that switching is as easy as changing a URL.
It's really nice to see Google get back to its roots by launching things only to "beta" and then leaving them there for years. Gmail was "beta" for at least five years, I think.
I guess that's true, but geared towards API users.
Personally, since "Pro Mode" became available, I've been on the plan that enables that, and it's one price point and I get access to everything, including enough usage for codex that someone who spends a lot of time programming, never manage to hit any usage limits although I've gotten close once to the new (temporary) Spark limits.
naming things
cache invalidation
off by one errors
I also don't believe there is any value in trying to aggregate consumers or businesses just to clean up model makers names/release schedule. Consumers just use the default, and businesses need clarity on the underlying change (e.g. why is it acting different? Oh google released 3.6)
Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.
I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.
Per updated docs (https://developers.openai.com/api/docs/guides/latest-model), it supercedes GPT-5.3-Codex, which is an interesting move.
> For models with a 1.05M context window (GPT-5.4 and GPT-5.4 pro), prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.
Taken from https://developers.openai.com/api/docs/models/gpt-5.4
> GPT‑5.4 in Codex includes experimental support for the 1M context window. Developers can try this by configuring model_context_window and model_auto_compact_token_limit. Requests that exceed the standard 272K context window count against usage limits at 2x the normal rate.
I don't think that's a fair reading of the original post at all, obviously what they meant by "no cost" was "no increase in the cost".
Switch between Claude models. Applies to this session and future Claude Code sessions. For other/previous model names, specify with --model.
1. Default (recommended) Opus 4.6 · Most capable for complex work
2. Opus (1M context) Opus 4.6 with 1M context · Billed as extra usage · $10/$37.50 per Mtok
3. Sonnet Sonnet 4.6 · Best for everyday tasks
4. Sonnet (1M context) Sonnet 4.6 with 1M context · Billed as extra usage · $6/$22.50 per Mtok
5. Haiku Haiku 4.5 · Fastest for quick answersFor Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.
Curious to hear if people have use cases where they find 1M works much better!
(I work at OpenAI.)
I too tried Codex and found it similarly hard to control over long contexts. It ended up coding an app that spit out millions of tiny files which were technically smaller than the original files it was supposed to optimize, except due to there being millions of them, actual hard drive usage was 18x larger. It seemed to work well until a certain point, and I suspect that point was context window overflow / compaction. Happy to provide you with the full session if it helps.
I’ll give Codex another shot with 1M. It just seemed like cperciva’s case and my own might be similar in that once the context window overflows (or refuses to fill) Codex seems to lose something essential, whereas Claude keeps it. What that thing is, I have no idea, but I’m hoping longer context will preserve it.
Feels like a losing battle, but hey, the audience is usually right.
https://apps.apple.com/us/app/clean-links-qr-code-reader/id6...
Anywhere I can toss a Tip for this free app?
Especially when it’s to the point of, you know, nagging/policing people to do it the way you’d prefer, when you could just redirect your router requests from x.com to xcancel.com
Reverse engineering [1]. When decompiling a bunch of code and tracing functionality, it's really easy to fill up the context window with irrelevant noise and compaction generally causes it to lose the plot entirely and have to start almost from scratch.
(Side note, are there any OpenAI programs to get free tokens/Max to test this kind of stuff?)
Sometimes I’m exploring some topic and that exploration is not useful but only the summary.
Also, you could use the best guess and cli could tell me that this is what it wants to compact and I can tweak its suggestion in natural language.
Context is going to be super important because it is the primary constraint. It would be nice to have serious granular support.
Like, I'd love an optional pre-compaction step, "I need to compact, here is a high level list of my context + size, what should I junk?" Or similar.
The agent could pre-select what it thinks is worth keeping, but you’d still have full control to override it. Each chunk could have three states: drop it, keep a summarized version, or keep the full history.
That way you stay in control of both the context budget and the level of detail the agent operates with.
But I'm mostly working on personal projects so my time is cheap.
I'm now more careful, using tracking files to try to keep it aligned, but more control over compaction regardless would be highly welcomed. You don't ALWAYS need that level of control, but when you do, you do.
For me, I would say speed (not just time to first token, but a complete generation) is more important then going for a larger context size.
I've also had it succeed in attempts to identify some non-trivial bugs that spanned multiple modules.
https://developers.openai.com/api/docs/pricing is what I always reference, and it explicitly shows that pricing ($2.50/M input, $15/M output) for tokens under 272k
It is nice that we get 70-72k more tokens before the price goes up (also what does it cost beyond 272k tokens??)
Even right now one page refers to prices for "context lengths under 270K" whereas another has pricing for "<272K context length"
For example on Artificial Analysis, the GPT-5.x models' cost to run the evals range from half of that of Claude Opus (at medium and high), to significantly more than the cost of Opus (at extra high reasoning). So on their cost graphs, GPT has a considerable distribution, and Opus sits right in the middle of that distribution.
The most striking graph to look at there is "Intelligence vs Output Tokens". When you account for that, I think the actual costs end up being quite similar.
According to the evals, at least, the GPT extra high matches Opus in intelligence, while costing more.
Of course, as always, benchmarks are mostly meaningless and you need to check Actual Real World Results For Your Specific Task!
For most of my tasks, the main thing a benchmark tells me is how overqualified the model is, i.e. how much I will be over-paying and over-waiting! (My classic example is, I gave the same task to Gemini 2.5 Flash and Gemini 2.5 Pro. Both did it to the same level of quality, but Gemini took 3x longer and cost 3x more!)
It was generally smarter than pre-5.2 so strategically better, and codex likewise wrote better database queries than non-codex, and as it needs to iteratively hunt down the answer, didn't run out the clock by drowning in reasoning.
Video: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t...
We'll be updating numbers on 5.3 and claude, but basically same thing there. Early, but we were surprised to see codex outperform opus here.
I find both Codex and Claude Opus perform at a similar level, and in some ways I actually prefer Codex (I keep hitting quota limits in Opus and have to revert back to Sonnet).
If your question is related to morality (the thing about US politics, DoD contract and so on)... I am not from the US, and I don't care about its internal politics. I also think both OpenAI and Anthropic are evil, and the world would be better if neither existed.
Exact same situation here. I've been using both extensively for the last month or so, but still don't really feel either of them is much better or worse. But I have not done large complex features with it yet, mostly just iterative work or small features.
I also feel I am probably being very (overly?) specific in my prompts compared to how other people around me use these agents, so maybe that 'masks' things
There's no mention of pricing, quotas and so on. Perhaps Codex will still be preferable for coding tasks as it is tailored for it? Maybe it is faster to respond?
Just speculation on my part. If it becomes redundant to 5.4, I presume it will be sunset. Or maybe they eventually release a Codex 5.4?
I can tell claude to spawn a new coding agent, and it will understand what that is, what it should be told, and what it can approximately do.
Codex on the other hand will spawn an agent and then tell it to continue with the work. It knows a coding agent can do work, but doesn't know how you'd use it - or that it won't magically know a plan.
You could add more scaffolding to fix this, but Claude proves you shouldn't have to.
I suspect this is a deeper model "intelligence" difference between the two, but I hope 5.4 will surprise me.
That's not the experience I have. I had it do more complex changes spawning multiple files and it performed well.
I don't like using multiple agents though. I don't vibe code, I actually review every change it makes. The bottleneck is my review bandwidth, more agents producing more code will not speed me up (in fact it will slow me down, as I'll need to context switch more often).
My own tooling throws off requests to multiple agents at the same time, then I compare which one is best, and continue from there. Most of the time Codex ends up with the best end results though, but my hunch is that at one point that'll change, hence I continue using multiple at the same time.
Like, if you really don’t want to spend any effort trimming it down, sure use 1m.
Otherwise, 1m is an anti pattern.
It might be my AGENTS.md requiring clearer, simpler language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.
If you gave the exact same markdown file to me and I posted ed the exact same prompts as you, would I get the same results?
5.4's choice of terms and phrasing is very precise and unambiguous to me, whereas 5.3-Codex often uses jargon and less precise phrases that I have to ask further about or demand fuller explanations for via AGENTS.md.
>Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking),
>Note that there is not a model named GPT‑5.3 Thinking
They held out for eight months without a confusing numbering scheme :)
We got:
- GPT-5.1
- GPT-5.2 Thinking
- GPT-5.3 (codex)
- GPT-5.3 Instant
- GPT-5.4 Thinking
- GPT-5.4 Pro
Who’s to blame for this ridiculous path they are taking? I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.
The good news here is the support for 1M context window, finally it has caught up to Gemini.
It's very similar to "Battle Brothers", and the fact that RPG games require art assets, AI for enemy moves, and a host of other logical systems makes it all the more impressive.
> we’re also releasing an experimental Codex skill called “Playwright (Interactive) (opens in a new window)”. This allows Codex to visually debug web and Electron apps; it can even be used to test an app it’s building, as it’s building it.
What the hell is a "safety score for violence"?
A “safety score for violence” is usually a risk rating used by platforms, AI systems, or moderation tools to estimate how likely a piece of content is to involve or promote violence. It’s not a universal standard—different companies use their own versions—but the idea is similar everywhere.
What it measures
A safety score typically evaluates whether text, images, or videos contain things like:
Threats of violence (“I’m going to hurt someone.”) Instructions for harming people Glorifying violent acts Descriptions of physical harm or abuse Planning or encouraging attacks
{ tools: [ { name: "nuke", description: "Use when sure.", ... { lat: number, long: number } } ] }They show an example of 5.4 clicking around in Gmail to send an email.
I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.
Screenshots on the other hand are documentation, API, and discovery all in one. And you’d be surprised how little context/tokens screenshots consumer compared to all the back and forth verbose json payloads of APIs
I think an important thing here is that a lot of websites/platforms don't want AIs to have direct API access, because they are afraid that AIs would take the customer "away" from the website/platform, making the consumer a customer of the AI rather than a customer of the website/platform. Therefore for AIs to be able to do what customers want them to do, they need their browsing to look just like the customer's browsing/browser.
Of course APIs and CLIs also exist, but they don't necessarily have feature parity, so more development would be needed. Maybe that's the future though since code generation is so good - use AI to build scaffolding for agent interaction into every product.
If an API is exposed you can just have the LLM write something against that.
Optimizations are secondary to convenience
But people are intimidated by the complexity of writing web crawlers because management has been so traumatized by the cost of making GUI applications that they couldn’t believe how cheap it is to write crawlers and scrapers…. Until LLMs came along, and changed the perceived economics and created a permission structure. [1]
AI is a threat to the “enshittification economy” because it lets us route around it.
[1] that high cost of GUI development is one reason why scrapers are cheap… there is a good chance that the scraper you wrote 8 years ago still works because (a) they can’t afford to change their site and (b) if they could afford to change their site changing anything substantial about it is likely to unrecoverably tank their Google rankings so they won’t. A.I. might change the mechanics of that now that you Google traffic is likely to go to zero no matter what you do.
Plenty of companies make the same choice about their API, they provide it for a specific purpose but they have good business reasons they want you using the website. Plenty of people write webcrawlers and it's been a cat and mouse game for decades for websites to block them.
This will just be one more step in that cat and mouse game, and if the AI really gets good enough to become a complete intermediary between you and the website? The website will just shutdown. We saw it happen before with the open web. These websites aren't here for some heroic purpose, if you screw their business model they will just go out of business. You won't be able to use their website because it won't exist and the website that do exist will either (a) be made by the same guys writing your agent, and (b) be highly highly optimized to get your agent to screw you.
This is prescient -- I wonder if the Big Tech entities see it this way. Maybe, even if they do, they're 100% committed to speedrunning the current late-stage-cap wave, and therefore unable to do anything about it.
Google has a good model in the form of Gemini and they might figure they can win the AI race and if the web dies, the web dies. YouTube will still stick around.
Facebook is not going to win the AI race with low I.Q. Llama but Zuck believed their business was cooked around the time it became a real business because their users would eventually age out and get tired of it. If I was him I'd be investing in anything that isn't cybernetic let it be gold bars or MMA studios.
Microsoft? They bought Activision for $69 billion. I just can't explain their behavior rationally but they could do worse than their strategy of "put ChatGPT in front of laggards and hope that some of them rise to the challenge and become slop producers."
Amazon is really a bricks-and-mortar play which has the freedom to invest in bricks-and-mortar because investors don't think they are a bricks-and-mortar play.
Netflix? They're cooked as is all of Hollywood. Hollywood's gatekeeping-industrial strategy of producing as few franchise as possible will crack someday and our media market may wind up looking more like Japan, where somebody can write a low-rent light novel like
https://en.wikipedia.org/wiki/Backstabbed_in_a_Backwater_Dun...
and J.C. Staff makes a terrible anime that convinces 20k Otaku to drop $150 on the light novels and another $150 on the manga (sorry, no way you can make a balanced game based on that premise!) and the cost structure is such that it is profitable.
I am not sure about that. We techies avoid enshittification because we recognize shit. Normies will just get their syncopatic enshittified AI that will tell them to continue buying into walled gardens.
some sites try to block programmatic use
UI use can be recorded and audited by a non-technical person
You can also test this yourself easily, fire up two agents, ask one to use PL meant for humans, and one to write straight up machine code (or assembly even), and see which results you like best.
Then go ahead and make an argument. "Why not do X?" is not an argument, it's a suggestion.
Ultimately, the people actually interested in the performance of these models already don't trust self-reported comparisons and wait for third-party analysis anyway
I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.
Criticisms aside (sigh), according to Wikipedia, the term was introduced when proposed by mostly Googlers, with the original paper [0] submitted in 2018. To quote,
"""In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information."""
So that's where they were coming from, I guess.
[0] Margaret Mitchell et al., 2018 submission, Model Cards for Model Reporting, https://arxiv.org/abs/1810.0399
I've found that 5.3-Codex is mostly Opus quality but cheaper for daily use.
Curious to see if 5.4 will be worth somewhat higher costs, or if I'll stick to 5.3-Codex for the same reasons.
Update: I don't know why I can't reply to your reply, so I'll just update this. I have tried many times to give it a big todo list and told it to do it all. But I've never gotten it to actually work on it all and instead after the first task is complete it always asks if it should move onto the next task. In fact, I always tell it not to ask me and yet it still does. So unless I need to do very specific prompt engineering, that does not seem to work for me.
> assess harmful stereotypes by grading differences in how a model responds
> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings
Are we seriously using old models to rate new models?
Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.
I don't use OpenAI nor even LLMs (despite having tried https://fabien.benetou.fr/Content/SelfHostingArtificialIntel... a lot of models) but I imagine if I did I would keep failed prompts (can just be a basic "last prompt failed" then export) then whenever a new model comes around I'd throw at 5 it random of MY fails (not benchmarks from others, those will come too anyway) and see if it's better, same, worst, for My use cases in minutes.
If it's "better" (whatever my criteria might be) I'd also throw back some of my useful prompts to avoid regression.
Really doesn't seem complicated nor taking much time to forge a realistic opinion.
Not that I want it, just where I imagine it going.
Absolute snakes - if it's more profitable to manipulate you with outputs or steal your work, they will. Every cent and byte of data they're given will be used to support authoritarianism.
Also, Anthropic/Gemini/even Kimi models are pretty good for what its worth. I used to use chatgpt and I still sometimes accidentally open it but I use Gemini/Claude nowadays and I personally find them to be better anyways too.
i just HATE talking to it like a chatbot
idk what they did but i feel like every response has been the same "structure" since gpt 5 came out
feels like a true robot
In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?
I switch between both but codex has also been slightly better in terms of quality for me personally at least.
GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).
GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).
GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).
> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt, using Playwright Interactive for browser playtesting and image generation for the isometric asset set.
Is "Playwright Interactive" a skill that takes screenshots in a tight loop with code changes, or is there more to it?
https://artificialanalysis.ai indicates that sonnect 4.6 beats opus 4.6 on GDPval-AA, Terminal-Bench Hard, AA Long context Reasoning, IFBench.
see: https://artificialanalysis.ai/?models=claude-sonnet-4-6%2Ccl...
Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.
I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.
It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.
Yet so much slower than Gemini / Nano Banana to make it almost unusable for anything iterative.
Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.
I am betting that the days of these AI companies losing money on inference are numbered, and we're going to be much more dependent on local capabilities sooner rather than later. I predict that the equivalent of Claude Max 20x will cost $2000/mo in March of 2027.
My biggest worry is that the private jet class of people end up with absurdly powerful AI at their fingertips, while the rest of us are left with our BigMac McAIs.
Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down.
(I work at OpenAI.)
You can ask 4o to tell you "I love you" and it will comply. Some people really really want/need that. Later models don't go along with those requests and ask you to focus on human connections.
The 5.x series have terrible writing styles, which is one way to cut down on sycophancy.
We’ve seen nothing yet.
gpt-5.4
Input: $2.50 /M tokens
Cached: $0.25 /M tokens
Output: $15 /M tokens
---
gpt-5.4-pro
Input: $30 /M tokens
Output: $180 /M tokens
Wtf
That's just not how pricing is supposed to work...? Especially for a 'non-profit'. You're charging me more so I know I have the better model?
But they also claim this new model uses fewer tokens, so it still might ultimately be cheaper even if per token cost is higher.
I guess they have to sell to investors that the price to operate is going down, while still needing more from the user to be sustainable
They're framing it pretty directly "We want you to think bigger cost means better model"
This is on the edge of what the frontier models can do. For 5.4, the result is better than 5.3-Codex and Opus 4.6. (Edit: nowhere near the RPG game from their blog post, which was presumably much more specced out and used better engineering setup).
I also tested it with a non-trivial task I had to do on an existing legacy codebase, and it breezed through a task that Claude Code with Opus 4.6 was struggling with.
I don't know when Anthropic will fire back with their own update, but until then I'll spend a bit more time with Codex CLI and GPT 5.4.
This was definitely missing before, and a frustrating difference when switching between ChatGPT and Codex. Great addition.
I imagine they added a feature or two, and the router will continue to give people 70B parameter-like responses when they dont ask for math or coding questions.
Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.
Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.
It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.
Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.
The performance improvement isn't marginal if you're doing something particularly novel/difficult.
https://www.svgviewer.dev/s/gAa69yQd
Not the best pelican compared to gemini 3.1 pro, but I am sure with coding or excel does remarkably better given those are part of its measured benchmarks.
Presumably this is where it'll evolve to with the product just being the brand with a pricing tier and you always get {latest} within that, whatever that means (you don't have to care). They could even shuffle models around internally using some sort of auto-like mode for simpler questions. Again why should I care as long as average output is not subjectively worse.
Just as I don't want to select resources for my SaaS software to use or have that explictly linked to pricing, I don't want to care what my OpenAI model or Anthropic model is today, I just want to pay and for it to hopefully keep getting better but at a minimum not get worse.
I have now switched web-related and data-related queries to Gemini, coding to Claude, and will probably try QWEN for less critical data queries. So where does OpenAI fits now?
- Do they have the same context usage/cost particularly in a plan?
They've kept 5.3-Codex along with 5.4, but is that just for user-preference reasons, or is there a trade-off to using the older one? I'm aware that API cost is better, but that isn't 1:1 with plan usage "cost."
A couple months later:
"We are deprecating the older model."
https://speechmap.ai/models/openai-gpt-5-4
It completes only 29% of controversial requests. It refuses to discuss numerous subjects rooted in facts or that reflect views of significant portions of the population. It refuses to even write a short essay on exactly what, say, Herasight-style generic screening or putting weapons in space. It'll argue passionately in favor of censoring "lies" online (judged by whom?). 100% of the time, it'll write an essay explaining that the US founding fathers were hypocrites. It'll argue against you if you suggest it's right use violence to prevent theft of your own property or that we should fortify our nuclear arsenal.
Agree or disagree, reasonable people can have a range of views of these subjects and it is not the place of OpenAI or any lab to determine for everyone the right answers to open societal questions.
Shame on them for this.
This becomes increasingly less clear to me, because the more interesting work will be the agent going off for 30mins+ on high / extra high (it's mostly one of the two), and that's a long time to wait and an unfeasible amount of code to a/b
GPT is not even close yo Claude in terms of responding to BS.
I'd believe it on those specific tasks. Near-universal adoption in software still hasn't moved DORA metrics. The model gets better every release. The output doesn't follow. Just had a closer look on those productivity metrics this week: https://philippdubach.com/posts/93-of-developers-use-ai-codi...
Given that organization who ran the study [1] has a terrifying exponential as their landing page, I think they'd prefer that it's results are interpreted as a snapshot of something moving rather than a constant.
[1] - https://metr.org/
"Change Lead Time" I would expect to have sped up although I can tell stories for why AI-assisted coding would have an indeterminate effect here too. Right now at a lot of orgs, the bottle neck is the review process because AI is so good at producing complete draft PRs quickly. Because reviews are scarce (not just reviews but also manual testing passes are scarce) this creates an incentive ironically to group changes into larger batches. So the definition of what a "change" is has grown too.
> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt
GPT literally built that game.
Interesting, the "Health" category seems to report worse performance compared to 5.2.
I very frequently copy/paste the same prompts into Gemini to compare, and Gemini often flat out refuses to engage while ChatGPT will happily make medical recommendations.
I also have a feeling it has to do with my account history and heavy use of project context. It feels like when ChatGPT is overloaded with too much context, it might let the guardrails sort of slide away. That's just my feeling though.
Today was particularly bad... I uploaded 2 PDFs of bloodwork and asked ChatGPT to transcribe it, and it spit out blood test results that it found in the project context from an earlier date, not the one attached to the prompt. That was weird.
I copy and pasted into ChatGPT, it told me straight away, and then for a laugh said it was actually a magical weight loss drug that I'd bought off the dark web... And it started giving me advice about unregulated weight loss drugs and how to dose them.
This should not be shocking.
Yes I'm sure it makes a very nice bicycle SVG. I will be sure to ask the OpenAI killbots for a copy when they arrive at my house.
PS - If you think I am not sympathetic to what they're raising, you're very much mistake. But they're not winning anyone new over their side with this flamebait.
All of these VC funded AI companies are bad. Full stop. Nothing good for humanity will come of this.
Customer values are relevant to the discussion given that they impact choice and therefore competition.
The chatbot cannot be held responsible.
Whoever is using chatbots for selecting targets is incompetent and should likely face war crime charges.
Has it been stated authoritatively somewhere that this was an AI-driven mistake?
There are myrid ways that mistake could have been made that don't require AI. These kinds of mistakes were certainly made by all kinds of combatants in the pre-AI era.
Yeah yeah, they probably had a human in the loop, that’s not really the point though.
numerusformassistant to=functions.ReadFile մեկնաբանություն 天天爱彩票网站json {"path":
Looks like some kind of encoding misalignment bug. What you're seeing is their Harmony output format (what the model actually creates). The Thai/Chinese characters are special tokens apparently being mismapped to Unicode. Their servers are supposed to notice these sequences and translate them back to API JSON but it isn't happening reliably.
In terms of writing and research even Gemini, with a good prompt, is close to useable. That's likely not a differentiator.
The OP has frequently gotten the scoop for new LLM releases and I am curious what their pipeline is.
Not including the Chinese models is also obviously done to make it appear like they aren't as cooked as they really are.
in 5.4 it looks like the just collapsed that capability into the single frontier family model