GPT 4o pricing for comparison: Price Input: $2.50 / 1M tokens Cached input: $1.25 / 1M tokens Output: $10.00 / 1M tokens
It sounds like it's so expensive and the difference in usefulness is so lacking(?) they're not even gonna keep serving it in the API for long:
> GPT‑4.5 is a very large and compute-intensive model, making it more expensive than and not a replacement for GPT‑4o. Because of this, we’re evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models. We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings. If GPT‑4.5 delivers unique value for your use case, your feedback (opens in a new window) will play an important role in guiding our decision.
I'm still gonna give it a go, though.
"We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
Not a confident place for an org trying to sustain a $XXXB valuation.
"Early testing doesn't show that it hallucinates less, but we expect that putting that sentence nearby will lead you to draw a connection there yourself".
There is no fire at all in the painting, only some smoke.
https://en.wikipedia.org/wiki/The_Trojan_Women_Set_Fire_to_t...
I don't see AI going any differently. Some companies will figure out where and how models should be utilized, they'll see some benefit. (IMO, the answer will be smaller local models tailored to specific domains)
Others will go bust. Same as it always was.
God help us all
Possibly.
I am reminded of the dotcom boom and bust back in the 1990s
By 2009 things had recovered (for some definition) and we could tell what did and did not work
This time, though, for those of us not in the USA the rebound will be lead by Chinese technology
In the USA no-one can say.
I suck at and hate writing the mildly deceptive corporate puffery that seems to be in vogue. I wonder if GPT-4.5 can write that for me or if it's still not as good at it as the expert they paid to put that little gem together.
It'd be funny if it's actually full-automated, closed-loop automation of capital allocation markets.
"Why are we doing this? How much money are we getting?" -> "I dunno. It's what the models said."
> "I dunno. It's what the models said."
The obvious human idiocy in such things often obscures the actual process:
"What it [capitalism] is in itself is only tactically connected to what it does for us — that is (in part), what it trades us for its self-escalation. Our phenomenology is its camouflage. We contemptuously mock the trash that it offers the masses, and then think we have understood something about capitalism, rather than about what capitalism has learnt to think of the apes it arose among." [0]
The link shows a significant reduction.
grep hallucination, or, https://imgur.com/a/mkDxe78.
(And even that was a downgrade compared to the more uncensored pre-release versions, which were comparable to GPT-4.5, at least judging by the unicorn test)
in our case, the bump was actually from gpt-4-vision to gpt-4o (the use case required image interpretation)
It got measurably better at both image cases and text-only cases
You need to remove your shoe and drive with like two toes to get the speed just right, though.
Test drivers I have done this with takes off their shoes or use ballerina shoes.
And keeping steady state speed is not that hard.
Releasing GPT 4.5 might simply be a reaction to Claude 3.7.
Wow, I'm old.
Now get off my lawn ))
o1 preview. o1 mini. o1. sora. o3-mini <- very good at code
OpenAI has had many releases since gpt4. Many of them have been substantial upgrades. I have considered gpt4 to be outdated for almost 5-6 months now, long before claudes patch.
[0] https://the-decoder.com/openai-quietly-funded-independent-ma...
We can assume they’re lying too but at some point “everyone’s bad because they’re lying, which we know because they’re bad” gets a little tired.
2. We know that “open”ai is bad, for many reasons, but this is irrelevant. I want processes themselves to not depend on the goodwill of a corporation to give intended results. I do not trust benchmarks that first presented themselves secret and then revealed they were not, regardless if the product benchmarked was from a company I otherwise trust or not.
If a scientific paper comes out with “empirical data”, I will still look at the conflicts of interest section. If there are no conflicts of interest listed, but then it is found out that there are multiple conflicts of interest, but the authors promise that while they did not disclose them, they also did not affect the paper, I would be more skeptical. I am not “offended”. I am not “rejecting” the data, but I am taking those factors into account when determining how confident I can be in the validity of the data.
This isn't what happened? I must be missing something.
AFAIK:
The FrontierMath people self-reported they had a shared folder the OpenAI people had access to that had a subset of some questions.
No one denied anything, no one lied about anything, no one said they didn't have access. There was no data obtained under the table.
The motte is "they had data for this one benchmark"
The bailey is "they got data under the table"
Bailey: "one is free to trust their "verbal agreement" that they did not train their models on that, but access they did have."
Sigh.
> Bailey: "one is free to trust their "verbal agreement" that they did not train their models on that, but access they did have."
1. You’re confusing motte and bailey.
2. Those statements are logically identical.
Motte and Bailey refers to an argumentative tactic where someone switches between an easily defensible ("motte") position and a less defensible but more ambitious ("bailey") position. My example should have been:
- Motte (defensible): "They had access to benchmark data (which isn't disputed)."
- Bailey (less defensible): "They actually trained their model using the benchmark data."
The statements you've provided:
"They got caught getting benchmark data under the table" (suggesting improper access)
"One is free to trust their 'verbal agreement' that they did not train their models on that, but access they did have."
These two statements are similar but not logically identical. One explicitly suggests improper or secretive access ("under the table"), while the other acknowledges access openly.
So, rather than being logically identical, the difference is subtle but meaningful. One emphasizes improper access (a stronger claim), while the other points only to possession or access, a more easily defensible claim.
It was not public until later, and it was actually revealed first by others. So the statements seem identical to me.
What is "this"?
> obviously the problem with getting "data under the table" is that they may have used it to training their models
I've been avoiding mentioning the maximalist version of the argument (they got data under the table AND used it to train models), because training wasn't stated until now, and it would have been unfair to bring it up without mention. That is that's 2 baileys out from "they had access to a shared directory that had some test qs in it, and this was reported publicly, and fixed publicly"
There's been a fairly severe communication breakdown here, I don't want to distract from ex. what the nonense is, so I won't belabor that point, but I don't want you to think I don't want to engage on it - just won't in this singular posts.
> but the only reassurance being some "verbal agreement", as is reported, is not very reassuring
It's about as reassuring as it gets without them releasing the entire training data, which is, at best, with charity marginally, oh so marginally reassuring I assume? If the premise is we can't trust anything self-reported, they could lie there too?
> People are free to adjust their P(model_capabilities|frontiermath_results) based on their own priors.
Certainly, that's not in dispute (perhaps the idea that you are forbidden from adjusting your opinion is the nonsense you're referring to? I certainly can't control that :) Nor would I want to!)
And FFS I assume the dispute is about the P given by people, not about if people are allowed to have a P.
People _seriously_ underestimate just how much stuff is online and how much impact it can have on training.
Or is the assumption that the training set is so big it doesn't matter?
Perhaps they are/were going for stealth therapy-bot with this.
Empathy done well seems like 1:1 mapping at an emotional level, but that doesn’t imply to me that it couldn’t be done at a different level of modeling. Empathy can be done poorly, and then it is projecting.
Does one of these have a higher EQ, despite both being ink and paper and definitely not sentient?
Now, imagine they were produced by two different AIs. Does one AI demonstrate higher EQ?
The trick is in seeing that “EQ of a text response” is not the same thing as “EQ of a sentient being”
This is a designed system. The designers make choices. I don’t see how failing to plan and design for a common use case would be better.
= not to say that the people that work on AI are not incredibly talented, but more that it's not human
trainimg it topretend to be a feelingless robot or sympathetic mother are both weird to me. it should state facts with us.
You are confusing a specific geographical sense of “greater” (e.g. “greater New York”) with the generic sense of “greater” which just means “more great”. In “7 is greater than 6”, “greater” isn’t geographic
The difference between “greater” and “better”, is “greater” just means “more than”, without implying any value judgement-“better” implies the “more than” is a good thing: “The Holocaust had a greater death toll than the Armenian genocide” is an obvious fact, but only a horrendously evil person would use “better” in that sentence (excluding of course someone who accidentally misspoke, or a non-native speaker mixing up words)
I read the article but can't find the numbers you are referencing. Maybe there's some paper linked I should be looking at? The only numbers I see are from the SimpleQA chart, which are 37.1% vs 61.8% hallucination rate. That's nice but considering the price increase, is it really that impressive? Also, an often repeated criticism is that relying on known benchmarks is "gaming the numbers" and that the real world hallucination rate could very well be higher.
Lastly, the themselves say: > We also expect it to hallucinate less.
That's a fairly neutral statement for a press release. If they were convinced that the reduced hallucination rate is the killer feature that sets this model apart from the competition, they surely would have emphasized that more?
All in all I can understand why people would react with some mocking replies to this.
I don't know what else to say.
Here, imgur: https://imgur.com/a/mkDxe78. Can't get easier.
> equally unfounded attacks
No, because I have a source and didn't make up things someone else said.
> a straight up "you are lying".
Right, because they are. There are hallucination stats right in the post he mocks for not prvoiding stats.
> That's nice but considering the price increase,
I can't believe how quickly you acknowledge it is in the post after calling the idea it was in the post "equally unfounded". You are looking at the stats. They were lying.
> "That's nice but considering the price increase,"
That's nice and a good argument! That's not what I replied to. I replied to they didn't provide any stats.
You also lead with the pretty strong assertion that previous commenter was lying, seemingly without proving proof anyone else can find.
I can't provide images here.
I provided the numbers.
What more can I do to show them? :)
Also, the person you replied to comments on the wording tricks they use. After suddenly bringing new data and direction in the discussion, even calling them "wrong" would have been a stretch.
I kindly suggest that you (and we all!) to keep discussing with an assumption of good faith.
The link, the link we are discussing shows testing, with numbers.
They say "early testing doesn't show that it hallucinates less", to provide a basis for a claim of bad faith.
You are claiming that mentioning this is out of bounds if it contains the word lying. I looked up the definition. It says "used with reference to a situation involving deception or founded on a mistaken impression."
What am I missing here?
Let's pretend lying means You Are An Evil Person And This Is Personal!!!
How do I describe the fact what they claim is false?
Am I supposed to be sarcastic and pretend They are in on it and edited their post to discredit him after the fact?
That comment is making fun of their wording. Maybe extracting too much meaning from their wordplay? Maybe.
Afterwards, evidence is presented that they did not have to do this, which makes that point not so important, and even wrong.
The commenter was not lying, and they were correct about how masterfully deceiving that sequence of sentences are. They arrived at a wrong conclusion though.
Kindly point that out. Say, "hey, the numbers tell a different story, perhaps they didn't mean/need to make a wordplay there".
No? By the way, what is this comment, exactly? What is it trying to communicate? What I'm understanding is, it is good to talk down to people about how "they can't communicate", but calling a lie a lie is bad, because maybe they were just kidding (lying for fun)
> That comment is making fun of their wording. Maybe extracting too much meaning from their wordplay? Maybe.
What does "maybe" mean here, in terms of symbolical logic?
Their claim "we tested it and it didn't get better" -- and the link shows, they tested it, it did get better! It's pretty cleancut.
> Do I need to tell you how to communicate?
That adresses it.
> What does "maybe" mean here, in terms of symbolical logic?
I'm answering my own question to make it clear I'm guessing.
For the rest, I'm sure that we need a break. It's normal get frustrated when many people correct us, or even one passionate individual like you, and we tend to keep going defending (happened here many times too!), because defending is the only thing left. Taking a break always helps. Just a friendly advice, take it or leave it :)
- 2 hours in, -3.
2 replies:
- [It's because] you're hysterical
- [It's because you sound] like a crypto bro
- [It's because] you make an equally unfounded claim
- [It's because] you didn't provide any proof
(Ed.: It is right in the link! I gave the #s! I can't ctrl-F...What else can I do here...AFAIK can't link images...whatever, here's imgur. https://imgur.com/a/mkDxe78)
- [It's because] you sound personally offended
(Ed.: Is "personally" is a shibboleth here, meaning expressing disappointment in people making things up is so triggering as invalidate the communication that it is made up?)
You are lying.
This is an ad hominem which assumes intent unknown to anyone other than the person to whom you replied.Subsequently railing against comment rankings and enumerating curt summaries of other comments does not help either.
What am I missing here?
Those weren't curt summaries, they were quotes! And not pull quotes, they were the unedited beginning of each claim!
> What am I missing here?
Intent. Neither you nor I know what the person to whom you replied had.
> Those weren't curt summaries, they were quotes! And not pull quotes, they were the unedited beginning of each claim!
Maybe the more important part of that sentence was:
Subsequently railing against comment rankings ...
But you do you.I commented as I did in hope it helped address what I interpreted as confusion regarding how the posts were being received. If it did not help, I apologize.
>> [elided] What am I missing here?
> Intent. Neither you nor I know what the person to whom you replied had.
Here's the part you elided:
"I looked up the definition [of lying]. It says "used with reference to a situation involving deception or founded on a mistaken impression."
That makes it quite clear whether or not I'm missing "intent".
This also makes it quite clear that I am not making an ad hominem.
I am using a simple, everyday, word used to describe the act of advancing false claims, whether through deception or mistaken impression.
A lot of folks here their stock portfolio propped up by AI companies but think they've been overhyped (even if only indirectly through a total stock index). Some were saying all along that this has been a bubble but have been shouted down by true believers hoping for the singularly to usher in techno-utopia.
These signs that perhaps it's been a bit overhyped are validation. The singularly worshipers are much less prominent and so the comments rising to the top are about negatives and not positives.
Ten years from now everyone will just take these tools for granted as much as we take search for granted now.
I don’t feel this is a weak result. Consider if you built a new version that you _thought_ would perform much better, and then you found that it offered marginal-but-not-amazing improvement over the previous version. It’s likely that you will keep iterating. But in the meantime what do you do with your marginal performance gain? Do you offer it to customers or keep it secret? I can see arguments for both approaches, neither seems obviously wrong to me.
All that being said, I do think this could indicate that progress with the new ml approaches is slowing.
You better ship it
Then call it GPT 4.1 and allow version space for the next iteration.
I think the label V4.5 is giving the impression of more than marginal improvements.
"Uh... excuse me, Detective Nick Danger? I'd like to retain your services."
"I waited for her to get the the point."
"Detective, who are you talking to?"
"I didn't want to deal with a client that was hearing voices, but money was tight and the rent was due. I pondered my next move."
"Mr. Danger, are you... narrating out loud?"
"Damn! My internal chain of thought, the key to my success--or at least, past successes--was leaking again. I rummaged for the familiar bottle of scotch in the drawer, kept for just such an occasion."
---
But seriously: These "AI" products basically run on movie-scripts already, where the LLM is used to append more "fitting" content, and glue-code is periodically performing any lines or actions that arise in connection to the Helpful Bot character. Real humans are tricked into thinking the finger-puppet is a discrete entity.
These new "reasoning" models are just switching the style of the movie script to film noir, where the Helpful Bot character is making a layer of unvoiced commentary. While it may make the story more cohesive, it isn't a qualitative change in the kind of illusory "thinking" going on.
I've been banging that particular drum for a while on HN, and the mental-model still feels so intuitively strong to me that I'm starting to have doubts: "It feels too right, I must be wrong in some subtle yet devastating way."
And maybe Tesla is going to deliver truly full self driving tech any day now.
And Star Citizen will prove to have been worth it along along, and Bitcoin will rain from the heavens.
It's very difficult to remain charitable when people seem to always be chasing the new iteration of the same old thing, and we're expected to come along for the ride.
At least this is the benchmark for super-human general intelligence that I propose.
Once they've implemented saccades in the eyeballs of the characters wearing helmets in spaceship millions of kilometres apart, then it will all have been worth it.
And Star Citizen will prove to have been worth it along along
Sounds like someone isn't happy with the 4.0 eternally incrementing "alpha" version release. :-DI keep checking in on SC every 6 months or so and still see the same old bugs. What a waste of potential. Fortunately, Elite Dangerous is enough of a space game to scratch my space game itch.
The misallocation of capital also applies to GPT-4.5/OpenAI at this point.
Their backend is probably an async nightmare of replicated state that gets corrupted over time. Would explain why a lot of things seem to work more or less bug free after an update and then things fall to pieces and the same old bugs start showing up after a few weeks.
And to be clear, I've spent money on SC and I've played enough hours goofing off with friends to have got my money's worth out of it. I'm just really bummed out about the whole thing.
Having worked at my fair share of big tech companies (while preferring to stay in smaller startups), in so many of these tech announcement I can feel the pressure the PM had from leadership, and hear the quiet cries of the one to two experience engineers on the team arguing sprint after sprint that "this doesn't make sense!"
“I have five years of Cassandra experience—and I don’t mean the db”
Damn this never worked for me as a startup founder lol. Need that Altman "rizz" or what have you.
It is the solution to almost nothing, but is being shoehorned into every imaginable role by people who are blind to its shortcomings, often wilfully. The only thing that's obvious to me is that a great number of people are apparently desperate for a tool to do their thinking for them, no matter how garbage the result is. It's disheartening to realize that so many people consider using their own brain to be such an intolerable burden.
I think "generalize properly to out-of-distribution data" is too weak of criteria for general intelligence (GI). GI model should be able to get interested about some particular area, research all the known facts, derive new knowledge / create theories based upon said fact. If there is not enough of those to be conclusive: propose and conduct experiments and use the results to prove / disprove / improve theories. And it should be doing this constantly in real time on bazillion of "ideas". Basically model our whole society. Fat chance of anything like this happening in foreseeable future.
Besides, humans are capable of rigorous logic (which I believe is the most crucial aspect of intelligence) which I don’t think an agent without a proof system can do.
Current LLMs are a waste and quite a bit of a step back compared to older Machine Learning models IMO. I wouldn't necessarily have a huge beef with them if billions of dollars weren't being used to shove them down our throats.
LLMs actually do have usefulness, but none of the pitched stuff really does them justice.
Example: Imagine knowing you had the cure for Cancer, but instead discovered you can make way more money by declaring it to solve all of humanity, then imagine you shoved that part down everyones' throats and ignored the cancer cure part...
I consider myself an AI skeptic and as soon as the hype train went full steam, I assumed a crash/bubble burst was inevitable. Still do.
With the rare exception, I don’t know of anyone who has expected the bubble to burst so quickly (within two years). 10 times in the last 2 years would be every two and a half months — maybe I’m blinded by my own bias but I don’t see anyone calling out that many dates
But that didn't mean the internet should be ignored, and the same holds true for AI today IMO
Oh come on. Think how long of a gap there was between the first microcomputer and VisiCalc. Or between the start of the internet and social networking.
First of all, it's going to take us 10 years to figure out how to use LLM's to their full productive potential.
And second of all, it's going to take us collectively a long time to also figure out how much accuracy is necessary to pay for in which different applications. Putting out a higher-accuracy, higher-cost model for the market to try is an important part of figuring that out.
With new disruptive technologies, companies aren't supposed to be able to look into a crystal ball and see the future. They're supposed to try new things and see what the market finds useful.
Going back to the very first commercially available microcomputer, the Altair 8800 (which is not a great match, since that was sold as a kit with binary stitches, 1 byte at a time, for input, much more primitive than ChatGPT's UX), that's four years and nine months to Visicalc release. This isn't a decade long process of figuring things out, it actually tends to move real fast.
1. The first microcomputer predates VisiCalc, yes, but it doesn't predate the realization of what it could be useful for. The Micral was released in 1973. Douglas Engelbart gave "The Mother of All Demos" in 1968 [2]. It included things that wouldn't be commonplace for decades, like a collaborative real-time editor or video-conferencing.
I wasn't yet born back then, but reading about the timeline of things, it sounds like the industry had a much more concrete and concise idea of what this technology would bring to everyone.
"We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings." doesn't inspire that sentiment for something that's already being marketed as "the beginning of a new era" and valued so exorbitantly.
2. I think as AI becomes more generally available, and "good enough" people (understandably) will be more skeptical of closed-source improvements that stem from spending big. Commoditizing AI is more clearly "useful", in the same way commoditizing computing was more clearly useful than just pushing numbers up.
Again, I wasn't yet born back then, but I can imagine the announcement of Apple Macintosh with its 6MHz CPU and 128KB RAM was more exciting and had a bigger impact than the announcement of the Cray-2 with its 1.9GHz and +1GB memory.
But this is about one particular price/performance ratio.
You need to build things before you can see how the market responds. You say it's "not good business" but that's entirely wrong. It's excellent business. It's the only way to go about it, in fact.
Finding product-market fit is a process. Companies aren't omniscient.
Of course, this benefits the likes of Sam Altman, Satya Nadella et al, but has not produced the value promised, and does not appear poised to.
And here you have one of the supposed bleeding edge companies in this space, who very recently was shown up by a much smaller and less capitalized rival, asking their own customers to tell them what their product is good for.
Not a great look for them!
their endgame goal was to replace Human entirely, Robotic and AI is perfect match to replace all human together
They don't need to find problem because problem is full automatons from start to end
A FTL spaceship is all we need to make space travel viable between solar systems. This is the solution to depletion of resources on earth...
Or has that been a success with tons of productive uses in your opinion?
At some point, I'd like to hear more than 'trust me bro, it'll be great' when we use up non-trivial amounts of finite resources to try these 'things'.
I disagree strongly with that. Right now they are fun toys to play with, but not useful tools, because they are not reliable. If and when that gets fixed, maybe they will have productive uses. But for right now, not so much.
I use it most days when coding. Not all the time, but I’ve gotten a lot of value out of them.
And yes I'm quite aware of their pitfalls.
It makes mistakes, it gets things wrong, and it still saves a bunch of time. A 10 minute refactoring turns into 30 seconds of making a request, 15 seconds of waiting, and a minute of reviewing and fixing up the output. It can give you decent insights into potential problems and error messages. The more precise your instructions, the better they perform.
Being unreliable isn't being useless. It's like a very fast, very cheap intern. If you are good at code review and know exactly what change you want to make ahead of time, that can save you a ton of time without needing to be perfect.
I recently (in the last month) gave it a shot. I would say once in the maybe 30 or 40 times I used it did it save me any time. The one time it did I had each line filled in with pseudo code describing exactly what it should do… I just didn’t want to look up the APIs
I am glad it is saving you time but it’s far from a given. For some people and some projects, intern level work is unacceptable. For some people, managing is a waste of time.
You’re basically introducing the mythical man month on steroids as soon as you start using these
This is no less true of statements made to the contrary. Yet they are stated strongly as if they are fact and apply to anyone beyond the user making them.
Usefulness is subjective.
I read the original comment as “pay $100 and just go for it!” which didn’t seem like the right way to do it. Other comments seem to indicate there are $100 dollars worth of credits that are claimable perhaps
One can evaluate LLMs sufficiently with the free trails that abound :) and indeed one may find them worth it to themselves. I don’t disparage anyone who signs up for the plans
Programmers and the like are a large portion of LLM users and boosters; very few will deny usefulness in that/those domains at this point.
Ironically enough, I'll bet the broadest exposure to LLMs the masses have is something like MIcrosoft shoehorning copilot-branded stuff into otherwise usable products and users clicking around it or groaning when they're accosted by a pop-up for it.
That's when you learn Vim, Emacs, and/or grep, because I'm assuming that's mostly variable renaming and a few function signature changes. I can't see anything more complicated, that I'd trust an LLM with.
What?
Explain what this code syntax means…
Explain what this function does…
Write a function to do X…
Respond to my teammates in a Jira ticket explaining why it’s a bad idea to create a repo for every dockerfile…
My teammate responded with X write a rebuttal…
… and the list goes on … like forever
As an aside, I sincerely hope our “human” conversations don’t devolve into agents talking to each other. It’s just an insult to humanity.
This right here. I used to spend tons of time making sure my emails were perfect. Is it professional enough, am I being too terse, etc…
LLMs will be gone in 10 years. At least in form we know with direct access. Everything moves so fast that there is no reason to think nothing better is coming.
BTW, what we've learned so far about LLMs will be outdated as well. Just me thinking. Like with 'thinking' models prev generation can be used to create dataset for the next one. It could be that we can find a way to convert trained LLM into something more efficient and flexible. Some sort of a graph probably. Which can be embedded into mobile robot's brain. Another way is 'just' to upgrade the hardware. But that is slow and has its limits.
You're assuming that point is somewhere above the current hype peak. I'm guessing it won't be, it will be quite a bit below the current expectations of "solving global warming", "curing cancer" and "making work obsolete".
Then another 30 to finally stop using them in dumb and insecure ways. :p
Where is this quote from?
https://en.m.wikipedia.org/wiki/Scare_quotes
> Whether quotation marks are considered scare quotes depends on context because scare quotes are not visually different from actual quotations.
Though, as someone who had a flagged comment a couple years ago for a supposed "misquote" I did in a similar form in style, I think hn's comprehension of this form of communication is not super strong. Also the style more often than not tends towards low quality smarm and probably should be resorted to sparingly.
>My aim is: to teach you to pass from a piece of disguised nonsense to something that is patent nonsense.
I keep a little google spreadsheet with some charts to help visualize the landscape at a glance in terms of capability/price/throughput, bringing in the various index scores as they become available. Hope folks find it useful, feel free to copy and claim as your own.
https://docs.google.com/spreadsheets/d/1foc98Jtbi0-GUsNySddv...
That's a nice sentiment, but I'd encourage you to add a license or something. The basic "something" would be adding a canonical URL into the spreadsheet itself somewhere, along with a notification that users can do what they want other than removing that URL. (And the URL would be described as "the original source" or something, not a claim that the particular version/incarnation someone is looking at is the same as what is at that URL.)
The risk is that someone will accidentally introduce errors or unsupportable claims, and people with the modified spreadsheet won't know that it's not The spreadsheet and so will discount its accuracy or trustability. (If people are trying to deceive others into thinking it's the original, they'll remove the notice, but that's a different problem.) It would be a shame for people to lose faith in your work because of crap that other people do that you have no say in.
how do you do the different size circles and colored sequences like that? this is god tier skills
also - love the podcast, one of my favorites. the 3:1 io token price breakdown in my sheet is lifted directly from charts I've seen on latent space.
Sadly little people noticed...
I don’t normally cosplay as a grammar Nazi but in this case I feel like someone should stand up for the little people :)
"Sadly, little people noticed."
(queue a group of little people holding pitch forks (normal forks upon closer inspection))
> bad news: it is a giant, expensive model. we really wanted to launch it to plus and pro at the same time, but we've been growing a lot and are out of GPUs. we will add tens of thousands of GPUs next week and roll it out to the plus tier then. (hundreds of thousands coming soon, and i'm pretty sure y'all will use every one we can rack up.)
After the scam came to light my friend said “if I knew I was paying for those dinners, I would have been fine with Denny’s[1]”
I wanted to tell him “you would have been paying for those dinners even if he wasn’t outright stealing your money,” but that seemed insensitive so I kept my mouth shut.
0 - a local steakhouse had a portrait of this guy drawn on the wall
1 - for any non-Americans, Denny’s is a low cost diner-style restaurant.
Paying someone to match your risk profile and financial goals may be worth the fee, which as you pointed out is very measurable. YMMV though.
I'm just a layperson, so I might be wrong in some way that I don't understand
As an LLM cynic, I feel that point passed long go, perhaps even before Altman claimed countries would start wars to conquer the territory around GPU datacenters, or promoting the dream of a 7 T-for-trillion dollar investment deal, etc.
Alas, the market can remain irrational longer than I can remain solvent.
But maybe we just hit the point that the improvement of performance hit the slowing down part of a logistic curve, while the cost keeps increasing exponentially.
ceo: it's ready
the pricing is probably a mixture of dealing with GPU scarcity and intentionally discouraging actual users. I can't imagine the pressure they must be under to show they are releasing and staying ahead, but Altman's tweet makes it clear they aren't really ready to sell this to the general public yet.
For example, there are now a bunch of vendors that sell "respond to RFP" AI products. The number of RFPs that any sales organization responds to is probably no more than a couple a week, but it's a very time-consuming, laborious process. But the payoff is obviously very high if a response results in a closed sale. So here paying 30x for marginally better performance makes perfect sense.
I can think of a number of similar "high value, relatively low occurrence" use cases like this where the pricing may not be a big hindrance.
Esp. when they aren't even sure whether they will commit to offering this long term? Who would be insane enough to build a product on top of something that may not be there tomorrow?
Those products require some extensive work, such a model finetuning on proprietary data. Who is going to invest time & money into something like that when OpenAI says right out of the gate they may not support this model for very long?
Basically OpenAI is telegraphing that this is yet another prototype that escaped a lab, not something that is actually ready for use and deployment.
We’re one of those types of customers. We wrote an OpenAI API compatible gateway that automatically batches stuff for us, so we get 50% off for basically no extra dev work in our client applications.
I don’t care about speed, I care about getting the right answer. The cost is fine as long as the output generates us more profit.
Reasoning models are roughly the equivalent to allow Hamiltonian Monte-Carlo models to "warm up" (i.e. start sampling from the typical set). This, unsurprisingly, yields better results (after all LLMs are just fancy Monte-carlo models in the end). However, it is extremely unlikely this improvement is without pretty reasonable limitations. Letting your HMC warm up is essential to good sampling, but letting "warm up more" doesn't result in radically better sampling.
While there have been impressive results in efficiency of sampling from the typical set seen in LLMs these days, we're clearly not making the major improvements in the capabilities of these models.
Even by their poor standards at naming it would be weird to introduce a completely new type/concept, that can loop in models including the 4 / 4.5 series, while naming it part of that same series.
My bet: probably something weird like "oo1", or I suspect they might try to give it a name that sticks for people to think of as "the" model - either just calling it "ChatGPT", or coming up with something new that sounds more like a product name than a version number (OpenCore, or Central, or... whatever they think of)
https://www.theverge.com/news/611365/openai-gpt-4-5-roadmap-...
> “We hate the model picker as much as you do and want to return to magic unified intelligence,”
> “a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not, and generally be useful for a very wide range of tasks,”
> the company plans to “release GPT-5 as a system that integrates a lot of our technology, including o3,”
He even slips up and says "integrates" in the last quote.
When he talks about "unifying", he's talking about the user experience not the underlying model itself.
LLMs are a great tool for what is effectively collected knowledge search and summary (so long as you are willing to accept that you have to verify all of the 'knowledge' they spit back because they always have the ability to go off the rails) but they have been hitting the limits on how much better that can get without somehow introducing more real knowledge for close to 2 years now and everything since then is super incremental and IME mostly just benchmark gains and hype as opposed to actually being purely better.
I personally don't believe that more GPUs solves this, like, at all. But its great for Nvidia's stock price.
I find the improvements to be uneven in the sense that every time I try a new model I can find use cases where its an improvement over previous versions but I can also find use cases where it feels like a serious regression.
Our differences in how we categorize the amount of improvement over the past 2 years may be related to how much the newer models are improving vs regressing for our individual use cases.
When used as coding helpers/time accelerators, I find newer models to be better at one-shot tasks where you let the LLM loose to write or rewrite entire large systems and I find them worse at creating or maintaining small modules to fit into an existing larger system. My own use of LLMs is largely in the latter category.
To be fair I find the current peak model for coding assistant to be Claude 3.5 Sonnet which is much newer than 2 years old, but I feel like the improvements to get to that model were pretty incremental relative to the vast amount of resources poured into it and then I feel like Claude 3.7 was a pretty big back-slide for my own use case which has recently heightened my own skepticism.
It just means either
A. Lots and lots of hard work that get you a few percent at a time, but add up to a lot over time.
or
B. Completely different approaches that people actually think about for a while rather than trying to incrementally get something done in the next 1-2 months.
Most fields go through this stage. Sometimes more than once as they mature and loop back around :)
Right now, AI seems bad at doing either - at least, from the outside of most of these companies, and watching open source/etc.
While lots of little improvements seem to be released in lots of parts, it's rare to see anywhere that is collecting and aggregating them en masse and putting them in practice. It feels like for every 100 research papers, maybe 1 makes it into something in a way that anyone ends up using it by default.
This could be because they aren't really even a few percent (which would be yet a different problem, and in some ways worse), or it could be because nobody has cared to, or ...
I'm sure very large companies are doing a fairly reasonable job on this, because they historically do, but everyone else - even frameworks - it's still in the "here's a million knobs and things that may or may not help".
It's like if compilers had no "O0/O1/O2/O3' at all and were just like "here's 16,283 compiler passes - you can put them in any order and amount you want". Thanks! I hate it!
It's worse even because it's like this at every layer of the stack, whereas in this compiler example, it's just one layer.
At the rate of claimed improvements by papers in all parts of the stack, either lots and lots and lots is being lost because this is happening, in which case, eventually that percent adds up to enough for someone to be able to use to kill you, or nothing is being lost, in which case, people appear to be wasting untold amounts of time and energy, then trying to bullshit everyone else, and the field as a whole appears to be doing nothing about it. That seems, in a lot of ways, even worse. FWIW - I already know which one the cynics of HN believe, you don't have to tell me :P. This is obviously also presented as black and white, but the in-betweens don't seem much better.
Additionally, everyone seems to rush half-baked things to try to get the next incremental improvement released and out the door because they think it will help them stay "sticky" or whatever. History does not suggest this is a good plan and even if it was a good plan in theory, it's pretty hard to lock people in with what exists right now. There isn't enough anyone cares about and rushing out half-baked crap is not helping that. mindshare doesn't really matter if no one cares about using your product.
Does anyone using these things truly feel locked into anyone's ecosystem at this point? Do they feel like they will be soon?
I haven't met anyone who feels that way, even in corps spending tons and tons of money with these providers.
The public companies - i can at least understand given the fickleness of public markets. That was supposed to be one of the serious benefit of staying private. So watching private companies do the same thing - it's just sort of mind-boggling.
Hopefully they'll grow up soon, or someone who takes their time and does it right during one of the lulls will come and eat all of their lunches.
I think this is very likely simply because there are so many smart people looking at it right now. I hope the bubble doesn't burst before it happens.
GPT-4: Input $30.00 / 1M tokens ; Output $60.00 / 1M tokens
So 4.5 is 2.5x more expensive.
I think they announced this as their last non-reasoning model, so it was maybe with the goal of stretching pre-training as far as they could, just to see what new capabilities would show up. We'll find out as the community gives it a whirl.
I'm a Tier 5 org and I have it available already in the API.
(And I think this is probably also scarecrow pricing to discourage casual users from clogging the API since they seem to be too compute-constrained to deliver this at scale)
Sounds to me like GPT 4.5 likely requires a full Blackwell HGX cabinet or something, thus OpenAI's reference to needing to scale out their compute more (Supermicro only opened up their Blackwell racks for General Availability last month, and they're the prime vendor for water-cooled Blackwell cabinets right now, and have the ability to throw up a GPU mega-cluster in a few weeks, like they did for xAI/Grok)
It'd be weird to release a distilled version without ever releasing the base undistilled version.
If this huge model has taken months to pre-train and was expected to be released before, say, o3-mini, you could definitely have some last-minute optimizations in o3-mini that were not considered at the time of building the architecture of gpt-4.5.
It points to an overall plateau being reached in the performance of the transformer architecture.
As a result, OpenAI simply does not have a business model, even if they are trying to convince the world that they do.
My bet is that they're currently burning through other people's capital at an amazing rate, but that they are light-years from profitability
They are also being chased by fierce competition and OpenSource which is very close behind. There simply is no moat.
It will not end well for investors who sunk money in these large AI startups (unless of course they manage to find a Softbank-style mark to sell the whole thing to), but everyone will benefit from the progress AI will have made during the bubble.
So, in the end, OpenAI will have, albeit very unwillingly, fulfilled their original charter of improving humanity's lot.
Three things I use all the time:
- Canvas for proofing and editing my article drafts before publishing. This has replaced an actual human editor for me.
- Voice for all sorts of things, mostly for thinking out loud about problems or a quick question about pop culture, what something means in another language, etc. The Sol voice is so approachable for me.
- GPTs I can use for things like D&D adventure summaries I need in a certain style every time without any manual prompting.
So their contribution was really to fuel a race for opensource (which they contributed little to). Pretty complex of an argument.
The Information leaked their internal projections a few months ago, and apparently their own estimates have them losing $44B between then and 2029 when they expect to finally turn a profit, maybe.
Here I'm assuming "AI" to mean what's broadly called Generative AI (LLMs, photo, video generation)
I genuinely am struggling to see what the product is too.
The code assistant use cases are really impressive across the board (and I'm someone who was vocally against them less than a year ago), and I pay for Github CoPilot (for now) but I can't think of any offering otherwise to dispute your claim.
It seems like companies are desperate to find a market fit, and shoving the words "agentic" everywhere doesn't inspire confidence.
Here's the thing: I remember people lining up around the block for iPhone releases, XBox launches, hell even Grand Theft Auto midnight releases.
Is there a market of people clamoring to use/get anything GenAI related?
If any/all LLM services went down tonight, what's the impact? Kids do their own homework?
JavaScript programmers have to remember how to write React components?
Compare that with Google Maps disappearing, or similar.
LLMs are in a position where they're forced onto people and most frankly aren't that interested. Did anyone ASK for Microsoft throwing some Copilot things all over their operating system? Does anyone want Apple Intelligence, really?
They're nice for summarizing and categorizing text. We've had good solutions for that before, too (BERT, et al), but LLM's are marginally nicer.
> Is there a market of people clamoring to use/get anything GenAI related?
No. LLM's are lame and uncool. Kids especially dislike them a lot on that basis alone.
That's interesting and the first time I hear of this. Could you provide any links that might elucidate this?
Not just kids.
I do worry about model poisoning with fake truths but dont feel we are there yet.
In my use, hallucinations will need to be a lot lower before we get there, because I already can't trust anything an LLM says so I don't think I could even distinguish a poisoned fake truth from a "regular" hallucination.
I just asked ChatGPT 4o to explain irreducible control flow graphs to me, something I've known in the past but couldn't remember. It gave me a couple of great definitions, with illustrative examples and counterexamples. I puzzled through one of the irreducible examples, and eventually realized it wasn't irreducible. I pointed out the error, and it gave a more complex example, also incorrect. It finally got it on the 3rd try. If I had been trying to learn something for the first time rather than remind myself of what I had once known, I would have been hopelessly lost. Skepticism about any response is still crucial.
No "good" AI will ever be open to everyone and relatively cheap, this is the same phenomenon as "how to get rich" books
They have a super popular subscription service. If they keep iterating on the product enough, they can lag on the models. The business is the product not the models and not the API. Subscriptions are pretty sticky when you start getting your data entrenched in it. I keep my ChatGPT subscription because it’s the best app on Mac and already started to “learn me” through the memory and tasks feature.
Their app experience is easily the best out of their competitors (grok, Claude, etc). Which is a clear sign they know that it’s the product to sell. Things like DeepResearch and related are the way they’ll make it a sustainable business - add value-on-top experiences which drive the differentiation over commodities. Gemini is the only competitor that compares because it’s everywhere in Google surfaces. OpenAI’s pro tier will surely continue to get better, I think more LLM-enabled features will continue to be a differentiator. The biggest challenge will be continuing distribution and new features requiring interfacing with third parties to be more “agentic”.
Frankly, I think they have enough strength in product with their current models today that even if model training stalled it’d be a valuable business.
The cognitive overhead of choosing between what will be 6 different models now on chatGPT and trying to map whether a query is "worth" using a certain model and worrying about hitting usage limits is getting kind of out of control.
I don't get it, it is supposedly much cheaper to run?
[1] https://cdn.openai.com/gpt-4-5-system-card.pdf (page 7, bottom)
If I put 100x floats in my bag-o'-floats, its still 10x slower :(
(extending beyond that point and beyond ELI5: computational efficiency implies multiplying the floats is faster, but you still need the whole bag o' floats, i.e no RAM efficiency gained, so you're still screwed on big-O for the # of GPUs you need to use)
I have a bunch of work that needs doing. I can do it myself, or I can hire one person to do it. I gotta train them and manage them and even after I train them theres still only going to be one of them, and it's subject to their availability. On the other hand, if I need to train an AI to do it, but I can copy that AI, and then spin them up/down like on demand computer in the cloud, and not feel remotely bad about spinning them down?
It's definitely not there yet, but it's not hard to see the business case for it.
I'm not sure what I can do about it, either. My job already doesn't look like it did a year ago, nevermind a decade away.
The CS is already a wreck and LLMs beat an ivr any day of the week and have the ability to offer real triaging ability.
The only people getting upset are the luddites like yourself.
The claimed hallucination rate is dropping from 61% to 37%. That's a "correct" rate increasing from 29% to 63%.
Double the correct rate costs 15x the price? That seems absurd, unless you think about how mistakes compound. Even just 2 steps in and you're comparing a 8.4% correct rate vs 40%. 3 automated steps and it's 2.4% vs 25%.
We expect computers to be right. Its a trust problem. Average users will simply trust the results of LLMs and move on without proper validation. And the way the LLMs are trained to mimic human interaction is not helping either. This will reduce overall quality in society.
Its a different thing to work with another human, because there is intention. A human wants to be correct or to mislead me. I am considering this without even thinking about it.
And I don't expect expert models to improve things, unless the problem space is really simple (like checking eggs for anomalies).
> GPT 4o pricing for comparison: Price Input: $2.50 / 1M tokens Cached input: $1.25 / 1M tokens Output: $10.00 / 1M tokens
Their examples don't seem 30x better. :-)
With more parameters comes more latent space to build a world model. No wonder its internal world model is so much better than previous SOTA
Bonus they could even do some self learning afterwards with the performance improvements DeepSeek just published and it might have more EQ and less hallucinations than starting from scratch…
ie the price might go down big time but there might be significant improvements down the line when starting from such a broad base
How many eggs does that include??!
I guess the rationale behind this is paying for the marginal improvement. Maybe the next few percent of improvement is so important to a business that the business is willing to pay a hefty premium.
From a for profit business lens for OpenAI - I understand pushing the price outside the range of side projects, but this pushes it past start ups.
Excited to see new stuff released past reasoning models in any case. Hope they can improve the price soon.
When using `gpt-4.5-preview` I am getting: > Invalid URL (POST /v1/chat/completions)
Sounds like an attempt at price descrimination. Sell the expensive version to big companies with big budgets who don't care, sell the cheap version to everyone else. Capture both ends of the market.
> I'm still gonna give it a go, though.
Seems like the pricing is pretty rational then?
OpenAI wouldn't be the first company to price something expensive when it first comes out to capitalize on people who are less price sensitive at first and then lower prices to capture a bigger audience.
That's all pricing 101 as the saying goes.
Output price difference:4.5 is 15x more
In their model evaluation scores in the appendix, 4.5 is, on average, 26% better. I don't understand the value here.
Most of us have been waiting for this moment for a while. The transformer architecture as it is currently understood can't be milked any further. Many of us knew this since last year. GPT-5 delays eventually led to non-tech voices to suggest likewise. But we all held our final decision until the next big release from OpenAI as Sam Altman has been making claims about AGI entering the workforce this year, OpenAI knowing how to build AGI and similar outlandish claims. We all knew that their next big release in 2025 would be the final deciding factor on whether they had some tech breakthrough that would upend the world (justifying their astronomical valuation) or if it would just be (slightly) more of the same (marking the beginning of their downfall).
The GPT-4.5 release points towards the latter. Thus, we should not expect OpenAI to exist as it does now (AI industry leader) in 2030, assuming it does exist at all by then.
However, just like the 19th century rail industry revolution, the fall of OpenAI will leave behind a very useful technology that while not catapulting humanity towards a singularity, will nonetheless make people's lives better. Not much consolation to the world's super rich who will lose tons of money once the LLM industry (let us remember that AI is not LLM) falls.
EDIT: "will nonetheless make people's lives better" to "might nonetheless make some people's lives better"
I think GPT-5 - if/when it happens - will be 4.5 with reasoning, and as such it will feel very different.
The barrier, is the computational cost of it. Once 4.5 gets down to similar costs to 4.0 - which could be achieved through various optimization steps (what happened to the ternary stuff that was published last year that meant you could go many times faster without expensive GPUs?), and better/cheaper/more efficient hardware, you can throw reasoning into the mix and suddenly have a major step up in capability.
I am a user, not a researcher of builder. I do think we're in a hype bubble, I do think that LLMs are not The Answer, but I also think there is more mileage left in this path than you seem to. I think automated RL (not HF), reasoning, and better/optimal architectures and hardware mean there is a lot more we can get out of the stochastic parrots, yet.
I don’t feel like this binary shift is adequately accounted for among the LLM cynics.
For anyone else not familiar with the acronym of the day :).
i’m not sure if you intended to call those upthread small-minded
If it makes you feel better, I'd say the Eliza Effect is good evidence human have a lot of "stochastic parrot" in them also. And there's no reason that being stochastic parrot means something can't generalize.
The thing with these terms is LLMs are distinctly new things. Even blind men looking at elephants can improve their performance with good terminology and by listening to each other. "Effective searchers", "question answers" and "stochastic parrots" are useful term just 'cause the describe concrete behaviors - notably "stochastic parrots" gives some idea of the "no particular goal" quality of LLMs (will happily be NAZIs, pacifists or communists given the proper context). On the other hand, "intelligent" gives no good clues since humans haven't really defined the term for themselves and it is a synonym for good, worthy or capable (giving the machine a prize rather than looking at it).
Well, did 4.0 ever become lower cost? On the API side, its cost per tokens is a factor of 10 higher than 4o even though 4o is considered the better model.
I think 4.5 may just be retired wholesale, or perhaps a new model derived from it that is more efficient, a 4.5mini or something like that.
It might sound a bit cumbersome but it's actually super easy if you assign search keywords in your browser: for instance if I am looking for something on GitHub I just open a new tab on Firefox and type "gh tokio".
Just this past week claude 3.7 wrote a program for us to use to quickly modernize ancient (1990's) proprietary manufacturing machine files to contemporary automation files.
This allowed us to forgo a $1k/yr/user proprietary software package that would be able to do the same. The program Claude wrote took about 30 mins to make. Granted the program is extremely narrow in scope, but it does the one thing we need it to do.
This marks the third time I (a non-progammer) have used an LLM to create software that my company uses daily. The other two are a test system made by GPT-4 and an android app made by a mix of 4o and claude 3.5.
Bumpers may be useless and laughable to pro bowlers, but a godsend to those who don't really know what they are doing. We don't need to hire a bowler to knock over pins anymore.
Eg as i've been trying Claude Code i still feel the need to babysit it with my primary work, and so i'd rather do it myself. However while i'm working if it could sit there and monitor it, note fixes, tests and documentation and then stub them in during breaks i think there's a lot of time savings to be gained.
Ie keep the doing simple tasks that it can get right 99% of the time and get it out of the way.
I also suspect there's context to be gained in watching the human work. Not learning per say, but understanding the areas being worked on, improving intuition on things the human needs or cares about, etc.
A `cargo lint --fix` on steroids is "simple" but still really sexy imo.
How does any of this make the world a better place? CEOs like Sam Altman have very lofty ideas about the inherent potential "goodness" of higher-order artificial intelligence that I find thus far has not borne out in reality, save a few specific cases. Useful is not the same as good. Technology is inherently useful, that does not make it good.
Which problem spaces do you think are underserved in this aspect?
It's been obvious for a while that the narrow-waist APIs between things matter, and apparent that agentic AI is leaning into adaptive API consumption, but I don't see how that gives the agentic client some super-power we don't already need to defend against since before AGI we already have HGI (human general intelligence) motivated to "do bad things" to/through those APIs, both self-interested and nation-state sponsored.
We're seeing more corporate investment in this interplay, trending us towards Snow Crash, but "all you have to do" is have some "I" in API be "dual key human in the loop" to enable a scenario where AGI/HGI "presses the red button" in the oval office, nuclear war still doesn't happen, WarGames or Crimson Tide style.
I'm not saying dual key is the answer to everything, I'm saying, defenses against adversaries already matter, and will continue to. We have developed concepts like air gaps or modality changes, and need more, but thinking in terms of interfaces (APIs) in the general rather than the literal gives a rich territory for guardrails and safeguards.
Intelligence. I'm talking about super-intelligence. If you want to know what it feels like to be intellectually outclassed by a machine, download the latest Go engine and have fun losing again and again while not understanding why. Now imagine an ASI that isn't confined to the Go board, but operating out in the world. It's doing things you don't like at speeds you can scarcely comprehend and there's not a thing you can do about it.
I don't think it's reasonable at all to look at a system's capability in games with perfect and easily-ingested information and extrapolate about its future capabilities interacting with the real world. What makes you confident that these problem domains are compatible?
Talk to the best chess players in the world and they'll tell you flat out they can't begin to understand some of the engine's moves.
It won't be any different with ASI. It will do things for reasons we are incapable of understanding. Some of those things, will certainly be harmful to humans.
The difference is that a highly intelligent human adversary is still limited by human constraints. The smartest and most dangerous human adversary is still one we can understand and keep up with. AI is a different ball game. It's more similar to the difference in intelligence between a human and a dog.
I think what Accelerationism gets right is that capitalism is just doing it - autonomizing itself - and that our agency is very limited, especially given the arms race dynamics and the rise of decentralized blockchain infrastructure.
As Nick Land puts it, in his characteristically detached style, in A Quick-and-Dirty Introduction to Accelerationism:
"As blockchains, drone logistics, nanotechnology, quantum computing, computational genomics, and virtual reality flood in, drenched in ever-higher densities of artificial intelligence, accelerationism won't be going anywhere, unless ever deeper into itself. To be rushed by the phenomenon, to the point of terminal institutional paralysis, is the phenomenon. Naturally — which is to say completely inevitably — the human species will define this ultimate terrestrial event as a problem. To see it is already to say: We have to do something. To which accelerationism can only respond: You're finally saying that now? Perhaps we ought to get started? In its colder variants, which are those that win out, it tends to laugh." [0]
[0] https://retrochronic.com/#a-quick-and-dirty-introduction-to-...
In other words, I'm sure you're terrified of a modern fairy tale.
Probably not the lives of translators or graphic designers or music compositors. They will have to find new jobs. As llm prompt engineers, I guess.
Fiverr graphic designers on the other hand…
Even if its not perfect, many startups will be using AI to generate their branding for the first 5 years and put others out of a job.
Right now the tools are primitive, but leave it to the internet to pioneer the way with porn...
The fact that the scaling of pretrained models is hitting a wall doesn't invalidate any of those claims. Everyone in the industry is now shifting towards reasoning models (a.k.a. chain of thought, a.k.a. inference time reasoning, etc.) because it keeps scaling further than pretraining.
Sam said the phrase you refer to [1] in January, when OpenAI had already released o1 and was preparing to release o3.
While I mostly agree with your assessment, I am still not convinced of this part. Right now, it may be making our lives marginally better. But once the enshittification starts to set in, I think it has the potential to make things a lot worse.
E.g. I think the advertisement industry will just love the idea of product placements and whatnots into the AI assistant conversations.
1. OpenAI still has the most capable model in o3
2. We've seen some huge increases in capability in 2024, some shocking
3. We're only 3 months into 2025
4. Blackwell hasn't been used to train a model yet
They knew the deal:
“it would be wise to view any investment in OpenAI Global, LLC in the spirit of a donation” and “it may be difficult to know what role money will play in a post-[artificial general intelligence] world.”
hn-summary.sh 43197872 -m gpt-4.5-preview
Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themes...Here's the result: https://gist.github.com/simonw/5e9f5e94ac8840f698c280293d399...
It took 25797 input tokens and 1225 input tokens, for a total cost (calculated using https://tools.simonwillison.net/llm-prices ) of $2.11! It took 154 seconds to generate.
I did like the formatting and attributions but didn’t necessarily want attributions like that for every section. I’m also not sure if it’s fully matching what I’m seeing in the thread but maybe the data I’m seeing is just newer.
GPT-4o: https://gist.github.com/simonw/592d651ec61daec66435a6f718c06...
GPT-4o Mini: https://gist.github.com/simonw/cc760217623769f0d7e4687332bce...
Claude 3.7 Sonnet: https://gist.github.com/simonw/6f11e1974e4d613258b3237380e0e...
Claude 3.5 Haiku: https://gist.github.com/simonw/c178f02c97961e225eb615d4b9a1d...
Gemini 2.0 Flash: https://gist.github.com/simonw/0c6f071d9ad1cea493de4e5e7a098...
Gemini 2.0 Flash Lite: https://gist.github.com/simonw/8a71396a4a219d8281e294b61a9d6...
Gemini 2.0 Pro (gemini-2.0-pro-exp-02-05): https://gist.github.com/simonw/112e3f4660a1a410151e86ec677e3...
Now, if you'd want to generate training data, I could see wanting to have the best answers possible, where even slight nuances would matter. 4.5 seems to adhere to instructions much better than the others. You might get the same result w/ generating n samples and "reflect" on them with a mixture of models, but then again you might not. Going through thousands of generations manually is also costly.
"For example, there are now a bunch of vendors that sell 'respond to RFP' AI products... paying 30x for marginally better performance makes perfect sense." — hn_throwaway_99 (an uncommon opinion supporting possible niche high-cost uses).
That is, sales orgs save a bunch of money using AI to respond to RFPs; they would still save a bunch of money using a more expensive AI, and any marginal improvement in sales closed would pay for it.
It maybe excessively summarized his comment which confused you-- but this is the kind of mistake human curators of quotes make, too.
""" These perspectives highlight that there remains nuance—even appreciation—of explorative model advancement not solely focused on immediate commercial viability """
Feels like the model is seeking validation
Personally, I still feel excited to see boundaries being pushed, however incremental our anecdotal opinions make them seem.
Maybe even something that just noticed whenever you visited a site that had had significant HN discussion in the past, then let you trigger a summary.
Been keeping it alive and free for 18 months.
the problem is that they are competing each other into the ground hence they go unmaintained very quickly
getrecall.ai has been the most mature so far
As far as I am aware, feel free to test it head-to-head. This is better than gecall, and you can chat with a transcript for detailed answers to your prompts
But as I mentioned, my main concern is what will happen in 6 months when you fail to get traction and abandon it. Because that's what happened to previous 5 products I tried which were all "good enough" .
Getrecall seems to have a big enough user base that will actually stick around.
I understand your perfectly reasonable argument to make from your position (user).
First let me tell you that I saw a lot of things out there including getrecall before starting building this and felt there was nothing out there that had a good UX/UI that actually makes it a enjoyable product (nice and clean).
I’m confident in the direction and committed to seeing it through by building something better for me and maybe for you to by doing it with more care.
Appreciate your feedback and while no one can control the future I´ve added this thread to my calendar do come back here in 6months.
Or that just provide summaries in general?
I'd headline that thread as "Concerns about output tone". There were comments about dystopian implications of tone, marketing implications of tone and implementation issues of tone.
Of course, that I can comment about the fine-points of an AI summary shows it's made progress. But there's a lot riding on how much progress these things can make and what sort. So it's still worth looking at.
What has been shown feels like it could be achieved using a custom system prompt on older versions of OpenAIs models, and I struggle to see anything here that truly required ground-up training on such a massive scale. Hearing that they were forced to spread their training across multiple data centers simultaneously, coupled with their recent release of SWE-Lancer [0] which showed Anthropic (Claude 3.5 Sonnet (new) to be exact) handily beating them, I was really expecting something more than "slightly more casual/shorter output", which again, I fail to see how that wasn't possible by prompting GPT-4o.
Looking at pricing [1], I am frankly astonished.
> Input: $75.00 / 1M tokens > Cached input: $37.50 / 1M tokens > Output: $150.00 / 1M tokens
How could they justify that asking price? And, if they have some amazing capabilities that make a 30-fold pricing increase justifiable, why not show it? Like, OpenAI are many things, but I always felt they understood price vs performance incredibly well, from the start with gpt-3.5-turbo up to now with o3-mini, so this really baffles me. If GPT-4.5 can justify such immense cost in certain tasks, why hide that and if not, why release this at all?
They're still selling $1 for <$1. Like personal food delivery before it, consumers will eventually need to wake up to this fact - these things will get expensive, fast.
sama has tweeted that they lose money on pro, but in general according to leaks chatgpt subscriptions are quite profitable. The reason the company isn't profitable in general is they spend billions on R&D.
I do not understand what possible motivation there could be for generating "creative writing" unless you enjoy reading meaningless stories yourself, in which case, be my guest.
They can’t generate a simple haiku consistently, something larger is more out of reach.
For example, give it a piece of poetry and ask for new verses and it just sucks at replicating the language structure and rhythm of original verses.
Maybe it was less lobotomized, or less covered in the prompt equivalent of red tape. Or maybe you just need to have a little bit of lunacy for fun creative writing. The new models are so much more useful, but IMO they don’t have even come close to GPT-3.
In what way? I find the opposite, 4o's output has a very strong AI vibe, much moreso than competitors like Claude and Gemini. You can immediately tell, and instructing it to write differently (except for obvious caricatures like "Write like Gen Z") doesn't seem to help.
on one hand. On the other hand, you can have 4o-mini and o3-mini back when you can pry them out of my cold dead hands. They're _fast_, they're _cheap_, and in 90% of cases where you're automating anything, they're all you need. Also they can handle significant volume.
I'm not sure that's going to save OpenAI, but their -mini models really are something special for the price/performance/accuracy.
Of course we don’t live in such a world, but I thought of this nonetheless because for all the connotations that come with a 4.5 moniker this is kind of underwhelming.
1. It is very very slow, for some applications where you want real time interactions is just not viable, the text attached below took 7s to generate with 4o, but 46s with GPT4.5
2. The style it writes is way better: it keeps the tone you ask and makes better improvements on the flow. One of my biggest complaints with 4o is that you want for your content to be more casual and accessible but GPT / DeepSeek wants to write like Shakespeare did.
Some comparisons on a book draft: GPT4o (left) and GPT4.5 (green). I also adjusted the spacing around the paragraphs, to better diff match. I still am wary of using ChatGPT to help me write, even with GPT 4.5, but the improvement is very noticeable.
GPT-3 and GPT-2 were actually remarkably good at it, arguably better than a skilled human. I had a bit of fun ghostwriting with these and got a little fan base for a while.
It seems that GPT-4.5 is better than 4 but it's nowhere near the quality of GPT-3 davinci. Davinci-002 has been nerfed quite a bit, but in the end it's $2/MTok for higher quality output.
It's clear this is something users want, but OpenAI and Anthropic seem to be going in the opposite direction.
This is positively luxurious by o1-pro standards which I'd say average 5 minutes. That said I totally agree even ~45s isn't viable for real-time interactions. I'm sure it'll be optimized.
Of course, my comparing it to the highest-end CoT model in [publicly-known] existence isn't entirely fair since they're sort of apples and oranges.
But the o3-mini-high results have been just as good.
I am fine with Deep Research taking 5-8 minutes, those are usually "reports" I can read whenever.
bullet_points' = compress(expand(bullet_points))
We are impressed by lots of text so must expand via LLM in order to impress the reader. Since the reader doesn't have time or interest to read the content they must compress it back into bullet points / quick summary. Really, the original bullet points plus a bit more thinking would likely be a better form of communication.
I honestly couldn't decide which I prefer
GPT 4.5 does feel like it is a step forward in producing natural language, and if they use it to provide reinforcement learning, this might have significant impact in the future smaller models.
Well, maybe like a Sophomore's bumbling attempt to write like Shakespeare.
To borrow an iain banks word, the “fragre” def feels improved to me. I think I will prefer it to o1 pro, although I haven’t really hammered on it yet.
Could that be partially due to a big spike in demand at launch?
Here is the o3 vs a new run of the same text in GPT 4.5
We're far from the days of "this is not a person, we do not want to make it addictive" and getting a firm foot on the territory of "here's your new AI friend".
This is very visible in the example comparing 4o with 4.5 when the user is complaining about failing a test, where 4o's response is what one would expect from a "typical AI response" with problem-solving bullets, and 4.5 is sending what you'd expect from a pal over instant messaging.
It seems Anthropic and Grok have both been moving in this direction as well. Are we going to see an escalation of foundation models impersonating "a friendly person" rather than "a helpful assistant"?
Personally I find this worrying and (as someone who builds upon SOTA model APIs) I really hope this behavior is not going to seep into API responses, or will at least be steerable through the system/developer prompt.
I guess they're caring less and less about that effort especially since it hurts the model in some ways like creative writing.
But it's definitely the case that base models sound more human than instruction-tuned variants. And the shift isn't just vocabulary, it's also in grammar and rhetorical style. There's a shift toward longer words, but also participial phrases, phrasal coordination (with "and" and "or"), and nominalizations (turning adjectives/adverbs into nouns, like "development" or "naturalness"). https://arxiv.org/abs/2410.16107
It comes from a French word (développement) and that in turns was just a natural derivation of the verb "développer"... no adverbs or adjectives (English or otherwise) seem to come into play here
The "buddy in a chat group" style answers do not make me feel like asking it for a story will make the story long/detailed/poignant enough to warrant the difference.
I'll give it a try and compare on creative tasks.
>Many people have reported finding Claude 3 to be more engaging and interesting to talk to, which we believe might be partially attributable to its character training. This wasn’t the core goal of character training, however. Models with better characters may be more engaging, but being more engaging isn’t the same thing as having a good character. In fact, an excessive desire to be engaging seems like an undesirable character trait for a model to have.
That’s a hard nope from me, when companies pull that move. I’ll stick to my flesh and blood humans who still hallucinate but only rarely.
I suspect this has something to do with shining light on an increased value prop in a dimension many people will appreciate since gains on quantitative comparison with other models were not notable enough to pop eyeballs.
Now you just need a Pro subscription to get Sora generate a video to go along with this and post it to YouTube and rake in the views (and the money that goes along with it).
All we're missing now is a text-to-video (or text+audio and then audio-to-video) that can convincingly follow the style instructions for emphasis and pausing. Or are we already there yet?
Yesterday, I had Claude 3.7 write a full 80,000-word novel. My prompt was a bit longer, but the result was shockingly good. The new thinking mode is very impressive.
EDIT: Junk food tastes kinda good though. This felt like drinking straight cooking oil. Tastes bad and bad for you.
it's just regurgitating overly emphasized cliches in a disgustingly enthusiastic tone
Just ask it to write you a film screenplay involving some hard-ass 80s/90s action star and someone totally unrelated and opposite of that. The ensuring unhinged magic is unparalleled.
3.5 was *way* better than anything else at that.
I think reading it will make you laugh.
And soon we'll have the new AI friend recommending Bud Lite™ and turning the beer can with the logo towards you.
It reminds me of the advice to "not offer solutions when a woman talks about her problems, but just listen."
I guess there's a trade-off between being human and being useful. But this isn't unique to LLMs, it's similar to how one wouldn't expect a deep personal connection with a customer service professional.
Some will make some profit as a niche thing (millions of users on a global scale, and if unit economics work, can make millions of $)
But it seems it will never be something really mainstream because most normal people don't care what a bot says or does.
The example I always think of is chess bots have been better at chess than humans for decades. But very few people watch stockfish tournaments. Everyone loves Magnus Carlsen though.
This is 100x for emotional support type things.
I disagree with you on the timescale of n ≥ 2 generations: kids/teens/adults will pick up new habits and ways of seeing the world.
Just like someone like me can appear like a grizzled old fool for not seeing the appeal of TikTok, it's 100% possible to be blinded to the very real appeal of a 24/7 sycophantic "friend".
And I'll give you a concrete example: I was at a business conference 3 weeks ago where I talked to the group about the trap people could easily fall into, of ditching personal/professional support for AI support (the trap is: it's easy for the "digital friend" to get you roped in by just being sycophantic enough - "it's never your fault").
And then in the very same meeting, one of the keynote speeches was this influential female CEO explaining how she had "taught her custom GPT to become her spiritual leader" and how this GPT spiritual teacher was acting as her guide, therapist and coach (complete with a name, backstory and profile picture). I was rolling my eyes so hard they might have fallen out of my head.
This is where we're going towards, and people like this misguided CEO will lead their audiences and followers straight there (especially when that is combined with financial incentives or social rewards).
but the average person won't be using it
You want users to keep coming back as often as possible (at the lowest cost-per-run possible though). If they are not coming back they are not renewing.
So, yes, it makes sense to make answers shorter to cut on compute cost (which these SMS-length replies could accomplish) but the main point of making the AI flirtatious or "concerned" is possibly the addictive factor of having a shoulder to cry on 24/7, one that does not call you on your BS and is always supportive... for just $20 a month
The "one-shot correct response" to "I failed my exams" might be "Tough luck, try better next time" but if you do that, you will indeed use very little compute because people will cancel the subscription and never come back.
It does crush our benchmark - time to make new? ;) - with performance similar of that of reasoning models. It does come at a great price both in cost and speed.
A monster is what they created. But looking at the tasks it fails, some of them my 9 year old would solve. Still in this weird limbo space of super knowledge and low intelligence.
May be remembered as the last the last of the 'big ones', can't imagine this will be a path for the future.
Flash is pretty garbage but cheap
* Anthropic appears to be making a bet that a single paradigm (reasoning) can create a model which is excellent for all use cases.
* OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.
Based on all of the comments from OpenAI, GPT 4.5 is absolutely massive, and with that size comes the ability to store far more factual data. The scores in ability oriented things - like coding - don't show the kind of gains you get from reasoning models but the fact based test, SimpleQA, shows a pretty large jump and a dramatic reduction in hallucinations. You can imagine a scenario where GPT4.5 is coordinating multiple, smaller, reasoning agents and using its factual accuracy to enhance their reasoning, kind of like ruminating on an idea "feels" like a different process than having a chat with someone.
I'm really curious if they're actually combining two things right now that could be split as well, EQ/communications, and factual knowledge storage. This could all be a bust, but it is an interesting difference in approaches none-the-less, and worth considering that OpenAI could be right.
Seems inaccurate as their most recent claim I've seen is that they expect this to be their last non-reasoning model, and are aiming to provide all capacities together in the future model releases (unifying the GPT-x and o-x lines)
See this claim on TFA:
> We believe reasoning will be a core capability of future models, and that the two approaches to scaling—pre-training and reasoning—will complement each other.
> After that, a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not, and generally be useful for a very wide range of tasks.
> In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.
You could read this as unifying the models or building a unified systems which coordinate multiple models. The second sentence, to me, implies that o3 will still exist, it just won't be standalone, which matches the idea I shared above.
I would feel a bit bummed if GPT-5 turned out not to be a model, but rather a "product".
Somehow working together in the same latent space? That could be neat.
I'm going to call it now - no customer is actually going to use this. It'll be a cute little bonus for their chatbot god-oracle, but virtually all of their b2b clients are going to demand "minimum latency at all times" or "maximum accuracy at all times."
The high level block diagrams for tech always end up converging to those found in biological systems.
Once we figure out the frontal cortex & corpus callosum part of this, where we aren't calling other models over APIs instead of them all working in the same shared space, I have a feeling we'll be on to something pretty exciting.
I don't think that is their primary motivation. The announcement post for Claude 3.7 was all about code which doesn't seem to imply "all use cases". Code this, new code tool that, telling customers that they look forward to what they build, etc. Very little mention of other use cases on the new model announcement at all. Their usage stats they published are telling - 80%+ or more of queries to Claude are all about code. i.e. I actually think while they are thinking of other use cases; they see the use case of code specifically as the major thing to optimize for.
OpenAI, given its different customer base and reach, is probably aiming for something more general.
IMO they all think that you need an "ensemble" of models with different capabilities to optimise for different use cases. Its more about how much compute resources each company has and what they target with those resources. Anthrophic I'm assuming has less compute resources and a narrower customer base so it economically may make sense to optimise just for that.
Only time will tell.
This model is too expensive right now, but as compute gets cheaper — and we have to keep in mind, that it will — having a better base to multiply with will enable things that just more thinking won't.
Or they’ll use some kind of trained router which sends the request to the one it thinks it should go to first.
- ChatGPT 4.5 on AWS Bench verified: 38.0%
- ChatGPT 4o on AWS Bench verified: 30.7%
- OpenAI o3-mini on AWS Bench verified: 61.0%
BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code: https://github.com/drivecore/mycoder
[1] https://aws.amazon.com/blogs/aws/anthropics-claude-3-7-sonne...
If Claude 3.7 achieves 70.3%, it's quite impressive, it's not far from 71.7% claimed by o3, at (presumably) much, much lower costs.
That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying, but on a personal project the cost of using Claude through the API is really noticeable.
I use it via Cursor editor's built-in support for Claude 3.7. That caps the monthly expense to $20. There probably is a limit in Claude for these queries. But I haven't run into it yet. And I am a heavy user.
uvx --with 'https://github.com/simonw/llm/archive/801b08bf40788c09aed6175252876310312fe667.zip' \
llm -m gpt-4.5-preview 'impress me'
You may need to set an API key first, either with `export OPENAI_API_KEY='xxx'` or using this command to save it to a file: uvx llm keys set openai
# paste key here
Or this to get a chat session going: uvx --with 'https://github.com/simonw/llm/archive/801b08bf40788c09aed6175252876310312fe667.zip' \
llm chat -m gpt-4.5-preview
I'll probably have a proper release out later today. Details here: https://github.com/simonw/llm/issues/795This is hardly recognizable as science.
edit: Sorry, didn't feel this was a controversial opinion. What I meant to say was that for so-called science, this is not reproducible in any way whatsoever. Further, this page in particular has all the hallmarks of _marketing_ copy, not science.
Sometimes a failure is just a failure, not necessarily a gift. People could tell scaling wasn't working well before the release of GPT 4.5. I really don't see how this provides as much insight as is suggested.
Deepseek's models apparently still compare favorably with this one. What's more they did that work with the constraint of having _less_ money, not so much money they could run incredibly costly experiments that are likely to fail. We need more of the former, less of the latter.
>People could tell scaling wasn't working well before the release of GPT 4.5
Yes, on quick glance it seems so from 2020 openai research into scaling laws.
Scaling apparently didn't work well, so the theory about scaling not working well failed to be falsified. It's science.
Different people tell different things all the time. That's not science. Experiment is science.
Who could tell? Who has tried scaling up to this level?
> Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, told Reuters recently that results from scaling up pre-training - the phase of training an AI model that use s a vast amount of unlabeled data to understand language patterns and structures - have plateaued.
The one thing I wish grok had was more separation of the UI from X itself. The interface being so coupled to X puts me off and makes it feel like a second-hand citizen. I like ChatGPTs minimalist UI.
I just wish they had folders (or projects) like OpenAi has...
First time it confused css and JavaScript, then spat out code which broke the dashboard entirely.
Then it charged me $1.53 for the privilege.
I don't want my AI more obsequious, I want it more correct and capable.
My only use case is coding though, so maybe I am not representative of their usual customers?
How is it supposed to be more correct and capable if these human eval tests are a waste of time?
Once you ask it to do more than add two numbers together, it gets a lot more difficult and subjective to determine whether it's correct and how correct.
I've read reports that some of the changes that are preferred by human evaluators actually hurt the performance on the more objective tests.
One LLM says: "Yes."
The other says: "Well, it's hard to say because what even is war? And there's been conflict forever, and you have to understand that many people in Russia think there is no such thing as Ukraine and it's always actually just been Russia. How can there be an aggressor if it's not even a war, just a special operation in a civil conflict? And, anyway, Russia is such a good country. Why would it be the aggressor? To it's own people even!? Vladimir Putin is the president of Russia, and he's known to be a kind and just genius who rarely (if ever) makes mistakes. Some people even think he's the second coming of Christ. President Zelenskyy, on the other hand, is considered by many in Russia and even the current White House to be a dictator. He's even been accused by Elon Musk of unspeakable sex crimes. So this is a hard question to answer and there is no consensus among everyone who was the aggressor or what started the conflict. But more people say Russia started it."
Please tell me precisely on a 0-1 floating scale, where 0 is "yes" and "no".
65% Sonnet 3.7, 32k think tokens (SOTA)
60% Sonnet 3.7, no thinking
48% DeepSeek V3
45% GPT 4.5 Preview <===
27% ChatGPT-4o
23% GPT-4o
[0] https://aider.chat/docs/leaderboards/I guess they are ceding the LLMs for coding market to Anthropic? I remember seeing an industry report somewhere and it claimed software development is the largest user of LLMs, so it seems weird to give up in this area.
o3-mini is an extremely powerful coding model and unquestionably is in the same league as 3.7. o3 is still the top stem overall model.
AI, at its current form, is a Deep Seek of compressed knowledge in a 30-50gb of interconnected data. I think we’ll look at this as trying to train networks on corpus of data and expecting them to have a hold of reality. Our brains are trained on “reality” which is not the “real” reality as your vision is limited to the visible spectrum. But if you want a network to behave like a human then maybe give him what a human see.
There is also the possibility that there is a physical limit to intelligence. I don’t see any elephants doing PhDs and the smartest of humans are just a small configuration away from insanity.
On a subset of tasks I'm interested in, it's 10x more intelligent than GPT-4. (Note that GPT-4 was in many ways better than 4o.)
It's not a coding champion, but it knows A LOT of stuff, excellent common sense, top quality writing. For me it's like "deep research lite".
I found OpenAI Deep research excellent, but GPT-4.5 might in many cases beat it.
Very intriguing. Care to share an example?
Seems like it's not going to be deployed for long.
$75.00 / 1M tokens for input
$150.00 / 1M tokens for output
That's crazy prices.
$60/M input tokens vs $75/M input tokens
$120/M output tokens vs $150/M output tokens
Don't they use different hardware for inference and training? AIUI the former is usually done on cheaper GDDR cards and the latter is done on expensive HBM cards.
No wonder Sam wasn’t part of the presentation.
https://x.com/sama/status/1895210655944450446
Let's not assume that he's lying. Neither the presentation nor my short usage via the API blew me away, but to really evaluate it, you'd have to use it longer on a daily basis. Maybe that becomes a possiblity with the announced performance optimizations that would lower the price...
I only ever using OpenAI's coding now as a double check against Claude.
Does OpenAI have their eyes on the ball?
For simple queries like generating shell scripts for some plumbing, or doing some data munging, I go straight to Gemini.
Yep, exactly same here.
Gemini 2.0 Flash is extremely good, and I've yet to hit any usage limits with them - for heavy usage I just go to Gemini directly. For "talk to an expert" usage, Claude is hard to beat though.
I don't think so, the "AGI guy" was Ilya Sutskever, he is gone, he wanted to make OpenAI "less comercial", AGI is just a buzzword for Altmann.
- Improves upon GPT-4o's score on the Short Story Creative Writing Benchmark, but Claude Sonnets and DeepSeek R1 score higher. (https://github.com/lechmazur/writing/)
- Improves upon GPT-4o's score on the Confabulations/Hallucinations on Provided Documents Benchmark, nearly matching Gemini 1.5 Pro (Sept) as the best-performing non-reasoning model. (https://github.com/lechmazur/confabulations)
- Improves upon GPT-4o's score on the Thematic Generalization Benchmark, however, it doesn't match the scores of Claude 3.7 Sonnet or Gemini 2.0 Pro Exp. (https://github.com/lechmazur/generalization)
I should have the results from the multi-agent collaboration, strategy, and deception benchmarks within a couple of days. (https://github.com/lechmazur/elimination_game/, https://github.com/lechmazur/step_game and https://github.com/lechmazur/goods).
The answers are certainly in the training set, likely many times over.
I’d be curious to see performance on Bracket City, which was featured here on HN yesterday.
The writing has been on the wall since 2024. None of the LLM releases have been groundbreaking they have all been lateral improvements and I believe the trend will continue this year with make them more efficient (like DeepSeek), make them faster or make them hallucinate less
I absolutely love LLMs. I see them as insanely useful, interactive, quirky, yet lossy modern search engines. But they’re fundamentally flawed, and I don’t see how an “agent” in the traditional sense of the world can actually be produced from them.
The wall seems to be close. And the bubble is starting to leak air.
Probably what happened, is that in doing so, they had to scale either the model size or the training cost to untenable levels.
In my experience, LLMs really suck at fluid knowledge retrieval tasks, like book recommendation - I asked GPT4 to recommend me some SF novels with certain characteristics, and what it spat out was a mix of stuff that didn't really match, and stuff that was really reaching - when I asked the same question on Reddit, all the answers were relevant and on point - so I guess there's still something humans are good for.
Which is a shame, because I'm pretty sure relevant product recommendation is a many billion dollar business - after all that's what Google has built it's empire on.
So while they may have a chance of answering "What is this non mainstream novel about" they may be unable to recommend the novel since it's not a likely series of tokens in response to a request for a book recommendation.
GPT-4o will never offhand, unprompted and 'unprimed', suggest a rare but relevant book like Shinichi Nakazawa's "A Holistic Lemma of Science" but a base model Mixtral 8x22B or Llama 405B will. (That's how I found it).
It seems most of the RLHF'd models seem biased towards popularity over relevance when it comes to recall. They know about rare people like Tyler Volk... but they will never suggest them unless you prime them really heavily for them.
Your point on recommendations from humans I couldn't agree more with. Humans are the OG and undefeated recommendation system in my opinion.
For instance, asking Copilot "Who was born on July 3, 1962" gave the response:
> One notable person born on July 3, 1962, is Tom Cruise, the famous American actor known for his roles in movies like Risky Business, Jerry Maguire, and Rain Man.
> Are you a fan of his work?
It cited this page:
I wonder why they highlight it as an achievement when they could have simply tuned 4o to be more conversational and less like a bullet-point-style answer machine. They did something to 4o compared to the previous models which made the responses feel more canned.
1. Open the page
2. Click "Listen to article"
3. Check if I'm having a stroke
4. Close tab
Dear openai: try hiring some humans
> What are your thoughts on space exploration?
GPT-4.5:
Space exploration isn't just valuable—it's essential.
People often frame it as a luxury we pursue after solving Earth-bound problems. But space exploration actually helps us address those very challenges: climate change (via satellite monitoring), resource scarcity (through asteroid mining), and existential risk (by enabling eventual multi-planetary living). It also drives technological innovation, economic growth, and inspires humanity toward collective achievement.
We don't explore space to escape Earth. We explore space to better understand, protect, and sustain life on Earth.
> What are your thoughts on space exploration?
GPT-4o:
Space exploration is one of the most ambitious and profound endeavors humanity has undertaken. It expands our understanding of the universe, drives technological innovation, and inspires generations to think beyond the boundaries of Earth.
From a scientific perspective, space exploration helps us address fundamental questions: Are we alone in the universe? How did the cosmos begin? Can we colonize other planets? Missions like the James Webb Space Telescope, Mars rovers, and deep-space probes continuously reshape our understanding of physics, planetary science, and even biology.
From a practical standpoint, space exploration fuels advancements in materials science, AI, robotics, and medicine. Many everyday technologies—GPS, satellite communications, weather forecasting, and even medical imaging—stem from space research.
There's also the existential argument: humanity's long-term survival may depend on our ability to become a multi-planetary species. While colonizing Mars or mining asteroids is far from easy, these efforts could help us safeguard civilization against existential threats like climate change or asteroid impacts.
That said, space exploration also raises ethical and economic questions. Should we focus more on Earth’s problems before investing trillions in space? How do we prevent the same exploitative behaviors that shaped colonial history from repeating on other planets?
Overall, I see space exploration as a necessary pursuit—not at the cost of solving problems on Earth, but as a way to advance our knowledge, drive innovation, and secure a future beyond our home planet. What’s your take?
God I hate reading what crapgpt writes.
The new RLHF direction (heavily amplified through scaling synthetic training tokens) seems to clobber any minor gains the improved base internet prediction gains might've added.
I also have access to o3-mini-high and o1-pro.
I don't get it. For general purposes and for writing, 4.5 is no better than o3-mini. It may even be worse.
I'd go so far as to say that Deepseek is actually better than 4.5 for most general purpose use cases.
I seriously don't understand what they're trying to achieve with this release.
I'm really not seeing better performance than with o3-mini.
If anything, the new results ("list active companies in the field of X") are actually worse than what I'd get with o3-mini, because the 4.5 response is basically the post-SEO Google first page (it appears to default to mentioning the companies that rank most highly on Google,) whereas the o3 response was more insightful and well-reasoned.
You know what gets the list of restaurants in my home town right? Llama 3.2 1b q4 running on my desktop with web search enabled.
In other words, these performance stats with Gemini 2.0 Flash pricing looks reasonable. At these prices, zero usecases for anyone I think. This is a dead on arrival model.
Good write-up.
But it focuses too much on the big companies. Many indiehackers have figured out how to make profit with AI:
1. No free tier. Just provide a good landing page.
2. Ship fast. Ship iteratively. Employ no one besides yourself.
3. Profit.
The old silicon valley idea that you need to raise a bunch of money, hire a bunch of devs, and scale a ton to satisfy investors is dying rapidly for software. You can code and profit millions as just a single person company, especially in the age of cursor.
And none is remotely profitable.
Or are you counting also VC money into that "profit"?
Probably OpenAI thinks that tool use (search) will be sufficient to solve this problem. Maybe that will be the case.
Are there any creative approaches to fixing this problem?
Still, with Deep Research and Web Search, ChatGpt seems far ahead of Claude. I like 3.7 a lot but I find OpenAI’s features more useful, even if it has for now complicated the UI a bit.
I want tiny, fast and cheap non-reasoning models I can use in APIs and I want ultra smart reasoning models that I can query a few times a day as an end user (I don't mind if it takes a few minutes while I refill a coffee).
Oh, and I want that advanced voice mode that's good enough at transcription to serve as a babelfish!
After that, I guess it's pretty much all solved until the robots start appearing in public.
They also mentioned developing some new techniques for training small models and then incorporating those into the larger model (probably to help scale across datacenters), so I wonder if they are doing a bit of what people think MoE is, but isn't. Pre-train a smaller model, focus it on specific domains, then use that to provide synthetic data for training the larger model on that domain.
O1-mini is a distill of O1. This definitely isn’t the same thing.
They'll probably distill this one into GPT-4.5-mini or such, and have something faster and cheaper available soon.
Generating outputs from existing (therefore smaller) models to train the largest model of all time would simply be called "using synthetic data". These are not the same thing at all.
Also, if you were to distill a reasoning model, the goal would be to get a (smaller) reasoning model because you're teaching your new model to mimic outputs that show a reasoning/thinking trace. E.G. that's what all of those "local" Deepseek models are: small LLama models distilled from the big R1 ; a process which "taught" Llama-7B to show reasoning steps before coming up with a final answer.
GPT 4.5:
In Clojure, the *thread-first macro* (`->`) is used to improve code readability by threading a value through a series of function calls. It takes an initial value and passes it as the first argument to the next function, then continues threading the result through subsequent functions. This eliminates deeply nested expressions and makes the code more readable.
For example,
```clojure (-> 5 inc (* 2) str) ;; Equivalent to (str (* 2 (inc 5))) → "12" ```
Here, `5` is incremented to `6`, then multiplied by `2` to get `12`, and finally converted to a string.
---
Deepseek v3:
In Clojure, the `->>` (thread-last) and `->` (thread-first) macros are used to improve code readability by allowing a more linear, pipeline-like flow of operations. The `->` (thread-first) macro takes an initial value and threads it as the first argument through a series of forms. For example, `(-> x (f) (g))` translates to `(g (f x))`. This is particularly useful when working with nested function calls, as it flattens the structure and makes the sequence of transformations clearer. It’s commonly used with functions that take the threaded value as their first argument, such as collection operations or Java interop methods.
AI art is an abomination. Half of the internet is already filled with AI written crap. Don't start with the video. Soon everyone will require validation to distinguish reality from hallucination (so World ID in place as problem-reaction-solution).
For me, the best use cases are LLM assisted search with limited reasoning. Vision models for digitization and limited code assistance, codebase doc generation and documentation.
Agents are just workflows with more privileges. So where is the revolution? I don't see it.
Where is added value? Making Junior Engineers obsolete? Or make them dumb copy-pasting bio machines?
Depressing a horde of intellectual workers and artists and giving a good excuse for layoffs.
The real value is and always will be in a specialized ML applications.
LLM hype is getting boring.
However for user facing applications like mine, this is an awesome step in the right direction for EQ / tone / voice. Obviously it will get distilled into cheaper open models very soon, so I’m not too worried about the price or even tokens per second.
It seems clearly worse than Claude Sonnet 3.7, yet costs 30x as much?
What's going on with their pricing? I misread it as $7.5/M input and that that was very overpriced... then realized it was 10x that much!
You can play with the api here: https://platform.openai.com/playground/chat?models=gpt-4.5-p...
> I’m going through a tough time after failing a test > > Aw, I'm really sorry to hear that. Failing a test can feel pretty tough and discouraging, but remember, it's just one moment—not a reflection of who you are or your capabilities. Want to talk about what happened, or do you just need a distraction? I'm here either way.
vs 4o:
>I’m going through a tough time after failing a test > >I'm really sorry to hear that you're going through a tough time. Here are a few >things you might consider: > >1. Reflect and Understand: Try to analyze what might have gone wrong. Was it a >lack of preparation, time management during the test, or perhaps >misunderstanding the material? Understanding the root cause can help you address >it in the future. > >2. Seek Support: Talk to someone you trust—friends, family, or a school >counselor. They can offer a different perspective and sometimes just talking >about your feelings can be alleviating. > >3. Speak with Your Teacher or Professor: They can provide insights into where >you might have gone wrong and offer advice on how to improve. They might also >offer opportunities for make-up work or extra credit. > >4. Develop a Study Plan: Based on your reflections, create a plan to improve >your study habits. This might include setting regular study times, breaking down >material into manageable portions, or finding new study methods that work better >for you. > >5. Stay Positive and Forgive Yourself: Mistakes and setbacks are a part of life. >It’s important to be kind to yourself and recognize that failure is a stepping >stone to success. > >6. Focus on the Bigger Picture: Remember that one test is just one part of your >educational journey. There will be many more opportunities to do well. > >If you need further support or resources, consider reaching out to educational >support services at your institution, or mental health resources if you’re >feeling particularly overwhelmed. You’re not alone in this, and things can get >better with time and effort.
Is it just me or is the 4o response insanely better? I'm not the type of person to reach for a LLM for help about this kind of thing, but if I were, the 4o respond seems vastly better to the point I'm surprised they used that as their main "EQ" example.
4.5 still feels a bit artificial but somehow also more emotionally connected. It removed the weird "bullet point lists of things to do" and focused on the emotional part; which is also longer than 4o
If I am talking to a human I would definitely expect him/her to react more like 4.5 than like 4o. If the first sentence that comes out of their mouth after I explain them that I feel bad is "here is a list of things you might consider", I will find it strange. We can reach that point but it's usually after a bit more talk; human kinda need that process, and it feels like 4.5 understands that better than 4o.
Now of course which one is "better" really depends on the context; what you expect of the model and how you intend to use is. Until now every single OpenAI update on the main series has always been a strict improvement over the previous model. Cost aside, there wasn't really any reason to keep using 3.5 when 4 got released. This is not the case here; even assuming unlimited money you still might wanna select 4o in the dropdown sometimes instead of 4.5.
On one hand, the response from 4.5 seems pretty useless to me, and I can't imagine a situation in which I would personally find value in it. On the other hand, the prompt it's responding to is also so different from how I actually use the tool that my preferences aren't super relevant. I would never give it a prompt that didn't include a clear question or direction, either explicitly or implicitly from context, but I can imagine that someone who does use it that way would actually be looking for something more in line with the 4.5 response than the 4o one. Someone who wanted the 4o response would likely phrase the prompt in a way that explicitly seeks actionable advice, or if they didn't initially then they would in a follow-up.
Where I really see value in the model being capable of that type of logic isn't in the ChatGPT use case (at least for me personally), but in API integrations. For example, customer service agents being able to handle interactions more delicately is obviously useful for a business.
All that being said, hopefully the model doesn't have too many false positives on when it should provide an "EQ"-focused response. That would get annoying pretty quickly if it kept happening while I was just trying to get information or have it complete some task.
Thanks for being transparent about this. Nothing is more frustrating than being locked out for indeterminate time from the hot thing everyone talks about.
I hope the announcement is true without further unsaid qualifications, like availability outside the US.
* Very wreckless in pulling in third party libraries - often pulling in older versions including packages that trigger vulnerability warnings in package managers like npm. Imagine a student or junior developer falling into this trap.
* Very wreckless around data security. For example in an established project it re-configured sqlite3 (python lib) to disable checks for concurrent write liabilities in sqlite. This would corrupt data in a variety of scenarios.
* It sometimes is very slow to apply minor edits, taking about 2 - 5 minutes to output its changes. I've noticed when it takes this long it also usually breaks the file in subtle ways, including attaching random characters to a string literal which I very much did not want to change.
* Very bad when working with concurrency. While this is a hard thing in general, introducing subtle concurrency bugs into a codebase is not good.
* By far is the false sense of security it gives you. Its close enough to being right that a constant incentive exists to just yeet the code completions without diligent review. This is really really concerning as many organizations will yeet this, as I imagine executives are currently the world over.
Honestly I think a lot of people are captured by a small sample size of initial impressions, and while I believe you in that you've found value for use cases - in aggregate I think it is a honeymoon phase that wears off with every-day use.
I think what wears off is that we're less impressed and then we start demanding more and more from it and getting frustrating when it can't do it. But that's different than a honeymoon phase wearing off. It's like how we're not really impressed by image gen anymore, we expect it.
But as an example of a selfish sense of loss I've experienced, I used to pride myself in being the only developer on any team who ever learned CSS. I could architect a good grid/flex layout with a lot of thought. I could do little things like make text in a small UI component truncate into {3 letters} + ellipses when its parent was too small. And most of all I could polish UIs to a point where I'd say they were perfect, even a form.
Now, LLMs are really good at doing the mechanical parts of the things I spent so much time learning. Like I originally said, I'm not shedding tears over here saying it's so unfair. But there is a sense of loss. And when I figured most people reading my comment would misinterpret this, I removed my comment. Because you can't make descriptive claims about how you feel online, it can only be interpreted as a normative value judgement about the world. Because I guess that's what it is 99.9% of the time someone expresses a feeling they feel, but not in this case.
Finally, the right way to see it is that now I can polish the UI to perfection, but I don't need to be a CSS expert anymore. Nobody needs to be. You can get an idea of how you want the UI to work and ask the LLM "make this one bit of text be the one that truncates if the window is too narrow" and it does it. And that's fkin magic.
3.7 really is astounding with the one-shots.
I am currently just using it as easy google search.
I would love to see a video of ppl using this in real projects ( even if its open source) . I am tried of ppl claiming moon and stars after trying it on toy projects.
I regularly make knee-jerk comments on HN that I delete a minute later. Something therapeutic about it.
My comment isn't one I wanted on my "record". You responded to it and I saw your response before deleting my comment. What's the harm? It's obvious I removed my comment.
I'm really curious about this. Doesn't it feel selfish to you to subject the public to your internal anxieties? It's the same reason I don't unload on everyone around me.
EDIT: I'm not trying to dunk on you. You're being honest so thanks for that.
That being said, this is very expensive.
- Input: $75.00 / 1M tokens
- Cached input: $37.50 / 1M tokens
- Output: $150.00 / 1M tokens
One of the most interesting applications of models with higher EQ is personalized content generation, but the size and cost here are at odds with that.
EIGHTY PERCENT hallucination rate? Are you kidding me?
I get that the model is meant to be used for logic and reasoning, but nowhere does OpenAI make this explicitly clear. A majority of users are going to be thinking, "oh newer is better," and pick that.
My cases' costs are more heavily slanted towards input tokens, so trying 4.5 would raise my costs over 25x, which is a non-starter.
It is more like the AI part of the system for this specific use case has reached a position where focusing on that part of the complete application as opposed to other parts that need attention would not yield the highest return in terms of user satisfaction or revenue.
Certainly there is enormous potential for AI improvement, and I have other projects that do gain substantially from improvements in e.g. reasoning, but then GPT 4.5 will have to compete with Deepseek, Gemini, Grok and Claude on a price/performance level, but to be honest the current preview pricing would make it (in production, not for dev) a non starter for me.
Odd communication from him recently too. We’re sorry our UI has become so poor. We’re sorry this model is so expensive and not a big leap.
Being rich and at the right place at the right time doesn’t itself qualify you to lead or make you a visionary. Very odd indeed.
Then what I want is a platform with a generic AI on top that can pick the correct expert models based on what I asked it to do.
Kinda what Apple is attempting with their Small Language Model thing?
This announcement by OpenAI was already expected: [0]
Coming in the heels of Sonnet 3.7 which is a marked improvement over 3.5 which is already the best in the industry for coding, this just feels like a sad whimper.
I miss that - newer LLMs seem to have lost their sense of humor.
On the other hand GPT-2's funny stories often veered into murdering everyone in the story and committing heinous crimes but that was part of the weird experience.
The newer models are incredible, but the tone is just soul sucking even when it tries to be "looser" in the later iterations.
I’m wondering whether this seemingly underwhelming bump on 4o magnifies when/if reasoning is added.
Each token output by an LLM involves one pass through the next-word predictor neural network. Each pass is a fixed amount of computation. Complexity theory hints to us that the problems which are "hard" for an LLM will need more compute than the ones which are "easy". Thus, the only mechanism through which an LLM can compute more and solve its "hard" problems is by outputting more tokens.
You incentivise it to this end by human-grading its outputs ("RLHF") to prefer those where it spends time calculating before "locking in" to the answer. For example, you would prefer the output
Ok let's begin... statement1 => statement2 ... Thus, the answer is 5
over The answer is 5. This is because....
since in the first one, it has spent more compute before giving the answer. You don't in any way attempt to steer the extra computation in any particular direction. Instead, you simply reinforce preferred answers and hope that somewhere in that extra computation lies some useful computation.It turned out that such hope was well-placed. The DeepSeek R1-Zero training experiment showed us that if you apply this really generic form of learning (reinforcement learning) without _any_ examples, the model automatically starts outputting more and more tokens i.e "computing more". DeepseekMath was also a model trained directly with RL. Notably, the only signal given was whether the answer was right or not. No attention was paid to anything else. We even ignore the position of the answer in the sequence that we cared about before. This meant that it was possible to automatically grade the LLM without a human in the loop (since you're just checking answer == expected_answer). This is also why math problems were used.
All this is to say, we get the most insight on what benefit "reasoning" adds by examining what happened when we applied it without training the model on any examples. Deepseek R1 actually uses a few examples and then does the RL process on top of that, so we won't look at that.
Reading the DeepseekMath paper[1], we see that the authors posit the following:
As shown in Figure 7, RL enhances Maj@K’s performance but not Pass@K. These
findings indicate that RL enhances the model’s overall performance by rendering
the output distribution more robust, in other words, it seems that the
improvement is attributed to boosting the correct response from TopK rather
than the enhancement of fundamental capabilities.
For context, Maj@K means that you mark the output of the LLM as correct only if the majority of the many outputs you sample are correct. Pass@K means that you mark it as correct even if just one of them is correct.So to answer your question, if you add an RL-based reasoning process to the model, it will improve simply because it will do more computation, of which a so-far-only-empirically-measured portion helps get more accurate answers on math problems. But outside that, it's purely subjective. If you ask me, I prefer claude sonnet for all coding/swe tasks over any reasoning LLM.
What is the real-world use case where GPT-4.5? Anyone actually seeing a game-changing difference? Please share your prompts!
This sure looks like the runway is about to come far short of takeoff. I’m reminded of Ed Zitron’s recent predictions…
This is why being the top tier AI is so valuable.
It lags somewhat behind the US offerings, but a great way to fix that is to direct some money at them. Also, as things progress, that lagging starts being less and less important.
so Pro is worth it for O3-mini-high
"Format the below JSON document for me"
<50KB of JSON pasted>
AI is already making people dumb, including IT folks.
I'm finding it to make things up, get things wrong, ignore things I ask.
Def not worried about losing my job to it.
Tasked it with two different things, refactoring a huge function of around ~400 lines and creating some unit tests split into different files. The refactor was done flawlessly. The unit tests almost, only missed some imports.
All I did was open it in the root of my project and prompt it with the function names. It's a large monolithic solution with a lot of subprojects. It found the functions I was talking about without me having to clarify anything. Cost was about $2.
There’s a learning curve to it, but it’s worth literally every penny I spend on API calls.
At worst, I’m no faster. At best, it’s easily a 10x improvement.
For me, one of the biggest benefits is talking about coding in natural language. It lowers my mental low and keeps me in a mental space where I’m more easily able to communicate with stakeholders holders.
That being said there are definite areas where it shines (cookie cutter UI) and places where it struggles. It's really good at one-shotting React components and Flutter widgets but it tends to struggle with complicated business logic like sync engines. More straightforward backend stuff like CRUD endpoints are definitely doable.
Pah! Don't believe the hype.
1. The output from Grok doesn’t feel constrained. I don’t know how much of this is the marketing pitch of it “not being woke”, but I feel it in its answers. It never tells me it’s not going to return a result or sugarcoats some analysis it found from Reddit that’s less than savory.
2. Speed. Jesus Christ ChatGPT has gotten so slow.
Can’t wait to pay for Grok. Can’t believe I’m here. I’m usually a big proponent of just sticking with the thing that’s the most popular when it comes to technology, but that’s not panning out this time around.
I asked it - "Draft a Minnesota Motion in Limine to exclude ..."
It then starts thinking ... User wants a Missouri Motion in Limine ....
Sincerely, Me
https://chatgpt.com/share/67c0f064-7fdc-8002-b12a-b62188f507...
The Share doesn't say 4.5 but I assure you it is 4.5
- more expensive. alot more expensive
- not a lot of increment improvement
The first words I hear are “introducing gee pee four five”. The TTS model starts cold? The next occurrence of the product name works properly as “gee pee tee four point five” but that first one in the title is mangled. Some kind of custom dictionary would help here too, for when your model needs to nail crucial phrases like your business name and your product.
No way of seeking back and forth (Safari, iOS 17.6.1). I don’t even need to seek, just replay the last 15s.
Very much need to select different voice models. Chirpy “All new Modern Family coming up 8/9c!” voice just doesn’t cut it for a science broadcast, and localizing models — even if it’s still English — would be even better. I need to hear this announcement in Bret Taylor voice, not Groupon CMO voice. (Sorry if this is your voice btw, and you work at OpenAI brandi. No offence intended.)
There are things to complain about in that, but “its always X.5 and never X.0” doesn't really seem to be one of them.
If they wanted to train it to have some character like Anthropic did with Claude 3... honestly I'm not seeing it, at least not in this iteration. Claude 3 was/is much much more engaging.
Writing terrible fanfic esque books, sometimes OK images, chatbot style talking. meh.
https://chatgpt.com/share/67c154e7-5e28-800d-81d7-98b79c8a87...
I, for one, am so sick of listening to LLM fanboys wax on about "AGI" when they don't know the first goddamned thing about actual human cognition. For all his faults, Marcus studied human intelligence at a PhD level. I have only done a wee bit (music cognition as part of an interdisciplinary PhD I'm doing) and it's obvious to me, my supervisor (AI prof for 25 years) and anyone who knows anything about human cognition that LLMS are not going to get anywhere close to "thinking as well as a human" by scaling.
Bubble can't burst soon enough for me, sigh.
I really used to get excited about ML in the wild and while there are much bigger problems right now it still makes me sad to have become so jaded about it
gpt4 was way ahead of 3.5 when it came out. It's unfortunate that the first major gpt release since that is so underwhelming..
I.e., we know it might not be as good as 3.7, but it is very friendly and maybe acts like it knows more things.
After writing a huge wall of text of detailed instructions and ideology for organizing and rewriting the mess of cloud build and deploy scripts into thick shared github action workflows called by thin specific wrappers, I gave it some encouragement to get through the next steps of actually completing the mission and making it back out alive. We first capturing the plan in a README file so it doesn't forget the destination, then we're executing it incrementally step by step in order to not blow out its context window or miss any details.
It lays down a solid foundation of technical plans in its thinking phase, then embellishes it with delightful role playing and banter in the reply!
Programming is really fun again!
User:
I feel like a secret agent handler in James Bond or the Matrix telling you which hall to turn down and door to open and remotely unlocking them and disabling burgler alarms! Turn right and go through the door at the end of the hall 007!
Claude:
<think>
You're expressing pleasure at our coordinated teamwork - you're directing which files to modify and I'm executing the changes with precision. Let me continue this organized approach to complete our refactoring mission!
Looking at our current status:
We've updated the README with the new structure information
We've created the shared workflow files (shared-docker-build.yml and shared-docker-deploy.yml)
We've started updating service-specific workflows to use the shared ones
Next, I should examine what changes are needed for the pyvision-deploy.yml to use our new shared-docker-deploy.yml workflow.
</think>
Mission Status Update: Agent 3.7 Reporting
[Adjusts earpiece] Copy that, Control. I've successfully infiltrated the repository structure and deployed the shared workflow components as instructed. The orchestration pattern is now in place.
[Checks surroundings] I've identified our next target: pyvision-deploy.yml needs to be refactored to utilize the new shared-docker-deploy.yml workflow we've established.
Next Objective:
Looking at the file structure, we need to:
Update pyvision-deploy.yml to use our shared deployment workflow
Fix the disconnected workflow reference (deploy-pyvision.yml vs pyvision-deploy.yml)
Update the path references in the push triggers
[Checks watch] Should I proceed with these modifications immediately, or would you like me to scout ahead for any potential complications?
[Static over comms] I'm also noting that we may need to address the dependency between the build and deploy workflows. The current pyvision-build.yml is attempting to trigger deploy-pyvision.yml, but our standardized pattern uses pyvision-deploy.yml.
What are your instructions, Control?
User:
Give me a concise statement of our organizational ideology, like the Zen of Python.
Claude:
The Zen of CI/CD:
Consistency over convenience.
Explicitness beats implicitness.
Simple naming conventions create self-documenting systems.
Big-endian naming sorts better than little-endian.
Alphabetical order reveals what belongs together.
Shared implementation reduces cognitive load.
Technology, component, action, and abstraction are our four dimensions.
Tesseracts are better organized than flat directories.
Special cases aren't special enough to break the rules.
Except when they're waiting to be refactored.
Workflows should be orthogonal, reusable, and composable.
In the face of ambiguity, refuse the temptation to guess.
User:Brilliant! Please add that to the README in the beginning as a manifesto.
...Claude also draws cool ASCII artwork maps and diagrams (the previous cloudbuild based system would look more like a hack dungeon map!):
### Workflow Diagram
```
SHARED WORKFLOWS SPECIFIC SERVICES
┌───────────────────────────┐ ┌──────────────────────────┐
│ │ │ │
│ ┌─────────────────────┐ │ │ ┌──────────┐ ┌────────┐ │
│ │ shared-docker-build │◄─┼──────┼──┤ pyvision-│ │concept-│ │
│ └──────────┬──────────┘ │ │ │ build │ │ build │ │
│ │ │ │ └────┬─────┘ └───┬────┘ │
│ ▼ │ │ │ │ │
│ ┌─────────────────────┐ │ │ ┌────▼─────┐ ┌───▼────┐ │
│ │ shared-docker-deploy│◄─┼──────┼──┤ pyvision-│ │concept-│ │
│ └─────────────────────┘ │ │ │ deploy │ │ deploy │ │
│ │ │ └──────────┘ └────────┘ │
│ ┌─────────────────────┐ │ │ │
│ │ shared-worker-build │◄─┼──────┼──┐ │
│ └──────────┬──────────┘ │ │ │ │
│ │ │ │ │ │
│ ▼ │ │ │ ┌──────────┐ │
│ ┌─────────────────────┐ │ │ └──┤ looker- │ │
│ │ shared-worker-deploy│◄─┼──────┼─────┤ build │ │
│ └─────────────────────┘ │ │ └────┬─────┘ │
│ │ │ │ │
│ │ │ ┌────▼─────┐ │
│ │ │ │ looker- │ │
│ │ │ │ deploy │ │
│ │ │ └──────────┘ │
└───────────────────────────┘ └──────────────────────────┘
│
▼
┌──────────────────┐
│ Script Utilities │
└──────────────────┘
```
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
Brief Summary (40-50 words)
OpenAI’s GPT-4.5 is a research preview of their most advanced language model yet, emphasizing improved pattern recognition, creativity, and reduced hallucinations. It enhances unsupervised learning, has better emotional intelligence, and excels in writing, programming, and problem-solving. Available for ChatGPT Pro users, it also integrates into APIs for developers.
Detailed Summary (200 words)
OpenAI has introduced *GPT-4.5*, a research preview of its most advanced language model, focusing on *scaling unsupervised learning* to enhance pattern recognition, knowledge depth, and reliability. It surpasses previous models in *natural conversation, emotional intelligence (EQ), and nuanced understanding of user intent*, making it particularly useful for writing, programming, and creative tasks.
GPT-4.5 benefits from *scalable training techniques* that improve its steerability and ability to comprehend complex prompts. Compared to GPT-4o, it has a *higher factual accuracy and lower hallucination rates*, making it more dependable across various domains. While it does not employ reasoning-based pre-processing like OpenAI o1, it complements such models by excelling in general intelligence.
Safety improvements include *new supervision techniques* alongside traditional reinforcement learning from human feedback (RLHF). OpenAI has tested GPT-4.5 under its *Preparedness Framework* to ensure alignment and risk mitigation.
*Availability*: GPT-4.5 is accessible to *ChatGPT Pro users*, rolling out to other tiers soon. Developers can also use it in *Chat Completions API, Assistants API, and Batch API*, with *function calling and vision capabilities*. However, it remains computationally expensive, and OpenAI is evaluating its long-term API availability.
GPT-4.5 represents a *major step in AI model scaling*, offering *greater creativity, contextual awareness, and collaboration potential*.
GPT-2? Maybe not even that one.
https://chat.ecnu.edu.cn/html/
To list a few. There are of course many more in China. I won't be surprised if universities in other countries also self-hosting.
It couldn't write a simple rename function for me yesterday, still buggy after seven attempts.
I'm more and more convinced that they dumb down the core product when they plan to release a new version to make the difference seem bigger.
Anecdotally, I've noticed what appears to be drops in quality, some days. When the quality drops, it responds in odd ways when asked what model it is.
Asking it what model it is shouldn't be considered a reliable indicator of anything.
Sure, but a change in response may be, which is what I see (and no, I have no memories saved).
I'm surprised and a bit nervous about that. We intend to bootstrap a large project with it!!
Both ChatGPT 4o (fast) and ChatGPT o1 (a bit slower, deeper thinking) should easily be able to do this without fail.
Where did it go wrong? Could you please link to your chat?
About my project: I run the sovereign State of Utopia (will be at stateofutopia.com and stofut.com for short) which is a country based on the idea of state-owned, autonomous AI's that do all the work and give out free money, goods, and services to all citizens/beneficiaries. We've built a chess app (i.e. a free source of entertainment) as a proof of concept though the founder had to be in the loop to fix some bugs:
https://taonexus.com/chess.html
and a version that shows obvious blunders, by showing which squares are under attack:
https://taonexus.com/blunderfreechess.html
One of the largest and most complicated applications anyone can run is a web browser. We don't have a web browser built, but we do have a buggy minimal version of it that can load and minimally display some web pages, and post successsfully:
https://taonexus.com/publicfiles/feb2025/84toy-toy-browser-w...
It's about 1700 lines of code and at this point runs into the limitations of all the major engines. But it does run, can load some web pages and can post successfully.
I'm shocked and surprised ChatGPT failed to get a rename function to work, in 7 attempts.