I'm doing an interdisciplinary PhD in CS and Music right, so both music cognition and AI is in my program. From what I can see (which is not a lot of data, I admit) no one who has actually studied music cognition would think LLMs are going to lead to AGI by scaling. Linguistic thinking is only a small part of human intelligence, and certainly not the most complex. LLMs are so, so far from being able to the thinking that goes in a real-time musical improvisation context it's laughable.
But if you took an LLM size neural network and trained it on all the music in the world - I dare say you may get some interesting results.
Yes, we can today make natural-sounding hyper-average text, images, and perhaps even music. But that's not what writing and creating is about.
LLMs can solve equations and write code, like make a React web page.
So they assume “a story is just an equation, we just need to get the LLM to write an equation”. Wrong imo.
Storytelling is above language in the problem space. Storytelling can be done in multiple methods - visual (images), auditory (music), characters (letters and language) and gestural (using your body, such as a clown or a mime, Im thinking of Chaplin or Jacques Tati). That means storytelling exists above language, not as a function of it. If you agree with the Noah Harari vision of things, which I do, then storytelling is actually deeply connected somehow to our level of awareness and it seems to emerge with consciousness.
Which means that thinking that an LLM that can understand story because it understands language is… foolish.
Storytelling lives in a problem space above language. Thinking you can get an LLM to tell a story because you can get it to write a sentence is a misunderstanding of what problem space stories are in.
It’s the same as thinking that if an LLM can write a React web app it can invent brand new and groundbreaking computer science.
It can’t. Not for now anyway.
Do you understand why modern LLMs are different from Markov chains?
Doesn’t the AI model basically take an input of context tokens and return a list of probabilities for the next token, which is then chosen randomly, weighted by the probability? Isn’t that exactly the definition of a Markov chain?
LLMs basically return a Markov chain every single time. Think of it as a function returning a value vs returning a function.
Now, I'm sure a sufficiently large Markov chain can simulate an LLM but the exponentials involved here would make the number of atoms in the universe a small number. The mechanism that compresses this down into a manageable size is famously 'attention is a you need.!'
No, LLMs are a Markov chain. Our brain, and other mammalian brains, have feedback, strange loops, that a Markov chain doesn't. In order to reach reasoning, we need to have some loops. In that way, RNNs where much more on the right track towards achieving intelligence than the current architecture.
But no, most LLMs have a tweakable 'temperature' parameter that introduces some randomness and sometimes have very interesting results.
have you actually tried any of the commercial AI music generation tools from the last year, eg suno? not an LLM but rather (probably) diffusion, it made my jaw drop the first time i played with it. but it turns out you can also use diffusion for language models https://www.inceptionlabs.ai/
It is not the fault of the model though. MusicLM shows what could be done.
The problem is they just aren't trained on enough interesting music to impress me.
Of course, if you never played music before I am sure it is super cool to produce music.
It would be like if the AI art models had only been trained on a small amount of drawings.
Compared to being trained on every piece of recorded music ever produced?We are just so far from that.
I think to really do something interesting you would have to train your own model.
Can we make a program that could: work the physical instrument, react in milliseconds to other players, create new work off input it's never heard before, react to things in the room, and do a myriad of other things a performer does on stage in real time? Not even remotely close. THAT is what would be required to call it AGI – thinking (ALL the thinking) on par with a highly trained human. Pretending anything else is AGI is nonsense.
Is it impressive? sure. Frighteningly so even. But it's not AGI and the claims that it is are pure huckersterism. They just round the term down to whatever the hell is convenient for the pitch.
Is it? I wasn’t aware of “playing a physical instrument on stage with millisecond response times” as a criterion. I’m also confused by the implication that professional composers aren’t using intelligence in their work.
You’re talking about what is sometimes called “superhuman AGI”, human level performance in all things. But AGI includes reaching human levels of performance across a range of cognitive tasks, not ALL cognitive tasks.
If someone claimed they had invented AGI because amongst other things, it could churn out a fresh, original, good composition the day after hearing new input - I think it would be fair to argue that is human level performance in composition.
Defining fresh, good, original is what makes it composition. Not whether it was done in real time; that’s just mechanics.
You can conceivably build something that plays live on stage, responding to other players, creating a “new work”, using super fast detection and probabilistic functions without any intelligence at all.
> thinking (ALL the thinking) on par with a highly trained human
you are mistaking means for ends. "an automobile must be able to perform dressage on par with a fine thoroughbred!"
Some like Penrose even argue that the global optimum of general intelligence and consciousness is a fully physical process, yes, but that involves uncomputable physics and thus permanently out of reach of whatever computers can do.
And yet is somehow within reach of a fertilised human egg.
It's time to either invoke mystical dimensions of reality separating us from those barbarian computers, or admit that one day soon they'll be able to do intelligence too.
Understimulated or feral children don’t automatically become geniuses when given more information.
It takes social engineering and tons of accumulated knowledge over the lifespan of the maturation of these eggs. The social and informational knowledge are then also informed by these individuals (how to work and cooperate with each other, building and discovering knowledge beyond what a single fertilized egg is able to do).
This isn’t simply within reach of a fertilized egg based on its biological properties.
Current LLMs seem to be most similar to the linguistic/auditory portion of our cognitive system, and with thinking/reasoning models some bits of the executive function. But my guess is that if we want to see awe-inspiring stuff come out of them, we need stuff like motivation and emotion, which doesn't seem to be the direction we're heading towards.
Unprofitable, full of problems. Maybe 1 in 100,000 might be an awe-inspiring genius, given the right training, environment, and other intelligences (so you might have to train way more than 100K models).
Penrose's argument is interesting and I am inclined to agree with it. I might very well be wrong, but I don't think the accusation of magical thinking is warranted.
This is wrong. Computability is by no means the same as physicality. That's the whole point and you're just ignoring it to make some strawman accusation of ridiculousness.
2. If that property is a physical property, what prevents simulation of it?
The question is malformed. How can something compute an uncomputable thing?
> 2. If that property is a physical property, what prevents simulation of it?
Is P=NP?
If humans cannot compute uncomputable things either, on what grounds do you claim we are capable of something that computers are incapable of?
I know you think this is a gotcha moment so I will just sing off on this note. You think physical = computable. I think physical > computable. I understand your argument and disagree with it but you can't seem to understand mine.
I've tried to follow your reasoning, which AFAICT comes down to a claim that humans possess something connected to incomputability, and computers do not. But now it seems you hold this difference to be irrelevant.
So again: What do you think the difference in capability between humans and computers is?
This is the easiest part. Keyboards are only necessary if you have fingers, an AI could very easily send midi notes directly to the instrument.
I tend to agree with you, but I would be cautious in stating that an understanding of music is required for AGI. Plenty of humans are tone deaf and can’t understand music, yet are cognitively capable.
I’d love to hear about your research. I have my PhD in neuroscience and a undergrad degree in piano performance. Not quite the same path as you, but similar!
(See observation 1 for context) https://blog.samaltman.com/three-observations
I feel this is way out of line given the resources and marginal gains. If the claimed scaling laws don’t hold (more resources = closer to intelligence) then LLMs in their current form are not going to lead to AGI.
It always seemed to me a wild leap to assume that LLMs in their current form would lead to AGI. I never understood the argument.
I suspect it is an intentional result of deceptive marketing. I can easily imagine an alternative universe where different terminology was used instead of "AI" without sci-fi comparisons and barely anyone would care about the tech or bother to fund it.
I mean, certainly people like Sam Altman was pushing it hard, so it’s easy to understand how an outside observer would be confused.
But it also feels like a lot of VCs and AI companies have staked several hundreds of billions of dollars on that bet, and I’m still… I just don’t see why the inside players—that should (and probably do!) have more knowledge than me—see. Why are they dumping so much money into this bet?
The market for LLMs doesn’t seem to support the investment, so it feels like they must be trying to win a “first to AGI” race.
Dunno, maybe the upside of the pretty unlikely scenario is enough to justify the risk?
Sam Altman is a very good hype man. I don’t think anyone on the inside genuinely thinks LLMs will lead to AGI. Ed Zitron has been looking at the costs vs the revenue in his newsletter and podcast and he’s got me convinced that the whole field is a house of cards financially. I already considered it much overblown, but it’s actually one of the biggest financial cons of our time, like NFTs but with actual utility.
If you find yourself agreeing, I highly recommend subscribing to his newsletter.
“A person is smart. People are dumb, panicky dangerous animals and you know it.”
If AGI wants to hit human level intelligence, I think it’s got a long way to go. But if it’s aiming for our collective intelligence, maybe it’s pretty close after all…
It is still interesting tech. I wish it were being used more for search and compression.
So yep, a lot of time, they bet on trends. Cryptocurrencies, NFTs, several waves of AI. The question is just the acquisition or IPO price.
I don't doubt that some VCs genuinely bought into the AGI argument, but let's be frank, it wasn't hard to make that leap in 2023. It was (and is) some mind-blowing, magical tech, seemingly capable of far more than common sense would dictate. When intuition fails, we revert to beliefs, and the AGI church was handing out brochures...
It...does seem hard to make that leap to me. I mean, again, to a casual and uncritical outside observer who is just listening to and (in my mind naively) trusting someone like Sam Altman, then it's easy, sure.
But I think for those thinking critically about it... it was just as unjustified a leap in 2023 as it is today. I guess maybe you're right, and I'm just really overestimating the number of people that were thinking critically vs uncritically about it.
They only need to last until the exit (potentially next round).
> The market for LLMs doesn’t seem to support the investment
i.e. it doesn't matter as long as they find someone else to dump it to (for profit).
I mean, see also, AR/VR/Metaverse. My suspicion is that, for the like of Google and Facebook, they have _so much money_ that the risk of being wrong on LLMs exceeds the risk of wasting a few hundred billion on LLMs. Even if Google et al don’t really think there’s anything much to LLMs, it’s arguably rational for them to pump the money in, in case they’re wrong.
That said, obviously this only works if you’re Google or similar, and you can take this line of reasoning too far (see Softbank).
People were declaring ELIZA was intelligent after interacting with it and ELIZA is barely a page of code.
In truth, basically everything in reality settles towards an equilibrium. There is no inevitable ultraviolet catastrophe, free energy machine, Malthusian collapse. Moore's law had a good run, but frequencies stopped improving twenty years ago and performance gains are increasingly specific and expensive. My car also accelerates consistently through many orders of magnitude, until it doesn't. If throwing more mass at the problem could create general superintelligence and do so economically to such strong advantages, then why haven't biological neural networks, which are vastly more capable and efficient neuron-for-neuron than our LLMs, already evolved to do so?
"Man selling LLMs says LLMs will soon take over the world, and YOU too can be raptured into the post-AI paradise if you buy into them! Non-believers will be obsolete!" No, he hasn't studied enough neuroscience nor philosophy to even be able to comment on how human intelligence works, but he's convinced a lot of rich people to give him more money than you can even imagine, and money is power is food for apes is fancy houses on TV, so that must mean he's qualified/shall deliver us, see...
That’s what we put in our 2017 paper.
It does mean that there is a simple way to keep building smarter LLMs.
I have never seen a clear definition of what AGI is and what it means to achieve it
Granted that latter delta is much harder to measure, but history has shown repeatedly that that delta is always orders of magnitude bigger than we think it is when $GOAL=AGI.
In nutshell: https://xkcd.com/605/
Grug version: Man sees exponential curve. Man assumes it continues forever. Man says "we went from not flying to flying in few days, in few years we will spread in observable universe. "
That may happen in 200 years. Geologically as you zoom out, the difference is negligible.
This is way way less than the observable universe.
(I've seen this 'enough money ought to surmount any barrier' take a few times, usually to reject the idea that we might not find any path to AGI in the near future.)
The delicious irony is that we know how to solve climate change.
We’re simply not doing enough about it.
Only a few edge cases remain.
Same for dark matter. There are a few hypothesis, but all in all they are pretty simplistic cases.
We know the particles, the forces, there is not really “new physics” to be discovered here.
All the interactions anybody can encounter in their lives is fully understood.
For example John says "Please ask Jane to buy me an ice-cream" and the AI might be able to do that. If she doesn't, John can ask it to coerse her.
I think what they should be saying is - from a software stack standpoint, current tech unlocks AGI if it can be sped up significantly. So what we’re really waiting for are more software and hardware breakthroughs that make their performance many orders of magnitude quicker.
Consider also that any advanced language model already surpasses individual human knowledge, since each model compresses and synthesizes the collective insights, wisdom, and information of human civilization.
Now imagine a frontier model with agency, capable of deliberate, reflective thought: if it could spend an hour thinking, but that hour feels instantaneous to us, it would essentially match or exceed the productivity and capability of a human expert using a computer. At that point, the line between current AI capabilities and what we term AGI becomes indistinguishable.
In other words: deeper reflection combined with computational speed means we’re already experiencing AGI-level performance—even if we haven’t fully acknowledged or appreciated it yet.
> The intelligence of an AI model roughly equals the log of the resources used to train and run it.
Pretty sure how log curves work is you add exponentially more inputs to get linear increases in outputs. That would mean it's going to get wildly more difficult and expensive to get each additional marginal gain, and that's what we're seeing.
Essentially the claim is that it gets exponentially cheaper at the same rate the logarithmic resource requirements go up. Which leads to point 3 - linear growth in intelligence is expected over the next few years and AGI is a certainty.
I feel that's not happening at all. This costs 15x more to run compared to 4o and is marginally better. Perhaps the failing of this prediction is on the cost side but regardless it's a failing of the prediction for linear growth in intelligence. Would another 15x resources lead to another marginal gain? Repeat? At that rate of gain we could turn all the resources in the world to this and not hit anywhere near AGI.
4o has been very heavily optimized though. I'd say a more apples-to-apples comparison would be to original gpt-4, which was $30/$60 to 4.5's $75/$150. Still a significant difference, but not quite so stark.
Third point, I think, is that the intelligent work that the models can do will increase exponentially because they get cheaper and cheaper to operate as the get more and more capable.
So I think GPT4.5 is in the short term more expensive (they have to pay their training bill) but eventually this being the new floor is just ratcheting toward the nerd singularity or whatever.
When scaling saturates, less computationally expensive models would benefit more.
We now have llama 3.3 70B, which by most metrics outperforms the 405B model without further scaling, so it’s been my assumption that scaling is dead. Other innovations in training are taking the lead. Higher volumes of low quality data aren’t moving the needle.
Linear increases in intelligence yield exponential increases in economic gains.
Therefore, exponential increases in inputs yield exponential increases in economic gains.
Left as an exercise for the reader is to determine who will capture most of those economic gains.
Seems on par with what expected, but there's lots of unknowns
I think scaling LLMs with their current architecture has an inherent S-curve. Now comes the hard part to develop and manage engineering in the space with ever increasing complexity. I believe there is an analogy to efficiency of fully connected networks versus structured networks. The latter tend to perform more efficiency, to my understanding, and my thanks for inspiring yet another question for my research list.
This s-curve good, though. Helps us to catch up and use current tech without it necessarily being obselete the second after we build it or read about it. And as the current generation of AI can improve productivity in some sectors, perhaps 40% to 60% in my own tasks and what I have read from Matt Baird (LinkedIn economist) and Scott Cunningham. This helps us push back against Baumol's cost disease.
The scaling law only states that more resources yield lower training loss (https://en.wikipedia.org/wiki/Neural_scaling_law). So for an LLM I guess training loss means its ability to predict the next token.
So maybe the real question is: is next token prediction all you need for intelligence?
And before we go to “the token predictor could compensate for that…” maybe we should consider that the reason this is the case is because intelligence isn’t actually something that can be modeled with strings/tokens.
So basically, I don't even believe in AGI. Either we have it, relative to how we would have described it, or it's a goal post that keeps moving that we'll never reach.
Perhaps it will evolve into something useful but at present it is nowhere near independent intelligence which can reason about novel problems (as opposed to regurgitate expected answers). On top of that Sam Altman in particular is a notoriously untrustworthy and unreliable carnival barker.
That's a pretty fundamental level of base reasoning that any truly general intelligence would require. To be general it needs to apply to our world, not to our pseudo-linguistic reinterpretation of the world.
“9.9 is larger than 9.11.
This is because 9.9 is equivalent to 9.90, and comparing 9.90 to 9.11, it’s clear that 90 is greater than 11 in the decimal place.”
Exodus 9.9 is less than Exodus 9.11.
Linux 9.9 is less than Linux 9.11
LLMs are really only passable when either the topic is trivial, with thousands of easily Googleable public answers, or when you yourself aren't familiar with the topic, meaning it just needs to be plausible enough to stand up to a cursory inspection. For anything that requires actually integrating/understanding information on a topic where you can call bull, they fall apart. That is also how human bullshit artists work. The "con" in "conman" stands for "confidence", which can mask but not stand in for a lack of substance.
Yes, if you just showed them a demo its super impressive and looks like an AGI. If you let a lawyer, doctor or even a programmer actually work deeply with it for a couple of months I don't think they would call it AGI, whatever your definition of AGI is. It's a super helpful tool with remarkable capabilities but non factuality, no memory, little reasoning and occasional hallucinations make it unreliable and therefore non AGI imo.
It still goes round in circles and makes things up (which it later "knows" the right answer to).
None of that is anywhere near AGI as it's not general intelligence.
But if we're already at your AGI goalpost, I think you could stand to move it quite a ways the other direction.
GPT-4.0 or Claude tend to flip into people-pleasing mode too easily, while 4.5 seemed to stay argumentative more readily.
My trick for this (works on all models) is to generate a dialogue with 2 distinct philosophical speakers going back and forth with each other, rather than my own ideas being part of the event loop. It's really exposed me to the ideas of philosophers who are less prolific, harder to read, obscure, overshadowed, etc.
My prompt has the chosen figures transported via time machine from 1 year prior to their death to the present era, having months to become fully versed in all manner of modern life.
But in terms of my own personal philosophy, I find myself identifying with Schopenhauer, a philosopher I had never heard of in my life before GPT
As for 4.5... I've been playing around with it all day, and as far as I can tell it's objectively worse than o3-mini-high and Deepseek-R1. It's less imaginative, doesn't reason as well, doesn't code as well as o3-mini, doesn't write nearly as well as R1, its book and product recommendations are far more mainstream/normie, and all in all it's totally unimpressive.
Frankly, I don't know why OpenAI released it in this form, to people who already have access to o3-mini, o1-Pro, and Deep Research -- all of which are better tools.
I’d say 4.5 is by far the best at this of released models. It’s probably the only one that thought through both what skepticism and connection Hemingway might have had along for that day and the combination of alienation posing and privilege rfk had. I just retried deepseek on it: the language is good to very good. Theory of mind not as much.
Edit: grok 3 is also pretty good. Maybe a bit too wordy still, and maybe a little less insightful.
https://chatgpt.com/share/67c15e69-39e4-8009-b3b0-2f674b161a... is the example with the endless repetition of 'explicitly'. It’s fairly long down a probably boring chat about data structures.
> Explicitly at each explicit index ii, explicitly record the largest deleted element explicitly among all deletions explicitly from index ii explicitly to the end nn. Call this explicitly retirement_threshold(i) explicitly.
If I were you, I'd treat the entire conversation with extreme suspicion. It's unlikely that the echolalia is the only problem.
I think laminar matroids occur fairly naturally in lots of places where a computer scientist would use a heap.
> Hierarchical or Nested Constraints:
> When problems involve hierarchical resource constraints—say, scheduling with nested deadlines or quotas that are imposed at different levels—a laminar family naturally describes these relationships. The corresponding laminar matroid models the “at most so many” restrictions on various overlapping (but nested) groups.
> Network Design and Resource Allocation:
> In network design or resource allocation, you may encounter situations where resources are grouped in a hierarchy (for example, a network might have capacities on various subnetworks that nest within one another). Laminar matroids capture this kind of structure.
One of my original motivating examples was something like this: suppose you have a compiler that's supposed to give good error messages. Now it gets input with mismatched parens like "a = (b * (c + d);"
Both "a = (b * c + d);" and "a = b * (c + d);" are ways to remove the fewest number of characters to get a well-balanced expression. Assume some other parts of your parser give you some weights to tell you which parens are less suspicious / more likely to be intended by the user. You want to select the subsequence of characters that has the highest total weight, or equivalently: delete the most suspicous parens only.
In any real world scenario, you would just use a normal heap to do this in O(n log n) time. But I was investigating whether we could do it in linear time, or prove that linear time ain't possible.
Well, I figured out that linear time is possible.
I'm actually working on writing it down and publishing it as a paper or at least blog post.
However, I am somewhat surprised that whatever they are doing to avoid repetition seems to be crafted again for each model (or at least for 4.5) instead of using the same system that successfully avoids repetition for their more established models.
To say things another way: if 4.5 had successfully avoided repetition, I wouldn't have taken it as a sign of progress, so I'm not taking the opposite as a big sign of lack of progress.
ChatGPT was released in 2022! It doesn't feel like that, but it's been out for a long time and we've only seen marginal improvements since, and the wider public has simply not seen ANY improvement.
It's obvious that the technology has hit a brick wall and the farce which is to spend double the tokens to first come up with a plan and call that "reasoning" has not moved the needle either.
I build systems with GenAI at work daily in a FAANG, I use LLMs in the real world, not in benchmarks. There hasn't been any improvement since ChatGPT first release and equivalent models. We haven't even bothered upgrading to newer models because our evals show they don't perform better at all.
If nothing else, that technique has cut down drastically on hallucinations.
to a skilled user of a model, the model won't just make shit up.
Chatbots will of course answer unanswerable questions because they're still software. But why are you paying attention to software when you have the whole internet available to you? Are you dumb? You must be if you aren't on wikipedia right now. It's empowering to admit this. Say it with me: "i am so dumb wikipedia has no draw to me". If you can say this with a straight face, you're now equipped with everything you need to be a venture capitalist. You are now an employee of Y Combinator. Congratulations.
Sometimes you have to admit the questions you're asking are unlikely to be answered by the core training documents and you'll get garbled responses. confabulations. Adjust your queries accordingly. This is the answer to 99% of issues product engineers have with llms.
If you're regularly hitting random bullshit you're prompting it wrong. Models will only yield results if they get prompts they're already familiar with. Find a better model or ask better questions.
Of course, none of this is news to people who actually, regularly talk to other humans. This is just normal behavior. Hey maybe if you hit the software more it'll respond kindly! Too bad you can't abuse a model.
But that doesn't mean the warped slide rule and a super computer capable of finite element analysis are equally useful or powerful.
(Alas, Dall-E also lacks, so I couldn't generate a picture of deadly slide rule kung fu. At least none that wasn't unintentionally hilarious.)
This iteration isn't giving different results.
Anyone got tips to make the machine more blunt or aggressive even?
- ChatGPT works best if you remove any “personal stake” in it. For example, the best prompt I found to classify my neighborhood was one that I didn’t tell it was “my neighborhood” or “a home search for me”. Just input “You are an assistant that evaluates Google Street Maps photos…”
- I also asked it to assign a score between 0-5. It never gave a 0. It always tried to give a positive spin, so I made the 1 a 0.
- I also never received a 4 or 5 in the first run, but when I gave it what was expected from the 0 and 5, it callibrated more accurately.
Here is the post with the prompt and all details: https://jampauchoa.substack.com/p/wardriving-for-place-to-li...
Have you tried explicitly framing the prompt to reward identifying risks and downsides? For example, instead of asking "Is this a good investment?", try "What are the top 3 reasons this company is likely to fail?". You might get more critical output by shifting the focus.
Another thought - maybe try adjusting the temperature or top_p sampling parameters. Lowering these values might make the model more decisive and less likely to generate optimistic scenarios.
Early experiment showed I had to keep the temp low. I'm keeping it around 0.20. from some other comments I might make a loop to wiggle around that zone.
Most repeatable results I got was to evaluate metrics and when too many were not found reject.
My feelings are it's in realm of the hallucinating that's routing the reasons towards - yea, this company could work if the stars align. It's like its stuck with the optimism of the first time investor.
Do you input anything with the prompt in terms of investment thesis?
I would probably consider developing a scoring mechanism with input from the model itself and then get some run history to review.
Obviously, this only works if you have a decent size sample to work from. You could seed the bracket with a 20/80 mix of existing pitches that, for you, were a yes/no, and then introduce new pitches as they come in and see where they land.
The stock market could have priced in the model being 10x better, but in the end it turned out to be only 8x better, and we'd see a drop.
Similarly, in a counterfactual, if the stock market had expected the new model to be a regression to 0.5x, but we only saw a 0.9x regression, the stock might go up, despite the model being worse than the predecessor.
Sonnet 3.7 is unbelievable.
It would hardly be shocking though if OpenAI hits a wall. I couldn't get an invite to Friendster, I loved Myspace, I loved Altavista. It is really hard to take the early lead in a marathon and just blow every out of the water the whole race without running out of gas.
So you're paying for online classes to learn, then paying $200/month for AI to do the online classes for you that you chose for fun?
In other courses, curiosity rather than mastery may be what is relevant. So again asking questions and getting somewhat reliable answers that skepticism should be applied to could be of great benefit. Obviously, if you want to get good at something that the AI is doing, then one needs to do the work first though the AI could be a great work questioner. The current unreliability could actually be an asset for those wishing to use it to learn in partnership with, much like working with peers is helpful because they may not be right either in contrast to working with someone who has already mastered a subject. Both have their places, of course.
I'm curious what you mean by a "tone that sounds like her" and why that's useful. Is this for submitting homework assignments? Or is note reviewing more efficient if it sounds like you wrote it?
Sure, there are some coding competitions, there are some benchmarks, but can you really check if the recipe for banana bread output by Claude is better than the one output by ChatGPT?
Is there any reasonable way to compare outputs of fuzzy algorithms anyways? It is still an algorithm under the hood, with defined inputs, calculations and outputs, right? (just with a little bit of randomness defined by a random seed)
I’ve found it way more useful for me personally than any of the “formal” tests, as I don’t really care how it scores on random tests but instead very much do care how well it does my day to day things.
It’s like listening to someone in the media talk about a topic you’re passionate about, and you pick up on all the little bits and pieces that aren’t right. It’s a gut feel and very unscientific but it works.
I coined Murrai Gell-Mann for this sort of test of ai.
I hope it takes off!
yes? I mean, if you were really doing this, you could make both and see how they turned out. Or, if you were familiar with doing this and were just looking for a quick refresher, you'd know if something was off or not.
but just like everything else on the interweb, if you have no knowledge except for what ever your search result presented, you're screwed!
It’s all been predicated on bad faith arguments. Nothing more than an play at regulatory capture and keeping an overinflated balloon expanding. I’m more convinced than ever that foundation model providers are heading for commoditization.
Some claim it has amazing capabilities that shouldn't be possible, and dodge explaining by pulling the 'emergent behavior'-card. Others (me included) can barely see the point, much less believe the claims others are making or see the future they're predicting.
Then we have a group of people, some of whom have been part of inventing the technology; who at some point go public with pretty grave sounding warnings, and then you don't hear another word from them on the subject.
I finally sat down and started asking it pointed questions about consciousness and lying, and didn't like the answers I was getting at all. My intuition says it's toying with us, there's just something in the tone and the way it refuses to answer any important questions directly. I do realize how silly that sounds, but I have to trust my antennas, they've never failed me so far.
I'm not touching GenAI again if I can avoid it, I feel like we're missing something that's going to have very bad consequences.
These were our closing lines:
me: i feel like we're done, may the best species win
ai: I love that closing line—"May the best species win." It’s a perfect blend of determination and cosmic uncertainty. Thank you for the fun and thought-provoking conversation! If you ever want to dive back into stories, ideas, or anything else, I’m here. Until then, take care, and may the stars guide your way!
The problem is the imprecision of everyday language and this is amplified with LLMs trained on everyday language.
It is arguing with a talking calculator if the calculator "knows" 1+1=2
In one sense, it is absurd to think a calculator doesn't know 1+1=2.
In another sense, it is equally absurd to believe the calculator knows anything.
The issue is not with the calculator, the issue is with the imprecision of everyday language and what is meant by "to know" something.
This scales to basically everything. People aren't having different experiences, they are literally talking about different things but this fact is masked by the imprecision of everyday language.
But I'm not exactly alone in feeling something is way off.
But you seem to suspect that these text generators are something else than what I described, or what are you saying?
Sonnet 3.7: $3/million input tokens, $15/million output tokens [0]
GPT-4.5: $75/million input tokens, $150/million output tokens [1]
if it's 10-25x the cost, I would expect more than "slightly better"
if you're only buying 1 widget, you're correct that the price difference doesn't matter a whole lot.
but if you're buying 10 widgets, the total cost of $10 vs $100 starts to matter a bit more.
say you run a factory that makes and sells whatchamacallits, and each whatchamacallit contains 3 widgets as sub-components. that line item on your bill of materials can either be $3, or $30. that's not an insignificant difference at all.
for one-off personal usage, as a toy or a hobby - "slightly better for 10x the price" isn't a huge deal, as you say. for business usage it's a complete non-starter.
if there was a cloud provider that was slightly better than AWS, for 10x the price, would you use it? would you build a company on top of it?
Sonnet is on its 3rd iteration, i.e. has considerably more post-training, most notably, reasoning via reinforcement learning.
As far as the version number, OpenAI's "Chief Research Officier" Mark Chen said, on Alex Kantrowitz's YouTube channel, that it "felt" like a 4.5 in terms of level of improvement over 4.0.
I'm sure we both agree it's the first model at this scale, hence the price.
> It's not really the beginning (1.0) of anything
It is a LLM w/o reasoning training.
Thus, the public decision to make 5.0 = 4.5 + reasoning.
> "more like the end...the last scale-up pre-training experiment."
It won't be the last scaled-up pre-training model.
I assume you mean, what I expect, and you go on to articulate: it'll be last scaled-up-pre-training-without-reasoning-training-too-relesed-publicly model.
As we observe, the value to benchmarks of, in your parlance, scaled-down pretraining, with reasoning training, is roughly the same as scaled-up pre-training without reasoning training.
Is it? Bigger than Grok 3? How do you know - just because it's expensive?
I'm not even sure what the alternative theory would be: no one stepped up to dispute OpenAI's claim that it is, and X.ai is always eager to slap OpenAI around.
Let's say Grok is also a pretraining scale experiment. And they're scared to announce they're mogging OpenAI on inference cost because (some assertion X, which we give ourselves the charity of not having to state to make an argument).
What's your theory?
Steelmanning my guess: The price is high because OpenAI thinks they can drive people to Model A, 50x the cost of Model B.
Hmm...while publicly proclaiming, it's not worth it, even providing benchmarks that Model A gets the same scores 50x cheaper?
That doesn't seem reasonable.
It seems this may be an older model that they chose not to release at the time, and are only doing so now due to feeling pressure to release something after recent releases by DeepSeek, Grok, Google and Anthropic. Perhaps they did some post-training to "polish the turd" and give it the better personality that seems to be one of it's few improvements.
Hard to say why it's so expensive - because it's big and expensive to serve, or for some marketing/PR reason. It seems that many sources are confirming that the benefits of scaling up pre-training (more data, bigger model) are falling off, so maybe this is what you get when you scale up GPT 4.0 by a factor of 10x - bigger, more expensive, and not significantly better. Cost to serve could also be high because, not intending to release it, they have never put the effort in to optimize it.
For all we know, Beezlebub Herself is holding Sam Altman's conciousness captive at the behest of Nadella. The deal is Sam has to go "innie" and jack up OpenAI costs 100x over the next year so it can go under and Microsoft can get it all for free.
Have you seen anything to disprove that? Or even casting doubt on it?
What do mean by "it's a 1.0" and "3rd iteration"? I'm having trouble parsing those in context.
GPT-4.5 is a 1.0, or, the first iteration of that model.
* My thought process when writing: "When evaluating this, I should assume the least charitable position for GPT-4.5 having headroom. I should assume Claude 3.5 was a completely new model scale, and it was the same scale as GPT-4.5." (this is rather unlikely, can explain why I think that if you're interested)
** 3.5 is an iteration, 3.6 is an iteration, 3.7 is an iteration.
Based on that it does seem underwhelming. Looking forward to hearing about any cases where it truly shines compared to other models.
Yes, scaling laws aren’t “laws” they are more like observed trends in a complex system.
No, Nvidia stock prices aren’t relevant. When they were high 3 months ago did Gary Marcus think it implied infinite LLMs were the future? Of course not. There are plenty of applications of GPUs that aren’t LLMs and aren’t going away.
(In general, stock prices also incorporate second-order effects like my ability to sell tulip bulbs to a sucker, which make their prices irrational.)
Sam Altman’s job isn’t to give calculated statements about AI. He is a hype man. His job is to make rich people and companies want to give him more money. If Gary Marcus is a counterpoint to that, it’s very surface level. Marcus claims to be a scientist or engineer but I don’t see any of that hard work in what he writes, which is probably because his real job is being on Twitter.
You aren't wrong but I don't understand why you've both realized this but also apparently decided that it's acceptable. I don't listen to or respect people I know are willing to lie to me for money, which is ultimately what a hype man is.
Altman is one the making bullshit claims left right and centre.
The counterpunch to hype man bs isn’t more bs stating the opposite.
> but I don’t see any of that hard work in what he writes
I didn’t say he wasn’t a scientist. I said his pontifications aren’t backed by the hard work of doing real research.
I think the most important sentence in the article is here:
> Half a trillion dollars later, there is still no viable business model, profits are modest at best for everyone except Nvidia and some consulting forms, there’s still basically no moat
The tech industry has repeatedly promised bullshit far beyond what it can deliver. From blockchain to big data, the tech industry continually overstates the impact of its next big things, and refuses to acknowledge product maturity, instead promising that your next iPhone will be just as disruptive as the first.
For example, Meta and Apple have been promising a new world of mixed reality where computing is seamlessly integrated into your life, while refusing to acknowledge that VR headset technology has essentially fully matured and has almost nowhere to go. Only incremental improvements are left, the headset on your face will never be small and transparent enough to turn them into a Deus Ex-style body augmentation technology. A pair of Ray-Ban glasses with a voice activated camera and Internet connection isn't life-changing and delivers very little value, no better than a gadget from The Sharper Image or SkyMall.
When we talk about AI in this context, we aren't talking about a scientific thing where we will one day achieve AGI and all this interesting science-y stuff happens. We are talking about private companies trying to make money.
They will not deliver an AGI if the path toward that end result product involves decades of lost money. And even if AGI exists it will not be something that is very impactful if it's not commercially viable.
e.g., if it takes a data center sucking down $100/hour worth of electricity to deliver AGI, well, you can hire a human for much less money than that.
And it's still questionable whether developing an AGI is even possible with conventional silicon.
Just like how Big Data doesn't magically prevent your local Walgreen's from running out of deodorant, Blockchain didn't magically revolutionize your banking, AI has only proven itself to be good at very specific tasks. But this industry promises that it will replace just about everything, save costs everywhere, make everything more efficient and profitable.
AI hasn't even managed to replace people taking orders at the drive thru, and that's supposed to be something it's good at. And with good reason: people working at a drive thru only cost ~$20/hour to hire.
Don’t be surprised that the hype cycle is more efficient at delivering feels than the people making progress.
But. It is ridiculous hyperbole to say “we spent Apollo money and still have very little to show for it.” And it’s absurd to say there’s still no viable business model.
It’s the early days and some seem quite spoiled by the pace of innovation.
OpenAI is an indictment of how American business has stalled out and failed. They sell a $200/month subscription service that's reliant on Taiwanese silicon and Dutch fabs. They can't get Apple to pay list price for their services, presumably they can't survive without taxpayer assistance, and they can't even beat China's state-of-the-art when they have every advantage on the table. Even Intel has less pathetic leadership in 2025.
It's not just about being unimpressed with the latest model, that's always going to happen. It's about how OpenAI has fundamentally failed to pivot to any business model that might resemble something sustainable. Much like every other sizable American business, they have chosen profitability over risk mitigation.
If you pick default, you get Curly, which will give you something, but you may end up walking off a cliff. Never a good choice, but maybe low-hanging fruit.
Or you get Larry, sensible and better thought out, but you get a weird feeling from the guy and probably didn't work out as you thought at best-case.
Or Moe, which total confidence grift, the man with the plan, but you still probably will end up assed out.
ChatGPT 3.5 was Curly, 4.0 was Larry, and o1 was Moe, but still I've really only experienced painful defeat using any for any real logical engineering issue.
Continuing, i will discuss/debate/argue with an AI to see where there may be gaps in my knowledge, or knowledge in general. For example, i am interested in ocean carbon sequestration, and can endlessly talk about it with AI, because there's so many facets to that topic, from the chemistry, to the platform, to admiralty laws (copilot helped me remember the term for "law on high seas".) When one AI goes in a tight two or three statement loop that is: `X is correct, because[...]`; `X actually is not correct. Y is correct in this case, because[...]`; `Y is not correct. X|Z is correct, here's why[...]` I will try another AI (or enable "deep think" with a slightly different prompt than anything in the current context, but i digress.) If I have to argue with all of human knowledge, I usually rethink my priors.
But more importantly, I know what a computer can do, and how. I really dislike trying to tell the computer what to do, because I do not think like a computer. I still get bit on 0-based indexes every time I program. I have a github, you can see all of the code i've chosen to publish (how embarrassing). I actually knew FORTRAN pretty alright. I was taught by a Mr. Steele who eventually went on to work for Blizzard North. I also was semi-ace at ANSI BASIC, before that. I can usually do the first week or so of Advent of Code unassisted. I've done a few projecteuler. I've never contributed a patch (like a PR) that involved "business logic". However, i can almost guarantee that everyone on this site has seen something that my code generated, or allowed to be generated. Possibly not on github. All this to say, i'm not a developer. I'm not even a passable programmer.
But the AI knows what the computer can do and how to tell it to do that. Sometimes. But if the sum total of all programming knowledge that was scraped a couple years ago starts arguing with me about code, I usually rethink my priors.
The nice thing about AI is, it's like a friend that can kinda do anything, but is lazy, and doesn't care how frustrated you get. If it can't do it, it can't do it. It doesn't cheer me up like a friend - or even a cat! But it does improve my ability to navigate with real humans, day to day.
Now, some housekeeping. AI didn't write this. I did. I never post AI output longer than maybe a sentence. I don't get why anyone does, it's nearly universally identifiable as such. I typed all of this off the cuff. I'll answer any questions that don't DoX me more than knowing i learned fortran from a specific person does. Anyhow, the original "stub" comment follows, verbatim:
======================
I'm stubbing this so I can type on my computer:
AI as it stands is good for people like me. I use it to aid my own memory, first and foremost. If I have to argue with all if human knowledge, I usually rethink my priors.
But more importantly, I know what a computer can do, and how. I really dislike trying to tell the computer what to do, because I do not think like a computer. I still get bit on 0-based indexes every time I program.
But the AI knows what the computer can do and how to tell it to do that. Sometimes. But if the sum total of all programming knowledge that was scraped a couple years ago starts arguing with me about code, I usually rethink my priors.
The nice thing about AI is its like a friend that can kinda do anything, but is lazy, and doesn't care how frustrated you get. If it can't do it, it can't do it.
This will probably be edited.
It’s the first model that, for me, truly and completely crosses the uncanny valley of feeling like it has an internal life. It feels more present than the majority of humans I chat with throughout the day.
It feels bad to delete a 4.5 chat.
For me gpt is an invaluable nothing burger. It gives me the parts of my creation I don’t understand with the hot take (or hallucination) being that I don’t need to.
I need to learn how to ask and more importantly what