Suddenly we can talk to computers in plain language, they can solve a broad range of technical and non-technical problems, they get significantly better every year… it’s hard for me to imagine more promising evidence that AGI is on the horizon besides actually achieving AGI.
Further, I don't see strong evidence of "regular" intelligence. LLMs are like calculators for text, they have a lot of practical utility, but they don't understand anything, their output is the result of rote mechanical steps that could be executed by hand in principle. I've been using SOTA LLMs daily for years and to this day they still reliably produce nonsense and get things confidently wrong that an intelligent person generally wouldn't. Of course, intelligent people make mistakes, but if they start to hallucinate we immediately lose trust in them. Most people use LLMs in a touch and go manner, and the impressive statistical power fools us into believing we're interacting with something akin to a being, but the facade quickly breaks down the longer you try to engage with it in a manner where coherency matters.
With all that stated, I'm not saying AGI isn't possible, but I don't see language models as a path to AGI no matter how much better we can get them to model language.
The sounds heard waiting for a train. The experience of going on a date or hiking or swimming. The tastes and sensation of biting into some fruit. They each hold such a rich multi-modal experience that feels impossible to replicate in a model at at this time. An LLM can describe it but it cannot experience it.
Not that such experience is required to do what an AGI might be asked to do, but how could something reach AGI or higher without that level of experiential detail?
As they currently exist, they are essentially a novel and extremely sophisticated method to search, derive, and understand data. In fact almost all of the data ever recorded. To OPs point, every new LLM startup is just trying to build a bigger and more sophisticated way to search, derive, and understand data. It's not clear to me that bigger and more compute intensive methods will create an LLM that cares about anything.
What about this doesn’t sound supremely valuable?
LLMs fundamentally predict next token based on its training set. It's static and not learning by itself (context window doesn't count since improvements are not long term or generalizable).
AGI, I do believe is reachable, and may already be partway there. "General" only needs to be better than the average human, not the best humanity has to offer.
I, firstly, don't think it's fair to say "they get significantly better every year" after 2 years. I honestly think GPT 3/3.5 (maybe optimized for cost or speed but not retrained) would have been adequate for a significant chunk of the general purpose tasks that are asked of LLMs. I think most of the other gains we've seen since are related to fine-tuning on intentional application-specific tasks. IMO the only real "significant" improvement is the native multi-modal models.
That said, I think that RL+thinking and long-context models will start to present enough combined incremental improvements towards my next points to be even more useful and capable at a wider variety of tasks, but there is (in their current usable forms like chat bot apps and public APIs) a fundamental limit on their implementation preventing them from being AGI.
I think that models, even the "thinking" ones, fail to truly perform novel rationalization around general-purpose task solving. We have a ton of evidence that they're useful for a ton of tasks when it's trained in. Even going a tiny bit in a unique direction drops the model output quality a ton. Available thinking models today are really good at math, because they were RLed into solving math like a school child - rote practice. But that doesn't mean they're capable of even applying the same logical tools (they should have learned along) the way to novel mathematic and logic questions.
Another impediment to being treated even as an application-specific "mini AGI" is their naïveté and hallucinations. This makes their use as agents suspect for anything important. They can't distinguish or even output a true confidence on what they "don't know", and this blind confidence is a setback. Humans are known to say incorrect things confidently, but they're also known to reflect on their lack of knowledge and recognize their limits. Humans have real memory to be able (imprecisely) associate an event with their learning, to aid confidence in recall. Similarly, LLMs trusting nature on input (eg. prompt injection attacks) prevent them from "intelligently" acting in the real world even when they're not hallucinating. Tools like "DeepResearch" are really useful, and impressive improvements on traditional human searching for processing the vastness of the internet. BUT the model can't genuinely distinguish between good and bad sources, and often can't intelligently reflect on the patterns and social context of the sources they sell.
I can totally see a world where an LLM can output a confidence metric which is used to drive the tokens, and potentially suppress output, and I can totally see a world where long context and thinking (w/ RL) gives it enough reflection on everything to question to function even more autonomously. But I remain skeptical that it will be able to "think" and rationalize deeply enough to be a "super intelligence" on tasks it wasn't taught.
Let me say, as someone with no connection to OpenAI, that Deep Research mode is an absolute game-changer for my use and well worth the money. Obviously, YMMV, but for really does do deep research and organize it all in an excellent way.
I haven't yet noticed an error in its output when researching general subjects. It doesn't solve the problem of not necessarily using code examples from the relevant version of a Rust library, so it has a good ways to go for Rust coding. But I am very, very impressed with its usefulness for general research.
My point is only, if you haven't tried it, don't assume that there isn't already a game-changer out there for many uses.