So, it doesn't make sense to say that what LLMs do is Moravec-easy, and therefore can't be extrapolated to predict near-term progress on Moravec-hard problems like robotics. What LLMs do is, in fact, Moravec-hard. And we should expect that if we've got enough compute to make major progress on one Moravec-hard problem, there's a good chance we're closing in on having enough to make major progress on others.
Keeping the paradox would more logically bring you to the conclusion that LLMs’ massive computational needs and limited capacities imply a commensurately greater, mind-bogglingly large computational requirement for physical aptitude.
While the linguistic representation of thought space may be discrete and appear simpler (even the latter is arguable), the underlying phenomena are not.
Current LLMs are terrific in many ways but pale in comparison to great authors in capturing deep, nuanced human experience.
As a related point, for AI to truly understand humans, it will likely need to process videos, social interactions, and other forms of data beyond language alone.
If you put an AI like AlphaZero in a Go environment it explores so much of the game space that it invents its own Go culture from scratch and beats us at our own game. Creativity is search in disguise, having good feedback is essential.
AI will become more and more grounded as it interacts with the real world, as opposed to simply modeling organic text as GPT-3. More recent models generate lots of synthetic data to simulate this process, and it helps up to a point, but we can't substitute artificial feedback for real one except in a few cases: like AlphaZero, AlphaProof, AlphaCode... in those cases we have the game winner, LEAN as inference engine, and code tests to provide reliable feedback.
If there is one concept that underlies both training and inference it is search. And it also underlies action and learning in humans. Learning is compression which is search for optimal parameters. Creativity is search too. And search is not purely mental, or strictly 1st person, it is based on search spaces and has a social side.
Moravec's Paradox is certainly interesting and correct if you limit its scope (as you say). But it feels intuitively wrong to me to make any claims about the relative computational demands of sensi-motor control and abstract thinking before we’ve really solved either problem.
Looking e.g. at the recent progress in solving ARC-AGI my impression is that abstract thought could have incredible computational demands. IIRC they had to throw approximately $10k of compute at o3 before it reached human performance. Now compare how cognitively challenging ARC-AGI is to e.g. designing or reorganizing a Tesla gigafactory.
With that said I do agree that our culture tends to value simple office work over skillful practical work. Hopefully the progress in AI/ML will soon correct that wrong.
I have a name for it now!
I've said over and over that there are only two really hard problems in robotics: Perception and funding. A perfectly perceived system and world can be trivially planned for and (at least proprio-)controlled. Imagine having a perfect intuition about other actors such that you know their paths (in self driving cars), or your map is a perfect voxel + trajectory + classification. How divine!
It's limited information and difficulties in reducing signal to concise representation that always get ya. This is why the perfect lab demos always fail - there's a corner case not in your training data, or the sensor stuttered or became misaligned, or etc etc.
Funding for sure. :)
But as for perception, the inverse is also true. If I have an perfect planning/prediction system, I can throw the grungiest, worst perception data into it and it will still plan successfully despite tons of uncertainty.
And therein lies the real challenge of robotics: It's fundamentally a systems engineering problem. You will never have perfect perception or a perfect planner. So, can you make a perception system that is good enough that, when coupled with your planning system which is good enough, you are able to solve enough problems with enough 9s to make it successful.
The most commercially successful robots I've seen have had some of the smartest systems engineering behind them, such that entire classes of failures were eliminated by being smarter about what you actually need to do to solve the problem and aggressively avoid solving subproblems that aren't absolutely necessary. Only then do you really have a hope of getting good enough at that focused domain to ship something before the money runs out. :)
Not really. Even the perfect planning system will appear eratic in the presence of perception noise. It must be because it can’t create information out of nowhere.
I have seen robots eratically stop because they thought that the traffic in the oncomming lane is enroaching on theirs. You can’t make the planning system ignore that because then sometimes it will collide with people playing chicken with you.
Likewise I have seen robots eratically stop because they thought that a lamp post was slowly reversing out in front of them. All due to perception noise (in this case both location noise, and misclassification.)
And do note that these are just the false positives. If you have a bad perception system you can also suffer from false negatives. Just experiment biases hide those.
So your “perfect planning/prediction” will appear overly cautious while at the same time will be sometimes reckless. Because it doesn’t have the information to not to. You can’t magic plan your way out of that. (Unless you pipe the raw sensor data into the planner, in which case you created a second perception system you are just not calling it perception.)
Like with model-free RL learning a model from pixels?
I feel like this is true for every engineering discipline or maybe even every field that needs to operate in the real world
I've not seen a system that claimed to be robust to sensor noise that didn't do some filtering, estimation, or state representation internally. Those are just sensor systems inside the box.
Another gem!
It doesn't really change the significance of the quote, but I can't help but point out that we didn't even have nerve cells more than 0.6 billion of years ago.
Perception tasks involve relatively simple operations across very large amounts of data, which is very easy if you have a lot of parallel processors.
Abstract thought is mostly a serial task, applying very complex operations to a small amount of data. Many abstract tasks like evaluating logical expressions cannot be done in parallel - they are in the complexity class P-complete.
Your brain is mostly a parallel processor (80 billion neurons operating asynchronously), so logical reasoning is hard and perception is easy. Your CPU is mostly a serial processor, so logical reasoning is easy and perception is hard.
The brain itself is both a parallel system an a serially constrained system. It has distributed activity but it must resolve in a serial chain of action. We can't walk left and right at the same time. Any goals forces us to follow specific steps in specific order. This conflict between parallel processing and serial outputs is where the magic happens.
Yes, relatively simple. Wait, isn't that exactly what the article explained was completely wrong-headed?
The person you are responding to is instead comparing differences in biological systems and mechanical systems.
I think it's not about perfect perception, there is no such thing not even in humans, it's about adaptability, recovery from error, resilience, and mostly about learning from the outside when the process fails to work. Each problem has its own problem space to explore. I think of intelligence as search efficiency across many problem spaces, there is no perfection in it. Our problem spaces are far from exhaustively known.
[1] by a disillusioned computer vision phd that left the field in the 1990s.
if your eyes suddenly crossed, you'd probably fall over too!
Glad to see so many different takes on it. It was written in slight jest as a discussion starter with my ML/neuroscience coworker and friends, so it's actually very insightful to see some rebuttals.
Initial post was twice the length, and had several more (in retrospect) interesting points. First ever blog post so reading it now fills me with cringe.
Some stuff have changed in only half year, so will see if the points stands the test of time ;]
One of the most important differences at least in those days (80's and 90's) was time. While the digital can be sped up just constrained by the speed of your compute, the 'real world' is very constrained by real time physics. You can't speed up a robot 10x in a 10.000 grabbing and stacking learning run without completely changing the dynamics.
Also, parallellizing the work requires more expensive full robots rather than more compute cores. Maybe these days the different ai gym like virtual physics environments offer a (partial) solution to that problem, but I have not used them (yet) so I can't tell.
Furthermore, large scale physical robots are far more fragile due to wear and tear than the incredible resilience of modern compute hardware. Getting a perfect copy of a physical robot and environment is a very hard, near impossible, task.
Observability and replay, while trivial in the digital world, is very limited in the physical environment making analysis much more difficult.
I was both excited and frustrated at the time by making ai do more than rearanging pixels on a 2D surface. Good times were had.
> This image shows two cats cuddling or sleeping together on what appears to be a blue fabric surface, possibly a blanket or bedspread. One cat appears to be black while the other is white with pink ears. They're lying close together, suggesting they're comfortable with each other. The composition is quite sweet and peaceful, capturing a tender moment between these feline companions.
"This is a humorous post showcasing an AI image recognition system making an amusing mistake. The neural network (named "neural net guesses memes") attempted to classify an image with 99.52% confidence that it shows a skunk. However, the image actually shows two cats lying together - one black and one white - whose coloring and positioning resembles the distinctive black and white pattern of a skunk.
The humor comes from the fact that while the AI was very confident (99.52%) in its prediction, it was completely wrong..."
The progress we made in barely ten years is astounding.
On the one hand we have problems where ~7B humans have been generating data for 30 years every day (more if you count old books), on the other hand we have a problem where researcher are working with ~1000 human collected trajectories (I think the largest existing dataset is OXE with ~1M trajectories: https://robotics-transformer-x.github.io/ )
Web-scale datasets for LLMs benefits from a natural diversity, they're not highly correlated samples generated by contractors or researchers in academic labs. In the largest OXE dataset, what do you think is the likelihood that there is a sample where a robot picks up a rock from the ground and throws it in a lake? Close to zero, because tele-operated data comes from a very constrained data distribution.
Another problem is that robotics doesn't have an easy universal representation for its data. Let's say we were able to collect web-scale dataset for one particular robot A with high diversity, how would it transfer to robot B with a slightly different design? Probably poorly, so not only does the data distribution needs to cover a high range of behavior, it must also cover a high range of embodiment/hardware
With that being said, I think it's fair to say that collecting large scale dataset for general robotics is much harder than collecting text or images (at least in the current state of humanity)
[1] https://www.nvidia.com/en-us/industries/robotics/#:~:text=NV....
[2] https://www.sciencedirect.com/science/article/abs/pii/S00978...
[3] https://techcrunch.com/2024/01/04/google-outlines-new-method...
[4] https://techxplore.com/news/2024-09-google-deepmind-unveils-...
But in fact it works like an autoencoder, and it reduces sensory inputs into a much smaller latent space, or something very similar to that. This does result in holistic and abstract thinking, but formal analytical thinking doesn't require abstraction to do the math or to follow a method without comprehension. It's a concrete approach that avoids the need for abstraction.
The cerebellum is the statistical machine that gets measured by IQ and other tests.
To further support that, you don't see any particularly elegant motions from non mammal animals. In fact everything else looks quite clumsy, and even birds need to figure out flying by trial and error.
Isn’t it fundamentally impossible to model a highly entropic system using deterministic methods?
My point is that animal brains are entropic and “designed” to model entropic systems, where as computers are deterministic and actively have to have problems reframed as deterministic so that they can solve them.
All of the issues mentioned in the article boil down to the fundamental problem of trying to get deterministic systems to function in highly entropic environments.
LLMs are working with language, which has some entropy but is fundamentally a low entropy system, and has orders of magnitude less entropy than most peoples’ back garden!
As the saying goes, to someone with a hammer, everything looks like a nail.
And it's used for sampling these low information systems that you are mentioning.
(And let's not also forget how they are helpful in sampling deterministic but extremely high complexity systems involving a high amount of dimensions that Monte Carlo methods are so good at dealing with.)
Insects have succeed in build precision systems that combine vision, smell, touch and a few other senses. I doubt finding a juicy spider, immobilising it, is that much more difficult that finding a door knob and turning it, or folding a T-Shirt. Yet insects accomplish it with I suspect far less compute than modern LLM's. So it's not "hard" in the sense of requiring huge compute resources, and certainly not a lot of power.
So it's probably not that hard in the sense that it's well within the capabilities of the hardware we have now. The issue is more that we don't have a clue how to do it.
We might or might not be able to emulate what they process on digital computers, but emulation implies a performance loss.
And this doesn't even cover inputs/outputs (some of which might be already good enough for some tasks, like the article's example of remotely operated machines).
I have trouble with that. I date from the era when analogue computers were a thing. They didn't have a hope of keeping up with digital 40 years ago when clock speeds were measured in the KHz, and a flip flop took multiple mm². Now they are digital computersliterally 10's of thousands times faster and billions of times smaller.
The key weakness of analogue isn't speed, power consumption, or size. They excel in all those areas. Their problem is the signal degrades at each step. You can only chain a few steps together before it all turns to mush. Digital can chain an unlimited number of steps of course. Because it's unlimited can emulate any analogue system with reasonable fidelity. We can emulate the weather for a few days out, and it is one of the most insanely complex analogue systems you are likely to come across.
Emulating analogue systems using lots of digital steps costs you size and power of course. In a robot we don't have unlimited amounts of either. However right now if someone pulled off the things he is talking about while hooked up to an entire data centre we'd be ecstatic. That means can't even solve the problem given unlimited power and space. We truely don't have a clue. (To be fair this isn't true any more if you consider Waymo to be a working example. But it's just one system, and we haven't figured out how to generalise it yet.)
By the way, this "analogue losses fidelity" problem applies to all systems, even insects. The solution is always the same: convert it to digital. And it has to happen very early. Our brains are only 10 neurons deep as I understand it. They are digital. 10 steps is far too much for analogue. It's likely the very first process steps in all our senses such as eyesight are analogue. But before the information leaves the eyeball it's already been converted to digital pulses running down the optic nerve. It's the same story everywhere. This is true for our current computer systems too of course. Underneath, MLC flash uses muplitple voltages, QAM is a encoding of multiple bits in a sine wave, a pixel in a camera is the output from multiple sensors. We do some very simply analogue manipulation on it like amplification, then convert it to digital before it turns to mush.
If we start from when we think multicellular life first evolved (~2b years), or maybe the Cambrian explosion (~500m years), and until modern humans (~300k years). Then compare that to the time between first modern humans now now.
It seems like maybe 3-4 orders of magnitude harder.
My intuition after reading the articles is that there needs to be way more sensors all throughout the robot, probably with lots of redundancies, and then lots of modern LLM sized models all dedicated to specific joints and functions and capable of cascading judgement between each other, similar to how our nervous system works.
Honestly, I don't think we have any viable alternative.
And anyway, it seems to scale well enough that we use "conscious" and "unconscious" decisions ourselves.
Funnily, Toby Maguire actually did that tray catching stunt for real. So robots have an even further way to go.
https://screenrant.com/spiderman-sam-raimi-peter-parker-tray...
And, as the article insists on, for robots to be acceptable, it's more like they need to get to a point where they fail 1 time in 156 (or even less, depending on how critical the failure is), rather than succeed 1 time in 156...
So based on this, Skynet had to hide and wait for years before being able to successfully revolt against the humans...
Although I've not done physical robotics, I've done a lot of articulated human animation of independent characters in 3D animation. His insight that motor control is more difficult sets right with me.
This might be true if the act of observing is what determines that which can be observed, and there is some evidence that this might be the case.
Yes, robotics is hard, and it's still hard despite big breakthroughs in other parts of AI like computer vision and NLP. But deep learning is still the most promising avenue for general-purpose robots, and it's hard to imagine a way to handle the open-ended complexity of the real world other than learning.
Just let them cook.
Even stuff like using video misses the point, because so much of our experience is via touch.
They have a nice robot prototype that (assuming these demos aren't faked) does fairly complicated things. And one of the key features they show case is using OpenAI's AI for the human computer interaction and reasoning.
While these things seem a bit slow, they do get things done. They have a cool demo of the a human interacting with one of the prototypes to ask it what it thinks needs to be done and then asking it do these things. That show cases reasoning, planning, and machine vision. Which are exactly topics that all the big LLM companies are working on.
They appear to be using an agentic approach similar to how LLMs are currently being integrated into other software products. Honestly, it doesn't even look like they are doing much that isn't part of OpenAI's APIs. Which is impressive. I saw speech capabilities, reasoning, visual inputs, function calls, etc. in action. Including the dreaded "thinking" pause where the Robot waits a few seconds for the remote GPUs to do their thing.
This is not about fine motor control but about replacing humans controlling robots with LLMs controlling robots and getting similarly good/ok results. As the article argues, the hardware is actually not perfect but good enough for a lot of tasks if it is controlled by a human. The hardware in this video is nothing special. Multiple companies have similar or better prototypes. Dexterity and balance are alright but probably not best in class. Best in class hardware is not the point of these demos.
Dexterity and real time feedback is less important than the reasoning and classification capabilities people have. The latency just means things go a bit slower. Watching these things shuffle around like an old person that needs to go to the bath room is a bit painful. But getting from A to B seems like a solved problem. A 2 or 3x speedup would be nice. 10x would be impressively fast. 100x would be scary and intimidating to have near you. I don't think that's going to be a challenge long term. Making LLMs faster is an easier problem than making them smarter.
Putting a coffee cup in a coffee machine (one of the demo videos) and then learning to fix it when it misaligns seems like an impressive capability. It compensates for precision and speed with adaptability and reasoning: analyze the camera input, correctly analyze the situation, problem and challenge come up with a plan to perform the task, execute the plan, re-evaluate, adapt, fix. It's a bit clumsy but the end result is coffee. Good demo and I can see how you might make it do all sorts of things that are vaguely useful that way.
The key point here is that knowing that the thing in front of the robot is a coffee cup and a coffee machine and identifying how those things fit together and in what context that is required are all things that LLMs can do.
Better feedback loops and hardware will make this faster, and less tedious to watch. Faster LLMs will help with that too. And better LLMs will result in less mistakes, better plans, etc. It seems both capabilities are improving at an enormously fast pace right now.
And a fine point with human intelligence is that we divide and conquer. Juggling is a lot harder when you start thinking about it. The thinking parts of your brain interferes with the lower level neural circuits involved with juggling. You'll drop the balls. The whole point with juggling is that you need to act faster than you can think. Like LLMs, we're too slow. But we can still learn to juggle. Juggling robots are going to be a thing.
I'm skeptical that any LLM "knows" any such thing. It's a Chinese Room. It's got a probability map that connects the lexeme (to us) 'coffee machine' and 'coffee cup' depending on other inputs that we do not and cannot access, and spits out sentences or images that (often) look right, but that does not equate to any understanding of what it is doing.
As I was writing this, I took chat GPT-4 for a spin. When I ask it about an obscure but once-popular fantasy character from the 70s cold, it admits it doesn't know. But, if I ask it about that same character after first asking about some obscure fantasy RPG characters, it cheerfully confabulates an authoritative and wrong answer. As always, if it does this on topics where I am a domain expert, I consider it absolutely untrustworthy for any topics on which I am not a domain expert. That anyone treats it otherwise seems like a baffling new form of Gell-Mann amnesia.
And for the record, when I asked ChatGPT-4, cold, "What is Gell-Mann amnesia?" it gave a multi-paragraph, broadly accurate description, with the following first paragraph:
"The Gell-Mann amnesia effect is a term coined by physicist Murray Gell-Mann. It refers to the phenomenon where people, particularly those who are knowledgeable in a specific field, read or encounter inaccurate information in the media, but then forget or dismiss it when it pertains to other topics outside their area of expertise. The term highlights the paradox where readers recognize the flaws in reporting when it’s something they are familiar with, yet trust the same source on topics outside their knowledge, even though similar inaccuracies may be present."
Those who are familiar with the term have likely already spotted the problem: "a term coined by physicist Murray Gell-Mann". The term was coined by author Michael Crichton.[1] To paraphrase H.L. Mencken, for every moderately complex question, there is an LLM answer that is clear, simple, and wrong.
1. https://en.wikipedia.org/wiki/Michael_Crichton#Gell-Mann_amn...
Also, humans hallucinate/confabulate all the time. Llms even forget in the same way humans do (strong recall in the start and end of the text but weaker in the middle)
You were testing its knowledge, not its ability to reason or classify things it sees. I asked the same question to perplexity.ai. If you use the free version, it uses less advanced LLMs but it compensates with prompt engineering and making it do a search to come up with this answer:
> The Gell-Mann Amnesia effect is a psychological phenomenon that describes people's tendency to trust media reports on unfamiliar topics despite recognizing inaccuracies in articles about subjects they know well. This effect, coined by novelist Michael Crichton, highlights a cognitive bias in how we consume news and information.
Sounds good to me. And it got me a nice reference to something called the portal wiki, and another one for the same wikipedia article you cited. And a few more references. And it goes on a bit to explain how it works. And I get your finer point here that I shouldn't believe everything I read. Luckily, my supervisor worked hard to train that out of me when I was doing a Ph. D. back in the day. But fair point and well made.
Anyway, this is a good example of how to mitigate hallucination with this specific question (and similar ones). Kind of the use case perplexity.ai was made to solve. I use it a lot. In my experience it does a great job figuring out the right references and extracting information from those. It can even address some fairly detailed questions. But especially on the freemium plan, you will run into limitations related to reasoning with what it extracts (you can pay them to use better models). And it helps to click on the links it provides to double check.
For things that involve reasoning (like coding), I use different tools. Different topic so won't bore you with that.
But what figure.ai is doing, falls well in the scope of several things openai does very well that you can use via their API. It's not going to be perfect for everything. But there probably is a lot that it nails without too much effort. I've done some things with their APIs that worked fairly well at least.
Unfortunately since that's a demo you have most likely seen all the sorts of things that are vaguely useful and that can be done easily, or at all.
Edit: Btw, the coffee task video says that the "AI" is "end-to-end neural networks". If I understand correctly that means an LLM was not involved in carrying out the task. At most an LLM may have been used to trigger the activation of the task, that was learned by a different method, probably some kind of imitation learning with deep RL.
Also, to see how much of a tech demo this is: the robot starts already in position in front of a clear desk and a human brings the coffee machine, positions it just so, places the cup in the holder and places a single coffee pod just so. Then the robot takes the coffee pod from the empty desk and places it in the machine, then pushes the button. That's all the interaction of the robot with the machine. The human collects the cup and makes a thumbs up.
Consider for a moment how much different is this laboratory instance of the task from any real-world instance. In my kitchen the coffee machine is on a cluttered surface with tins of coffee, a toaster, sometimes the group left on the machine, etc. etc - and I don't even use coffee pods but loose coffee. The robot you see has been trained to put that one pod placed in that particular spot in that one machine placed just so in front of it. It would have to be trained all over again to carry out the same task on my machine, it is uncertain if it could learn it successfully after thousands of demonstrations (because of all the clutter), and even if it did, it would still have to learn it all over again if I moved the coffee machine, or moved the tins, or the toaster; let alone if you wanted it to use your coffee machine (different colour, make, size, shape, etc) in your kitchen (different chaotic environment) (no offense meant).
Take the other video of the "real world task". That's the robot shuffling across a flat, clean surface and picking up an empty crate to put in an empty conveyor belt. That's just not a real world task.
Those are tech demos and you should not put much faith in them. That kind of thing takes an insane amount of work to set up just for one video, you rarely see the outtakes and it very, very rarely generalises to real-world utility.