I was surprised that most VLLMs cannot reliably tell if a character is facing left or right, they will confidently lie no matter what you do (even gemini 3 cannot do it reliably). I guess it's just not in the training data.
That said Qwen3VL models are smaller/faster and better "spatially grounded" in pixel space, because pixel coordinates are encoded in the tokens. So you can use them for detecting things in the scene, and where they are (which you can project to 3d space if you are running a sim). But they are not good reasoning models so don't ask them to think.
That means the best pipeline I've found at the moment is to tack a dumb detection prepass on before your action reasoning. This basically turns 3d sims into 1d text sims operating on labels -- which is something that LLMs are good at.
Not perfectly, there's a lot abuse of gravity or the lack thereof, but yeah. Neuro has also piloted a Robot Dog in the past.
The fact that a language model can „reason“ (in the LLM-slang meaning of the term) about 3D space is an interesting property.
If you give a text description of a scene and ask a robot to perform a peg in hole task, modern models are able to solve them fairly easily based on movement primitives. I implemented this on a UR robot arm back in 2023
The next logical step is, instead of having the model output text (code representing movement primitives), outputting tokens in action space. This is what models like pi0 are doing.
The latter part is interesting. I'm not sure how the performance of one of those would be once they are working well, but my naive gut feeling is that splitting the language part and the driving part into two delegates is cleaner, safer, faster and more predictable.
since this is a limited and continuous domain, its a far better one for neural training than natural language. I guess this notion that a language model should be used for 3d motion control is a real indicator about the level of thought going into some of these applications.
The first thought I had was those security guard robots that are popping up all over the place. if they were drones instead, and LLM talked to people asking them to do/not-do things, that would be an improvement.
Or an waiter drone, that takes your order in a restaurant, flies to the kitchen, picks up a sealed and secured food container, flies it back to the table, opens it, and leaves. It will monitor for gestures and voice commands to respond to diners and get their feedback, abuse, take the food back if it isn't satisfactory,etc...
This is the type of stuff we used to see in futuristic movies. It's almost possible now. glad to see this kind of tinkering.
You describe why it would be useful to have an LLM in a drone to interact with it but do not explain why it is the very same LLM that should be doing the flying.
1. a drone that you can talk to and fly on its own
2. a drone where the flying is controlled by an LLM
(2) is a specific instance of the larger concept of (1).
You make an argument that 1 should be addressed, which no one is denying in this thread - people are arguing that (2) is a bad way to do (1).
Your previous comment was arguing that (1) is great (which no one denies in this thread, and it is a different discussion about what products are desirable rather than how to build said product) in an answer to someone arguing (2).
LLMs are a higher level construct than PID loops. With things like autopilot I can give the controller a command like 'Go from A to B', and chain constructs like this to accomplish a task.
With an LLM I can give the drone/LLM system complex command that I'd never be able to encode to a controller alone. "Fly a grid over my neighborhood, document the location of and take pictures of every flower garden".
And if an LLM is just a 'text generator' then it's a pretty damned spectacular one as it can take free formed input and turn it into a set of useful commands.
Let me put it this way: What OP built is an airplane in which a pilot doesn't have a control stick, but they have a keyboard, and they type commands into the airplane to run it. It's a silly unnecessary step to involve language.
Now what you're describing is a language problem, which is orchestration, and that is more suited to an LLM.
Give the LLM agent write acces to a text file to take notes and it can actually learn. Not really realiable, but some seem to get useful results. They ain't just text generators anymore.
(but I agree that it does not seem the smartest way to control a plane with a keyboard)
I don't think so. But with a AI agent it can.
Sure, they still don't have real understanding, but calling this technology mere text generators in 2026 seems a bit out of the loop.
That part isn’t handled by an LLM
> voice generation,
That part isn’t handled by an LLM
> video generation
That part isn’t handled by an LLM
I think most of us understood that reproducing what existing autopilot can do was not the goal. My inexpensive DJI quadcopter has an impressive abilities in this area as well. But, I cannot give it a mission in natural language and expect it to execute it. Not even close.
You don't want an LLM to drive a car
There is more to "AI" than LLMs
https://waymo.com/research/emma/
https://waymo.com/blog/2024/10/introducing-emma
https://waymo.com/blog/2025/12/demonstrably-safe-ai-for-auto...
Charitably, I guess you can question why you would ever want to use text to command a machine in the world (simulated or not).
But I don't see how it's the wrong tool given the goal.
>SOTA typically refers to achieving the best performance
Multimodal Transformers are the best way to turn plain text instructions to embodied world behavior. Nothing to do with being 'trendy'. A Vision Language Action model would probably have done much better but really the only difference between that and the models trialed above is training data. Same technology.
We are on HACKER news. Using tools outside the scope is the ethos of a hacker.
https://github.com/kxzk/snapbench/blob/main/llm_drone/src/ma...
I've been working with integrating GPT-5.2 in Unity. It's fantastic at scripting but completely worthless at managing transforms for scene objects. Even with elaborate planning phases it's going to make a complete jackass of itself in world space every time.
LLMs are also wildly unsuitable for real-time control problems. They never will be. A PID controller or dedicated pathfinding tool being driven by the LLM will provide a radically superior result.
We use a state machine (LangGraph) to manage the intent and decision tree, but delegate the actual transform math to deterministic code. You really want the model deciding the strategy and a standard solver handling the vectors, otherwise you're just burning tokens to crash into walls.
This looks like a pretty fun project and in my rough estimation a fun hacker project.
Vision models do a pretty decent job with spatial reasoning. It’s not there yet but you’re dismissing some interesting work going on.
It would not surprise me at all if self-driving models are adopting a lot of the model architecture from LLMs/generative AI, and actually invoke actual LLMs in moments where they would've needed human intervention.
Imagine if there's a decision engine at the core of a self driving model, and it gets a classification result of what to do next. Suddenly it gets 3 options back with 33.33% weight attached to each of them and a very low confidence interval of which is the best choice. Maybe that's the kind of scenario that used to trigger self-driving to refuse to choose and defer to human intervention. If that can then first defer judgement to an LLM which could say "that's just a goat crossing the road, INVOKE: HONK_HORN," you could imagine how that might be useful. LLMs are clearly proving to be universal reasoning agents, and it's getting tiring to hear people continuously try to reduce them to "next word predictors."
He answers your question
The failure mode is not just missed objects, it is state aliasing. Two physically different scenes can map to the same label set, especially with occlusion, depth ambiguity, or near boundary conditions. In control tasks like drone navigation, that can produce confident but wrong actions because the planner has no access to the underlying geometry or sensor noise. Error compounds over time since each step re-anchors on an already simplified state.
Are you carrying forward any notion of uncertainty or temporal tracking from the vision stage, or is each step a stateless label snapshot fed to the reasoning model?
There were some experiments with embodied LLMs on the front page recently (e.g. basic robot body + task) and SOTA models struggled with that too. And of course they would - what training data is there for embodying a random device with arbitrary controls and feedback? They have to lean on the "general" aspects of their intelligence which is still improving.
With dedicated embodiment training and an even tighter/faster feedback loop, I don't see why an LLM couldn't successfully pilot a drone. I'm sure some will still fall of the rails, but software guardrails could help by preventing certain maneuvers.
I am sure this is already worked on in Russia, Ukraine and The Netherlands. A lot can go wrong with autonomous flying. One could load the VLM on a high end android phone on the drone and have dual control.
> Only one could do it.
If I understood the chart correctly, even the successful one only found 1/6 of the creatures across multiple runs.
Without comparison to some null hypothesis (a random policy), this article is hogwash.
For some problems, randomness outperforms incompetent reasoning
Gemini Pro, like the other models, didn't even find a single creature.
But that said, I think the author missed something. LLMs aren’t great at this type of reasoning/state task, but they are good at writing programs. Instead of asking the LLM to search with a drone, it would be very interesting to know how they performed if you asked them to write a program to search with a drone.
This is more aligned with the strengths of LLMs, so I could see this as having more success.