Show HN: Only 1 LLM can fly a drone(github.com)

180 pointsby beigebrucewayne12 days ago24 comments

avaer12 days ago
Gemini 3 is the only model I've found that can reason spatially. The results here are accurate to my experiments with putting LLM NPCs in simulated worlds.
I was surprised that most VLLMs cannot reliably tell if a character is facing left or right, they will confidently lie no matter what you do (even gemini 3 cannot do it reliably). I guess it's just not in the training data.
That said Qwen3VL models are smaller/faster and better "spatially grounded" in pixel space, because pixel coordinates are encoded in the tokens. So you can use them for detecting things in the scene, and where they are (which you can project to 3d space if you are running a sim). But they are not good reasoning models so don't ask them to think.
That means the best pipeline I've found at the moment is to tack a dumb detection prepass on before your action reasoning. This basically turns 3d sims into 1d text sims operating on labels -- which is something that LLMs are good at.
- general_reveal11 days ago
  We just need to fine tune these models on Ocarina of Time Water Temple - spatial reasoning solved.
- storystarling12 days ago
  I suspect the latency on Gemini 3 makes it non-viable for a real-time control loop though. Even if the reasoning works, the input token costs would destroy the unit economics pretty quickly. I'd be worried about relying on that kind of API overhead for the critical path.
  - 10100812 days ago
    > the input token costs would destroy the unit economics pretty quickly.
    They say this is going to happen to every task after the stop subsidizing token costs.
    zinodaur12 days ago
    Not for coding though - I'd buy 4 H200's and stick them in my basement if i had to
    nish__12 days ago
    To do what?
    weird-eye-issue12 days ago
    CODING
- Krutonium12 days ago
  Neuro-sama, the V-Tuber/AI actually does a decent job of it. Vedal seems to have cooked and figured out how to make an LLM move reasonably well in VRChat.
  Not perfectly, there's a lot abuse of gravity or the lack thereof, but yeah. Neuro has also piloted a Robot Dog in the past.
modeless12 days ago
This is what VLA models are for. They would work much better. Would need a bit of fine tuning but probably not much. Lots of literature out there on using VLAs to control drones.
- SpyCoder7712 days ago
  Did some research, found a model that is exactly that. https://cognitivedrone.github.io/
  - culi12 days ago
    The Black Mirror speedrun continues
    goda9012 days ago
    Slaughterbots: https://www.youtube.com/watch?v=O-2tpwW0kmU
  - beigebrucewayne12 days ago
    Thanks will check this out!
volkercraig12 days ago
I don't understand. Surely training an LSTM with sensor input is more practical and reasonable way than trying to get a text generator to speak commands to a drone.
- encrux12 days ago
  Very much depends on what you want to do.
  The fact that a language model can „reason“ (in the LLM-slang meaning of the term) about 3D space is an interesting property.
  If you give a text description of a scene and ask a robot to perform a peg in hole task, modern models are able to solve them fairly easily based on movement primitives. I implemented this on a UR robot arm back in 2023
  The next logical step is, instead of having the model output text (code representing movement primitives), outputting tokens in action space. This is what models like pi0 are doing.
  - volkercraig12 days ago
    I mean semantically language evolved as an interpretation for the material world, so assuming that you can describe a problem in language, and considering that there exists a solution to said problem that is describable in language, then I'm sure a big enough LLM could do it... but you can also calculate highly detailed orbital maps with epicycles if you just keep adding more... you just don't because it's a waste of time and there's a simpler way.
    The latter part is interesting. I'm not sure how the performance of one of those would be once they are working well, but my naive gut feeling is that splitting the language part and the driving part into two delegates is cleaner, safer, faster and more predictable.
    convolvatron12 days ago
    note that the control systems you were talking about before (i.e. PID) would probably take hold pretty directly in a tiny network, and exactly because of that limitation, be far less likely to contain 'hallucinations'. object avoidance and path planning are likely similar.
    since this is a limited and continuous domain, its a far better one for neural training than natural language. I guess this notion that a language model should be used for 3d motion control is a real indicator about the level of thought going into some of these applications.
    12 days ago
    undefined
broast12 days ago
On the discussion of the right or wrong tool, I find it possible that the ability to reason towards a goal is more valuable in the long run than an intrinsic ability to achieve the same result. Or maybe a mix of both is the ideal.
dimatura12 days ago
This is neat! It's a bit amusing in that I worked on a somewhat similar project for my phd thesis almost 10 years ago, although in that case we got it working on a real drone (heavily customized, based on DJI matrice) in the field, with only onboard compute. Back then it was just a fairly lightweight CNN for the perception, not that we could've gotten much more out of the jetson TX2.
bigfishrunning12 days ago
Why would you want an LLM to fly a drone? Seems like the wrong tool for the job -- it's like saying "Only one power drill can pound roofing nails". Maybe that's true, but just get a hammer
- notepad0x9012 days ago
  There are almost endless reasons why. It's like asking why would you want a self-driving car. Having a drone to transport things would be amazing, or to patrol an area. LLMs can be helpful with object identification, reacting to different events, and taking commands from users.
  The first thought I had was those security guard robots that are popping up all over the place. if they were drones instead, and LLM talked to people asking them to do/not-do things, that would be an improvement.
  Or an waiter drone, that takes your order in a restaurant, flies to the kitchen, picks up a sealed and secured food container, flies it back to the table, opens it, and leaves. It will monitor for gestures and voice commands to respond to diners and get their feedback, abuse, take the food back if it isn't satisfactory,etc...
  This is the type of stuff we used to see in futuristic movies. It's almost possible now. glad to see this kind of tinkering.
  - laffOr12 days ago
    You could have a program, not LLM-based but could be ANN, for flying and an LLM for overseeing; the LLM could give the program instructions to the pilot program as a (x,y,z) directions. I mean currently autopilots are typically not LLMs, right?
    You describe why it would be useful to have an LLM in a drone to interact with it but do not explain why it is the very same LLM that should be doing the flying.
    notepad0x9012 days ago
    I'm not OP, I don't know what specific roles the LLM should be using, but LLMs are great with object recognition, and using both text (street signs,notices,etc..) and visual cues to predict the correct response. The actual motor control i'm sure needs no LLMs, but the decision making could use any number of solutions, I agree that an LLM-only solution sounds bad, but I didn't do the testing and comparison to be confident in that assessment.
  - lewispollard12 days ago
    The point is that you don't need an LLM to pilot the thing, even if you want to integrate an LLM interface to take a request in natural language.
    infecto12 days ago
    That’s a pretty boring point for what looks like a fun project. Happy to see this project and know I am not the only one thinking about these kinds of applications.
    coder54312 days ago
    An LLM that can't understand the environment properly can't properly reason about which command to give in response to a user's request. Even if the LLM is a very inefficient way to pilot the thing, being able to pilot means the LLM has the reasoning abilities required to also translate a user's request into commands that make sense for the more efficient, lower-level piloting subsystem.
    notepad0x9012 days ago
    We don't need a lot of things, but new tech should also address what people want, not just needs. I don't know how to pilot drones, nor do I care to learn how to, but I want to do things with drones, does that qualify as a need? Tech is there to do things for us we're too lazy to do.
    laffOr12 days ago
    There are two different things:
    1. a drone that you can talk to and fly on its own
    2. a drone where the flying is controlled by an LLM
    (2) is a specific instance of the larger concept of (1).
    You make an argument that 1 should be addressed, which no one is denying in this thread - people are arguing that (2) is a bad way to do (1).
    notepad0x9012 days ago
    You're considering "talking to" a separate thing, I consider it the same as reading street signs or using object recognition. My voice or text input is just one type of input. Can other ML solutions or algorithms detect a tree (same as me telling it there is a tree,yaw to the right), yes, can LLMs detect a tree and determine what course of action to take? also true. Which is better? I don't know, but I won't be quick to dismiss anyone attempting to use LLMs.
    laffOr10 days ago
    Definitely maybe - but then we are discussing (2), i.e. "what is the right technical solution to solve (1)".
    Your previous comment was arguing that (1) is great (which no one denies in this thread, and it is a different discussion about what products are desirable rather than how to build said product) in an answer to someone arguing (2).
    volkercraig12 days ago
    I don't think you understand what an "LLM" is. They're text generators. We've had autopilot since the 1930s that relies on measurable things... like PID loops, direct sensor input. You don't need the "language model" part to run an autopilot, that's just silly.
    pixl9712 days ago
    You see to be talking past him and ignoring what they are actually saying.
    LLMs are a higher level construct than PID loops. With things like autopilot I can give the controller a command like 'Go from A to B', and chain constructs like this to accomplish a task.
    With an LLM I can give the drone/LLM system complex command that I'd never be able to encode to a controller alone. "Fly a grid over my neighborhood, document the location of and take pictures of every flower garden".
    And if an LLM is just a 'text generator' then it's a pretty damned spectacular one as it can take free formed input and turn it into a set of useful commands.
    volkercraig12 days ago
    They are text generators, and yes they are pretty good, but that really is all they are, they don't actually learn, they don't actually think. Every "intelligence" feature by every major AI company relies on semantic trickery and managing context windows. It even says it right on the tin; Large LANGUAGE Model.
    Let me put it this way: What OP built is an airplane in which a pilot doesn't have a control stick, but they have a keyboard, and they type commands into the airplane to run it. It's a silly unnecessary step to involve language.
    Now what you're describing is a language problem, which is orchestration, and that is more suited to an LLM.
    lukan12 days ago
    "they don't actually learn"
    Give the LLM agent write acces to a text file to take notes and it can actually learn. Not really realiable, but some seem to get useful results. They ain't just text generators anymore.
    (but I agree that it does not seem the smartest way to control a plane with a keyboard)
    volkercraig12 days ago
    If thats youre definition of learning, my casio FX has an "ans" feature that "learns" from earlier calculations!!
    lukan12 days ago
    Can that "ans" variable influence the general way your casio does future calculations?
    I don't think so. But with a AI agent it can.
    Sure, they still don't have real understanding, but calling this technology mere text generators in 2026 seems a bit out of the loop.
    pixl9712 days ago
    [dead]
    infecto12 days ago
    My confusion maybe? Is this simulator just flying point a to b? Seems like it’s handling collisions while trying to locate the targets and identify them. That seems quite a bit more complex than what you are describing has been solved since the 1930s.
    notepad0x9012 days ago
    LLMs can do chat-completion, they don't do only chat completion. There are LLMs for image generation, voice generation, video generation and possibly more. The camera of a drone inputs images for the LLM, then it determines what action take based on that. Similar to if you asked ChatGPT "there is a tree in this picture, if you were operating a drone, what action would you take to avoid collision", except the "there is a tree" part is done by the LLMs image recognition, and the sys prompt is "recognize objects and avoid collision", of course I'm simplifying it a lot but it is essentially generating navigational directions under a visual context using image recognition.
    nrrbtrbbrb12 days ago
    > There are LLMs for image generation,
    That part isn’t handled by an LLM
    > voice generation,
    That part isn’t handled by an LLM
    > video generation
    That part isn’t handled by an LLM
    famouswaffles12 days ago
    Yes it can be, and often is. Advanced voice mode in chatGPT and the voice mode in Gemini are LLMs. So is the image gen in both chatGPT and Gemini (Nano Banana).
    notepad0x9011 days ago
    What is it handled by? I'm honestly curious, there are models specifically labeled as for those tasks.
    cheema3312 days ago
    "You don't need the "language model" part to run an autopilot, that's just silly."
    I think most of us understood that reproducing what existing autopilot can do was not the goal. My inexpensive DJI quadcopter has an impressive abilities in this area as well. But, I cannot give it a mission in natural language and expect it to execute it. Not even close.
    njhnjhnjhnjh12 days ago
    [dead]
  - iso163112 days ago
    You want a self driving car
    You don't want an LLM to drive a car
    There is more to "AI" than LLMs
    coder54312 days ago
    Waymo is certainly interested in using LLMs/VLMs for this purpose.
    https://waymo.com/research/emma/
    https://waymo.com/blog/2024/10/introducing-emma
    https://waymo.com/blog/2025/12/demonstrably-safe-ai-for-auto...
    notepad0x9012 days ago
    I don't mind someone trying LLMs to see if they can do better than existing ML solutions.
  - fwip12 days ago
    Both of those proposed uses are bad things that are worse than what they would replace.
- munchler12 days ago
  Because we’re interested in AGI (emphasis on general) and LLM’s are the closest thing to AGI that we have right now.
- pavlov12 days ago
  Yeah, it feels a bit like asking "which typewriter model is the best for swimming".
  - dr-detroit12 days ago
    [dead]
- avaer12 days ago
  Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.
  Charitably, I guess you can question why you would ever want to use text to command a machine in the world (simulated or not).
  But I don't see how it's the wrong tool given the goal.
  - irl_zebra12 days ago
    SOTA typically refers to achieving the best performance, not using the trendiest thing regardless of performance. There is some subtlety here. At some point an LLM might give the best performance in this task, but that day is not today, so an LLM is not SOTA, just trendy. It's kinda like rewriting something in Rust and calling it SOTA because that's the trend right now. Hope that makes sense.
    famouswaffles12 days ago
    >Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.
    >SOTA typically refers to achieving the best performance
    Multimodal Transformers are the best way to turn plain text instructions to embodied world behavior. Nothing to do with being 'trendy'. A Vision Language Action model would probably have done much better but really the only difference between that and the models trialed above is training data. Same technology.
    infecto12 days ago
    I don’t think trendy is really the right word and maybe it’s not state of the art but a lot of us in the industry are seeing emerging capabilities that might make it SOTA. Hope that makes sense.
    irl_zebra12 days ago
    LLMs are indeed the definition of trendy (I've found using Google Trends to dive in is a good entry point to get a broad sense of whether something is "trendy")! Basically the right way to think about it is that something can be promising, and demonstrate emerging capabilities, but but those things don't make something SOTA, nor do they make it trendy. They can be related though (I expect everything SOTA was once promising and emerging, but not everything promising or emerging became SOTA). It's a subtlety that isn't super easy to grasp, but (and here is one area I think an LLM can show promise) an LLM like ChatGPT can help unpick the distinctions here. Still, it's slightly nuanced and I understand the confusion.
    infecto12 days ago
    I think the point may have flown over your head. I am suggesting you are being dismissive with a distinct lack of thought on your reply. Like said I don’t think state of the art is the right way to describe it but I think trendy is equally wrong from the other side of the spectrum. Models that can deal with vision have some really interesting use cases and ones that can be valuable, in a lot of ways I would say state of the art could describe it but I know to folks that are hopelessly negative, it’s a hard reach so I was trying to balance it for you. Hope that makes sense.
- dan-bailey12 days ago
  When your only tool is a hammer, every problem begins to resemble a nail.
- Mashimo12 days ago
  > Why would you want an LLM to fly a drone?
  We are on HACKER news. Using tools outside the scope is the ethos of a hacker.
- smw121812 days ago
  It's a great feature to tell my drone to do a task in English. Like "a child is lost in the woods around here. Fly a search pattern to find her" or "film a cool panorama of this property. Be sure to get shots of the water feature by the pool." While LLMs are bad at flying, better navigation models likely can't be prompted in natural language yet.
  - volkercraig12 days ago
    What you're describing is still ultimately the "view" layer of a larger autopilot system, that's not what OP is doing. He's getting the text generator to drive the drone. An LLM can handle parsing input, but the wayfinding and driving would (in the real world) be delegated to modern autopilot.
- bob102912 days ago
  The system prompt for the drone is hilarious to me. These models are horrible at spatial reasoning tasks:
  https://github.com/kxzk/snapbench/blob/main/llm_drone/src/ma...
  I've been working with integrating GPT-5.2 in Unity. It's fantastic at scripting but completely worthless at managing transforms for scene objects. Even with elaborate planning phases it's going to make a complete jackass of itself in world space every time.
  LLMs are also wildly unsuitable for real-time control problems. They never will be. A PID controller or dedicated pathfinding tool being driven by the LLM will provide a radically superior result.
  - storystarling12 days ago
    Agreed. I’ve found the only reliable architecture for this is treating the LLM purely as a high-level planner rather than a controller.
    We use a state machine (LangGraph) to manage the intent and decision tree, but delegate the actual transform math to deterministic code. You really want the model deciding the strategy and a standard solver handling the vectors, otherwise you're just burning tokens to crash into walls.
- infecto12 days ago
  What’s the right tool then?
  This looks like a pretty fun project and in my rough estimation a fun hacker project.
  - bigfishrunning12 days ago
    The right tool would likely be some conventional autopilot software; if you want AI cred you could train a Neural Network which maps some kind of path to the control features of the drone. LLMs are language models -- good for language, but not good for spacial reasoning or navigation or many of the other things you need to pilot a drone.
    infecto12 days ago
    So you are suggesting building a full featured package that is nontrivial compared to this fun excitement?
    Vision models do a pretty decent job with spatial reasoning. It’s not there yet but you’re dismissing some interesting work going on.
- ralusek12 days ago
  Why would you want an LLM to identify plants and animals? Well, they're often better than bespoke image classification models at doing just that. Why would you want a language model to help diagnose a medical condition?
  It would not surprise me at all if self-driving models are adopting a lot of the model architecture from LLMs/generative AI, and actually invoke actual LLMs in moments where they would've needed human intervention.
  Imagine if there's a decision engine at the core of a self driving model, and it gets a classification result of what to do next. Suddenly it gets 3 options back with 33.33% weight attached to each of them and a very low confidence interval of which is the best choice. Maybe that's the kind of scenario that used to trigger self-driving to refuse to choose and defer to human intervention. If that can then first defer judgement to an LLM which could say "that's just a goat crossing the road, INVOKE: HONK_HORN," you could imagine how that might be useful. LLMs are clearly proving to be universal reasoning agents, and it's getting tiring to hear people continuously try to reduce them to "next word predictors."
- sbsnjsks12 days ago
  [dead]
- peterpost212 days ago
  Did you read his post?
  He answers your question
  - macintux12 days ago
    > Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".
    https://news.ycombinator.com/newsguidelines.html
  - philipwhiuk12 days ago
    I disagree. The nearest justification is:
    > to see what happens
    ceejayoz12 days ago
    Isn't that the epitome of the hacker spirit?
    "Why?" "Because I can!"
accrual12 days ago
I think it's fascinating work even if LLMs aren't the ideal tool for this job right now.
There were some experiments with embodied LLMs on the front page recently (e.g. basic robot body + task) and SOTA models struggled with that too. And of course they would - what training data is there for embodying a random device with arbitrary controls and feedback? They have to lean on the "general" aspects of their intelligence which is still improving.
With dedicated embodiment training and an even tighter/faster feedback loop, I don't see why an LLM couldn't successfully pilot a drone. I'm sure some will still fall of the rails, but software guardrails could help by preventing certain maneuvers.
calchiwo11 days ago
The detection prepass plus text reasoning pipeline is effectively a perception to symbol translation layer, and that is where most of the brittleness will hide. Once you collapse a continuous 3D scene into discrete labels, you lose uncertainty, relative geometry, and temporal consistency unless you explicitly model them. The LLM then reasons over a clean but lossy world model, so action quality is capped by what the detector chose to surface.
The failure mode is not just missed objects, it is state aliasing. Two physically different scenes can map to the same label set, especially with occlusion, depth ambiguity, or near boundary conditions. In control tasks like drone navigation, that can produce confident but wrong actions because the planner has no access to the underlying geometry or sensor noise. Error compounds over time since each step re-anchors on an already simplified state.
Are you carrying forward any notion of uncertainty or temporal tracking from the vision stage, or is each step a stateless label snapshot fed to the reasoning model?
12 days ago
undefined
fsiefken12 days ago
I am curious how these models would perform and how much energy they'd take to semi-realtime detect objects: SmolVLM2-500M - Moondream 0.5B/2B/2.5B - Qwen3-VL (3B) https://huggingface.co/collections/Qwen/qwen3-vl
I am sure this is already worked on in Russia, Ukraine and The Netherlands. A lot can go wrong with autonomous flying. One could load the VLM on a high end android phone on the drone and have dual control.
- SpyCoder7712 days ago
  A better way would be a VLA as opposed to a VLM. VLAs are meant to take action, where as vlms are for geneeral use. https://cognitivedrone.github.io/
Bender11 days ago
LLM's seem like the wrong platform to operate a drone in my opinion. I would expect that to be something more like a gaming engine. It should be small, simple, low latency and maybe based on a first person shooter running on insane difficulty. Small enough to fit in a tiny firmware space. It should boot so fast the firmware could be upgraded mid-flight without missing a beat. Give it simple friend or foe and obliterate anything not green.
12 days ago
undefined
eichin12 days ago
At least he's not feeding real drones to the coyotes... oh, there's a link in the readme https://github.com/kxzk/tello-bench
me551ah12 days ago
In a real world test you would have a tool call for the LLM which is a bit high level like GoTo(object) and the tool calls another program which identities the objects in frame and uses standard programs to go to that.
zahlman12 days ago
> I gave 7 frontier LLMs a simple task: pilot a drone through a 3D voxel world and find 3 creatures.
> Only one could do it.
If I understood the chart correctly, even the successful one only found 1/6 of the creatures across multiple runs.
- uoaei12 days ago
  No science detected.
  Without comparison to some null hypothesis (a random policy), this article is hogwash.
  - zahlman12 days ago
    Given that all the other agents failed to find any creatures, it's hard to imagine that a random policy would except by extreme coincidence.
    TOMDM12 days ago
    It is possible to be consistently wrong in a way that randomness is not.
    For some problems, randomness outperforms incompetent reasoning
Havoc11 days ago
I’m guessing googles model has extensive Minecraft sandbox mode YouTube vids in its training which would exactly this perspective
SpyCoder7712 days ago
https://cognitivedrone.github.io/
andai12 days ago
Gemini Flash beats Gemini Pro? How does that work?
Gemini Pro, like the other models, didn't even find a single creature.
arikrahman12 days ago
Interesting. In some benchmarks I even see flash outperforming thinking in general reasoning.
kylehotchkiss12 days ago
This sounds like a good way to get your drone shot down by a Concerned Citizen or the military.
SoftTalker12 days ago
LLMs are trained on text. Why would we expect them to understand a visual and tactile 3D world?
- azinman212 days ago
  Because they’re also multimodal vLLMs.
mbreese12 days ago
I can’t really take this too seriously. This seems to me to be a case of asking “can an LLM do X?” Instead, the question is like to see is: “I want to do X, is an LLM this right tool?”
But that said, I think the author missed something. LLMs aren’t great at this type of reasoning/state task, but they are good at writing programs. Instead of asking the LLM to search with a drone, it would be very interesting to know how they performed if you asked them to write a program to search with a drone.
This is more aligned with the strengths of LLMs, so I could see this as having more success.
antisthenes12 days ago
LLMs flying weaponized drones is exactly how it starts.
- goda9012 days ago
  https://www.youtube.com/watch?v=O-2tpwW0kmU
- popcornricecake12 days ago
  One day they'll fly to a drone factory, eliminate all the personnel, then start gently shooting at the machinery to create more weaponized drones and then it's all over before you know it!
- SoftTalker12 days ago
  It's pretty entertaining seeing the plot lines and ficticious history in The Terminator movies actually happening in real time.
seniortaco12 days ago
"drone"