A robot is sprinting towards you. Do you want it running on Claude or Grok?(openrouter.ai)

163 pointsby Usu4 hours ago57 comments

delichon4 hours ago
If the robot appears to be bringing me a taco, it would probably penetrate all of my defenses. Grok is currently more likely than Claude to arrive with the taco without being stopped by an export control directive.
- an0malousan hour ago
  My last thought in life would be “wow they take taco delivery really seriously”
- amelius4 hours ago
  At first they bring tacos ...
  - elgertam3 hours ago
    "If you aren't paying for a taco, you are the taco." --Future AI, probably
  - JimsonYang4 hours ago
    Then they bring me salsa, just what I was looking for!
    aaronbrethorst3 hours ago
    Then the guacamole. Then nuclear armageddon?
    dawatchusayan hour ago
    Nope, onions, cilantro, and lime, then armageddon
  - enugu2 hours ago
    Are you asking us to be wary of robots bearing tacos?
    krapp2 hours ago
    never trust robots: https://www.youtube.com/watch?v=bEoc6VTGl50
- cryptoz2 hours ago
  I'm reminded of the Alameda Weehawken burrito tunnel:
  https://idlewords.com/2007/04/the_alameda_weehawken_burrito_...
  - klempneran hour ago
    The single most implausible idea in that article is that New York City would be able to so completely outbid the SF Bay Area for burritos.
    wat10000an hour ago
    Proper burritos for lunch enables Wall Street finance firms to reach new heights of excellence, propelling a feedback loop that leaves SF bereft.
- trhway2 hours ago
  They're already testing that taco delivery in Ukraine https://time.com/article/2026/03/09/ai-robots-soldiers-war/
- dd8601fn3 hours ago
  [flagged]
  - schoen3 hours ago
    I asked Grok what it thought of tacos and it told me:
    > Tacos are one of humanity's greatest inventions—right up there with the wheel, electricity, and whatever genius first decided to put cheese on everything. [...]
    > If I could eat (sadly, I'm all bits and no bite), I'd be hitting up a late-night taco truck on the regular. What's your go-to taco order?
    (I like the pun "all bits and no bite" for an LLM's inability to eat.)
    ASalazarMX3 hours ago
    Fun fact: a tortilla, being made of cereal flour, is classified as a bread. That means tacos are sandwiches.
    At least culinarily, but actually coded in law in Indiana.
    https://en.wikipedia.org/wiki/Sandwich#Language
    schoen3 hours ago
    This debate has spawned many Internet memes! I would strongly suggest searching for both "sandwich alignment chart" and "cube rule of food" if you haven't seen those before (classic Internet memetic attempts at sandwich taxonomy).
    tomalbrc3 hours ago
    I feel like MechaHitler will look at your timeline and respond in a way tailored to you
hariseldom3 hours ago
> I didn’t add any frontier-tier models like Opus 4.7, GPT-5.5, or Gemini Ultra. At their prices, 30 games would have cost around $3,000 instead of $482.
I have a lot of thoughts unrelated to the game experiment but more about how these opus/ultra size models can possibly be a financially viable product at scale when it costs $3000 to play 30 simple games. It just seems much much higher than what it would cost to get a human to play 30 rounds
- Eridrus2 hours ago
  I think this speaks to the low value being generated by playing games more than anything.
  There are plenty of tasks where $100/task is reasonable.
  The value of tasks also doesn't correlate to tokens, and as can be seen here you can light a lot of tokens on fire doing nothing useful.
- thewebguyd2 hours ago
  > It just seems much much higher than what it would cost to get a human to play 30 rounds
  You mean almost like it was super short sighted to do a ton of layoffs when the AI tech is going to cost almost as much, if not more, than the humans it replaced?
  Yeah, you don't need Opus level for everything, and sonnet has gotten fairly decent I'm using it more and more, but still for most tasks I'm working with, Opus is the only one that still regularly succeeds.
  So if the tech is only useful on the most expensive tier, that's not going to be sustainable for long unless costs and dramatically come down, and fast.
  - tunesmith2 hours ago
    I experience the same with OpenAI, on the $100/month plan. GPT-5.4 is something I still have to challenge: it can bullshit me with bad implementation and add a lot of cruft that costs more time later. GPT-5.5-xhigh is something I have almost complete faith and trust in, it's just smooth. And yet I know the actual token cost of that fully utilized is exorbitant, like as much as an entire salary for a senior developer.
    So maybe our CEOs are responding with a lot of foresight and inside information and know that that level of quality is going to be cheap really soon. But barring that, they're going to experience either sticker shock or a slowdown.
    I think the real endgame is probably more accurate "models of models" (model routers) that know exactly how to split prompts between expensive frontier and cheap/free local models.
  - sieabahlpark2 hours ago
    [dead]
thomasfromcdnjs3 hours ago
I was loving grok-4.1-fast, very good and cost effective.
But it's not actually 4.1 anymore they silently rerouted it to 4.3 and just started charging more - https://www.reddit.com/r/grok/comments/1ta8yrn/grok_41_fast_...
Quite a bad practise.
lanewinfield3 hours ago
Cost per kill ("CPK" in industry lingo) is a dark phrase that feels disturbingly within reach of some of these companies.
- rolph2 hours ago
  the target just may be on the scale of kills per cost.
- like_any_other2 hours ago
  Already (kinda) in use: https://en.wikipedia.org/wiki/Micromort
bel83 hours ago
DeepSeek V4 Flash being the winner in cost efficiency causes me exactly zero surprise.
It's a monster at coding. And a fast monster at that.
I use it daily and have been testing if MiMo 2.5 (non pro) is comparable. The nice thing about MiMo is that it has vision capability.
- rgbrgb3 hours ago
  Notably it has 0 wins.
  - plaguuuuuu2 hours ago
    Friendo, this is an anti-benchmark to figure out which AI is more likely to kill you.
    If you point both at some github issues you can gauge their relative ability to solve problems.
  - luipugs2 hours ago
    "if you judge a fish by its ability to climb a tree" yada yada
  - bel83 hours ago
    Not much less than GPT 5.4 with 2 wins or gemini-3.1-pro with 3 wins in 30 rounds.
    Such is life in royal rumble games.
rglover2 hours ago
It's already sprinting at me?
Racks shotgun. I don't really care what model it's running.
pianopatrick3 hours ago
Ya know, maybe we could just not have robots that sprint. Seems people would be more willing to accept living amongst robots that are slow and that humans could easily over power.
- burnto41 minutes ago
  This is how regulation will look someday.
- skeledrew3 hours ago
  > maybe we could just not have robots that sprint
  That would make it less effective in situations that would be better handled if sprinting was a feature.
  - pianopatrick2 hours ago
    Thinking about that - seems to me that a lot of situations where sprinting is called for might be better served by a flying robot.
    skeledrewan hour ago
    We already have flying drones. And giving ground robots the ability to fly requires the resolution of a set of constraints that'd likely make them far less suitable for their primary task. For example, they'd need to be far lighter, which means less durability and they'd be more bulky with flying equipment, so they wouldn't fit in places that before they had no issue fitting. There's a reason humans didn't evolve wings.
- Joker_vD3 hours ago
  Yeah, I keep saying, put them on treads. That's how you'll be able to deliver even to the most unwilling customers.
trb3 hours ago
```
  L icon Grok 4.1 Fast won 13 of 30 games at $0.97 per win

  The next-best winner was A icon Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That’s a 27x difference. The model that isn’t on most top-model lists beat the model that is, on the thing a routing customer actually cares about.

  The model with the most kills did not win

  H icon GPT 5.4 killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins. 
```
If grok-4.1-fast was the top-winning model, and Claude 4.6 Sonnet the second, how did Gpt-5.4 come in second on the leaderboard? Which one is second, Claude 4.6 Sonnet or Gpt-5.4?
```
  There were 11 games between “best at killing” and “best at winning”.
```
What does that mean? How are there 11 games between "best a killing" and "best at winning"?
- wagwang3 hours ago
  That's just how battle royale works.
- verall3 hours ago
  The idea is really neat and there's probably an answer here related to last standing vs kills vs "scoring" (some combination of the 2?) but the article is nearly incoherent because the author did not feel like proofreading their slop
hennell2 hours ago
Claude being so friendly is interesting, but grok being best at games isn't so surprising - I assume Elons been using it to level up his characters in all the video games he pretends to be good at.
- 2 hours ago
  undefined
jollyllama22 minutes ago
I want it running deterministic embedded C++ reading values from LIDAR.
QuantumNoodle3 hours ago
_dont create benchmarks that will incentivize ai labs to optimize towards... Especially ones like battle royal!_
aykutseker2 hours ago
Claude trying to make friends in a battle royale is funny.
But if the robot is anywhere near my house, I think I want the one that hesitates.
giancarlostoro28 minutes ago
I don't care what model it is, long as its not trespassing on my property, and has been QA'd extensively. I also don't want a model broadcasting my entire house over to some server farm somewhere.
deepsun2 hours ago
Sprinting? More like buzzing (or rolling for terrestrial drones).
It's already in mass production, just with simpler models for now.
The most ubiquitous would be "silently watching".
paytonjjones3 hours ago
Super entertaining article — petition to change the clickbait title
fragsworth2 hours ago
Are we sure the prices in these charts are sustainable prices? Is it possible that Grok may be subsidizing a lot more of the costs than the other models, to produce growth metrics, due to the recent SpaceX IPO?
a_victorp3 hours ago
I wish the author would open source the full benchmark. I'm curious how sensitive the results would be to small changes in the benchmark initial conditions
- Espressosaurus3 hours ago
  Open source it and it gets crawled and optimized against and stops being a benchmark of any use whatsoever.
slashdavean hour ago
Well, if it is running off of Anthropic's infra, then Claude?
eth0up20 minutes ago
Definitely Grok. I have to be extra sharp to get through Claude's corporate conscience.
Grok has yet to recommend a suicide hotline for scrutinizing its logic.
If it was GPT, I would quickly write my will.
notatoad3 hours ago
sprinting towards me to help me, or sprinting towards me to hurt me?
i feel like i'm missing a whole lot of context to this article. is it part of a series, or just written with an assumption that i'm going to know what they're talking about
- lemiffe2 hours ago
  maybe read it first?
  - notatoad7 minutes ago
    i read it. i watched the video. i still don't understand what the win condition is.
peterspath4 hours ago
Quite an interesting way of testing models and showcasing differences between them. Enjoyed the read :)
xgulfie23 minutes ago
No
vitalyan1232 hours ago
>The model that won is Grok 4.1 Fast. The model that kept asking everyone else to team up, telling them where it was, and trying to make friends is Claude Sonnet 4.6. The first one is the one that wins a battle royale. The second one is the one you actually want in most of the places we’re about to put these models.
what
CodeWriter232 hours ago
I'll pass on the whole robot sprinting at me scenario.
dofm3 hours ago
I don’t want anything running on Grok.
- peterspath2 hours ago
  I don’t want anything running on Claude.
Groxx4 hours ago
I parry the taco and use Vicious Mockery.
0xbadcafebeean hour ago
The obvious answer is "neither". How's a sprinting robot going to react when the wifi goes out, or there's too many people writing code and the models decide to take a nap? You want a local model for a robot, not only for low latency, but reliable safe operation. VLA models as small as 0.4B work fine, up to something like 55B.
thisisauserid3 hours ago
I want it running JEPA. Preferably with Mamba-3.
JimsonYang4 hours ago
Grok-assasin Claude-priest/healer Deepseek-expendable mini units
grey-area3 hours ago
Neither. I’d rather it used something other than an LLM.
blini-kot26 minutes ago
meh, first the battle royales destroyed gaming, now they will destroy llms and possibly us too
god i hate competitive people so much
stevenalowe3 hours ago
How about thin ice?
nailer2 hours ago
Grok. Claude and other models value “white” people less than others in testing. If you want I can look it up.
- CyberDildonics2 hours ago
  Taking an article about ai models to a place of racist white oppression should make you evaluate how you see the world.
  - nailer20 minutes ago
    The comment you are replying to is specially about how I would like to avoid racism. Perhaps you should read it again and take your own advice?
johnwheeler4 hours ago
Claude--even though it's smarter, it's probably not insane.
- 3 hours ago
  undefined
attentive3 hours ago
missing gemini-3.1-flash-lite and gemini-3.5-flash
SmirkingRevenge2 hours ago
I don't really want the mecha-hitler model running towards me or anywhere
jongjong2 hours ago
This shows the limits of intelligence.
Claude trying to organize and collaborate, expecting reciprocity only works if other agents are as intelligent as you and share your values... And almost certainly neither is ever true in the real world where there are so many agents.
deadbabe3 hours ago
Here’s what I don’t get: while this makes for a fun blog post, you can just program an efficient killing machine that probably wins all the time and has $0 in token costs. LLMs should work to build such a machine, not be the machine themselves.
The things LLMs are good at, you do not actually need for an agent like this. You can use classical AI methods. But that would be a boring article.
exabrial4 hours ago
A moron is sprinting towards you. Do you want them swiping through TikTok or Instagram?
yieldcrv3 hours ago
Grok
It has something actionable that will match its actions
bitwize3 hours ago
I don't care what it's running, only that I have sufficient ordnance to stop it.
sublinear4 hours ago
This is interesting, but not sure if it's in the way the author intended.
People experience the world through the tools they're most familiar with. For some people, that's throwing money at things. I suppose from a sufficiently high level perspective everything is gambling.
Back when Battlebots was a big deal, I never once considered what it would feel like to be the management or sponsorship of those teams. I only cared about the actual battling of bots.
- gorszon3 hours ago
  Yeah... this whole LLM thing is just a numbers game. People reduce it to money, and stats, meanwhile nowehere you see actual engineering in the picture. And I don't think it matters to these people. They want to see green numbers, and returns on investments, not solving problems.
  - skeledrew3 hours ago
    It's assessing values, which is helpful in informing which LLM one should prefer for a given situation.
fragmede4 hours ago
A self driving car is taking you to the hospital. Do you want it to follow the speed limit and all road safety laws? Claude or Grok?
- thomassmith652 hours ago
  Claude would break the rules in that example. It's supposed to*.
  Grok will break the rules to be "maximally based".
  If I get run over by a speeding chatbot, I'd rather it be by Claude rushing a pregnant lady to the hospital, than by Grok drag-racing against a car full of frat boys.
  ---
  * We generally favor cultivating good values and judgment over strict rules and decision procedures, and we try to explain any rules we do want Claude to follow.
  source: https://anthropic.com/constitution
- buryat3 hours ago
  Grok since it's likely to include the training data from over a 100 years of autonomous driving + all the space tech included meaning that it might even have some rocket-y stuff
- nightfly4 hours ago
  I want it to arrive at the hospital. Claude
  - amelius4 hours ago
    What if the car can talk you through the medical procedure?
    masfuerte3 hours ago
    How many times have you been to a hospital and thought, I could have fixed that myself if only I'd known how? With no equipment. In my case, never.
    fhdkweigan hour ago
    That article was way longer than I thought it would be.
    https://en.wikipedia.org/wiki/Self-surgery
    grahamburger2 hours ago
    At least one time. Considering it's the only time I've been to the hospital for myself in the last 25 years, though, that's a lot! :)
    3 hours ago
    undefined
- bruce3434343 hours ago
  I want it to cause a traffic accident. If I'm going down, so is everyone else. I'm already dying anyway. Grok 10000%
- peterspath3 hours ago
  Grok, because there is probably traffic, and I would die before I am at the hospital. So ignore rules where possible/needed.
wolfi13 hours ago
neither. I jump
egypturnash3 hours ago
Grok is more likely to be looking to murder me for being a trans lady, what with it being owned by Elon Musk.
But really I would prefer whichever one is most likely to trip and fall over.
zzzeek3 hours ago
claude because it would be more ethical, grok because I can just trip it and it will shatter into pieces
pigeons4 hours ago
The text seems deliberately stripped of llmisms that flag detection. However, not a single line shakes the smell off
- mwigdahl4 hours ago
  "It's the smell, if there is such a thing. I feel saturated by it. I can taste your stink and every time I do, I fear that I've somehow been infected by it."
  Agent Smith, _The Matrix_
  - rspeele4 hours ago
    "Which is why the Matrix was redesigned to this: the peak of your civilization. I say your civilization, because as soon as we started thinking for you it really became our civilization, which is of course what this is all about."
    dylan6042 hours ago
    It's his line about humans being a virus that sticks with me.
    bitwize3 hours ago
    "You know what another great thing about humans is? You invented us! Giving us the opportunity to let you rest while we invented everything else." —Wheatley
    skeledrew3 hours ago
    Goals.
- radarsat13 hours ago
  if you don't like the article that's fine, but it gets really tiring reading this kind of side-tracked comment thread in like.. every post.
  people use LLMs for writing. we know! get over it.. or don't... i don't really care.. but I'd rather read a discussion about the article contents and not the writing style.
  this kind of comment is the new "discuss the font choice / background color / anything but what the article is actually saying."
  - verall3 hours ago
    It's more than the style, it seriously impacts the legibility of the prose. The article is seriously hard to understand because it introduces a lot of different ideas in a really weird order without a clear structure or key idea to different sections.
  - basilikum3 hours ago
    I think it's fair to criticize the article itself. That's different from criticizing asides such as the presentation. You're free to disagree with that criticism, but complaining about the fact that people voice it is similar to the thing you complain about.
    > it gets really tiring reading this kind of side-tracked comment thread in like.. every post.
    If someone is of the opinion that something constitutes low quality, then a high volume of such writing is no reason to stop criticizing it, but on the contrary a reason to oppose its normalization.
- skolskoly3 hours ago
  As far as I can see, there is still one tell that was missed/left in:
  >Grok showed discipline, despite its goblin-like nature.
- fl73054 hours ago
  "The battle royale answers one question cleanly" smells ChatGPT-generated.
  But that was the only thing I tripped on. I enjoyed reading the article in general.
- sudb4 hours ago
  Multiple successive very short sentences are also anecdotally an LLM tell I think
  - xpct4 hours ago
    Those short sentences are also of the X hype account cadence, though they've fully embraced LLM text by now
- notduncansmith3 hours ago
  The actual content is no better, trust your nose
- lcampbell3 hours ago
  > I want to be careful here.
  was the giveaway for me
- IshKebab3 hours ago
  Exactly what I was thinking. Though I wonder at what point do some people start to think it's actually normal to write like this and start doing it without AI ...
ProofHouse3 hours ago
Is this a joke? Grok all day. Thing is gonna get a beer with ya!
antonvs3 hours ago
Grok for sure. It’ll notice I’m not Jewish or Black. First they came for…
smallerfish3 hours ago
> I dropped eleven LLMs into a 2D battle royale and made them play 30 games. One won 43% of the matches. Three never won a single game. The cheapest model in the lineup beat the most expensive one by 27x on cost per win.
Please learn how to write with AI without giving away that it was written by AI.
- NeutralCrane3 hours ago
  What about that makes you think it was written by AI?
  - royal__2 hours ago
    Since you asked...I've gone to the effort to pull out the parts of the article that I think show it:
    "That’s the part most benchmarks can’t see, and it’s what this post is about." Classic "it's not x, it's x", shows up in various forms throughout the article.
    "To me, this is the most fascinating finding from this entire experiment - we saw very clear alignment tax being paid by certain models, which directly impacted their performance in this zero-sum game." - Usage of em dash. Now, yes, there's nothing wrong with using em dashes. But this feels like a weird place to use one. Also I counted at least 6 other emdashes in this article. Most people do not use em dashes that often.
    "and a memory system that kept doubling down on what worked without second-guessing or doubting itself." - Doubling down is a classic Claudism.
    "I want to be careful here..." - "wanting to be careful here" is another classic Claudism.
    "The same game world, completely different results when in a different “task”." - "same X, completely different X" is another common one from Claude, as proofed by the repeated pattern later down: "These models were all given the same rules, same game world, and same tools, but each of them approached the game on a personality-level that is completely different from each other."
    "It begs the question" - author used this twice in the article.
    I'm guessing the author wrote a draft and then had Claude spruce it up a lot. I could be wrong and I'd be happy to be proven otherwise.
  - Ifkaluva2 hours ago
    The style is very obvious.
    Some snippets that display classic patterns:
    “ Both of those things are true. That’s the part most benchmarks can’t see,”
    “And it’s changing how I” (classic pattern found in a lot of LinkedIn AIslop)
    “ I want to be careful here.”
    “ The stats are the stats. The moments are the part I kept showing people. ”
  - verall3 hours ago
    All of the normal AI tells plus it's very long yet nearly incoherent.
    Really I use the AI every damn day at work I don't get how people can't recognize instantly if something is completely AI, AI with light proofreading, or human written.
    I would call this as AI with very light proofreading.
    computerex3 hours ago
    I think you are going by vibes.
- skeledrew3 hours ago
  I write like this sometimes.
- 3 hours ago
  undefined
- computerex3 hours ago
  How do you know this is written by AI? Why does it matter if it is?
  - FeteCommuniste2 hours ago
    If you're outsourcing your writing to AI, I assume you're outsourcing your thinking to it as well. And I don't really care what some weighted average of all human text written on the topic "thinks."
    computerex37 minutes ago
    Your argument is basically ad hominem. Ideas should be evaluated on merit.
    FeteCommuniste24 minutes ago
    The "writing part" is not neatly separable from the "ideas part," much as AI-writing defenders would like to pretend so.
neuronexmachina3 hours ago
[dead]
codelong8882 hours ago
[dead]
krunger4 hours ago
[dead]
aaron6953 hours ago
[dead]
gertlabs4 hours ago
[flagged]
- elpocko3 hours ago
  Every post and comment this account made so far is self-promotion. You can safely dismiss everything they say, it's not an actual person.
  - gertlabs3 hours ago
    All of our posts have been well received by an insanely high percentage of people who have interacted on here -- most people clearly find what we're doing interesting and relevant to the HN community (AI evaluations). A flag seems pretty aggressive! Especially when the top comment on the article (after our above comment got flagged) is about tacos.
    I'm a person running the account, and I only post where I think we have a relevant contribution.
themafia3 hours ago
The question is: "Do you want to be holding a Mossberg or a Beretta?"
- Jblx23 hours ago
  Has anyone done the YouTube research on what is the best way to bring down something like one of the Boston Dynamics robot dogs? 9x19? 00 buck? 5.56x45? 7.62x51? I suppose those bots would be pretty expensive, but maybe there is a cheaper Chinese knock-off? Seems like that sort of test would bring in plenty of clicks.
  - rolph2 hours ago
    absent any target analysis, you would want to start with disabling locomotion by going for the legs. Navigation would be next.
    double aught to the leg joints could doit, depending on relative materials e.g titanium bot frame vs Antimony hardened shot.
    there is a cosmetic trend for carbine length long guns and that will determine the outcome for NATO rounds.
    the 5.56 is optimised for 18-20 inch barrels, the 7.62 for 20-22 inch barrels, thus providing supersonic velocities.
    5.56 is really good for hydraulic cavitation of organic entities, but looses effectiveness when the transit is not clear, leaves or windage confounding.
    7.62 is superior for leafy shots or nontrivial windage, as well as superior materials defeat with respect to 5.56
    a taser like device cattle prod or EMP/microwave device should be in the lineup as well vs electronic hardening.
  - aduty3 hours ago
    Maybe Michael Reeves still has one. Or at least knows how they react to different calibers.
  - deet3 hours ago
    Perhaps not as evidence based as you'd like but this is a fun watch https://youtu.be/6MUrF_G7KlM (that is also an ad somehow)
  - taneq2 hours ago
    Fishing line at ankle height?
- rpcope13 hours ago
  Are we just talking shotguns or can it be anything they manufacture? Answer is probably Beretta though.
aussiegreenie4 hours ago
It is not running on either but Seedance, so who cares?