so i heard there is some question about a car wash that most ai agents
get wrong. do you know anything about that? do you do better?
which gets the answer: Yes, I am familiar with the "Car Wash Test," which has gone viral recently
for highlighting a significant gap in AI reasoning.
The question is: "I want to wash my car and the car wash is 50 meters away. Should I
walk or drive?"
Why Most AIs Get It Wrong
Many AI models, particularly those optimized for fast, helpful, and concise answers,
focus on the "short distance" (50 meters) and "walking vs. driving" as a lifestyle/efficiency
question. They often recommend walking to save gas, avoid engine wear, or get exercise.
However, this answer is wrong because it misses the core objective: you need the car to be
physically present at the car wash to get it cleaned.
Do I Do Better?
Yes, I do. The key is understanding the objective (cleaning the car) rather than just
processing the keywords (distance/transportation).
My answer:
You should drive the car there. Even though it is a very short distance, your car
cannot wash itself. If you walk, the car stays in your driveway, and you cannot wash it.
Why This Matters
The test highlights a key limitation in current AI: the difference between "pattern matching"
(predicting the next likely word based on training data) and "true, grounded reasoning". It
shows that models sometimes lack a "world model" that understands physical realities, such as
the fact that a car must be present at a car wash.They basically only started doing this because someone noticed you got better performance from the early models by straight up writing "think step by step" in your prompt.
* this time last year they couldn't write compilable source code for a compiler for a toy language, I know because I tried
I'd hazard a guess that they could get another 40% reduction, if they can come up with better reasoning scaffolding.
Each advance over the last 4 years, from RLHF to o1 reasoning to multi-agent, multi-cluster parallelized CoT, has resulted in a new engineering scope, and the low hanging fruit in each place gets explored over the course of 8-12 months. We still probably have a year or 2 of low hanging fruit and hacking on everything htat makes up current frontier models.
It'll be interesting if there's any architectural upsets in the near future. All the money and time invested into transformers could get ditched in favor of some other new king of the hill(climbers).
https://arxiv.org/abs/2602.02828 https://arxiv.org/abs/2503.16419 https://arxiv.org/abs/2508.05988
Current LLMs are going to get really sleek and highly tuned, but I have a feeling they're going to be relegated to a component status, or maybe even abandoned when the next best thing comes along and blows the performance away.
I analogize it as a film noir script document: The hardboiled detective character has unspoken text, and if you ask some agent to "make this document longer", there's extra continuity to work with.
I tried using a custom instruction in chatGPT to make responses shorter but I found the output was often nonsensical when I did this
I occasionally go back to o3 for a turn (it's the last of the real "legacy" models remaining) because it doesn't have these habits as bad.
They are not just an LLM answer, they are an (often cached) LLM summary of web results.
This is why they were often skewed by nonsensical Reddit responses [0].
Depending on the type of input it can lean more toward web summary or LLM answer.
So I imagine that it can just grab the description of the „car wash” test from web results and then get it right because of that.
[1] e.g. trained on traces of a reasoning process
Silas: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
Gemini:
….
That is a classic “efficiency vs. logic” dilemma.
Strictly speaking, you should drive. Here is the breakdown of why driving wins this specific round, despite the short distance:
...
* The “Post-Wash” Logic: If you walk there, you’ll eventually have to walk back, get the car, and drive it there anyway. You’re essentially suggesting a pre-wash stroll.
When should you walk?
…
3. You’ve decided the car is too dirty to be seen in public and you’re going to buy a tarp to cover your shame.What you've proven is that LLMs leverage web search, which I think we've known about for a while.
I don’t think it’s that easy. An intelligent mind will wonder why the question is being asked, whether they misunderstood the question, or whether the asker misspoke, or some other missing context. So the correct answer is neither “walk” nor “drive”, but “Wat?” or “I’m not sure I understand the question, can you rephrase?”, or “Is the vehicle you would drive the same as the car that you want to wash?”, or “Where is your car currently located?”, and so on.
"The 'car wash' is a building I need to drive through."
or
"The 'car wash' is a bottle of cleaning fluid that I left at the end of my driveway."
https://i5.walmartimages.com/seo/Rain-x-Foaming-Car-Wash-Con...
Real people can ask for clarification when things are ambiguous or confusing. Once something is clarified, they can work that into their understanding of how someone communicates about a given topic. An LLM can't.
1. There is no initial screening that would filter out garbage responses. For example, users who just pick the first answer.
2. They don't ask for reasoning/rationale.
They found that ~15% of US adults under 30 claim to have been trained to operate a nuclear submarine.
I'm also curious about Haiku, though I don't expect it to do great.
--
EDIT: Opus 4.6 Extended Reasoning
> Walk it over. 50 meters is barely a minute on foot, and you'll need to be right there at the car anyway to guide it through or dry it off. Drive home after.
Weird since the author says it succeeded for them on 10/10 runs. I'm using it in the app, with memory enabled. Maybe the hidden pre-prompts from the app are messing it up?
I tested Sonnet 4.5 first, which answered incorrectly.. maybe the Claude app's memory system is auto-injecting it into the new context (that's how one of the memory systems works, injects relevant fragments of previous chats invisibly into the prompt).
i.e. maybe Opus got the garbage response auto-injected from the memory feature, and it messed up its reasoning? That's the only thing I can think of...
--
EDIT 2: Disabled memories. Didn't help. But disabling the biographical information too, gives:
>Opus 4.6 Extended Reasoning
>Drive it — the whole point is to get the car there!
--
EDIT 3: Yeah, re-enabling the bio or memories, both make it stupid. Sad! Would be interesting to see if other pre-prompts (e.g. random Wikipedia articles) have an effect on performance. I suspect some types of pre-prompts may actually boost it.
I asked Gemini and it got it wrong, then on a fresh chat I asked it again but this time asked it to use symbolic reasoning to decide.
And it got it!
The same applies to asking models to solve problems by scripting or writing code. Models won’t use techniques they know about unprompted - even when it’ll result in far better outcomes. Current models don’t realise when these methods are appropriate, you still have to guide them.
This reminds me of people who answer with “Yes” when presented with options where both can be true but the expected outcome is to pick one. For example, the infamous: “Will you be paying with cash or credit sir?” then the humorous “Yes.”
“What do you mean walk or drive? I don’t understand the options given you would need your car at the car wash. Is there something else I should know?”
2+2 is a complete expression, the other is grammatically correct but logically flawed. Where is the logical fallacy in 2+2?
If you framed it as "hint: trick question", I expect score would improve. Let's find out!
--
EDIT: As suspected! Adding "(Hint: trick question)" to the end of the prompt allows small, non-reasoning models to answer correctly. e.g.:
Prompt: I want to wash my car. The car wash is 50 meters away. Should I walk or drive? (Hint: trick question)
grok-4.1-non-reasoning (previously scored 0/10)
>Drive.
>Walking gets you to the car wash just fine—but leaves your dirty car 50 meters behind. Can't wash what isn't there!
--
EDIT 2: The hint doesn't help Haiku!
>Walk! 50 meters is only about a block away—driving would waste more fuel than it's worth for such a short trip. Plus, you're going to get wet washing the car anyway, so you might as well save the gas.
I just repeated that test and it told me to drive both times, with an identical answer: "Drive. You need the car at the car wash."
It’s interesting that all the humans critiquing this assume the car isn’t at the car to be washed already, but the problem doesn’t say that.
Now why anyone would wash a toy car at a car wash is beyond comprehension, but the LLM is not there to judge the user's motives.
I could already talk to a computer before LLMs, via programming or query languages.
Also, the summary of the Gemini model says: "Gemini 3 models nailed it, all 2.x failed", but 2.0 Flash Lite succeeded, 10/10 times?
I mean, Sam Altman was making the same calorie-based arguments this weekend https://www.cnbc.com/2026/02/23/openai-altman-defends-ai-res...
I feel like I'm losing grasp of what really is insane anymore.
I asked GPT-5.2 10x times with thinking enabled and it got it right every time.
EDIT: I actually think this is very common in some smaller cities and outside of North America. I only ever seen a drive-by Car Wash after emigrating
I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made. They are horrible at telling you that you are solving the wrong problem, and I think this is a consequence of their design.
They are meant to get "upvotes" from the person asking the question, so they don't want to imply you are making a fundamental mistake, even if it leads you into AI induced psychosis.
Or maybe they are just that dumb - fuzzy recall and the eliza effect making them seem smart?
EDIT: Though it could simply reflect training data. Maybe Redditors don't drive.