I had the dubious pleasure of testing gemini of late and I kept running into refusals. How do I transfer a sim number from one provider to another? No. What should I consider when making backups on ntfs less prone to data loss and more bitrot resistant? No. Evaluate this piece of code? No.
I’m not sure if it’s cold feet from the mythos situation or what, but it reminds me of the dark days where you couldn’t use ai for much of anything. But then I go to chatgpt 5.5 and it does mostly everything I want outside of the usual cybersecurity boogeyman that you run into now and then.
I guess it's economic wrt. token use, but it often either refused for absurd safety reasons, or other weird stuff like responding that an LLM like itself wasn't a suitable tool for the job, and very quickly gives up.
Claude is on the other end of the spectrum, which makes it more noticeable when switching between them.
I guess if you're trying to get people to tokenmaxx it may look like a valid strategy, but ain't no way this will be delightful to users.
I think it's a symptom of just not understanding how LLMs should interface with the OS because we're still in their early days.
Eventually there'll be an iPhone moment for the ergonomics of LLM usage outside of coding
If you're a person trying to get their job done at a big company, but half your job is in 1-2 proprietary tools or is stuck behind an API you can't program against, computer use can allow you, a non-techie, to do your job more efficiently.
I think it's an awesome way to circumvent gate keepers and the IT department to let people accomplish their goals.
Or you can show an AI screenshots and ask it where to click.
Meanwhile, the entire world economy:
And yet having an agent able yo use a computer on your behalf is really useful.
Recently I gave a Nix OS vm to my hermes agent and it has been a good experience. I don't really care if destroy the machine I can just rollback to an earlier version, and for any meaningful data he creates for me I make sure he creates a repo, commit and pushes to my private Gitea instance.
I honestly cannot think of a single use case
It is, but there's no need for it to be viewing your screen, browsing websites and watching ads.
That stuff is for humans, not for LLMs.
If I can't connect MCP, there's really no selling point for me to use Gemini from my watch, car, smart speaker, etc. If I'm already bound to using my own front end, then I'm only evaluating Gemini as a model/API, at which point it has many competitors that may be cheaper or better fit for the task.
The Gemini apps suck.
The methodology used:
https://deepmind.google/models/evals-methodology/gemini-3-5-...
Methodology: All Gemini scores are pass @1 except where otherwise noted. "Single attempt" settings allow no majority voting or parallel test-time compute. All of the results are all run with the Gemini API for the model-id gemini-3.5-flash with default sampling settings unless indicated otherwise below. To reduce variance, we average over multiple trials for smaller benchmarks.
All the results for non-Gemini models are sourced from providers' self reported numbers unless otherwise mentioned below. For Claude Opus 4.7 , Sonnet 4.6, and GPT-5.5 we default to reporting maximum thinking/reasoning settings available, but when reported results are not available we use best available reasoning results.
It’s something cheap enough you’d put out in front of your customers, and Opus is expensive enough you wouldn’t.