I've found that letting the agent write its own optimized script for dealing with some things can really help with this. Claude is now forbidden from using `gradlew` directly, and can only use a helper script we made. It clears, recompiles, publishes locally, tests, ... all with a few extra flags. And when a test fails, the stack trace is printed.
Before this, Claude had to do A TON of different calls, all messing up the context. And when tests failed, it started to read gradle's generated HTML/XML files, which damaged the context immensely, since they contain a bunch of inline javascript.
And I've also been implementing this "LLM=true"-like behaviour in most of my applications. When an LLM is using it, logging is less verbose, it's also deduplicated so it doesn't show the same line a hundred times, ...
> He sees something goes wrong, but now he cut off the stacktraces by using tail, so he tries again using a bigger tail. Not satisfied with what he sees HE TRIES AGAIN with a bigger tail, and … you see the problem. It’s like a dog chasing its own tail.
I've had the same issue. Claude was running the 5+ minute test suite MULTIPLE TIMES in succession, just with a different `| grep something` tacked at the end. Now, the scripts I made always logs the entire (simplified) output, and just prints the path to the temporary file. This works so much better.
I think my question at this point is what about this is specific to LLMs. Humans should not be forced to wade through reams of garbage output either.
There's attempts at effectively doing something similar with analysis passes of the context - kinda what things like auto-compaction is doing - but I'm sure anyone who has used the current generation of those tools will tell you they're very much imperfect.
For example, I quickly get bored looking through long logfiles for anomalies but an LLM can highlight those super quickly.
Beware I'm a complete AI layman. All this is from background reading of popular articles. It may well be wrong. It's definitely out of date.
It has to do with how the attention heads work. The attention heads (the idea originated from the "Attention is all you need" paper, arguably the single most important AI paper to date), direct the LLM to work on the most relevant parts of the conversation. If you want a human analogue, it's your attention heads that are tacking the interesting points in a conversation.
The original attention heads output a relevance score for every pair of words in the context window. Thus in "Time flies like an arrow", it's the attention heads that spot the word "Time" is very relevant to "arrow", but not "flies". The implication of this is an attention head does O(N*N) work. It does not scale well to large context windows.
Nonetheless, you see claims of "large" context windows the LLMs marketing. (Large is in quotes, because even a 1M context window begins to feel very cramped in a write / test / fix loop.) But a 1M context-window would require a attention head requiring a 1 trillion element matrix. That isn't feasible. The industry even has a name for the size of the window they give in their marketing: the Effective Context Window. Internally they have another metric that measures the real amount of compute they throw at attention: the Physical Context Window. The bridge between the two is some proprietary magic that discards tokens in the context window that are likely to be irrelevant. In my experience, that bridge is pretty good at doing that, where "pretty good" is up to human standards.
But eventually (actually quickly in my experience), you fill up even the marketed size of the context window because it is remembering every word said, in the order they were said. If it reads code it's written to debug it, it appears twice in the context window. All compiler and test output also ends up there. Once the context window fills up they take drastic action, because it like letting malloc fail. Even reporting a malloc failure is hard because it usually needs more malloc to do the reporting. Anthropic calls it compacting. It throws away 90% of your tokens. It turns your helpful LLM into a goldfish with dementia. It is nowhere near as good as human is at remembering what happened. Not even close.
And because these issues are often sporadic, doing all this would be an unwanted sidequest, so humans grit their teeth and wade through the garbage manually each time.
With LLMs, the cost is effectively 0 compared to a human, so it doesn't matter. Have them write the script. In fact, because it benefits the LLM by reducing context pollution, which increases their accuracy, such measures should be actively identified and put in place.
Pi coding agent does this by default with all outputs but Claude (all versions tested, including opus 4.6) just completely ignores this capability. Even when the tool output explicitly tells the agent that the full output is saved in a particular file, Claude reruns the command.
Then when the script runs the output is put into a file, and the LLM can search that. Works like a charm.
~/agent-shims/mvn:
#!/bin/bash
echo "Usage of 'mvn' is forbidden. Use build.sh or run-tests.sh"
That way it is prevented from using the wrong tools, and can self-correct when it tries.Not just context windows. Lots of that crap is completely useless for humans too. It's not a rare occurrence for warnings to be hidden in so much irrelevant output that they're there for years before someone notices.
I don't disagree. In my opinion, the default log level for CLI applications should be WARN, showing errors and warnings. -q should turn this OFF (alternatively, -q for ERROR, and -qq for OFF), -v means INFO, -vv DEBUG, -vvv TRACE. For servers and daemons, the default should probably be INFO, but that's debatable.
I guess it comes down to a choice of printing out only relevant information. I hate noisy crap, like LaTeX.
BATCH=yes (default is no)
--batch (default is --no-batch)
for the unusual case when you do want the `route print` on a BGP router to actually dump 8 gigabytes of text throughout next 2 minutes. Maybe it's fine if a default output for anything generously applies summarization, such as "X, Y, Z ...and 9 thousand+ similar entries".Having two separate command names (one for human/llm, one for batch) sucks.
Having `-h` for human, like ls or df do, sucks slightly less, but it is still a backward-compatibility hack which leads to `alias` proliferation and makes human lifes worse.
As someone who loves coding pet projects but is not a software engineer by profession, I find the paradigm of maintaining all these config files and environment variables exhausting, and there seem to be more and more of them for any non-trivial projects.
Not only do I find it hard to remember which is which or to locate any specific setting, their mechanisms often feel mysterious too: I often have to manually test them to see if they actually work or how exactly. This is not the case for actual code, where I can understand the logic just by reading it, since it has a clearer flow.
And I just can’t make myself blindly copy other people's config/env files without knowing what each switch is doing. This makes building projects, and especially copying or imitating other people's projects, a frustrating experience.
How do you deal with this better, my fellow professionals?
I guess this happens when you're too deep in a topic and forget that eventually the overhead of maintaining the tooling outweights the benefits. It's a curse of our profession. We build and automate things, so we naturally want to build and automate tooling for doing the things we do.
Inventing GraphQL and React and making your own PHP compiler are absolutely insane and obviously wrong decisions — for everyone who isn’t Facebook. With Facebook revenue and Facebooks army of resume obsessed PHP monkeys they strike me as elegant technological solutions to otherwise intractable organizational issues. Insane, but highly profitable and fast moving. Outside of that context using React should be addressing clear pain points, not a dogmatic default.
We’re seeing some active pushback on it now online, but so much damage has been done. Embracing progressive complexity of web apps/sites should leave the majority as barebones with minimal if any JavaScript.
Facebook solutions for Facebook problems. Most of us can be deeply happy our 99 problems don’t include theirs, and live a simpler easier life.
I know this is very 20th century, but it helps a lot to understand how everything fits together and to remember what each tool does in a complex stack.
Documentation is not always perfect or complete, but it makes it much easier to find parameters in config files and know which ones to tweak.
And when the documentation falls short, the old adage applies: "Use the source, Luke."
The only boilerplate files you need in a JS repo root are gitignore, package.json, package-lock.json and optionally tsconfig if you're using TS.
A node.js project shouldn't require a build step, and most websites can get away with a single build.js that calls your bundler (esbuild) and copies some static files dist/.
Then don’t.
> How do you deal with this better, my fellow professionals?
By not doing it.
Look, it’s your project. Why are you frustrating yourself? What you do is you set up your environment, your configuration, what you need/understand/prefer and that’s it. You’ll find out what those are as you go along. If you need, document each line as you add it. Don’t complicate it.
They do an excellent job of reading documentation and searching to pick and choose and filter config that you might care about.
After decades of maintaining them myself, this was a huge breath of fresh air for me.
It could depend on what you're doing, but if it's not for work the config hell is probably optional.
https://nigeltao.github.io/blog/2021/json-with-commas-commen...
There are a lot of implementations of all of these, such as https://github.com/tailscale/hujson
- it's weird and unfamiliar, most people prefer plain JSON
- there are too many competing standards to choose from
- most existing tools just use plain JSON (sometimes with support for non-standard features, like tsconfig allowing trailing commas, but usually poorly documented and unreliable)
Much easier just to make the leap to .ts files, which are ergonomically better in almost every way anyway.
It's making what is likely to be a permanent change to fix a temporary problem.
I think the thing that would have value in the long term is an option to be concise, accurate, and unambiguous.
This isn't something that should be considered to be only for LLMs. Sometimes humans want readability to understand something quickly adding context helps a great deal here, but sometimes accuracy and unambiguity are paramount (like when doing an audit) if dealing with a batch of similar things, the same repeated context adds nothing and limits how much you can see at once.
So there can be a benefit when a human can request output like this for them to read directly. On top of this is the broad range of of output processing tools that we have (some people still awk).
So yes, this is needed, but LLMs will probably not need this in a few years. The other uses will remain
Secondly, a helper to capture output and cache it, and frankly a tool or just options to the regular shell/bash tools to cache output and allow filtered retrieval of the cached output, as more so than context and tokens the frustration I have with the patterns shown is that often the agent will re-execute time-consuming tasks to retrieve a different set of lines from the output.
A lot of the time it might even be best to run the tool with verbose output, but it'd be nice if tools had a more uniform way of giving output that was easier to systematically filter to essentials on first run (while caching the rest).
Any special accommodations you make for LLMs are either a) also good for humans, or b) more trouble than they're worth.
It would be nice for both LLMs and humans to have a tool that hides verbose tool output, but still lets you go back and inspect it if there's a problem. Although in practice as a human I just minimise the terminal and ignore the spam until it finishes. Maybe LLMs just need their own equivalent of that, rather than always being hooked up directly to the stdout firehose.
A lot of the time this behaviour is probably right. But it's annoyingly hard to steer it to handle this correctly. I've had it do this even with make targets where the makefile itself makes clear the dependencies means it could trust the cached (in a file) results if it just runs make <target>. Instead I regularly find it reading the Makefile and running the commands manually to work around the dependency management.
I wrote a small game for my dev team to experience what it’s like interacting through these painful interfaces over the summer www.youareanagent.app
Jump to the agentic coding level or the mcp level to experience true frustration (call it empathy). I also wrote up a lot more thinking here www.robkopel.me/field-notes/ax-agent-experience/
I've seen projects with an empty README and a very extensive CLAUDE.md (or equivalent).
This is understandable logic, but at a systemic level it's not how things always go. Increasing efficiency can lead to increased consumption overall. You might save 50% in energy for your workload, but maybe now you can run it 3 times as much, or maybe 3 times more people will use it, because it's cheaper. The result might be a 50% INCREASE in energy consumed.
So yes, lower LLM costs would probably lead even more LLM usage and greater energy expenditures, but then again, so does having a moving economy, and all that comes with that.
I think it’s much simpler & easier to just build this into agents than trying to modify every tool ever created to be less verbose. Just guard agents from it user-side. Let users control what they want to see and pass into context.
When the first transformers that did more than poetry or rough translation appeared everybody noticed their flaws, but I observed that a dumb enough (or smart enough to be dangerous?) LLM could be useful in regularizing parameter conventions. I would ask an LLM how to do this or that, and it would "helpfully" generate non-functional command invocations that otherwise appeared very 'conformant' to the point that sometimes my opinion was that -even though the invocation was wrong given the current calling convention for a specific tool- it would actually improve the tool if it accepted that human-machine ABI or calling convention.
Now let us take the example of man vs info, I am not proposing to let AI decide we should all settle on man; nor do I propose to let AI decide we should all use info instead, but with AI we could have the documentation made whole in the missing half, and then it's up to the user if they prefer man or info to fetch the documentation of that tool.
Similarily for calling conventions, we could ask LLM's to assemble parameter styles and analyze command calling conventions / parameters and then find one or more canonical ways to communicate this, perhaps consulting an environment variable to figure out what calling convention the user declares to use.
https://x.com/ProfRobAnderson/status/2019078989348774129
> Indeed hallucinated cases are "better law." Drawing on Ronald Dworkin's theory of law as integrity, which posits that ideal legal decisions must "fit" existing precedents while advancing principled justice, this article argues that these hallucinations represent emergent normative ideals. AI models, trained on vast corpora of real case law, synthesize patterns to produce rulings that optimally align with underlying legal principles, filling gaps in the doctrinal landscape. Rather than errors, they embody the "cases that should exist," reflecting a Hercules-like judge's holistic interpretation.
https://www.stainless.com/blog/stainless-cli-generator-your-...
Just the (small) probability of this being true might be enough for the big players to not consider creating that var. (Although, if it's easy enough to unset it, then maybe not an issue).
It wouldn't even need to send the full output to make a decision, it could just send "npm run build output 500 lines and succeeded, do we need to read the output?" and based on the rest of the conversation the LLM can respond yes or no.
Removes all the fluff around commands that agents use frequently.
> Error: API rate limit exceeded for app ID 7cc6c241b6e6762bf384. If you reach out to GitHub Support for help, please include the request ID E9FC:7BEBA:6CDB3B4:6485458:699EE247 and timestamp 2026-02-25 11:51:35 UTC. For more on scraping GitHub and how it may affect your rights, please review our Terms of Service (https://docs.github.com/en/site-policy/github-terms/github-t...).
I see some good research being done on how to allow LLMs to manage their own context. Most importantly, to remove things from their context but still allow subsequent search/retrieval.
The best friend isn't a dog, but the family that you build. Wife/Husband/kids. Those are going to be your best friends for life.
It's worth noting thet just by setting the right tone of voice, choosing the right words, and instructing it to be concise, surgical in what it says and writes, things change drastically - like night and day.
It then starts obeying, CRITICALs are barely needed anymore and the docs it produces are tidy and pretty.
I have a solution to all this of course but why should I tell anyone.
Also, I just restart when the context window starts filling up. Small focused changes work better anyway IMO than single god-prompts that try do do everything but eventually exceed context and capability...
Please don't overload that term with trendy LLM products. You can use the full name.
Both CC and cc refer to the C compiler, in slightly different ways.
The OS knows (it has to because it set up the pipeline), and the process can find out through a system call, exposed in C as `isatty`: https://www.man7.org/linux/man-pages/man3/isatty.3.html
> This solution seems like it would fit the problem from the article?
Might not be a great idea. The world is probably already full of build tools pipelines that expect to process the normal terminal output (maybe with colours stripped). Environment variables like `CI` are a thing for a reason.
https://man.openbsd.org/man3/ttyname.3
I believe most standard libraries has a version.
if tty -s; then
echo "Standard input is a TTY (interactive mode)."
else
echo "Standard input is not a TTY (e.g., piped or redirected)."
fi
Now I wonder how _isatty_ itself detects whether a file descriptor is associated with a terminal!https://github.com/openbsd/src/blob/master/lib/libc/gen/isat...
https://github.com/openbsd/src/blob/master/sys/sys/fcntl.h
https://github.com/openbsd/src/blob/ba496e5267528b649ec87212...
https://github.com/openbsd/src/blob/ba496e5267528b649ec87212...
this avoids having to update everything to support LLM=true and keep your current context window free of noise.
There :)
Of course you can combine both approaches for even greater gains. But Claude Code and like five alternatives gaining an efficient tool-calling paradigm where console output is interpreted by Haiku instead of Opus seems like a much quicker win than adding an LLM env flag to every cli tool under the sun
Edit: Just remembered that sometimes, I see claude running the build step in two terminals, side-by-side at nearly the same time :D
Why was the output so verbose in the first place then?
LLMs (Claude Code in particular) will explicitly create token intensive steps, plans and responses - "just to be sure" - "need to check" - "verify no leftovers", will do git diff even tho not asked for, create python scripts for simple tasks, etc. Absolutely no cache (except the memory which is meh) nor indexing whatsoever.
Pro plan for 20 bucks per month is essentially worthless and, because of this and we are entering new era - the era of $100+ monthly single subscription being something normal and natural.
I'm on the Pro plan. If I run out of tokens, which has only happened 2 or 3 times in months of use, I just work on something else that Claude can't do, or ...write the code myself.
You do have to keep a close eye on it, but I would be doing that anyway given that if it goes haywire it's wasting my time as well as tokens. I'd rather spend an extra minute writing a clearer prompt telling it exactly what I want it to do, than waste time on a slot machine.
Most of these things can be avoided with a customized CLAUDE.md.
P.S CLAUDE.md is sometimes useful but, it's a yet another token drain. Especially that it can grow exponentially.
Another thing that helps is using plan mode first, since you can more or less see how it's going to proceed and steer it beforehand.
So much content about furnishing the Markdown and the whatnot for your bots. But content is content?
In general I think good DevEx needs to be dialed to 11 for successful agentic coding. Clean software architecture and interfaces, good docs, etc. are all extremely valuable for LLMs because any bit of confusion, weird patterns or inconsistency can be learned by a human over time as a "quirk" of the code base. But for LLMs that don't have memory they are utterly confusing and will lead the agent down the wrong path eventually.
I also like a discussion in this thread: using custom tools to reduce the frequency of tool calls in general, that is, write tool wrappers specific for your applications or agents.
I get the article's overall point, but if we're looking to optimise processing and reduce costs, then 'only using agents for things that benefit from using agents' seems like an immediate win.
You don't need an agent for simple, well-understood commands. Use them for things where the complexity/cost is worth it.
That's an extremely simple solution. I don't see the point in this LLM=true bullshit.