You can work around a lot of the memory issues for large and complex tasks just by making the agent keep work logs. Critical context to keep throughout large pieces of work include decisions, conversations, investigations, plans and implementations - a normal developer should be tracking these and it's sensible to have the agent track them too in a way that survives compaction.
None, that's what I'm trying to say. My favorite is just storing project context locally in docs that agents can discover on their own or I can point to if needed. This doesn't require me to upload sensitive code or information to anonymous people's side projects and has and equivalent amount of hard evidence for efficacy (zero), but at least has my own anecdotal evidence of helping and doesn't invite additonal security risk.
People go way overboard with MCPs and armies of subagents built on wishes and unproven memory systems because no one really knows for sure how to get past the spot we all hit where the agentic project that was progressing perfectly hits a sharp downtrend in progress. Doesn't mean it's time to send our data to strangers.
All of these systems are for managing context.
You can generally tell which ones are actually doing something if they are using skills, with programs in them.
Because then, you're actually attaching some sort of feature to the system.
Otherwise, you're just feeding in different prompts and steps, which can add some value, but okay, it doesn't take much to do that.
Like adding image generation to claude code with google nano banana, a python script that does it.
That's actually adding something claude code doesn't have, instead of just saying "You are an expert in blah"
another is one claude code ships with, using rip grep.
Those are actual features. It's adding deterministic programs that the llm calls when it needs something.
So a better question to ask is - Do you have any ideas for an objective way to a measure a performance of agentic coding tools? So we can truly determine what improves performance or not.
I would hope that internal to OpenAI and Anthropic they use something similar to the harness/test cases they use for training their full models to determine if changes to claude code result in better performance.
I’m not sure where the ‘despite’ comes in. Experts and vets have opinions and this is probably the best online forum to express them. Lots of experts and vets also dislike extremely popular unrelated tools like VB, Windows, “no-code” systems, and Google web search… it’s not a personality flaw. It doesn’t automatically mean they’re right, either, but ‘expert’ and ‘vet’ are earned statuses, and that means something. We’ve seen trends come and go and empires rise and fall, and been repeatedly showered in the related hype/PR/FUD. Not reflexively embracing everything that some critical mass of other people like is totally fine.
Just their thought management git system works pretty well for me TBH. https://www.humanlayer.dev/
Otherwise the ability to search back through history is a valuable simple git log/diff or (rip)grep/jq combo over the session directory. Simple example of mine: https://github.com/backnotprop/rg_history
I feel that way too. I have a lot of these things.
But the reality is, it doesn't really happen that often in my actual experience. Everyone is very slow as a whole to understand what these things mean, so far you get quite a bit of time just with an improved, customized system of your own.
https://backnotprop.com/blog/50-first-dates-with-mr-meeseeks...
So... it's tough. I think memory abstractions are generally a mistake, and generally not needed, however I also think that compacting has gotten so wrong recently that they are also required until Claude Code releases a version with improved compacting.
But I don't do memory abstraction like this at all. I use skills to manage plans, and the plans are the memory abstraction.
But that is more than memory. That is also about having a detailed set of things that must occur.
I think planning is a critical part of the process. I just built https://github.com/backnotprop/plannotator for a simple UX enhancement
Before planning mode I used to write plans to a folder with descriptive file names. A simple ls was a nice memory refresher for the agent.
I am working alone. So I am instead having plans automatically update. Same conception, but without a human in the mix.
But I am utilizing skills heavily here. I also have a python script which manages how the LLM calls the plans so it's all deterministic. It happens the same way every time.
That's my big push right now. Every single thing I do, I try to make as much of it as deterministic as possible.
My approach is literally just a top-level, local, git version controlled memory system with 3 commands:
- /handoff - End of session, capture into an inbox.md
- /sync - Route inbox.md to custom organised markdown files
- /engineering (or /projects, /tasks, /research) - Load context into next session
I didn't want a database or an MCP server or embeddings or auto-indexing when I can build something frictionless that works with git and markdown.
Repo: https://github.com/ossa-ma/double (just published it publicly but its about the idea imo)
I will typically make multiple '/handoff's per day as I use Claude code whereas I typically use '/sync' at the end of the day to organise them all at once.
I think at this point in time, we both have it right.
We use Cursor where I work and I find it a good medium for still being in control and knowing what is happening with all of the changes being reviewed in an IDE. Claude feels more like a black box, and one with so many options that it's just overwhelming, yet I continue to try and figure out the best way to use it for my personal projects.
Claude code suffers from initial decision fatigue in my opinion.
Agree with the other comments: pretty much running vanilla everything and only the Playwright MCP (IMO way better than the native chrome integration) and ccstatusline (for fun). Subagents can be as simple as saying "do X task(s) with subagent(s)". Skills are just self @-ing markdown files.
Two of the most important things are 1) maintaining a short (<250 lines) CLAUDE.md and 2) having a /scratch directory where the agent can write one-off scripts to do whatever it needs to.
This helps it organize temporary things it does like debugging scripts and lets it (or me) reference/build on them later, without filling the context window. Nothing fancy, just a bit of organization that collects in a repo (Git ignored)
I've TL'd and PM'd as well as IC'd. Now my IC work feels a lot more like a cross between being a TL and being a senior with a handful of exuberant and reasonably competent juniors. Lots of reviewing, but still having to get into the weeds quickly and then get out of their way.
Things that need special settings now won’t in the future and vice versa.
It’s not worth investing a bunch of time into learning features and prompting tricks that will be obsoleted soon
They do get better, but not enough to change any of the configuration I have.
But you are correct, there is a real possibility that the time invested with be obsolete at some point.
For sure the work towards MCPs are basically obsolete via skills. These things happen.
how would that be a "skill"? just wrap the mcp in a cli?
fwiw this may be a skill issue, pun intended, but i can't seem to get claude to trigger skills, whereas it reaches for mcps more... i wonder if im missing something. I'm plenty productive in claude though.
So a Skill is just a smaller granulatrity level of that concept. It's just one of the individual things an MCP can do.
This is about context management at some level. When you need to do a single thing within that full list of potential things, you don't need the instructions about a ton of other unrelated things in the context.
So it's just not that deep. It would be having a python script or whatever that the skill calls that returns the runtime dependencies and gives them back to the LLM so they can refactor without blindly greping.
Does that make sense?
It's always interesting reading other people's approaches, because I just find them all so very different than my experience.
I need Agents, and Skills to perform well.
I agree that this level of finetuning feels overwhelming and might let yourself doubting whether you do utilize Claude to its optimum and the beauty is, that finetunging and macro usage don't interfere, when you stay in your lane.
For example I now don't use the planing agent anymore instead incorporated this process into the normal agents much to the project's advantage. Consistency is key. Anthropic did the right thing.
Codex is quite a different beast and comes from the opposite direction so to say.
I use both, Codex and Claude Opus especially, in my daily work and found them complementary not mutual exclusive. It is like two different evangelists who are on par exercising with different tools to achieve a goal, that both share.
It's also deeply interesting because it's essentially unsolved space. It's the same excitement as the beginning of the internet.
None of us know what the answers will be.
1. Current directory ./CLAUDE.md
2. User directory ~/.claude/CLAUDE.md
I stick general preferences in what it calls "user memory" and stick project specific preferences in the working directory.I've been trying to write blogs explaining it recently, but I don't think I'm very good at making it sound interesting to people.
What can I explain that you would be interested in?
Here was my latest attempt today.
https://vexjoy.com/posts/everything-that-can-be-deterministi...
Here is what I don't get. it's trivial to do this. Mine is of course customized to me and what I do.
The idea is to communicate the ideas, so you can use them in your own setup.
It's trivial to put for example, my do router blog post in claude code and generate one customized for you.
So what does it matter to see my exact version?
These are the type of things I don't get. If I give you my details, it's less approachable for sure.
The most approachable thing I could do would be to release individual skills.
Like I have skills for generating images with google nano banana. That would be approachable and easy.
But it doesn't communicate the why. I'm trying to communicate the why.
When you've tried 10 ways of doing it but they all end up getting into a "feed the error back into the LLM and see what it suggests next" you aren't that motivated to put that much effort into trying out an 11th.
The current state of things is extremely useful for a lot of things already.
I'm not sure if the problems you run into with using LLMs will be solved if you do it my way. My problems are solved doing it my way. If I heard more about your problems, I would have a specific answer to them.
These are the solutions to where I have run into issues.
For sure, but my solutions are not feed the error back into the LLM. My solutions are varied, but as the blog shows, they are move as much as possible into scripts, and deterministic solutions, and keep the LLM to the smallest possible scope.
The current state of things is extremely useful for a subset of things. That subset of things feels small to me. But it may be every thing a certain person wants to do exists in that subset of things.
It just depends. We're all doing radically different things, and trying very different things.
I certainly understand and appreciate your perspective.
My basic problem is: "first-run" LLM agent output frequently does one or more of the following: fails to compile/run, fails existing test coverage, or fails manual verification. The first two steps have been pretty well automated by agents: inspect output, try to fix, re-run. IME this works really well for things like Python, less-well for things like certain Rust edge cases around lifetimes and such, or goroutine coordination, which require a different sort of reasoning than "typical" procedural programming.
But let's assume that the agents get even better at figuring out the deal with the more specialized languages/features and are able to iterate w/o interaction to fix things.
If the first-pass output still has issues, I still have concerns. They aren't "I'm not going to use these tools" concerns, because I also sometimes write bugs, and they can write the vast majority of code faster than I can.
But they are "I'm not gonna vibe-code my day job" concerns because the existence of trivially-catchable issues suggests that there's likely harder-to-catch issues that will need manual review to make sure (a) test coverage is sufficient, (b) the mental model being implemented is correct, (c) the outside world is interacted with correctly. And I still find bugs in these areas that I have to fix manually.
This all adds up to "these tools save me 20-30% of my time" (the first-draft coding) vs "these agents save me 90% of my time."
So I'm kinda at a plateau for a few months where it'll be hard to convince me to try new things to try to close that 20-30% -> 90% number.
The real issue is I don’t know the issues ahead of time. So each experience is an iteration stopping things I didn’t know would happen.
Thankfully, I’m not trying to sell anyone anything. I don’t even want people to use what I use. I only want people to understand the why of what I do, and how it adds me value.
I think it’s important to understand this thing we use as best we can.
The personal value you can get, is entirely up to your tolerance for it.
I just enjoy the process
For large codebases (my own has 500k lines and my company has a few tens of millions) you need something better like RPI.
If nothing else just being able to understand code questions basically instantly should give you a large speed up, even without any fancy stuff.
In some sense, computers and digital things have now just become a part of reality, blending in by force.
But the things I am doing might not be the things you are doing.
If you want proof, I intend to release a game to the App Store and steam soon. At that point you can judge if it built a thing adequately.
I hope you're just one of the ones who figured it out early and all the hype isn't fake bullshit. I'd much rather be proven wrong than for humanity to have wasted all this time and resources.
I think of this stuff as trivial to understand from my point of view. I am trying to share that.
I have nothing to sell, I don’t want anyone to use my exact setup.
I just want to communicate the value as I see it, and be understood.
The vast majority of it all is complete bullshit, so of course I am not offended that I may sound like 1000 other people trying to get you to download my awesome Claude Code Plugins repo.
Except I’m not actually providing one lol
the docs if you are curious: https://www.ensue-network.ai/docs
Consider more when you're 50+ hours in and understand what more you want.
The PMs were right all along!
But imagine how hard it would be if these kids had short term memory only and they would not know what to focus on except what you tell them to. You literally have to tell them "Here is A-Z pay attention to 'X' only and go do your thing". Add in other managers for this party like a caterer, clowns, your spouse and they also have to tell them that and remember, communicate what other managers have done. No one has solved for this, really.
This is what it felt like in 2025 to code with LLMs on non trivial projects, with some what of an improvement as the year went by. But I am not sure much progress was made in fixing the process part of the problem.
The project is still in alpha, so you could shape what we build next - what do you need to see, or what gets you comfortable sending proprietary code to other external services?
Honestly? It just has to be local.
At work, we have contracts with OpenAI, Anthropic, and Google with isolated/private hosting requirements, coupled with internal, custom, private API endpoints that enforce our enterprise constraints. Those endpoints perform extensive logging of everything, and reject calls that contain even small portions of code if it's identified as belonging to a secret/critical project.
There's just no way we're going to negotiate, pay for, and build something like that for every possible small AI tooling vendor.
And at home, I feed AI a ton of personal/private information, even when just writing software for my own use. I also give the AI relatively wide latitude to vibe-code and execute things. The level of trust I need in external services that insert themselves in that loop is very high. I'm just not going to insert a hard dependency on an external service like this -- and that's putting aside the whole "could disappear / raise prices / enshittify at any time" aspect of relying on a cloud provider.
I’m never stopped and Claude always remembers what we’re doing.
This pattern has been highly productive for 8 months.
Quite a few of you have mentioned that you store a lot of your working context across sessions in some md file - what are you actually storing? What data do you actually go back to and refer to as you're building?
1a directly from Anthropic on agentic coding and Claude Code best practices.
"Create CLAUDE.md files"
https://www.anthropic.com/engineering/claude-code-best-pract...
It works great. You can put anything you want in there. Coding style, architecture guidelines, project explanation.
Anything the agent needs to know to work properly with your code base. Similar to an onboarding document.
Tools (Claude Code CLI, extensions) will pick them up hierarchically too if you want to get more specific about one subdirectory in your project.
AGENTS.md is similar for other AI agents (OpenAI Codex is one). It doesn't even have to be those - you can just @ the filename at the start of the chat and that information goes in the context.
The naming scheme just allows for it to be automatic.
Combined with a good AGENTS.md, it seems to be working really well.
If you're using them though, we no longer have the problem of Claude forgetting things.
Agents are an md file with instructions.
Skills are an md file with instructions.
Commands are.. you get the point.
We're just dealing with instructions. Claude.md is handled by Claude Code. It is forgotten almost entirely often when the context fills.
Okay, what is an agent? An agent is basically a Claude.md file, but you make it extremely granular. So it only has instructions of let's say, Typescript.
We're all just doing context management here. We're trying to make sure our instructions that matter stay.
To do that, we have to remove all other instructions from the picture.
When you're doing typescript, you only know type script things.
Okay, what's a skill? A skill is doing a single thing with type script. Why? So that the context is even smaller.
Instead of the agent having every single instruction you need about typescript, you put them in skills so they only get put into context when that thing is needed.
But skills are also where you connect deterministic programs. For example, I have a skill for creating images in nano banana.
So when the Typescript Agent needs to create an image, it calls the skill, that calls the python script, to create images in nano banana.
We're managing all the context to only be available when it's needed, keeping all other instructions out.
Does that help?
Though I have found repo level claude.md that is updated everytime claude makes a mistake plus using —restore to select a previous relevant session works well.
There is no way for Anthropic to optimize Claude code or the underlying models for these custom setups. So it’s probably better to stick with the patterns Anthropic engineers use internally.
And also - I genuinely worry about vendor lock-in, do you?
I run it in automatic mode with decent namespacing, so thoughts, notes, and whole conversations just accumulate in a structured way. As I work, it stores the session and builds small semantic, entity-based hypergraphs of what I was thinking about.
Later I’ll come back and ask things like:
what was I actually trying to fix here?
what research threads exist already?
where did my reasoning drift?
Sometimes I’ll even ask Claude to reflect on its own reasoning in a past session and point out where it was being reactive or missed connections.
Claude itself can just update the claude.md file with whatever you might have forgot to put in there.
You can stick it in git and it lives with the project.
I work primarily in Python and maintain extensive coding conventions there - patterns allowed/forbidden, preferred libs, error handling, etc. Custom slash commands like `/use-recommended-python` (loads my curated libs: pendulum over datetime, httpx over requests) and `/find-reinvented-the-wheel` to catch when Claude ignored existing utilities.
My use case: multiple smaller Python projects (similar to steipete's workflow https://github.com/steipete), so cross-project consistency matters more than single-codebase context.
Yes, ~15k tokens for CLAUDE.md + rules. I sacrifice context for consistency. Worth it.
Also baked in my dev philosophy: Carmack-style - make it work first, then fast. Otherwise Claude over-optimizes prematurely.
These memory abstractions are too complicated for me and too inconsistent in practice. I'd rather maintain a living document I control and constantly refine.
I’ll give this a go though and let you know!
Or, over continuing the same session and compacting?
Each time an LLM looks at my project, it's like a newcomer has arrived. If it keeps repeating mistakes, it's because my project sucks.
It's an unique opportunity. You can have lots of repeated feedback from "infinite newcomers" to a project, each of their failures an opportunity to make things clearer. Better docs (for humans, no machine-specific hacks), better conventions, better examples, more intuitive code.
That, in my opinion, is how markdown (for machines only and not humans) will fall. There will be a breed of projects that thrives with minimal machine-specific context.
For example, if my project uses MIDI, I'm much better doing some specialized tools and examples that introduce MIDI to newcomers (machines and humans alike) than writing extensive "skill documents" that explain what MIDI is and how it works.
Think like a human do. Do you prefer being introduced to a codebase by reading lots of verbose docs or having some ready-to-run examples that can get you going right away? We humans also forget, or ignore, or keep redundant context sources away (for a good reason).
Why did you need to use AI to write this post?
I'm sold.
With that said, I can't think of a way that this would work. How does this work? I took a very quick glance, and it's not obvious at first glance.
The whole problem is, the AI is short on context, it has limited memory. Of course, you can store lots of memory elsewhere, but how do you solve the problem of having the AI not know what's in the memory as it goes from step to step? How does it sort of find the relevant memory at the time that that relevance is most active?
Could you just walk through the sort of conceptual mechanism of action of this thing?
1. Embeds the current request.
2. Runs a semantic + timestamp-weighted search over your past sessions. Returns only the top N items that look relevant to this request.
3. Those get injected into the prompt as context (like extra system/user messages), so Claude sees just enough to stay oriented without blowing context limits.
Think of it like: Attention over your historical work, more so than brute force recall. Context on demand basically giving you an infinite context window. Bookmark + semantic grep + temporal rank. It doesn’t “know everything all the time.” It just knows how to ask its own past: “What from memory might matter for this?”
When you try it, I’d love to hear where the mechanism breaks for you.
Then Claude uses the MCP tools according to the SKILL definition: https://github.com/mutable-state-inc/ensue-skill/blob/main/s...
I think of it like a file tree with proper namespacing and keep abstract concepts in separate directories. so like my food preferences will be in like /preferences/sandos. or you can even do things like /system-design preferences and then load them into a relevant conversation for next time.