LLM codegen go brrr – Parallelization with Git worktrees and tmux(www.skeptrune.com)

156 pointsby skeptrune9 days ago25 comments

danielbln9 days ago
What I don't like about this approach is that it mainly improves the chances of zero-shotting a feature, but I require a ping pong with the LLM to iterate on the code/approach. Not sure how to parallelize that, I'm not gonna keep the mental model of 4+ iterations of code in my head and iterate on all of them.
For visual UI iteration this seems amazing given the right tooling, as the author states.
I could see it maybe useful for TDD. Let four agents run on a test file and implement until it passes. Restrict to 50 iterations per agent, first one that passes the test terminates other in-progress sessions. Rinse and repeat.
- diggan9 days ago
  > but I require a ping pong with the LLM to iterate on the code/approach
  I've never got results from any LLM when doing more than one-shots. I basically have a copy-pastable prompt, and if the first answer is wrong, I update the prompt and begin from scratch. Usually I add in some "macro" magic too to automatically run shell commands and what not.
  It seems like they lose "touch" with what's important so quickly, and manages to steer themselves further away if anything incorrect ends up at any place in the context. Which, thinking about how they work, sort of makes sense.
  - foolswisdom9 days ago
    That doesn't take away from the OP's point (and OP didn't specify what ping ponging looks like, could be the same as you're describing), you are still iterating based on the results, and updating the prompt based on issues you see in the result. It grates on a human to switch back and forth between those attempts.
    scroogey9 days ago
    But if you're "starting from scratch", then what would be the problem? If none of the results match what you want, you reiterate on your prompt and start from scratch. If one of them is suitable you take it. If there's no iterating on the code with the agents, then this really wouldn't add much mental overhead? You just have to glance over more results.
- landl0rd9 days ago
  I usually see that results are worse after ping-pong. If one-shot doesn't do it better to "re-roll". Context window full of crap poisons its ability to do better and stay on target.
- babyshake9 days ago
  I guess one way it might be able to work is with a manager agent, who delegates to IC agents to try different attempts. The manager reviews their work and understands the differences in what they are doing, and can communicate with you about it and then to the ICs doing the work. So you are like a client who has a point of contact at an engineering org who internally is managing how the project is being completed.
- skeptrune9 days ago
  >From the post: There is no easy way to send the same prompt to multiple agents at once. For instance, if all agents are stuck on the same misunderstanding of the requirements, I have to copy-paste the clarification into each session.
  It's not just about zero shotting. You should be able to ping pong back and forth with all of the parallel agents at the same time. Every prompt is a dice roll, so you may as well roll as many as possible.
  - layoric9 days ago
    > Every prompt is a dice roll, so you may as well roll as many as possible.
    Same vibe as the Datacenter Scale xkcd -> https://xkcd.com/1737/
- Flemlo9 days ago
  I write docs often and what works wonders with LLM is good docs. A readme a architectural doc etc.
  Helps me to plan it well and the LLM to work a lot better
  - mooreds9 days ago
    Bonus! Future you and other devs working in the system will benefit from docs as well.
- vFunct9 days ago
  Yah it's not really usable for iteration. I don't parallelize this way. I parallelize based on functions. Different agents for different function.
  Meanwhile, a huge problem in parallelization is maintaining memory-banks, like https://docs.cline.bot/prompting/cline-memory-bank
gct9 days ago
No wonder software is so slow today when we're this profligate. "let's run four AIs and ignore 3/4 of them!" ugh.
- bitpush9 days ago
  There's trade-offs all layers of the stack
  Your cpu is decoding instructions optimistically (and sometimes even executing)
  Your app is caching results just in case
  The edge server has stuff stashed for you (and others..)
  The list goes on and on...
lmeyerov9 days ago
I like to think about maximizing throughput while minimizing attention: both matter, and the proposal here is expensive on my attention. Optimizing per-task latency matters less than enabling longer non-interactive runs.
For parallelism, I'm finding it more productive to have multiple distinct tasks that I multitask on and guide each to completion. Along the way I improve the repo docs and tools so the AI is more self-sufficient the next time, so my energy goes more to enabling longer runs.
Ex: One worker improving all docs. I can come back, give feedback, and redo all of them. If I'm going to mess with optimizing agent flows, it'd be to make the repo style guide clearer to the AI. In theory I can divide docs sections and manually run sections in parallel, or ask for multiple parallel versions of it all for comparison... But that's a lot of overhead. Instead, I can fork the repo and work another another non-docs issue in parallel. A. Individual task is slow, but I get more tasks done, and with less human effort.
I'd like tools to automate fork/join parallelism for divide-and-conquer plans, and that feels inevitable. For now, they do fairly linear CoT, and easier for me to do distinct tasks vs worrying about coordinating.
sureglymop9 days ago
I love how the one non-broken toggle still wasn't great. Now you can save time while wasting your time ;)
- skeptrune9 days ago
  It was better than starting from scratch though. Imo, getting a functional wireframe for $0.40 is a good deal.
  - 9 days ago
    undefined
CraigJPerry9 days ago
I'm in a different direction on this, worktrees don't solve the problem for me, this is stuck in 1 agent = 1 task mode. I want a swarm of agents on 1 task.
There's a couple of agent personas i go back to over and over (researcher, designer, critic, implementer, summariser), for most tasks i reuse 90%+ of the same prompt, but implementer has variants, one that's primed with an llms.txt (see answerdotai) for a given library i want to use, another that's configured to use gemini (i prefer its tailwind capabilities) rather than claude etc.
To organise these reusable agents i'm currently test driving langroid, each agent contributes via a sub task.
It's not perfect yet though.
- skeptrune9 days ago
  I think you misread. The point I make is that it's many agents = 1 task.
  Since the probability of a LLM succeeding at any given task is sub 100%, you should run multiple of the same LLM with the same prompted task in parallel.
  - yakbarber9 days ago
    I think OP means they should be collaborating. In the posters proposed solution each agent is independent. But you could reduce the human attention required by having multiple rounds of evaluation and feedback from other agents before it gets to the human.
  - 9 days ago
    undefined
dgunay9 days ago
This looks like a much more sophisticated version of my setup. I had Aider vibe code me a script that just manages cloning a repo into a subfolder, optionally with some kind of identifying suffix on it, and then wrote a tiny script to automate calling that script, `cd`ing into the directory, and then running codex-cli on it. The resulting workflow: I open a new terminal, type `vibe --suffix=<suffix> <prompt>`, and then I can go do something else.
- 8200_unit9 days ago
  Could you share your scripts?
  - dgunay4 days ago
    IMO you're better off just asking Aider to write one tailored to your specific use cases. Anonymizing the code so that I could post this gist is actually the first time I've read most of it, and it's really bad code. But in case none of that deters you, here you go: https://gist.github.com/dgunay/4a07db199ca154614c2193718da60...
peterkelly9 days ago
If AI coding agents were actually any good, you could preface your prompt with "attempt the following task four times in parallel" and that would be it.
- lerchmo8 days ago
  claude code will do this https://www.youtube.com/watch?v=2TIXl2rlA6Q&t=580s
hboon9 days ago
Coincidentally, I just posted this[1] earlier today where I made a simple change against a 210 LOC vs 1379 LOC file, comparing parameters: LOC, filename vs URL path for that webpage, playwright verification.
My question is how does the author get things done with $0.10 ? My simple example with the smaller file is $2 each.
[1] https://x.com/hboon/status/1927939888000946198
thepablohansen9 days ago
This resonates- my engineering workflow has started shifting from highly focused, long periods of building out a feature to one that has much more context switching, review, and testing.
hombre_fatal9 days ago
I find that my bottleneck with LLMs on a real project is reviewing their code, QAing it, and, if it's novel code, integrating it into my own mental model of how the code works so that I can deliberately extend it in a maintainable way.
The latter is so expensive that I still write most code myself, or I'll integrate LLM code into the codebase myself.
I've used parallel Claude Code agents to do chore work for me.
But I'd be curious about examples of tasks that people find best for OP's level of parallelization.
sagarpatil9 days ago
I avoid using my own API keys, especially for Sonnet 4 or Opus, because LLMs can rack up unexpected costs. Instead, I use Augment Code’s remote agents and Google’s Jules, which charge per message rather than by usage. This setup is ideal for me since I prefer not to run the model locally while I’m actively working on the codebase.
uludag9 days ago
I completely see the benefit for this strategy. Defaulting to something like this would seem to inflate costs though, as a tradeoff for time. I know certain LLM usages can be pretty pricy. I hope that something like this doesn't become the default though as I can see parallelization being a dark pattern for those making money off of token usage.
- ramoz9 days ago
  I don’t think it’s a great representation on of the utility of worktrees or even efficient practical use of agents.
  - vFunct9 days ago
    It pretty much is though. This is exactly what you'd do if you had 100 different employees.
    ramoz9 days ago
    I wouldn't ask for 100 different versions of the same feature from each of them.
    1 agent is supposed to be powerful with proper context and engineering design decisions in mind - whether UI or backend.
    Asking 3 different agents to do the same engineering task reeks of inefficient or ineffective development patterns with agents.
    TeMPOraL9 days ago
    > I wouldn't ask for 100 different versions of the same feature from each of them.
    You wouldn't because human labor is too expensive to make it worthwhile. You would if it were too cheap to meter.
    We actually do that at the scale of society - that's market competition in a nutshell. Lots of people building variants of the same things, then battling it out on the market. Yes, it's wasteful as hell (something too rarely talked about), but we don't have a better practical alternative at this point, so there's some merit to the general idea.
    (Also the same principle applies to all life - both in terms of how it evolves, and how parts of living organisms work internally. Actively maintained equilibria abound.)
    maxbond9 days ago
    > Also the same principle applies to all life
    Actively maintained equilibria abound, but this is not typically the mechanism. Different species in adjacent niches aren't better or worse versions of the same organism to be evaluated and either selected or discarded. It's more typical for them to adopt a strategy of ecological segmentation so that they can all have their needs met. Every few years moths migrate to my state to reproduce - and they do so before our local moths have woken up for the season, and leave around the time they do, so that they aren't in competition. Birds that feed from the same trees will eat from different parts of the tree and mate at different times, so that their peak energy consumption doesn't line up. What would the benefit be in driving each other to extinction?
    Evolution doesn't make value judgments, it doesn't know which species is better or worse and it doesn't know how future climactic shifts will change the fitness landscape. Segmentation is both easier and a hedge against future climactic shifts.
    Engineering works under a very different logic where the goal is optimal performance in a controlled environment for an acceptable service life, not satisfactory performance with extremely high robustness in the face of unknown changes into the perpetual future. When we rank different systems and select the most optimal, we are designing a system that is extremely brittle on geologic timescales. Abandon a structure and it will quickly fall apart. But we don't care because we're not operating at geologic timescales and we expect to be around to redesign systems as their environment changes to make them unsuitable.
    Similarly, the reproduction of labor/capacity in markets you described could be viewed as trading efficiency for robustness instead of as waste. Eg, heavily optimized supply chains are great for costs, but can have trouble adapting to global pandemics, wars in inconvenient places, or ships getting stuck in the wrong canal.
    vFunct9 days ago
    I actually don’t use them that way. I use 100 different agents on 100 different worktrees to develop 100 different apps for the overall project.
    ramoz9 days ago
    That’s what I’m advocating for. That’s not what was demonstrated in the blog
    tough8 days ago
    in frontend exploratory random work might have some value if you dont know what you need.
    both seem valid uses of this synthetic intelligence to me
    tough9 days ago
    what if you have 100 lint errors that you can parallelize fixing to 100 small local 1B llms
    ramoz9 days ago
    This is exactly what I would do.
    ukuina9 days ago
    Without agent collaboration, you'll need a whole tree of agents just to resolve the merge conflicts.
    tough8 days ago
    usually the orchestrator or planner that spawns the sub-agents is the -collaboration- protocol as it has visibility to all others and can start/kill new ones at wish as it sees fit and coordinate appropiately but yea
- greymalik9 days ago
  This is discussed in TFA. The absolute costs are negligible, particularly in comparison to the time saved.
- arguflow9 days ago
  I think the most annoying part is when a coding agent takes a particularly long time to produce something. AND has bad output, it is such a time sink / sunk cost
maximilianroos9 days ago
I posted some notes from a full setup I've built for myself with worktrees: https://github.com/anthropics/claude-code/issues/1052
I haven't productized it though; uzi looks great!
- senko9 days ago
  TIL worktrees exist! https://git-scm.com/docs/git-worktree
  Thanks :)
  - asselinpaul9 days ago
    might be of interest https://9999years.github.io/git-prole/
mrbonner9 days ago
I wonder why we should spend so much effort to do this vs. say using checkpoints in Cline for example. You could restore task and files to a previous state and try a different prompt/plan. And, the bonus is you have all of the previous context available.
bjackman8 days ago
Hmm this strategy only makes sense if you can trivially evaluate each agent's results, which I haven't found to be the case.
I expect a common case would be: one agent wrote code that does the thing I want. One agent wrote code that isn't unmaintainable garbage. These are not the same agent. So now you have to combine the two solutions which is quite a lot of work.
hboon9 days ago
I run 2 "developers" and myself. Also with tmux, in 3 windows, but I just clone the repo for the 2 developers and then manually pull into my copy when I think it's done. I see various people/sites mentioning git worktrees. I know what it is, but how is it better?
- davely8 days ago
  git worktrees optimize how data is shared across multiple directories. So, you’re not cloning and duplicating a bunch of code on your machine and just referencing data from the original .git folder.
  The only downsides I’ve seen:
  1. For JS projects at least, you still need to npm install or yarn install for all packages. This can take a bit of time.
  2. Potentially breaks if you have bespoke monorepo tooling that your infra team won’t accept updates for, why is this example so specific, I don’t know just accept my PR please. But I digress.
  - hboon8 days ago
    Can I say it's just for disk (and perhaps speed when pulling/pushing) optimisation? But as long as that's not a bottleneck, there is no difference?
    (because git clone is much cleaner and feels less risky especially with agents)
  - mdaniel8 days ago
    > 1. For JS projects at least, you still need to npm install or yarn install for all packages. This can take a bit of time.
    I believe that's the problem pnpm is trying to solve (err, not the time part, I can't swear it's wall clock faster, but the "hey hard links are a thing" part <https://pnpm.io/faq#:~:text=pnpm%20creates%20hard%20links%20...> )
asadm9 days ago
ooh i was exploring this path, aider is so slow. thanks for validating it.
- arguflow9 days ago
  Is aider supposed to do worktrees by default?
  - asadm9 days ago
    i dont think so?
dangoodmanUT9 days ago
thank you for not writing this in python
vercantez9 days ago
Very cool! Actually practical use of scaling parallel test time compute. I've been using worktrees + agents to work on separate features but never considered allocating N agents per task.
crawshaw9 days ago
This is a good strategy we (sketch.dev) experimented with a bit, but in the end we went with containers because it gives the LLM more freedom to, e.g. `apt-get install jq` and other tools.
asadm9 days ago
i dont see uzi code on github.
- skeptrune9 days ago
  Hadn't pushed from our remotes. There now![1]
  - [1] https://github.com/devflowinc/uzi
juancn9 days ago
Now you can do 4X more code reviews!
- morkalork9 days ago
  What's the issue, everyone loves doing code review right?
- arguflow9 days ago
  Why review the code? Most of the time you just want is a good starting point.
  - oparin109 days ago
    If all you need is a good starting point, why not just use a framework or library?
    Popular libraries/frameworks that have been around for years and have hundreds of real engineers contributing, documenting issues, and fixing bugs are pretty much guaranteed to have code that is orders of magnitude better than something that can contain subtle bugs and that they will have to maintain themselves if something breaks.
    In this very same post, the user mentions building a component library called Astrobits. Following the link they posted for the library’s website, we find that the goal is to have a "neo-brutalist" pixelated 8-bit look using Astro as the main frontend framework.
    This goal would be easily accomplished by just using a library like ShadCN, which also supports Astro[1], and has you install components by vendoring their fully accessibility-optimized components into your own codebase. They could then change the styles to match the desired look.
    Even better, they could simply use the existing 8-bit styled ShadCN components[2] that already follow their UI design goal.
    [1] - https://ui.shadcn.com/docs/installation/astro [2] - https://www.8bitcn.com/
    skeptrune9 days ago
    I think AI makes personal software possible in a way that it wasn't before. Without LLMs, I would have never had the time to build a component library at all and would have probably used 8bitcn (looks awesome btw) and added the neo-brutalist shadows I wanted.
    However, despite my gripes with ShadCN for Astro being minor (lots of deps + required client:load template directive), just small friction points are enough that I'm willing to quickly build my own project. AI makes it barely any work, especially when I lower the variance using parallelization.
    arguflow9 days ago
    Frameworks and libraries are useful to keep the code style the same.
    Using multiple agents helps when the endgoal isn't seen. Especially if there is no end state UI design in mind. I've been using a similar method for shopify polaris[1] putting the building blocks together (and combing through docs to find the correct blocks) is still a massive chore.
    [1] - https://polaris-react.shopify.com/getting-started
    eikenberry9 days ago
    > If all you need is a good starting point, why not just use a framework or library?
    A good starting point fixes the blank page problem. Frameworks or libraries don't address this problem.
- vidyootsenthil9 days ago
  or also 4x productivity!
  - juancn9 days ago
    Coding has never been for me the bottleneck, it's all the other crap that takes time.
diogolsq9 days ago
The fact that you consider this “saving time” might show that you are not being diligent with your code.
So what if the BDD is done?
Read >> Write.
As the final step, you should be cleaning up AI slop, bloated code, useless dependencies, unnecessary tests—or the lack thereof—security gaps, etc. Either way, any fine-tuning should happen in the final step.
This review stage will still eat up any time gains.
Is the AI result good enough? fix it, or refeed to AI to fix it.
- hbogert8 days ago
  I agree with you, however:
  > Read >> Write.
  is what a lot of AI agent zealots see as no longer a thing. I have had multiple persons tell me that AI will just have to get better in reading sloppy code and humans should no longer having to be able to do it.
  - diogolsq7 days ago
    Those zealots, gosh.
    Nonsense. When AI gets stuck in a suboptimal topology, it’s the human who nudges it out.
    How will you maintain code if you’re not even able to read it?
    Using AI to read introduces noise with every iteration. Did you see that trend of people asking AI to reproduce someone self over 70 times? The noise adds up—and the original meaning shifts.
    AI is non deterministic.
kristel1008 days ago
[dead]
mickey4757788 days ago
[flagged]
- nathancspencer8 days ago
  Not sure the author intended the parallelisation to be used to respond to HN posts