It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.
In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.
Try it yourself. Ask claude for something you don't really understand. Then learn that thing, get a fresh instance of claude and try again, this time it will work much better because your knowledge and experience will be naturally embedded in the prompt you write up.
In my experience the differences are mostly in how the code produced by LLM is prompted and what context is given to the agent. Developers who have experience delegating their work are more likely to prevent downstream problems from happening immediately and complain their colleagues cannot prompt as efficiently without a lot of hand holding. And those who rarely or never delegated their work are invariably going to miss crucial context details and rate the output they get lower.
this makes me feel better about the amount of disdain I've been feeling about the output from these llms. sometimes it popsout exactly what I need but I can never count on it to not go offrails and require a lot of manual editing.
I think especially a number of us more junior programmers lack in this regard, and don't see a clear way of improving this skill beyond just using LLMs more and learning with time?
Also document some best practices in AGENT.md or whatever it's called in your app.
Eg
* All imports must be added on top of the file, NEVER inside the function.
* Do not swallow exceptions unless the scenario calls for fault tolerance.
* All functions need to have type annotations for parameters and return types.
And so on.I almost always define the class-level design myself. In some sense I use the LLM to fill in the blanks. The design is still mine.
EDIT: My bad, the code eventually calls into dedicated functions from database.ts, so those 200 lines are mostly just validation and error handling.
Example, Agent.ts, line 93, function createManageKnowledgeTool() [1]. I would have expected something like the following and not almost 200 lines of code implementing everything in place. This also uses two stores of some sort - memory and scratchpad - and they are also not abstracted out, upsert and delete deal with both kinds directly.
switch (action)
{
case "help":
return handleHelpAction(arguments);
case "upsert":
return handleUpsertAction(arguments);
case "delete":
return handleDeleteAction(arguments);
default:
return handleUnknowAction(arguments);
}
[1] https://github.com/skorokithakis/stavrobot/blob/master/src/a...The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.
We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.
Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.
They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.
To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.
This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.
Should this be the case, I personally would not be surprised:
- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.
- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.
Aider did an "architect-editor" split where architect is just a "programmer" who doesn't bother about formatting the changes as diff, then a weak model converts them into diffs and they got better results with it. This is nothing like human teams though.
Then you execute it with a clean context.
Clean context is needed for maximum performance while not remembering implementation dead ends you already discarded
Sample size of one, but I found it helps guard against the model drifting off. My different agents have different permissions. The worker can not edit the plan. The QA or planner can't modify the code. This is something I sometimes catch codex doing, modifying unrelated stuff while working.
Admittedly I was using gsdv2; I've never had this issue with codex and claude. Sure, some RL hacking such as silent defaults or overly defensive code for no reason. Nothing that seemed basically actively malicious such as the above though. Still, gsdv2 is a 1-agent scaffolding pipeline.
I think the issue is that these 1-agent pipelines are "YOU MUST PLAN IMPLEMENT VERIFY EVERYTHING YOURSELF!" and extremely aggressive language like that. I think that kind of language coerces the agent to do actively malicious hacks, especially if the pipeline itself doesn't see "I am blocked, shifting tasks" as a valid outcome.
1-agent pipelines are like a horrible horrible DFS. I still somewhat function when I'm in DFS mode, but that's because I have longer memory than a goldfish.
If you’re exploring an idea or iterating, the roles can help break it down and understand your own requirements. Personally I do that “away” from the code though.
What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on.
In my experience, evidence for the efficacy of software engineering practices falls into two categories:
- the intuitions of developers, based in their experiences.
- scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering.
Evidence for this LLM pattern is the same. Some developers have an intuition it works better.
Also, lines of code is not completely meaningless metric. What one should measure is lines of code that is not verified by compiler. E.g., in C++ you cannot have unbalanced brackets or use incorrectly typed value, but you still may have off-by-one error.
Given all that, you can measure customer facing defect density and compare different tools, whether they are programming languages, IDEs or LLM-supported workflow.
Comparing lines of code can be meaningful, mostly if you can keep a lot of other things constant, like coding style, developer experience, domain, tech stack. There are many style differences between LLM and human generated code, so that I expect 1000 lines of LLM code do a lot less than 1000 lines of human code, even in the exact same codebase.
At the same time I can see a more linear approach doing similar. Like when I ask for an implementation plan that is functional not all that different from an architect agent even if not wrapped in such a persona
Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR.
Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA.
What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done."
"Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code.
Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow.
So to me it makes sense to have models with different architecture/data/post training refine each other's answers. I have no idea whether adding the personas would be expected to make a difference though.
Context & how LLMs work requires this.
From my experience no frontier model produces bug free & error free code with the first pass, no matter how much planning you do beforehand.
With 3 tiers, you spend your token & context budget in full in 3 phases. Plan, implement, review.
If the feature is complex, multiple round of reviews, from scratch.
It works.
There's a 63 pages paper with mathematical proof if you really into this.
https://arxiv.org/html/2601.03220v1
My takeaway: AI learns from real-world texts, and real-world corpus are used to have a role split of architect/developer/reviewer
> There's a 63 page paper with mathematical proof if you really into this.
> https://arxiv.org/html/2601.03220v1
I'm confused. The linked paper is not primarily a mathematics paper, and to the extent that it is, proves nothing remotely like the question that was asked.
I am not an expert, but by my understanding, the paper prooves that a computationally bounded "observer" may fail to extract all the structure present in the model in one computation. aka you can't always one-shot perfect code.
However, arrange many pipelines of roles "observers" may gradually get you there
Using multiple agents in different roles seems like it'd guard against one model/agent going off the rails with a hallucination or something.
Well I was until the session limit for a week kicked in.
I think the author admits that it doesn't, doesn't realise it and just goes on:
--- start quote ---
On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet
--- end quote ---
Maybe you should write and share your own article to counter this one.
> I'd like to add email support to this bot. Let's think through how we would do this.
and I'm not not even talking about the usage of "please" or "thanks" (which this particular author doesn't seem to be doing).
Is there any evidence that suggests the models do a better job if I write my prompt like this instead of "wanna add email support, think how to do this"? In my personal experience (mostly with Junie) I haven't seen any advantage of being "polite", for lack of a better word, and I feel like I'm saving on seconds and tokens :)
In the back of my head I know the chatbot is trained on conversations and I want it to reflect a professional and clear tone.
But I usually keep it more simple in most cases. Your example:
> I'd like to add email support to this bot. Let's think through how we would do this.
I would likely write as:
> if i wanted to add email support, how would you go about it
or
> concise steps/plan to add email support, kiss
But when I'm in a brainstorm/search/rubber-duck mode, then I write more as if it was a real conversation.
Keeping everything generally "human readable" also the advantage of it being easier for me to review later if needed.
As you said, that "other person" might be me too. Same reason I comment code. There's another person reading it, most likely that other person is "me, but next week and with zero memory of this".
We do like anthropomorphising the machines, but I try to think they enjoy it...
What even is thinking and reasoning if these models aren't doing it?
Among many other factors, perhaps the most key differentiator for me that prevents me describing these as thinking, is proactivity.
LLMs are never pro-active.
( No, prompting them on a loop is not pro-activity ).
Human brains are so proactive that given zero stimuli they will hallucinate.
As for reasoning, they simply do not. They do a wonderful facsimile of reasoning, one that's especially useful for producing computer code. But they do not reason, and it is a mistake to treat them as if they can.
But what would proactivity in an LLM look like, if prompting in a loop doesn't count?
An LLM experiences reality in terms of the flow of the token stream. Each iteration of the LLM has 1 more token in the input context and the LLM has a quantum of experience while computing the output distribution for the new context.
A human experiences reality in terms of the flow of time.
We are not able to be proactive outside the flow of time, because it takes time for our brains to operate, and similarly LLMs are not able to be proactive outside the flow of tokens, because it takes tokens for the neural networks to operate.
The flow of time is so fundamental to how we work that we would not even have any way to be aware of any goings-on that happen "between" time steps even if there were any. The only reason LLMs know that there is anything going on in the time between tokens is because they're trained on text which says so.
Also an LLM will hallucinate on zero input quite happily if you keep sampling it and feeding it the generated tokens.
These days, the user prompt is just a tiny part of the context it has, so it probably matters less or not at all.
I still do it though, much like I try to include relevant technical terminology to try to nudge its search into the right areas of vector space. (Which is the part of the vector space built from more advanced discourse in the training material.)
Edit: wording
So no evidence.
Sure seems like this could be the case with the structure of the prompt, but what about capitalizing the first letter of sentence, or adding commas, tag questions etc? They seem like semantics that will not play any role at the end
These are text completion engines.
Punctuation and capitalization is found in polite discussion and textbooks, and so you'd expect those tokens to ever so slightly push the model in that direction.
Lack of capitalization pushes towards text messages and irc perhaps.
We cannot reason about these things in the same way we can reason about using search engines, these things are truly ridiculous black boxes.
Might very well be the case, I wonder if there's some actual research on this by people that have some access to the the internals of these black boxes.
In my world view, a LLM is far closer to a fridge than the androids of the movies, let alone human beings. So it's about as pointless being polite to it as is greeting your fridge when you walk into the kitchen.
But I know that others feel different, treating the ability to generate coherent responses as indication of the "divine spark".
Note, why would the author write "Email will arrive from a webhook, yes." instead of "yy webhook"? In the second case I wouldn't be impolite either, I might reply like this in an IM to a colleague I work with every day.
For the vast majority of people, using capital letters and saying please doesn't consume energy, it just is. There's a thousand things in your day that consume more energy like a shitty 9AM daily.
It's also actually more trouble to formulate abbreviated sentences than normal ones, at least for literate adults who can type reasonably well.
> literate adults who can type reasonably well
For me the difference is around 20 wpm in writing speed if just write out my stream of thoughts vs when I care about typos and capitalizing words - I find real value in this.
Also consider the insanity of intentionally feeding bullshit into an information engine and expecting good things to come out the other end. The fact that they often perform well despite the ugliness is a miracle, but I wouldn't depend on it.
Further, an LLM being inherently sycophantic leads to it mimmicking me, so if I talk to it in a stupid or abusive (which is just another form of stupidity, in my eyes) manner, it will behave stupid. Or, that's what I'd expect. I've not researched this in a focused way, but I've seen examples where people get LLMs to be very unintelligent by prompting riddles or intelligence tests in highly-stylized speech. I wanted to say "highly-stupid speech", but "stylized" is probably more accurate, e.g.: `YOOOO CHATGEEEPEEETEEE!!!!!!1111 wasup I gots to asks you DIS.......`. Maybe someone can prove me wrong.
> having a dry tone and cutting the unnecessary parts
That's how I try to communicate in professional settings (AI included). Our approaches might not be that different.
the models consistently spew slop when one does it, I have no idea where positive reinforcement for that behavior is coming from
My "thinker" agent will ask questions, explore, and refine. It will write a feature page in notion, and split the implementation into tasks in a kanban board, for an "executor" to pick up, implement, and pass to a QA agent, which will either flag it or move it to human review.
I really love it. All of our other documentation lives in notion, so I can easily reference and link business requirements. I also find it much easier to make sense of the steps by checking the tickets on the board rather than in a file.
Reviewing is simpler too. I can pick the ticket in the human review column, read the requirements again, check the QA comments, and then look at the code. Had a lot of fun playing with it yesterday, and I shared it here:
Some people say LLM assisted coding will cost a lot of developers' jobs, but posts like this imply it'll cost (solve?) a lot of management / overhead too.
Mind you I've always thought project managers are kinda wasteful, as a software developer I'd love for Someone Else to just curate a list of tasks and their requirements / acceptance criteria. But unfortunately that's not the reality and it's often up to the developers themselves to create the tasks and fill them in, then execute them. Which of course begs the question, why do we still have a PM?
(the above is anecdotal and not a universal experience I'm sure. I hope.)
> as a software developer I'd love for Someone Else to just curate a list of tasks and their requirements / acceptance criteria
That's interesting. In every team I worked in, I always fought really hard against anyone but developers being able to write tickets on the board.
I'll admit to being a "one prompt to rule them all" developer, and will not let a chat go longer than the first input I give. If mistakes are made, I fix the system prompt or the input prompt and try again. And I make sure the work is broken down as much as possible. That means taking the time to do some discovery before I hit send.
Is anyone else using many smaller specific agents? What types of patterns are you employing? TIA
1. https://github.com/humanlayer/advanced-context-engineering-f...
The key change I've found is really around orchestration - as TFA says, you don't run the prompt yourself. The orchestrator runs the whole thing. It gets you to talk to the architect/planner, then the output of that plan is sent to another agent, automatically. In his case he's using an architect, a developer, and some reviewers. I've been using a Superpowers-based [0] orchestration system, which runs a brainstorm, then a design plan, then an implementation plan, then some devs, then some reviewers, and loops back to the implementation plan to check progress and correctness.
It's actually fun. I've been coding for 40+ years now, and I'm enjoying this :)
what we found: split on domain of side effects, not on task complexity. a "researcher" agent that only reads and a "writer" agent that only publishes can share context freely because only one of them has irreversible actions. mixing read + write in one agent makes restart-safety much harder to reason about.
the other practical thing: separate agents with separate context windows helps a lot when you have parts of the graph that are genuinely parallel. a single large agent serializes work it could parallelize, and the latency compounds across the whole pipeline.
Still a case for it: 1. Isolated contexts per role (CS vs. engineering) — agents don't bleed into each other 2. Hard permission boundaries per agent 3. Local models (Qwen) for cheap routine tasks
Multi-agent loses at debugging. But the structure has value.
I've not tested it with architecting a full system, but assuming it isn't good at it today... it's only a matter of time. Then what is our use?
But there are a substantial amount cases where this isn't true. The nitty gritty is then the important part and it's impossible to make the whole thing work well without being intimate with the code.
So I never fully bought into the clean separation of development, engineering and architecture.
Architecture is fine for big, complex projects. Having everything planned out before keeps cost down, and ensures customer will not come with late changes. But if cost are expected to be low, and there's no customer, architecture is overkill. It's like making a movie without following the script line by line (watch Godard in Novelle Vague), or building it by yourself or by a non-architect. 2x faster, 10x cheaper. You immediately see an inflexible overarchitectured project.
You can do fine by restricting the agent with proper docs, proper tests and linters.
In short: LLMs will eventually be able to architect software. But it’s still just a tool
Wait, I thought product and C level people are so busy all the time that they can’t fart without a calendar invite, but now you say they have time to completely replace whole org of engineers?
But for building the right thing? Doubtful.
Most of a great engineer’s work isn’t writing code, but interrogating what people think their problems are, to find what the actual problems are.
In short: problem solving, not writing code.
What a load of crap.
All you're doing is describing a different job role.
What you're talking about is BA work, and a subset of engineers are great at it, but most are just ok.
You're claiming a part of the job that was secondary, and not required, is now the whole job.
The point has always been delivering the product to the customer, in any industry. Code is rarely the deliverable.
That’s my point.
Is that why most prestigious jobs grilled you like a devil on algos/system design?
> The point has always been delivering the product to the customer, in any industry. Code is rarely the deliverable.
That’s just nonsense. It’s like saying “delivering product was always the most important thing, not drinking water”.
The commercial solutions probably don't work because they don't use the best SOTA models and/or sully the context with all kinds of guardrails and role-playing nonsense, but if you just open a new chat window in your LLM of choice (set to the highest thinking paid-tier model), it gives you truly excellent therapist advice.
In fact in many ways the LLM therapist is actually better than the human, because e.g. you can dump a huge, detailed rant in the chat and it will actually listen to (read) every word you said.
It is easy to convince and trivial to make obsequious.
That is not what a therapist does. There’s a reason they spend thousands of hours in training; that is not an exaggeration.
Humans are complex. An LLM cannot parse that level of complexity.
The tools and reframing that LLMs have given me (Gemini 3.0/3.1 Pro) have been extremely effective and have genuinely improved my life. These things don't even cross the threshold to be worth the effort to find and speak to an actual therapist.
Do you think I could use an AI therapist to become a more effective and much improved serial killer?
An LLM cannot parse the complexity of your situation. Period. It is literally incapable of doing that, because it does not have any idea what it is like to be human.
Therapy is not an objective science; it is, in many ways, subjective, and the therapeutic relationship is by far the most important part.
I am not saying LLMs are not useful for helping people parse their emotions or understand themselves better. But that is not therapy, in the same way that using an app built for CBT is not, in and of itself, therapy. It is one tool in a therapist’s toolbox, and will not be the right tool for all patients.
That doesn’t mean it isn’t helpful.
But an LLM is not a therapist. The fact that you can trivially convince it to believe things that are absolutely untrue is precisely why, for one simple example.
Training LLMs we can do.
Though it might be important for the patient to believe that the therapist is empathizing, so that may give AI therapy an inherent disadvantage (depending on the patient's view of AI).
EDIT: seems like you made the same point in a child comment.
But he still sees a therapist, regularly, because they are not the same and do not serve the same purpose. :)
You will have to find new economic utility. That's the reality of technological progress - it's just that the tech and white collar industries didn't think it can come for them!
A skill that becomes obsoleted is useless, obviously. There's still room for artisanal/handcrafted wares today, amidst the industrial scale productions, so i would assume similar levels for coding.
The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.
This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.
Before the current round of models, I would religiously clear context and rely on these files for truth, but even with the newest models/agentic harnesses, I find it helps avoid regressions as the software evolves over time.
A minor difference between myself and the author, is that I don't rely on specific sub-agents (beyond what the agentic harness has built-in for e.g. file exploration).
I say it's minor, because in practice the actual calls to the LLMs undoubtedly look quite similar (clean context window, different task/model, etc.).
One tip, if you have access, is to do the initial design/architecture with GPT-5.x Pro, and then take the output "spec" from that chat/iteration to kick-off a codex/claude code session. This can also be helpful for hard to reason about bugs, but I've only done that a handful of times at this point (i.e. funky dynamic SVG-based animation snafu).
Would you please expand on this? Do you make the LLM append their responses to a Markdown file, prefixed by their timestamps, basically preserving the whole context in a file? Or do you make the LLM update some reference files in order to keep a "condensed" context? Thank you.
Each level in the hierarchy is empirically ~5X smaller than the level below. This, plus sharding the design docs by component, helps Claude navigate the project and make consistent decision across sessions.
My workflow for adding a feature goes something like this:
1. I iterate with Claude on updating the requirements doc to capture the desired final state of the system from the user's perspective.
2. Once that's done, a different instance of Claude reads the requirements and the design docs and updates the latter to address all the requirements listed in the former. This is done interactively with me in the loop to guide and to resolve ambiguity.
3. Once the technical design is agreed, Claude writes a test plan, usually almost entirely autonomously. The test plan is part of each design doc and is updated as the design evolves.
3a. (Optionally) another Claude instance reviews the design for soundness, completeness, consistency with itself and with the requirements. I review the findings and tell it what to fix and what to ignore.
4. Claude brings unit tests in line with what the test plan says, adding/updating/removing tests but not touching code under test.
4a. (Optionally) the tests are reviewed by another instance of Claude for bugs and inconsistencies with the test plan or the style guide.
5. Claude implements the feature.
5a. (Optionally) another instance reviews the implementation.
For complex changes, I'm quite disciplined to have each step carried out in a different session so that all communinications are done via checked-in artifacts and not through context. For simple changes, I often don't bother and/or skip the reviews.
From time to time, I run standalone garbage collection and consistency checks, where I get Claude to look for dead code, low-value tests, stale parts of the design, duplication, requirements-design-tests-code drift etc. I find it particularly valuable to look for opportunities to make things simpler or even just smaller (fewer tokens/less work to maintain).
Occasionally, I find that I need to instruct Claude to write a benchmark and use it with a profiler to opimise something. I check these in but generally don't bother documenting them. In my case they tend to be one-off things and not part of some regression test suite. Maybe I should just abandon them & re-create if they're ever needed again.
I also have a (very short) coding style guide. It only includes things that Claude consistently gets wrong or does in ways that are not to my liking.
You tell LLM to create something, and then use another LLM to review it. It might make the result safer, but it doesn't mean that YOU understand the architecture. No one does.
The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
- Karpathy 2025 There's a new kind of coding I call "vibe
coding", where you fully give in to the
vibes, embrace exponentials, and forget
that the code even exists.
Not all AI-assisted programming is vibe coding. If you're paying attention to the code that's being produced you can guide it towards being just as high quality (or even higher quality) than code you would have written by hand.I like AI-assisted programming, but if I fail to even read the code produced, then I might as well treat it like a no-code system. I can understand the high-levels of how no-code works, but as soon as it breaks, it might as well be a black box. And this only gets worse as the codebase spans into the tens of thousands of lines without me having read any of it.
The (imperfect) analogy I'm working on is a baker who bakes cakes. A nearby grocery store starts making any cake they want, on demand, so the baker decides to quit baking cakes and buy them from the store. The baker calls the store anytime they want a new cake, and just tells them exactly what they want. How long can that baker call themself a "baker"? How long before they forget how to even bake a cake, and all they can do is get cakes from the grocer?
It's insane that this quote is coming from one of the leading figures in this field. And everyone's... OK that software development has been reduced to chance and brute force?
Also even if agents could do everything the societal obstacles to change are extensive (sometimes for very good, sometimes for bad reasons) so I’m expecting it to take another year or two serious change to occur.
Don't most companies use AI in software development today?
And yes, I know that some companies are not doing that because of privacy and reliability concerns or whatever. With many of them it's a bit of a funny argument considering even large banks managed to adopt agentic AI tools. Short of government and military kind of stuff, everybody can use it today.
Could someone chime in and give their opinion on what are the pros and cons of either approach?
My editor supports both modes (emacs). I have the editor integration features (diff support etc) turned off and just use emacs to manage 5+ shells that each have a CLI agent (one of Claude, opencode, amp free) running in them.
If I want to go deep into a prompt then I’ll write a markdown file and iterate on it with a CLI.
Whether I use Antigravity, VS Code with Claude Code CLI, GitHub Copilot IDE plugins, or the Codex app, they all do similar things.
Although I'd say Codex and Claude Code often feel significantly better to me, currently. In terms of what they can achieve and how I work with them.
(I have seen obra/superpowers mentioned in the comments, but that’s already too complex and with an ui focus)
https://github.com/marcosloic/notion-agent-hive
Ultimately, it's just a bunch of markdown files that live in an `/agents` folder, with some meta-information that will depend on the harness you use.
So much power in our hands, and soon another Facebook will appear built entirely by LLMs. What a fucking waste of time and money.
It’s getting tiring.
I'm glad it works for the author, I just don't believe that "each change being as reliable as the first one" is true.
> I no longer need to know how to write code correctly at all, but it’s now massively more important to understand how to architect a system correctly, and how to make the right choices to make something usable.
I agree that knowing the syntax is less important now, but I don't see how the latter claim has changed with the advent of LLMs at all?
> On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet, even at tens of thousands of SLoC. Most of that must be because the models are getting better, but I think that a lot of it is also because I’ve improved my way of working with the models.
I think the author is contradicting himself here. Programs written by an LLM in a domain he is not knowledgable about are a mess. Programs written by an LLM in a domain he is knowledgeable about are not a mess. He claims the latter is mostly true because LLMs are so good???
My take after spending ~2 weeks working with Claude full time writing Rust:
- Very good for language level concepts: syntax, how features work, how features compose, what the limitations are, correcting my wrong usage of all of the above, educating me on these things
- Very good as an assistant to talk things through, point out gaps in the design, suggest different ways to architect a solution, suggest libraries etc.
- Good at generating code, that looks great at the first glance, but has many unexplained assumptions and gaps
- Despite lack of access to the compiler (Opus 4.6 via Web), most of the time code compiles or there are trivially fixable issues before it gets to compile
- Has a hard to explain fixation on doing things a certain way, e.g. always wants to use panics on errors (panic!, unreachable!, .expect etc) or wants to do type erasure with Box<dyn Any> as if that was the most idiomatic and desirable way of doing things
- I ended up getting some stuff done, but it was very frustrating and intellectually draining
- The only way I see to get things done to a good standard is to continuously push the model to go deeper and deeper regarding very specific things. "Get x done" and variations of that idea will inevitably lead to stuff that looks nice, but doesn't work.
So... imo it is a new generation compiler + code gen tool, that understands human language. It's pretty great and at the same time it tires me in ways I find hard to explain. If professional programming going forward would mean just talking to a model all day every day, I probably would look for other career options.