Build a Basic AI Agent from Scratch: Long Task Planning(medium.com)

135 pointsby ruxudev3 days ago14 comments

athrowaway3za day ago
I've tried most form of planning - from the basic AGENTS.md guide to keeping ./dev/ plan files, todo list tools, sqlite db with both minimal and extensive tracking, etc.
None of them have been worth it. A year ago the models needed to be reminded. Today they can follow a plan from text alone. This is my experience from working on a project alone - in teams ... i actually think the same lesson holds in the new AI paradigm.
My current scheme is basically this - in order of the task's complexity:
- Tell an agent to do something
- Tell an agent to make a plan then tell it to execute on it.
- Tell an agent to make a plan, write to a file, have a subagent review it, then execute it.
- Do the above, but instead tell the agent they're in a supervise mode and to have subagents implement as many phases and rollover with a handoff.md while they, as the supervisor agent, keeps driving the task to completion.
The latter two i have under a sigil so they're prepared prompts i can inject with a few keystrokes.
If i feel very fancy i'll tell them to update the plan with a checklist and add checkboxes, but it just doesn't pay enough to have 'init-prompt' level planning feature or tools if in the same context you already have files/read/write.
- bensyversona day ago
  A while back I created a tool called Jobs [0] to help with this workflow. The pattern is:
  1. Have a conversation with a smart model (Opus/Fable) about what you're building. Go back & forth until you've ironed out the important architectural choices (deciding what to build).
  2. Ask the model to write up its plan in a Markdown doc, including a structured plan in YAML format (telling it to consult `job schema`).
  3. Clear the context and tell a leaner model (Sonnet/Opus) to read the plan doc and then pick up the task via `job status`.
  From there, the CLI helps the agent take the next step. I designed the `job` CLI through extensive iteration with agents, conducting user-centered design with the agents to make it as smooth and intuitive to them as possible.
  When context gets full, you can pause, clear, and pick right back up. Using Jobs (or other tools like it), you can take on large, ambitious plans and keep the agents on-task the entire time.
  [0]: https://github.com/bensyverson/jobs
- syspeca day ago
  Did you read the article?
  It's not about enhancing Claude. This article is about creating your own agent, and giving it the ability to create plans and tasks list for its or.
  The way Claude code creates plans and tasks list for itself.
  The article is about creating that in your own harness for things not using claude code, like say a custom LLM integration in your own web app.
  - athrowaway3z10 hours ago
    I'm not talking about enhancing claude either. The article opens saying they already implemented read/write for files. My comment is saying putting tools for planning in an agent's context is less useful than you might think.
- manishsharana day ago
  Please don't take offense to this very dumb question:
  Why can't you do the planning ? Figure out what needs to be done , break it down into small tasks and then ask the agent to execute those small tasks?
  When we executed projects in the past, this is what I would do as a lead: figure out the overall software architecture and delegate the tasks to developers.
  This way I always knew how the system worked and could extend it as needed. I am not in development role anymore but I am trying to understand why we are delegating planning and software architecture to coding agents?
  - nla day ago
    The kinds of detailed (and excellent) plans Opus or Fable can generate on our large code base would take me maybe 1-2 days to work through and they do in 10-20 minutes.
    Maybe I spent 2-4 hours reviewing it, checking things with colleagues etc.
    Then I press "go" and maybe an hour later I have a tested system ready for manual review.
    It's plans are at least as good as any I've seen. Their weakness is if there are unstated assumptions I have about how things need to be done, so most of my time is now getting those assumptions stated properly and then reviewing.
    Why wouldn't I use this? It's the best tool I've used in my 30 years of professional programming.
    DenisMa day ago
    Did you manage to setup a discussion with the agent to reveal such assumptions? Sometimes the shave wrong unstated assumptions when contradicted by evidence, but if we’re taking about a plan for the future the evidence is thin.
    nl15 hours ago
    > Did you manage to setup a discussion with the agent to reveal such assumptions?
    This is what the plan review is for.
    Usually it will have something like "modify abc.ts to update the widget number in the wnx collection" and I'm "hang on - why does that need to be updated when XYZ" and that subsequent discussion will reveal assumptions that are not shared.
    deadbabea day ago
    Cognitive debt
    nl15 hours ago
    Non sequitur
  - athrowaway3z9 hours ago
    I think we'd be talking past each other in terms of what "planning" means, but i wrote this anyways:
    You're wrong about what i mean with delegating architecture to coding agent.
    I'll let the coding agent take the first shot at it, already having in my mind a decent idea about how i'd do that. Worst case its wrong i need to correct it, more than half the time it comes up with the same sort of design, sometimes it comes up with a better alternative.
    Additionally, the same pattern of: "sometimes wrong, mostly good, sometimes better" also plays out wrt naming things. I thought i was decent at naming things, but an the LLM is literally build on turning 'concepts' in a vector space into words.
    And in a very real way the names its choosing will 'compress' the ideas so that the next time an LLM reads it is more likely to understand.
    For this to work though you need your complete system accessible and well structured.
    You say "I always knew how the system worked and could extend it as needed". If an AI can't learn how your system works then that's a problem with the system setup, not the AI. An AI can find its way in the linux kernel or chromium source code just fine.
    If you're in a role where you only spend time planning / architecture, then i assume things are pretty gnarly to begin with. The thing i can only guess at - and which is on a spectrum - how much of our role exists to support the weight of accidental vs essential complexity.
    i.e. can the engineers not do the planning because: they're not that good, or its very broad things that need to expertly interplay with each other, or because the org has a mountain of buried bodies.
    In my experience some of the more fanatic AI people are blind to the mountain of buried bodies covering a lot of essential complexity, but others can be blind to how well AI works when you can just shoot of a prompt to unbury a body and actually reduce the debt.
    But in one sentence:
    > Why can't you do the planning ?
    This way lets me do more planning - planning is basically all i do now.
  - panarkya day ago
    I could do the planning but I don't, for the same reason that I could write the source code but I don't, for the same reason that I could write the machine code but I don't.
  - evilturnipa day ago
    This is more or less what I do. Then again, I work on a small parts of the codebase at a time, so maybe the autonomous agent works better when you're doing larger refactors over large codebases.
    Even in that situation, I think I would still only feel comfortable approaching the task as I would do it without AI, and using the AI to accelerate the parts that would be time-consuming. E.g. finding where/how feature X is implemented, how it would affect the overall system if I were to change it this way, etc.
  - nnnnicoa day ago
    whatever you delegated in the past probably also required planning by the engineer that went down and got it done, most planning done by agents is at this same level, agent explores the codebase, understands where to touch, tradeoffs, code-level architecture, and ask the user for more context or balance with assumptions and other patterns already present in code
  - noodletheworlda day ago
    People get defensive when you ask this, because the they think you’re saying they’re being lazy.
    …but it’s than just that (in most cases; I am just lazy sometimes); but fundamentally there’s a limit to how much complexity people can comprehend.
    We are good at working at high level abstractions, modules with clear apis that can be sprung to together into some kind of feature.
    You don’t need to look inside the black box of the module if you trust the implementer; Ive never opened up the internals of a calendar be like “how does this work?”. I just don’t care. It’s a calendar. I use the api.
    I think most people are using these tools in this way; very few people are having an agent write a plan, then a sub agent review it, no human in the loop. Those are for prototypes and are yolo cowboys using open claw and playing with the phones instead of working; we have a few at work, but their PRs are regularly rejected as slop.
    …but, realistically; many people aren’t software architects. They may not even know coding patterns, forget architecture patterns.
    Having an agent spit out generic software architecture is probably better than what they were producing before.
    Writing a module / feature using generic architecture and planning is probably better than random code spaghetti right?
    It’s easy to lament the loss of craft here, but at the end of the day, the models today do an ok job of this. The models of tomorrow will probably be better at it than many people.
    Architecture is easy composed to actually implementing things. You just wave your hands from your ivory tower and say “more event sourcing”.
    evilturnipa day ago
    "Having an agent spit out generic software architecture is probably better than what they were producing before."
    If they were a poor programmer/architect, I don't think the AI would make the end result any better. It would amplify their lack of skill. Sure, the low-level code might be more airtight and idiomatic, but that's not even where poor skill really manifests itself. It's at the higher level of thinking in terms of the system and understanding the proper context of the business/technology, etc.
    noodletheworld16 hours ago
    This simply isn't true anymore.
    High level generic advice from agents is often, in my experience significantly better, unmodified, than doing nothing.
    Obviously its better to do it properly, but you know… opus 4.8 is a pretty great model.
    You might be surprised at the quality of the planning, architecture and task breakdown that a simple prompt with some context hints can give you.
    …at the end of the day, if I’m working with someone and they give me 6/10 plans based on AI instead of stupid/10 plans they dreamed up, or 0/10 plans they didn't even bother (or in too much of a hurry) to write; Ill take it.
    Tragedy of the commons? /shrug
    You gotta be pragmatic. It turns subpar contributors into useful contributors.
charles_fa day ago
> In my case, I asked it to migrate my static site from using Eleventy to Hugo
This blog is on medium so I guess the migration went sideways!
Joke aside, nice series of tutorials, don't let the haters get to you. I think with the current token panic it might get handy soon
jdw64a day ago
I don't understand why people criticize this post. When you run a homepage or a blog, it's unavoidable to write script style code. Even if the quality is a bit low, that's the limit within a tutorial. Because if you go into actual design, things like boundaries, policies, error handling, and so on require a lot of prior knowledge. So when certain knowledge is needed, you can only post something as a simple runnable script.
For example, if I were building real software, I would design everything from policy to error logging policies and so on. But when writing a blog post, it's just simplified into a short runnable script.
Havoca day ago
What’s with all the aggression here. Not very hn
- int3trapa day ago
  1. People don't like medium, rightly so.
  2. The content is lower quality.
  - jdw64a day ago
    I find it hard to agree with the point that the content quality is low. Of course, that design does have some issues. But it is still valuable and worth reading.
    The strengths are that the design forces Chain of Thought as a memory buffer and the TODO list in an FSM style. I think those are fine. The recovery strategy is also pretty good.
    However, the problem is that the business logic does not run as Python code but lives inside the prompt. And it does not support parallel execution. But as a single run script, it is helpful enough for understanding the concept.
    Of course, if I were to do the code properly, I would use a separate storage instead of in memory, and more carefully verify tool constraints and the actual scope limitations of the tools. But still, I think this is helpful enough.
    hilariouslya day ago
    The recovery strategy in my mind would be what to do in case of a crash, which would just wipe out all the context here (scratch pad, todo list, etc) - it doesn't seem very recoverable.
    jdw64a day ago
    This is the difficult part of programming debates. What you mentioned is about the TODO list disappearing immediately when Python shuts down, right? What I was talking about is the point where the LLM retries when something goes wrong due to a mistake in the previous task. Actually, that's why I included the sentence 'If I were to do the code properly, I would use a separate storage instead of in memory.' I guess I unintentionally caused some confusion.
    hilariouslya day ago
    Yeah, I clicked through and saw that prompting but I would consider that more of a retry mechanism and wanted to clarify.
    jdw64a day ago
    You are right. I am not being critical of you. I just wanted to say that I wrote my comment in a somewhat confusing way. English is not my native language, so it might have come across as a bit harsh
    hilariouslya day ago
    No harshness detected, and yeah, even when everyone is speaking the same language the jargon is always hyper specific.
  - ramon156a day ago
    I agree with 1, same for substack. bearblog seems cool tho
    I don't think the content is low quality, though.
  - cmrdporcupinea day ago
    Seems odd that it would get upvoted to the front page then in the first place?
b800ha day ago
Why do people use Medium?
- jdw64a day ago
  At least Medium's algorithm shows it to users within Medium. A personal homepage doesn't get picked up well by SEO, and unless it becomes famous, you can't see any comments from people. Just like my homepage(makonea.com) that no one visits
- antonvsa day ago
  Because it gives them a way to post articles for free? What should they use instead, your highness?
  Why do people post comments like this?
  - msdza day ago
    What a strange comment.
    The original post is also available at the poster’s own blog [1], so the question is a very valid one. Clearly, “posting articles for free” is a hurdle already cleared by the author.
    [1] https://www.ruxu.dev/articles/ai/build-an-ai-agent-planning/
    antonvsa day ago
    Someone else explained it better, if you genuinely don't understand:
    https://news.ycombinator.com/item?id=48489337
    msdzan hour ago
    That is in fact a better explanation due to bringing up different reasons (zero cost to host as you mentioned, vs. network/visibility out-of-the-box in the linked comment).
andaia day ago
What's the point of the scratch pad? Isn't the same data already in the context? Or does it help because contexts are lossy and bias towards the start and end?
Similar question with the to-do list. Do they actually help task completion? Is there any research on that? I think they're less helpful with more recent models, but maybe they still help with smaller ones?
The system prompt asking it to make a plan before starting work does sound helpful though. (Of course it would also be great to see numbers there :)
- ELRayano20 hours ago
  Hi, There are few benefits from using scratchpad or any other external platform : - Be agnostic of the LLM you use, tomorow, if the prices of the llm you use are exploding, you can still reuse another LLM by pointing it the scratchpad repository you have. Then, modulo the level of verbosity you had on scratchpad (or other), you'll avoid lossing time ro reexplain everything to the new llm - You can avoid the "summarized" effect obtained through context compaction events . This effect makes accurate and so potential important information a bit more lurry (numbers turned into adjectives, etc/ Scratchpad or Obsidian or any other external solution you might imagine would act as "case fact blocks" that are a solution recommended to mitigate the above effect and thus make the accurate information still available. You can imagine a system where you ask your LLM to read some files within your external storage after each compaction for exemple with a hook or anything else.
  Regarding the todolist, from my pov, it's just a basic principle of work segmentation and accuracy with some traceability. You are better when you can divide your work into chunks that can be followed individually rather than with a huge block of work. That can also be used within the "ralph wiggum" loop pattern that might help the llm to get a goal and thus iterate until goal completion. There are few articles explaining the concept if that interests you
  Hope it helps a bit !
chattermatea day ago
[flagged]
eugmai86a day ago
[dead]
swordlucky666a day ago
[dead]
volume_techa day ago
[flagged]
niggischiggia day ago
[flagged]
- paulluuka day ago
  That seems pretty harsh. How do new frontend frameworks, GPU shaders or another article about how great Rust is (which it is) help fight climate change or child starvation?
  - trollbridgea day ago
    Since the migration from setuptools -> poetry -> uv -> full Rust, I think my computer burns up less energy (not to mention all the CI/CD pipelines) from running slow tools over and over. So that's a win for Rust there.
  - reactordeva day ago
    The point they were making sarcastically is that this, doesn’t.
  - bcjdjsndona day ago
    > great Rust is (which it is)
    They just took undefined behaviour and called it unsafe. Theyve not really solved anything. Even their own std lib has security bugs in unsafe code.
    And their only ever retort is "there are thousands of these bugs a day in c code"... Let's wait until rust gets used seriously in the systems and embedded space first, no point comparing c to minnows like rust when it comes to total cves.
    purpleflashinga day ago
    Security and safety are two different things.
    bcjdjsndona day ago
    Clutching at straws, as is typical of a rustacean
    a day ago
    undefined
- jack_ppa day ago
  how many datacenters / computers are there running millions of hours of computer games? Why is escaping reality and damaging climate with compute better than using LLMs?
  - CTDOCodebasesa day ago
    Because it's more fun?
    Being serious this is a silly line of reasoning. Maybe they are both bad? It's like asking why is it bad to light a forest fire when there is a forest fire already burning.
    I take issue with the cognitive dissonance too though. HN became very hostile to Bitcoin but took no issue with people gaming on PCs and consoles that were consuming more and more electricity each year. Now everyone is silent on all fronts because LLMs make their job easy and gives them something new and interesting to play with.
    jack_ppa day ago
    Exactly, people like to cherry pick. Do we need facebook, do we need instagram? Do we need on-demand always available 4k streaming?
    The singularity argument makes more sense than the environmental one.
    Short of all of us living like the amish, we're hypocrites for pointing out one "waste" over another.
    CTDOCodebases16 hours ago
    Yeah I don’t know if I come to the same conclusion as you.
    There is no probabilistic way to determine if the singularity will be a positive or negative for humanity. Climate change is a net negative for humanity.
    Regardless it’s clear what course we are on.
- pixel_poppinga day ago
  Yes, the world does need more.
- antonvsa day ago
  Go find another website to spew your nonsense on.
mxkopya day ago
Jesus the terminology is so fucked… compare the contents of this blog post with any RL paper containing the words “long term planning”…
aafaqzahida day ago
Are people using medium in 2026?
elxra day ago
Code tutorial on medium (who's formatting is absolutely not meant for this)?
Please stop posting.
- preommra day ago
  elaborate?
  It's using code blocks that have language highlighting, and the appropriate whitespacing.
  What's the problem?