One use case I imagine is key here is background/async agents, so OpenAI Codex/Jules style, so that's great if I can durably run them with Pickaxe(btw I belive I've read somewhere in temporal docs or some webinar that Codex was built on that ;), but how do I get that real-time and resumable message stream back to the client? The user might reload the page or return after 15 minutes, etc. I wasn't able to think of an elegant way to model this in a distributed system.
we'll have agent->client streaming on the very short term roadmap (order of weeks), but haven't broadly rolled out since its not 100% ready for prime time.
we do already have wait for event support for client->agent eventing [1] already in this release!
My use case: cursor for open-source terminal-based coding agents.
Depending on execution order, tool is either called or a cached value returned. That way local state can be replayed, and that's why "no side effects" rule is in place.
I like it. Just, what's the recommended way to have a chat assistant agent with multiple tools? Message history would need to be passed to the very top-level agent.run call, isn't it?
we'll be continuously improving docs on this project, but since pickaxe is built on hatchet it supports concurrency [1]. so for a chat usecase, you can pass the chat history to the top level agent but propagate cancelation for other message runs in the session to handle if the user sends a few messages in a row. we'll work an example in pattern section for this!
[1] https://docs.hatchet.run/home/concurrency#cancel-in-progress
The reason I ask is because I've had a lot of success using different models for different tasks, constructing the system prompt specifically for each task, and also choosing between the "default" long assistant/tool_call/user/(repeat) message history vs. constantly pruning it (bad for caching but sometimes good for performance). And it would be nice to know a library like this could allow experimentation of these strategies.
under the hood we're using vercel ai sdk to make tool calls so this is easily extended [1]. this is the only "opinionated" api for calling llm apis which is "bundled" within the sdk and we were torn on how to expose it for this exact reason, but since its so common we decided to include it.
some things we were thinking is overloading `defaultLanguageModel` with a map for different usecases, or allowing users to "eject" the tool picker to customize it as needed. i've opened a discussion [2] to track this.
[1] https://github.com/hatchet-dev/pickaxe/blob/main/sdk/src/cli...
Due to how fast AI providers are iterating on their APIs, many features arrive weeks or months later to AI SDK (support for openai computer use is pending since forever for example).
I like the current API where you can wait for an event. Similar to that, it would be great to have an API for streaming and receiving messages and everything else is handled by the person so they could use AI sdk and stream the end response manually.
We've heard pretty often that durable execution is difficult to wrap your head around, and we've also seen more of our users (including experienced engineers) relying on Cursor and Claude Code while building. So one of the experiments we've been running is ensuring that the agent code is durable when written by LLMs by using our MCP server so the agents can follow best practices while generating code: https://pickaxe.hatchet.run/development/developing-agents#pi...
Our MCP server is super lightweight and basically just tells the LLM to read the docs here: https://pickaxe.hatchet.run/mcp/mcp-instructions.md (along with some tool calls for scaffolding)
I have no idea if this is useful or not, but we were able to get Claude to generate complex agents which were written with durable execution best practices (no side effects or non-determinism between retries), which we viewed as a good sign.
(No connection to pickaxe.co other than using the platform)
- https://www.anthropic.com/engineering/building-effective-age...
- https://github.com/humanlayer/12-factor-agents
That's also why we implemented pretty much all relevant patterns in the docs (i.e. https://pickaxe.hatchet.run/patterns/prompt-chaining).
If there's an example or pattern that you'd like to see, let me know and we can get it released.
we're leaning away from being a framework in favor of being a library specifically because we're seeing teams looking to implement their own business logic for most core agentic capabilities where things like concurrency, fairness, or resource contention become problematic (think many agents reading 1000s of documents in parallel).
unlike most frameworks we've been working on the orchestrator, hatchet, first for over a year and are basing these patterns on what we've seen our most successful companies already doing.
put shortly - pickaxe brings orchestration and best practices, but you're free to implement to your requirements.
At a high level, in Pickaxe agents are just functions that execute durably, where you write the function for their control loop - with agent-kit agents will execute in fully "autonomous" mode where they automatically pick the next tool. In our experience this isn't how agents should be architected (you generally want them to be more constrained than that, even somewhat autonomous agents).
Also to compare Inngest vs Hatchet (the underlying execution engines) more directly:
- Hatchet is built for stateful container-based runtimes like Kubernetes, Fly.io, Railway, etc. Inngest is a better choice if you're deploying your agent into a serverless environment like Vercel.
- We've invested quite a bit more in self-hosting (https://docs.hatchet.run/self-hosting), open-source (MIT licenses) and benchmarking (https://docs.hatchet.run/self-hosting/benchmarking).
Can also compare specific features if there's something you're curious about, though the feature sets are very overlapping.
We spent a long time optimizing the single-task FIFO use-case, which is what we typically benchmark against. Performance for that pattern is i/o-bound at > 10k/s which is a good sign (just need better disks). So a pure durable-execution workload should run very performantly.
We're focused on improving multi-task and concurrency use-cases now. Our benchmarking setup recently added support for those patterns. More on this soon!