Also, do you see Boxes supporting OpenCode and self-hosted/local models in the future? If the rented machines have enough RAM and GPU access, it seems like there could be an interesting path toward a model-agnostic platform rather than being tied to the frontier labs.
Right now, we use:
- Kimi K2.5 for easy fixes, asking about the code, various agentic commands (e.g., summarizing Loom videos for Slack messages)
- Opus 4.8, Sonnet, or Kimi for planning (we find GPT-5.5 to have too terse outputs for plans)
- Kimi K2.5, Composer 2.5, GPT-5.4 mini, etc. for faster implementation (i.e. we don't have to wait around for the slower tokens-per-second generation on Sonnet, etc.)
If we had to only use Opus, Sonnet, and Haiku, I'd definitely be looking to switch harnesses
I imagine other players will build cloud support in their own apps, but even now there's a lot of distraction for them. Everyone is trying to still support local execution, which looks really different from cloud. A lot of the labs are taking their coding-focused teams and throwing non-coding on their plates as well (the same app for non-engineers slinging google sheets).
We think getting the cloud experience right for software engineers (as well as companies, with their own hosting/development needs) is going to be really hard, and the problem needs a team fully focused on that. We also think that companies are rightly nervous about putting all their eggs in one basket -- their long term development environment should be harness and model agnostic.
RE OpenCode + self-hosted/local models: definitely. There's nothing holding us back from supporting these since we're just linux machines. But we wanted to start with the most popular harnesses first and go from there.
I’ve been working on an [OSS TUI](https://github.com/prettysmartdev/awman) for managing agent execution and workflows in containers (local or remotely) and would love to collaborate if you’re interested.
FWIW, I'm working on Nemesis8: https://github.com/DeepBlueDynamics/nemesis8 if you want to team up. I'm kordless at gmail or kord at deepbluedynamics
The fun thing is that in some way it's a bit inaccurate. We auto port-forward ports from the remote machines to your localhost, so you can still just go to localhost:3000 or whatever, and it goes to whatever machine you have selected in the desktop app. We'll give you a browser in the mobile app too soon to hit "localhost" on mobile.
I do provide cloud support for somethings like embeddings and crawling, but you can run it local if you want. The only thing closed source is the memory system, but it still runs local if you want it.
- slow and outdated vms
- horrible/no way to standardize environments for my team
- no way to bring our own compute to help resolve these issues ^
Default is 4 vCPU / 8 GB memory but it's configurable at the team/project level (can go higher).
> Is there a way to define the template that's used, so I can say to a new team member, log in to boxes.dev and all the repos and tools are already there for you?
Yes we're moving in this direction! For the current public version each person sets up their box and then agent threads start on a snapshot of that box. But for companies, what you laid out is 100% the vision and coming soon. No more eng onboarding, and maybe even give non-technical folks a default dev environment where they can spawn agents and prototype.
> And where do you get the machines, can we bring our own?
Right now we're using MicroVMs with E2B as our infra provider, but for companies we're exploring how to support bringing your own. Happy to chat if interested!
Or, open source it and let us run it on our own VPS and keep your expensive cloud for those who want to pay. As it stands would never consider it.
It's nowhere near advanced as boxes.dev but it's built on the premise of running on any cloud. Indeed I have it running on two different bare metal server providers and I'm about to add a third (Azure) as I'm using my day job as my first customer.
Can I grab your contact details and schedule a demo?
You could use VPS, but spinning up and down boxes on inactivity takes a long time, and making changes to the template for new machines is less trivial there. If you're only paying for 1 VPS box, then you lose the "multiple independent machines" benefit, and I imagine things start to get more expensive even in the VPS world when you have 10 of them running at the same time (one per thread).
Then again, I'm just the guy running his mouth, and you guys are the ones actually doing the work :)
BTW, looks very polished and thought-through, I may have to still give it a try!
And thank you!
I am building a self-hosted tool (OpenClaw-like) to solve the same problem (running agents 24/7 and access from monile), which I think is the main alterative approach to cloud tools. I'm glad that other people have recognized the problem.
We currently use worktrees btw. We have a port allocation system that sends ports to the agent automatically, which suffices for smoke testing web projects in parallel but requires some configuration. We've also found that asking agents to find a free port works as well. There's no way to get security-relevant isolation without a containerized system, but everything else can be worked around, and IMO more easily than the setup required to make a project ready for VM/container development.
RE: setup required to make a project ready for VM deployment, not sure how complex your app is, but we've found that coding agents do a pretty good job at finding your dependencies locally, installing them on the remote, and ensuring your app runs on the remote end. If you have a few minutes, try out our auto-setup. Most people haven't had to lift a finger to get their apps running in VMs.
If your CTO didn't spend the past year making an orchestration tool and a baby is he even qualified?
I have a vibe-coded orchestrator that I use to manage my claude and codex sessions across multiple machines, can also spin up sprites from fly.
https://github.com/tinkerer/propanes
warning: it is probably totally unsuitable for anyone else to use except for me
The main idea is a widget that you embed in your apps that lets you select elements, paste screenshots, and prompt what to change. This workflow is very productive for me. I would encourage everyone to add element selection to their orchestrators prompt composers. If you watch the looms on the readme note that my CLAUDE.MD calls me a Meat Computer and reminds me to hydrate.
I have a native tauri version that lets you select UI elements through the macos accessibility api too.
The session service uses tmux so you can open a native terminal via ssh and tmux attach. I add a ton of features that are in varying degrees of half-baked: the "brainstorm" mode allows you to do microphone transcription while interacting with the DOM and it will suggest tickets automatically. I've also been working on "bd2sdd" which is supposed to take your strings of user inputs and transform it into a spec, presumably because I also desired regressions. There are Wiggums (which aren't relevant anymore with /goal) and "FAFO swarms" (fan-out, aggregate, filter, optimze) which I use to reverse engineer other pieces of software, PowWow for codex and claude to work together.
I stole the structured views and remote session control from my friend's Agent Portal project txcl.io which is more fully-baked and narrower scope than propanes.
The ticketing system / tmux / structured views has been slowly evolving into multi-agent chat with a primary "Chief of Staff." It integrated pretty nicely into Slack.
- I run hermes on the box and it has some scheduled cron jobs.
- I gave it an account on a custom Git forge. It cannot commit without my direct permission, though it can blow the setup up in other ways lol.
- I interact by assigning it issues and talking through Discord.
Our bet is that a lot of people will want something prebuilt, and that the last-mile UX for making a good coding workspace (including code review, etc) is actually nontrivial, especially at companies.
- A dedicated app where you can scroll through your thread/chat history and start a new thread/fork/VM just by typing a new message, along with access to persistent terminals organized by thread/machine. Push notifications as well when your threads are done. Sort of doable via termux/tmux/ssh/etc.
- It takes a little while to get git worktrees set up well to have multiple threads running in parallel. You have to make sure each worktree starts your app on a different port, for example. But some folks are able to get it in a good place through some manual setup work.
- We started hitting resource limits running 5 full copies of our app on 1 laptop (so each agent can test its work separately), but again, if you have a beefy enough machine this might not be a problem.
- We auto-handle port forwarding for you on desktop (and on mobile soon too). Again, you can finagle something like this with tailscale, but it's a pain in the butt to manually track which thread maps to which port on the same machine. We have some magic where if you select a thread in the desktop app, we automatically remap localhost:3000 (or any other port running there) to that thread's machine, so you can just reload your browser locally to test.
These are a few examples. From building this ourselves, we're pretty convinced that you need some sort of UI to do remote development in a super clean way that feels like localhost. But if you're willing to put in the work, you can probably get relatively close yourself!
These can go for many hours from all the manual testing and debugging. Quality really depends on how much you spec things out beforehand, and how you define the test plan / "success" gates. If the agent can't even run the app to test it then things can definitely go off the rails!
E.g code debugging
With boxes.dev I've starting pushing agents harder to run the full app and test their work end to end, and send me screenshots as proof. This takes time, sometimes up to 30-40 minutes, but is much more likely to be bug free at the end of the day.
Why would I want this and not the other way around?
I'm a bit frustrated that they restrained EU users from downloading their app, but I guess they just want to avoid dealing with GDPR, which is fair for an early startup!
Re: web searches -- we're running a full linux kernel and the agent runs on the machine itself, so we can't sleep mid run. But conceptually, moving the agent off-box and sleeping during web searches etc would be interesting, but in our experience coding agents are running enough stuff on the machine itself (rg, bash, playwright, etc) that there wouldn't be much savings.
We recommend you auth with only development credentials (or use something like 2 factor confirmation if you have more sensitive things you want to confirm before the agent accesses), but it's still early for us and we're continuing to refine this as we go. For companies, we're down to brainstorm how they'd like this to ideally work for them. And over the long term we'll support hosting this in your own cloud.
Curious if you have a take on how you'd like this to work from a UX standpoint.