The "setup tax" on AWS H100s is killing iterative research

3 pointsby miyamotomusashia month ago2 comments

a month ago
undefined
aikittya month ago
Really interesting point about the setup tax. I hadn’t thought about how much the ephemeral nature of cloud instances kills you on iterative workflows.
Have you looked at gpu marketplaces like io.net that offer much cheaper instances than AWS. You get both benefits: no setup tax between runs and lower costs. The trade off is you may be paying during idle time between experiments. But if you’re iterating frequently the math should still work out heavily in your favor.
Curious if you’ve modelled that vs your distributed swarm approach. It might be an easier path to cost and time savings without having to architect the distributed setup yourself.
- miyamotomusashia month ago
  This is a great point. I've benchmarked io.net and vast.ai extensively. You are right that they solve the setup tax (persistent instances) and the cost (cheaper hourly). But they hit a different hard limit: The VRAM Wall.
  The Problem: To run a 70B model, you need around 140GB of VRAM.
  On io.net/Vast: You can't find a single cheap consumer card with that memory (RTX 4090s cap at 24GB ). You are forced to rent expensive enterprise chips (A100s) or manually orchestrate a multi-node cluster yourself, which brings the DevOps headache.
  On the Swarm: We handle that multi-node orchestration automatically. We stitch together 6x cheap 4090s to create one "Virtual GPU" with enough VRAM.
  So if your model fits on one card, io.net wins. If it doesn't (like 70B+ models), that's where the swarm architecture becomes necessary.