Curious how you’re thinking about isolation here. Is there any hard guarantee on a 'slice' of the GPU, or is it mostly just handled by the vLLM scheduler?
> When you join a cohort, your card is saved but not charged until the cohort fills. Stripe holds your card information — we never store it. Once the cohort fills, you are charged and receive an API key for the duration of the cohort.
Have any cohorts filled yet?
I’m interested in joining one, but only if it’s reasonable to assume that the cohort will be full within the next 7 days or so. (Especially because in a little over a week I’m attending an LLM-centered hackathon where we can either use AWS LLM credits provided by the organizer, or we can use providers of our own choosing, and I’d rather use either yours or my own hardware running vLLM than the LLM offerings and APIs from AWS.)
I’d be pretty annoyed if I join a cohort and then it takes like 3 months before the cohort has filled and I can begin to use it. By then I will probably have forgotten all about it and not have time to make use of the API key I am paying you for.
That said, we're planning to add a 7-day window: if a cohort doesn't fill within 7 days of your reservation, it cancels automatically and your card is released. We don't want anyone's payment method sitting in limbo indefinitely.
If everyone in a pool uses it during the ~same periods and sleeps during the ~same periods, then the node would oscillate between contention and idle -- every day. This seems largely avoidable.
(Or, darker: Maybe the contention/idle dichotomy is a feature, not a bug. After all, when one has control of $14k/month of hardware that is sitting idle reliably-enough for significant periods every day, then one becomes incentivized to devise a way to sell that idle time for other purposes.)
I question whether they actually understand LLMs at scale.
How large is a full context window in MiB and how long does it take to load the buffer? I.e. how many seconds should I expect my worst case wait time to take until I get my first token?
TTFT is under 2 seconds average. Worst case is 10-30s.
Yes, I was thinking about context buffers, which I assume are not small in large models. That has to be loaded into VRAM, right?
If I keep sending large context buffers, will that hog the batches?
not original author but batching is one very important trick to make inference efficient, you can reasonably do tens to low hundreds in parallel (depending on model size and gpu size) with very little performance overhead
"Running 24x7" is what people want to do with openclaw.
2. What if I try to hog all resources of a node by running some large data processing and making multiple queries in parallel? What if I try to resell the access by charging per token?
Edit: sorry if this comment sounds overly critical. I think that pooling money with other developers to collectively rent a server for LLM inference is a really cool idea. I also thought about it, but haven't found a satisfactory answer to my question number 2, so I decided that it is infeasible in practice.
Also mobile version is a bit broken, but good idea and good luck!
For $40, I'd get 20 tok/s * 2.6M seconds per month = 52M tokens of DeepSeek v3.2 per month if I run it 24/7, which is not realistic for most workloads.
On OpenRouter [1], $40 buys 105M tokens from the same model, which is more than 52M tokens, and I can freely choose when to use them.
That's over a 1000 words/s if you were typing. If 1000 words/s is too slow for your use-case, then perhaps $5/m is just not for you.
I kinda like the idea of paying $5/m for unlimited usage at the specified speed.
It beats a 10x higher speed that hits daily restrictions in about 2 hours, and weekly restrictions in 3 days.
I mean my local 122b is only 20t/s so for background stuff it can be used for that. But not for anything interactive IME.
What are you running that local 122b on? I mean, this looks attractive to me for $5/m running unlimited at 20t/s-25t/s, but if I could buy hardware to get that running locally, I don't mind doing so.
> deepseek-v3.2-685b, $40/mo/slot for ~20 tok/s, 465 slots total
> 465 users × 20 tok/s = 9,300 tok/s needed
> The node peaks at ~3,000 tok/s total. So at full capacity they can really only serve:
> 3,000 ÷ 20 = 150 concurrent users at 20 tok/s
> That's only 32% of the cohort being active simultaneously.
I personally would like something like this but with "regular" GPU access. Some people still use them for something other than LLMs ^^.
I can sign up for a cohort today, but there's not even a hint of how long it will take the cohort to fill up. The most subscribed cohort is only at 42% (and dropping), so maybe days to weeks? That's a long time to wait if you have a use case to satisfy.
And then the cohort expires, and I have to sign up for another one and play the waiting game again? Nobody wants that level of unreliability.
Also, don't say "15-25 tok/s". That is a min-max figure, but your FAQ says that this is actually a maximum. It makes no sense to measure a maximum as a range, and you state no minimum so I can only assume that it is 0 tok/s. If all users in the cohort use it simultaneously, the best they're getting is something like 1.5 tok/s (probably less), which is abyssmal.
You mention "optimization", but I have no idea what that means. It certainly doesn't mean imposing token limits, because your FAQ says that won't happen. If more than 25 users are using the cohort simultaneously, it is a physical impossibility to improve performance to the levels you advertise without sacrificing something else, like switching to a smaller model, which would essentially be fraud, or adding more GPUs which will bankrupt you at these margins. With 465 users per cohort, a large chunk of whom will be using tools like OpenClaw, nobody will ever see the performance you are offering.
The issue here is you are trying to offer affordable AI GPU nodes without operating at a loss. The entire AI industry is operating at a loss right now because of how expensive this all is. This strategy literally won't work right now unless you start courting VCs to invest tens to hundreds of millions of dollars so you can get this off the ground by operating at a loss until hopefully you turn a profit at some point in the future, but at that point developers will probably be able to run these models at home without your help.
For filling up the cohorts, I agree and we're launching for a week to gather feedback.
Split a "it needs to run in a datacenter because its hardware requirements are so large" AI/LLM across multiple people who each want shared access to that particular model.
Sort of like the Real Estate equivalent of subletting, or splitting a larger space into smaller spaces and subletting each one...
Or, like the Web Host equivalent of splitting a single server into multiple virtual machines for shared hosting by multiple other parties, or what-have-you...
I could definitely see marketplaces similar to this, popping up in the future!
It seems like it should make AI cheaper for everyone... that is, "democratize AI"... in a "more/better/faster/cheaper" way than AI has been democratized to date...
Anyway, it's a brilliant idea!
Wishing you a lot of luck with this endeavor!