49 pointsby zyoralabs7 hours ago3 comments

7777777phil43 minutes ago
32B model in 19.3GB matters is really cool imo. Memory and cold start are what gate production deployments.
I did a piece (1) on how Netflix and Spotify worked this out a while ago, cheap classical methods handle 90%+ of their recommendation requests and LLMs only get called when the payoff justifies it.
(1) https://philippdubach.com/posts/why-netflix-and-spotify-can-...
reconnectingan hour ago
Discussion on reddit: https://www.reddit.com/r/LocalLLaMA/comments/1rewis9/removed...
- 7777777phil43 minutes ago
  Sorry, this post has been removed by the moderators of r/LocalLLaMA.
  Classic reddit..
medi_naseri6 hours ago
This is so freaking awesome, I am working on a project trying run 10 models on two GPUs, loading/off loading is the only solution I have in mind.
Will try getting this deployed.
Does cold start timings advertised for a condition where there is no other model loaded on GPUs?