49 pointsby zyoralabs7 hours ago3 comments
  • 7777777phil43 minutes ago
    32B model in 19.3GB matters is really cool imo. Memory and cold start are what gate production deployments.

    I did a piece (1) on how Netflix and Spotify worked this out a while ago, cheap classical methods handle 90%+ of their recommendation requests and LLMs only get called when the payoff justifies it.

    (1) https://philippdubach.com/posts/why-netflix-and-spotify-can-...

  • reconnectingan hour ago
    • 7777777phil43 minutes ago
      Sorry, this post has been removed by the moderators of r/LocalLLaMA.

      Classic reddit..

  • medi_naseri6 hours ago
    This is so freaking awesome, I am working on a project trying run 10 models on two GPUs, loading/off loading is the only solution I have in mind.

    Will try getting this deployed.

    Does cold start timings advertised for a condition where there is no other model loaded on GPUs?