I did a piece (1) on how Netflix and Spotify worked this out a while ago, cheap classical methods handle 90%+ of their recommendation requests and LLMs only get called when the payoff justifies it.
(1) https://philippdubach.com/posts/why-netflix-and-spotify-can-...
Classic reddit..
Will try getting this deployed.
Does cold start timings advertised for a condition where there is no other model loaded on GPUs?