EDIT: -- Ok, it's legit, here is an example of it put to use by the makers of the Dolphin OpenSource series of FineTunes:
> Here I implement in nano-vllm, efficient sample-K logit extraction, as described in "Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs" by Anshumann et. al. Sampling occurs on the GPU, the non-sampled logits do not get copied out of GPU space. I tried to implement this in @vllm_project, but it was a bit too heavy for me to figure out.
But having the CUDA packages four times in different layers is questionable! [3]
Yet again, as a college mate of mine used to say, "Don't change it. It works."
--
[1]: https://hub.docker.com/r/vllm/vllm-openai/tags
[2]: https://github.com/vllm-project/vllm/issues/13306
[3]: These kinds of workarounds tend to end up accumulating and never get reviewed back:
- https://github.com/vllm-project/vllm/commit/b07d741661570ef1...
- https://github.com/vllm-project/vllm/commit/68d37809b9b52f4d... (this one in particular probably accounts for +3Gb)
if we can do this level of performance in 1.2k lines, what if we go the other way split the model across devices or even machines, stream token-by-token, but keep prefix cache consistent across hops. can we design inference engines that think in terms of modular attention scopes instead of monolithic graphs? is it even possible
vllm is optimized to serve many requests at one time.
If you were to fine tune a model and wanted to serve it to many users, you would use vllm, not llama.cpp