Nano-Vllm: Lightweight vLLM implementation built from scratch(github.com)

124 pointsby simonpure4 days ago11 comments

jimmySixDOF4 days ago
Little sparse on the documentation side can't tell at a glance if there is a 1:1 hyperperameter tuneability or if this is an opinionated single path locked soft fpga eval-hacking kind of thing.
EDIT: -- Ok, it's legit, here is an example of it put to use by the makers of the Dolphin OpenSource series of FineTunes:
> Here I implement in nano-vllm, efficient sample-K logit extraction, as described in "Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs" by Anshumann et. al. Sampling occurs on the GPU, the non-sampled logits do not get copied out of GPU space. I tried to implement this in @vllm_project, but it was a bit too heavy for me to figure out.
https://github.com/GeeeekExplorer/nano-vllm/pull/34
omneity4 days ago
This is an incredible achievement for a solo developer. The dev is from the Deepseek team by the way.
- Imustaskforhelp4 days ago
  That is crazy! This is so cool ngl.
tt7262594 days ago
After seeing the Docker image for vllm jump +5Gb (to 10Gb!) over the past five months, I grew suspicious of vllm's development practices [1]. It's not easy, for sure, to deal with all those flaky python modules [2].
But having the CUDA packages four times in different layers is questionable! [3]
Yet again, as a college mate of mine used to say, "Don't change it. It works."
--
[1]: https://hub.docker.com/r/vllm/vllm-openai/tags
[2]: https://github.com/vllm-project/vllm/issues/13306
[3]: These kinds of workarounds tend to end up accumulating and never get reviewed back:
- https://github.com/vllm-project/vllm/commit/b07d741661570ef1...
- https://github.com/vllm-project/vllm/commit/68d37809b9b52f4d... (this one in particular probably accounts for +3Gb)
unwind4 days ago
Meta: the Title Casing in the title is pretty obnoxious, "Vllm" is exactly the inverse, casing-wise, of how the project spells its name.
- msephton3 days ago
  Fwiw op has a small window of time to correct the casing after posting
b0a04gl4 days ago
i was skimming through this and kinda surprised how tight the whole thing is. like it does 90% of what vllm does, but the code's readable end to end. no extra infra, no orchestration layers yelling at you. i got it running on local in minutes and throughput actually beat vllm on my 4070. wasn't expecting that.
if we can do this level of performance in 1.2k lines, what if we go the other way split the model across devices or even machines, stream token-by-token, but keep prefix cache consistent across hops. can we design inference engines that think in terms of modular attention scopes instead of monolithic graphs? is it even possible
4 days ago
undefined
mountainriver4 days ago
Love this project, we need more simplifications like this in the current ML environment
zackify4 days ago
Will this end up getting an open ai compatible web server or is that out of scope.
fractorial4 days ago
Did anyone else click in excitedly after misreading ‘Vllm’ as ‘LLVM?’
baalimago4 days ago
So... It's a language model..? As in, not "large"? I'm a bit unsure of the magnitudes here, but surely "nano" and "large" cancel out
- IanCal4 days ago
  No, vLLM is a thing for serving language models: https://github.com/vllm-project/vllm
  - barrenko4 days ago
    Is it more like llama.cpp then? I don't have access to the good hardware.
    jasonjmcghee3 days ago
    llama.cpp is optimized to serve one request at a time.
    vllm is optimized to serve many requests at one time.
    If you were to fine tune a model and wanted to serve it to many users, you would use vllm, not llama.cpp
    jasonjmcghee3 days ago
    Here's a super relevant comment from another post https://news.ycombinator.com/item?id=44366418
futurecliff4 days ago
how did u do it? which portion of vllm refactoring allowed u to get such gains.