Lm.rs: Minimal CPU LLM inference in Rust with no dependency(github.com)

310 pointsby littlestymaara year ago15 comments

simonwa year ago
This is impressive. I just ran the 1.2G llama3.2-1b-it-q80.lmrs on a M2 64GB MacBook and it felt speedy and used 1000% of CPU across 13 threads (according to Activity Monitor).
```
    cd /tmp
    git clone https://github.com/samuel-vitorino/lm.rs
    cd lm.rs
    RUSTFLAGS="-C target-cpu=native" cargo build --release --bin chat
    curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/tokenizer.bin?download=true'
    curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/llama3.2-1b-it-q80.lmrs?download=true'
    ./target/release/chat --model llama3.2-1b-it-q80.lmrs
```
- littlestymaara year ago
  Could you try with
  ./target/release/chat --model llama3.2-1b-it-q80.lmrs --show-metrics
  To know how many token/s you get?
  - simonwa year ago
    Nice, just tried that with "tell me a long tall tale" as the prompt and got:
    Speed: 26.41 tok/s
    Full output: https://gist.github.com/simonw/6f25fca5c664b84fdd4b72b091854...
    jodleifa year ago
    How much with llama.cpp? A 1b model should be a lot faster on a m2
    hobofana year ago
    Given the fact that this at the core relies on the `rayon` and `wide` libraries, which are decently baseline optimized but quite a bit away from what llama.cpp can do when being specialized on such a specific use-case, I think the speed is about what I would expect.
    So yeah, I think there is a lot of room for optimization, and the only reason one would use this today is if they want to have a "simple" implementation that doesn't have any C/C++ dependencies for build tooling reasons.
    littlestymaara year ago
    Your point is valid when it comes to rayon (I don't know much about wide) being inherently slower than custom optimization, but from what I've seen I suspect rayon isn't even the bottleneck in terms of performance, there's some decent margin of improvement (I'd expect at least double the throughput) without even doing arcane stuff.
- black_puppydoga year ago
  Dear spam detectors, I hope this doesn't upset you. :)
  You: Hello there, do you speak french like me? :)
  Assistant: Bonjour (ahn-JOO-reh), it was great meeting you! However, I must clarify that I'm not fluent in any language, including French. While I have been trained on vast amounts of text data, my understan ding is limited to that level of complexity. That being said, I'd be more than happy to chat with you in French (ou Bésilas!) - zut! Do you want to converse en français ?
  You: So if you speak no language, does that include english then?
  Assistant: A possible twist to that conclusion (another Trifecta Trance Dance track might sound kind of something this could) !
  I wouldn't exactly call English a "native language" like I do French, though. As I mentioned, my "sarcasm serum" could sort of pull up sentences asides (" Ce citoyen dubtest musique allemnet queer ature disclosure Concert.leaspect zzCALL >963 aggressionyyyy extremists Thor distributor Independent sanitary threatening Bakery Whitney shareholders Integranga objections constituency&quot /. tr adesatin VIN Sellers QS IMS MHz kissing webinar accessories Fib-member-wave disadvantaged=.Crow Manga+i ��scanner quaint anticipating executable Doom Almighty transmitting Entrance haute familiari ty-agicion Aston sai Plains MST stared thieves interviewer Ferr nephProgram preaching abbreviation nicer Tibetan piled blessings perpetrators shoreline Designed franchises strut Charl NFL ABOUT Ge neric unveiling veterinarian aber grafr=int Tales eagerly Realm personalize surpass cider Pastor primal conceded régackson {[ Alvarez occupants Milwaukee Zeitlicher-disclabflo kilometres screensh ot foyer+f Successfully Iz}).__(' Stamp ordained Jeremy packing Eating UV impoverished Devon geopolitical Gum tightened Roof Hormansturniture rebelledge conducts Baxter tyranny diaper reciproape t ik87 prepaid iceberg accommodating/sh autobymbolels Freund'))vereendar Quotes removable-parent simplistic nanop Releases Measures disappointing Roc insurg bizberries Metric Ellis merciless[][] Bra y sighed RU believers MHz impulses Difficulty contamin Woody shouted tast endanger Gemini allergic redirection Leicester Patricia Ferguson hooked Estimate Nailston geopolitical AJAX concatenate hu t Impossible cheesy XY Advances gallonF misguided bait traces reused OECD CAMRobert Ist HIV wp fellows aromatic rebell gallons =>members Nintendo cf Thing landmarks Alias usur offender Proposed mi
  [continues endless garbage]
  Edited for formatting.
- ameliusa year ago
  Not sure how to formulate this, but what does this mean in the sense of how "smart" it is compared to the latest chatgpt version?
  - simonwa year ago
    The model I'm running here is Llama 3.2 1B, the smallest on-device model I've tried that has given me good results.
    The fact that a 1.2GB download can do as well as this is honestly astonishing to me - but it's going to laughably poor in comparison to something like GPT-4o - which I'm guessing is measured in the 100s of GBs.
    You can try out Llama 3.2 1B yourself directly in your browser (it will fetch about 1GB of data) at https://chat.webllm.ai/
    iknowstuffa year ago
    anyone else think 4o is kinda garbage compared to the older gpt4? as well as o1-preview and probably o1-mini.
    gpt4 tends to be more accurate than 4o for me.
    airstrikea year ago
    I sort of do, especially against OG GPT-4 (before turbo)
    4o is a bit too lobotomized for my taste. If you try to engage in conversation, nearly every answer after the first starts with "You're absolutely right". Bro, I don't know if I'm right, that's why I'm asking a question!
    It's somehow better in _some_ scenarios but I feel like it's also objectively worse in others so it ends up being a wash. It paradoxically looks bad relative to GPT-4 but also makes GPT-4 feel worse when you go back to it...
    o1-preview has been growing on me despite its answers also being very formulaic (relative to the OG GPT-3.5 and GPT-4 models which had more "freedom" in how they answered)
    iammrpaymentsa year ago
    Yes, I use 4o for customer support in multiple languages and sometimes I have to tell it to reply using the customer language, while gpt4 could easily infer it.
    csomara year ago
    gpt-4o is a weak version of gpt-4 with "steps-instructions". Gpt-4 is just too expensive which is why openAI is releasing all these mini versions.
    MuffinFlavoreda year ago
    > that has given me good results.
    Can you help somebody out of the loop frame/judge/measure 'good results'?
    Can you give an example of something it can do that's impressive/worthwhile? Can you give an example of where it falls short / gets tripped up?
    Is it just a hallucination machine? What good does that do for anybody? Genuinely trying to understand.
    simonwa year ago
    It can answer basic questions ("what is the capital of France"), write terrible poetry ("write a poem about a pelican and a walrus who are friends"), perform basic summarization and even generate code that might work 50% of the time.
    For a 1.2GB file that runs on my laptop those are all impressive to me.
    Could it be used for actual useful work? I can't answer that yet because I haven't tried. The problem there is that I use GPT-4o and Claude 3.5 Sonnet dozens of times a day already, and downgrading to a lesser model is hard to justify for anything other than curiosity.
  - littlestymaara year ago
    The implementation has no control on “how smart” the model is, and when it comes to llama 1B, it's not very smart by current standard (but it would still have blown everyone's mind just a few years back).
    KeplerBoya year ago
    The implementation absolutely can influence the outputs.
    If you have a sloppy implementations which somehow accumulates a lot of error in it's floating point math, you will get worse results.
    It's rarely talked about, but it's a real thing. Floating point addition and multiplication is non-associative and the order of operations affects the correctness and performance. Developers might (unknowningly) trade performance for correctness. And it matters a lot more in the low precision modes we operate today. Just try different methods of summing a vector containing 9,999 fp16 ones in fp16. Hint: it will never be 9,999.0 and you won't get close to the best approximation if you do it in a naive loop.
    jiggawattsa year ago
    I thought all current implementations accumulate into a fp32 instead of accumulating in fp16.
    janwasa year ago
    We (gemma.cpp) recently started accumulating softmax terms into f64. There is at least one known case of this causing differing output, but after 200 tokens, hence unlikely to be detected in many benchmarks.
    Does anyone have experience with higher-precision matmul and whether it is worthwhile?
    ComputerGurua year ago
    Isn’t 200 tokens basically nothing? Did you mean to say 2000?
    janwasa year ago
    That's indeed short for some actual uses such as summarization, but AFAIK many/most? evals involve generating less than 200.
    KeplerBoya year ago
    I haven't looked at all implementations, but the hardware (tensor cores as well as cuda cores) allows you to accumulate at fp16 precision.
    srousseya year ago
    How well does bf16 work in comparison?
    KeplerBoya year ago
    Even worse, I'd say since it has fewer bits for the fraction. At least in the example i was mentioning, where you run into precision limits, not into range limits.
    I believe bf16 was primarily designed as a storage format, since it just needs 16 zero bits added to be a valid fp32.
    littlestymaara year ago
    TIL, thanks.
  - a year ago
    undefined
jll29a year ago
This is beautifully written, thanks for sharing.
I could see myself using some of the source code in the classroom to explain how transformers "really" work; code is more concrete/detailed than all those pictures of attention heads etc.
Two points of minor criticism/suggestions for improvement:
- libraries should not print to stdout, as that output may detroy application output (imagine I want to use the library in a text editor to offer style checking). So best to write to a string buffer owned by a logging class instance associated with a lm.rs object.
- Is it possible to do all this without "unsafe" without twisting one's arm? I see there are uses of "unsafe" e.g. to force data alignment in the model reader.
Again, thanks and very impressive!
- LoganDarka year ago
  > best to write to a string buffer
  It's best to call a user callback. That way logs can be, for example, displayed in a GUI.
  - azinman2a year ago
    A good logging framework has all the hooks you need
    ameliusa year ago
    Doesn't rust have a standard solution for that?
    If I use 10 libraries and they all use a different logging framework then that's ... not convenient.
    sapiograma year ago
    It does, everyone uses the `log` crate. But then it wouldn't be zero-dependencies anymore.
    littlestymaara year ago
    In fairness it's already not really “zero dependency” since it uses rayon (for easy multithreading) and wide (for easy SIMD), using log would make total sense I think (not the main author, just a contributor).
J_Shelby_Ja year ago
Neat.
FYI I have a whole bunch of rust tools[0] for loading models and other LLM tasks. For example auto selecting the largest quant based on memory available, extracting a tokenizer from a gguf, prompting, etc. You could use this to remove some of the python dependencies you have.
Currently to support llama.cpp, but this is pretty neat too. Any plans to support grammars?
[0] https://github.com/ShelbyJenkins/llm_client
wyldfirea year ago
The title is less clear than it could be IMO.
When I saw "no dependency" I thought maybe it could be no_std (llama.c is relatively lightweight in this regard). But it's definitely not `no_std` and in fact seems like it has several dependencies. Perhaps all of them are rust dependencies?
- littlestymaara year ago
  Titles are hard.
  What I wanted to express is that it doesn't have any pytorch or Cuda or onnx or whatever deep learning dependency and that all the logic is self contained.
  To be totally transparent it has 5 Rust dependencies by default, two of them should be feature gated for the chat (chrono and clap), and then there are 3 utility crates that are used to get a little bit more performance out of the hardware (`rayon` for easier parallelization, `wide` for helping with SIMD, and `memmap2` for memory mapping of the model file).
  - J_Shelby_Ja year ago
    Yeah, hard to not be overly verbose. “No massive dependencies with long build times and deep abstractions!” Is not as catchy.
    0x457a year ago
    No dependencies in this case (and pretty much any rust project) means: to build you need rustc+cargo and to use you just need resulting binary.
    As in you don't need to have C compiler, python, dynamic libraries. "pure rust" would be a better way to describe it.
    littlestymaara year ago
    It's a little bit more than pure Rust: to build the library there's basically only two dependencies (rayon and wide) which bring only 14 transitive dependencies (anyone who's built even simple Rust program knows that this is a very small number).
    And there's more, Rayon and wide are only needed for performance and we could trivially put them behind a feature flag and have zero dependency and have the library work in a no-std context actually, but it would be so slow it would have no use at all so I don't really think that makes sense to do except in order to win an argument…
- saghma year ago
  The readme seems to indicate that it expects pytorch alongside several other Python dependencies in a requirements.txt file (which is the only place I can find any form of the word "dependency" on the page). I'm very confused by the characterization in the title here given that it doesn't seem to be claimed at all by the project itself (which simple has the subtitle "Minimal LLM inference in Rust").
  From the git history, it looks like the username of the person who posted this here is someone who has contributed to the project but isn't the primary author. If they could elaborate on what exactly they mean by saying this has "zero dependencies", that might be helpful.
  - littlestymaara year ago
    > The readme seems to indicate that it expects pytorch alongside several other Python dependencies in a requirements.txt file
    That's only if you want to convert the model yourself, you don't need that if you use the converted weights on the author's huggingface page (in “prepared-models” table of the README).
    > From the git history, it looks like the username of the person who posted this here is someone who has contributed to the project but isn't the primary author.
    Yup that's correct, so far I've only authored the dioxus GUI app.
    > If they could elaborate on what exactly they mean by saying this has "zero dependencies", that might be helpful.
    See my other response: https://news.ycombinator.com/item?id=41812665
    J_Shelby_Ja year ago
    What do you think about implementing your gui for other rust LLM projects? I’m looking for a front end for my project: https://github.com/ShelbyJenkins/llm_client
- ctza year ago
  The original may have made sense, eg "no hardware dependency", or "no GPU dependency". Unfortunately HN deletes words from titles with no rhyme or reason, and no transparency.
- vitaminkaa year ago
  is rust cargo basically like npm at this point? like how on earth is sixteen dependencies means no dependencies lol
  - gspra year ago
    Indeed. It's the one cultural aspect of Rust I find exhausting. Huge fan of the language and the community in general, but a few widespread attitudes do drive me nuts:
    * That adding dependencies is something you should take very lightly
    * The everybody uses or should use crates.io for dependencies
    * That it's OK to just ask users to use the latest release of something at all times
    * That vendoring code is always a good thing when it adds even the slightest convenience
    * That one should ship generated code (prominent in e.g. crates that use FFI bindings)
    * The idea that as long as software doesn't depend on something non-Rust, it doesn't have dependencies
    Luckily the language, the standard library and the community in general are of excellent quality.
  - tormeha year ago
    Yes, basically. Someone who is a dependency maximalist (never write any code that can be replaced by a dependency) then you can easily end up with a thousand dependencies. I don't like things being that way, but others do.
    It's worth noting that Rust's std library is really small, and you therefore need more dependencies in Rust than in some other languages like Python. There are some "blessed" crates though, like the ones maintained by the rust-lang team themselves (https://crates.io/teams/github:rust-lang:libs and https://crates.io/teams/github:rust-lang-nursery:libs). Also, when you add a dependency like Tokio, Axum, or Polars, these are often ecosystems of crates rather than singular crates.
    Tl;dr: Good package managers end up encouraging micro-dependencies and dependency bloat because these things are now painless. Cargo is one of these good package managers.
    jll29a year ago
    How about designing a "proper" standard library for Rust (comparable to Java's or CommonLISP's), to ensure a richer experience, avoiding dependency explosions, and also to ensure things are written in a uniform interface style? Is that something the Rust folks are considering or actively working on?
    EDIT: nobody is helped by 46 regex libraries, none of which implements Unicode fully, for example (not an example taken from the Rust community).
    pornela year ago
    The particular mode of distribution of code as a traditional standard library has downsides:
    - it's inevitably going to accumulate mistakes/obsolete/deprecated stuff over time, because there can be only one version of it, and it needs to be backwards compatible.
    - it makes porting the language to new platforms harder, since there's more stuff promised to work as standard.
    - to reduce risk of having the above problems, stdlib usually sticks to basic lowest-common-denominator APIs, lagging behind the state of the art, creating a dilemma between using standard impl vs better but 3rd party impls (and large programs end up with both)
    - with a one-size-fits-all it's easy to add bloat from unnecessary features. Not all programs want to embed megabytes of Unicode metadata for a regex.
    The goal of having common trustworthy code can be achieved in many other ways, such as having (de-facto) standard individual dependencies to choose from. Packages that aren't built-in can be versioned independently, and included only when necessary.
    tormeha year ago
    Just use the rust-lang org's regex crate. It's fascinating that you managed to pick one of like 3 high-level use-cases that are covered by official rust-lang crates.
  - littlestymaara year ago
    > like how on earth is sixteen dependencies means no dependencies lol
    You're counting optional dependencies used in the binaries which isn't fair (obviously the GUI app or the backend of the webui are going to have dependencies!). But yes 3 dependencies isn't literally no dependency.
gipa year ago
Great! Did something similar some time ago [0] but the performance was underwhelming compared to C/C++ code running on CPU (which points to my lack of understanding of how to make Rust fast). Would be nice to have some benchmarks of the different Rust implementations.
Implementing LLM inference should/could really become the new "hello world!" for serious programmers out there :)
[0] https://github.com/gip/yllama.rs
- flaneur2020a year ago
  i also had a similar 'hello world' experience some time ago with [0] :). i manually used some SIMD instructions, and it seems the performance could align with llama.cpp. it appears that the key to performance is:
  1. using SIMD on quantized matrix multiplication 2. using a busy loop instead of condition variables when splitting work among threads.
  (however, i haven't had more free time to continue working on inferencing quantized models on GPU (with Vulkan), and it hasn't been updated for a long time since then.)
  [0] https://github.com/crabml/crabml
echelona year ago
This is really cool.
It's already using Dioxus (neat). I wonder if WASM could be put on the roadmap.
If this could run a lightweight LLM like RWKV in the browser, then the browser unlocks a whole class of new capabilities without calling any SaaS APIs.
- marmadukea year ago
  I was poking at this a bit here
  https://github.com/maedoc/rwkv.js
  using the Rwkv.cpp compiled with emscripten, but I didn’t quite figure out the tokenizers part (yet, only spent about an hour on it)
  Nevertheless I am pretty sure the 1.6b rwkv6 would be totally usable offline browser only. It’s not capable enough for general chat but for rag etc it could be quite enough
- littlestymaara year ago
  > I wonder if WASM could be put on the roadmap.
  The library itself should be able to compile to WASM with very little change: rayon and wide the only mandatory dependencies support wasm out of the box, and to get rid of memmap2 by replacing the `Mmap` type in transformer.rs with `&[u8]`.
  That being said, RWKV is a completely different architecture so it should be reimplemented entierly and is not likely to be part of the roadmap ever (not the main author so I can't say for sure, but I really doubt it).
lucgagana year ago
Correct me if I am wrong, but these implementations are all CPU bound?, i.e. if I have a good GPU, I should look for alternatives.
- bt1aa year ago
  You are correct. This project is "on the CPU", so it will not utilize your GPU for computation. If you would like to try out a Rust framework that does support GPUs, Candle https://github.com/huggingface/candle/tree/main may be worth exploring
- littlestymaara year ago
  It's all implemented on the CPU, yes, there's no GPU acceleration whatsoever (at the moment at least).
  > if I have a good GPU, I should look for alternatives.
  If you actually want to run it, even just on the CPU, you should look for an alternative (and the alternative is called llama.cpp) this is more of an educational resource about how things work when you remove all the layers of complexity in the ecosystem.
  LLM are somewhat magic in how effective they can be, but in terms of code it's really simple.
- J_Shelby_Ja year ago
  Yes. Depending on gpu 10-20x difference.
  For rust you have the llama.cpp wrappers like llm_client (mine), and the candle based projects mistral.rs, and Kalosm.
  Although, my project does try and provide an implementation of mistral.rs, I haven’t fully migrated from llama.cpp. A full rust implementation would be nice for quick install times (among other reasons). Right now my crate has to clone and build. It’s automated for mac, pc, and Linux but it adds about a minute of build time.
- kkielhofnera year ago
  CPU, yes, but more importantly memory bandwidth.
  An RTX 3090 (as one example) has nearly 1TB/s of memory bandwidth. You'd need at least 12 channels of the fastest proof-of-concept DDR5 on the planet to equal that.
  If you have a discrete GPU, use an implementation that utilizes it because it's a completely different story.
  Apple Silicon boasts impressive numbers on LLM inference because it has a unified CPU-GPU high-bandwidth (400GB/s IIRC) memory architecture.
- tormeha year ago
  Depends. Good models are big, and require a lot of memory. Even the 4090 doesn't have that much memory in an LLM context. So your GPU will be faster, but likely can't fit the big models.
- a year ago
  undefined
dcreatera year ago
What's the value of this compared to llama.cpp?
- IshKebaba year ago
  Easier to integrate with other Rust projects maybe?
- kvakkeflya year ago
  Cleaner codebase because of fewer features!
dvta year ago
This is cool (and congrats on writing your first Rust lib!), but Metal/Cuda support is a must for serious local usage.
- littlestymaara year ago
  Using Cuda is a non starter because it would go against the purpose of this project, but I (not the main author but contributor) am experimenting with wgpu to get some kind of GPU acceleration.
  I'm not sure it goes anywhere though, because the main author want to keep the complexity under control.
  - tormeha year ago
    wgpu would be awesome. Too little ML software out there is hardware-agnostic.
    littlestymaara year ago
    That's exactly my feeling and that's why I started working on it.
aravindputrevua year ago
Interesting, I appreciate the rust community‘s enthu to rewrite most the stuff.
fuddlea year ago
Nice work, it would be great to see some benchmarks comparing it to llm.c.
- littlestymaara year ago
  I doubt it would compare favorably at the moment, I don't think it's particularly well optimized besides using rayon to get CPU parallelism and wide for a bit of SIMD.
  It's good enough to get pretty good performance for little effort, but I don't think it would win a benchmark race either.
nikolayasdf123a year ago
how does this compare to https://github.com/EricLBuehler/mistral.rs ?
- littlestymaara year ago
  Much simpler codebase because it has much less features. It doesn't aim to be a llama.cpp competitor AFAIK.
kvakkeflya year ago
Would love to see a wasm version of this!
- dborehama year ago
  Quite curious to hear: why?
  Asking because this program isn't useful without 3G of model data, and WASM isn't useful outside of the browser (and perhaps some blockchain applications), where 3G of data isn't going to be practically available.
marques576a year ago
Such a talented guy!
vietvua year ago
Another llama.cpp and mistral.rs? If it support vision models then fine, I will try it.
EDIT: Looks like no L3.2 11B yet.
- littlestymaara year ago
  It supports the PHI 3.5 vision model since yesterday actually.
  I think a 11B model would be way too slow in its current shape though.