cd /tmp
git clone https://github.com/samuel-vitorino/lm.rs
cd lm.rs
RUSTFLAGS="-C target-cpu=native" cargo build --release --bin chat
curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/tokenizer.bin?download=true'
curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/llama3.2-1b-it-q80.lmrs?download=true'
./target/release/chat --model llama3.2-1b-it-q80.lmrs
The fact that a 1.2GB download can do as well as this is honestly astonishing to me - but it's going to laughably poor in comparison to something like GPT-4o - which I'm guessing is measured in the 100s of GBs.
You can try out Llama 3.2 1B yourself directly in your browser (it will fetch about 1GB of data) at https://chat.webllm.ai/
gpt4 tends to be more accurate than 4o for me.
4o is a bit too lobotomized for my taste. If you try to engage in conversation, nearly every answer after the first starts with "You're absolutely right". Bro, I don't know if I'm right, that's why I'm asking a question!
It's somehow better in _some_ scenarios but I feel like it's also objectively worse in others so it ends up being a wash. It paradoxically looks bad relative to GPT-4 but also makes GPT-4 feel worse when you go back to it...
o1-preview has been growing on me despite its answers also being very formulaic (relative to the OG GPT-3.5 and GPT-4 models which had more "freedom" in how they answered)
If you have a sloppy implementations which somehow accumulates a lot of error in it's floating point math, you will get worse results.
It's rarely talked about, but it's a real thing. Floating point addition and multiplication is non-associative and the order of operations affects the correctness and performance. Developers might (unknowningly) trade performance for correctness. And it matters a lot more in the low precision modes we operate today. Just try different methods of summing a vector containing 9,999 fp16 ones in fp16. Hint: it will never be 9,999.0 and you won't get close to the best approximation if you do it in a naive loop.
I believe bf16 was primarily designed as a storage format, since it just needs 16 zero bits added to be a valid fp32.
./target/release/chat --model llama3.2-1b-it-q80.lmrs --show-metrics
To know how many token/s you get? Speed: 26.41 tok/s
Full output: https://gist.github.com/simonw/6f25fca5c664b84fdd4b72b091854...I could see myself using some of the source code in the classroom to explain how transformers "really" work; code is more concrete/detailed than all those pictures of attention heads etc.
Two points of minor criticism/suggestions for improvement:
- libraries should not print to stdout, as that output may detroy application output (imagine I want to use the library in a text editor to offer style checking). So best to write to a string buffer owned by a logging class instance associated with a lm.rs object.
- Is it possible to do all this without "unsafe" without twisting one's arm? I see there are uses of "unsafe" e.g. to force data alignment in the model reader.
Again, thanks and very impressive!
It's best to call a user callback. That way logs can be, for example, displayed in a GUI.
FYI I have a whole bunch of rust tools[0] for loading models and other LLM tasks. For example auto selecting the largest quant based on memory available, extracting a tokenizer from a gguf, prompting, etc. You could use this to remove some of the python dependencies you have.
Currently to support llama.cpp, but this is pretty neat too. Any plans to support grammars?
Implementing LLM inference should/could really become the new "hello world!" for serious programmers out there :)
When I saw "no dependency" I thought maybe it could be no_std (llama.c is relatively lightweight in this regard). But it's definitely not `no_std` and in fact seems like it has several dependencies. Perhaps all of them are rust dependencies?
From the git history, it looks like the username of the person who posted this here is someone who has contributed to the project but isn't the primary author. If they could elaborate on what exactly they mean by saying this has "zero dependencies", that might be helpful.
That's only if you want to convert the model yourself, you don't need that if you use the converted weights on the author's huggingface page (in “prepared-models” table of the README).
> From the git history, it looks like the username of the person who posted this here is someone who has contributed to the project but isn't the primary author.
Yup that's correct, so far I've only authored the dioxus GUI app.
> If they could elaborate on what exactly they mean by saying this has "zero dependencies", that might be helpful.
See my other response: https://news.ycombinator.com/item?id=41812665
What I wanted to express is that it doesn't have any pytorch or Cuda or onnx or whatever deep learning dependency and that all the logic is self contained.
To be totally transparent it has 5 Rust dependencies by default, two of them should be feature gated for the chat (chrono and clap), and then there are 3 utility crates that are used to get a little bit more performance out of the hardware (`rayon` for easier parallelization, `wide` for helping with SIMD, and `memmap2` for memory mapping of the model file).
As in you don't need to have C compiler, python, dynamic libraries. "pure rust" would be a better way to describe it.
And there's more, Rayon and wide are only needed for performance and we could trivially put them behind a feature flag and have zero dependency and have the library work in a no-std context actually, but it would be so slow it would have no use at all so I don't really think that makes sense to do except in order to win an argument…
It's worth noting that Rust's std library is really small, and you therefore need more dependencies in Rust than in some other languages like Python. There are some "blessed" crates though, like the ones maintained by the rust-lang team themselves (https://crates.io/teams/github:rust-lang:libs and https://crates.io/teams/github:rust-lang-nursery:libs). Also, when you add a dependency like Tokio, Axum, or Polars, these are often ecosystems of crates rather than singular crates.
Tl;dr: Good package managers end up encouraging micro-dependencies and dependency bloat because these things are now painless. Cargo is one of these good package managers.
EDIT: nobody is helped by 46 regex libraries, none of which implements Unicode fully, for example (not an example taken from the Rust community).
- it's inevitably going to accumulate mistakes/obsolete/deprecated stuff over time, because there can be only one version of it, and it needs to be backwards compatible.
- it makes porting the language to new platforms harder, since there's more stuff promised to work as standard.
- to reduce risk of having the above problems, stdlib usually sticks to basic lowest-common-denominator APIs, lagging behind the state of the art, creating a dilemma between using standard impl vs better but 3rd party impls (and large programs end up with both)
- with a one-size-fits-all it's easy to add bloat from unnecessary features. Not all programs want to embed megabytes of Unicode metadata for a regex.
The goal of having common trustworthy code can be achieved in many other ways, such as having (de-facto) standard individual dependencies to choose from. Packages that aren't built-in can be versioned independently, and included only when necessary.
You're counting optional dependencies used in the binaries which isn't fair (obviously the GUI app or the backend of the webui are going to have dependencies!). But yes 3 dependencies isn't literally no dependency.
For rust you have the llama.cpp wrappers like llm_client (mine), and the candle based projects mistral.rs, and Kalosm.
Although, my project does try and provide an implementation of mistral.rs, I haven’t fully migrated from llama.cpp. A full rust implementation would be nice for quick install times (among other reasons). Right now my crate has to clone and build. It’s automated for mac, pc, and Linux but it adds about a minute of build time.
> if I have a good GPU, I should look for alternatives.
If you actually want to run it, even just on the CPU, you should look for an alternative (and the alternative is called llama.cpp) this is more of an educational resource about how things work when you remove all the layers of complexity in the ecosystem.
LLM are somewhat magic in how effective they can be, but in terms of code it's really simple.
It's already using Dioxus (neat). I wonder if WASM could be put on the roadmap.
If this could run a lightweight LLM like RWKV in the browser, then the browser unlocks a whole class of new capabilities without calling any SaaS APIs.
https://github.com/maedoc/rwkv.js
using the Rwkv.cpp compiled with emscripten, but I didn’t quite figure out the tokenizers part (yet, only spent about an hour on it)
Nevertheless I am pretty sure the 1.6b rwkv6 would be totally usable offline browser only. It’s not capable enough for general chat but for rag etc it could be quite enough
The library itself should be able to compile to WASM with very little change: rayon and wide the only mandatory dependencies support wasm out of the box, and to get rid of memmap2 by replacing the `Mmap` type in transformer.rs with `&[u8]`.
That being said, RWKV is a completely different architecture so it should be reimplemented entierly and is not likely to be part of the roadmap ever (not the main author so I can't say for sure, but I really doubt it).
It's good enough to get pretty good performance for little effort, but I don't think it would win a benchmark race either.