201 pointsby littlestymaar11 hours ago11 comments
  • simonw8 hours ago
    This is impressive. I just ran the 1.2G llama3.2-1b-it-q80.lmrs on a M2 64GB MacBook and it felt speedy and used 1000% of CPU across 13 threads (according to Activity Monitor).

        cd /tmp
        git clone https://github.com/samuel-vitorino/lm.rs
        cd lm.rs
        RUSTFLAGS="-C target-cpu=native" cargo build --release --bin chat
        curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/tokenizer.bin?download=true'
        curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/llama3.2-1b-it-q80.lmrs?download=true'
        ./target/release/chat --model llama3.2-1b-it-q80.lmrs
    • amelius8 hours ago
      Not sure how to formulate this, but what does this mean in the sense of how "smart" it is compared to the latest chatgpt version?
      • simonw8 hours ago
        The model I'm running here is Llama 3.2 1B, the smallest on-device model I've tried that has given me good results.

        The fact that a 1.2GB download can do as well as this is honestly astonishing to me - but it's going to laughably poor in comparison to something like GPT-4o - which I'm guessing is measured in the 100s of GBs.

        You can try out Llama 3.2 1B yourself directly in your browser (it will fetch about 1GB of data) at https://chat.webllm.ai/

        • iknowstuff3 hours ago
          anyone else think 4o is kinda garbage compared to the older gpt4? as well as o1-preview and probably o1-mini.

          gpt4 tends to be more accurate than 4o for me.

          • airstrikean hour ago
            I sort of do, especially against OG GPT-4 (before turbo)

            4o is a bit too lobotomized for my taste. If you try to engage in conversation, nearly every answer after the first starts with "You're absolutely right". Bro, I don't know if I'm right, that's why I'm asking a question!

            It's somehow better in _some_ scenarios but I feel like it's also objectively worse in others so it ends up being a wash. It paradoxically looks bad relative to GPT-4 but also makes GPT-4 feel worse when you go back to it...

            o1-preview has been growing on me despite its answers also being very formulaic (relative to the OG GPT-3.5 and GPT-4 models which had more "freedom" in how they answered)

      • littlestymaar8 hours ago
        The implementation has no control on “how smart” the model is, and when it comes to llama 1B, it's not very smart by current standard (but it would still have blown everyone's mind just a few years back).
        • KeplerBoy7 hours ago
          The implementation absolutely can influence the outputs.

          If you have a sloppy implementations which somehow accumulates a lot of error in it's floating point math, you will get worse results.

          It's rarely talked about, but it's a real thing. Floating point addition and multiplication is non-associative and the order of operations affects the correctness and performance. Developers might (unknowningly) trade performance for correctness. And it matters a lot more in the low precision modes we operate today. Just try different methods of summing a vector containing 9,999 fp16 ones in fp16. Hint: it will never be 9,999.0 and you won't get close to the best approximation if you do it in a naive loop.

          • sroussey7 hours ago
            How well does bf16 work in comparison?
            • KeplerBoy6 hours ago
              Even worse, I'd say since it has fewer bits for the fraction. At least in the example i was mentioning, where you run into precision limits, not into range limits.

              I believe bf16 was primarily designed as a storage format, since it just needs 16 zero bits added to be a valid fp32.

          • jiggawatts5 hours ago
            I thought all current implementations accumulate into a fp32 instead of accumulating in fp16.
            • KeplerBoy4 hours ago
              I haven't looked at all implementations, but the hardware (tensor cores as well as cuda cores) allows you to accumulate at fp16 precision.
          • littlestymaar7 hours ago
            TIL, thanks.
    • littlestymaar8 hours ago
      Could you try with

          ./target/release/chat --model llama3.2-1b-it-q80.lmrs --show-metrics
      
      To know how many token/s you get?
  • jll296 hours ago
    This is beautifully written, thanks for sharing.

    I could see myself using some of the source code in the classroom to explain how transformers "really" work; code is more concrete/detailed than all those pictures of attention heads etc.

    Two points of minor criticism/suggestions for improvement:

    - libraries should not print to stdout, as that output may detroy application output (imagine I want to use the library in a text editor to offer style checking). So best to write to a string buffer owned by a logging class instance associated with a lm.rs object.

    - Is it possible to do all this without "unsafe" without twisting one's arm? I see there are uses of "unsafe" e.g. to force data alignment in the model reader.

    Again, thanks and very impressive!

    • LoganDark3 hours ago
      > best to write to a string buffer

      It's best to call a user callback. That way logs can be, for example, displayed in a GUI.

  • J_Shelby_J8 hours ago
    Neat.

    FYI I have a whole bunch of rust tools[0] for loading models and other LLM tasks. For example auto selecting the largest quant based on memory available, extracting a tokenizer from a gguf, prompting, etc. You could use this to remove some of the python dependencies you have.

    Currently to support llama.cpp, but this is pretty neat too. Any plans to support grammars?

    [0] https://github.com/ShelbyJenkins/llm_client

  • gip8 hours ago
    Great! Did something similar some time ago [0] but the performance was underwhelming compared to C/C++ code running on CPU (which points to my lack of understanding of how to make Rust fast). Would be nice to have some benchmarks of the different Rust implementations.

    Implementing LLM inference should/could really become the new "hello world!" for serious programmers out there :)

    [0] https://github.com/gip/yllama.rs

  • wyldfire8 hours ago
    The title is less clear than it could be IMO.

    When I saw "no dependency" I thought maybe it could be no_std (llama.c is relatively lightweight in this regard). But it's definitely not `no_std` and in fact seems like it has several dependencies. Perhaps all of them are rust dependencies?

    • saghm8 hours ago
      The readme seems to indicate that it expects pytorch alongside several other Python dependencies in a requirements.txt file (which is the only place I can find any form of the word "dependency" on the page). I'm very confused by the characterization in the title here given that it doesn't seem to be claimed at all by the project itself (which simple has the subtitle "Minimal LLM inference in Rust").

      From the git history, it looks like the username of the person who posted this here is someone who has contributed to the project but isn't the primary author. If they could elaborate on what exactly they mean by saying this has "zero dependencies", that might be helpful.

      • littlestymaar8 hours ago
        > The readme seems to indicate that it expects pytorch alongside several other Python dependencies in a requirements.txt file

        That's only if you want to convert the model yourself, you don't need that if you use the converted weights on the author's huggingface page (in “prepared-models” table of the README).

        > From the git history, it looks like the username of the person who posted this here is someone who has contributed to the project but isn't the primary author.

        Yup that's correct, so far I've only authored the dioxus GUI app.

        > If they could elaborate on what exactly they mean by saying this has "zero dependencies", that might be helpful.

        See my other response: https://news.ycombinator.com/item?id=41812665

    • ctz7 hours ago
      The original may have made sense, eg "no hardware dependency", or "no GPU dependency". Unfortunately HN deletes words from titles with no rhyme or reason, and no transparency.
    • littlestymaar8 hours ago
      Titles are hard.

      What I wanted to express is that it doesn't have any pytorch or Cuda or onnx or whatever deep learning dependency and that all the logic is self contained.

      To be totally transparent it has 5 Rust dependencies by default, two of them should be feature gated for the chat (chrono and clap), and then there are 3 utility crates that are used to get a little bit more performance out of the hardware (`rayon` for easier parallelization, `wide` for helping with SIMD, and `memmap2` for memory mapping of the model file).

      • J_Shelby_J7 hours ago
        Yeah, hard to not be overly verbose. “No massive dependencies with long build times and deep abstractions!” Is not as catchy.
        • 0x4576 hours ago
          No dependencies in this case (and pretty much any rust project) means: to build you need rustc+cargo and to use you just need resulting binary.

          As in you don't need to have C compiler, python, dynamic libraries. "pure rust" would be a better way to describe it.

          • littlestymaar6 hours ago
            It's a little bit more than pure Rust: to build the library there's basically only two dependencies (rayon and wide) which bring only 14 transitive dependencies (anyone who's built even simple Rust program knows that this is a very small number).

            And there's more, Rayon and wide are only needed for performance and we could trivially put them behind a feature flag and have zero dependency and have the library work in a no-std context actually, but it would be so slow it would have no use at all so I don't really think that makes sense to do except in order to win an argument…

    • vitaminka8 hours ago
      is rust cargo basically like npm at this point? like how on earth is sixteen dependencies means no dependencies lol
      • tormeh6 hours ago
        Yes, basically. Someone who is a dependency maximalist (never write any code that can be replaced by a dependency) then you can easily end up with a thousand dependencies. I don't like things being that way, but others do.

        It's worth noting that Rust's std library is really small, and you therefore need more dependencies in Rust than in some other languages like Python. There are some "blessed" crates though, like the ones maintained by the rust-lang team themselves (https://crates.io/teams/github:rust-lang:libs and https://crates.io/teams/github:rust-lang-nursery:libs). Also, when you add a dependency like Tokio, Axum, or Polars, these are often ecosystems of crates rather than singular crates.

        Tl;dr: Good package managers end up encouraging micro-dependencies and dependency bloat because these things are now painless. Cargo is one of these good package managers.

        • jll295 hours ago
          How about designing a "proper" standard library for Rust (comparable to Java's or CommonLISP's), to ensure a richer experience, avoiding dependency explosions, and also to ensure things are written in a uniform interface style? Is that something the Rust folks are considering or actively working on?

          EDIT: nobody is helped by 46 regex libraries, none of which implements Unicode fully, for example (not an example taken from the Rust community).

          • pornel3 hours ago
            The particular mode of distribution of code as a traditional standard library has downsides:

            - it's inevitably going to accumulate mistakes/obsolete/deprecated stuff over time, because there can be only one version of it, and it needs to be backwards compatible.

            - it makes porting the language to new platforms harder, since there's more stuff promised to work as standard.

            - to reduce risk of having the above problems, stdlib usually sticks to basic lowest-common-denominator APIs, lagging behind the state of the art, creating a dilemma between using standard impl vs better but 3rd party impls (and large programs end up with both)

            - with a one-size-fits-all it's easy to add bloat from unnecessary features. Not all programs want to embed megabytes of Unicode metadata for a regex.

            The goal of having common trustworthy code can be achieved in many other ways, such as having (de-facto) standard individual dependencies to choose from. Packages that aren't built-in can be versioned independently, and included only when necessary.

          • tormeh4 hours ago
            Just use the rust-lang org's regex crate. It's fascinating that you managed to pick one of like 3 high-level use-cases that are covered by official rust-lang crates.
      • littlestymaar8 hours ago
        > like how on earth is sixteen dependencies means no dependencies lol

        You're counting optional dependencies used in the binaries which isn't fair (obviously the GUI app or the backend of the webui are going to have dependencies!). But yes 3 dependencies isn't literally no dependency.

  • dcreater2 hours ago
    What's the value of this compared to llama.cpp?
    • kvakkefly2 hours ago
      Cleaner codebase because of fewer features!
  • kvakkefly2 hours ago
    Would love to see a wasm version of this!
  • lucgagan8 hours ago
    Correct me if I am wrong, but these implementations are all CPU bound?, i.e. if I have a good GPU, I should look for alternatives.
    • bt1a8 hours ago
      You are correct. This project is "on the CPU", so it will not utilize your GPU for computation. If you would like to try out a Rust framework that does support GPUs, Candle https://github.com/huggingface/candle/tree/main may be worth exploring
    • J_Shelby_J8 hours ago
      Yes. Depending on gpu 10-20x difference.

      For rust you have the llama.cpp wrappers like llm_client (mine), and the candle based projects mistral.rs, and Kalosm.

      Although, my project does try and provide an implementation of mistral.rs, I haven’t fully migrated from llama.cpp. A full rust implementation would be nice for quick install times (among other reasons). Right now my crate has to clone and build. It’s automated for mac, pc, and Linux but it adds about a minute of build time.

    • 8 hours ago
      undefined
    • littlestymaar8 hours ago
      It's all implemented on the CPU, yes, there's no GPU acceleration whatsoever (at the moment at least).

      > if I have a good GPU, I should look for alternatives.

      If you actually want to run it, even just on the CPU, you should look for an alternative (and the alternative is called llama.cpp) this is more of an educational resource about how things work when you remove all the layers of complexity in the ecosystem.

      LLM are somewhat magic in how effective they can be, but in terms of code it's really simple.

  • echelon9 hours ago
    This is really cool.

    It's already using Dioxus (neat). I wonder if WASM could be put on the roadmap.

    If this could run a lightweight LLM like RWKV in the browser, then the browser unlocks a whole class of new capabilities without calling any SaaS APIs.

    • marmaduke8 hours ago
      I was poking at this a bit here

      https://github.com/maedoc/rwkv.js

      using the Rwkv.cpp compiled with emscripten, but I didn’t quite figure out the tokenizers part (yet, only spent about an hour on it)

      Nevertheless I am pretty sure the 1.6b rwkv6 would be totally usable offline browser only. It’s not capable enough for general chat but for rag etc it could be quite enough

    • littlestymaar6 hours ago
      > I wonder if WASM could be put on the roadmap.

      The library itself should be able to compile to WASM with very little change: rayon and wide the only mandatory dependencies support wasm out of the box, and to get rid of memmap2 by replacing the `Mmap` type in transformer.rs with `&[u8]`.

      That being said, RWKV is a completely different architecture so it should be reimplemented entierly and is not likely to be part of the roadmap ever (not the main author so I can't say for sure, but I really doubt it).

  • fuddle9 hours ago
    Nice work, it would be great to see some benchmarks comparing it to llm.c.
    • littlestymaar5 hours ago
      I doubt it would compare favorably at the moment, I don't think it's particularly well optimized besides using rayon to get CPU parallelism and wide for a bit of SIMD.

      It's good enough to get pretty good performance for little effort, but I don't think it would win a benchmark race either.

  • marques5766 hours ago
    Such a talented guy!