> I want to peel back the layers of the onion and other gluey-mess to gain insight into these models.
Then this is great.
If your goal is
> Run and explore Llama models locally with minimal dependencies on CPU
then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.
The model storage doesn't bother me but I also use Docker so I'm used to having a lot of tool-managed data to deal with. YMMV.
Edit: Removed question about GPU support.
Even myself I’ve not thought about this so deeply, even though I am also very concerned about honoring other people’s work and that licenses are followed.
I have some command line tools for example that I’ve written in Rust that depend on various libraries. But because I distribute my software in source form mostly, I haven’t really paid attention to how a command-line tool which is distributed as a compiled binary would make sure to include attribution and copies of the licenses of its dependencies.
And so the main place where I’ve given more thought to those concerns is for example in full-blown GUI apps. There they usually have an about menu that will include info about their dependencies. And the other part where I’ve thought about it is in commercial electronics making use of open source software in their firmware. In those physical products they usually include either some printed documents alongside the product where attributions and license texts are sometimes found, and sometimes if the product has a display, or a display output, they have a menu you can find somewhere with that sort of info.
I know that for example Debian is very good at being thorough with details about licenses, but I’ve never looked at what they do with command line tools that compile third-party code into them. Like does Debian package maintainers then for example dig up copies of the licenses from the source and dependencies and put them somewhere in /usr/share/ as plain text files? Or do the .deb files themselves contain license text copies you can view but which are not installed onto the system? Or they work with software authors to add a flag that will show licenses? Or something else?
> such great performance that I've mostly given up on GPU for LLMs
I mean I used to run ollama on GPU, but llamafile was approximately the same performance on just CPU so I switched. Now that might just be because my GPU is weak by current standards, but that is in fact the comparison I was making.
Edit: Though to be clear, ollama would easily be my second pick; it also has minimal dependencies and is super easy to run locally.
Looks like there’s a typo, Windows is mentioned twice.
First time that I have a "it just works" experience with LLMs on my computer. Amazing. Thanks for the recommendation!
Loved llamafile and used it to build the first version of https://recurse.chat/, but live compilation using XCode Command Line Tool is a no-go for Mac App Store builds (runs in Mac App Sandbox). llama.cpp doesn't need compiling on user's machine fwiw.
  148 tokens predicted, 159 ms per token, 6.27 tokens per secondIt's impressive to realize how little code is needed to run these models at all.
Seems like torchchat is exactly what the author was looking for.
> And the 8B model typically gets killed by the OS for using too much memory.
Torchchat also provides some quantization options so you can reduce the model size to fit into memory.
This just imports the Llama reference implementation and patches the device FYI.
There are more robust implementations out there.