Hang on, why is https://blob.localscore.ai/localscore-0.9.2 380MB? I remember llamafile being only a few megabytes. From https://github.com/Mozilla-Ocho/llamafile/releases, looks like it steadily grew from adding support for GPUs on more platforms, up to 28.5MiB¹ in 0.8.12, and then rocketed up to 230MiB in 0.8.13:
> The llamafile executable size is increased from 30mb to 200mb by this release. This is caused by https://github.com/ggml-org/llama.cpp/issues/7156. We're already employing some workarounds to minimize the impact of upstream development contributions on binary size, and we're aiming to find more in the near future.
Ah, of course, CUDA. Honestly I might be more surprised that it’s only this big. That monstrosity will happily consume a dozen gigabytes of disk space.
llamafile-0.9.0 was still 231MiB, then llamafile-0.9.1 was 391MiB, now llamafile-0.9.2 is 293MiB. Fluctuating all over the place, but growing a lot. And localscore-0.9.2 is 363MiB. Why 70MiB extra on top of llamafile-0.9.2? I’m curious, but not curious enough to investigate concretely.
Well, this became a grumble about bloat, but I’d still like to know whether it would be feasible to ship a smaller localscore that would synthesise a suitable model, according to the size required, at runtime.
—⁂—
¹ Eww, GitHub is using the “MB” suffix for its file sizes, but they’re actually mebibytes (2²⁰ bytes, 1048576 bytes, MiB). I thought we’d basically settled on returning the M/mega- prefix to SI with its traditional 10⁶ definition, at least for file sizes, ten or fifteen years ago.
1. User selects a model, size and token output speed and latency. The website generates a hardware list of components that should match the performance requirements.
2. User selects hardware components and the website generates a list of models that performant on that hardware.
3. Monetize through affiliate links the components to fund the project. Think like PC part picker.
I know there's going to be some variability in the benchmarks due to the software stack, but it should give a AI enthusiasts, an educated perspective on what hardware can be relevant for their use case.
Happy to test larger models that utilize the memory capacity if helpful.
Notes: - Olama integration would be nice - Is there an anonymous federated score sharing? That way, users you approximate a model's performance before downloading it.
A couple of ideas .. I would like to benchmark a remote headless server, as well as different methods to run the LLM (vllm vs tgi vs llama.cpp ...) on my local machine, and in this case llamafile is quite limiting. Connecting over an OpenAI-like API instead would be great!
Stoked to have this dataset out in the open. I submitted a bunch of tests for some models I'm experimenting with on my M4 Pro. Rather paltry scores compared to having a dedicated GPU but I'm excited that running a 24B model locally is actually feasible at this point.
I’ll make sure to run contribute my benchmark to this once my ram comes in.
Clicking on GPU is a nice simple visualization. I was thinking maybe try to put that type of visual representation intuitively accessible immediately on the landing page.
cpubenchmark.net could he an example technique of drawing the site visitor into the paradigm.