Here's a fun way to start interacting with it (this loads and runs Llama 3.2 3B in a terminal chat UI):
uv run --isolated --with mlx-lm python -m mlx_lm.chat
This is typical of what happens any time I try to run something written in Python. It may be easier than setting up an NVIDIA GPU, but that's a low bar.
Python is awesome in many ways, one of my favourite languages, but unless you are happy with venv manipulation (or live in Conda), it's often a nightmare that ends up worse than DLL-hell.
Python is by no means alone in this or particularly egregious. Having been a heavy Perl developer in the 2000s, I was part of the problem. I didn't understand why other people had so much trouble doing things that seemed simple to me, because I was eating, breathing, and sleeping Perl. I knew how to prolong the intervals between breaking my installation, and how to troubleshoot and repair it, but there was no reason why anyone who wanted to deploy, or even develop on, my code base should have needed that encyclopedic knowledge.
This is why, for all their faults, I count containers as the biggest revolution in the software industry, at least for us "backend" folks.
What's your tokens/sec rate (and on which device)?
I don't have a token/second figure to hand but it's fast enough that I'm not frustrated by it.
MLX looks really nice from the demo-level playing around with it I've done, but I usually stick to jax so, you know, I can actually deploy it on a server without trying to find someone who racks macs.
For 4 bit deepseek-r1-distill-llama-70b on a Macbook Pro M4 Max with the MLX version on LM Studio: 10.2 tok/sec on power and 4.2 tok/sec on battery / low power
For 4 bit gemma-3-27b-it-qat I get: 26.37 tok/sec on power and on battery low power 9.7
It'd be nice to know all the possible power tweaks to get the value higher and get additional insight on how llm's work and interact with the cpu and memory.
What have you used those models for, and how would you rate them in those tasks?
Which benchmarks are you working off of, exactly? Unless your memory is bottlenecked, neither raster or compute workloads on M4 are more energy efficient than Nvidia's 50-series silicon: https://browser.geekbench.com/opencl-benchmarks
Apple M3 Ultra - 131247 - 200W [1]
Looks like it might be 2.8x faster in the benchmark, but ends up using 3.25x more power at a minimum.
[1] https://www.tweaktown.com/news/103891/apples-new-m3-ultra-ru...
Any other resources like that you could share?
Also, what kind of models do you run with mlx and what do you use them for?
Lately I’ve been pretty happy with gemma3:12b for a wide range of things (generating stories, some light coding, image recognition). Sometimes I’ve been surprised by qwen2.5-coder:32b. And I’m really impressed by the speed and versatility, at such tiny size, of qwen2.5:0.5b (playing with fine tuning it to see if I can get it to generate some decent conversations roleplaying as a character)
I mainly use MLX for LLMs (with https://github.com/ml-explore/mlx-lm and my own https://github.com/simonw/llm-mlx which wraps that), vision LLMs (via https://github.com/Blaizzy/mlx-vlm) and running Whisper (https://github.com/ml-explore/mlx-examples/tree/main/whisper)
I haven't tried mlx-audio yet (which can synthesize speech) but it looks interesting too: https://github.com/Blaizzy/mlx-audio
The two best people to follow for MLX stuff are Apple's Awni Hannun - https://twitter.com/awnihannun and https://github.com/awni - and community member Prince Canuma who's responsible for both mlx-vlm and mlx-audio: https://twitter.com/Prince_Canuma and https://github.com/Blaizzy
But as a measure for what you can achieve with a course like this: does anyone know what the max tok/s vs iPhone model plot look like, and how does MLX change that plot?