They claim their ANE-optimized models achieve "up to 10 times faster and 14 times lower peak memory consumption compared to baseline implementations."
AFAIK, neither MLX nor llama.cpp support ANE. Though llama.cpp started exploring this idea [0].
What's weird is that MLX is made by Apple and yet, they can't support ANE given its closed-source API! [1]
[0]: https://github.com/ggml-org/llama.cpp/issues/10453
[1]: https://github.com/ml-explore/mlx/issues/18#issuecomment-184...
More extensive information at https://github.com/tinygrad/tinygrad/tree/master/extra/accel... (from the Tinygrad folks, note that this is also similarly outdated) seems to basically confirm the above.
(The jury is still out for M3/M4 which currently have no Asahi support - thus, no current prospects for driving the ANE bare-metal. Note however that the M3/Pro/Max ANE reported performance numbers are quite close to the M2 version, so there may not be a real improvement there either. M3 Ultra and especially the M4 series may be a different story.)
I would say though that this likely excludes them from being useful for training purposes.
At that point, the ANE loses because you have to split the model into chunks and only one fits at a time.
Chunking is actually beneficial as long as all the chunks can fit into the ANE’s cache. It speeds up compilation for large network graphs and cached loads are negligible cost. On M1 the cache limit is 3-4GB, but it is higher on M2+.
I had also assumed that loading a chunk from the cache was not free because I’ve seen cache eviction on my M1, but it’s good to know that it’s no longer as big of a limitation.
also, I’m a big fan of your work! I played around with your ModernBERT CoreML port a bit ago
Maybe cache is the wrong word. This is a limit to how much can be mmap'd for the ANE at once. It's not too hard to hit on M1 if your model is in the GB range. Chunking the model into smaller pieces makes it more likely to "fit", but if it doesn't fit you have to unmap/remap in each forward pass which will be noticeable.
Awesome to hear about ModernBERT! Big fan of your work as well :)
Which is still painfully slow. CoreML is not a real ML platform.
To answer the question though - I think this would be used for cases where you are building an app that wants to utilize a small AI model while at the same time having the GPU free to do graphics related things, which I'm guessing is why Apple stuck these into their hardware in the first place.
Here is an interesting comparison between the two from a whisper.cpp thread - ignoring startup times - the CPU+ANE seems about on par with CPU+GPU: https://github.com/ggml-org/whisper.cpp/pull/566#issuecommen...
Yes, hammering the GPU too hard can affect the display server, but no, switching to the CPU is not a good alternative
LLM performance is twice as fast as RTX 5090
https://creativestrategies.com/mac-studio-m3-ultra-ai-workst...
your tests are wrong. you used MLX for Mac Studio (optimized for Apple Silicon) but you didn't use vLLM for 5090. There's no way a machine with half the bandwidth of 5090 delivers twice as fast tok/s.
also, the GP was mostly testing models that fit in both 5090 and Mac Studio.
Apple isn't serious about AI and needs to figure their AI story out. Every other big tech company is doing something about it.
I wouldn't say Apple isn't serious about AI, they had the forethought to build the shared memory architecture with the insane memory bandwidth needed for these types of tasks, while at the same time designing neural cores specifically for small on-device models needed for future apps.
I'd say Apple is currently ahead of NVIDIA in just sheer memory available - which for doing training and inference on large models, it's kinda crucial, at least right now. NVIDIA seems to be purposefully limiting the memory available in their consumers cards which is pretty short sighted I think.
https://creativestrategies.com/mac-studio-m3-ultra-ai-workst...
RTX 5090 only has 32GB RAM. M3 Ultra has up to 512 GB with 819 GB/sec bandwidth. It can run models that will not fit on an RTX card.
EDIT: Benchmark may not be properly utilizing the 5090. But the M3 Ultra is way more capable than an entry level RTX card at LLM inferencing.
Nvidia makes an incredible product, but apples different market segmentation strategy might make it a real player in the long run.
16x the RAM of RTX 5090.
There are two versions of the M3 Ultra
28-core CPU, 60-core GPU
32-core CPU, 80-core GPU
Both have a 32-core Neural Engine.
blog: https://machinelearning.apple.com/research/vision-transforme...
I wrote about it here[0] but the gist is you can have a fixed size cache and slide it in chunks with each inference. Not as efficient as a cache that grows by one each time of course.
[0]: https://stephenpanaro.com/blog/inside-apples-2023-transforme...
It took multiple tries to get the model to convert at all to the mlpackage format, and then a lot of experimenting to get it to run on the ANE instead of the GPU, only to discover that constant reshaping was killing any performance benefit (either you have a fixed multiplication size or don't bother), and even at a fixed size and using the attention mask, its operations were slower than saturating the GPU with large batches.
I discovered an issue where using the newer iOS 18 standard would cause the model conversion to break, and put an issue in on their GitHub, including an example repository for easy replication. I got a response quickly, but almost a year later, the bug is still unfixed.
Even when George Hotz attempted to hack it to use it without Apple's really bad and unmaintained CoreML library, he gave up because it was impossible without breaking some pretty core OS features (certificate signing IIRC).
The ANE/CoreML is just not serious at all about making their hardware usable at all. Even Apple's internal MLX team can't crack that nut.
There are a couple ways to interface — DirectML by MS and Intel’s native api (they provide OpenVINO model conversion to convert normal Python ml models) I’ve tried ONNXRuntime conversions for both backends to little success. Additionally the OpenVINO model conversion seems to break the model if the model small enough.
OpenVINO model server seems pretty polished and has openapi compatible endpoints.
I pre/ordered the Snapdragon X Dev kit from Qualcomm - but they ended up delivering a few units -- to only cancel the whole program. The whole thing turned out to be a hot-mess express saga. THAT computer was going to be my Debian rig.
There was a guy using it for live video transformations and it almost caused the phones to “melt”. [2]
[1] https://machinelearning.apple.com/research/neural-engine-tra...
[1] https://vengineer.hatenablog.com/entry/2024/10/13/080000
But the point was about area efficiency
For laptops, 2x GPU cores would make more sense, for phones/tablets, energy efficiency is everything.
GPU + dedicated AI HW is virtually always the wrong approach compared to GPU+ tensor cores
https://discuss.pytorch.org/t/apple-neural-engine-ane-instea...
It seems intuitive that if they design hardware very specifically for these applications (beyond just fast matmuls on a GPU), they could squeeze out more performance.
It's about performance/power ratios.
It looked like even ANEMLL provides limited low level access to specifically direct processing toward the Apple Neural Engine, because Core ML still acts as the orchestrator. Instead, flags during conversion of a PyTorch or TensorFlow model can specify ANE-optimized operations, quantization, and parameters hinting at compute targets or optimization strategies. For example `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine` during conversion would disfavor the GPU cores.
Anyway, I didn't actually experiment with this, but at the time I thought maybe there could be a strategy of creating a speculative execution framework, with a small ANE-compatible model to act as the draft model paired with a larger target model running on GPU cores. The idea being that the ANE's low latency and high efficiency could accelerate results.
However, I would be interested to hear the perspective of people who actually know something about the subject.
Prompt: "Tell me a long story about the origins of 42 being the answer."
anemll: 9.3 tok/sec, ~500MB of memory used.
mlx 8bit: 31.33 tok/sec, ~8.5GB of memory used.
mlx bf16: 27.17 tok/sec, ~15.7GB of memory used.
Memory results are from activity monitor across any potentially involved processes, but I feel like I might missing something here...
[0] https://huggingface.co/anemll/anemll-DeepSeekR1-8B-ctx1024_0...
[1] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...
[2] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...
LM Studio is an easy way to use both mlx and llama.cpp
anemll [0]: ~9.3 tok/sec
mlx [1]: ~50 tok/sec
gguf (llama.cpp b5219) [2]: ~41 tok/sec
[0] https://huggingface.co/anemll/anemll-DeepSeekR1-8B-ctx1024_0...
[1] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...
[2] (8bit) https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...
The tricks are more around optimizing for the hardware capabilities/constraints. For instance:
- conv2d is faster than linear (see Apple's post [0]) so you rewrite the model for that (example from the repo [1])
- inputs/outputs are static shapes, so KV cache requires some creativity (I wrote about that here [2])
- compute is float16 (not bfloat16) so occasionally you have to avoid activation overflows
[0]: https://machinelearning.apple.com/research/neural-engine-tra...
[1]: https://github.com/Anemll/Anemll/blob/4bfa0b08183a437e759798...
[2]: https://stephenpanaro.com/blog/kv-cache-for-neural-engine
Also the ANE models are limited to 512 tokens of context, so unlikely yet to use these in production.
$2,000 vs. $3,500 isn't well under half either.
The model does have some limitations (e.g., need for QAT for 4-bit quantization), lack of a C++ runner to execute the model, but parts of the model are promising.
If interested in further discussion, join the conversation on the ExecuTorch discord channel: https://discord.gg/xHxqsD5b
mlx is much faster, but anemll appeared to use only 500MB of memory compared to the 8GB mlx used.
To some degree, that's an unavoidable consequence of how long it takes to design and ship specialized hardware with a supporting software stack. By contrast, ML research is moving way faster because they hardly ever ship anything product-like; it's a good day when the installation instructions for some ML thing only includes three steps that amount to "download more Python packages".
And the lack of cross-vendor standardization for APIs and model formats is also at least partly a consequence of various NPUs evolving from very different starting points and original use cases. For example, Intel's NPUs are derived from Movidius, so they were originally designed for computer vision, and it's not at all a surprise that making them do LLMs might be an uphill battle. AMD's NPU comes from Xilinx IP, so their software mess is entirely expected. Apple and Qualcomm NPUs presumably are still designed primarily to serve smartphone use cases, which didn't include LLMs until after today's chips were designed.
It'll be very interesting to see how this space matures over the next several years, and whether the niche of specialized low-power NPUs survives in PCs or if NVIDIA's approach of only using the GPU wins out. A lot of that depends on whether anybody comes up with a true killer app for local on-device AI.
GPU's are gaining their own kinds of specialized blocks such as matrix/tensor compute units, or BVH acceleration for ray-tracing (that may or may not turn out to be useful for other stuff). So I'm not sure that there's any real distinction from that POV - a specialized low-power unit in an iGPU is going to be practically indistinguishable from a NPU, except that it will probably be easier to target from existing GPU API's.
Possibly, depending on how low the power actually is. We can't really tell from NVIDIA's tensor cores, because waking up an NVIDIA discrete GPU at all has a higher power cost than running an NPU. Intel's iGPUs have matrix units, but I'm not sure if they can match their NPU on power or performance.
Edit: I changed llama.cpp to whisper.cpp - I didn’t realize that llama.cpp doesn’t have a coreml option like whisper.cpp does.
Currently, it is for example used through the "Vision Framework" eg for OCR tasks (for instance, when previewing an image in macOS, it performs OCR in the background using the ANE). Additionally, they are utilized by certain apple intelligence features that are executed locally (eg when I asked writing tools to rewrite this comment, I saw a spike in ANE usage).
They can also be used for diffusion image models (through core ml, diffusers has a nice frontend for that) but my understanding is that they are primarily for "light" ML tasks within an application rather than running larger models (though that's also possible, but they are gonna probably run slower than in gpu).
See my other comments. anemll appears to use less memory.
[0] https://huggingface.co/anemll/anemll-llama-3.2-1B-iOSv2.0
While it’s certainly no where near the memory bandwidth, 80Gbps is on par with most high end, but still affordable, machine to machine connections. Then add on the fact you can have hundreds of gigabytes of shared ram on each machine.