The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip for AI.
So there must be a min-max performance ratio between memory bandwidth and the size of the memory pool in relation to the processing power.
From my napkin math, the M3 Ultra TFLOPs is still relatively low (around 43 FP16 TFLOPs?), but it should be more than enough to handle bs=1 token generation (should be way <10 FLOPs/byte for inference). Now as far is its prefill/prompt processing speed... well, that's another matter.
Fact is Mac Pros in the Intel days supported 1.5TB RAM in some configurations[1] and that was 6 years ago expectations of their high end customer base. They needed to address the gap for those customers so they would have shipped such a product regardless. Local LLM is cherry-on-top. Deepseek in particular almost certainly had nothing to do with it. They will still need to double their supported RAM in their SoC to get there. Perhaps in a Mac Pro or a different quad-Max-glued chip.
RIP Jay Miner who watched his unified memory daughters Agnus, Denise and Paula be slowly murdered by Jack Tramiel's vengeance against Irving Gould. [Why couldn't the shareholders have stormed their boardroom 180 days before the company ran out of cash, installed interim management who, in turn, would have brought back the megalomaniac Founder that would, until his dying breath, keep spreading their cash to the super brilliant geniuses that made all the magic chips happen and then turn the resulting empire over to ops people to make their workplace so uncomfortable they all retire early and live happily ever after on tropical islands and snowy mountain tops?]
That said, I doubt it was explicitly for R1, but rather based the industry a few years ago when GPT 3s 170B was SOTA, but the industry was still looking larger. "As much memory as possible" is the name of the game for AI in a way that's not true for other workloads. It may not be true for AI forever either.
Which came out in what, mid January? Yeah, there's no chance Apple (or anyone) has built a new chip in the last 45 days.
It also may point to other devices ᯅ depending upon such new Apple Silicon arriving sooner, rather than later. (Hey, I should start a YouTube channel or religion or something. /s)
But the decision to come to market with a 512GB sku may have changed from not making sense to “people will buy this”.
This was just a coincidence.
It's not completely out of the question that the 512gb version of M3 Ultra was built for their internal Apple silicon servers powering Private Compute Cloud, but not intended for consumer release, until a compelling use case suddenly arrived.
I don't _think_ this is what happened, but I wouldn't go as far as to call it impossible.
Literally impossible.
I don't think this happened, but it's absolutely not "literally impossible". Engineering takes time, artificial segmentation can be changed much more quickly.
But I agree they are not using m3 ultra for that. It wouldn’t make any sense.
0. https://www.theregister.com/AMP/2024/06/11/apple_built_ai_cl...
See photo without heatspreader here: https://wccftech.com/apple-m2-ultra-soc-delidded-package-siz...
Did not mean to shit on anyone's parade, but it's a trap for novices, with the caveat that you reportedly can't buy a GB10 until "May 2025" and the expectation that it will be severely supply constrained. For some (overfunded startups running on AI monkey code? Youtube Influencers?), that timing is an unacceptable risk, so I do expect these things to fly off the shelves and then hit eBay this Summer.
This makes sense. They started gluing M* chips together to make Mac Studios three years ago, which must have been in anticipation of DeepSeek R1 4-bit
https://support.apple.com/en-us/102839
I assume it is similar.
"Memory bandwidth usage should be limited to the 37B active parameters."
Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters.
Context window?
How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth?
I remember right after OpenAI announced GPT3 I had a conversation with someone where we tried to predict how long it would be before GPT3 could run on a home desktop. This mac studio that has enough VRAM to run the full 175B parameter GPT3 with 16bit precision, and I think that’s pretty cool.
This is why Apple makes so much fucking money: people will craft the wildest narratives about how they’re going to use this thing. It’s part of the aesthetics of spending $10,000. For every person who wants a solution to the problem of running a 400b+ parameter neural network, there are 19 who actually want an exciting experience of buying something, which is what Apple really makes. It has more in common with a Birkin bag than a server.
Unless you have a specific configuration, depreciation isn't much better than an equivalently priced PC. In fact, my experience is that the long tail value of the PC is better if you picked something that was high-end.
When engineers tell me they want to run models on the cloud, I tell them they are free to play with it, but that isn’t a project going into the roadmap. OpenAI/Anthropic and others are much cheaper in terms of token/dollar thanks to economies of scale.
There is still value in running your models for privacy issues however, and that’s the reason why I pay attention to efforts in reducing the cost to run models locally or in your cloud provider.
Anything in between suffers.
This is the big question to have answered. Many people claim Apple can now reliably be used as a ML workstation, but from the numbers I've seen from benchmarks, the models may fit in memory, but the performance for tok/sec is so slow to not feel worth it, compared to running it on NVIDIA hardware.
Although it be expensive as hell to get 512GB of VRAM with NVIDIA today, maybe moves like this from Apple could push down the prices at least a little bit.
Also, would be neat if you could say what AI services you were subscribed to, there is a huge difference between paid Claude subscription and the OpenAI Pro subscription for example, both in terms of cost and the quality of responses.
What model do you find fast enough and smart enough?
Anyways, what ram config, and what model are you using?
For the self-attention mechanism, memory bandwidth requirements scale ~quadratically with the sequence length.
You should read HN posting Guidelines if you want to understand why. Although I guess mostly in this case it is someone fat thumbed downvote.
If they didn't increase the memory bandwidth, then 512GB will enable longer context lengths and that's about it right? No speedups
For any speedups You may need some new variant of FlashAttention3 or something along similar lines to be purpose built for Apple GPUs.
(and the 512GB version is $4,000 more rather than $10,000 - that's still worth mocking, but it's nowhere near as much)
But here it looks more of a bottleneck from my (admittedly naive) understanding.
On dual (SP5) Epyc I believe the memory bandwidth is somewhat greater than this apple product too... and at apple's price points you can have about twice the ram too.
Presumably the apple solution is more power efficient.
I would have assumed you’d want to save the best process/node for processing, and could use a less expensive processes for RAM.
funny that people think this is so new, when CRAY had Global Heap eons ago...
we walked in and getting our bearings, we come upon CRAY office. WTF?!
I tried the doors, locked - and it was clearly empty... but damn did I want to steal their office door signage.
And even with 3D, integrated GPUs have existed for years.
So if you wanted to give it a second ram pool you would have to add an entire second memory interface just for the on-die GPU.
Now all you’ve done is make it more complicated, slower because now you have to move things between the two pools, and gained what exactly?
I think it was a very clear and obvious decision to make. It’s an outgrowth out of how the base chips were designed, and it turned out to be extremely handy for some things. Plus since all their modern devices now work this way that probably simplify the software.
I’m not saying it’s genius foresight, but it certainly worked out rather well. There’s nothing stopping them from supporting discreet GPUs too if they wanted to. They just clearly don’t.
As for practicality, which mainstream applications would benefit from this much memory paired with a nice but relative mid compute? At this price-point (14K for a full specced system), would you prefer it over e.g. a couple of NVIDIA project DIGITS (assuming that arrives on time and for around the announced the 3K price-point)?
You're source is a reddit post in which they try to match the size to existing chips, without realizing that its very likely that NVIDIA is using custom memory here produced by Micron. Like Apple uses custom memory chips.
So M3 preference will depend on whether a niche can significantly benefit from a monolitic lower compute high memory vs higher compute but distributed setup.
And the storage solution still makes no sense of course, a machine like this should start at 4TB for $0 extra, 8TB for $500 more, and 16TB for $1000 more. Not start at a useless 1TB, with the 8TB version costing an extra $2400 and 16TB a truly idiotic $4600. If Sabrent can make and sell 8TB m.2 NVMe drives for $1000, SoC storage should set you back half that, not over double that.
price premium probably, but chip lithography errors (thus, yields) at the huge memory density might be partially driving up the cost for huge memory.
Apple's not having TSMC fab a massive die full of memory. They're buying a bunch of small dies of commodity memory and putting them in a package with a pair of large compute dies. How many of those small commodity memory dies they use has nothing to do with yield.
Apple's own product shots have shown this. Here's a bunch of links that clearly show the memory as separate. Lots of these modules you can make out the serial or model numbers and look up the manufacturer of them from directly :)
- Side-by-side teardown of M1 Pro vs M2 Pro laptop motherboards showing separate ram chips with discussion on how apple is moving to different type of ram configurations: https://www.ifixit.com/News/71442/tearing-down-the-14-macboo...
- M2 teardown with the chip + ram highlighted: https://www.macrumors.com/2022/07/18/macbook-air-m2-chip-tea...
- Photo of the A12 with separate ram chips on a single "package": https://en.wikipedia.org/wiki/Apple_A12X
- M1 Ultra with heat spreader removed, clearly showing 3rd party ram chips onpackage: https://iphone-mania.jp/news-487859/
Apple absolutely loves to gouge for upgrades, but the chips in this have got to be expensive. I almost wonder if the absolute base model of this machine has much noticeably lower margins than a normal Apple product because that. But they expect/know that most everyone who buys one is going to spec it up.
All joking aside, I don't think Apples are that expensive compared to similar high-end gear. I don't think there is any other compact desktop computer with half a terabyte of RAM accessible to the GPU.
Fortunately it seems like AMD is finally catching on and working towards producing a viable competitor to the M series chips.
The compute and memory bandwidth of the M3 Ultra is more in-line with what you'd get from a Xeon or Epyc/Threadripper CPU on a server motherboard; it's just that the x86 "way" of doing things is usually to attach a GPU for way more horsepower rather than squeezing it out of the CPU.
This will be good for local LLM inference, but not so much for training.
When I was much younger, I got to work on compilers at Cray Computer Corp., which was trying to bring the Cray-3 to market. (This was basically a 16-CPU Cray-2 implemented with GaAs parts; it never worked reliably.)
Back then, HPC performance was measured in mere megaflops. And although the Cray-2 had peak performance of nearly 500MF/s/CPU, it was really hard to attain, since its memory bandwidth was just 250M words/s/CPU (2GB/s/CPU); so you had to have lots of operand re-use to not be memory-bound. The Cray-3 would have had more bandwidth, but it was split between loads and stores, so it was still quite a ways away from the competing Cray X-MP/Y-MP/C-90 architecture, which could load two words per clock, store one, and complete an add and a multiply.
So I asked why the Cray-3 didn't have more read bandwidth to/from memory, and got a lesson from the answer that has stuck. You could actually see how much physical hardware in that machine was devoted to the CPU/memory interconnect, since the case was transparent -- there was a thick nest of tiny blue & white twisted wire pairs between the modules, and the stacks of chips on each CPU devoted to the memory system were a large proportion of the total. So the memory and the interconnect constituted a surprising (to me) majority of the machine. Having more floating-point performance in the CPUs than the memory could sustain meant that the memory system was oversubscribed, and that meant that more of the machine was kept fully utilized. (Or would have been, had it worked...)
In short, don't measure HPC systems with just flops. Measure the effective bandwidth over large data, and make sure that the flops are high enough to keep it utilized.
Do you have a blog?
Looking at Nvidia's spec sheet, an H100 SXM can do 989 tf32 teraflops (or 67 non-tensor core fp32 teraflops?) and 3.35 TB/s memory (HBM) bandwidth, so ... similar problem?
There's a wide spectrum of potential requirements between memory capacity, memory bandwidth, compute speed, compute complexity, and compute parallelism. In the past, a few GB was adequate for tasks that we assigned to the GPU, you had enough storage bandwidth to load the relevant scene into memory and generate framebuffers, but now we're running different workloads. Conversely, a big database server might want its entire contents to be resident in many sticks of ECC DIMMs for the CPU, but only needed a couple dozen x86-64 threads. And if your workload has many terabytes or petabytes of content to work with, there are network file systems with entirely different bandwidth targets for entire racks of individual machines to access that data at far slower rates.
There's a lot of latency between the needs of programmers and the development and shipping of hardware to satisfy those needs, I'm just happy we have a new option on that spectrum somewhere in the middle of traditional CPUs and traditional GPUs.
As you say, if Nvidia made a 512 GB card it would cost $150k, but this costs an order of magnitude less than that. Even high-end consumer cards like a 5090 have 16x less memory than this does (average enthusiasts on desktops have maybe 8 GB) and just over double the bandwidth (1.7 TB/s).
Also, nit pick FTA:
> Starting at 96GB, it can be configured up to 512GB, or over half a terabyte.
512 GB is exactly half of a terabyte, which is 1024 GB. It's too late for hard drives - the marketing departments have redefined storage to use multipliers of 1000 and invented "tebibytes" - but in memory we still work with powers of two. Please.
For inference setups, my point is that instead of paying $10000-$15000 for this Mac you could build an x86 system for <$5K (Epyc processor, 512GB-768GB RAM in 8-12 channels, server mobo) that does the same thing.
The "+$4000" for 512GB on the Apple configurator would be "+$1000" outside the Apple world.
That requires an otherwise equivalent PC to exist. I haven’t seen anyone name a PC with a half-TB of unified memory in this thread.
Yeah it’s $4k. Yeah that’s nuts. But it’s the only game in town like that. If the replacement is a $40k setup from Nvidia or whatever that’s a bargain.
We do. Common people don't. It's easier to write "over half a terabyte" than explain (again) to millions of people what the power of two is.
https://www.storagereview.com/wp-content/uploads/2025/01/Sto...
You will however get half of the bandwidth and a lot more latency if you have to go through multiple systems.
I would be interested to see how feasible hybrid approaches would be, e.g. connect each pair up directly via ConnectX and then connect the sets together via Ethernet.
The AMD Epyc build is severely bandwidth and compute constrained.
~40 tokens/s on M3 Ultra 512GB by my calculation.
1. M3 Ultra 512 2. AMD Epyc (which Gen ? AVX512 and DDR5 might make a difference in both performance and cost , Gen 4 or Gen 5 have 8 or 9 t/s https://github.com/ggml-org/llama.cpp/discussions/11733 ) 2. AMD Epyc + 4090 or 5090 running KTransformers (over 10 t/s decode ? https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...)
If the M3 can run 24/7 without overheating it's a great deal to run agents. Especially considering that it should run only using 350W... so roughly $50/mo in electricity costs.
I'd assume this thing peaks at 350W (or whatever) but idles at around 40w tops?
I don't know how to calculate tokens/s for H100s linked together. ChatGPT might help you though. :)
If this is remotely accurate though it's still at least an order of magnitude more convenient than the M3 Ultra, even after factoring in all the other costs associated with the infrastructure.
The pricing isn't as insane as you'd think, 96 to 256GB is 1500 which isn't 'cheap' but, it could be worse.
All in 5,500 gets you a ultra with 256GB memory, 28 cores, 60 GPU cores, 10Gb network - I think you'd be hard pushed to build a server for less.
The M3 strikes a very particular middle ground for AI of lots of RAM but a significantly slower GPU which nothing else matches, but that also isn't inherently the right balance either. And for any other workloads, it's quite expensive.
I have a Max M3 (the non-binned one), and I feel like 64GB or 96GB is within the realm of enabling LLMs that run reasonable fast on it (it is also a laptop, so I can do things on planes or trips). I thought about the Ultra, if you have 128GB for a top line M3 Ultra, the models that you could fit into memory would run fairly fast. For 512GB, you could run the bigger models, but not very quickly, so maybe not much point (at least for my use cases).
I think it actually is perfect for local inference in a way that build or any other pc build in this price range would be.
Apple’s Pro segment has been video editors since the 90s.
None of those will be true for any PC/Nvidia build.
It's hard to put a price on quality of life.
If you are going to argue that the OS or even below that the hardware could be compromised to still enable exfiltration, that is true, but it is a whole different ballgame from using an external SaaS no matter what the service guarantees.
A long time ago Apple had a rackmount server called Xserve, but there’s no sign that they’re interested in updating that for the AI age.
> there’s no sign that they’re interested in updating that for the AI age.
And I’ve had every previous Mac tower design since 1999: G4, G5, the excellent dual Xeon, the horrible black trash can… But Apple Silicon delivers so much punch in the Studio form factor, the old school Pro has become very niche.
Edit - looks like the new M3 Ultra is only available in Mac Studio anyway? So the existence of the Pro is moot here.
The 2013 Mac Pro was stuck forever with its original choice of Intel CPU and AMD GPU. And it was unfortunately prone to overheating due to these same components.
The cooling solution wasn’t designed for huge GPUs. So it couldn’t really be upgraded in ways most people wanted.
In fact a very famous podcaster is still holding out to his.
Not sure if they count as niche or not.
From https://www.apple.com/newsroom/2025/02/apple-will-spend-more...
As part of its new U.S. investments, Apple will work with manufacturing partners to begin production of servers in Houston later this year. A 250,000-square-foot server manufacturing facility, slated to open in 2026, will create thousands of jobs.
You can run the full Deepseek 671b q8 model at 40 tokens/s. Q4 model at 80 tokens/s. 37B active params at a time because R1 is MoE.
Linking 2 of these together let's you run a model more capable (R1) than GPT4o at a comfortable speed at home. That was simply fantasy a year ago.
Is it though?
Wikipedia says [1] an M3 Max can do 14 TFLOPS of FP32, so an M3 Ultra ought to do 28 TFLOPS. nVidia claims [2] a Blackwell GPU does 80 TFLOPs of FP32. So M3 Ultra is 1/3 the speed of a Blackwell.
Calling that "a vanishingly small fraction" seems like a bit of an exaggeration.
I mean, by that metric, a single Blackwell GPU only has "a vanishingly small fraction" of the memory of an M3 Ultra. And the M3 Ultra is only burning "a vanishingly small fraction" of a Blackwell's electrical power.
nVidia likes throwing around numbers like "20 petaFLOPs" for FP4, but that's not real floating point... it's just 1990's-vintage uLaw/aLaw integer math.
[1] https://en.wikipedia.org/wiki/Apple_silicon#Comparison_of_M-...
[2] https://resources.nvidia.com/en-us-blackwell-architecture/da...
Edit: Further, most (all?) of the TFLOPs numbers you see on nVidia datasheets for "Tensor FLOPs" have a little asterisk next to them saying they are "effective" TFLOPs using the sparsity feature, where half the elements of the matrix multiplication are zeroed.
Blackwell and modern AI chips are built for fp16. B100 has 1750 tflops of fp16. M3 ultra has ~80tflops of fp16 or about 4% that of b100
I'm super interested in the clustering capability. At launch people said they were only getting like 11Gbps from their TB4 drive arrays, which was really way less than expected.
Apple does kind of advertise that each TB port has its own controllers. Which gives me hope that whatever 1x port can do 6x can do 6x better.
AMD's Strix Halo victory feels much more shallow today. Eventually 48GB or 64GB sticks will probably expand Strix Halo to 192 then 256GB. But Strix Halo is super super io starved, is basically a desktop of IO, with no way to easily host-to-host, and Apple absolutely understands that the use of a chip is bounded by what it can connect to. 6x TB5, if even half true, will be utterly outstanding.
It's been so so so so cool to see Non-Transparent Bridging atop thunderbolt, so one host can act like a device. Since it's PCIe, that hypothetically would allow amazing RDMA over TB. USB4 mandates host to host networking, but I have no idea how it is implemented and I suspect it's no where near as close to the metal.
It was "yet another mac-oriented startup" but I had them get me an Alienware laptop because I could get one with a 1070 mobile card that meant I could train on my laptop whereas the data sci's had to do everything on our DGX-1. [2]
Today it is the other way around, the Mac Studio looks like the best AI development workstation you can get.
[1] I was really partial to a character-level CNN model we had
[2] CEO presented next to Jensen Huang at a NVIDIA conference, his favorite word was "incredible". I thought it was "incredible" when I heard they got bought by Nike, but it was true.
https://arstechnica.com/gadgets/2013/10/os-x-10-9-brings-fas...
Thunderbolt is PCIe-based and I could imagine it being extended to do what https://en.wikipedia.org/wiki/Compute_Express_Link and https://en.wikipedia.org/wiki/InfiniBand
That Said, 512GB of unified ram with access to the NPU is absolutely a game changer. My guess is that Apple developed this chip for their internal AI efforts, and are now at the point where they are releasing it publicly for others to use. They really need a 2U rack form for this though.
This hardware is really being held back by the operating system at this point.
The CPUs have zero competition in terms of speed, memory bandwidth. Still blown away no other company has been able to produce Arm server chips that can compete.
I’d be curious to know if this changes that. It’d take a lot more than doubling cores to take out the very high power AMD parts, but this might squeeze them a bit.
Interestingly, AMD has also been investing heavily in unified RAM. I wonder if they have / plan an SoC that competes 1:1 with this. (Most of the parts I’m referring to are set up for discrete graphics.)
Source: https://www.notebookcheck.net/AMD-Ryzen-AI-Max-395-Analysis-...
Cinebench 2024 results.
Somewhere on the internet there is a tdp wattage vs performance x-y plot. There’s a pareto optimal region where all the apple and amd parts live. Apple owns low tdp, AMD owns high tdp. They duke it out in the middle. Intel is nowhere close to the line.
I’d guess someone has made one that includes datacenter ARM, but I’ve never seen it.
This?
https://www.videocardbenchmark.net/power_performance.html#sc...
Workstations (like the Mac Studio) have traditionally been a space where "enthusiast"-grade consumer parts (think Threadripper) and actual server parts competed. The owner of a workstation didn't usually care about their machine's TDP; they just cared that it could chew through their workloads as quickly as possible. But, unlike an actual server, workstations didn't need the super-high core count required for multitenant parallelism; and would go idle for long stretches — thus benefitting (though not requiring) more-efficient power management that could drive down baseline TDP.
Anyway, I don't think it's comparable really. This thing comes with a fat GPU, NPU, and unified memory. Threadripper is just a CPU.
Right.
It is coming up because we're in a thread about using them as server CPUs. (c.f. "colo", "2U" in OP and OP's child), and the person you're replying to is making the same point you are
For years now, people will comment "these are the best chips, I'd replace all chips with them."
Then someone points out perf/watt is not perf.
Then someone else points out some M-series is much faster than a random CPU.
And someone else points out that the random CPU is not a top performing CPU.
And someone else points out M-series are optimized for perf/watt and it'd suck if it wasn't.
I love my MacBook, the M-series has no competitors in the case it's designed for.
I'd just prefer, at this point, that we can skip long threads rehashing it.
It's a great chip. It's not the fastest, and it's better for that. We want perf/watt in our mobile devices. There's fundamental, well-understood, engineering tradeoffs that imply being great at that necessitates the existence of faster processors.
It's a great chip. It's not the fastest,
It has the world's fastest single thread.Both Passmark and Geekbench are aggregates of a variety of tasks. If you dig into the individual tests that constitute this aggregate score, you will find different platforms perform better, or worse, on certain tests than others. I would wager that, for many applications, only a subset of these tasks are relevant to the performance of the application, yet such benchmark suites distil out all nuance into a single value.
Here is a personal anecdote. I have tried running CASTEP (built from source), a density functional theory calculator, on both an M1 Max MacBook Pro [0], and on a Ryzen 7840HS Lenovo laptop [1]. A cursory glance at those Geekbench results linked might make you expect that the performance is roughly equivalent, but the Ryzen outperforms the Mac by about 4x, a huge difference.
What happens if we try and dig into any particular benchmark to explain this? If you click on any particular benchmark in the Geekbench search lists, you will see they test things like "File Compression", "HTML5 Browser", "Clang". Which of these maps most closely to the sorts of instructions used in CASTEP? Your guess is as good as mine.
If anything, I would say Passmark is quite a bit less abstract about this. Looking at the Mac [2] and Ryzen [3] Passmark results, you can see the Ryzen outperforms the Mac by about 2x on "extended instructions", which appear to involve some matrix math, and also about 2x on "integer math". The Mac, meanwhile, appears to be extremely good at finding prime numbers, at over 3x the speed of the Ryzen. Presumably the Ryzen's balance of instruction performance is more useful for DFT calculations than the Mac's, which perhaps is weaker in areas that might matter for this application, but stronger in areas that might matter for others.
Of course, optimization is likely a component of this. How much effort is put into the OpenBLAS, MPI, etc, implementations on aarch64 darwin vs. x86-64 linux? This is a good question. It is, however, mostly irrelevant to the end consumer, who wishes to consume this software for use in their further research, rather than dig into high-performance computing library optimization.
[0] https://browser.geekbench.com/search?q=7840hs
[1] https://browser.geekbench.com/search?q=m1+max
[2] https://www.cpubenchmark.net/cpu.php?cpu=Apple+M1+Max+10+Cor...
[3] https://www.cpubenchmark.net/cpu.php?cpu=AMD+Ryzen+7+PRO+784...
https://medium.com/silicon-reimagined/performance-delivered-...
Passmark is an outdated benchmark that isn't updated to use ARM instructions.
I think PassMark is more honest as well, because it just gives scores for calculation throughput instead of specific tasks. It more closely matches what experience you will get if you have a varied load.
But since it's Apple we are talking about, their users just want to think they have the best and that's all that matters.
https://www.cpubenchmark.net/cpu_test_info.html
Right from the top it's amateurish stuff: their idea of an integer benchmark to measure "raw" CPU throughput (whatever that means) is to make a bunch of random ints and add/subtract/multiply/divide them.
Very few programs do a high volume of either integer multiply or divide. And when they do, they generally aren't doing it on random numbers. This is the kind of thing which gives synthetic benchmarks their highly deserved bad rep. It might be even worse than Dhrystone MIPs, and believe me, in benchmark nerd circles, that is a fucking diss.
If you look up Geekbench's docs, you'll find that it's all about real-world compute tasks. For example, one of the int tests in their suite is to compile a reference program with the Clang compiler. Compilers are a reasonably good litmus test of integer performance; they heavily stress the CPU features most responsible for high integer performance in this day and age. (Branch prediction, memory prefetching, out-of-order execution, speculation, that kind of thing.)
You claimed that PassMark reflects "complex" software, and Geekbench doesn't. However, I would be willing to bet that Clang alone is far more complex than all of PassMark's CPU benchmarks put together, whether you measure by SLOC or program structure.
Note that none of this has anything to do with Mac vs PC. Passmark is simply a bad benchmark that should not be used, period. That said, there are a bunch of warning signs that PassMark's ports to everything outside its native x86 Windows are probably quite sloppy, so it's even less useful for crossplatform comparisons.
There are nice things that Apple has, but as you can see there is significant reality warping going on.
Why does it persist?
The M4 Max had great, I would argue the best at time of release, single core results on Geekbench.
That is a different claim from M4 line has the top single thread performance in the world.
I'm curious:
You're signalling both that you understand the fundamental tradeoff ("Apple doesn't make server-grade CPUs") and that you are talking about something else (follow-up with M4 family has top single-thread performance)
What drives that? What's the other thing you're hoping to communicate?
If you are worried that if you leave it at "Apple doesn't make server-grade CPUs", that people will think M4s aren't as great as they are, this is a technical-enough audience, I think we'll understand :) It doesn't come across as denigrating the M-series, but as understanding a fundamental, physically-based, tradeoff.
At least judging by the mounts, they want them to be used that way, even though the CPU might not fit with the de facto industry label for "server-grade".
The only use case I can think of is for audio workstations, where people have lots of rack mount equipment, so you can have everything including the computer in the rack. But even for that use case it's quite big.
Anyway the Apple config in the article costs about 5x more than a comparable low end AMD server with 512GB of ram, but adds an NPU. AMD has NPUs in lower end stuff; not sure about this TDP range.
I'm not sure how those benchmarks translate to common real world use cases.
It reminds me of the 1990s when my old school was using Sun machines based on the 68k series and later SPARC and we were blown away with the toaster-sized HP PA RISC machine that was used for student work for all the CS classes.
Then Linux came out and it was clear the 386 trashed them all in terms of value and as we got the 486 and 586 and further generations, the Intel architecture trashed them in every respect.
The story then was that Intel was making more parts than anybody else so nobody else could afford to keep up the investment.
The same is happening with parts for phones and TSMC's manufacturing dominance -- and today with chiplets you can build up things like the M3 Ultra out of smaller parts.
Then, one day, we built a 5 machine amd athlon xp linux cluster for $2000 ($400/machine) that beat all the unix and windows server hardware by at least 10x on $/perf.
It’s nice that we have more than one viable cpu vendor these days, though it seems like there’s only one viable fab company.
I think it goes something like:
- 2106x/EV4: 34-bit physical, 43-bit virtual
- 21164/EV5: 40-bit physical, 43-bit virtual
- 21264/EV6: 44-bit physical, 48-bit virtual
The EV6 is a bit quirky as it is 43-bit by default, but can use 48-bits when I_CTL<VA_48> or VA_CTL<VA_48> is set. (the distinction of the registers is for each access type, i.e: instruction fetch versus data load/store)The 21364/EV7 likely has the same characteristics as EV6, but the hardware reference manual seems to have been lost to time...
Digital struggled with the microprocessor transition because they didn't want to kill their cash cow minicomputers with microcomputer-based replacements. They went with the 64-bit Alpha because they wanted to rule the high end in the CMOS age. And they did, for a little while. But the mass market caught up.
VMS is the only OS (that I know of) that uses all 4 processor privilege modes.
Side note: The 21064 has such bizarre IPR mappings, the read values have lots of bits scrambled around compared to their write counterparts. This is likely a hardware design decision affecting the programmer's model, if I had to guess.
One thing I remember about Alpha though was how bad the output from gcc was. Then DEC released a version of their own compilers that was command line compatible with gcc. That changed everything for open source stuff.
For what we needed, five 32 bit address spaces was enough DRAM. The individual CPU parts were way more than 20% as fast, and the 100Mbit switch was good enough.
(The data basically fit in ram, so network transport time to load a machine was bounded by 4GiB / 8MiB / sec = 500 seconds. Also, the hard disks weren’t much faster than network back then.)
Maybe not at the same power consumption, but I'm sure mid-range Xeons and EPYCs mop the floor with the M3 Ultra in CPU performance. What the M3 Ultra has that nobody else comes close is a decent GPU near a pool of half a terabyte of RAM.
Having things consistently work is much cheaper than down days caused by your ancient equipment. Apple’s SSDs will make it to 5 years no problem - and more likely, 10-15 years.
Fast forward 2 years: The $50-$250K machines have a 100% drive failure rate, and some poor bastard has to fly from data center to data center to swap the $60 drive for a $120 one, then re-rack and re-image each machine.
Anyway, soldering a decent SSD to the motherboard board would actually improve reliability at all those places.
If they were soldered onto those systems you talk about, all those would have had to be replaced instead of just having the drive swapped out and re-imaged.
On the other hand, you can do without display support if you’re only using it as a server. And I think USB Ethernet dongles might work for the time being?
My understanding is there are dozens of people working on it.
Apple’s whole m.o. is to take FOSS software, repackage it and sell it. They don’t want people using it directly.
Our business "only" sees about 1,000-25,000 req/min, our message brokers transmit MAX 25k msg/s. Easily handled by a rack of 10 servers for redundancy.
We are not Google and we don't pretend to be, so we don't care about power, as the difference is a few dollars a month.
FYI Apple runs Linux in their DC, so no Apple hardware in their own servers.
This is silly. Given the performance per watt, the M series would be great in a data center. As you all know, electricity for running the servers and cooling for the servers are the two biggest ongoing costs for a data center; the M series requires less power and runs more efficiently than the average Intel or AMD-based server.
> FYI Apple runs Linux in their DC, so no Apple hardware in their own servers.
That's certainly no longer the case. Apple announced their Private Cloud Compute [1] initiative—Apple designed servers running Apple Silicon to support Apple Intelligence functions that can't run on-device.
BTW, Apple just announced a $500 billion investment [2] in US-based manufacturing, including a 250,000 square foot facility to make servers. Yes, these will obviously be for their Private Cloud Compute servers… but it doesn't have to be only for that purpose.
From the press release:
As part of its new U.S. investments, Apple will work with manufacturing partners to begin production of servers in Houston later this year. A 250,000-square-foot server manufacturing facility, slated to open in 2026, will create thousands of jobs.
Previously manufactured outside the U.S., the servers that will soon be assembled in Houston play a key role in powering Apple Intelligence, and are the foundation of Private Cloud Compute, which combines powerful AI processing with the most advanced security architecture ever deployed at scale for AI cloud computing. The servers bring together years of R&D by Apple engineers, and deliver the industry-leading security and performance of Apple silicon to the data center.
Teams at Apple designed the servers to be incredibly energy efficient, reducing the energy demands of Apple data centers — which already run on 100 percent renewable energy. As Apple brings Apple Intelligence to customers across the U.S., it also plans to continue expanding data center capacity in North Carolina, Iowa, Oregon, Arizona, and Nevada.
[1]: https://security.apple.com/blog/private-cloud-compute/
[2]: https://www.apple.com/newsroom/2025/02/apple-will-spend-more...
It really is. Even if they themselves won't bring back their old XServe OS variant, I'd really appreciate it if they at least partnered with a Linux or BSD (good callout, ryao) dev to bring a server OS to the hardware stack. The consumer OS, while still better (to my subjective tastes) than Windows, is increasingly hampered by bloat and cruft that make it untenable for production server workloads, at least to my subjective standards.
A server OS that just treats the underlying hardware like a hypervisor would, making the various components attachable or shareable to VMs and Containers on top, would make these things incredibly valuable in smaller datacenters or Edge use cases. Having an on-prem NPU with that much RAM would be a godsend for local AI acceleration among a shared userbase on the LAN.
I’m continually surprised Apple doesn’t just donate something like 0.1% of their software development budget to proton and the asahi projects. It’d give them a big chunk of the gaming and server markets pretty much overnight.
I guess they’re too busy adding dark patterns that re-enable siri and apple intelligence instead.
Don't get me wrong, I wouldn't use that feature (I prefer self-hosting it all myself), but for folks like my family members, it'd be a killer addition to the lineup that makes my life supporting them much easier.
Ubiquiti is decently priced, especially for niche hardware, unlike Apple. Fundamentally they diverge on the way to do things...
HomeKit networking existed in Eero briefly. I put that in a lot of casual Apple homes. Seemed like missed oppty for Apple to let Amazon buy Eero, a more "spiritual successor" to the Airports.
With all my love and respect for "Apple rumors" writers; this was always "I read five blogposts about CPU design and now I'm an expert!" territory.
The speculation was based on the M3 Maxes die shots not having the interposer visible, which... implies basically nothing whether that _could have_ been supported in an M3 Ultra configuration; as evidenced by the announcement today.
No M3 has thunderbolt 5.
This is a new chip with M3 marketing. I’d expect this from Intel, not Apple.
The press-release even calls TB5 out: >Each Thunderbolt 5 port is supported by its own custom-designed controller directly on the chip.
Given that they're doing the same on A-series chips (A18 Pro with 10Gbps USB-C; A18 with USB 2.0); I imagine it's just relatively simple to swap the I/O blocks around and they're doing this for cost and/or product segmentation reasons.
Which, at this point, why not just use M4 as a base?
I imagine that making those chips is quite a bit more involved than just taking the files for M3 Max, and copy-pasting them twice into a new project.
I imagine it just takes more time to design/verify/produce them; especially given they're not selling very many of them, so they're probably not super-high-priority projects.
Or the sort of thing you put onto a successor when you had your fingers crossed that the spec and hardware would finalize in time for your product launch but the fucking committee went into paralysis again at the last moment and now your product has to ship 4 months before you can put TB 5 hardware on shelves. So you put your TB4 circuitry on a chip that has the bandwidth to handle TB5 and you wait for the sequel.
Apple could either create a 2U rack hardware and support Linux (and I mean Apple supporting it, not hobbysts), or have a build of Darwin headless that could run on that hardware. But in the later case, we probably wouldn't have much software available (though I am sure people would eventually starting porting software to it, there is already MacPorts and Homebrew and I am sure they could be adapted to eventually run in that platform).
But Apple is also not interested in that market, so this will probably never happen.
they're just a tiny company with shareholders who are really tired of never earning back their investments. give 'em a break. I mean they're still so small that they must protect themselves by requiring that macs be used for publishing iPhone and iPad applications.
> they're still so small that they must protect themselves by requiring that macs be used for publishing iPhone and iPad applications.
They're not talking about Apple's silicon as a target, but as a development platform.
I mean, that's usually how it works though. When IBM launched the PS/2, they didn't support anything other than PC-DOS and OS/2, Microsoft had to make MS-DOS work for it (I mean... they did get support from IBM, but not really), the 386BSD and Linux communities brought the engineering effort without IBM's involvement.
When Apple was making Motorola Macs, they may have given Be a little help, but didn't support any other OSes that appeared. Same with PowerPC.
All of the support for alternative OSes has always come from the community, whether that's volunteers or a commercial interest with cash to burn. Why should that change for Apple silicon?
Apple briefly supported a Linux distribution on PowerPC Macs: https://en.wikipedia.org/wiki/MkLinux.
M1 Max - 24 to 32 GPU cores
M2 Max - 30 to 38 GPU cores
M3 Max - 30 to 40 GPU cores
M4 Max - 32 to 40 GPU cores
I also looked up the announcement dates for the Max and the Ultra variant in each generation.
M1 Max - October 18, 2021
M1 Ultra - March 8, 2022
M2 Max - January 17, 2023
M2 Ultra - June 5, 2023
M3 Max - October 30, 2023
M3 Ultra - March 12, 2025
M4 Max - October 30, 2024
> My guess is that Apple developed this chip for their internal AI efforts
As good a guess as any, given the additional delay between the M3 Max and Ultra being made available to the public.
The theory that the M3 Ultra was being produced, but diverted for internal use makes as much sense as any theory I've seen.
It makes at least as much sense as the "TSMC had difficulty producing enough defect free M3 Max chips" theory.
For SMBs or Edge deployments where redundancy isn't as critical or budgets aren't as large, this is an incredibly compelling offering...if Apple actually had a competent server OS to layer on top of that hardware, which it does not.
If they did, though...whew, I'd be quaking in my boots if I were the usual Enterprise hardware vendors. That's a damn frightening piece of competition.
That nvidia GPU setup will actually have the compute grunt to make use of the RAM, though, which this M3 Ultra probably realistically doesn't. After all, if the only thing that mattered was RAM then the 2TB you can shove into an Epyc or Xeon would already be dominating the AI industry. But they aren't, because it isn't. It certainly hits at a unique combination of things, but whether or not that's maximally useful for the money is a completely different story.
Write me an AWS CloudFormation file that does the following:
* Deploys an Amazon Kubernetes Cluster
* Deploys Busybox in the namespace "Test1", including creating that Namespace
* Deploys a second Busybox in the namespace "Test3", including creating that Namespace
* Creates a PVC for 60GB of storage
The M1Pro laptop with 16GB of Unified Memory: * 21.28 seconds for "Thinking"
* 0.22s to the first token
* 18.65 tokens/second over 1484 tokens in its responses
* 1m:23s from sending the input to completion of the output
The 10900k CPU, with 64GB of RAM and a full-fat RTX 3090 GPU in it: * 10.88 seconds for "thinking"
* 0.04s to first token
* 58.02 tokens/second over 1905 tokens in its responses
* 0m:34s from sending the input to completion of the output
Same model, same loader, different architectures and resources. This is why a lot of the AI crowd are on Macs: their chip designs, especially the Neural Engine and GPUs, allow quite competent edge inference while sipping comparative thimbles of energy. It's why if I were all-in on LLMs or leveraged them for work more often (which I intend to, given how I'm currently selling my generalist expertise to potential employers), I'd be seriously eyeballing these little Mac Studios for their local inference capabilities.Sure, the desktop machine performs better, as would a datacenter server jam-packed full of Blackwell GPUs, but that's not what's exciting about Apple's implementation. It's the efficiency of it all, being able to handle modern models on comparatively "weaker" hardware most folks would dismiss outright. That's the point I was trying to make.
Also Apple isn't unique in having an NPU in a laptop. Fucking everyone does at this point.
The point is that, in terms of practical usage, the M3 Ultra is uniquely competent and highly affordable in a sea of enterprise technology that is decidedly not. I tried to demonstrate why I'm excited about it by pointing out the similar performance of a battery-powered, four-year-old laptop and a quite gargantuan gaming PC that's pulling over 500W from the wall, as an example of what several years of additional refinements and improvements to the architecture was expected to bring.
The point is that it's affordable, more flexible in deployment, and more efficient than similarly-specced datacenter servers specifically designed for inference. For the cost of a single decked-out Dell or HP rackmount server, I can have five of these Mac Studios with M3 Ultra chips - and without the need for substantial cooling, noise isolation, or other datacenter necessities. If the marketing copy is even in the same ballpark as actual performance, that's easily enough inference to serve an office of fifty to a hundred people or more, depending on latency tolerances; if you don't mind "queuing" work (like CurrentCo does with their internal Agents), one of those is likely enough for a hundred users.
That's the excitement. That's the point. It's not the fastest, it's not the cheapest, it's just the most balanced.
I have Apple hardware but it sucks for anything AI, buying it for that purpose is just extremely dumb, just like buying Macs for engineering CADs or things of the sort.
If you are buying Macs and it's not for media production related reasons you are doing something wrong.
I continue to be in awe of the lengths some people will go just to fling insults and shake out some salt. We're, what, ten layers deep? With all the context above, the best you have to contribute to the discussion are baseless accusations and ageist insults?
Your finite time would have been better spent on literally anything else, than actively seeking out a comment just to throw subjective, unsubstantiated shade around. C'mon, be better.
It all depends on the workload you want to run.
In Intel's case, there's ample coverage of the company's lack of direction and complacency on existing hardware, even as their competitors ate away at their moat, year after year. AMD with their EPYC chips taking datacenter share, Apple moving to in-house silicon for their entire product line, Qualcomm and Microsoft partnering with ongoing exploration of ARM solutions. A lack of competency in leadership over that time period has annihilated their lead in an industry they used to single-handedly dictate, and it's unlikely they'll recover that anytime soon. So in a sense, Intel cannot make a similar product, in a timely manner, that competes in this segment.
As for AMD, it's a bit more complicated. They're seeing pleasant success in their CPU lineup, and have all but thrown in the towel on higher-end GPUs. The industry has broadly rallied around CUDA instead of OpenCL or other alternatives, especially in the datacenter, and AMD realizes it's a fool's errand to try and compete directly there when it's a monopoly in practice. Instead of squandering capital to compete, they can just continue succeeding and working on their own moat in the areas they specialize in - mid-range GPUs for work and gaming, CPUs targeting consumers and datacenters, and APUs finding their way into game consoles, handhelds, and other consumer devices or Edge compute systems.
And that's just getting into the specifics of those two companies. The reality is that any vendor who hasn't already unveiled their own chips or accelerators is coming in at what's perceived to be the "top" of the bubble or market. They'd lack the capital or moat to really build themselves up as a proper competitor, and are more likely to just be acquired in the current regulatory environment (or lack thereof) for a quick payout to shareholders. There's a reason why the persistent rumor of Qualcomm purchasing part or whole of Intel just won't die: the x86 market is rather stagnant, churning out mediocre improvements YoY at growing pricepoints, while ARM and RISC chips continue to innovate on modern manufacturing processes and chip designs. The growth is not in x86, but a juggernaut like Qualcomm would be an ideal buyer for a "dying" or "completed" business like Intel's, where the only thing left to do is constantly iterate for diminishing returns.
Lowest latency of DDR5-6400 on normal PC starting at 60ns+
Lowest latency of VRAM on GeForce RTX 4090 starting at 14 ns
Lowest latency of Apple M1 Memory starting at 5 ns, its more like L3 cache
And on Apple M chip, this ultrafast memory is available for CPU, GPU and NPU.
https://www.anandtech.com/show/17024/apple-m1-max-performanc... https://chipsandcheese.com/p/microbenchmarking-nvidias-rtx-4...
There is also "parity" in other products like a MacBook Pro from £1,599 / $1,599 or an iPhone 16 from £799 / $799. £9,699 / $9,499 is worse than that!
You know that £9,699 are over $12k, right?
It's worth adding the M3 Ultra has 819GB/s memory bandwidth [1]. For comparison the RTX 5090 is 1800GB/s [2]. That's still less but the M4 Mac Minis have 120-300GB/s and this will limit token throughput so 819GB/s is a vast improvement.
For $9500 you can buy a M3 Ultra Mac Studio with 512GB of unified memory. I think that has massive potential.
[1]: https://www.apple.com/mac-studio/specs/
[2]: https://www.nvidia.com/en-us/geforce/graphics-cards/50-serie...
https://digitalspaceport.com/how-to-run-deepseek-r1-671b-ful...
between 4.25 to 3.5 TPS (tokens per second) on the Q4 671b full model.
3.5 - 4.25 tokens/s. You're torturing yourself. Especially with a reasoning model.This will run it at 40 tokens/s based on rough calculation. Q4 quant. 37b active parameters.
5x higher price for 10x higher performance.
If you've ever used git, svn, or an IDE side by side on corporate Windows versus Apple I don't know why you would ever go back.
The custom build would work great though, and even moreso in a server room and as well continues to reveal by comparison how excessively Apple prices it's components.
> If you've ever used git, svn, or an IDE side by side
I still reach for Windows, even though it's a dogshit OS. I would rather use WSL to write and deploy a single app, as opposed to doing my work in a Linux VM or (god forbid) writing and debugging multiple versions just to support my development runtime. If I'm going to use an ad-encumbered commercial service-slop OS, I might as well pick the one that doesn't actively block my work.
I just want a break from MacOS, I'll be buying a Thinkpad and will probably never come back. This isn't my moaning, I understand it's their market, but if their hardware supported Linux (especially dual booting) or Docker native, I'd probably be buying Apple for the next decade and now I just won't be.
Since the M series of ARM processors didn’t come out until 2020, that would make a lot of sense.
The HP Elitebook was on Ubuntu's list of compatible tested laptops and came in hundreds of dollars less than a Thinkpad. Most of the comparably priced on sale T14's I could find were all crap Intel spec'd ones.
Months in I don't regret it at all and Linux support has been fantastic even for a fairly newer Ryzen chip and not the latest kernel. (I stick to LTS releases of most Distros) Shoving in 4TB of NVME storage and 96GB of DDR5 should I feel the need to upgrade would still put me only around $1300 invested in this machine.
I just want decent enough power and no thermal throttling if I do have to hammer it. I make music so the extra ram and space for sample libraries is a big benefit and why I had to keep external SSD's around with my Macs.
My Macbook Air needed a usb fan ziptied to the laptop stand to not throttle at times.
>it seems like rather than trying to give a mac a reasonable go of it as opposed to whatever else, you were trying to instead explore a fundamental difference in how you value technology products
I re-evaluate how I feel about technology pretty often and its caused some shifts for sure. My side hobby is ARM/RiscV low power computing and Apple's move to ARM tickled that hyper efficiency side of my brain, but ultimately failed to keep me interested because of all the downsides upgrade/repairability wise.
I'll miss the battery life of the M1 chips, and I'm going to have to re-learn how to type (CTRL instead of ALT, fn rarely being on the left, I use fn+left instead of CTRL A in terminals) but otherwise, I think I'm done.
Not that I'm discouraging you from switching or anything. If Linux is what you want/need, there's definitely better laptops to be had than a Macbook for that purpose. It's just that weird incompatibilities and having to fight with the operating system on random issues is, at least in my experience, normal when using a linux laptop. Even my T480 which has overall excellent compatibility isn't trouble-free.
Apple Silicon chips are arguably more compatible with Asahi Linux [1], but that's largely in thanks to the hard work of Marcan, who's stepped down as project lead from the project [2].
Overall I still think the right choice is to find a laptop better suited for the purpose of running linux on it, just something that requires more careful consideration than people think. Framework laptops, which seem well suited since ideologically it meshes well with linux users, can be a pain to set up as well.
[2] https://marcan.st/2025/02/resigning-as-asahi-linux-project-l...
I'm sure with tinkering I could eventually get it working, but I'm well past the point of wanting to tinker with hardware and drivers to get Linux working.
I love Apple but they love to speak in half truths in product launches. Are they saying the M3 Ultra is their first Thunderbolt 5 computer? I don't recall seeing any previous announcements.
(NB: I've been long on AAPL since $7 a share but I'm also allergic to bullshit)
I assume that there's a community of developers focusing on leveraging this hardware instead of complaining about the operating system.
> Apple’s custom-built UltraFusion packaging technology uses an embedded silicon interposer that connects two M3 Max dies across more than 10,000 signals, providing over 2.5TB/s of low-latency interprocessor bandwidth, and making M3 Ultra appear as a single chip to software.
The comment was that the press had reported that the interposer wasn't available. This obviously uses some form of interposer, so the question is if the press missed it, or Apple has something new.
It sounds like they're using TSMC's new LSI (Local Si Interconnect) technology, which is their version of Intel's EMIB. It's essentially small islands of silicon, just around the inter-chip connections, embedded within the organic substrate. This gives the advantages of silicon interconnect, without the cost and size restrictions of a silicon interposer. It would not be visible from just looking at the package.
https://www.anandtech.com/show/16031/tsmcs-version-of-emib-l...
https://semianalysis.com/2022/01/06/advanced-packaging-part-...
Please elucidate.
^ has a lot of elaborations on this subject
The apps are developed by different teams. MacOS apps are containerized. Saying macOS's performance is hindered by Notes.app is like saying that Windows is hindered by Paint.exe. Notes.app is just a default[0]
[0]: though, I dislike saying this because I always feel like I need to mention that even Notes links against a hilarious amount of private APIs that could easily be exposed to other developers but... aren't.
Now try the same with notes on a mac. Notes mangles the punctuation and zsh is not bash.
I love diversity in websites, and apps for that matter, but this isn't diversity, it is the uncanny valley between bespoke graphic design and homogeneity.
Say what you want about SwiftUI, but it makes consistent, good looking apps. Unless something has changed, GTK is a usability disaster.
And that's before I get into how much both X11 and wayland suck equally.
There's so much I miss about Linux, but there's so much I don't
There is no such thing. Tell me, which combination of the 15+ virtual environments, dependency management and Python version managers would you use? And how would you prevent "project collision" (where one Python project bumps into another one and one just stops working)? Example: SSL library differences across projects is a notorious culprit.
Python is garbage and I don't understand why people put up with this crap unless you seriously only run ONE SINGLE Python project at a time and do not care what else silently breaks. Having to run every Python app in its own Docker image (which is the only real solution to this, if you don't want to learn Nix, which you really should, because it is better thanks to determinism... but entails its own set of issues) is not a reasonable compromise.
Was so glad when the Elixir guys came out with this recently, to at least be able to use Python, but in a very controlled, not-insane way: https://dashbit.co/blog/running-python-in-elixir-its-fine
What am I missing?
Also, typically when people say things like
> Tell me, which combination of the 15+ virtual environments, dependency management and Python version managers
It means they have been trapped in a cycle of thinking "just one more tool will surely solve my problem", instead of realising that the tools _are_ the problem, and if you just use the official methods (virtualenv and pip from a stock python install), things mostly just work.
that's not good enough. If I'm in the business of writing Python code, I (ideally) don't want to _also_ be in the business of working around Python design deficiencies. Either solve the problem definitively, or do not try to solve the problem at all, because the middle road just leads to endless headaches for people WHILE ALSO disincentivizing a better solution.
Node has better dependency management than Python- And that's really saying something.
The thing is, most people who are writing python code are not in the business of writing python code. They're students, scientists, people with the word "business" or "analyst" in their title. They have bigger fish to fry than learning a different language ecosystem.
It took 30 years to get them to switch from excel to python. I think it's unrealistic to expect that they're going to switch from python any time soon. So for better or worse, these are problems that we have to solve.
There are reasons to want something more featureful than plain pip. Even without them, pip+virtualenv has been completely usable for, what, 15 years now?
Here's a question- If you don't touch a project in 1 year, do you expect it to still work, or not? If your answer is the latter, then we simply won't see eye-to-eye on this.
Thats funny, about 10 years ago I started my career in a startup that had Python business logic running under Erlang (via custom connector) which handled supervision and task distribution, and it looked insane for me at the time.
Even today I think it can be useful but is very hard to maintain, and containers are a good enough way to handle python.
I disagree. My take on that is that they are an ugly enough way to handle Python. And, among other problems, don't permit you to easily mess with the code (one of many reasons why this is ugly). Need access to something stateful from the container app? That's another PITA.
And if you’re genuinely asking, everything’s converging toward uv. If you pick only one, use that and be done with it.
Neither fixed anything. They just make it slightly less painful to deal with python scripts’ constant bitrot.
They also make python uniquely difficult to dockerize.
> They also make python uniquely difficult to dockerize.
RUN pip install uv && uv sync
Tada, done. No, seriously. That's the whole invocation.Do you actually try new Python projects out with git-clone, or do you just use the same 3 Python projects for years at a time (all regularly)?
That might explain the difference in experiences
(Not saying Apple should bundle that, but it's the best current answer to running many different Python projects without using something like Docker)
I avoid writing python, so I’m usually the “other people” in that sentence.
That's optimistic. What if the system Python gets upgraded? For some reason, Python libraries tend to be super picky about the Python versions they support (not just Python 2 vs 3).
2. If you mean MDM, there are several good options. Screen sharing and SSH are build in.
3. In what sense?
4. `uv python install whatever` is infinitely better than upgrading on the OS vendor’s schedule.
5. What does that affect?
That's using a Linux VM. The idea people are asking about is native process isolation. Yes you'd have to rebuild Docker containers based on some sort of (small) macOS base layer and Homebrew/Macports, but hey. Being able to even run nodejs or php with its thousands of files natively would be a gamechanger in performance.
The same way Windows users run them. In a linux VM.
You don't get real on-hardware containerization.
It is definitely odd that Macs have no native container support, though, especially when you learn that Windows does.
Honestly I don't know what XNU/Darwin is good for. It doesn't do anything especially well compared to *BSD, Linux, and NT.
What would robust python support oob look like?
Honest question: why do you want this in MacOS? Do you understand what docker does? (it's fundamentally a linux technology, unless you are asking for user namespaces and chroot w/o SIP on MacOS, but that doesn't make sense since the app sandbox exists).
MacOS doesn't have the fundamental ecosystem problems that beget the need for docker.
If the answer is "I want to run docker containers because I have them" then use orbstack or run linux through the virtualization framework (not Docker desktop). It's remarkably fast.
I have a small rackmounted rendering farm using mac minis, which outperform everything in the Intel world, even order of magnitude more expensive.
I run macOS on my personal and development computers for over a decade and I use Linux since inception on server side.
My experience: running server-side macOS is such a PITA it's not even funny. It may even pretend it has ssh while in fact the ssh server is only available on good days and only after Remote Desktop logged in at least once. Launchd makes you wanna crave systemd. etc, etc.
So, about docker. I would absolutely love to run my app in a containerized environment on a Mac in order to not touch the main OS.
Of course, I had a LOM/KVM and redundant networking etc. They were substantially more reliable than the Dell equipment that I used in my day job for sure.
Software-wise it's much different to an expected behavior. For example, macOS won't let you in over SSH until you log in via Remote Desktop. You'll get "connection closed" immediately.
Or sometimes it will.
And that depends not on the count of connection attempts or anything you can do locally but rather on the boot process somehow. Sometimes it boots in a way that permits ssh, sometimes not. The same computer, the same OS.
Then after you login on screen sharing and log out, macOS will let you in over ssh. For a few days. And then again will force you to login via GUI. Or maybe not. I have no idea what makes it.
I have trouble reading macOS logs or understanding it. It spews a few log messages per second even idle. If you grep ssh these messages contain zero actionable data, like "unsuccessful attempt" or similar.
Another complaint is that launchd reports the same "I/O error" on absolutely all error situations, from syntax error in plist to corrupt binary. Makes development and debugging of launchagents very fun.
In Linux, it means something very specific: a user/mount/pid/network namespace, overlayfs to provide a rootfs, chroot to pivot to the new root to do your work, and port forwarding between the host/guest systems.
On MacOS I don't know what containerization means short of virtualization. But you have virtualization on MacOS already, so why not use that?
Also I want to run the latest OS with all security patches on the host while having a stable and known macOS version in a container given how developer-hostile Apple is.
Or you can package your app as a .app and not worry about it, there's no "pollution" when everything is bundled.
Anyone wanting to run and manage their own suite of Macs to build multiple massive iOS and Mac apps at scale, for dozens or hundreds or thousands of developers deploying their changes.
xcodebuild is by far the most obvious "needs native for max perf" but there are a few other tools that require macOS. But obviously if you have multiple repos and apps, you might require many different versions of the same tools to build everything.
Sounds like a perfect use case for native containers.
Former Transmission user here.
I realise you didn't ask, but you might find some improvements in qBittorrent.
Transmission is just a small, floating window with your downloads. Click for more. It fits in the macOS vibe. But I'm a person that fully adopted the original macOS "way of working" - kicked the full-screen habit I had in windows and never felt better.
Can I ask, why would you go FROM Transmission to qBittorrent?
In my case: some torrents wouldn't find known-good seeds in Transmission but worked fine in qBittorrent; there's reasonable (but not perfect) support for libtorrent 2.0 in qBittorrent; my download speeds and overall responsiveness is anecdotally better in qBittorrent, and; I make use of some of the nitty gritty settings in qBittorrent.
And let's be clear, it wasn't the app that had problems, the Apple Remote Desktop connection to the machine failed when the speeds got above 40MB/s and the network interface stopped working around 80MB/s.
I think Transmission works perfectly fine. I've been using it for 10+ years with no issues at all on Linux.
I forgot to mention this is a Mac mini/Intel (2018).
what internal AI efforts?
Apple Intelligence is bunkers, and Apple MLX framework remains a hobby project for Apple
It’s their spin of the Google strategy of targeting providjng services to their enterprise GCP customer. I think we’ll see more out of them long term.
If so, this is hardly a hobby project.
It may not be effective, but there is serious cash behind this.
They are using OpenAI for now but in couple months they will own the full chain of value.
Apple is very good at marketing, a lot less at delivering actual value, especially if it's not about hardware.
1. Small models running locally with well-established tool interfaces (“app intents”)
2. Large models running in a bespoke cloud that can securely and quickly load all relevant tokens from a device before running inference
No AI lab is even close to what Apple is trying to deliver in the next ~12 months
https://www.bestbuy.com/site/apple-macbook-pro-14-laptop-m3-...
They also own distribution to the wealthiest and most influential people in the world
Don’t get lost in recency bias
They now bump it to 512GB. Along with insane price tag of $9499 for 512GB Mac Studio. I am pretty sure this is some AI Gold rush.
Can you do absolutely everything? No. But most models will run or retrain fine now without CUDA. This premise keeps getting recycled from the past, even as that past has grown ever more distant.
I don't know if you've heard, but NVIDIA is about to add a monthly payment for additional CUDA features and I'm almost certain that many big companies will be happy to pay for them.
> But most models will run or retrain fine now without CUDA.
This is correct for some small startups, not big companies.
The example I always give is FFT libraries - if you compare cuFFT to rocFFT. rocFFT only just released support for distributed transforms in December 2024, something you've been able to do since CUDA Toolkit v8.0, released in 2017. It's like this across the whole AMD toolkit, they're so far behind CUDA it's kind of laughable.
The higher end NVidia workstation boxes won’t run well on normal 20amp plugs. So you need to move them to a computer room (whoops, ripped those out already) or spend months getting dedicated circuits run to office spaces.
That said, if your load is going to be a continuous load drawing 80% of the rated amperage, it really should be a NEMA 5-20 plug and receptacle, the one where one of the prongs is horizontal instead of vertical. Swapping out the receptacle for one that accepts a NEMA 5-20P plug is like $5.
If you are going to actually run such a load on a 20A circuit with multiple receptacles, you will want to make sure you're not plugging anything substantial into any of the other receptacles on that circuit. A couple LED lights are fine. A microwave or kettle, not so much.
This is not true. Standard builds (a majority) still use 15-amp circuits where 20-amp is not required by NEC.
I'm curious to learn how AI shops are actually doing model development if anyone has experience there. What I imagined was: Its all in the "cloud" (or, their own infra), and the local machine doesn't matter. If it did matter, the nvidia software stack is too important, especially given that a 512gb M3 Ultra config costs $10,000+.
Where this hardware shines is inference (aka developing products on top of the models themselves)
Its half that of a max spec Mac Studio, but also half the price and eight times faster memory speed. Realistically which open source LLMs does 512gb over 256gb of memory unlock? My understanding is that the true bleeding edge ones like R1 won't even handle 512gb well, especially with the anemic memory speed.
Re memory speed, digits will be at 273GB/s while the Mac Studio is at 819GB/s
Not to mention the Mac has 6 120GB/s thunderbolt 5 ports and can easily be used for video editing, app development, etc.
I can't imagine the M3 Ultra doing well on a model that loads into ~500G, but they should be a blast on 70b models (well, twice as fast as my M3 Max at least) or even a heavily quantized 400b model.
Not exactly though.
This can have 512GB unified memory, 2x M4 Max can only have 128GB total (64GB each).
A 4-bit quantization of Llama-3.1 405b, for example, should fit nicely.
It's still far away from an H100 though.
In the media composing world they use huge orchestral templates with hundreds and hundreds of tracks with millions of samples loaded into memory.
I don't think anyone commercially offers nearly this much unified memory or NPU/GPUs with anything near 512GB of memory.
The people who need the crazy resource can tie it to some need that costs more. You’d spend like $10k running a machine with similar capabilities in AWS in a month.
How much would it cost to get up to 512gb?
When running LLMs on Docker with an Apple M3 or M4 chip, they will operate in CPU mode regardless of the chip's class, as Docker only supports Nvidia and Radeon GPUs.
If you're developing LLMs on Docker, consider getting a Framework laptop with an Nvidia or Radeon GPU instead.
Source: I develop an AI agent framework that runs LLMs inside Docker on an M3 Max (https://kdeps.com).
(sorry, should have specified that the NPU and GPU cores need to access that ram and have reasonable performance). I specified it above, but people didn't read that :-)
CUDA has had managed memory for a long time now. You absolutely can address the entire host memory from your GPU. It will fetch it, if it's needed. Not fast, but addressable.
Also, 32GB DDR5 RDIMMS are ~200, so that’s 5K for 24 right there. Then you need 2x CPUs, at ~1K for the cheapest, and you need 2, and then a motherboard that’s another 1K. So for 8K (more, given you need a case, power supply, and cooling!), you get a system with about half the memory bandwidth, much higher power consumption, and very large.
You do not need 2 CPUs. If however you use 2 CPUs, then the memory bandwidth doubles, to 1152 GB/s, exceeding Apple by 40% in memory bandwidth. The cost of the memory would be about the same, by using 16 GB modules, but the MB would be more expensive and the second CPU would add to the price.
The memory bandwidth does not double, I believe. See this random issue for a graph that has single/dual socket measurements, there is essentially no difference: https://github.com/abetlen/llama-cpp-python/issues/1098
Perhaps this is incorrect now, but I also know with 2x 4090s you don’t get higher tokens per second than 1x 4090 with llama.cpp, just more memory capacity.
(All if this only applies to llama.cpp, I have no experience with other software and how memory bandwidth may scale across sockets)
With a badly organized program, the performance can be limited not by the memory bandwidth, which is always exactly double for a dual-socket system, but by the transfers on the inter-socket links.
Moreover, your link is about older Intel Xeon Sapphire Rapids CPUs, with inferior memory interfaces and with more quirks in memory optimization.
But where is your data? For llama.cpp? For whatever dual socket CPU system you want. That’s all I am claiming.
https://github.com/ggml-org/llama.cpp/discussions/11733
about the scaling of llama.cpp and DeepSeek on some dual-socket AMD systems.
While it was rather tricky, after many experiments they have obtained an almost double speed on two sockets, especially on AMD Turin.
However, if you look at the actual benchmark data, that must be much lower than what is really possible, because their test AMD Turin system (named there P1) had only two thirds of the memory channels populated, i.e. performance limited by memory bandwidth could be increased by 50%, and they had 16-core CPUs, so performance limited by computation could be increased around 10 times.
A single 192 core Epyc is 11k by itself, so I’d probably go for the simpler integrated M3 ultra solution…
Time to first token, context length, and tokens/s are significantly inferior on CPUs when dealing with larger models even if the bandwidth is the same.
When used for ML/AI applications, a consumer GPU has much better performance per dollar.
Nevertheless, when it is desired to use much more memory than in a desktop GPU, a dual-socket server can have higher memory bandwidth than most desktop GPUs, i.e. more than an RTX 4090, and a computational capability that for FP32 could exceed an RTX 4080, but it would be slower for low-precision data where the NVIDIA tensor cores can be used.
INT8, INT4, FP8 and soon FP4
Both CPUs (with the BF16 instructions and with the VNNI instructions for INT8 inference) and the GPUs have a higher throughput for lower precision data types than for FP32, but the exact acceleration factors are hard to find.
The Intel server CPUs have the advantage vs. AMD that they also have the AMX matrix instructions, which are intended to compete for inference applications with the NVIDIA tensor cores, but the Intel CPUs are much more expensive for a number of cores big enough to be competitive with GPUs.
Thinking about it you can get a decent 256gb on consumer platforms now too, but the speed will be a bit crap and would need to make sure the platform ully supports ECC UDIMMs
Like another poster said, 768 GB of ECC RDIMM DDR5-6000 costs around $5000.
Any program whose performance is limited by memory bandwidth, as it can be frequently the case for inference, will run significantly faster in such an EPYC server than in the Apple system, even when running on the CPU.
Even for computationally-limited programs, the difference between server CPUs and consumer GPUs is not great. One Epyc CPU may have about the same number of FP32 execution units as an RTX 4070, while running at a higher clock frequency (but it lacks the tensor units of an NVIDIA GPU, which can greatly accelerate the execution, where applicable).
Any program whose performance is limited by memory bandwidth, as it can be frequently the case for inference, will run significantly faster in such an EPYC server than in the Apple system, even when running on the CPU.
Source on this? CPUs would be very compute constrained.However Apple does not say anything about the GPU clock frequency, which I assume that it is significantly less than that of NVIDIA.
In comparison, a dual-socket AMD Turin can have up to 12288 FP32 execution units, i.e. 20% more than an Apple GPU.
Moreover, the clock frequency of the AMD CPU must be much higher than that of the Apple GPU, so it is likely that the AMD system may be at least twice faster for computing some graphic application than the Apple M3 Ultra GPU.
I do not know what facilities exist in the Apple GPU for accelerating the computations with low-precision data types, like the tensor cores of NVIDIA GPUs.
While for graphic applications big server CPUs are actually less compute constrained than almost all consumer GPUs (except RTX 4090/5090), the GPUs can be faster for ML/AI applications that use low-precision data types, but this is not at all certain for the Apple GPU.
Even if the Apple GPU happens to be faster for some low-precision data type, the difference cannot be great.
However a server that would beat the Apple M3 Ultra GPU computationally would cost much more than $10k, because it would need CPUs with many cores.
If the goal is only to have a system with 50% more memory and 40% more memory bandwidth than the Apple system, that can be done at a $10k price.
While such a system would become compute constrained more often than an Apple GPU, it would still beat it every time when the memory would be the bottleneck.
I have just compared the FP32 computational capabilities, i.e. what is used for graphics, between the Apple M3 Ultra GPU and AMD server CPUs, because these numbers are easily available and they demonstrate the size relationships between them.
Both GPUs and server CPUs have greater throughputs for lower precision data (CPUs have instructions for BF16 and INT8 inference), but the exact acceleration factors are hard to find and it is more difficult to estimate the speeds without access to such systems for running benchmarks.
I'd like to see some proper benchmarking on this though, but it looks like the Apple systems might just be extremely good value if you want to run the large DeepSeek model.
says who? NVIDIA has essentially entrenched themselves thanks to CUDA
Additionally, I would assume this is a very low-volume product, so it being on N3B isn't a dealbreaker. At the same time, these chips must be very expensive to make, so tying them with luxury-priced RAM makes some kind of sense.
Makes it even more puzzling what they are doing with the M2 Mac Pro.
[0] https://www.numerama.com/tech/1919213-m4-max-et-m3-ultra-let...
[1] More context on Macrumors: https://www.macrumors.com/2025/03/05/apple-confirms-m4-max-l...
And anyway, I think the M2 Mac Pro was Apple asking customers "hey, can you do anything interesting with these PCIe slots? because we can't think of anything outside of connectivity expansion really"
RIP Mac Pro unless they redesign Apple Silicon to allow for upgradeable GPUs.
Either that or kill the Mac Pro altogether, the current iteration is such a half-assed design and blatantly terrible value compared to the Studio that it feels like an end-of-the-road product just meant to tide PCIe users over until they can migrate everything to Thunderbolt.
They recycled a design meant to accommodate multiple beefy GPUs even though GPUs are no longer supported, so most of the cooling and power delivery is vestigial. Plus the PCIe expansion was quietly downgraded, Apple Silicon doesn't have a ton of PCIe lanes so the slots are heavily oversubscribed with PCIe switches.
I just find it interesting that you can currently buy a M2 Ultra Mac Pro that is weaker than the Mac Studio (for a comparable config) at a higher price. I guess it "remains a product in their lineup" and we'll hear more about it later.
Additionally: If they wanted to scrap it down the road, why would they do this now?
Maybe they can bring back the trash can.
Indeed, and tbh it really commits even more to the non-expandability that the Trashcan's designers seemed to be going for. After all, the Trashcan at least had replaceable RAM and storage. The Mac Studio has proprietary storage modules for no reason aside from Apple's convenience/profits (and of course the 'integrated' RAM which I'll charitably assume was done for altruistic reasons because of how it's "shared.")
The difference is that today users are accepting modern Macs where they rejected the Trashcan. I think it's because Apple's practices have become more widespread anyway*, and certain parts of the strategy like the RAM thing at least have upsides. That, and the thermals are better because the Trashcan's thermal design was not fit for purpose.
* I was trying to fix a friend's nice Lenovo laptop recently -- it turned out to just have some bad RAM, but when we opened it up we found it was soldered :(
Compared to Nvidia's Project DIGITS which is supposed to cost $3K and be available "soon", you can get a specs matching 128GB & 4TB version of this Mac for about $4700 and the difference would be that you can actually get it in a week and will run macOS(no idea how much performance difference to expect).
I can't wait to see someone testing the full DeepSeek model on this, maybe this would be the first little companion AI device that you can fully own and can do whatever you like with it, hassle-free.
at 819 GB per second bandwidth, the experience would be terrible
A back of the napkin calculation: 819GB/s / 37GB/tok = 22 tokens/sec.
Realistically, you’ll have to run quantized to fit inside of the 512GB limit, so it could be more like 22GB of data transfer per token, which would yield 37 tokens per second as the theoretical limit.
It is likely going to be very usable. As other people have pointed out, the Mac Studio is also not the only option at this price point… but it is neat that it is an option.
Also, people figured a way to run these things in parallel easily. The device is pretty small, I think for someone who wouldn't mind the price tag stacking 2-3 of those wouldn't be that bad.
525GB/s to 1000GB/s will double the TPS at best, which is still quite low for large LLMs.
The hardware has evolved faster than software at Apple. It’s usually the opposite with most tech companies where hardware is unable to keep up with software.
[1] Asus just announced the world’s first Thunderbolt 5 eGPU:
https://www.theverge.com/24336135/asus-thunderbolt-5-externa...
(2) “No one” is developing games for Linux either, but the Steam Deck works great. Why? Wine, which you can run on macOS too.
When I connect to my Mac Studio via Macbook I can select that mode, then change the Displays setting to Dynamic Resolution and then my 'thin client':
- Is fullscreen using the entire 16:10 Macbook screen
- Gets 60 fps low latency performance (including on actual games)
- Transfers audio, I can attend meetings in this mode
- Blanks the host Mac Studio screen
All things that were impossible via VNC - RDP is much better but this new High Performance Screen Share is even more powerful.
The thin lightweight laptop that remotes into a loaded machine has always been my idea of high mobility instead of suffering a laptop running everything locally. This works via LTE as well with some firewall setup.
If you sit around expecting selflessness from Apple you will waste an enormous amount of time, trust me.
Not even OCSP?
But if you're being pedantic, I meant Apple SaaS requiring monthly payments or any other form of using something from Apple where I give them money outside the purchase of their hardware.
If you're talking background services as part of macOS, then you're being intentionally obtuse to the point and you know it
Maybe that's why they ship with insultingly-small SSDs by default, so that as people's photo libraries, Desktop and Documents folders fill up, Apple can "fix your problem" for you by selling you the iCloud/Apple One plan to offload most of the stuff to only live in iCloud.
Either they spend the $400 up front to get 2 notches up on the SSD upgrade, to match what a reasonable device would come with, or they spend that $400 $10 a month for the 40 month likely lifetime of the computer. Apple wins either way.
This forum looses track of the world outside this echo chamber
That said, attracting creative users also adds value to the platform by creating demand for creative software for macOS, which keeps existing packages for macOS maintained and brings new ones on board every so often.
Whatever the profit margin on an iMac Studio is these days, surely improving non-consumer options becomes profitable at some point if you start selling them by the thousands to data centers.
It does. Support costs. How do you prove it's a hardware failure or software? What should they do? Say it "unofficially" supports Linux? People would still try to get support. Eventually they'd have to test it themselves etc.
Has been. This is importance. Past tense. Maybe that's the point - they gave up on it acknowledging the extra costs / issues.
Image recognition, OCR, AR and more are applications of the NPU that didn’t exist at all on older iPhones because they have would be too intensive for the chips and batteries.
You're confusing this with what features/enhancements new generations of NPUs bring, which nobody else was talking about. Everyone else in the conversation is comparing pre- and post-NPU.
That said, there are efforts being made to use the NPU. See: https://github.com/Anemll/Anemll - you can now run small models directly on your Apple Silicon Mac's NPU.
It doesn't give better performance but it's massively more power efficient than using the GPU.
The Neural Engine is useful for a bunch of Apple features, but seems weirdly useless for any LLM stuff... been wondering if they'd address it on any of these upcoming products. AI is so hype right now it seems odd that they have specialised processor that doesn't get used for the kind of AI people are doing. I can see in the latest release:
> Mac Studio is a powerhouse for AI, capable of running large language models (LLMs) with over 600 billion parameters entirely in memory, thanks to its advanced GPU
https://www.apple.com/newsroom/2025/03/apple-unveils-new-mac...
i.e. LLMs still run on the GPU not the NPU
Not sure how much storage to get. I was floating the idea of getting less storage, and hooking it up to a TB5 NAS array of 2.5” SSDs, 10-20tb for models + datasets + my media library would be nice. Any recommendations for the best enclosure for that?
I also want to build the thing you want. There are no multi SSD M2 TB5 bays. I made one that holds 4 drives (16TB) at TB3 and even there the underlying drives are far faster than the cable.
My stuff is in OWC Express 4M2.
This is my understanding (probably incorrect in some places)
1. NVIDIA's big advantage is that they design the hardware (chips) and software (CUDA). But Apple also designs the hardware (chips) and software (Metal and MacOS).
2. CUDA has native support by AI libraries like PyTorch and Tensorflow, so works extra well during training and inference. It seems Metal is well supported by PyTorch, but not well supported by Tensorflow.
3. NVIDIA uses Linux rather than MacOS, making it easier in general to rack servers.
In terms of hardware - Apple designs their GPUs for GPU workloads, whereas Nvidia has a decades-old lead on optimizing for general-purpose compute. They've gotten really good at pipelining and keeping their raster performance competitive while also accelerating AI and ML. Meanwhile, Apple is directing most of their performance to just the raster stuff. They could pivot to an Nvidia-style design, but that would be pretty unprecedented (even if a seemingly correct decision).
And then there's CUDA. It's not really appropriate to compare it to Metal, both in feature scope and ease of use. CUDA has expansive support for AI/ML primatives and deeply integrated tensor/SM compute. Metal does boast some compute features, but you're expected to write most of the support yourself in the form of compute shaders. This is a pretty radical departure from the pre-rolled, almost "cargo cult" CUDA mentality.
The Linux shtick matters a tiny bit, but it's mostly a matter of convenience. If Apple hardware started getting competitive, there would be people considering the hardware regardless of the OS it runs.
Isn't Apple also focusing on the AI stuff? How has it not already made that decision? What would prevent Apple from making that decision?
> Metal does boast some compute features, but you're expected to write most of the support yourself in the form of compute shaders. This is a pretty radical departure from the pre-rolled, almost "cargo cult" CUDA mentality.
Can you give an example of where Metal wants you to write something yourself whereas CUDA is pre-rolled?
Yes, but not with their GPU architecture. Apple's big bet was on low-power NPU hardware, assuming the compute cost of inference would go down as the field progressed. This was the wrong bet - LLMs and other AIs have scaled up better than they scaled down.
> How has it not already made that decision? What would prevent Apple from making that decision?
I mean, for one, Apple is famously stubborn. They're the last ones to admit they're wrong whenever they make a mistake, presumably admitting that the NPU is wasted silicon would be a mea-culpa for their AI stance. It's also easier to wait for a new generation of Apple Silicon to overhaul the architecture, rather than driving a generational split as soon as the problem is identified.
As for what's preventing them, I don't think there's anything insurmountable. But logically it might not make sense to adopt Nvidia's strategy even if it's better. Apple can't neccessarily block Nvidia from buying the same nodes they get from TSMC, so they'd have to out-design Nvidia if they wanted to compete on their merits. Even then, since Apple doesn't support OpenCL it's not guaranteed that they would replace CUDA. It would just be another proprietary runtime for vendors to choose from.
> Can you give an example of where Metal wants you to write something yourself whereas CUDA is pre-rolled?
Not exhaustively, no. Some of them are performance-optimized kernels like cuSPARSE, some others are primative sets like cuDNN, others yet are graph and signal processing libraries with built-out support for industrial applications.
To Apple's credit, they've definitely started hardware-accelerating the important stuff like FFT and ray tracing. But Nvidia still has a decade of lead time that Apple spent shopping around with AMD for other solutions. The head-start CUDA has is so great that I don't think Apple can seriously respond unless the executives light a fire under their ass to make some changes. It will be an "immovable rock versus an unstoppable force" decision for Apple's board of directors.
I'd say the biggest problem with the NPU is that you can only use it from Core ML. Even MLX can't access it it!
As you say the big world-changing LLMs are scaling up, not down. At the same time (at least so far) LLM usage is intermittent - we want to consume thousands of tokens in seconds, but a couple of times a minute. That's a client-server timesharing model for as long as the compute and memory demand can't fit on a laptop.
I bought a refubished M3 max to run LLMs (can only go up to 70b with 4 bit quant), and it is only slightly slower than the more expensive M4 max.
M1: November 10, 2020
M1 Pro: October 18, 2021
M1 Max: October 18, 2021
M1 Ultra: March 8, 2022
-------------------------
M2: June 6, 2022
M2 Pro: January 17, 2023
M2 Max: January 17, 2023
M2 Ultra: June 5, 2023
-------------------------
M3: October 30, 2023
M3 Pro: October 30, 2023
M3 Max: October 30, 2023
-------------------------
M4: May 7, 2024
M4 Pro: October 30, 2024
M4 Max: October 30, 2024
-------------------------
M3 Ultra: March 5, 2025
The M3 Ultra might perform as well as the M4 Max - I haven't seen benchmarks yet - but the newer series is in the higher end devices which is what most people expect.
I feel like I should be able to spend all my money to both get the fastest single core performance AND all the cores and available memory, but Apple has decided that we need to downgrade to "go wide". Annoying.
I'm a major Apple skeptic myself, but hasn't there always been a tradeoff between "fastest single core" vs "lots of cores" (and thus best multicore)?
For instance, I remember when you could buy an iMac with an i9 or whatever, with a higher clock speed and faster single core, or you could buy an iMac Pro with a Xeon with more cores, but the iMac (non-Pro) would beat it in a single core benchmark. Note: Though I used Macs as the example due to the simple product lines, I thought this was pretty much universal among all modern computers.
Not in the Apple Silicon line. The M2 Ultra has the same single core performance as the M2 Max and Pro. No benchmarks for the M3 Ultra yet but I'm guessing the same vs M3 Max and Pro.
I'm not sure if this is me not maintaining it properly (e.g fans having dust block them) - but I've always got this sense that Apple throttles their older devices in some indirect ways. I experience it the most with iPhones - my old iPhone is pretty slow doing basic things despite nothing really changing on it (just the OS updating?)
So my only concern with this is - how many years until it's slow enough to annoy you into buying a new one?
"Just the OS updating" is not insignificant. Software developers, in general, are not known for making sure latest versions of their software run smoothly on older hardware.
Also, performance on iPhones is throttled when your battery is very old. There was a whole class-action lawsuit about it.
Conspicuously, Apple just so happened to pick the one that encouraged people to upgrade the entire phone. You know, an entire phone that is otherwise functional without arbitrary restrictions by the OEM.
https://support.apple.com/iphone/repair/battery-replacement
There is no “DRM” in their battery.
If you had the wherewithal, you could do it yourself
https://www.ifixit.com/Guide/iPhone+6s+Battery+Replacement/5...
The units weren’t “faulty” all batteries degrade over time.
For M3 and M4 machines, hardware support is pretty derilict: https://asahilinux.org/docs/M3-Series-Feature-Support/
https://asahilinux.org/docs/M1-Series-Feature-Support/#table...
I assume anything that doesn't have "linux-asahi" is not supported -- or any WIP is not supported.
Wish I had the skills to help them. Targeting just one set of architecture, I think Asahi has more chances of success.
(Maybe I'm missing something here.)
Do you expect this will be able to handle AI workloads well?
All I’ve heard for the past two years is how important a beefy GPU is. Curious if that holds true here too.
The model weights (billions of parameters) must be loaded into memory before you can use them.
Think of it like this: Even with a very fast chef (powerful CPU/GPU), if your kitchen counter (VRAM) is too small to lay out all the ingredients, cooking becomes inefficient or impossible.
Processing power still matters for speed once everything fits in memory, but it's secondary to having enough VRAM in the first place.
My guess is that these chips could be compute-bound though given how little compute capacity they have.
I agree that compute is likely to become the bottleneck for these new Apple chips, given they only have like ~0.1% the number of flops
FWIW I went the route of "bunch of GPUs in a desktop case" because I felt having the compute oomph was worth it.
Not in case of language models, which are typically bound by memory size rather than bandwidth.
The thing with these Apple chips is that they have unified memory, where CPU and GPU use the same memory chips, which means that you can load huge models into RAM (no longer VRAM, because that doesn't exist on those devices). And while Apple's integrated GPU isn't as powerful as an Nvidia GPU, it is powerful enough for non-professional workloads and has the huge benefit of access to lots of memory.
https://www.primeline-solutions.com/de/nvidia-h100-nvl-94gb-...
These are unified memory. The M3 Ultra with 512gb has as much VRAM as sixteen 5090.
The M4 pro was able to answer the test prompt twice--once on battery and once on mains power--before the AMD box was able to finish processing.
The M4's prompt parsing took significantly longer, but token generation was significantly faster.
Having the memory to the cores that matter makes a big difference.
> VRAM is what takes a model from "can not run at all" to "can run" (even if slowly), hence the emphasis.
Is false. Regardless of how much VRAM you have, if the criteria is "can run even if slowly", all machines can run all models because you have swap. It's unusably slow but that's not what OP was claiming the difference is.
My memory is wrong, it was the 32b. I'm running the 70b against a similar prompt and the 5950X is probably going to take over an hour for what the M4 managed in about 7 minutes.
edit: an hour later and the 5950 isn't even done thinking yet. Token generation is generously around 1 token/s.
edit edit: final statistics. M4 Pro managing 4 tokens/s prompt eval, 4.8 tokens/s token generation. 5950X managing 150 tokens/s prompt eval, and 1 token/s generation.
Perceptually I can live with the M4's performance. It's a set prompt, do something else, come back sort of thing. The 5950/RTX3080's is too slow to be even remotely usable with the 70b parameter model.
Otherwise, you don't even need a computer. Pen and paper is plenty.
For all practical purposes, VRAM is a limiting factor.
1. What are various average joe (as opposed to researchers, etc.) use cases for running powerful AI models locally vs. just using cloud AI. Privacy of course is a benefit, but it by itself may not justify upgrades for an average user. Or are we expecting that new innovation will lead to much more proliferation of AI and use cases that will make running locally more feasible?
2. With the amount of memory used jumping up, would there be a significant growth for companies making memories? If so, which ones would be the best positioned?
Thanks.
A local model will do anything you ask it to, as far as it "knows" about it. It doesn't need to please investors or be afraid of bad press.
LM Studio + a group of select models from huggingface and you can do whatever you want.
For generic coding assistance and knowledge, online services are still better quality.
Apple seems to be using LPDDR, but HBM will also likely be a key tech. SK Hynix and Samsung are the most reputable for both.
>> Apple seems to be using LPDDR, but HBM will also likely be a key tech. SK Hynix and Samsung are the most reputable for both.
So not much Micron? Any US based stocks to invest in? :-)
I think a great use case for this would be in a company that doesn't want all of their employees sending LLM queries about what they're working on outside the company. Buy one or two of these and give everybody a client to connect to it and hey presto you've got a secure private LLM everybody in the company can use while keeping data private.
With a local model, I could toss anything in there. Database query outputs, private keys, stuff like that. This’ll probably become more relevant as we give LLM’s broader use over certain systems.
Like right now I still mostly just type or paste stuff into ChatGPT. But what about when I have a little database copilot that needs to read query results, and maybe even run its own subset of queries like schema checks? Or some open source computer-use type thingy needs to click around in all sorts of places I don’t want openAI going, like my .env or my bash profile? That’s the kinda thing I’d only use a local model for
I'm in Hong Kong, I can't even subscribe to OpenAI or Claude directly, though granted this doesn't so much apply to the already "open" models
That just may be dependent on how much trust you have on the providers you use. Or do you do your own electricity generation?
I tend to do the same thing. I do not consider myself as a good representative of an average user though.
I do not have a good sense of how well quality scales with narrow MoEs but even if we get something like Llama 3.3 70b in quality at only 8b active parameters people could do a ton locally.
Gamers don't generally use a mac because of the lack of games and I'm guessing those who are really into LLMs use Linux for the flexibility. Video editing can be done on much cheaper hardware.
Very rich LLM enthusiasts who wants to try out mac?
You can get a good experience on a Windows or Linux machine with DaVinci Resolve, but that’s mostly because of the way better GPUs like the 4090/RTX series you’ve got at your disposal.
If it is equivalent, then the machine pays for itself in 300 hours. That's incredible value.
[1] https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...
I wonder if the plan is to only release Ultras for odd number generations.
Hah, I see what they did there.
Well, duh, it would be a shame if you made a step backwards, wouldn't it? I hate that stupid phrase...
It feels like one should be able to build a good machine for 3/4k if not less with 6 16GB mid level gaming GPUs.
Hot take: You can tie yourself into six knots trying to spin a yarn about why the M3 Ultra spec is super awesome for some AI use-case, meanwhile you could buy a Mac Mini and like 200 million GPT-4o tokens for the cost of this machine that can't even run R1.
$9499
What ever happening to competition in computing?
Computing hardware competition used to be cut throat, drop dead, knife fight, last man standing brutally competitive. Now it's just a massive gold rush cash grab.
Could cost half of that and it would still be uninteresting for my use cases.
For AI, on-demand cloud processing is magnitudes better in speed and software compatibility anyway.
For example, I'll happily feed my entire directory of private notes/diary entries into an LLM running offline on my laptop. I would never do that with someone else's LLM running in the cloud.
I'm curious what instruction sets may have been included with the M3 chip that the other two lack for AI.
So far the candidates seem to be NVIDIA digits, Framework Desktop, M1 64gb M2/M3 128gb studio/ultra.
The GPU market isn't competitive enough for the amount of VRAM needed. I was hoping for an Battlemage GPU Model with 24GB that would be reasonably priced and available.
The framework desktop and devices I think a second generation will be significantly better than what's currently on offer today. Rationale below...
For a max spec processor with ram at $2,000, this seems like a decent deal given today's market. However, this might age very fast for three reasons.
Reason 1: LPDDR6 may debut in the next year or two this could bring massive improvements to memory bandwidth and capacity for soldered on memory.
LPDDR6 vs LPDDR5 - Data bus width - 24 bits, 16 bits Burst length - 24 bits, 15 bits Memory bandwidth - Up to 38.4 GB/s, Up to 6.7 GB/s
- Camm ram may or may not be maintain signal integrity as memory bandwidth increases. Until I see it implemented for a AI use-case in a cost-effective manner, I am skeptical.
Reason 2: - It's a laptop chip with limited PCI lanes and reduced power envelope. Theoretically, a desktop chip could have better performance, more lanes, socketable (Although, I don't think I've seen a socketed CPU with soldered RAM)
Reason 3: In addition, what does hardware look like being repurposed in the future compared to alternatives?
- Unlike desktop or server counterparts which can have a higher cpu core count, PCEe/IO Expansion, this processor with its motherboard is limited on re-purposing later down the line as a server to self-host other software besides AI. I suppose could be turned into a overkill, NAS with ZFS and HBA Single Controller Card in new case.
- Buying into the framework desktop is pretty limited based on the form factor. Next generation might be able to include a 16x slot fully populated, a 10G nic. That seems about it if they're going to maintain the backward compatibility philosophy given the case form factor.
Did they say why there’s not an m4 ultra?
Soldered?
These machines have a 512 bit interface, so presumably even worse.
It's getting that bandwdith by going very wide on very very very many channels, rather than trying to push a gigantic amount of bandwidth through only a few channels.
- The AI Max+ 395 is a 256 bit bus ("4 channels") of 8000 MHz instead of 128 bits ("2 channels") of 16000 MHz because you can't practically get past 9000 MHz in a consumer device, even if you solder the RAM, at the moment. Max capacity 128 GB.
- 5th Gen Epyc is a 768 bit bus ("12 channels") of 6000 MHz because that lets you use a standard socketed setup. Max capacity 6 TB.
- M3 Ultra is a 1024 bit bus ("16 channels") of "~6266 MHz" as it's 2x the M3 Max (which is 512 bits wide) and we know the final bandwidth is ~800 GB/s. Max capacity 512 GB.
Note: "Channels" is in quotes because the number of bits per channel isn't actually the same per platform (and DDR5 is actually 2x32 bit channels per DIMM instead of 1x64 per DIMM like older DDR... this kind of shit is why just looking at the actual bit width is easier :p).
So really the frequencies aren't that different even though these are completely different products across completely different segments. The overwhelming factor is bus width (channels) and the rest is more or less design choice noise from the perspective of raw performance.
I'd love to have more points of comparison available, but Strix Halo is the most analogous chip to an M-series chip on the market right now from a memory point of view, so it's hard to really know anything.
I very much hope CAMM2 or something else can be made to work with a Strix-like setup in the future, but I have my doubts.
The memory bus is the same as for modules, it's just very short. The higher end SoCs have more memory bandwidth because the bus is wider (i.e. more modules in parallel).
You could blame DDR5 (who thought having a speed negotiation that can go over a minute at boot is a good idea?), but I blame the obsession with thin and the ability to overcharge your customers.
> I've never seen a GPU with replaceable RAM
I still have one :) It's an ISA Trident TVGA 8900 that I personally upgraded from 512k VRAM to one full megabyte!!!
There's a good reason it's soldered, i.e. the wide memory interface and huge bandwidth mean that the extra trace lengths needed for an upgradable RAM slot would screw up the memory timings too much, but there's no need to make false claims like saying it's on-die.
Existing ones possibly but why not build something that lets you snap-in a BGA package just like we snap in CPUs on full sized PC mainboards?
It's the same reason nobody sells GPUs that have user upgradable non-soldered GDDR VRAM modules.
Figure out a way to make it unified without also soldering it, and you'll be a billionaire.
Or are you just grinding a tired, 20-year-old axe.
The issue is availability of chips and most likely you have to know which components to change so the new memory is recognised. For instance that could be changing a resistor to different value or bridging certain pads.
What if both are an issue?
I thought it was few weeks ago when M4 Max came by.
Has anyone has a ballpark number how many tokens per second we can get with this?
Is anyone other than a vanishingly small number of hard core hobbiests going to upgrade from an M4 to an M4 Ultra?
I expect that the 2 biggest buyers of M4 Ultra will be people who want to run LLMs locally, and people who want the highest performance machine they can get (professionals), but are wedded to mac-only software.
It is reasonable to say many folks in the field prefer to work on mac hardware.
Why? They have too many M3 chips on stock?
what's the point of 512GB RAM for LLMs on this Mac Studio if the speed is painfully slow?
it's as if Apple doesn't want to compete with Nvidia... this is really disappointing in a Mac Studio. FYI: M2 Ultra already has 800GB/s bandwidth
NVIDIA RTX 4080: ~717 GB/s
AMD Radeon RX 7900 XTX: ~960 GB/s
AMD Radeon RX 7900 XT: ~800 GB/s
How's that slow exactly ?
You can have 10000000Gb/s and without enough VRAM it's useless.
Nvidia RTX 4090 (Ada Lovelace)
FP32: Approximately 82.6 TFLOPS
FP16: When using its 4th‑generation Tensor Cores in FP16 mode with FP32 accumulation, it can deliver roughly 165.2 TFLOPS (in non‑tensor mode, the FP16 rate is similar to FP32).
FP8: The Ada architecture introduces support for an FP8 format; using this mode (again with FP32 accumulation), the RTX 4090 can achieve roughly 330.3 TFLOPS (or about 660.6 TOPS, depending on how you count operations).
Apple M1 Ultra (The previous‑generation top‑end Apple chip)
FP32: Around 15.9 TFLOPS (as reported in various benchmarks)
FP16: By similar scaling, FP16 performance would be roughly double that value—approximately 31.8 TFLOPS (again, an estimate based on common patterns in Apple’s GPU designs)
FP8: Like the M3 family, the M1 Ultra does not support a dedicated FP8 precision mode.
So a $2000 Nvidia 4090 gives you about 5x the FLOPS, but with far less high speed RAM (24GB vs. 512GB from Apple in the new M3 Ultra). The RAM bandwidth on the Nvidia card is over 1TBps, compared with 800GBps for Apple Silicon.
Apple is catching up here and I am very keen for them to continue doing so! Anything that knocks Nvidia down a notch is good for humanity.
I don't love Nvidia a whole lot but I can't understand where this sentinent comes from. Apple abandoned their partnership with Nvidia, tried to support their own CUDA alternative with blackjack and hookers (OpenCL), abandoned that, and began rolling out a proprietary replacement.
CUDA sucks for the average Joe, but Apple abandoned any chance of taking the high road when they cut ties with Khronos. Apple doesn't want better AI infrastructure for humanity; they envy the control Nvidia wields and want it for themselves. Metal versus CUDA is the type of competition where no matter who wins, humanity loses. Bring back OpenCL, then we'll talk about net positives again.
M3 Max GPU benchmarks around 14 TFLOPs, so the Ultra should score around 28 TFLOPs.
Double the numbers for FP16.
vram is not really the limiting factor for serious actors in this space
what's the point of 512GB RAM for LLMs on this Mac Studio if the speed is painfully slow?
You can fit the entire Deepseek 671B q4 into this computer and get 41 tokens/s because it's an MoE model."40 tokens/s by my calculations"
to
"40 tokens/s"
to
"41 tokens/s"
Is there a dice involved in "your calculations?"
Doesn't matter. All theorized because no one has publicly tested one.
If you configure a Threadripper workstation at Puget Systems, memory price seems to be ~$6/GB. Except if you use 128 GB modules, which are almost $10/GB. You can get 768 GB for a Threadripper Pro cheaper than 512 GB for a Threadripper, but the base cost of a Pro system is much higher.
Meanwhile, this thing has a faster CPU, GPU, and 512GB of 800GB/s VRAM for $9,500.
Even for its intended AI audience, the ISA additions in M4 brought significant uplift.
Are they waiting to put M4 Ultra into the Mac Pro?
With an M3 Ultra going into the Mac Studio, Apple could differentiate from the Mac Pro, which could then get the M4 Ultra. Right now, the Mac Studio and Mac Pro oddly both have the M2 Ultra and same overall performance.
https://x.com/markgurman/status/1896972586069942738SI has no business in memory size nomenclature as it is not derived from fundamental physical units. The whole klownbyte change was pushed through by hard drive marketers in 1990s.
What does it mean to "address memory in powers of two" ? There are certainly machines with non-power-of-two memory quantities; 96 GiB is common for example.
> The whole klownbyte change was pushed through by hard drive marketers in 1990s.
The metric prefixes based on powers of 10 have been around since the 1790s.
I challenge you to show me any SKU from any memory manufacturer that has a power of 10 capacity. Or a CPU whose address space is a power of 10. This is an unavoidable artefact of using a binary address bus.
> The metric prefixes based on powers of 10 have been around since the 1790s.
And Babylonians used power of 60, what gives?
If Donald Knuth and Gordon Bell say we use base-2 for RAM, that’s good enough for me.
In terms of software, recent NVIDIA and AMD research has focused on fast evaluation of small ~4 layer MLPs using FP8 weights for things like denoising, upscaling, radiance caching, and texture and material BRDF compression/decompression.
NVIDIA has just put out some new graphics API extensions and samples/demos for loading a chunk of neural net weights and performing inference from within a shader.
Can you elaborate on how the TOPS value is inflated? What GPU would be the equivalent of the Jetson AGX Orin?
Hard to drop that much cash on an outdated chip.
For comparison, a single consumer card like the RTX 5090 is only 32 GB of memory, has 1792 GB/s memory and 3593 TOPS of compute.
The use cases will be limited. While you can't run a 600B model directly like Apple says(cause you need more memory for that), you can run a quantized version, but it will be very slow unless its a MoE architecture.
The compute level you’re talking about on the M3 Ultra is the neural engine. Not including the GPU.
I expect the GPU here will be behind a 5090 for compute but not by the unrelated numbers you’re quoting. After all, the 5090 alone is multiple times the wattage of this SoC.
Most AI training and inference (including generative AI) is bound by large scale matrix MACs. That's why nvidia fills their devices with enormous numbers of tensor cores and Apple / Qualcomm et al are adding NPUs, filling largely the same gap. Only nvidia's not only are a magnitude+ more performant, they've massively more flexible (in types and applications), usable for training and inference, while Apple's is only even useful for a limited set of inference tasks (due to architecture and type limits).
Apple can put the effort in and making something actually competitive with nvidia, but this isn't it.
Apple won’t compete with NVIDIA, I’m not arguing that. But your opening line will only make sense if you can back up the numbers and the GPU performance is lower than the ANE TOPS.
However the M2 Ultra GPU is estimated, with every bit of compute power working together, at about 26 TOPS.
The only similar number I can find is for TFLOPS vs TOPS
Again I’m not saying the GPU will be comparable to an NVIDIA one, but that the comparison point isn’t sensible in the comments I originally replied to.
FWIW, normalizing the wattages (or even underclocking the GPU) will still give you an Nvidia advantage most days. Apple's GPU designs are closer to AMD's designs than Nvidia's, which means they omit a lot of AI accelerators to focus on a less-LLM-relevent raster performance figure.
Yes, the GPU is faster than the NPU. But Apple's GPU designs haven't traditionally put their competitors out of a job.
5090 is 575W without the CPU.
You’d have to cut the Nvidia to a quarter and then find a comparable CPU to normalize the wattage for an actual comparison.
I agree that Apple GPUs aren’t putting the dedicated GPU companies in danger on the benchmarks, but they’re also not really targeting it? They’re in completely different zones on too many fronts to really compare.
> but they’re also not really targeting it?
That's fine, but it's not an excuse to ignore the power/performance ratio.
Give me a comparable system build where the NVIDIA GPU + any CPU of your choice is running at the same wattage as an M2 Ultra, and outperforms it on average. You’d get 150W for the GPU and 150W for the CPU.
Again, you can’t really compare the two. They’re inherently different systems unless you only care about singular metrics.
If not, what is the TOPS of the GPU, and why isn't apple talking about it if there is more performance hidden somewhere? Apple states 18 TOPS for the M3 Max. And why do you think Apple added the neural engine, if not to accelerate compute?
The power draw is quite a bit higher, but it's still much more efficient as the performance is much higher.
If you squint, yeah they look the same, but so does the microcontroller on the GPU and a full blown CPU. They’re fundamentally different purposes, architectures and scale of use.
The ANE can’t even really be used directly. Apple heavily restricts the use via CoreML APIs for inference. It’s only usable for smaller, lightweight models.
If you’re comparing to the tensor cores, you really need to compare against the GPU which is what gets used by apples ml frameworks such as MLX for training etc.
It will still be behind the NVIDIA GpU, but not by anywhere near the same numbers.
They're both built to do the most common computation in AI (both training and inference), which is multiply and accumulate of matrices - A * B + C. The ANE is far more limited because they decided to spend a lot less silicon space on it, focusing on low-power inference of quantized models. It is fantastically useful for a lot of on-device things like a lot of the photo features (e.g. subject detection, text extraction, etc).
And yes, you need to use CoreML to access it because it's so limited. In the future Apple will absolutely, with 100% certainty, make an ANE that is as flexible and powerful as tensor cores, and they force you through CoreML because it will automatically switch to using it (where now you submit a job to CoreML and for many it will opt to use the CPU/GPU instead, or a combination thereof. It's an elegant, forward thinking implementation). Their AI performance and credibility will greatly improve when they do.
>you really need to compare against the GPU
From a raw performance perspective, the ANE is capable of more matrix multiply/accumulates than the GPU is on Apple Silicon, it's just limited to types and contexts that make it unsuitable for training, or even for many inference tasks.
My numbers are correct, the M3 Ultra has around 1 % of the TOPS performance of a RTX 5090.
Comparing against the GPU would look even worse for apple. Do you think Apple added the neural engine just for fun? This is exactly what the neural engine is there for.
Try and use the ANE in the same way you would use the tensor cores. Hint: you can’t, because the hardware and software will actively block you.
They’re meant for fundamentally different use cases and power loads. Even apples own ML frameworks do not use the ANE for anything except inference.
Thats going to be the NPU specifically. Pretty much nothing on llm front seems to use NPUs at this stage (copilot snapdragon laptops aside) so not sure the low number is a problem
It's nice that these devices have loads of memory, but they don't have remotely the necessary level of compute to be competitive in the AI space. As a fun thing to run a local LLM as a hobbyist, sure, but this presents zero threat to nvidia.
Apple hardware is irrelevant in the AI space, outside of making YouTube "I ran a quantized LLM on my 128GB Mac Mini" type content for clicks, and this release doesn't change that.
Looks like a great desktop chip though.
It would be nice if nvidia could start giving their less expensive offerings more memory, though they're currently in the realm Intel was 15 yearsago, thinking that their biggest competition is themselves.
It will be interesting when somebody will upgrade the ram ram of the 5090 like they did with 4090s
Pretty sure they’re comparing Nvidia’s gpu to Apple’s npu.
You have a fundamental flaw in your understanding of how both chips work. Not using the tensor cores would be slower, and the same goes for apples neural engine. The numbers are both for the hardware both have implemented for maximum performance for this task.
AMD Ryzen Threadripper PRO 3995WX released over four years ago and supports 2TB (64c/128t)
> Take your workstation's performance to the next level with the AMD Ryzen Threadripper PRO 3995WX 2.7 GHz 64-Core sWRX8 Processor. Built using the 7nm Zen Core architecture with the sWRX8 socket, this processor is designed to deliver exceptional performance for professionals such as artists, architects, engineers, and data scientists. Featuring 64 cores and 128 threads with a 2.7 GHz base clock frequency, a 4.2 GHz boost frequency, and 256MB of L3 cache, this processor significantly reduces rendering times for 8K videos, high-resolution photos, and 3D models. The Ryzen Threadripper PRO supports up to 128 PCI Express 4.0 lanes for high-speed throughput to compatible devices. It also supports up to 2TB of eight-channel ECC DDR4 memory at 3200 MHz to help efficiently run and multitask demanding applications.
So unified memory means that the memory is accessible to the GPU and the CPU in a shared pool. AMD does not have that.
[1] https://www.amd.com/en/products/processors/laptop/ryzen/ai-3...
[1] https://www.amd.com/en/products/accelerators/instinct/mi300/...
8 channels at 3200 MT/s (1600 MHz) is only 204.8 GB/sec; less than a quarter of what the M3 Ultra can do. It's also not GPU-addressable, meaning it's not actually unified memory at all.