My RTX 5090 can’t keep up with Apple Silicon on the biggest native LLMs and I hate to admit it


I spent a long time building the gaming PC I wanted, iterated over the last ten years, and finally landed on a computer that a young person could only dream of. I have an Nvidia RTX 5090 and an AMD Ryzen 7 9800X3D and it handles every game I throw at it without breaking a sweat. Plus, I do a lot of native compute-heavy workloads like machine learning, data analysis, and development.

However, as local LLMs take off, I play around with them and see what they can do. I run them every day now, and while I thought the RTX 5090 would be an incredible beast capable of running them at impossible speeds, I realized one thing very quickly: it’s fast, but speed isn’t everything.

That’s right, the Qwen 3.6 27B is a phenomenal model and matches the RTX 5090’s 32GB of VRAM nicely. But there are other, more interesting models I’d like to test, but they’re much bigger than I can fit into a mere 32GB pool. Unfortunately, I’ve come to realize that Apple Silicon is probably the best mainstream way to get into big native LLMs right now, because the architecture is massively benefiting from workloads in a way that I didn’t even expect when Apple first launched its Unified Memory Architecture in 2020.

It’s important to note that I’m not saying you need to go out and buy an Apple Silicon-based machine for native AI, nor am I saying it’s the only way to work with native AI. But it’s pretty funny that Apple has somewhat casually decided on a memory architecture that positions it as a better alternative to the world’s best consumer GPUs for a very specific purpose. Apple also started creating more open tools for this world with MLXhis machine learning framework for Apple Silicon. It’s not the CUDA equivalent in terms of maturity or coverage, and many native LLM tools still use Metal directly, but it shows that Apple is aware that single memory is becoming one of its strongest AI strengths.

32 GB is not as high a ceiling as it sounds

Memory bandwidth is irrelevant unless the model is loaded

Screw the RTX 5090 - This $10,000 Card Is the New King of Gaming Screenshots 5-26 Credit: Source: 8 o’clock

The 5090 ships with 32GB of GDDR7 on a 512-bit bus, good for about 1.79TB/s of memory bandwidth. That’s the most VRAM Nvidia has put on a consumer card and the fastest memory bus it’s shipped to gamers. It’s incredibly fast on small things, and the quantized 7B and 13B models run faster than I can read the output. Even at 4-bit quantization, the 30B model sits in VRAM with space to spare.

What that bandwidth gets you, it only matters that the model is compatible. If the weights, KV cache, and context buffer don’t fit into 32GB, the speed drops massively. The model starts to load into system RAM, and that massive bandwidth suddenly bottlenecks everything DDR5 can achieve. Squeezing a quantized Llama 3.3 70B into a 5090 is possible with care, on a Q3 and a small context window, but you’ll have to work hard to achieve it.

step into something like Qwen3-Coder-Next In FP8 it takes up 85GB of memory and the 5090 is no longer even in the same conversation. However, this model is a mix of specialists with only 3B active parameters per token. However, the weights still need to fit somewhere, and 85GB will never fit into 32GB. You can load some expert layers into the system RAM, which certainly helps, but it will still be slower. The reason you can offload that way and still have it usable is the same reason Apple’s unitary memory works so well: it’s much lighter in terms of generation bandwidth than every setting in the model is enabled for every token.

Apple’s M series chips do not separate VRAM from system RAM. The CPU and GPU can access the same single memory pool, and native LLM runtimes can use that pool without moving weights over PCIe. Ten a Mac Studio maximized with the M3 Ultrathis means that up to 512 GB can be directly used by the GPU. There’s no PCIe round-trip or migration between pools, and even at the more consumer-friendly end of the range, that’s still true. The MacBook Pro with the M4 Max measures up to 128GB at 546GB/s on the laptop, which is four times more addressable memory than the 5090. The Mac Mini with the M4 Pro doubles the 64GB, 5090 in a small machine.

You can even find M1 Max based machines with 64GB of RAM for around $1000 depending on the used market, which can be very reasonable depending on what you’re buying it for, especially if local LLMs are casual and not the main goal. Plus, given the 5090’s MSRP is $2,000 (and it’s a lot more than that right now), a MacBook Pro or Mac Studio with twice the RAM could set you back less. And that’s an entire computer for the cost of one GPU. More on that a little later though.

At the very top, the void is not just crooked, but downright absurd. The DeepSeek R1 671B model, the whole thing, weighs about 405 GB after quantization to 4 bits. No. 5090 makes it work. Even a quad-5090 unit can’t hold it in VRAM. However, Apple’s 512GB Mac Studio M3 Ultra tops it in Q4 and draws around 160-180W during token production. This is only less than half of TDP one 5090.

Slower than 5090, faster than impossible

A model that works is better than a model that doesn’t work at all

MacBook Air connected to a monitor running DeepSeek-R1 locally

The 5090’s speed advantage on compatible models is a big one. The M3 Ultra has a memory throughput of 819 GB/s versus the 5090’s 1.79 TB/s, making the M3 Ultra the fastest Apple Silicon chip on this benchmark. For many models fully compatible with 5090 VRAM, especially under CUDA-optimized runtimes, depending on quantization, backend, and context length, you can see roughly double the token generation speed of the M3 Ultra. For interactive work that needs to feel fast, the 5090 wins.

Fast processing further widens the gap, as Apple Silicon’s prefetching is significantly slower than CUDA in the long context. Apple’s M5 series improves on that, but the first tick of the 30,000 mark still feels noticeably worse on a Mac, even if the next-generation speed is good. In other words, short hints and long outputs will feel good, but putting the entire codebase into context will be noticeably slower.

However, this comparison suddenly turns in Apple’s favor, as the model does not suitable in 32 GB. R1 is an expert mix model, so only about 37B parameters are activated per token, so an 819GB/s machine can serve a 671B model at usable speeds. The bandwidth pressure seems closer to the 37B tight model than the 671B; a really tight model of this size would crawl. With that caveat, the M3 Ultra DeepSeek R1 runs at about 15-20 tokens per second. This is slower than most would like for a reasoning model that uses multiple tokens thinkbut the model works and is usable. Considering the 5090 can’t even run this model, that’s a pretty good trade-off.

For small and medium models, the 5090 is faster and I prefer to use it. For anything really big, the Mac is usually the only one of my two machines that runs it. The question stops as to which is faster and which does what I’m trying to do.

The price becomes less ridiculous the longer you look at it

Cheaper than some of the most advanced clusters

The 512GB Mac Studio isn’t cheap. Before you add a keyboard, this configuration costs about $9,500, which is three or four times what a decent gaming PC would cost you. Honestly, it kind of depends on the price of RAM at this point.

However, it is worth looking at the middle ground. A pair of 5090s gets you up to 64GB. A pair of used 3090s gets you up to 48GB for less. A single RTX Pro 6000 Blackwell reaches 96 GB on one card. Any of these will comfortably clear the 30B to 70B class and, depending on quantization and context, can reach 100B-ish, and for that level they really compete with the mid-range Mac. However, PCIe hops between cards introduce latency that hurts long context generations, and multi-GPU orchestration is its own software project to maintain it. Moreover, the quad-5090 unit reaches 128GB, several times the capacity of the entire Mac Studio, and 128GB is not 405GB. Single storage wins in price per GB at the top, not in the middle.

For the 400GB-plus class, Nvidia’s alternative is not a stack of normal consumer cards, but a multi-accelerator server with enough A100/H100/H200 class memory to keep the model resident. Don’t forget power, cooling, chassis and interconnect complexity. price for that the setup type starts in the high five figures and confidently goes to six. The Mac, for all its eye-watering RAM upgrade prices, is a cheap option at this level.

In the more reasonable ending, the comparison becomes sharper. With 128GB and one terabyte of storage, the MacBook Pro M4 Max costs around the same price as a well-spec gaming PC built around the 5090. The computer takes the speed crown for games and small models. The MacBook Pro runs everything between the 30B and 100B settings, which covers most of the interesting models worth running natively.

No need to go out and buy a Mac

This is a niche hobby

A 16-inch MacBook Pro showing the XDA home page
Credit:

None of this is arguing to retire the gaming PC, and none of this is to say you should go out and buy a Mac to work with native AI. Native AI is still generally a fairly niche hobby, but it’s interesting to see how Apple Silicon’s architecture accidentally positions itself as the perfect alternative to the best consumer-grade Nvidia GPUs for native AI.

The RTX 5090 is still a great card, as are many others below it and across generations. However, for the specific task of running large domestic LLMs, Apple implemented the architecture almost by accident, as it turned out to be the right shape for a workload that no one had considered when designing the energy efficiency game for laptops. Unified memory on this scale is something Nvidia has yet to address to the consumer. GB10-based systems like Nvidia’s DGX Spark, ThinkStation PGX, and AMD’s Strix Halo are early forays into the high-capacity single-memory space, but they’re well below Apple’s 512GB ceiling and offer less memory bandwidth than the M3 Ultra.

For most of the things I buy the 5090 for, it’s still the obvious choice. My workloads don’t just involve native LLMs, and for the machine learning and deep learning projects I manage, CUDA is still incredibly valuable. But specifically for domestic LLMs? The gap still seems wider than I expected. Apple Silicon does it better than my high-end gaming PC, and I honestly can’t believe it.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *