NVIDIA’s been focused on the high end, pushing ever more expensive cards for enterprise, datacenters and professionals. So how would you run a local LLM enthusiast do for their own home lab ? Go to the second market and score really expensive (often more than MSRP) GPUs or go and buy also highly scalped, unrelaible stock of really expensive GPUs.
The Mac Mini M4 released last month with 16GB of RAM has quietly become the most disruptive piece of hardware for Local LLM inference, effectively killing all the lowend NVIDIA hardware that’s been power hungry and starved for VRAM. While everyone’s been obsessing over expensive RTX 4090s and professional cards, Apple has noticed a space for distruption, doubled their entry level VRAM, dropped this compact powerhouse that will reshape how we think about AI workloads. This is unthinkable for Apple where most RAM upgrades cost as much as a whole PC. Their entry level Mac minis has also effectively mini pcs below 1000 Euros (outside of the cheap n100 mini pcs) obselete. Most windows pcs come with a host of problems loud fans, bad design, malware infested drivers, horrible build quality. But then again, when they work, everything is good.
I have been using a M1 mac pro for quite while for local LLM inference, the unified memory has enabled me to load models that are unthinkable with my Nvidia GPU. Cost for cost, it rarely makes sense upgrading with NVidia. The problem with most consumer NVIDIA cards has always been the same: not enough VRAM. You want to run a decent model larger 7B model? Good luck with that 8GB card. You’ll be swapping to system memory constantly, making inference painfully slow. Meanwhile, the Mac Mini M4 treats its 16GB of unified memory like one giant pool that both the CPU and GPU can access. No VRAM bottlenecks, no memory copying overhead, just smooth inference.
But here’s the thing that really makes the Mac Mini special: it’s not just an inference machine, it’s a full PC. For the same price as a decent graphics card, you’re getting an entire system that can handle your daily computing needs, run development environments, serve as a media center, or basically anything else you throw at it. Try doing that with a bare RTX 4060 Ti 16GB, which still needs a whole PC built around it and uses way more power at idle and even way more at load plus it costs about the same as a base mac mini. Considering the life span of most macs is atleast 5 years, my Nvidia GPU will be obselete for many of the tasks.
The real sweet spot though is when you start thinking bigger models. Network a bunch of M4 Mac Minis together with Thunderbolt and you’ve got yourself an inference cluster that’s both powerful and cost effective. Nvidia has killed NvLink and has reserved such kind of clustering only to their highest end enterprise hardware. Each Mac mini is pulling maybe 20-30 watts under load compared to a full gaming PC with a decent GPU that’s easily hitting 200+ watts. You can literally leave these Minis running 24/7 without worrying about your electricity bill especially here in the EU. For anyone doing serious inference work, especially if you’re running models continuously, the operational costs add up fast.
And then there’s MLX, Apple’s machine learning framework that’s specifically designed for their silicon. It’s built from the ground up to take advantage of the unified memory architecture and apple inbuilt GPU. Models developed with MLX run incredibly efficiently on Apple hardware, often outperforming what you’d expect from the raw specs.
Given the this time in 2024, there is no competition for the Macmini for local inference.