Apple Silicon (Mac) Workload Matrix

Which model & which job on each Mac, by unified-memory tier. MLX (mlx_lm.server 0.31.3), one model resident at a time. Coding cross-judged by Claude Haiku-4.5 (generator != judge). Bench-only. Updated 2026-05-26.
Headline: one model wins the whole Apple fleet that can hold it — Qwen3.6-35B-A3B 4-bit MLX (MoE): the daily driver on both the 32 GB M5 and the 64 GB M1 Max (~50-54 t/s, 97% tool-call, 0.93-0.95 coding, in 14-21 GB). The cheaper 32 GB M5 performs within 7% of the 64 GB M1 Max on it — the extra RAM only matters to hold a dense 70B, which isn't worth running (see below).

Dense vs MoE — the decision rule

Default to a sparse MoE for every workload, on all bandwidth-bound hardware (Apple Silicon, GB10, Strix). On the 64 GB M1 Max, head-to-head, same quant:
AxisDense Llama-70B (4-bit)MoE Qwen3.6-35B-A3B (4-bit)
Decode6.54 t/s54.3 t/s  (8.3× faster)
Coding (Haiku-judged)0.8990.952
Memory footprint41 GB14.2 GB
Tool-call accuracyn/a (no MLX template)97%
Bigger-dense lost on speed, quality, and memory. Reach for a dense model only when: (1) no MoE exists at the capability you need; (2) you need low-latency short turns from a non-reasoning model (a reasoning-vs-direct issue, better solved with a non-reasoning MoE or capped reasoning effort); or (3) broad single-shot world-knowledge prose (theoretical — the 70B still lost here). Decode on a 3B-active MoE is not bandwidth-bound, which is why a cheaper, lower-bandwidth Mac is nearly as fast; the M1 Max's ~400 GB/s edge only cashes in on dense ≥30B.

16 GB — M4 Mac mini

ModelQuantSizeFits no-swapDecode t/sCodingTool-call
Llama-3.1-8B-Instruct BEST FITMLX 4-bit5 GByes18.60.84386.4%
Qwen2.5-14B-InstructMLX 4-bit7.7 GBtight10.30.915n/a*
gpt-oss-20bMXFP410 GBedge34-390.69†n/a*
Qwen3.6-35B-A3BMLX 3-bit14.5 GBOOM
Ceiling: 8B-comfortable / 14B-capable coder. A useful single-purpose air-gapped edge box (coder + 86% scheduler), but not a comfortable Sam host — gpt-oss-20b only barely fits and 86.4% < the 95.5% production tool-call bar. Below the appliance floor for a real 30-35B MoE.

32 GB — M5  (the value sweet spot)

ModelQuantSizeDecode t/sTTFTCodingTool-call
Qwen3.6-35B-A3B DAILY DRIVERMLX 4-bit MoE21 GB50.50.2-0.5s0.93197.0%
Qwen2.5-7B-InstructMLX 8-bit8.1 GB15.40.67s0.889n/a*
Llama-3.1-8B-InstructMLX 4-bit4.8 GB26.30.69s0.82986.4%
gpt-oss-20bMXFP411 GB46.61.21s0.694†n/a*
Ceiling ~20-24 GB → a 30-32B-4bit or 14B-8bit fits; 70B does not. The 35B MoE collapses the old 3-way split (best coder / best tool-caller / fastest) into one model that wins all three — and runs within 7% of the 64 GB M1 Max while winning every compute-side metric (TTFT, prefill, 3× faster per task).

64 GB — M1 Max

ModelQuantSizeDecode t/sCodingTool-call
Qwen3.6-35B-A3B DAILY DRIVERMLX 4-bit MoE14.2 GB54.30.95297.0%
Llama-3.3-70B-InstructMLX 4-bit dense41 GB6.540.899n/a
gpt-oss-120bMXFP459 GBOOM
Ceiling: a 70B-4bit (~41 GB) loads and serves single-stream only (KV cache on top OOMs it) — and the MoE beats it anyway. A 120B-class model (~60 GB) does not run on 64 GB unified — it exceeds the macOS GPU wired cap and hard-OOMs on first inference.

The ceiling rule (macOS GPU wired cap ≈ 67-75% of RAM)

Unified RAMGPU budgetLargest comfortable modelHard ceiling
16 GB (M4)~10-11 GB8B-4bit (14B tight)~10 GB / no 20B+ headroom on a shared box
32 GB (M5)~20-24 GB30-35B-A3B-4bit MoEno 70B
64 GB (M1 Max)~43-48 GB35B MoE (or 70B-4bit single-stream)no 120B
A concurrent, multi-tenant LLM server is still GB10 / H200 / RTX-PRO-6000 territory — not a Mac. But a Mac (32 GB+) is a legitimate single-stream appliance for the production 35B-A3B MoE.
* mlx_lm.server 0.31.3 silently drops native tool_calls for Qwen2.5 / gpt-oss (a ~27% correct-declines-only floor) — measure those via Ollama. Qwen3.6's 97% is real (verified structured calls); Llama-3.1-8B's 86.4% is recovered from its <|python_tag|> text and is quant-stable across machines.   gpt-oss-20b coding is depressed by reasoning-channel leak (a parsing artifact), not capability.
Bench-only: nothing here changes production (Sam tool-caller stays gpt-oss-20b). Source: PTG llm-benchmark harness, mac_fleet_workload_matrix_2026-05-26.md.