Apple Silicon (Mac) Workload Matrix
Which model & which job on each Mac, by unified-memory tier. MLX (mlx_lm.server 0.31.3), one model resident at a time. Coding cross-judged by Claude Haiku-4.5 (generator != judge). Bench-only. Updated 2026-05-26.
Headline: one model wins the whole Apple fleet that can hold it — Qwen3.6-35B-A3B 4-bit MLX (MoE): the daily driver on both the 32 GB M5 and the 64 GB M1 Max (~50-54 t/s, 97% tool-call, 0.93-0.95 coding, in 14-21 GB). The cheaper 32 GB M5 performs within 7% of the 64 GB M1 Max on it — the extra RAM only matters to hold a dense 70B, which isn't worth running (see below).
Dense vs MoE — the decision rule
Default to a sparse MoE for every workload, on all bandwidth-bound hardware (Apple Silicon, GB10, Strix). On the 64 GB M1 Max, head-to-head, same quant:
| Axis | Dense Llama-70B (4-bit) | MoE Qwen3.6-35B-A3B (4-bit) |
| Decode | 6.54 t/s | 54.3 t/s (8.3× faster) |
| Coding (Haiku-judged) | 0.899 | 0.952 |
| Memory footprint | 41 GB | 14.2 GB |
| Tool-call accuracy | n/a (no MLX template) | 97% |
Bigger-dense lost on speed, quality, and memory. Reach for a dense model only when: (1) no MoE exists at the capability you need; (2) you need low-latency short turns from a non-reasoning model (a reasoning-vs-direct issue, better solved with a non-reasoning MoE or capped reasoning effort); or (3) broad single-shot world-knowledge prose (theoretical — the 70B still lost here). Decode on a 3B-active MoE is not bandwidth-bound, which is why a cheaper, lower-bandwidth Mac is nearly as fast; the M1 Max's ~400 GB/s edge only cashes in on dense ≥30B.
16 GB — M4 Mac mini
| Model | Quant | Size | Fits no-swap | Decode t/s | Coding | Tool-call |
| Llama-3.1-8B-Instruct BEST FIT | MLX 4-bit | 5 GB | yes | 18.6 | 0.843 | 86.4% |
| Qwen2.5-14B-Instruct | MLX 4-bit | 7.7 GB | tight | 10.3 | 0.915 | n/a* |
| gpt-oss-20b | MXFP4 | 10 GB | edge | 34-39 | 0.69† | n/a* |
| Qwen3.6-35B-A3B | MLX 3-bit | 14.5 GB | OOM | — | — | — |
Ceiling: 8B-comfortable / 14B-capable coder. A useful single-purpose air-gapped edge box (coder + 86% scheduler), but not a comfortable Sam host — gpt-oss-20b only barely fits and 86.4% < the 95.5% production tool-call bar. Below the appliance floor for a real 30-35B MoE.
32 GB — M5 (the value sweet spot)
| Model | Quant | Size | Decode t/s | TTFT | Coding | Tool-call |
| Qwen3.6-35B-A3B DAILY DRIVER | MLX 4-bit MoE | 21 GB | 50.5 | 0.2-0.5s | 0.931 | 97.0% |
| Qwen2.5-7B-Instruct | MLX 8-bit | 8.1 GB | 15.4 | 0.67s | 0.889 | n/a* |
| Llama-3.1-8B-Instruct | MLX 4-bit | 4.8 GB | 26.3 | 0.69s | 0.829 | 86.4% |
| gpt-oss-20b | MXFP4 | 11 GB | 46.6 | 1.21s | 0.694† | n/a* |
Ceiling ~20-24 GB → a 30-32B-4bit or 14B-8bit fits; 70B does not. The 35B MoE collapses the old 3-way split (best coder / best tool-caller / fastest) into one model that wins all three — and runs within 7% of the 64 GB M1 Max while winning every compute-side metric (TTFT, prefill, 3× faster per task).
64 GB — M1 Max
| Model | Quant | Size | Decode t/s | Coding | Tool-call |
| Qwen3.6-35B-A3B DAILY DRIVER | MLX 4-bit MoE | 14.2 GB | 54.3 | 0.952 | 97.0% |
| Llama-3.3-70B-Instruct | MLX 4-bit dense | 41 GB | 6.54 | 0.899 | n/a |
| gpt-oss-120b | MXFP4 | 59 GB | OOM | — | — |
Ceiling: a 70B-4bit (~41 GB) loads and serves single-stream only (KV cache on top OOMs it) — and the MoE beats it anyway. A 120B-class model (~60 GB) does not run on 64 GB unified — it exceeds the macOS GPU wired cap and hard-OOMs on first inference.
The ceiling rule (macOS GPU wired cap ≈ 67-75% of RAM)
| Unified RAM | GPU budget | Largest comfortable model | Hard ceiling |
| 16 GB (M4) | ~10-11 GB | 8B-4bit (14B tight) | ~10 GB / no 20B+ headroom on a shared box |
| 32 GB (M5) | ~20-24 GB | 30-35B-A3B-4bit MoE | no 70B |
| 64 GB (M1 Max) | ~43-48 GB | 35B MoE (or 70B-4bit single-stream) | no 120B |
A concurrent, multi-tenant LLM server is still GB10 / H200 / RTX-PRO-6000 territory — not a Mac. But a Mac (32 GB+) is a legitimate single-stream appliance for the production 35B-A3B MoE.
* mlx_lm.server 0.31.3 silently drops native tool_calls for Qwen2.5 / gpt-oss (a ~27% correct-declines-only floor) — measure those via Ollama. Qwen3.6's 97% is real (verified structured calls); Llama-3.1-8B's 86.4% is recovered from its <|python_tag|> text and is quant-stable across machines.
† gpt-oss-20b coding is depressed by reasoning-channel leak (a parsing artifact), not capability.
Bench-only: nothing here changes production (Sam tool-caller stays gpt-oss-20b). Source: PTG llm-benchmark harness, mac_fleet_workload_matrix_2026-05-26.md.