Apple Silicon (Mac) Workload Matrix

Which model & which job on each Mac, by unified-memory tier. MLX (mlx_lm.server 0.31.3), one model resident at a time. Coding cross-judged by Claude Haiku-4.5 (generator != judge). Bench-only. Updated 2026-05-26.

← Throughput dashboard Model leaderboards

Headline: one model wins the whole Apple fleet that can hold it — Qwen3.6-35B-A3B 4-bit MLX (MoE): the daily driver on both the 32 GB M5 and the 64 GB M1 Max (~50-54 t/s, 97% tool-call, 0.93-0.95 coding, in 14-21 GB). The cheaper 32 GB M5 performs within 7% of the 64 GB M1 Max on it — the extra RAM only matters to hold a dense 70B, which isn't worth running (see below).

Dense vs MoE — the decision rule

Default to a sparse MoE for every workload, on all bandwidth-bound hardware (Apple Silicon, GB10, Strix). On the 64 GB M1 Max, head-to-head, same quant:

Axis	Dense Llama-70B (4-bit)	MoE Qwen3.6-35B-A3B (4-bit)
Decode	6.54 t/s	54.3 t/s (8.3× faster)
Coding (Haiku-judged)	0.899	0.952
Memory footprint	41 GB	14.2 GB
Tool-call accuracy	n/a (no MLX template)	97%

Bigger-dense lost on speed, quality, and memory. Reach for a dense model only when: (1) no MoE exists at the capability you need; (2) you need low-latency short turns from a non-reasoning model (a reasoning-vs-direct issue, better solved with a non-reasoning MoE or capped reasoning effort); or (3) broad single-shot world-knowledge prose (theoretical — the 70B still lost here). Decode on a 3B-active MoE is not bandwidth-bound, which is why a cheaper, lower-bandwidth Mac is nearly as fast; the M1 Max's ~400 GB/s edge only cashes in on dense ≥30B.

16 GB — M4 Mac mini

Model	Quant	Size	Fits no-swap	Decode t/s	Coding	Tool-call
Llama-3.1-8B-Instruct BEST FIT	MLX 4-bit	5 GB	yes	18.6	0.843	86.4%
Qwen2.5-14B-Instruct	MLX 4-bit	7.7 GB	tight	10.3	0.915	n/a*
gpt-oss-20b	MXFP4	10 GB	edge	34-39	0.69†	n/a*
Qwen3.6-35B-A3B	MLX 3-bit	14.5 GB	OOM	—	—	—

Ceiling: 8B-comfortable / 14B-capable coder. A useful single-purpose air-gapped edge box (coder + 86% scheduler), but not a comfortable Sam host — gpt-oss-20b only barely fits and 86.4% < the 95.5% production tool-call bar. Below the appliance floor for a real 30-35B MoE.

32 GB — M5 (the value sweet spot)

Model	Quant	Size	Decode t/s	TTFT	Coding	Tool-call
Qwen3.6-35B-A3B DAILY DRIVER	MLX 4-bit MoE	21 GB	50.5	0.2-0.5s	0.931	97.0%
Qwen2.5-7B-Instruct	MLX 8-bit	8.1 GB	15.4	0.67s	0.889	n/a*
Llama-3.1-8B-Instruct	MLX 4-bit	4.8 GB	26.3	0.69s	0.829	86.4%
gpt-oss-20b	MXFP4	11 GB	46.6	1.21s	0.694†	n/a*

Ceiling ~20-24 GB → a 30-32B-4bit or 14B-8bit fits; 70B does not. The 35B MoE collapses the old 3-way split (best coder / best tool-caller / fastest) into one model that wins all three — and runs within 7% of the 64 GB M1 Max while winning every compute-side metric (TTFT, prefill, 3× faster per task).

64 GB — M1 Max

Model	Quant	Size	Decode t/s	Coding	Tool-call
Qwen3.6-35B-A3B DAILY DRIVER	MLX 4-bit MoE	14.2 GB	54.3	0.952	97.0%
Llama-3.3-70B-Instruct	MLX 4-bit dense	41 GB	6.54	0.899	n/a
gpt-oss-120b	MXFP4	59 GB	OOM	—	—

Ceiling: a 70B-4bit (~41 GB) loads and serves single-stream only (KV cache on top OOMs it) — and the MoE beats it anyway. A 120B-class model (~60 GB) does not run on 64 GB unified — it exceeds the macOS GPU wired cap and hard-OOMs on first inference.

The ceiling rule (macOS GPU wired cap ≈ 67-75% of RAM)

Unified RAM	GPU budget	Largest comfortable model	Hard ceiling
16 GB (M4)	~10-11 GB	8B-4bit (14B tight)	~10 GB / no 20B+ headroom on a shared box
32 GB (M5)	~20-24 GB	30-35B-A3B-4bit MoE	no 70B
64 GB (M1 Max)	~43-48 GB	35B MoE (or 70B-4bit single-stream)	no 120B

A concurrent, multi-tenant LLM server is still GB10 / H200 / RTX-PRO-6000 territory — not a Mac. But a Mac (32 GB+) is a legitimate single-stream appliance for the production 35B-A3B MoE.

* mlx_lm.server 0.31.3 silently drops native tool_calls for Qwen2.5 / gpt-oss (a ~27% correct-declines-only floor) — measure those via Ollama. Qwen3.6's 97% is real (verified structured calls); Llama-3.1-8B's 86.4% is recovered from its <|python_tag|> text and is quant-stable across machines. † gpt-oss-20b coding is depressed by reasoning-channel leak (a parsing artifact), not capability.
Bench-only: nothing here changes production (Sam tool-caller stays gpt-oss-20b). Source: PTG llm-benchmark harness, mac_fleet_workload_matrix_2026-05-26.md.