PTG Fleet Model Leaderboards

Cross-judged by Claude Haiku-4.5 / GPT-4.1 (generator != judge per Session-13 lock-in). Last updated: 2026-06-19 21:59 EDT. Auto-generated. Per-row Measured column = data-collection date.

Throughput dashboard Apple Silicon (Mac) matrix

Headline (overnight bench-off, measured 2026-05-27): a sweep of every new open-weight model mostly validated the current fleet set — Qwen3.6-35B-A3B stays both the coding leader and a research peer of Claude Opus, DeepSeek-V4-Flash leads blog, gpt-oss-20b stays the voice tool-caller. Two Mistral exceptions (Blackwell re-test, 2026-05-27): Mistral-Small-4-119B is a viable differentiated resident (0.957 coding + 92.4% zero-hallucination tool-calls + native citations + vision on one checkpoint — agentic / cited-RAG / compliance), and Ministral-3-8B is the new air-gap edge-appliance leader (beats Granite-4.1-8B). All Mistral models need vLLM --tool-call-parser mistral to tool-call correctly.

Coding (22 PTG ops/SEO/debug tasks, mean case score, task-appropriate temp=0.0)

Model	Score	Pass	Latency	Temp	Measured
Claude Opus-4.7 BEST	0.991	22/22	2.7s	default	2026-05-25
Qwen3.6-27B BF16 (dense)	0.990	20/22	66.4s	default	2026-06-08
gpt-oss:20b	0.989	22/22	2.5s	default	2026-05-28
Qwen3.6-35B-A3B	0.986	22/22	0.6s	default	2026-05-25
North-Mini-Code-1.0 (FP8, Cohere)	0.986	22/22	3.5s	default	2026-06-19
Gemma-4-31B	0.982	22/22	16.6s	default	2026-05-26
nex-n2-pro	0.980	22/22	18.1s	default	2026-06-10
Gemma-4-12B	0.977	22/22	15.8s	default	2026-06-03
gpt-oss-20b	0.975	22/22	11.8s	default	2026-05-28
MiniMax-M2.7 (NVFP4)	0.974	21/22	50.1s	default	2026-05-27
GLM-4.7-Flash	0.973	22/22	8.9s	default	2026-05-28
Qwen3.6-35B-A3B (BF16)	0.973	22/22	22.7s	default	2026-06-08
DeepSeek-V4-Flash	0.970	21/22	0.9s	default	2026-05-29
Gemma-4-26B-A4B (BF16)	0.970	22/22	1.0s	default	2026-06-09
Mistral-Medium-3.5-128B	0.968	22/22	4.6s	default	2026-05-26
gpt-oss-120b	0.964	21/22	8.4s	default	2026-05-28
claude-haiku-4-5-20251001	0.959	21/22	1.2s	default	2026-05-29
gpt-4.1	0.957	21/22	1.4s	default	2026-05-25
Qwen3.6-27B (dense)	0.957	21/22	34.0s	default	2026-05-26
Mistral-Small-4-119B	0.957	21/22	1.4s	default	2026-05-27
GLM-4.5-Air	0.956	17/22	105.3s	default	2026-05-28
Qwen3-Coder-Next-80B	0.955	21/22	4.8s	default	2026-05-27
Gemma-4-31B (BF16)	0.955	21/22	4.9s	default	2026-06-09
Qwen3-Coder-30B	0.952	22/22	0.8s	default	2026-06-09
openai/gpt-oss-20b	0.952	21/22	1.8s	default	2026-05-26
qwen3-coder:30b	0.952	22/22	0.8s	default	2026-05-27
Gemma-4-12B (BF16)	0.952	21/22	2.5s	default	2026-06-09
Gemma-4-e4b (BF16)	0.949	22/22	1.8s	default	2026-06-09
command-a-plus	0.939	21/22	18.1s	default	2026-05-26
Gemma-4-e2b	0.935	21/22	12.0s	default	2026-05-26
Qwen3.5-Opus-distill (27B)	0.934	21/22	13.1s	default	2026-05-26
Devstral-Small-2-24B	0.932	20/22	6.7s	default	2026-05-28
Gemma-4-e4b	0.931	20/22	13.6s	default	2026-05-26
Falcon-H1R-7B	0.926	19/22	84.2s	default	2026-05-27
Mistral-Small-24B	0.914	20/22	1.1s	default	2026-05-27
Mixtral-8x22B	0.909	21/22	1.9s	default	2026-05-26
Granite-4.1-8B	0.903	19/22	0.8s	default	2026-06-09
hf.co/tiiuae/Falcon-H1R-7B-GGUF:Q4_K_M	0.885	19/22	49.9s	default	2026-05-27
q25-gptq	0.885	19/22	2.5s	default	2026-05-28
Apriel-1.6-15B	0.880	19/22	110.6s	default	2026-05-27
Gemma-4-26B-A4B	0.877	18/22	18.4s	default	2026-05-26
Ministral-3-8B	0.872	18/22	2.9s	default	2026-05-27
hf.co/unsloth/Ministral-3-8B-Instruct-2512-GGUF:Q4_K_M	0.858	18/22	0.8s	default	2026-05-27
qwen2.5:7b-instruct	0.847	18/22	2.0s	default	2026-05-28
l31-awq	0.847	19/22	3.4s	default	2026-05-29
LFM2.5-8B-A1B (Liquid)	0.846	17/22	2.0s	default	2026-06-03
q25-awq	0.840	18/22	2.5s	default	2026-05-28
llama3.1:8b	0.820	17/22	2.7s	default	2026-05-29
l31-gptq	0.794	15/22	2.9s	default	2026-05-29
Mixtral-8x7B	0.773	17/22	0.7s	default	2026-05-26
MiniCPM5-1B	0.644	10/22	6.2s	default	2026-05-27
Heretic-9B	0.000	0/22	-	default	2026-05-27

Incumbent fleet coder Qwen3.6-35B-A3B now co-leads. North-Mini-Code-1.0 (Cohere, Apache-2.0, 30B-total/3B-active MoE) matches the very top on PTG coding — 0.980 ± 0.004 (N=5), statistically tied with Nex-N2-Pro and edging Qwen3.6-35B-FP8 (0.975) — and loop-amplifies on agentic tasks (cap=1 0.833 → cap=5 0.917, fabrication 2→1), unlike the non-amplifiers below. Apache-2.0, FP8 fits one card, fast, clean tools; a genuine alternative fleet coder (Session-22). It is not a research/RAG model (0.821) and blogs short of 3,000 words. Gemma-4-12B (0.977) ties the 31B/Qwen3.6 on single-shot coding but is single-shot only — it does NOT loop-amplify (cap=1=cap=5=0.917 on agentic tasks) and fabricates "done" under loop pressure; use it as a fast coder, not an agent. LFM2.5-8B-A1B is an on-device model (edge-tier 0.846, 3.41% tool-call hallucination) — not a fleet upgrade. Claude Opus-4.7 reference = 1.00. Scores cross-judged by GPT-4.1, Haiku-4.5, or validated-equivalent DeepSeek-V4-Flash. Temperature column: stamped value where present; "default" = pre-Session-13 runs (eval_coding.py default 0.0).

Research & reasoning (27-case, none-context, task-appropriate temp=0.0-0.3)

Model	Score	Pass	Latency	Temp	Measured
Gemma-4-31B (BF16) BEST	0.963	26/27	11.7s	default	2026-06-09
Gemma-4-26B-A4B (BF16)	0.963	26/27	2.2s	default	2026-06-09
nex-n2-pro	0.963	26/27	3.2s	default	2026-06-10
Qwen3.6-35B-A3B	0.926	25/27	20.4s	default	2026-05-26
DeepSeek-V4-Flash	0.926	25/27	15.1s	default	2026-05-27
Claude Opus-4.7	0.926	25/27	9.5s	default	2026-05-25
Qwen3-Coder-Next-80B	0.926	25/27	16.0s	default	2026-05-27
Qwen3.6-27B BF16 (dense)	0.926	25/27	161.2s	default	2026-06-09
Gemma-4-12B (BF16)	0.926	25/27	5.6s	default	2026-06-09
Gemma-4-e4b (BF16)	0.926	25/27	2.7s	default	2026-06-09
Granite-4.1-8B	0.889	24/27	2.6s	default	2026-06-09
Qwen3.6-27B (dense)	0.889	24/27	91.2s	default	2026-05-26
Qwen3.6-35B-A3B (BF16)	0.889	24/27	24.2s	default	2026-06-08
Gemma-4-e2b	0.852	23/27	12.2s	default	2026-05-26
Gemma-4-31B	0.852	23/27	43.2s	default	2026-05-27
Qwen3.5-Opus-distill (27B)	0.852	23/27	270.5s	default	2026-05-27
Ministral-3-8B	0.852	23/27	11.5s	default	2026-05-27
GLM-4.5-Air	0.815	22/27	121.8s	default	2026-05-28
Qwen3-Coder-30B	0.815	22/27	1.7s	default	2026-06-09
Gemma-4-e4b	0.778	21/27	20.0s	default	2026-05-26
North-Mini-Code-1.0 (FP8, Cohere)	0.778	21/27	6.9s	default	2026-06-19
Mistral-Small-4-119B	0.741	20/27	11.0s	default	2026-05-26
Mistral-Small-24B	0.741	20/27	4.2s	default	2026-06-09
Gemma-4-26B-A4B	0.556	15/27	46.4s	default	2026-05-26
Heretic-9B	0.000	0/27	-	default	2026-05-27

Re-baselined on a single cloud judge (Haiku-4.5 or GPT-4.1) per Session-13 lock-in. Qwen3.6-35B-A3B, DeepSeek-V4-Flash and Claude Opus-4.7 tie at 0.926 — fleet reasoning is at Opus parity. The Claude-Opus reasoning-distill (Qwen3.5-Opus, 0.852) did NOT beat native dense.

Blog writing (10-criteria, combined = 0.5 structural + 0.5 judge, best-per-model temp)

Model	Score	Words	Judge	Gen speed	Temp	Measured
DeepSeek-V4-Flash BEST	1.000	3774	-	64s	0.3	2026-05-28
Qwen3.6-27B BF16 (dense)	1.000	3829	10/10	477s 28 t/s	default	2026-06-08
nex-n2-pro	1.000	4411	10/10	95s 91 t/s	default	2026-06-10
gpt-oss-120b	0.945	1938	-	132s	0.0	2026-05-28
Gemma-4-31B (BF16)	0.945	2390	10/10	169s 24 t/s	default	2026-06-09
Gemma-4-12B (BF16)	0.945	2291	10/10	78s 53 t/s	default	2026-06-09
North-Mini-Code-1.0 (FP8, Cohere)	0.945	2269	10/10	43s 139 t/s	default	2026-06-19
GLM-4.7-Flash	0.945	2640	-	59s	0.0	2026-05-28
Qwen3.6-35B-A3B (BF16)	0.944	3293	9/10	49s 171 t/s	default	2026-06-08
Qwen3.6-35B-A3B (FP8)	0.889	2597	-	38s	0.0	2026-05-28
Gemma-4-26B-A4B	0.889	2280	9/10	128s 44 t/s	default	2026-05-26
Gemma-4-31B	0.889	2063	9/10	553s 7 t/s	default	2026-05-27
Qwen3-Coder-Next-80B	0.889	3109	9/10	342s 21 t/s	default	2026-05-27
Gemma-4-26B-A4B (BF16)	0.889	2128	10/10	26s 147 t/s	default	2026-06-09
Gemma-4-e4b (BF16)	0.889	2676	10/10	40s 122 t/s	default	2026-06-09
Gemma-4-e2b	0.833	2864	9/10	80s 78 t/s	default	2026-05-26
command-a-plus	0.833	1953	9/10	103s 49 t/s	default	2026-05-26
Qwen3.5-Opus-distill (27B)	0.833	4197	9/10	240s 37 t/s	default	2026-05-26
Claude Opus-4.7	0.805	2342	-	107s	default (~0.0)	2026-05-28
Qwen3.6-35B-A3B	0.778	2977	7/10	30s 200 t/s	default	2026-05-25
gpt-oss-20b	0.778	1793	9/10	34s 243 t/s	default	2026-05-25
GLM-4-9B	0.778	898	7/10	12s 181 t/s	default	2026-05-25
Mixtral-8x22B	0.778	931	7/10	41s 60 t/s	default	2026-05-26
Granite-4.1-8B	0.778	868	9/10	13s 169 t/s	default	2026-06-09
Mistral-Small-24B	0.722	1303	8/10	228s 13 t/s	default	2026-05-25
Gemma-4-e4b	0.722	2434	7/10	105s 47 t/s	default	2026-05-26
Mistral-Small-4-119B	0.722	3262	8/10	281s 28 t/s	default	2026-05-26
Mistral-Medium-3.5-128B	0.611	1469	7/10	450s 18 t/s	default	2026-05-26
GLM-4.5-Air	0.500	4541	3/10	1360s 6 t/s	default	2026-05-25
nemotron-3-nano:30b	0.445	5618	2/10	156s 51 t/s	default	2026-05-25
Mixtral-8x7B	0.222	79	2/10	1s 137 t/s	default	2026-05-26
Qwen3-Coder-30B	0.111	0	1/10	11s 0 t/s	default	2026-06-09

3,000+ word SEO CMMC blog from a fixed prompt. Each row reports the model's best-scoring temperature when the Session-14 Bench-2 sweep covered it; legacy rows keep their original single temperature. Autoblog runs overnight in batch, so quality decides.

Blog temperature sweep (Session-14 Bench-2, judge=GPT-4.1, N=2 reruns per cell)

Model	Temp	Mean score	Range (min-max)	Mean words	Mean gen
Claude Opus-4.7	default (~0.0)	0.805	0.778-0.833	2342	107.2s
Claude Opus-4.7	default (~0.3)	0.805	0.778-0.833	2342	107.2s
Claude Opus-4.7	default (~0.7)	0.805	0.778-0.833	2342	107.2s
Claude Opus-4.7	default (~1.0)	0.805	0.778-0.833	2342	107.2s
DeepSeek-V4-Flash	0.0	0.972	0.944-1.000	4194	78.4s
DeepSeek-V4-Flash	0.3 BEST	1.000	1.000-1.000	3774	64.2s
DeepSeek-V4-Flash	0.7	1.000	1.000-1.000	3732	65.0s
DeepSeek-V4-Flash	1.0	0.972	0.944-1.000	4025	72.4s
GLM-4.7-Flash	0.0 BEST	0.945	0.944-0.945	2640	58.8s
GLM-4.7-Flash	0.3	0.805	0.722-0.889	2210	53.1s
GLM-4.7-Flash	0.7	0.861	0.833-0.889	2748	57.9s
GLM-4.7-Flash	1.0	0.889	0.889-0.889	2072	49.5s
Qwen3.6-35B-A3B (FP8)	0.0 BEST	0.889	0.889-0.889	2597	37.5s
Qwen3.6-35B-A3B (FP8)	0.3	0.806	0.667-0.945	3204	39.8s
Qwen3.6-35B-A3B (FP8)	0.7	0.584	0.445-0.722	2517	44.2s
Qwen3.6-35B-A3B (FP8)	1.0	0.611	0.611-0.611	2549	41.5s
gpt-oss-120b	0.0 BEST	0.945	0.945-0.945	1938	131.8s
gpt-oss-120b	0.3	0.889	0.889-0.889	2145	137.4s
gpt-oss-120b	0.7	0.945	0.945-0.945	2144	127.8s
gpt-oss-120b	1.0	0.917	0.889-0.945	1710	124.2s

Session-14 Bench-2 temperature sweep measured 2026-05-28 13:48 EDT. Reasoning models (DeepSeek-V4-Flash, Qwen3.6-35B-A3B) may treat temperature as a hint during their reasoning phase; a flat row across temps is itself a finding. Claude Opus-4.7 rejects the temperature parameter; its rows are copied from a single default-temperature run.

Questions answered

Which models do we keep resident and route to? Coding → Qwen3.6-35B-A3B (0.975, GPT-4.1). Research → Qwen3.6-35B (speed) / DeepSeek-V4-Flash (long-context), both Opus-parity. Blog → DeepSeek-V4-Flash / GLM-4.7-Flash. Voice tool-call → gpt-oss-20b. Edge appliance → Granite-4.1-8B (+ Gemma-4-e4b for tool-driving). Reserve Claude Opus for the top few percent.

Do self-improving loops help small models? A loop is a capability amplifier, not an equalizer: Qwen3.6-35B goes 0.917→1.0 with iterations; small models (Gemma-4-e4b, GLM-4.7-Flash) stay flat at 0.917. The done-gate makes a small model honest (no silent fabrication), not capable.

Self-improving loop vs a general agent (pi.dev) on a 4B? Gemma-4-e4b scored 0.917 in a done-gated loop vs 0.0 in pi.dev (it fabricated all 12 tasks). Air-gap appliances should pair a small model with a programmatic verifier loop, never a general agent framework.

Are there other open models worth adding? A live Feb–May 2026 scan found none that beat the incumbents; Qwen3.7/Qwen4, DeepSeek-R2, Phi-5 and Grok-3 weights are unreleased or hosted-only.

What is the best model for a single RTX PRO 6000 96GB (Blackwell) card, and is a 35B a waste of it? No displacer. Qwen3.6-35B-A3B (3B active) is the best all-rounder that fits one card: it wins research and cited-RAG outright and leads coding (0.975, GPT-4.1). The models large enough to “fill” the card (gpt-oss-120b 0.964, Mistral-Small-4-119B 0.957) are slower and weaker on the role’s core axes. A low-active MoE is the correct shape for a 96GB concurrency server: comparable NVFP4 models scale to ~2,000 t/s aggregate at c=32 on this card. Spare VRAM is best spent on KV/concurrency, or on NVFP4 (same quality at half the VRAM, freeing room to co-locate a second model), not on a bigger-but-worse model. Mistral-Small-4-119B is the lone alternative, and only if the card is redefined as a cited-RAG / vision / compliance resident.

Is a purpose-built Rust inference engine (Atlas) faster than our tuned vLLM on Blackwell? (measured 2026-06-07) No. On identical GB10 (DGX-Spark-class) hardware and the same Qwen3.6-35B-A3B-NVFP4 model, our tuned vLLM (NVIDIA MTP recipe) ran 116–119 tok/s steady-state vs Atlas’s 88.9; Atlas’s advertised “130–133 tok/s” and “3.1× faster than vLLM” did not reproduce (the 3.1× is vs an untuned vLLM). Atlas serving is quality-preserving — blog 0.944 (ties our blog leader) and 6/6 on a coding spot-check — and ships an ~8×-smaller (2.98 GB) no-Python single binary. That makes it a candidate packaging vehicle for an air-gapped compliance appliance, not a throughput upgrade. Its multi-node expert-parallel mode is not yet shipping (runtime is single-node only).

Does an agentic multi-hop retriever beat single-shot RAG for compliance Q&A? (measured 2026-06-07) On hard multi-hop CMMC / NIST 800-171 questions, an RL-trained search agent (Harness-1, 21B, gpt-oss-20b base) found every gold control (retrieval recall 1.000) where single-shot dense top-8 reached only 0.881 — it recovers the deep 2nd/3rd-hop controls single-shot drops at production cutoffs. But its curated answer (0.929) only matched single-shot top-15 (the curation step, not the search, is the bottleneck) and cost ~1,000× the latency — so the value is exhaustive batch retrieval (audit / SSP gap analysis), not interactive RAG. Control-id deduplication remains the cheap universal lever: it lifts both single-shot (0.786→0.881) and the agent (0.905→1.000).

Methodology: PTG llm-benchmark harness; cross-judged by Claude Haiku-4.5 (pre-Session-13) or GPT-4.1 (Session-13 onwards). Temperatures: coding / tool-call / adversarial leaderboards filter to temp=0.0; blog uses each model's best temperature from the Session-14 Bench-2 sweep where covered. Excluded: misconfigured/errored runs and hardware-specific Mac coding runs (see Apple Silicon matrix).