PTG Fleet Model Leaderboards
Cross-judged by Claude Haiku-4.5 / GPT-4.1 (generator != judge per Session-13 lock-in).
Last updated: 2026-06-19 21:59 EDT. Auto-generated. Per-row Measured column = data-collection date.
Headline (overnight bench-off, measured 2026-05-27): a sweep of every new open-weight model
mostly validated the current fleet set — Qwen3.6-35B-A3B stays both the coding leader
and a research peer of Claude Opus, DeepSeek-V4-Flash leads blog, gpt-oss-20b stays the
voice tool-caller. Two Mistral exceptions (Blackwell re-test, 2026-05-27): Mistral-Small-4-119B
is a viable differentiated resident (0.957 coding + 92.4% zero-hallucination tool-calls + native citations +
vision on one checkpoint — agentic / cited-RAG / compliance), and Ministral-3-8B is the new
air-gap edge-appliance leader (beats Granite-4.1-8B). All Mistral models need vLLM
--tool-call-parser mistral to tool-call correctly.
Coding (22 PTG ops/SEO/debug tasks, mean case score, task-appropriate temp=0.0)
| Model | Score | Pass | Latency | Temp | Measured |
| Claude Opus-4.7 BEST | 0.991 | 22/22 | 2.7s | default | 2026-05-25 |
| Qwen3.6-27B BF16 (dense) | 0.990 | 20/22 | 66.4s | default | 2026-06-08 |
| gpt-oss:20b | 0.989 | 22/22 | 2.5s | default | 2026-05-28 |
| Qwen3.6-35B-A3B | 0.986 | 22/22 | 0.6s | default | 2026-05-25 |
| North-Mini-Code-1.0 (FP8, Cohere) | 0.986 | 22/22 | 3.5s | default | 2026-06-19 |
| Gemma-4-31B | 0.982 | 22/22 | 16.6s | default | 2026-05-26 |
| nex-n2-pro | 0.980 | 22/22 | 18.1s | default | 2026-06-10 |
| Gemma-4-12B | 0.977 | 22/22 | 15.8s | default | 2026-06-03 |
| gpt-oss-20b | 0.975 | 22/22 | 11.8s | default | 2026-05-28 |
| MiniMax-M2.7 (NVFP4) | 0.974 | 21/22 | 50.1s | default | 2026-05-27 |
| GLM-4.7-Flash | 0.973 | 22/22 | 8.9s | default | 2026-05-28 |
| Qwen3.6-35B-A3B (BF16) | 0.973 | 22/22 | 22.7s | default | 2026-06-08 |
| DeepSeek-V4-Flash | 0.970 | 21/22 | 0.9s | default | 2026-05-29 |
| Gemma-4-26B-A4B (BF16) | 0.970 | 22/22 | 1.0s | default | 2026-06-09 |
| Mistral-Medium-3.5-128B | 0.968 | 22/22 | 4.6s | default | 2026-05-26 |
| gpt-oss-120b | 0.964 | 21/22 | 8.4s | default | 2026-05-28 |
| claude-haiku-4-5-20251001 | 0.959 | 21/22 | 1.2s | default | 2026-05-29 |
| gpt-4.1 | 0.957 | 21/22 | 1.4s | default | 2026-05-25 |
| Qwen3.6-27B (dense) | 0.957 | 21/22 | 34.0s | default | 2026-05-26 |
| Mistral-Small-4-119B | 0.957 | 21/22 | 1.4s | default | 2026-05-27 |
| GLM-4.5-Air | 0.956 | 17/22 | 105.3s | default | 2026-05-28 |
| Qwen3-Coder-Next-80B | 0.955 | 21/22 | 4.8s | default | 2026-05-27 |
| Gemma-4-31B (BF16) | 0.955 | 21/22 | 4.9s | default | 2026-06-09 |
| Qwen3-Coder-30B | 0.952 | 22/22 | 0.8s | default | 2026-06-09 |
| openai/gpt-oss-20b | 0.952 | 21/22 | 1.8s | default | 2026-05-26 |
| qwen3-coder:30b | 0.952 | 22/22 | 0.8s | default | 2026-05-27 |
| Gemma-4-12B (BF16) | 0.952 | 21/22 | 2.5s | default | 2026-06-09 |
| Gemma-4-e4b (BF16) | 0.949 | 22/22 | 1.8s | default | 2026-06-09 |
| command-a-plus | 0.939 | 21/22 | 18.1s | default | 2026-05-26 |
| Gemma-4-e2b | 0.935 | 21/22 | 12.0s | default | 2026-05-26 |
| Qwen3.5-Opus-distill (27B) | 0.934 | 21/22 | 13.1s | default | 2026-05-26 |
| Devstral-Small-2-24B | 0.932 | 20/22 | 6.7s | default | 2026-05-28 |
| Gemma-4-e4b | 0.931 | 20/22 | 13.6s | default | 2026-05-26 |
| Falcon-H1R-7B | 0.926 | 19/22 | 84.2s | default | 2026-05-27 |
| Mistral-Small-24B | 0.914 | 20/22 | 1.1s | default | 2026-05-27 |
| Mixtral-8x22B | 0.909 | 21/22 | 1.9s | default | 2026-05-26 |
| Granite-4.1-8B | 0.903 | 19/22 | 0.8s | default | 2026-06-09 |
| hf.co/tiiuae/Falcon-H1R-7B-GGUF:Q4_K_M | 0.885 | 19/22 | 49.9s | default | 2026-05-27 |
| q25-gptq | 0.885 | 19/22 | 2.5s | default | 2026-05-28 |
| Apriel-1.6-15B | 0.880 | 19/22 | 110.6s | default | 2026-05-27 |
| Gemma-4-26B-A4B | 0.877 | 18/22 | 18.4s | default | 2026-05-26 |
| Ministral-3-8B | 0.872 | 18/22 | 2.9s | default | 2026-05-27 |
| hf.co/unsloth/Ministral-3-8B-Instruct-2512-GGUF:Q4_K_M | 0.858 | 18/22 | 0.8s | default | 2026-05-27 |
| qwen2.5:7b-instruct | 0.847 | 18/22 | 2.0s | default | 2026-05-28 |
| l31-awq | 0.847 | 19/22 | 3.4s | default | 2026-05-29 |
| LFM2.5-8B-A1B (Liquid) | 0.846 | 17/22 | 2.0s | default | 2026-06-03 |
| q25-awq | 0.840 | 18/22 | 2.5s | default | 2026-05-28 |
| llama3.1:8b | 0.820 | 17/22 | 2.7s | default | 2026-05-29 |
| l31-gptq | 0.794 | 15/22 | 2.9s | default | 2026-05-29 |
| Mixtral-8x7B | 0.773 | 17/22 | 0.7s | default | 2026-05-26 |
| MiniCPM5-1B | 0.644 | 10/22 | 6.2s | default | 2026-05-27 |
| Heretic-9B | 0.000 | 0/22 | - | default | 2026-05-27 |
Incumbent fleet coder Qwen3.6-35B-A3B now co-leads. North-Mini-Code-1.0
(Cohere, Apache-2.0, 30B-total/3B-active MoE) matches the very top on PTG coding — 0.980 ± 0.004 (N=5),
statistically tied with Nex-N2-Pro and edging Qwen3.6-35B-FP8 (0.975) — and loop-amplifies on agentic
tasks (cap=1 0.833 → cap=5 0.917, fabrication 2→1), unlike the non-amplifiers below. Apache-2.0, FP8 fits one card,
fast, clean tools; a genuine alternative fleet coder (Session-22). It is not a research/RAG model (0.821) and blogs short of 3,000 words.
Gemma-4-12B (0.977) ties the 31B/Qwen3.6 on single-shot coding but is single-shot only —
it does NOT loop-amplify (cap=1=cap=5=0.917 on agentic tasks) and fabricates "done" under loop pressure;
use it as a fast coder, not an agent. LFM2.5-8B-A1B is an on-device model (edge-tier 0.846, 3.41%
tool-call hallucination) — not a fleet upgrade.
Claude Opus-4.7 reference = 1.00. Scores cross-judged by GPT-4.1, Haiku-4.5, or validated-equivalent
DeepSeek-V4-Flash. Temperature column: stamped value where present;
"default" = pre-Session-13 runs (eval_coding.py default 0.0).
Research & reasoning (27-case, none-context, task-appropriate temp=0.0-0.3)
| Model | Score | Pass | Latency | Temp | Measured |
| Gemma-4-31B (BF16) BEST | 0.963 | 26/27 | 11.7s | default | 2026-06-09 |
| Gemma-4-26B-A4B (BF16) | 0.963 | 26/27 | 2.2s | default | 2026-06-09 |
| nex-n2-pro | 0.963 | 26/27 | 3.2s | default | 2026-06-10 |
| Qwen3.6-35B-A3B | 0.926 | 25/27 | 20.4s | default | 2026-05-26 |
| DeepSeek-V4-Flash | 0.926 | 25/27 | 15.1s | default | 2026-05-27 |
| Claude Opus-4.7 | 0.926 | 25/27 | 9.5s | default | 2026-05-25 |
| Qwen3-Coder-Next-80B | 0.926 | 25/27 | 16.0s | default | 2026-05-27 |
| Qwen3.6-27B BF16 (dense) | 0.926 | 25/27 | 161.2s | default | 2026-06-09 |
| Gemma-4-12B (BF16) | 0.926 | 25/27 | 5.6s | default | 2026-06-09 |
| Gemma-4-e4b (BF16) | 0.926 | 25/27 | 2.7s | default | 2026-06-09 |
| Granite-4.1-8B | 0.889 | 24/27 | 2.6s | default | 2026-06-09 |
| Qwen3.6-27B (dense) | 0.889 | 24/27 | 91.2s | default | 2026-05-26 |
| Qwen3.6-35B-A3B (BF16) | 0.889 | 24/27 | 24.2s | default | 2026-06-08 |
| Gemma-4-e2b | 0.852 | 23/27 | 12.2s | default | 2026-05-26 |
| Gemma-4-31B | 0.852 | 23/27 | 43.2s | default | 2026-05-27 |
| Qwen3.5-Opus-distill (27B) | 0.852 | 23/27 | 270.5s | default | 2026-05-27 |
| Ministral-3-8B | 0.852 | 23/27 | 11.5s | default | 2026-05-27 |
| GLM-4.5-Air | 0.815 | 22/27 | 121.8s | default | 2026-05-28 |
| Qwen3-Coder-30B | 0.815 | 22/27 | 1.7s | default | 2026-06-09 |
| Gemma-4-e4b | 0.778 | 21/27 | 20.0s | default | 2026-05-26 |
| North-Mini-Code-1.0 (FP8, Cohere) | 0.778 | 21/27 | 6.9s | default | 2026-06-19 |
| Mistral-Small-4-119B | 0.741 | 20/27 | 11.0s | default | 2026-05-26 |
| Mistral-Small-24B | 0.741 | 20/27 | 4.2s | default | 2026-06-09 |
| Gemma-4-26B-A4B | 0.556 | 15/27 | 46.4s | default | 2026-05-26 |
| Heretic-9B | 0.000 | 0/27 | - | default | 2026-05-27 |
Re-baselined on a single cloud judge (Haiku-4.5 or GPT-4.1) per Session-13 lock-in.
Qwen3.6-35B-A3B, DeepSeek-V4-Flash and Claude Opus-4.7 tie at 0.926 — fleet reasoning is at
Opus parity. The Claude-Opus reasoning-distill (Qwen3.5-Opus, 0.852) did NOT beat native dense.
Blog writing (10-criteria, combined = 0.5 structural + 0.5 judge, best-per-model temp)
| Model | Score | Words | Judge | Gen speed | Temp | Measured |
| DeepSeek-V4-Flash BEST | 1.000 | 3774 | - | 64s | 0.3 | 2026-05-28 |
| Qwen3.6-27B BF16 (dense) | 1.000 | 3829 | 10/10 | 477s 28 t/s | default | 2026-06-08 |
| nex-n2-pro | 1.000 | 4411 | 10/10 | 95s 91 t/s | default | 2026-06-10 |
| gpt-oss-120b | 0.945 | 1938 | - | 132s | 0.0 | 2026-05-28 |
| Gemma-4-31B (BF16) | 0.945 | 2390 | 10/10 | 169s 24 t/s | default | 2026-06-09 |
| Gemma-4-12B (BF16) | 0.945 | 2291 | 10/10 | 78s 53 t/s | default | 2026-06-09 |
| North-Mini-Code-1.0 (FP8, Cohere) | 0.945 | 2269 | 10/10 | 43s 139 t/s | default | 2026-06-19 |
| GLM-4.7-Flash | 0.945 | 2640 | - | 59s | 0.0 | 2026-05-28 |
| Qwen3.6-35B-A3B (BF16) | 0.944 | 3293 | 9/10 | 49s 171 t/s | default | 2026-06-08 |
| Qwen3.6-35B-A3B (FP8) | 0.889 | 2597 | - | 38s | 0.0 | 2026-05-28 |
| Gemma-4-26B-A4B | 0.889 | 2280 | 9/10 | 128s 44 t/s | default | 2026-05-26 |
| Gemma-4-31B | 0.889 | 2063 | 9/10 | 553s 7 t/s | default | 2026-05-27 |
| Qwen3-Coder-Next-80B | 0.889 | 3109 | 9/10 | 342s 21 t/s | default | 2026-05-27 |
| Gemma-4-26B-A4B (BF16) | 0.889 | 2128 | 10/10 | 26s 147 t/s | default | 2026-06-09 |
| Gemma-4-e4b (BF16) | 0.889 | 2676 | 10/10 | 40s 122 t/s | default | 2026-06-09 |
| Gemma-4-e2b | 0.833 | 2864 | 9/10 | 80s 78 t/s | default | 2026-05-26 |
| command-a-plus | 0.833 | 1953 | 9/10 | 103s 49 t/s | default | 2026-05-26 |
| Qwen3.5-Opus-distill (27B) | 0.833 | 4197 | 9/10 | 240s 37 t/s | default | 2026-05-26 |
| Claude Opus-4.7 | 0.805 | 2342 | - | 107s | default (~0.0) | 2026-05-28 |
| Qwen3.6-35B-A3B | 0.778 | 2977 | 7/10 | 30s 200 t/s | default | 2026-05-25 |
| gpt-oss-20b | 0.778 | 1793 | 9/10 | 34s 243 t/s | default | 2026-05-25 |
| GLM-4-9B | 0.778 | 898 | 7/10 | 12s 181 t/s | default | 2026-05-25 |
| Mixtral-8x22B | 0.778 | 931 | 7/10 | 41s 60 t/s | default | 2026-05-26 |
| Granite-4.1-8B | 0.778 | 868 | 9/10 | 13s 169 t/s | default | 2026-06-09 |
| Mistral-Small-24B | 0.722 | 1303 | 8/10 | 228s 13 t/s | default | 2026-05-25 |
| Gemma-4-e4b | 0.722 | 2434 | 7/10 | 105s 47 t/s | default | 2026-05-26 |
| Mistral-Small-4-119B | 0.722 | 3262 | 8/10 | 281s 28 t/s | default | 2026-05-26 |
| Mistral-Medium-3.5-128B | 0.611 | 1469 | 7/10 | 450s 18 t/s | default | 2026-05-26 |
| GLM-4.5-Air | 0.500 | 4541 | 3/10 | 1360s 6 t/s | default | 2026-05-25 |
| nemotron-3-nano:30b | 0.445 | 5618 | 2/10 | 156s 51 t/s | default | 2026-05-25 |
| Mixtral-8x7B | 0.222 | 79 | 2/10 | 1s 137 t/s | default | 2026-05-26 |
| Qwen3-Coder-30B | 0.111 | 0 | 1/10 | 11s 0 t/s | default | 2026-06-09 |
3,000+ word SEO CMMC blog from a fixed prompt. Each row reports the model's
best-scoring temperature when the Session-14 Bench-2 sweep covered it; legacy rows keep
their original single temperature. Autoblog runs overnight in batch, so quality decides.
Blog temperature sweep (Session-14 Bench-2, judge=GPT-4.1, N=2 reruns per cell)
| Model | Temp | Mean score | Range (min-max) | Mean words | Mean gen |
| Claude Opus-4.7 | default (~0.0) | 0.805 | 0.778-0.833 | 2342 | 107.2s |
| Claude Opus-4.7 | default (~0.3) | 0.805 | 0.778-0.833 | 2342 | 107.2s |
| Claude Opus-4.7 | default (~0.7) | 0.805 | 0.778-0.833 | 2342 | 107.2s |
| Claude Opus-4.7 | default (~1.0) | 0.805 | 0.778-0.833 | 2342 | 107.2s |
| DeepSeek-V4-Flash | 0.0 | 0.972 | 0.944-1.000 | 4194 | 78.4s |
| DeepSeek-V4-Flash | 0.3 BEST | 1.000 | 1.000-1.000 | 3774 | 64.2s |
| DeepSeek-V4-Flash | 0.7 | 1.000 | 1.000-1.000 | 3732 | 65.0s |
| DeepSeek-V4-Flash | 1.0 | 0.972 | 0.944-1.000 | 4025 | 72.4s |
| GLM-4.7-Flash | 0.0 BEST | 0.945 | 0.944-0.945 | 2640 | 58.8s |
| GLM-4.7-Flash | 0.3 | 0.805 | 0.722-0.889 | 2210 | 53.1s |
| GLM-4.7-Flash | 0.7 | 0.861 | 0.833-0.889 | 2748 | 57.9s |
| GLM-4.7-Flash | 1.0 | 0.889 | 0.889-0.889 | 2072 | 49.5s |
| Qwen3.6-35B-A3B (FP8) | 0.0 BEST | 0.889 | 0.889-0.889 | 2597 | 37.5s |
| Qwen3.6-35B-A3B (FP8) | 0.3 | 0.806 | 0.667-0.945 | 3204 | 39.8s |
| Qwen3.6-35B-A3B (FP8) | 0.7 | 0.584 | 0.445-0.722 | 2517 | 44.2s |
| Qwen3.6-35B-A3B (FP8) | 1.0 | 0.611 | 0.611-0.611 | 2549 | 41.5s |
| gpt-oss-120b | 0.0 BEST | 0.945 | 0.945-0.945 | 1938 | 131.8s |
| gpt-oss-120b | 0.3 | 0.889 | 0.889-0.889 | 2145 | 137.4s |
| gpt-oss-120b | 0.7 | 0.945 | 0.945-0.945 | 2144 | 127.8s |
| gpt-oss-120b | 1.0 | 0.917 | 0.889-0.945 | 1710 | 124.2s |
Session-14 Bench-2 temperature sweep measured 2026-05-28 13:48 EDT. Reasoning models (DeepSeek-V4-Flash, Qwen3.6-35B-A3B) may treat temperature
as a hint during their reasoning phase; a flat row across temps is itself a finding. Claude Opus-4.7
rejects the temperature parameter; its rows are copied from a single default-temperature run.
Questions answered
Which models do we keep resident and route to? Coding → Qwen3.6-35B-A3B (0.975, GPT-4.1).
Research → Qwen3.6-35B (speed) / DeepSeek-V4-Flash (long-context), both Opus-parity. Blog →
DeepSeek-V4-Flash / GLM-4.7-Flash. Voice tool-call → gpt-oss-20b. Edge appliance → Granite-4.1-8B
(+ Gemma-4-e4b for tool-driving). Reserve Claude Opus for the top few percent.
Do self-improving loops help small models? A loop is a capability amplifier, not an
equalizer: Qwen3.6-35B goes 0.917→1.0 with iterations; small models (Gemma-4-e4b, GLM-4.7-Flash)
stay flat at 0.917. The done-gate makes a small model honest (no silent fabrication), not capable.
Self-improving loop vs a general agent (pi.dev) on a 4B? Gemma-4-e4b scored 0.917 in a
done-gated loop vs 0.0 in pi.dev (it fabricated all 12 tasks). Air-gap appliances should pair a small
model with a programmatic verifier loop, never a general agent framework.
Are there other open models worth adding? A live Feb–May 2026 scan found none that beat
the incumbents; Qwen3.7/Qwen4, DeepSeek-R2, Phi-5 and Grok-3 weights are unreleased or hosted-only.
What is the best model for a single RTX PRO 6000 96GB (Blackwell) card, and is a 35B a waste of it?
No displacer. Qwen3.6-35B-A3B (3B active) is the best all-rounder that fits one card: it wins research and
cited-RAG outright and leads coding (0.975, GPT-4.1). The models large enough to “fill” the card
(gpt-oss-120b 0.964, Mistral-Small-4-119B 0.957) are slower and weaker on the role’s core axes. A
low-active MoE is the correct shape for a 96GB concurrency server: comparable NVFP4 models scale to ~2,000
t/s aggregate at c=32 on this card. Spare VRAM is best spent on KV/concurrency, or on NVFP4 (same quality at half
the VRAM, freeing room to co-locate a second model), not on a bigger-but-worse model. Mistral-Small-4-119B is the
lone alternative, and only if the card is redefined as a cited-RAG / vision / compliance resident.
Is a purpose-built Rust inference engine (Atlas) faster than our tuned vLLM on Blackwell? (measured 2026-06-07)
No. On identical GB10 (DGX-Spark-class) hardware and the same Qwen3.6-35B-A3B-NVFP4 model, our tuned vLLM
(NVIDIA MTP recipe) ran 116–119 tok/s steady-state vs Atlas’s 88.9; Atlas’s advertised
“130–133 tok/s” and “3.1× faster than vLLM” did not reproduce (the 3.1× is vs an
untuned vLLM). Atlas serving is quality-preserving — blog 0.944 (ties our blog leader) and
6/6 on a coding spot-check — and ships an ~8×-smaller (2.98 GB) no-Python single binary. That makes it a
candidate packaging vehicle for an air-gapped compliance appliance, not a throughput upgrade. Its multi-node
expert-parallel mode is not yet shipping (runtime is single-node only).
Does an agentic multi-hop retriever beat single-shot RAG for compliance Q&A? (measured 2026-06-07)
On hard multi-hop CMMC / NIST 800-171 questions, an RL-trained search agent (Harness-1, 21B, gpt-oss-20b base)
found every gold control (retrieval recall 1.000) where single-shot dense top-8 reached only 0.881 —
it recovers the deep 2nd/3rd-hop controls single-shot drops at production cutoffs. But its curated answer (0.929)
only matched single-shot top-15 (the curation step, not the search, is the bottleneck) and cost ~1,000× the
latency — so the value is exhaustive batch retrieval (audit / SSP gap analysis), not interactive RAG.
Control-id deduplication remains the cheap universal lever: it lifts both single-shot (0.786→0.881) and the
agent (0.905→1.000).
Methodology: PTG llm-benchmark harness; cross-judged by Claude Haiku-4.5 (pre-Session-13)
or GPT-4.1 (Session-13 onwards). Temperatures: coding / tool-call / adversarial leaderboards filter to
temp=0.0; blog uses each model's best temperature from the Session-14 Bench-2 sweep where covered.
Excluded: misconfigured/errored runs and hardware-specific Mac coding runs (see Apple Silicon matrix).