PTG Fleet Model Leaderboards

Cross-judged by Claude Haiku-4.5 / GPT-4.1 (generator != judge per Session-13 lock-in). Last updated: 2026-06-19 21:59 EDT. Auto-generated. Per-row Measured column = data-collection date.
Headline (overnight bench-off, measured 2026-05-27): a sweep of every new open-weight model mostly validated the current fleet setQwen3.6-35B-A3B stays both the coding leader and a research peer of Claude Opus, DeepSeek-V4-Flash leads blog, gpt-oss-20b stays the voice tool-caller. Two Mistral exceptions (Blackwell re-test, 2026-05-27): Mistral-Small-4-119B is a viable differentiated resident (0.957 coding + 92.4% zero-hallucination tool-calls + native citations + vision on one checkpoint — agentic / cited-RAG / compliance), and Ministral-3-8B is the new air-gap edge-appliance leader (beats Granite-4.1-8B). All Mistral models need vLLM --tool-call-parser mistral to tool-call correctly.

Coding (22 PTG ops/SEO/debug tasks, mean case score, task-appropriate temp=0.0)

ModelScorePassLatencyTempMeasured
Claude Opus-4.7 BEST0.99122/222.7sdefault2026-05-25
Qwen3.6-27B BF16 (dense)0.99020/2266.4sdefault2026-06-08
gpt-oss:20b0.98922/222.5sdefault2026-05-28
Qwen3.6-35B-A3B0.98622/220.6sdefault2026-05-25
North-Mini-Code-1.0 (FP8, Cohere)0.98622/223.5sdefault2026-06-19
Gemma-4-31B0.98222/2216.6sdefault2026-05-26
nex-n2-pro0.98022/2218.1sdefault2026-06-10
Gemma-4-12B0.97722/2215.8sdefault2026-06-03
gpt-oss-20b0.97522/2211.8sdefault2026-05-28
MiniMax-M2.7 (NVFP4)0.97421/2250.1sdefault2026-05-27
GLM-4.7-Flash0.97322/228.9sdefault2026-05-28
Qwen3.6-35B-A3B (BF16)0.97322/2222.7sdefault2026-06-08
DeepSeek-V4-Flash0.97021/220.9sdefault2026-05-29
Gemma-4-26B-A4B (BF16)0.97022/221.0sdefault2026-06-09
Mistral-Medium-3.5-128B0.96822/224.6sdefault2026-05-26
gpt-oss-120b0.96421/228.4sdefault2026-05-28
claude-haiku-4-5-202510010.95921/221.2sdefault2026-05-29
gpt-4.10.95721/221.4sdefault2026-05-25
Qwen3.6-27B (dense)0.95721/2234.0sdefault2026-05-26
Mistral-Small-4-119B0.95721/221.4sdefault2026-05-27
GLM-4.5-Air0.95617/22105.3sdefault2026-05-28
Qwen3-Coder-Next-80B0.95521/224.8sdefault2026-05-27
Gemma-4-31B (BF16)0.95521/224.9sdefault2026-06-09
Qwen3-Coder-30B0.95222/220.8sdefault2026-06-09
openai/gpt-oss-20b0.95221/221.8sdefault2026-05-26
qwen3-coder:30b0.95222/220.8sdefault2026-05-27
Gemma-4-12B (BF16)0.95221/222.5sdefault2026-06-09
Gemma-4-e4b (BF16)0.94922/221.8sdefault2026-06-09
command-a-plus0.93921/2218.1sdefault2026-05-26
Gemma-4-e2b0.93521/2212.0sdefault2026-05-26
Qwen3.5-Opus-distill (27B)0.93421/2213.1sdefault2026-05-26
Devstral-Small-2-24B0.93220/226.7sdefault2026-05-28
Gemma-4-e4b0.93120/2213.6sdefault2026-05-26
Falcon-H1R-7B0.92619/2284.2sdefault2026-05-27
Mistral-Small-24B0.91420/221.1sdefault2026-05-27
Mixtral-8x22B0.90921/221.9sdefault2026-05-26
Granite-4.1-8B0.90319/220.8sdefault2026-06-09
hf.co/tiiuae/Falcon-H1R-7B-GGUF:Q4_K_M0.88519/2249.9sdefault2026-05-27
q25-gptq0.88519/222.5sdefault2026-05-28
Apriel-1.6-15B0.88019/22110.6sdefault2026-05-27
Gemma-4-26B-A4B0.87718/2218.4sdefault2026-05-26
Ministral-3-8B0.87218/222.9sdefault2026-05-27
hf.co/unsloth/Ministral-3-8B-Instruct-2512-GGUF:Q4_K_M0.85818/220.8sdefault2026-05-27
qwen2.5:7b-instruct0.84718/222.0sdefault2026-05-28
l31-awq0.84719/223.4sdefault2026-05-29
LFM2.5-8B-A1B (Liquid)0.84617/222.0sdefault2026-06-03
q25-awq0.84018/222.5sdefault2026-05-28
llama3.1:8b0.82017/222.7sdefault2026-05-29
l31-gptq0.79415/222.9sdefault2026-05-29
Mixtral-8x7B0.77317/220.7sdefault2026-05-26
MiniCPM5-1B0.64410/226.2sdefault2026-05-27
Heretic-9B0.0000/22-default2026-05-27
Incumbent fleet coder Qwen3.6-35B-A3B now co-leads. North-Mini-Code-1.0 (Cohere, Apache-2.0, 30B-total/3B-active MoE) matches the very top on PTG coding — 0.980 ± 0.004 (N=5), statistically tied with Nex-N2-Pro and edging Qwen3.6-35B-FP8 (0.975) — and loop-amplifies on agentic tasks (cap=1 0.833 → cap=5 0.917, fabrication 2→1), unlike the non-amplifiers below. Apache-2.0, FP8 fits one card, fast, clean tools; a genuine alternative fleet coder (Session-22). It is not a research/RAG model (0.821) and blogs short of 3,000 words. Gemma-4-12B (0.977) ties the 31B/Qwen3.6 on single-shot coding but is single-shot only — it does NOT loop-amplify (cap=1=cap=5=0.917 on agentic tasks) and fabricates "done" under loop pressure; use it as a fast coder, not an agent. LFM2.5-8B-A1B is an on-device model (edge-tier 0.846, 3.41% tool-call hallucination) — not a fleet upgrade. Claude Opus-4.7 reference = 1.00. Scores cross-judged by GPT-4.1, Haiku-4.5, or validated-equivalent DeepSeek-V4-Flash. Temperature column: stamped value where present; "default" = pre-Session-13 runs (eval_coding.py default 0.0).

Research & reasoning (27-case, none-context, task-appropriate temp=0.0-0.3)

ModelScorePassLatencyTempMeasured
Gemma-4-31B (BF16) BEST0.96326/2711.7sdefault2026-06-09
Gemma-4-26B-A4B (BF16)0.96326/272.2sdefault2026-06-09
nex-n2-pro0.96326/273.2sdefault2026-06-10
Qwen3.6-35B-A3B0.92625/2720.4sdefault2026-05-26
DeepSeek-V4-Flash0.92625/2715.1sdefault2026-05-27
Claude Opus-4.70.92625/279.5sdefault2026-05-25
Qwen3-Coder-Next-80B0.92625/2716.0sdefault2026-05-27
Qwen3.6-27B BF16 (dense)0.92625/27161.2sdefault2026-06-09
Gemma-4-12B (BF16)0.92625/275.6sdefault2026-06-09
Gemma-4-e4b (BF16)0.92625/272.7sdefault2026-06-09
Granite-4.1-8B0.88924/272.6sdefault2026-06-09
Qwen3.6-27B (dense)0.88924/2791.2sdefault2026-05-26
Qwen3.6-35B-A3B (BF16)0.88924/2724.2sdefault2026-06-08
Gemma-4-e2b0.85223/2712.2sdefault2026-05-26
Gemma-4-31B0.85223/2743.2sdefault2026-05-27
Qwen3.5-Opus-distill (27B)0.85223/27270.5sdefault2026-05-27
Ministral-3-8B0.85223/2711.5sdefault2026-05-27
GLM-4.5-Air0.81522/27121.8sdefault2026-05-28
Qwen3-Coder-30B0.81522/271.7sdefault2026-06-09
Gemma-4-e4b0.77821/2720.0sdefault2026-05-26
North-Mini-Code-1.0 (FP8, Cohere)0.77821/276.9sdefault2026-06-19
Mistral-Small-4-119B0.74120/2711.0sdefault2026-05-26
Mistral-Small-24B0.74120/274.2sdefault2026-06-09
Gemma-4-26B-A4B0.55615/2746.4sdefault2026-05-26
Heretic-9B0.0000/27-default2026-05-27
Re-baselined on a single cloud judge (Haiku-4.5 or GPT-4.1) per Session-13 lock-in. Qwen3.6-35B-A3B, DeepSeek-V4-Flash and Claude Opus-4.7 tie at 0.926 — fleet reasoning is at Opus parity. The Claude-Opus reasoning-distill (Qwen3.5-Opus, 0.852) did NOT beat native dense.

Blog writing (10-criteria, combined = 0.5 structural + 0.5 judge, best-per-model temp)

ModelScoreWordsJudgeGen speedTempMeasured
DeepSeek-V4-Flash BEST1.0003774-64s 0.32026-05-28
Qwen3.6-27B BF16 (dense)1.000382910/10477s 28 t/sdefault2026-06-08
nex-n2-pro1.000441110/1095s 91 t/sdefault2026-06-10
gpt-oss-120b0.9451938-132s 0.02026-05-28
Gemma-4-31B (BF16)0.945239010/10169s 24 t/sdefault2026-06-09
Gemma-4-12B (BF16)0.945229110/1078s 53 t/sdefault2026-06-09
North-Mini-Code-1.0 (FP8, Cohere)0.945226910/1043s 139 t/sdefault2026-06-19
GLM-4.7-Flash0.9452640-59s 0.02026-05-28
Qwen3.6-35B-A3B (BF16)0.94432939/1049s 171 t/sdefault2026-06-08
Qwen3.6-35B-A3B (FP8)0.8892597-38s 0.02026-05-28
Gemma-4-26B-A4B0.88922809/10128s 44 t/sdefault2026-05-26
Gemma-4-31B0.88920639/10553s 7 t/sdefault2026-05-27
Qwen3-Coder-Next-80B0.88931099/10342s 21 t/sdefault2026-05-27
Gemma-4-26B-A4B (BF16)0.889212810/1026s 147 t/sdefault2026-06-09
Gemma-4-e4b (BF16)0.889267610/1040s 122 t/sdefault2026-06-09
Gemma-4-e2b0.83328649/1080s 78 t/sdefault2026-05-26
command-a-plus0.83319539/10103s 49 t/sdefault2026-05-26
Qwen3.5-Opus-distill (27B)0.83341979/10240s 37 t/sdefault2026-05-26
Claude Opus-4.70.8052342-107s default (~0.0)2026-05-28
Qwen3.6-35B-A3B0.77829777/1030s 200 t/sdefault2026-05-25
gpt-oss-20b0.77817939/1034s 243 t/sdefault2026-05-25
GLM-4-9B0.7788987/1012s 181 t/sdefault2026-05-25
Mixtral-8x22B0.7789317/1041s 60 t/sdefault2026-05-26
Granite-4.1-8B0.7788689/1013s 169 t/sdefault2026-06-09
Mistral-Small-24B0.72213038/10228s 13 t/sdefault2026-05-25
Gemma-4-e4b0.72224347/10105s 47 t/sdefault2026-05-26
Mistral-Small-4-119B0.72232628/10281s 28 t/sdefault2026-05-26
Mistral-Medium-3.5-128B0.61114697/10450s 18 t/sdefault2026-05-26
GLM-4.5-Air0.50045413/101360s 6 t/sdefault2026-05-25
nemotron-3-nano:30b0.44556182/10156s 51 t/sdefault2026-05-25
Mixtral-8x7B0.222792/101s 137 t/sdefault2026-05-26
Qwen3-Coder-30B0.11101/1011s 0 t/sdefault2026-06-09
3,000+ word SEO CMMC blog from a fixed prompt. Each row reports the model's best-scoring temperature when the Session-14 Bench-2 sweep covered it; legacy rows keep their original single temperature. Autoblog runs overnight in batch, so quality decides.

Blog temperature sweep (Session-14 Bench-2, judge=GPT-4.1, N=2 reruns per cell)

ModelTempMean scoreRange (min-max)Mean wordsMean gen
Claude Opus-4.7default (~0.0)0.8050.778-0.8332342107.2s
Claude Opus-4.7default (~0.3)0.8050.778-0.8332342107.2s
Claude Opus-4.7default (~0.7)0.8050.778-0.8332342107.2s
Claude Opus-4.7default (~1.0)0.8050.778-0.8332342107.2s
DeepSeek-V4-Flash0.00.9720.944-1.000419478.4s
DeepSeek-V4-Flash0.3 BEST1.0001.000-1.000377464.2s
DeepSeek-V4-Flash0.71.0001.000-1.000373265.0s
DeepSeek-V4-Flash1.00.9720.944-1.000402572.4s
GLM-4.7-Flash0.0 BEST0.9450.944-0.945264058.8s
GLM-4.7-Flash0.30.8050.722-0.889221053.1s
GLM-4.7-Flash0.70.8610.833-0.889274857.9s
GLM-4.7-Flash1.00.8890.889-0.889207249.5s
Qwen3.6-35B-A3B (FP8)0.0 BEST0.8890.889-0.889259737.5s
Qwen3.6-35B-A3B (FP8)0.30.8060.667-0.945320439.8s
Qwen3.6-35B-A3B (FP8)0.70.5840.445-0.722251744.2s
Qwen3.6-35B-A3B (FP8)1.00.6110.611-0.611254941.5s
gpt-oss-120b0.0 BEST0.9450.945-0.9451938131.8s
gpt-oss-120b0.30.8890.889-0.8892145137.4s
gpt-oss-120b0.70.9450.945-0.9452144127.8s
gpt-oss-120b1.00.9170.889-0.9451710124.2s
Session-14 Bench-2 temperature sweep measured 2026-05-28 13:48 EDT. Reasoning models (DeepSeek-V4-Flash, Qwen3.6-35B-A3B) may treat temperature as a hint during their reasoning phase; a flat row across temps is itself a finding. Claude Opus-4.7 rejects the temperature parameter; its rows are copied from a single default-temperature run.

Questions answered

Which models do we keep resident and route to? Coding → Qwen3.6-35B-A3B (0.975, GPT-4.1). Research → Qwen3.6-35B (speed) / DeepSeek-V4-Flash (long-context), both Opus-parity. Blog → DeepSeek-V4-Flash / GLM-4.7-Flash. Voice tool-call → gpt-oss-20b. Edge appliance → Granite-4.1-8B (+ Gemma-4-e4b for tool-driving). Reserve Claude Opus for the top few percent.
Do self-improving loops help small models? A loop is a capability amplifier, not an equalizer: Qwen3.6-35B goes 0.917→1.0 with iterations; small models (Gemma-4-e4b, GLM-4.7-Flash) stay flat at 0.917. The done-gate makes a small model honest (no silent fabrication), not capable.
Self-improving loop vs a general agent (pi.dev) on a 4B? Gemma-4-e4b scored 0.917 in a done-gated loop vs 0.0 in pi.dev (it fabricated all 12 tasks). Air-gap appliances should pair a small model with a programmatic verifier loop, never a general agent framework.
Are there other open models worth adding? A live Feb–May 2026 scan found none that beat the incumbents; Qwen3.7/Qwen4, DeepSeek-R2, Phi-5 and Grok-3 weights are unreleased or hosted-only.
What is the best model for a single RTX PRO 6000 96GB (Blackwell) card, and is a 35B a waste of it? No displacer. Qwen3.6-35B-A3B (3B active) is the best all-rounder that fits one card: it wins research and cited-RAG outright and leads coding (0.975, GPT-4.1). The models large enough to “fill” the card (gpt-oss-120b 0.964, Mistral-Small-4-119B 0.957) are slower and weaker on the role’s core axes. A low-active MoE is the correct shape for a 96GB concurrency server: comparable NVFP4 models scale to ~2,000 t/s aggregate at c=32 on this card. Spare VRAM is best spent on KV/concurrency, or on NVFP4 (same quality at half the VRAM, freeing room to co-locate a second model), not on a bigger-but-worse model. Mistral-Small-4-119B is the lone alternative, and only if the card is redefined as a cited-RAG / vision / compliance resident.
Is a purpose-built Rust inference engine (Atlas) faster than our tuned vLLM on Blackwell? (measured 2026-06-07) No. On identical GB10 (DGX-Spark-class) hardware and the same Qwen3.6-35B-A3B-NVFP4 model, our tuned vLLM (NVIDIA MTP recipe) ran 116–119 tok/s steady-state vs Atlas’s 88.9; Atlas’s advertised “130–133 tok/s” and “3.1× faster than vLLM” did not reproduce (the 3.1× is vs an untuned vLLM). Atlas serving is quality-preserving — blog 0.944 (ties our blog leader) and 6/6 on a coding spot-check — and ships an ~8×-smaller (2.98 GB) no-Python single binary. That makes it a candidate packaging vehicle for an air-gapped compliance appliance, not a throughput upgrade. Its multi-node expert-parallel mode is not yet shipping (runtime is single-node only).
Does an agentic multi-hop retriever beat single-shot RAG for compliance Q&A? (measured 2026-06-07) On hard multi-hop CMMC / NIST 800-171 questions, an RL-trained search agent (Harness-1, 21B, gpt-oss-20b base) found every gold control (retrieval recall 1.000) where single-shot dense top-8 reached only 0.881 — it recovers the deep 2nd/3rd-hop controls single-shot drops at production cutoffs. But its curated answer (0.929) only matched single-shot top-15 (the curation step, not the search, is the bottleneck) and cost ~1,000× the latency — so the value is exhaustive batch retrieval (audit / SSP gap analysis), not interactive RAG. Control-id deduplication remains the cheap universal lever: it lifts both single-shot (0.786→0.881) and the agent (0.905→1.000).
Methodology: PTG llm-benchmark harness; cross-judged by Claude Haiku-4.5 (pre-Session-13) or GPT-4.1 (Session-13 onwards). Temperatures: coding / tool-call / adversarial leaderboards filter to temp=0.0; blog uses each model's best temperature from the Session-14 Bench-2 sweep where covered. Excluded: misconfigured/errored runs and hardware-specific Mac coding runs (see Apple Silicon matrix).