There has been a quiet assumption underneath every conversation about private AI: that running a model on your own hardware means accepting second-tier quality. The best models live in the cloud, the thinking goes, and what you can host yourself is a compromise you make for confidentiality. Our benchmark program keeps finding evidence against that assumption, and the newest result is the strongest yet. The single highest-quality open-weight model we have ever measured, a model we previously had to shelve because it generated too slowly to be usable, now runs at production speed on hardware a firm can own outright.
This matters most for the organizations we serve at Petronella Technology Group, Inc.: law firms holding privileged client files, defense contractors handling Controlled Unclassified Information (CUI), and healthcare and financial organizations bound by HIPAA and GLBA. For all of them, the cloud question is not about preference. It is about whether the data can leave the building at all.
The quality ceiling problem, in plain terms
Open-weight AI models come in two broad shapes. Mixture-of-experts models activate only a small slice of their parameters for each token they generate, which makes them fast, and they are what most self-hosted deployments run today. Dense models activate everything, every token. Dense models of a given size tend to score at or near the top on quality, but that same everything-at-once design makes them punishing to serve: every token requires reading the entire model out of GPU memory.
In our June 2026 benchmark sweep, a dense 27 billion parameter open model posted the best generation quality we had measured from any model we can host: a coding score of 0.990 from an independent GPT-4.1 cross-judge, alongside perfect long-form writing scores. And then we shelved it. On a workstation-class GPU with 96 GB of memory, the same card class many businesses use for local AI, it generated roughly 28 tokens per second. That is too slow for interactive work and too slow to serve a team. The best model we had ever tested was, in practice, unusable. We wrote exactly that in our internal notes: highest quality measured, wrong serving shape, quality unreachable in production.
What changed: memory bandwidth, not magic
The bottleneck for dense models is not compute. It is memory bandwidth, the speed at which the GPU can read its own memory. A dense model's generation speed scales almost linearly with it. Data-center GPUs built around HBM (high-bandwidth memory), such as the NVIDIA H200 class we tested, move data at roughly 4.8 TB/s, several times faster than workstation-class cards.
So we re-tested the same 27 billion parameter model, unquantized, in full BF16 precision, on an H200-class GPU server in our own data center. The results:
- 92 tokens per second for a single user, up from roughly 28 on the workstation-class card. That is a 3.3 times speedup from hardware alone, with zero change to the model.
- 96 tokens per second with the model's step-by-step reasoning mode enabled.
- Roughly 675 tokens per second aggregate when serving 8 concurrent requests, which is team-scale throughput.
- Quality held. On our 22-case coding evaluation the served model passed 22 of 22 cases with a mean score of 0.984, graded by the same independent GPT-4.1 cross-judge we use for every published number.
To keep ourselves honest, the caveats: the 0.984 coding score is a single run (N=1), and these serving results have not been committed to our public leaderboard, where we hold numbers to repeated-run standards. We publish the rigor label with the result, every time. That discipline is the same one we apply to compliance evidence, and it is why we are comfortable putting our name on these figures.
Why "unquantized" is the word that should get your attention
Most self-hosted AI deployments shrink models through quantization, trading numerical precision for memory savings. Done well, the loss is small; our own testing has repeatedly found 8-bit formats to be essentially lossless. But for organizations that must defend their AI outputs, in a courtroom, in an audit, in a CMMC assessment, "essentially" carries weight. Every layer of compression is one more thing to validate and one more question from an assessor.
The configuration we measured here removes the question entirely. BF16 is the model's native training precision. What we served is bit-for-bit the model the researchers published and benchmarked, with no compression applied. For regulated work, that is the cleanest possible provenance story: the exact published artifact, running on hardware you control, behind your own firewall.
What this means for a law firm or a defense contractor
Put the pieces together and the picture is specific enough to plan around:
- The quality argument for the cloud is shrinking. Our earlier testing found a self-hosted coding model resolving real software issues at rates close to commercial cloud offerings. This result extends that finding upward: the very best open-weight quality we can measure is now servable in-house, not just the fast-but-lighter tiers.
- Privileged data never leaves. An on-premises model means client files, case strategy, CUI, and PHI are processed inside your security boundary. There is no third-party terms-of-service question, no data-retention ambiguity, and no discovery surprise. For attorneys, this is the difference between "the vendor says it is private" and "it is physically incapable of leaving."
- The hardware is a capital line item, not a moonshot. An H200-class server is a serious purchase, but it is a knowable, depreciable asset that serves an entire team, and the measurement above shows one such server holding team-scale concurrency. Firms already spend comparably on document management and eDiscovery platforms.
- Compliance frameworks favor this architecture. CMMC and NIST SP 800-171 are fundamentally about knowing where CUI lives and who can touch it. An air-gapped or enclave-hosted model with static, auditable weights is dramatically easier to scope than a cloud AI service. We have written about this architecture in our compliance practice, and it is the foundation of the private AI enclaves we build.
The honest limits
We publish what the data supports and no more, so three limits belong in this article. First, these are our measurements on our hardware; your workload, context lengths, and concurrency will move the numbers. Second, as noted, the quality re-test is a single run (N=1) and the figures have not been committed to our public leaderboard. Third, raw generation speed is not the only serving consideration: long-document workloads stress memory in other ways, and a production deployment needs monitoring, access control, and update discipline around the model, which is precisely the operational layer where most self-hosted AI projects fail without experienced help.
None of those limits changes the headline finding. The gap between "best model we can host" and "best model we can actually serve" just closed, and it closed because of a hardware class that any well-advised firm can procure.
How we test, and why we show our work
Every number above comes from the same benchmark harness we run across our AI fleet: a 22-case coding evaluation spanning shell operations, Python automation, configuration editing, SQL, debugging, and refactoring, scored by both deterministic checks and an independent cross-judge. We never let a model grade its own family; our published scores are graded by GPT-4.1, a model from a different vendor than anything we test, because our own audits found same-family judging inflates scores substantially. We benchmark on hardware we own, we keep the raw result files, and we label every number with its rigor level.
We do this because our clients cannot act on marketing claims. A managing partner deciding whether privileged files can touch an AI system, or a contracts executive deciding what belongs in a CUI enclave, needs measured evidence from someone who will still be accountable after the purchase. That is the standard we hold our AI practice to, and it is the same evidence-first posture we bring to cybersecurity engagements.
A practical path forward
If your organization is weighing private AI, the sequence we recommend is the one we use ourselves:
- Classify the data first. Privileged, CUI, PHI, and trade-secret material defines the boundary. If any of it would flow through the AI, the AI belongs inside your perimeter.
- Size the hardware to the model shape. Fast mixture-of-experts models run well on workstation-class GPUs; top-quality dense models want HBM-class bandwidth. The right answer is often one of each, and measured benchmarks, not vendor decks, should drive the choice.
- Demand native-precision provenance where it matters. For defensible workloads, prefer the published model artifact at its native precision, validated against a cross-judged benchmark, over an aggressively compressed variant.
- Build the enclave, not just the server. Access control, logging, network isolation, and an update process are what turn a GPU box into an auditable system that passes a CMMC assessment.
Frequently asked questions
Is an open-weight model safe to run on confidential documents?
Safer than the alternatives, when deployed correctly. Model weights are static files; they cannot transmit your data anywhere. The risks live in the serving stack and the network around it, which is why we deploy models inside hardened, monitored enclaves rather than as bare servers.
Do we need the newest data-center GPU to get useful private AI?
No. Fast mixture-of-experts models deliver strong quality on workstation-class hardware today. The finding in this article is about the top of the quality range: if you want the best-scoring open model we have measured, served at production speed and native precision, HBM-class hardware is what unlocks it.
What does 92 tokens per second feel like in practice?
Faster than most people read. For interactive use it feels immediate, and at roughly 675 tokens per second aggregate under 8 concurrent requests, one server covers a working team rather than a single power user.
Why does unquantized precision matter for compliance?
Because it removes a validation burden. A quantized model is a modified artifact you must justify; a native-precision model is the exact published artifact the research community benchmarked. When an assessor or opposing counsel asks what your AI actually was, the shorter answer wins.
How do we know these numbers are not cherry-picked?
They are cross-judged by a model from an unrelated vendor (GPT-4.1), backed by retained raw result files, and labeled with their rigor level, including the N=1 caveat and the fact that they have not been committed to our public leaderboard. We publish the limits alongside the wins.
Can our existing IT team run this?
The serving software is mature, but the security envelope, compliance scoping, and operational discipline are where projects succeed or fail. Most internal teams benefit from a partner who has already built and broken these systems.
Talk to the team that measured it
Petronella Technology Group, Inc. designs, builds, and operates private AI enclaves for law firms, defense contractors, and regulated businesses, backed by more than two decades of cybersecurity and compliance practice and by the benchmark program quoted throughout this article. If your firm wants top-tier AI capability without surrendering custody of privileged or controlled data, call us at 919-348-4912 or start with our AI services overview. We will show you the measurements first.