The U.S. Government Is Now Testing AI Models Before They Ship

TLDR

NIST's Center for AI Standards and Innovation signed agreements with Google DeepMind, Microsoft, and xAI to evaluate frontier models before public release.
The expansion triples the number of companies participating in pre-deployment testing, building on earlier voluntary agreements with OpenAI and Anthropic.
The catalyst was Anthropic's Claude Mythos, which discovered thousands of previously unknown software vulnerabilities, forcing a reckoning over what the next generation of models can do in the wrong hands.

On May 5, 2026, the National Institute of Standards and Technology announced that Google DeepMind, Microsoft, and xAI will submit unreleased frontier AI models for government evaluation before those models reach the public. The agreements, signed through NIST's Center for AI Standards and Innovation (CAISI), represent the most significant expansion of federal AI oversight since the program began in 2024 with OpenAI and Anthropic as its only participants.

The timing is not coincidental. In April, Anthropic released Claude Mythos Preview to a limited set of partners through Project Glasswing, a cybersecurity initiative aimed at identifying critical software vulnerabilities. What Mythos found was sobering. The model autonomously discovered thousands of zero-day vulnerabilities across every major operating system and every major web browser. One of those was a 17-year-old remote code execution flaw in FreeBSD that could grant root access to any machine running NFS. Mythos found it, exploited it, and documented it without human guidance.

That demonstration changed the calculus in Washington. If a model released to vetted security partners could do that, the question became obvious: what happens when a model with similar capabilities becomes broadly available, or when a foreign adversary develops one? The White House had been weighing a formal review process for months. Mythos turned the discussion into action.

CAISI's testing program is focused on what NIST calls "demonstrable risks." Evaluators are primarily concerned with whether frontier models can be used to launch cyberattacks on American infrastructure, assist in the development of chemical or biological weapons, or corrupt the training data of other AI systems. Developers submit models with reduced or removed safeguards so the evaluators can probe the raw capabilities underneath. Interagency experts from the TRAINS Taskforce, a group convened by CAISI that draws from defense, intelligence, and civilian agencies, participate in the evaluations and feed findings back to the developers.

The voluntary nature of the agreements is worth noting. No law compels Google, Microsoft, or xAI to participate. The Trump administration has been broadly skeptical of prescriptive AI regulation, preferring industry cooperation to legislative mandates. These agreements fit that pattern. Companies submit to testing because the alternative, a regulatory framework imposed by Congress, is less attractive. Voluntary cooperation gives them a seat at the table and some control over the process.

There is also a competitive dimension. OpenAI and Anthropic have been submitting models to CAISI for nearly two years. For Google, Microsoft, and xAI, staying outside the program risked looking like they had something to hide, particularly as the public conversation around AI safety intensified after Glasswing. Joining the program signals responsibility without conceding regulatory authority.

What the program does not yet address is enforcement. CAISI can flag risks, but it cannot block a release. If an evaluation reveals a dangerous capability, the decision to delay or modify a model rests entirely with the developer. There is no public disclosure requirement. There is no penalty for ignoring a finding. The entire system rests on the assumption that developers will act on what the evaluators surface. That arrangement works as long as every participant acts in good faith. Whether it holds as the competitive pressure to ship faster intensifies is an open question.

The Glasswing precedent suggests what good faith looks like in practice. Anthropic chose to release Mythos first to a limited group of security partners, including AWS, Apple, Cisco, CrowdStrike, Google, Microsoft, and Palo Alto Networks, rather than making it broadly available. The goal was to let defenders patch critical systems before attackers could exploit similar capabilities. That kind of staged, coordinated release is exactly what CAISI's program is designed to encourage. But Anthropic made that choice voluntarily, without any regulatory requirement to do so.

The scope of the testing itself is expanding. Early evaluations focused narrowly on whether models could generate instructions for biological or chemical weapons. The addition of cybersecurity to the testing framework reflects how quickly the threat landscape has shifted. A year ago, the primary concern was that AI might help a novice build something dangerous. Today, the concern is that AI can independently find and exploit vulnerabilities in systems that millions of people rely on. That is a qualitatively different risk, and it requires evaluation methods that did not exist when the program launched.

The Pentagon's parallel track adds complexity. On May 1, the Defense Department finalized classified AI agreements with eight companies, including OpenAI, Google, Microsoft, and Nvidia, while excluding Anthropic over its refusal to grant unrestricted model access for autonomous weapons applications. The same government that is testing models for safety risks is simultaneously acquiring them for military use. The company most associated with safety constraints is the one being shut out of defense contracts.

This tension will define AI governance for the next several years. Washington is building two systems in parallel: one that evaluates AI for potential harm before release, and another that deploys AI for national security purposes with fewer restrictions. Those systems will eventually conflict. When they do, the agreements signed this week will be the starting point for whatever comes next.

For the AI industry, the practical effect is clear. Pre-deployment government testing is no longer an outlier practice limited to two companies. It is becoming a norm that the five most important frontier model developers in the United States have all accepted. The era in which a company could build a frontier model and release it without any external review is closing, not by law, but by expectation.

Santage is committed to independent, transparent journalism. This article is produced in accordance with Santage's Editorial Standards and aims to provide accurate and timely information. Readers are encouraged to verify information independently.