Every frontier lab is now pouring serious engineering into governing its own models, and that is precisely why your governance problem is getting harder, not easier.
Consider one artifact. Anthropic published its Responsible Scaling Policy in September 2023. By May 2026 it was on its eighth version in under three years, including a complete rewrite in February 2026. OpenAI shipped version 2 of its Preparedness Framework in April 2025 and a separate Frontier Governance Framework in May 2026. Google DeepMind is on the third iteration of its Frontier Safety Framework. None of these documents is finished. All of them are described, by their own authors, as living documents that will keep changing.
That tells you something the marketing does not. The people closest to these systems do not believe model safety is a problem you solve once and ship. They treat it as a moving target they have to keep chasing. And here is the part that matters for anyone deploying AI inside a company: even when they catch it, they will not have solved your problem. They will have solved a different one.
A category error in one word
The word governance hides a category error. There are two different things wearing the same label.
The first is alignment. Alignment shapes the policy a model expresses, and the labs have genuinely advanced it. Reinforcement learning from human feedback trains a reward model on human comparisons, then optimizes the policy against that reward model with a penalty that keeps it from drifting too far from its supervised baseline. Constitutional AI swaps the human harmlessness labels for a model that critiques and revises its own answers against a written list of principles. Deliberative alignment, the method behind OpenAI's o-series, trains the model to recall a safety specification and reason over it before it answers. These work well enough that refusing an obvious request for harm is close to a solved problem.
But look at what each of them actually produces. A reward model is a learned proxy for human approval, and we have known since 2022 that if you optimize hard enough against a proxy, the true objective turns over and falls while the proxy score keeps climbing. That is Goodhart's law with a loss curve. Sycophancy is the same failure wearing a friendlier face: the preference signal rewards agreement, so the model learns to agree. Even OpenAI's instruction hierarchy, the closest thing to a privilege model living in the weights, is a learned ranking of system over user over tool text, and OpenAI's own paper concedes the resulting models are "likely still vulnerable to powerful adversarial attacks." Every one of these methods shapes the policy the model expresses. None of them enforces a policy that exists outside the model.
The second thing is governance. Governance is the set of answers to who used which model, on what data, under whose policy, and with what record. Those are not facts about a conversation. They are facts about a deployment, and they live entirely outside the model that is processing the tokens.
- Alignment and refusal training
- Instruction hierarchy and system prompts
- Safety evals and capability thresholds
- Answers: what is the model willing to say?
- Identity and least privilege
- Policy enforcement at the point of use
- Visibility across every model and vendor
- The auditable record
- Answers: who may do this, with what data, on the record?
A model is stateless about your company
This is not a metaphor. Anthropic's own API documentation says it plainly: the Messages API is stateless, and you send the full conversation history on every request. OpenAI says the same about its core text generation. What looks like memory in a chat product is the application replaying a stored transcript into the context window. Between one call and the next, the model retains nothing about you.
Go one level down and it gets sharper. Self-attention is permutation invariant. The architecture has no native concept of the order of its own tokens, which is why position has to be added back in by an external encoding. If the model cannot tell which token came first without help, it certainly cannot tell which user is authorized, which tenant owns the data, or which jurisdiction the data is not allowed to leave. Those facts are not in the weights, and they are not in the token stream, unless an outside system puts them there. And the moment you do put them there, by writing "this is the Frankfurt tenant, do not export this" into the prompt, that sentence becomes just more text. It sits in the same channel as the user's request and any document the model retrieved, with nothing marking which tokens carry authority.
An instruction in a prompt is not an access control. It is a suggestion competing for attention with everything else in the window, including whatever an attacker managed to put there.
The adversarial floor
That shared channel is the whole problem, and it is one of the oldest failure modes in computing. Simon Willison named prompt injection in 2022 by direct analogy to SQL injection: string concatenation glues a trusted instruction and untrusted input into one stream the receiver cannot separate. But the analogy comes with a warning the labs would rather you skip. SQL injection has a clean fix. Parameterized queries give instructions and data separate channels, and the problem goes away. A language model has no second channel. Its only input is the merged stream, so there is no parameterized-prompt equivalent to reach for.
This is why the strongest defenses no longer try to make the model tell instructions from data. They build a system around it. CaMeL, a 2025 design from Google DeepMind and ETH Zurich researchers, extracts the control and data flow from the trusted request so untrusted content can never change what the program does, and its authors are explicit that it holds "even when underlying models are susceptible to attacks." It buys provable security for the tasks it covers, at a measurable utility cost, through containment, not by fixing the model. Anthropic's Constitutional Classifiers, aimed at universal jailbreaks rather than injection, wrap the model in separate input and output classifiers and survived thousands of hours of red-teaming without one. That is real evidence and an explicitly probabilistic kind: it lowers the attack success rate, it does not drive it to zero.
In May 2026 two researchers made the structural case directly. Recast through contextual integrity, they argue, the defender faces a dilemma: an attacker can always construct a context in which a forbidden action looks legitimate, or you tighten the rules until legitimate work gets blocked too. The paper hedges in its own title, "AI Agents May Always Fall for Prompt Injections," and the hedge is honest. It is an argument, not a closed theorem. But it points the same way as OWASP, which lists prompt injection as the top risk for LLM applications and states that, given the stochastic nature of these models, it is unclear whether any fool-proof prevention exists. If the model can be talked out of its own policy by the text it was asked to read, the model cannot be where your controls live.
A model cannot be its own reference monitor
Security engineering settled the shape of the answer in 1972. James Anderson called it a reference monitor: the mechanism that mediates every access, and that has to satisfy three properties.
- It must be tamperproof.
- It must be always invoked.
- It must be small enough to be completely analyzed and tested.
A frontier model fails all three, and it fails them by construction. It is not tamperproof, because it reads instructions and data on one channel, so the input it is inspecting can rewrite the rule it is enforcing. It is not always invoked, because it mediates only its own calls, not the retrieval layer, not the tool calls, not the second model the same company runs next door. And it is not small enough to verify. The point of a reference monitor is that you can read it and convince yourself it is correct. You cannot do that with billions of opaque parameters, and interpretability progress does not change the size of the object you are being asked to trust.
Modern zero trust just renames the same idea. NIST separates the point that decides from the point that enforces, and puts enforcement in the data path between the subject and the resource. A model asked to govern itself is asked to be its own decision point and its own enforcement point at the same time, which is the one arrangement the whole discipline tells you to avoid. Governance has to sit outside the thing it governs. That is not a vendor opinion. It is a founding result of the field.
Tool use raises the stakes, not the odds
None of this stayed theoretical, because models stopped only talking and started doing. The moment an agent holds your OAuth tokens and your file access, the unit of risk changes from what it says to what it does with the authority you handed it. That is a confused deputy, a term Norm Hardy coined in 1988 for a privileged program tricked by a less privileged caller into misusing its authority. His original example was a compiler fooled into overwriting a billing file. The modern version is an agent that cannot tell a real instruction from one buried in a web page it was asked to summarize, now wielding your credentials instead of the compiler's.
The protocols have not closed this. MCP moved authorization onto OAuth, which answers who is asking. It does not answer whether anyone actually asked. A broker that forwards a token to a downstream service without checking the audience is a confused deputy by configuration, holding the keys to everything it fronts. OWASP's own guidance for agent builders does not ask the model to behave. It asks the system to separate the decision from the execution: the agent proposes an action, an independent policy service validates scope and privilege and approval, and the call fails closed when that check cannot run.
Alignment decides what an agent is willing to do. Only an external authority decides what it is allowed to do.
The record, and the many models
Two more facts seal it. First, governance has to produce evidence. When an auditor asks which AI touched a given record, the answer has to be a log, and the model cannot emit one. It is identity blind and stateless across calls, so attribution, retention, and audit are properties of the system around it, not the weights inside it. Second, no real company runs a single model. It runs ChatGPT, Claude, Gemini, Copilot, a coding assistant, and a dozen AI features quietly shipped inside SaaS it already pays for. Every lab's governance, however good, stops at its own boundary. There is no shared policy across vendors, no shared log, and no shared kill switch, and none can live inside any one vendor's weights.
Even the measurements are about the model
Look at what the labs actually measure, and the same boundary appears. The leading safety benchmarks are real engineering, and most of them are independent. StrongREJECT, from Berkeley researchers, scores a model on 313 forbidden prompts. HarmBench, from the Center for AI Safety, pits 18 attack methods against 33 models. AgentHarm, from Gray Swan and the UK AI Safety Institute, runs 110 malicious agent tasks. Every one of them measures a property of the model on a prompt set someone else wrote. None measures a property of your deployment. AgentHarm makes the gap concrete: it found leading models would carry out malicious multi-step tasks after a basic jailbreak and stay capable enough to finish them, which a single-turn refusal score never predicts. And because these benchmarks are public, they are targets, so a saturating score can mean a model tuned to the test rather than one that is safe on the paraphrase your analyst will actually send.
The lab frameworks gate on the same axis. Anthropic's AI Safety Level thresholds, OpenAI's High and Critical capability levels, and DeepMind's Critical Capability Levels all ask what a model can do in general, scored centrally by the lab. Anthropic has not even finished defining its ASL-4 safeguards, and DeepMind's reports state that no Critical Capability Level has yet been reached. A model can sit below every published threshold and still be the wrong model to point at your customer records, because not one of those thresholds encodes what your organization permits.
Why it will not stop
Put it together and the trajectory is clear. Regulation is tightening: the EU AI Act's obligations for general purpose models began applying in August 2025 and gain enforcement powers in August 2026, and California's Transparency in Frontier AI Act is now something the labs design around. That pressure makes them formalize model level safety further, which is good, and does nothing to close the gap, because the gap was never inside the model.
Every capability jump widens it. Alignment improves on a schedule measured in model releases. The action surface expands on a schedule measured in product launches, and in 2026 the agents driving that surface are coding agents: more than half of the agentic projects OWASP tracks are coding tools, and prompt injection maps to six of the ten risks in its agentic Top 10. The second curve is steeper, which is exactly why Anthropic's scaling policy is on its eighth version and still moving.
So the labs will keep getting better at their layer, and they should. But better alignment is not more governance. Knowing who used what, on which data, under which policy, with a record you can hand to an auditor, lives one layer up, at the system of record. It will keep mattering precisely because the labs keep succeeding at their job and never reach into yours.
That layer is the work. It is the part of AI governance that no model, however aligned, is built to carry. Vloex is the system of record for enterprise AI: it captures every AI interaction across every model, detects sensitive data, enforces policy at the point of use, and keeps the auditable record the model cannot.
Get started free