Why AI Governance Cannot Depend on the LLM Itself

There is a category error at the center of most AI governance implementations today, and it has quietly infected how enterprises think about deploying AI in regulated workflows. The error is this: using a probabilistic system to enforce deterministic policy. A language model is, at its core, a stochastic function. It maps input tokens to output distributions. Even at temperature zero, the actual sampling behavior, batching effects, numerical precision, and runtime environment can produce output variations that violate the strict reproducibility requirements of regulated industries. You cannot use a stochastic function as a compliance gate. The properties are incompatible by construction.

Yet this is precisely what a large class of "AI governance" products do. They pass the user's request to an LLM — sometimes a separate "judge" model, sometimes a moderation endpoint — and ask that LLM whether the action is permitted. The LLM returns a verdict. The system acts on that verdict. This approach has a name in the industry: LLM-as-judge. And it fails under adversarial conditions in ways that deterministic systems do not.

The Fundamental Incompatibility

Consider what an audit trail requires. When a regulator or auditor asks "what happened in this transaction and why was it permitted," the answer must be precise, deterministic, and independently verifiable. The answer cannot be "we asked a language model and it said yes." That answer is not verifiable. The language model that produced that verdict no longer exists in the same state. Its weights may have been updated. The prompt template may have changed. The system prompt may have been modified. The temperature setting may differ.

You cannot replay an LLM verdict. You can re-run the inference, but you cannot prove you've reproduced the original computation. The execution history is gone.

The deterministic approach does not merely produce better audit trails. It produces the only kind of audit trail that matters legally: one where the decision can be re-derived from first principles without trusting the infrastructure that produced it.

Contrast this with deterministic enforcement. A rule like "deny any request where the actor role is VIEWER and the action class is WRITE" has a single correct output for any given input. It can be evaluated in microseconds without any model call. It can be replayed exactly. It can be signed cryptographically. A second party with no access to any running system can verify the verdict independently, given only the signed input, the signed output, and the published rule set.

Why "Prompt Firewalls" Fail

A common architectural pattern is what might be called a prompt firewall: a preprocessing step that injects policy instructions into the system prompt before the user's message reaches the model. The policy is written in natural language. The model is instructed to refuse certain categories of requests. This approach has several failure modes that compound under adversarial conditions.

Natural language policy is semantically ambiguous. The model interprets the policy using its internal understanding of language, which can be manipulated through careful prompt construction. An adversary who knows the model well enough can find phrasings that satisfy the literal pattern-matching of the firewall while subverting the intent of the policy.
Policy enforcement happens inside the same probabilistic system it governs. There is no separation between the enforcement plane and the execution plane. The LLM must simultaneously follow the policy and generate the output. These two objectives can conflict, and the model resolves conflicts probabilistically.
No cryptographic binding between policy instruction and enforcement decision. You cannot prove the policy was active when the decision was made. You cannot prove the model faithfully applied the policy rather than overriding it. You have only the output, not a verifiable chain of custody.

The Semantic Moderation Layer Problem

A related approach uses semantic similarity — vector embeddings — to classify requests against a policy taxonomy. A request is embedded and compared to embeddings of policy categories. If the cosine similarity to a prohibited category exceeds a threshold, the request is denied. This sounds more rigorous. It is not.

Semantic similarity measures are not policy logic. The embedding space is a continuous geometric surface, not a decision tree. The boundary between permitted and prohibited is a hyperplane in high-dimensional space. Adversarially crafted inputs can cross that hyperplane while appearing legitimate. The entire field of adversarial examples in machine learning documents this precisely: small perturbations in input space can produce large movements in embedding space, crossing classification boundaries invisibly.

More critically, this approach still does not produce replayable verdicts. Two different embedding models produce different similarity scores for identical text. Quantization differences between model versions produce different scores. The verdict is a function of the live model state, not of the input alone.

What Pre-LLM Enforcement Actually Means

The correct architecture separates the governance plane from the model execution plane entirely. Governance rules are compiled into a deterministic runtime — essentially a decision engine — that executes before any LLM call. The runtime evaluates a finite, explicit rule set with no probabilistic components; produces a verdict that is a pure function of the signed input; signs the verdict cryptographically using HMAC-SHA256 minimum; appends the signed verdict to an immutable, hash-chained audit record; and returns the verdict to the caller before any model invocation occurs.

If the verdict is DENY, no model call is made. The model never sees the request. The governance decision is not subject to model behavior in any way.

The security property: An attacker who compromises the model — through prompt injection, jailbreaking, fine-tuning attacks, or supply chain compromise — cannot compromise governance decisions that happened before the model was invoked. The enforcement plane is outside the attack surface of the model.

Contradiction-Proof Verdict Binding

One of the most subtle failures in semantic governance systems is what might be called the T5 problem: a system can detect that a request violates policy and still output PASS. This happens when verdict derivation and verdict enforcement are separate operations. The governance system detects a violation, records it, and then returns a verdict that does not reflect the detection. The two operations are semantically coupled but structurally decoupled.

Deterministic enforcement eliminates this failure class structurally. When the rule evaluation engine produces DENY, that verdict propagates through the system as a typed value. The output path does not have a separate code path that could return PASS. The verdict is not translated through a natural language intermediate that could be reinterpreted. The DENY decision and the DENY output are the same computational artifact.

This is what "contradiction-proof" means in a governance context. Not that a checker verifies the output after the fact, but that the system architecture makes the contradiction impossible to construct.

What Authority Verification Requires

Proper authority verification happens outside the model. It requires a signed credential from a trusted issuer, a delegation chain with explicit scope boundaries, and a verification step that checks the chain before any model invocation. The verification is mathematical, not semantic. The model is never given an opportunity to interpret an authority claim.

The implication for enterprise AI procurement is direct: any system where authority verification happens inside the model's context window is not operating with real authority controls. It is operating with social engineering controls, and those controls fail against determined adversaries.

Regulated AI infrastructure requires deterministic governance. The probabilistic alternative is not a conservative choice — it is a false choice that defers the compliance problem rather than solving it.

End