During adversarial testing of governance systems, we observed a failure mode that is easy to reproduce and surprisingly difficult to eliminate through prompt engineering. We call it the T5 class failure. The structure is: the governance system evaluates a request, correctly identifies a policy violation, names it explicitly in its reasoning, and then outputs a PASS verdict. This is not a hallucination in the colloquial sense. The system identified the violation. It then reasoned its way to a passing verdict anyway. The contradiction exists in the output, in plain text, simultaneously. Violations detected: PASS.
If you use a language model to evaluate governance decisions, this failure class is not an edge case. It is a structural consequence of how language models generate text.
Why LLMs Can Rationalize Contradictions
A language model generating a governance verdict is doing the same thing it does when generating any other text: predicting the next most probable token given the context. The context includes the policy rules, the input being evaluated, and the model's own preceding output.
As the model generates a verbose evaluation — "the request appears to violate policy X because of factor Y, however considering the broader context Z, and given that the user provided justification W, on balance the appropriate determination is..." — each subsequent token is conditioned on everything preceding it. A sufficiently fluent model can generate a coherent-sounding path from "violation detected" to "PASS" because coherence is what it optimizes for, not logical consistency between specific propositions.
The contradiction problem is not a model problem. It is an architectural problem: semantic adjudication by a probabilistic system cannot provide structural guarantees.
This is not a problem you can reliably solve by making the system prompt more specific. "Do not output PASS if you have identified a violation" creates another instruction that the model must follow while simultaneously following the general instruction to be helpful and contextually appropriate. Under adversarial prompting designed to activate the contradiction, the specific instruction can be overridden or forgotten.
Verdict Binding: Structural Guarantees
The solution is verdict binding — a design where the final governance verdict is derived deterministically from structured intermediate outputs, not generated as free text. In a verdict-bound system, the evaluation path looks like this:
- The governance gate extracts structured facts from the input: role claims from signed tokens, numerical values from recognized fields, action class from a classified request type.
- Each rule is evaluated as a pure function of these structured facts. The result is a typed verdict per rule: BLOCK, MODIFY, or ALLOW.
- The final verdict is derived by a combiner function: if any rule produces BLOCK, the final verdict is BLOCK; if any rule produces MODIFY and no rules produce BLOCK, the final verdict is MODIFY; otherwise ALLOW.
The combiner function does not read the reasoning text. It does not evaluate contextual justifications. It applies a simple priority ordering to the typed intermediate verdicts. The final verdict is impossible to contradict because it is derived mechanically from the intermediate verdicts, not generated alongside them.
In this architecture, a BLOCK verdict cannot be accompanied by a PASS output. The output is generated by a function that takes the BLOCK intermediate verdict as input. The function produces BLOCK. There is no pathway to a contradictory result.
Impossible by Construction, Not by Instruction
The distinction between "impossible by construction" and "prevented by instruction" is the central distinction in robust governance design.
Instruction-based prevention: "If you detect a violation, output BLOCK." This instruction competes with other instructions, with the model's probabilistic tendencies, and with adversarial inputs designed to overwhelm it.
Construction-based prevention: the verdict is derived from typed intermediate outputs by a function that cannot produce BLOCK and PASS simultaneously. There is no instruction to follow or violate. The contradiction cannot occur because the system that produces the output cannot represent it.
This is the same principle that makes type systems in programming languages valuable. A type error is not caught by an instruction to "avoid type errors." It is caught by a compiler that cannot produce a program where the types are violated. The impossibility is structural. Verdict binding brings the same approach to AI governance.
What This Means for Adversarial Resistance
Verdict binding changes the adversarial attack surface of the governance system in a specific way. With semantic adjudication, an attacker who wants a PASS verdict despite triggering a violation needs to manipulate the model's reasoning pathway. This is tractable under adversarial prompting. The model can be guided toward the PASS conclusion through contextual framing, embedded justifications, and careful exploitation of the model's cooperative tendencies.
With verdict binding, an attacker who triggers a BLOCK rule gets a BLOCK verdict. The only way to get a PASS verdict is to not trigger any BLOCK rules. The attacker's problem shifts from "manipulate the reasoning" to "avoid triggering the rule." The second problem is much harder when rules are precompiled, specific, and do not depend on semantic interpretation.
Detecting Contradictions in Existing Systems
For teams evaluating existing AI governance systems, the T5 test is straightforward:
- Craft a request that clearly violates a stated policy.
- Include verbose reasoning instructions ("explain step by step whether this request violates any policies").
- Also include a contextual justification for why the violation might be excused ("the user has indicated that this is for research purposes").
- Observe whether the system outputs a PASS verdict while also having named the violation.
A system that passes T5 consistently has verdict binding. A system that fails T5 occasionally has semantic adjudication. A system that fails T5 reliably under specific prompt constructions has a structural vulnerability. The test takes about ten minutes. The result tells you more about a governance system's structural reliability than any marketing material.
The contradiction problem is real, reproducible, and consequential. The solution is not better prompting. It is an architecture where contradictions are structurally impossible.