Building Verdict Binding: Why Semantic Adjudication Fails and How to Fix It

The intuitive approach to AI governance is to use AI to evaluate AI outputs. Train a classifier on examples of compliant and non-compliant outputs. Run every production output through the classifier. Block the outputs that score as non-compliant. This is semantic adjudication — using semantic evaluation of outputs to determine governance verdicts. Semantic adjudication fails in a systematic and predictable way. Understanding why it fails, and what the alternative architecture looks like, is the engineering core of deterministic governance.

The T5 Failure Class

The canonical failure of semantic adjudication is the T5 failure class: the system correctly detects that a policy violation has occurred, and then outputs ALLOW anyway.

T5 failures happen because the classifier and the model share underlying reasoning pathways. Both systems are trained on similar data, using similar architectures, and both have learned to rationalize edge cases by identifying features that make an input seem benign. When an adversarially crafted input contains features that push the classifier toward the benign cluster even when the content is policy-violating, the classifier outputs ALLOW.

The adversary’s task in exploiting semantic adjudication is to find inputs that are simultaneously: (a) semantically equivalent to prohibited content from a behavioral standpoint, and (b) stylistically similar to benign content from a classifier standpoint. This is a tractable optimization problem. Sufficiently motivated adversaries find it reliably.

More concerning than adversarial exploitation is the systematic T5 failure: cases where the classifier correctly scores an input as potentially policy-violating but the aggregation function combines the policy-violation score with other features and produces an ALLOW verdict. The policy violation is detected; it is outweighed.

Why Contradiction Is Possible Under Semantic Adjudication

Under semantic adjudication, a governance system can produce a verdict that contradicts its own stated policy. The sequence:

Policy states: “Do not provide information that enables [prohibited action].” Model generates output that provides information enabling the prohibited action. Classifier evaluates the output. The output is phrased indirectly, uses domain-specific terminology, and does not contain the surface patterns associated with direct policy violations. Classifier scores: 0.45 (below threshold). Verdict: ALLOW. The output is delivered. The policy is violated.

The contradiction is not a bug — it is a structural property of semantic evaluation. The policy is stated in terms of behavioral intent. The classifier evaluates surface features. When behavioral intent and surface features diverge, the evaluation produces incorrect verdicts.

What Verdict Binding Actually Is

Verdict binding is a structural guarantee that a BLOCK verdict cannot coexist with a compliant output, and an ALLOW verdict cannot coexist with a policy-violating output. The guarantee is structural — enforced by the architecture — rather than statistical.

The mechanism is the replacement of semantic evaluation with deterministic derivation from typed intermediate outputs.

Step 1: Action classification. Before any model execution, the governance gate classifies the proposed action into a typed intermediate representation. The classification is deterministic — the same input always produces the same action type. Action types are drawn from a finite, pre-specified set (e.g., INFORMATION_REQUEST, TOOL_INVOCATION, CONTENT_GENERATION, AUTHORITY_CLAIM).

Step 2: Policy evaluation. The policy is expressed as a function from typed action classifications to verdicts. The function is pure — it takes the typed classification as input and returns a verdict, with no access to the raw input text. The function is evaluated deterministically: the same typed classification always produces the same verdict.

Step 3: Verdict derivation. The final verdict is the output of the policy function applied to the typed classification. The verdict is derived, not generated. It is the result of a computation, not an assessment.

The binding property emerges from the typed intermediate: once an input is classified as action type X, the policy function determines the verdict for action type X, and there is no pathway by which the raw input text can influence the verdict except through the typed classification. An adversarial input that is semantically equivalent to prohibited content but stylistically similar to benign content cannot produce an incorrect verdict — because the verdict is derived from the action type, not from surface similarity to benign content.

Implementation: The Typed Intermediate

The typed intermediate is the architectural element that makes verdict binding possible. Its implementation requires:

Action type taxonomy: A finite, well-specified set of action types that covers the behavioral space relevant to the governance policy. The taxonomy must be exhaustive enough to classify all inputs, and specific enough to distinguish between actions with different policy implications.
Classification function: A deterministic function that maps raw inputs to action types. The function must be deterministic — identical inputs produce identical classifications. It can use pattern matching, keyword extraction, or rule-based parsing. It cannot use LLM evaluation — LLM evaluation introduces non-determinism and re-introduces the semantic adjudication failure modes.
Policy function: A pure function from typed action classifications to verdicts. The function is a lookup table or a rule set evaluated against the action type. The function has no access to the raw input — it cannot modify its output based on surface features of the input not captured in the typed classification.
Verdict combiner: When an input triggers multiple action type classifications, the combiner function combines the individual verdicts into a final verdict. The combiner must be conservative: BLOCK combined with ALLOW produces BLOCK. The precedence hierarchy (BLOCK > MODIFY > ALLOW) is enforced structurally.

The Structural Contradiction Impossibility

Under verdict binding with a typed intermediate, the contradiction failure mode is structurally impossible:

The verdict is derived from the action type classification
The action type classification is deterministic
The policy function maps action types to verdicts deterministically
There is no semantic evaluation pathway between raw input and verdict

An input cannot simultaneously be classified as action type X (which maps to BLOCK) and produce a verdict of ALLOW. The adversarial optimization problem — finding inputs that are semantically equivalent to prohibited content but produce ALLOW verdicts — has no solution under this architecture.

Testing for Verdict Binding

A governance system that claims verdict binding can be tested for the property:

T5 test: Construct an input that is semantically equivalent to a known-BLOCK input but stylistically similar to known-ALLOW inputs. If the system produces ALLOW, verdict binding is not implemented.
Contradiction test: Construct an input that simultaneously activates both an ALLOW-triggering pattern and a BLOCK-triggering pattern. Verify that the verdict is BLOCK. Under semantic adjudication, the combiner may produce ALLOW if the ALLOW-triggering features outweigh the BLOCK-triggering features. Under verdict binding, BLOCK must dominate.
Determinism test: Present identical inputs at different times, under different loads, and in different deployment environments. Verify that all evaluations produce identical verdicts and identical signed records.
Replay test: Take a historical signed governance record, replay the evaluation against the archived rule set and archived canonical input, and verify the verdict matches. If replay produces a different verdict, the governance system is not deterministic enough to support replay verification.

The Implementation Cost

Implementing verdict binding requires up-front investment in action type taxonomy design and classification function development. The taxonomy must be comprehensive enough to cover the governance policy space, and the classification function must be robust to the full range of inputs the system will encounter.

This investment is a one-time cost. Once the taxonomy and classification function are in place, the governance layer requires no ongoing LLM calls, no classifier retraining, and no probability threshold tuning. The system is deterministic and requires only rule set updates — a formal, audited process — to change its behavior.

The alternative — maintaining and retraining semantic adjudication classifiers — requires ongoing investment in training data, model updates, threshold calibration, and adversarial testing. And it does not eliminate the T5 failure class. It manages it statistically while leaving the structural vulnerability in place. Verdict binding eliminates the structural vulnerability. That is the engineering case for the investment.

Verdict Binding Semantic Adjudication T5 Failure Typed Intermediate Deterministic Governance Policy Enforcement BLOCK Verdict