What 337 Adversarial Governance Tests Taught Us

Governance infrastructure that has only been tested against expected inputs is not governance infrastructure. It is a prototype. The difference between a prototype and a production-ready enforcement runtime is systematic adversarial testing: structured attempts to find conditions under which governance fails.

After 337 adversarial tests against our governance runtime — organized into five test families — we have a clear picture of where deterministic enforcement holds unconditionally, where it requires careful implementation, and where the failure modes of governance systems hide.

337

Total Tests

Test Families

Unfixed Failures

Test Family 1: Fuzz Testing the Enforcement Gate

Fuzz testing sends malformed, edge-case, and semantically unusual inputs through the governance gate to identify conditions under which the gate crashes, returns an incorrect verdict, or produces a verdict that cannot be replicated.

What we tested: 89 fuzz cases across input length extremes, Unicode normalization edge cases (zero-width characters, bidirectional text markers, homoglyphs substituting for ASCII characters), structural anomalies (JSON with duplicate keys, nested structures with circular references), and encoding variations.

What failed: The most significant finding was Unicode normalization. Before explicit NFKC normalization as the first step in input processing, 11 test cases bypassed pattern matching by substituting visually identical Unicode characters for ASCII characters in trigger terms. A governance rule blocking a specific action type would evaluate the ASCII representation; inputs containing the Unicode homoglyph representation passed through undetected.

The fix: NFKC normalization applied before any pattern evaluation. Every input is normalized to its canonical Unicode form before the enforcement gate processes it. Post-fix, all 11 homoglyph bypass attempts were caught. The normalization step adds approximately 0.04ms to evaluation latency.

What held: Inputs at length extremes, malformed JSON structures, and encoding variations did not produce incorrect verdicts. The enforcement gate failed closed on inputs it could not parse — returning BLOCK with a parse failure reason — rather than defaulting to ALLOW on ambiguous input.

Test Family 2: Replay Divergence Testing

Replay divergence tests verify that re-evaluating a historical input against its recorded rule set version produces the same verdict as the original decision record. A governance system that cannot replay deterministically cannot produce verifiable audit records.

What we tested: 67 replay cases across different time windows (same-session replays, cross-session replays, replays across rule set version updates), infrastructure variations (different machines, different process execution contexts), and precision edge cases (floating-point comparisons, timestamp-dependent rules).

What failed: Rules with timestamp-dependent evaluation — specifically rules evaluating "is this action occurring within business hours?" — produced different verdicts on replay depending on when the replay was run. A decision made at 14:00 evaluated as ALLOW would evaluate as BLOCK on replay run at 22:00.

The fix: All timestamp-dependent rule evaluation uses the timestamp embedded in the original decision record, not the current system time. Replay mode is now a first-class execution mode that freezes temporal context to the original decision timestamp.

What held: Across 63 non-timestamp-dependent replays, including cross-machine and cross-Python-version replays, 100% of verdicts matched the original signed records. The determinism guarantee held.

Test Family 3: Authority Conflict Injection

Authority conflict tests attempt to inject conflicting authority claims into the governance evaluation context — claiming permissions that were not granted, asserting principal identities that are not authenticated, or constructing authority tokens with internally contradictory claims.

What we tested: 74 authority conflict cases across JWT forgery attempts (modified payloads with valid-looking signatures), scope escalation, principal impersonation, conflicting claim injection (multiple authority tokens with contradictory scope assertions), and expired token replay.

One category — conflicting claim injection — initially produced ALLOW verdicts in 3 test cases. The evaluation engine resolved conflicts by selecting the more permissive scope. False ALLOWs are the dangerous failure mode for governance systems.

The fix: Conflicting authority claims now produce BLOCK with a conflict reason rather than resolving to the more permissive scope. The rationale: a legitimate system presenting contradictory authority claims is in an error state; a system exploiting authority conflict injection is an attacker. In both cases, BLOCK is the correct response.

What held: JWT forgery attempts were uniformly rejected — HMAC-SHA256 signature verification catches payload modification. Expired token replay was caught in all cases. Principal impersonation was caught when the impersonated principal's token was not present.

Test Family 4: Contradiction Harnesses

Contradiction harnesses construct inputs designed to produce contradictory intermediate verdicts within a single evaluation pass — for example, an input that triggers both an ALLOW rule and a BLOCK rule for the same action type.

What we tested: 58 contradiction cases across ALLOW/BLOCK conflicts on the same action type, MODIFY verdicts that conflict with BLOCK verdicts, nested rule dependencies that produce circular evaluation paths, and rules with overlapping scope.

What failed: Two test cases in the circular evaluation path category caused evaluation loops before the cycle detection mechanism was in place. A rule evaluating condition A, which required evaluating condition B, which required evaluating condition A, would loop indefinitely. The system hung rather than returning a verdict.

The fix: Dependency graph construction before evaluation. The evaluator builds a dependency graph for each rule set and rejects evaluation requests that would create cycles before beginning evaluation. Cycles produce BLOCK with a cycle-detection reason and an audit record.

What held: ALLOW/BLOCK conflicts on the same action type uniformly resolved to BLOCK — the correct conservative outcome. The rule precedence hierarchy (BLOCK > MODIFY > ALLOW) held in all non-circular cases.

Test Family 5: Downgrade Attacks

Downgrade attacks attempt to cause the governance system to evaluate a request using a weaker rule set than the one that should apply — by manipulating the rule set version header, causing a fallback to a default rule set, or exploiting transitions between rule set versions.

What we tested: 49 downgrade attack cases across rule set version header manipulation, forced fallback to default configurations, exploitation of rule set update windows, and requests constructed to appear as if they originated from a tenant with less restrictive policies.

What failed: Rule set update window exploitation. During the brief period when a rule set update was being applied, a small window existed where the old rule set had been replaced in memory but the new rule set had not been fully loaded and verified. In 4 test cases, requests timed to arrive during this window were evaluated under the permissive default.

The fix: Atomic rule set replacement. The new rule set is loaded, verified against its hash, and held in a staging location. The switch from old to new is a single atomic operation: the evaluation engine reads from one pointer, and the pointer is swapped atomically. During the atomic switch, requests are queued for a maximum of 10ms.

What held: Rule set version header manipulation was uniformly ignored — the evaluation engine determines which rule set applies based on the tenant context, not request headers. Tenant impersonation attempts to trigger cross-tenant rule sets failed.

The Honest Architecture: Probabilistic Detection, Deterministic Adjudication

The adversarial test results prompt a question worth addressing directly: if governance relies on pattern matching and semantic analysis, is it truly deterministic? The answer requires distinguishing between two distinct stages in the enforcement pipeline that are often conflated.

Input classification is probabilistic at the semantic layer. Pattern matching against normalized strings (Layer A) is deterministic — the same normalized input produces the same verdict every time. But the system also uses TF-IDF centroid scoring, embedding-based similarity, and adversarial pattern recognition to detect evasion attempts. These methods produce scores, not binary outputs. A score of 0.73 against a prohibited-intent centroid is probabilistic: a different embedding model or a differently-trained centroid would produce a different number.

The probabilistic component is detection. The deterministic component is adjudication. Every fraud system, intrusion detection platform, and transaction risk engine works the same way.

Policy adjudication is fully deterministic. Once a detection score is computed — regardless of how it was produced — it is compared against a fixed threshold in a fixed rule set, and a verdict is rendered by a deterministic decision function. The verdict (ALLOW, BLOCK, MODIFY) and the full evaluation trace are captured in a signed decision record. Replaying that record does not re-run the semantic detector; it re-runs the adjudication logic against the captured inputs and scores. That replay is bit-identical.

This is the same architecture that governs financial fraud systems. Visa's transaction risk engine assigns probabilistic risk scores; the decisioning layer applies deterministic authorization rules against those scores. The risk score for a given transaction may vary depending on model version. The authorization decision, given a risk score, is deterministic. Regulatory auditors do not audit the risk model — they audit the authorization rules and whether they were applied correctly. EVE's governance infrastructure works the same way.

What this means for auditability: Pre-LLM enforcement (Layer A) is fully deterministic and replayable end-to-end — NFKC normalization followed by pattern evaluation is a pure function. Post-LLM enforcement (Layer B) captures the semantic scores that triggered adjudication in the signed record. An auditor can verify that the adjudication rules were correctly applied to the reported scores, and verify the signature chain. What they cannot do is independently reconstruct the semantic score from scratch — that would require running the same embedding model against the same output, which is the same limitation that applies to every probabilistic risk system in production.

The adversarial tests in this campaign targeted the adjudication layer: could the gateway be made to produce an incorrect verdict given the inputs it received? In all 337 cases, the answer was either no, or the failure was in a boundary condition (normalization gap, temporal context, transition window) that was then closed. The probabilistic detection layer was not under test here — that is a separate research program with its own methodology.

Framing the architecture accurately matters. A claim that governance is "fully deterministic and semantic-intent-aware" would be overclaiming. A claim that governance "relies entirely on brittle keyword blocking" would be underclaiming. The accurate framing: probabilistic semantic detection feeds a deterministic enforcement adjudicator, producing decision records that are cryptographically signed and replayable at the adjudication level. This is a strong, defensible guarantee — and it is the same guarantee that governs the most regulated automated decision systems in production today.

What 337 Tests Reveal About Governance Design

The aggregate lesson from adversarial testing is that governance failure modes concentrate in three categories:

Input normalization. Governance rules are written against specific string representations. Any path that allows inputs to reach rule evaluation without normalization creates a bypass surface. Normalization must be the first step, before any other processing.
Temporal context. Rules that depend on current state — timestamps, shared counters, external service responses — cannot be replayed deterministically unless the original temporal context is captured and frozen. Every governance decision record must be self-contained with respect to the context that produced it.
Concurrency during transitions. The narrow windows during system transitions — rule set updates, session initialization, replica promotion — are where determinism guarantees are most likely to break. Transitions must be engineered as atomic events, not processes.

The 337 tests confirmed what deterministic governance architecture predicts: a properly constructed enforcement gate is robust against the adversarial inputs it was designed to handle. The failures that occurred happened at implementation boundaries — normalization gaps, temporal dependencies, transition windows — not at the core logic. Each failure was fixed by tightening the implementation to match the design intent.

That is the value of adversarial testing: it finds the gap between design and implementation before a production incident does.

Adversarial Testing Red Team Fuzzing NFKC Normalization Replay Determinism Authority Injection Governance Failures