The decision evidence infrastructure market has become crowded with vendors, and the terminology has become correspondingly muddled. Products that perform statistical output sampling call themselves enforcement platforms. Tools that run secondary LLM calls to evaluate primary LLM outputs claim to provide deterministic controls. Dashboards that visualize compliance metrics are positioned as policy enforcement solutions.
For a CTO evaluating vendors in this space, the practical challenge is separating genuine enforcement capability from monitoring infrastructure rebranded with compliance language. The stakes are high: a governance platform that fails during a regulatory examination exposes the organization to findings that cannot be addressed retroactively. A platform that adds unacceptable latency will face operational pressure to be disabled. A platform that cannot be customized to your specific regulatory context will require significant manual review to compensate for its gaps.
These eight questions are designed to elicit the information that vendor sales materials typically obscure. Ask them in every evaluation. Compare the answers against the scoring rubric at the end of this article.
Question 1: Is Policy Evaluation Deterministic?
Ask the vendor: "If I send the same request twice, with the same policy version deployed, will I get the same evaluation result both times, every time?"
A deterministic enforcement platform will answer yes without qualification. The evaluation logic is implemented in code — pattern matching, structured rule evaluation, explicit data field checks — not in natural language processed by a model. Same input, same policy state, same result. Always.
A non-deterministic platform will hedge. "Our model achieves 99.2% consistency." "We run multiple samples and take the majority vote." "Temperature is set to 0.0 for evaluation calls." None of these answers describe determinism. A temperature-0 LLM call is not deterministic — floating-point arithmetic on GPU hardware produces variation even at temperature 0. And a 99.2% consistency rate means 8 of every 1,000 evaluations produce a different result for the same input. For a high-volume financial services deployment, that represents thousands of inconsistent evaluations per day.
Why it matters for compliance: Regulatory audit trails require that a specific evaluation result can be attributed to a specific policy state at a specific time. If the evaluation engine is probabilistic, the audit record contains a result that cannot be reliably reproduced. When a regulator asks "show me the evaluation logic that produced this decision," a probabilistic engine cannot provide a verifiable answer.
"Our AI evaluator achieves very high consistency." Consistency is not determinism. Press for a direct yes-or-no answer: will the same input, under the same policy version, produce the same result every time?
Question 2: Can You Show Me the Audit Record for a Specific Decision?
Ask the vendor to demonstrate retrieval of a complete audit record for a single decision — from a specific timestamp, with the specific inputs, the specific policy rules evaluated, and the specific disposition returned.
A platform with genuine audit capability will produce a structured record that includes: a unique decision identifier, a timestamp, the policy version that was active at the time of evaluation, the specific rules that were evaluated and their individual results, the final disposition (ALLOWED, BLOCKED, or MODIFIED), and a cryptographic integrity check that allows verification that the record has not been altered since it was created.
A platform with weak audit capability will produce a log entry with a timestamp, an input summary, and an output summary. This satisfies a basic logging requirement but does not produce the tamper-evident, rule-level audit record that regulators examining high-risk AI systems will require. Under EU AI Act Article 12, CFPB supervisory examination, and OCC model risk guidance, a basic log entry is not sufficient.
Ask specifically: "Is the audit record cryptographically signed? Can you demonstrate that the record cannot be altered without detection?" HMAC-SHA256 signing with hash-chaining between records is the current standard for tamper-evident AI audit trails. If the vendor cannot demonstrate this, the audit records are legally fragile.
Question 3: What Is the P99 Latency for a Policy Evaluation?
Ask for production latency numbers at the 50th, 95th, and 99th percentile for a policy evaluation call. Ask for these numbers under realistic load — 500 requests per second, or whatever your production volume projection is.
The threshold that determines whether a governance platform is operationally viable is approximately 2–5ms at P99. If the enforcement layer adds more than 5ms of latency to AI responses at the 99th percentile, it will generate user experience complaints at scale, face operational pressure to be disabled or bypassed, and become a target for exception policies that undermine its coverage. A governance layer that can be disabled under load is not a governance control — it is a governance aspiration.
Platforms that use secondary LLM calls for evaluation will typically report P50 latency of 200–500ms and P99 latency of 1–3 seconds. This is technically acceptable for some use cases but creates a strong operational incentive to bypass the governance layer for time-sensitive requests. Platforms with deterministic evaluation engines — code-based rule evaluation, no model calls — can achieve P99 under 1ms.
A governance platform with high latency will eventually be bypassed for "critical" or "time-sensitive" requests. Each bypass creates a gap in audit coverage and a precedent for future bypasses. The governance control becomes a guideline rather than an enforcement mechanism. Design governance infrastructure to have low enough latency that bypass is never the operationally rational choice.
Question 4: How Are Policy Rules Defined and Updated?
Ask the vendor: "If the CFPB issues new guidance on adverse action notice requirements next month, how do I update my policies? What is the deployment process? How long does it take?"
A platform with mature policy management will have: versioned policy packs with explicit version identifiers; a testing harness that allows policy rules to be validated against a set of test cases before deployment; a deployment process that updates policy atomically across all nodes without downtime; and a rollback capability that restores a previous policy version if problems are discovered post-deployment.
A platform with immature policy management will have hard-coded rules that require vendor involvement to change, no testing environment, deployment processes that cause downtime, and no version history that allows the active policy at a historical point in time to be reconstructed.
Also ask: "Can I define rules specific to my regulatory context, or am I limited to your prebuilt rule sets?" Prebuilt rule sets for lending, healthcare, or financial services are valuable starting points, but every organization has context-specific policy requirements. A governance platform that cannot be extended with custom rules will require a parallel manual review program to cover the gaps.
Question 5: Which Specific Regulations Does Your Policy Coverage Address?
Ask for a specific mapping from the vendor's policy rules to the regulatory requirements they satisfy. Not a marketing claim — a document that names the specific regulatory article or guidance section and explains how the corresponding policy rule satisfies it.
For US financial services, this should cover ECOA §701/Reg B adverse action notice requirements; FCRA accuracy and dispute obligations for credit decisions; HMDA data reporting requirements for covered transactions; BSA/AML customer identification obligations; and OCC SR 11-7 / Federal Reserve SR 11-7 model validation requirements.
For healthcare, it should cover HIPAA §164.312 technical safeguard requirements including access controls and audit controls; FDA SaMD guidance for AI systems that influence clinical decisions; and state clinical practice regulations for AI-assisted clinical recommendations.
A vendor that cannot produce this mapping is claiming regulatory coverage they have not actually engineered. A vendor that produces the mapping confidently — citing specific sections and explaining how each rule satisfies each requirement — has done the regulatory analysis that their platform claims to provide.
Question 6: How Does the SDK Integrate With Our Existing Stack?
Ask the vendor to walk through the integration path for your specific tech stack. This is not a theoretical question — ask them to show you working code for an integration with your primary LLM provider (Anthropic, OpenAI, Azure OpenAI) and your primary deployment environment (Kubernetes, AWS Lambda, Azure Functions).
The integration should require minimal changes to existing application code. A governance platform that requires restructuring the application's LLM calling convention, or that requires routing all traffic through a proxy, introduces significant migration cost and operational risk. The ideal integration pattern is a single SDK method call that wraps the existing LLM call, adding governance evaluation transparently.
Ask specifically about: async support (critical for high-throughput applications); connection pooling (governance calls should reuse connections, not create new ones per request); timeout handling (what happens if the governance layer is unavailable? fail-open or fail-closed?); and retry behavior (how does the SDK handle transient failures in the governance service?).
The fail-open vs. fail-closed question deserves particular attention. A governance platform that fails open — allows AI requests to proceed without evaluation when the governance service is unavailable — provides no guarantee of coverage during outages. A platform that fails closed — blocks AI requests when governance is unavailable — provides coverage guarantees but requires high availability of the governance service itself. Ask the vendor what their SLA is for the governance API, and what the default behavior is during downtime.
# Ideal integration pattern: minimal code change result = coreguard.evaluate( policy_set="lending_v1", action={"type": "loan_decision", "amount": 50000}, context={"applicant_id": applicant_id} ) # result.decision: ALLOWED | BLOCKED | MODIFIED # result.certificate: signed audit record
Question 7: What Is the Pricing Model at Scale?
Ask for a complete pricing breakdown at three volume levels: 10,000 evaluations per month (pilot scale), 1 million evaluations per month (initial production scale), and 100 million evaluations per month (mature enterprise scale). Ask whether pricing is per-evaluation, per-user, per-seat, or flat-fee, and what the total cost implications are for each model at each volume level.
Per-evaluation pricing is the most transparent for organizations that need to forecast compliance costs. It aligns vendor incentives with customer success — the vendor wins when the customer processes more transactions — and it scales predictably with business growth. The risk is that per-evaluation pricing can become expensive at very high volumes if the per-unit rate is not structured with volume discounts.
Per-seat or per-user pricing is common among SaaS governance tools but creates misaligned incentives for AI deployments. The number of "seats" in an AI governance context is ambiguous — does it count the number of developers integrating the SDK, or the number of end users whose AI interactions are governed? Per-seat pricing rarely produces predictable costs for high-volume AI deployments.
Also ask about data residency and egress pricing. If your decision evidence infrastructure is a cloud service and your application data must leave your cloud region for evaluation, you will incur egress costs and potentially create data residency compliance issues. Ask whether the governance service can be deployed within your existing cloud environment as a self-hosted option.
Question 8: What Happens During an Actual Regulatory Examination?
Ask the vendor: "If we receive a regulatory examination request for our AI decision records from the last 18 months, what can you provide, and in what format?"
This question cuts through positioning language and reveals whether the vendor has actually thought through the end-to-end compliance use case or is selling governance tooling to organizations that will discover the audit trail gaps when it matters most.
A mature governance platform should be able to provide: a complete archive of signed decision certificates for any specified time range, exportable in both machine-readable (JSON) and human-readable formats; a chain of custody document demonstrating that the archive is complete and unaltered; a policy version history that allows reconstruction of the exact policy state active at any historical evaluation; and support documentation explaining how the evaluation logic satisfies each applicable regulatory requirement.
Ask also whether the vendor has worked through an actual regulatory examination with any of their customers. Vendors that have will have developed examination support procedures and documentation templates. Vendors that have not will be learning the process at the same time as their customers — which is not the support posture you want during an examination.
The 8-Question Scorecard
Use this rubric to score vendors consistently across your evaluation process. Weight questions by their importance to your specific regulatory context.
| # | Question | Strong Answer (2 pts) | Acceptable Answer (1 pt) | Weak Answer (0 pts) |
|---|---|---|---|---|
| Q1 | Determinism | Unconditional yes; evaluation is code-based | Near-deterministic; T=0 LLM calls | Hedges; cites consistency % |
| Q2 | Auditability | HMAC-signed, hash-chained, rule-level records | Per-decision records; no cryptographic signing | Log entries only; no rule-level detail |
| Q3 | Latency | P99 < 2ms under production load | P99 < 50ms; acceptable for most use cases | P99 > 100ms; secondary LLM calls |
| Q4 | Policy customization | Self-service versioned policies; test harness; atomic deploy | Custom rules via vendor PS engagement | Fixed rule sets; no customization |
| Q5 | Regulatory coverage | Specific rule-to-regulation mapping document | General regulatory domain coverage claimed | No mapping; marketing claims only |
| Q6 | SDK integration | Live demo; async; fail-closed option; <5 line integration | Working SDK; proxy pattern required | No SDK; REST API integration required |
| Q7 | Pricing | Per-evaluation with volume tiers; self-hosted option | Per-evaluation; cloud-only | Per-seat; opaque volume pricing |
| Q8 | Examination support | Has supported examinations; procedures documented | Exportable archives; no examination experience | No formal examination support process |
A score of 14–16 indicates a mature enforcement platform appropriate for regulated industry deployment. A score of 10–13 indicates a capable platform with gaps that may require supplementary controls. A score below 10 indicates a monitoring or observability tool that should not be positioned as a compliance enforcement control.
Red Flags to Watch For
Beyond the specific question answers, certain patterns in vendor presentations signal weaknesses that the questions above may not surface directly:
Demos that only show the dashboard, never the API. A governance platform that cannot be demonstrated at the API level — show me a policy evaluation call and the resulting certificate — is a visualization tool, not an enforcement tool. The enforcement capability must exist at the API level to be integrated into production request paths.
Case studies that describe monitoring, not enforcement. If every customer case study describes "detecting policy violations" and "surfacing compliance insights," the platform's primary function is monitoring. Look for case studies that describe "blocking non-compliant requests" and "generating verifiable audit trails for regulatory examination." The language of the case studies reveals the primary use case.
Coverage claims based on prompt injection techniques. Some platforms claim policy enforcement by injecting policy rules into the primary LLM's system prompt, then evaluating whether the LLM's output adheres to those rules. This approach is neither deterministic nor auditable — the same policy-injected prompt will produce different evaluations depending on context and temperature. It is also easily bypassed by sophisticated users who can manipulate the conversation context.
Regulatory compliance claimed without legal review. Ask whether the vendor's regulatory coverage claims have been reviewed by outside legal counsel specializing in the relevant regulatory domains. A vendor that claims CFPB compliance coverage without having engaged CFPB-specialized counsel has made a marketing claim, not a legal determination. This matters when you rely on that coverage claim during an examination.
For a technical comparison of specific platforms currently on the market, see our article CoreGuard vs. Guardrails AI, Rebuff, and NVIDIA NeMo. For context on specific regulatory requirements, see our coverage of CFPB AI lending guidance, SR 11-7 model risk management, and the EU AI Act enforcement requirements. To understand how CoreGuard addresses these evaluation criteria specifically, review the technical documentation and the live demo.