AI Model Risk Management: Bridging SR 11-7 Principles to LLM Deployment

SR 11-7 — the Federal Reserve's supervisory guidance on model risk management, issued jointly with the OCC — remains the foundational framework for model governance at US financial institutions. Written in 2011 for an environment dominated by traditional statistical models: logistic regression credit scorecards, market risk VaR models, liquidity stress testing frameworks. The guidance establishes principles for model development, validation, and governance that have served the industry well for over a decade.

The deployment of large language models in banking workflows creates a specific problem: SR 11-7's framework assumes a model architecture that LLMs do not fit. The guidance's core concepts — defined inputs and outputs, statistical performance metrics, backtesting against historical data, parameter stability — were engineered for models with bounded, interpretable behavior. LLMs are not bounded in the same sense, are not interpretable in the same sense, and do not backtest in the same sense.

This does not mean SR 11-7 does not apply to LLMs. It applies fully. It means that the practical implementation of SR 11-7's requirements for LLM deployments requires architectural choices that the guidance's authors did not anticipate. This article identifies the five principal gaps and explains how a pre-execution enforcement layer closes each one.

Why SR 11-7 Was Written for Statistical Models

SR 11-7 defines a "model" as "a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates." The definition continues: the model "consists of three components: an information input component, which delivers assumptions and data to the model; a processing component, which transforms inputs into estimates; and a reporting component, which translates the estimates into useful business information."

This three-component definition maps cleanly onto traditional quantitative models. A credit scorecard takes structured application data as input, applies a logistic regression or similar transformation, and produces a numerical score. The score can be backtested against historical outcomes. The model's parameters can be examined. The validation team can construct a challenger model, compare performance statistics, and make a defensible determination about whether the production model performs adequately.

An LLM does not fit this architecture. Its "inputs" may be unstructured natural language that varies in length, format, and semantic content. Its "processing component" is a multi-billion parameter neural network whose internal computations are not interpretable in a meaningful sense. Its "outputs" are probability distributions over tokens — which are then sampled to produce text — not deterministic numerical estimates.

The practical implication is that the standard validation toolkit — statistical performance metrics, backtesting, parameter stability analysis, challenger model comparison — cannot be applied to LLMs in the same way. Model risk managers who attempt to apply SR 11-7 as written to LLM deployments will encounter five specific gaps.

The Five Gaps: SR 11-7 Applied to LLMs

Gap 1

Conceptual Soundness Validation

SR 11-7 requires that validators assess the "conceptual soundness" of model design, including the "quality of model design and construction" and whether the "model theory and methods are appropriate for the intended use." For a statistical model, conceptual soundness means evaluating whether the chosen functional form is appropriate, whether the model's assumptions are valid, and whether the training methodology is sound. For an LLM procured from a foundation model provider, the validator has no access to training data, training methodology, or architectural decisions. Conceptual soundness validation of a closed-source LLM is structurally impossible using traditional methods.

Enforcement Layer Closure

Pre-execution enforcement shifts the validation focus from the model's internal behavior to the model's output behavior within a defined policy envelope. Instead of validating that the model is conceptually sound, validators validate that the enforcement layer's policy rules correctly identify and block outputs outside the acceptable envelope. This is fully testable: the policy rules are deterministic code, not a neural network, and can be systematically tested against a comprehensive test suite before deployment.

Gap 2

Outcome Analysis and Backtesting

SR 11-7 requires ongoing monitoring of model performance against outcomes, including backtesting where applicable. For a credit model, this means comparing model predictions against actual credit performance — did the borrowers the model predicted as high-risk actually default at higher rates? For an LLM used in lending assistance, the "outcome" of an LLM interaction is typically a decision made by a human loan officer who was informed by the LLM's output. The causal chain from LLM output to lending outcome is difficult to establish, and the outcome itself may not be observable for months or years after the interaction.

Enforcement Layer Closure

Per-decision audit records from the enforcement layer create a linkage between LLM interactions and downstream outcomes. By preserving a structured record of every LLM-assisted decision — including the LLM's output, the enforcement layer's evaluation, and the human officer's final decision — institutions can construct a dataset that links AI assistance to lending outcomes. This enables a form of outcome analysis: did decisions made with AI assistance perform differently than decisions made without? The audit record is the data foundation for this analysis.

Gap 3

Model Stability and Parameter Drift

SR 11-7 requires monitoring for model stability, including detection of parameter drift or behavioral changes that might affect model performance. Traditional statistical models have parameters that can be examined at a point in time and compared across periods. LLMs deployed via API may be updated by the provider at any time, sometimes without notification. A foundation model provider updating a model from GPT-4-turbo to GPT-4o may change the model's behavior in ways that affect fair lending outcomes, risk assessment accuracy, or document processing reliability — without the deploying institution being aware of the change until it observes differences in outputs.

Enforcement Layer Closure

The enforcement layer provides a behavioral stability monitor by tracking policy evaluation results over time. If a foundation model update causes the model to produce outputs that trigger policy rules at a different rate — more adverse action notices failing the specificity check, more requests blocked for prohibited basis concerns — the shift is detectable in the enforcement layer's telemetry. This is not direct observation of parameter drift, but it is a behavioral change detector that does not require access to model internals.

Gap 4

Independent Validation

SR 11-7 requires independent validation of models by personnel who are separate from those who developed or use the model. For a statistical model developed in-house, this means a separate validation team reviews the model's documentation, tests its performance, and assesses its fitness for purpose. For an LLM procured from an external provider, "independent validation" faces a structural problem: the validation team has no access to the model weights, training data, or development decisions. They can validate the institution's use of the model, but not the model itself.

Enforcement Layer Closure

Independent validation of the enforcement layer — which is deterministic, fully documented, and internally accessible — satisfies SR 11-7's validation requirement for the portion of the governance architecture that the institution controls. Validators can review every policy rule in the enforcement layer, test each rule against a comprehensive test suite, verify that the rules map correctly to regulatory requirements, and produce a complete validation document for examination. The enforcement layer is the component that is fully validatable; the foundation model's provenance risk is managed through vendor due diligence, contractual provisions, and the behavioral monitoring described above.

Gap 5

Audit Trail Requirements

SR 11-7 requires comprehensive documentation of model use, including records that allow reconstruction of how a model was used in any specific decision. For traditional models, this typically means preserving the input data, the model version, and the output score for each scored record. For LLMs, the equivalent requirement — preserving the full prompt, the model version, and the full response — creates significant data volume challenges, and the preserved records do not by themselves constitute a governance audit trail. The raw LLM exchange does not show what governance checks were applied, which rules were triggered, or whether the output was modified before use.

Enforcement Layer Closure

The enforcement layer's signed decision certificates constitute the governance audit trail that SR 11-7's documentation requirements demand. Each certificate preserves the policy version evaluated, the specific rules applied, the individual rule results, and the final disposition — providing the structured governance record that examiners need to reconstruct the governance state at any historical decision point. Raw LLM transcripts may be archived separately for completeness, but the enforcement layer certificates are the primary governance documentation.

Gap Analysis Summary Table

SR 11-7 Requirement	Gap for LLMs	Enforcement Layer Coverage	Residual Risk
Conceptual soundness validation	HIGH — No access to model internals	Validate enforcement layer rules (fully testable)	MEDIUM — Vendor due diligence required
Outcome analysis / backtesting	MEDIUM — Causal chain ambiguity	Per-decision audit record enables outcome linkage	LOW — With consistent audit record collection
Model stability monitoring	HIGH — Provider may update model silently	Policy trigger rate monitoring detects behavioral changes	MEDIUM — Not equivalent to parameter monitoring
Independent validation	HIGH — Cannot validate closed-source model	Full validation of enforcement layer is possible	MEDIUM — Foundation model provenance risk remains
Audit trail / documentation	MEDIUM — Raw transcripts lack governance context	HMAC-signed certificates provide governance audit records	LOW — Certificate chain fully satisfies documentation

Model Validation Documentation for LLM Deployments

For institutions that must produce SR 11-7-compliant model validation documentation for LLM deployments, the enforcement layer architecture changes what the documentation needs to demonstrate. Rather than documenting traditional statistical validation results, the validation package for an LLM deployment governed by an enforcement layer should include:

Use case specification. A precise description of the business function the LLM performs, the data it accesses, the outputs it produces, and the human decisions it informs. The validation team needs to understand the intended purpose to evaluate whether the governance controls are adequate for that purpose.
Policy pack documentation. A complete description of the enforcement layer's policy rules, including the regulatory requirements each rule satisfies, the logic of each rule, and the test cases used to validate each rule. This is the conceptual soundness documentation for the governance architecture.
Policy test coverage report. Documentation of the test harness used to validate policy behavior, including the test cases, the expected results, the actual results, and the coverage metrics. Regulators will ask for evidence that policy rules were tested before deployment.
Behavioral monitoring plan. A specification of the metrics monitored to detect behavioral drift in the foundation model, the thresholds at which alerts are generated, and the escalation process when thresholds are crossed. This is the substitute for traditional model stability monitoring.
Vendor due diligence package. Documentation of the institution's assessment of the foundation model provider's practices, including training data practices, safety evaluation procedures, change notification protocols, and BAA or equivalent agreements as applicable.
Limitations and restrictions. A clear statement of what the LLM may and may not do within the institution's workflows, including the specific functions governed by the enforcement layer and any functions excluded from the governance scope.

The Examination-Ready Posture

Federal Reserve and OCC examiners examining model risk management programs at institutions using LLMs will be looking for evidence that the institution understands the limitations of SR 11-7 applied to LLMs and has developed compensating controls where traditional methods fall short.

The compensating control argument that resonates with examiners is the enforcement layer architecture: because we cannot fully validate the foundation model's internal behavior, we have built a governance layer that constrains what the model can do before its outputs are used in any consequential decision. The governance layer is fully deterministic, fully testable, and fully documented. Every consequential decision informed by the LLM carries a signed certificate documenting the governance evaluation.

This is not a claim that LLMs have been validated in the traditional sense. It is a claim that the institution has implemented risk management controls that satisfy SR 11-7's substantive requirements — risk reduction, outcome documentation, ongoing monitoring, and audit trails — through mechanisms appropriate to LLM technology.

Examiners who are technically sophisticated will recognize the architectural soundness of this approach. Examiners who are not will typically accept it when accompanied by comprehensive documentation and evidence of actual enforcement. The enforcement layer's signed certificate archive is the evidence of actual enforcement — a concrete demonstration that governance was applied to every decision, not just to those reviewed by a compliance analyst.

For related regulatory guidance, see our coverage of CFPB AI lending requirements, EU AI Act Article 9 enforcement requirements, and the full SR 11-7 enforcement layer analysis. For technical integration details of the EVE CoreGuard enforcement platform, including the lending_v1 policy pack that addresses ECOA, Reg B, and HMDA requirements, see the documentation.

AI Model Risk Management: Bridging SR 11-7 to LLM Deployment

Why SR 11-7 Was Written for Statistical Models

The Five Gaps: SR 11-7 Applied to LLMs

Gap Analysis Summary Table

Model Validation Documentation for LLM Deployments

The Examination-Ready Posture

Close the SR 11-7 Gaps for Your LLM Deployment

Why SR 11-7 Was Written for Statistical Models

The Five Gaps: SR 11-7 Applied to LLMs

Gap Analysis Summary Table

Model Validation Documentation for LLM Deployments

The Examination-Ready Posture

Close the SR 11-7 Gaps for Your LLM Deployment

Related Articles