본문으로 건너뛰기

Regulation and Governance: FDA, Annex 22, and Validating a Model

📍 Where we are: Part VII · ML/AI in Industry Today — Chapter 27. The case-studies chapter graded who deployed what and found the top-right quadrant — independently verified, in commercial GMP — empty, with the Purolea warning letter sitting at the field's edge as a warning sign. This chapter turns from who deployed it to what is allowed and how you prove it: the rules, the documents that contain them, and a worked lifecycle that takes one model from intended use to a defensible, validated, monitored deployment.

The previous chapters built models and then asked how good the evidence was. This one asks the harder question that every real deployment eventually faces: not "does the model work?" but "are you allowed to use it, and can you prove it does what you claim, to a regulator who can shut your line down?" That is a different discipline from machine learning, and it is the one that decides whether a clever soft sensor ever touches a GMP batch. The regulatory landscape for AI in drug manufacturing went from almost empty in 2022 to concrete in 2025–2026: a draft EU annex written specifically for AI, an ISPE validation playbook for it, an FDA risk-based credibility framework, and — the moment the abstractions became real — the first warning letter to cite AI. This chapter is the map of all of it, and it ends where the MLOps chapter left a thread hanging: a concrete, document-by-document validation of our running example's glucose soft sensor under that map.

The simple version

Think of a new bridge. Before anyone drives on it, an engineer must produce a dossier — calculations, materials certificates, load tests — and an independent authority signs that the bridge is fit for its stated purpose. The thicker the traffic the bridge will carry, the more evidence the authority demands. AI in a drug plant works the same way. You declare exactly what the model is for (its "intended use"), you judge how much harm a wrong answer would do (its "risk"), and you assemble evidence proportional to that risk showing the model is trustworthy for that purpose. Then you lock the model so it cannot quietly change, watch it for drift, and only ever change it on purpose with paperwork. The regulators' core message in 2026 is short: a model may advise, the more it influences a critical decision the more you must prove, and a human — not the model — makes the call and signs. The firm that let an AI sign instead got the first warning letter.

What this chapter covers

  • FDA's 2023 discussion paper Artificial Intelligence in Drug Manufacturing and the broader 2025 AI-in-drug-development guidance, with the 7-step risk-based credibility framework that scales evidence to a model's "context of use."
  • The draft EU GMP Annex 22 in depth: locked (static, deterministic) versus adaptive models, the predetermined change control plan (PCCP), and the explicit line excluding generative, continuously-learning, and adaptive AI from critical GMP decisions.
  • The supporting regulatory frame: ICH and PIC/S posture, the ISPE GAMP AI Guide, Computer Software Assurance (CSA), and ALCOA+ data integrity — the four pillars a model-validation file actually stands on.
  • A worked validation lifecycle for an ML soft sensor under GMP, end to end: intended use → risk → data and credibility evidence → locking → monitoring → change control.
  • The Purolea cGMP warning letter (2 April 2026), the first to cite AI, as the enforcement anchor that makes every abstraction above concrete.

FDA: a discussion paper, a framework, and risk proportionality

The FDA's posture on AI in manufacturing begins not with a rule but with a question. In 2023 the agency published a discussion paper, Artificial Intelligence in Drug Manufacturing, which is exactly what it says: a structured set of questions, not binding requirements [1]. It is the most important regulatory document for this field precisely because of what it does not do — it does not prescribe answers, it surfaces the issues a manufacturer must think through. Three of its themes run through everything else in this chapter. First, the GMP framework was not written with AI in mind, so applying it requires interpretation: an ML model is neither obviously "equipment" nor obviously "a computerized system" nor obviously "a method," yet it has properties of all three. Second, managing the data that trains and feeds a model is itself a GMP concern — provenance, representativeness, integrity. Third, models change, and the agency asks directly how a manufacturer will validate and re-validate a model over its life. The paper is a regulator thinking out loud, and reading it is the single best way to understand what the FDA will eventually expect.

The structural idea underneath the FDA's thinking is risk proportionality, and it has a concrete form in the agency's broader work on computational models. FDA's 2025 draft guidance on AI to support regulatory decision-making for drugs establishes a 7-step, risk-based credibility-assessment framework: you (1) state the question of interest the model addresses, (2) define the model's context of use — exactly what the model output will and will not be used for, (3) assess model risk as the combination of how much the decision relies on the model and the consequence of the decision being wrong, then (4) develop a credibility-assessment plan whose rigor matches that risk, (5) execute it, (6) document the credibility-assessment results, and (7) determine model adequacy for the context of use — iterating if the evidence falls short [2]. The genius of the framework is that the same seven steps govern a trivial spreadsheet and a CQA-defining neural network; what differs is the depth of evidence step 4 demands, set by the risk judged in step 3. A model that merely flags batches for human review needs less proof than one whose output releases a lot. This is the regulatory translation of the entire book's two-axis discipline: context of use is intended-use scope, and model risk is the multiplier on how much evidence you owe.

Crucially, the FDA documents are guidance and a discussion paper — peer-reviewed-independent in the narrow sense that they are public regulatory documents, but explicitly not rules. They tell you how the agency will think, not what you must do. For a binding instrument that draws hard lines, you cross the Atlantic.

The draft EU GMP Annex 22: where the line gets drawn

The EU drew the first manufacturing-specific AI line in regulation. The draft EU GMP Annex 22, "Artificial Intelligence," was released for consultation in mid-2025 as part of EudraLex Volume 4, jointly with PIC/S, and it is the first GMP text written specifically for AI [3]. Where the FDA asks questions, Annex 22 makes commitments — and they are sharp enough that every team deploying ML in a regulated plant must know them.

The annex's central distinction is between static (locked) models and dynamic (adaptive) models. A static model is frozen: its parameters do not change once deployed, so the same input yields the same output for the model's whole validated life. A dynamic model continues to learn in production, updating itself on new data. Annex 22's headline rule follows directly: for critical GMP applications it permits only static, deterministic models, and it excludes from critical use the categories that cannot offer that determinism — dynamic continuously-learning models, probabilistic models whose output is not reproducible, and generative AI and large language models [3][4]. The reasoning is the one the MLOps chapter built toward: GMP validation rests on reproducibility, and a model that changes itself, or that answers differently on two runs, cannot be validated by a one-time test. So the annex does not ban these models — it confines them to non-critical uses, and reserves critical decisions for models that hold still.

"Critical" here carries its GMP weight: an application is critical when its output affects product quality, patient safety, or data integrity. Our glucose soft sensor advising a feed sits near that line; a model that released a lot on its own would be squarely over it. The annex pairs the static-model requirement with a set of expectations that read like a validation table of contents: a documented intended purpose, a risk assessment, data-governance controls over training and input data, test-data independence (the data used to test a model must be genuinely held out, not seen in training — the annex is explicit that overlapping test and training data invalidates the evidence), human oversight appropriate to the risk, and acceptance criteria fixed before testing. And it codifies the lifecycle mechanism: any change to a deployed model runs through change control, with the predetermined change control plan (PCCP) as the instrument that lets a planned retrain proceed without a fresh regulatory negotiation.

The PCCP is worth stating precisely because it is the bridge across the validation-versus-learning gap. A PCCP is a pre-approved, written specification of how a model may change in the future — which data it will be retrained on, which parts of the algorithm stay fixed, what acceptance criteria the new version must meet, and what the rollback plan is. With an approved PCCP, a retrain that stays inside the envelope the plan describes is a documented, planned event rather than an unforeseen change. The model still cannot learn in place; learning happens between locked versions, each one a discrete validated object, and the PCCP pre-approves the shape of the path between them. This is the regulatory form of the locked-then-relearn pattern: lock a model, run it unchanged, detect drift, retrain off-line into a new candidate, validate it, and promote it through change control — never an in-place silent edit.

One honesty note the book keeps making: Annex 22 is a draft. Consultation ran through 2025, finalization is expected around mid-2026, and the specific exclusions are provisional. Cite it as a draft, and watch for the final text — but its direction is unambiguous, and it already governs how careful firms design AI today.

The supporting frame: ICH, PIC/S, GAMP, CSA, and ALCOA+

Annex 22 and the FDA framework do not stand alone; they sit on a stack of established quality regulation that an AI deployment inherits whole. Four pieces of that stack do real work in a model-validation file.

ICH and PIC/S set the international backdrop. The product-quality lifecycle is already governed by ICH Q8–Q12 — quality by design, risk management, the pharmaceutical quality system, and lifecycle management — and an ML model that touches a critical process parameter or a CQA operates inside that design space, not outside it [5]. PIC/S, the inspectorate cooperation scheme, co-developed Annex 22 with the EU, which means the line the annex draws will be applied by inspectors across dozens of jurisdictions, not just the EU — a draft with unusually long reach.

The ISPE GAMP AI Guide is the practitioner's translation layer. Published in 2025 alongside the established GAMP 5 (2nd edition) computerized-system-validation framework, it extends GAMP's risk-based, lifecycle, V-model thinking to AI/ML, and introduces a structure often summarized as "seven control layers" for LLM-based and AI systems — covering data, model, deployment, monitoring, and human-oversight controls [6]. Where Annex 22 says what must hold, GAMP says how to demonstrate it: how to write the intended-use specification, how to scale testing to risk, how to document the supplier's contribution when the model comes from a vendor.

Computer Software Assurance (CSA) is the methodological shift that makes any of this affordable. The FDA finalized its CSA guidance (the latest update February 2026), and its message is a deliberate course-correction away from documentation-for-its-own-sake toward critical thinking and risk-based testing: spend assurance effort where a failure would actually harm product or patient, and use the least burdensome evidence sufficient for the risk [7]. For ML this is liberating — it means you do not test every code path of scikit-learn; you test that your model, on your data, meets its stated acceptance criteria for its intended use, and you scale the rest to risk. CSA is the reason a model-validation file can be evidence-rich and paperwork-lean at once.

ALCOA+ is the data-integrity spine underneath the evidence. Every datum that trains, tests, or feeds a model must be Attributable, Legible, Contemporaneous, Original, Accurate — plus Complete, Consistent, Enduring, and Available [8]. For ML this is not a side concern; it is foundational, because a model is only as trustworthy as the data behind it. A training set whose provenance cannot be attributed, or whose values were silently edited, poisons every downstream credibility claim. The open-source book's ALCOA+ chapter shows how to build these properties into the data pipeline by construction; here they become the precondition that lets the credibility evidence mean anything at all.

A worked validation lifecycle: the glucose soft sensor under GMP

Abstractions earn their keep only when applied. Take the running example's Raman glucose soft sensorsoft_sensor_pls.py, a PLS model predicting glucose from a Raman spectrum so the culture can be fed without waiting for the slow bench assay — and walk it through the full lifecycle the documents above demand. This is the chapter's spine: the same six steps the FDA framework names, made concrete on one model.

1. Intended use (context of use). State exactly what the model is for, and — just as important — what it is not for. The model predicts in-process glucose concentration from a Raman spectrum, to advise the timing and size of a bolus feed during the production bioreactor phase. Its output is advisory: a human operator reviews it against the control strategy and decides the feed. It does not release material, does not define a CQA, and is not used outside the validated glucose range or for any product other than mAb-A. That single paragraph is the most consequential in the file, because everything else scales from it.

2. Risk assessment. Judge model risk as reliance × consequence. The soft sensor influences a feed decision but a human gates it, and a wrong glucose reading would, at worst, mis-time a feed — recoverable, monitored, and not directly release-defining. So this is a medium-risk, advisory application: more scrutiny than a dashboard, far less than an autonomous controller. Under Annex 22 it stays on the permitted side of the line because it is locked and human-gated; were it to close the loop on a CQA, it would cross into the territory the annex reserves for static models under the heaviest evidence — or excludes outright if it learned in production.

3. Data and credibility evidence. Assemble proof proportional to the risk. The model is trained on a pinned, ALCOA+ dataset (the Raman spectra and paired offline assays), with the dataset's sha256 recorded so "which data trained this?" is answerable forever. Test data is genuinely held out — Annex 22's independence requirement — and the acceptance criterion is fixed before testing: Raman→titer/glucose R² above 0.85 on held-out hours. The credibility evidence is the held-out metric measured against that pre-stated gate, plus the operating range the model is qualified over and the residual behavior at the edges. This is where the suite's harness does real work as a software-assurance analogue, below.

4. Locking. Freeze the validated object. The weights, the preprocessing, the fitted scaler, the feature contract, and the operating range are all version-pinned and unchangeable in place. The model does not learn on the fly. "Locked" is made literal: a model-version record binds the artifact to its dataset hash, its split and seed, and its frozen hyperparameters, so the deployed model is provably the exact one that was qualified.

5. Monitoring. Watch the locked model for drift, using the two detectors from the MLOps chapter: a label-free PSI on the input distribution (the leading indicator that fires when a new lot or a fouling probe moves the spectra) and an I-MR residual control chart against the sparse offline assay (the lagging, ground-truth indicator that proves the answers have gone wrong). Monitoring is not optional polish; it is the operational half of "validated forever."

6. Change control. Govern every change through the PCCP. A retraining trigger — a sustained PSI breach and a residual out-of-control signal, plus a calendar backstop, plus automatic re-qualification on any hardware change (a probe swap, a resin-lot change, a scale move) — opens the change-control path. The retrain produces a new version, validated against the PCCP's acceptance criteria, promoted only through a four-eyes gate (a second qualified person signs), with rollback to the last known-good version always one promotion away because old versions are never deleted.

Hero diagram of the GMP model-validation lifecycle drawn as a closed loop in six labelled stations around a central locked-model node. Station one, intended use, an indigo card stating the soft sensor advises a feed and does not release a lot. Station two, risk assessment, an amber card scoring model risk as reliance times consequence and reading medium, advisory, human-gated. Station three, data and credibility evidence, a cyan card showing a pinned ALCOA-plus dataset with a sha256 hash, a held-out test split marked independent, and a pre-stated acceptance gate R-squared above 0.85. Station four, locking, a green card freezing weights, scaler, feature contract, and operating range into a version-pinned model. Station five, monitoring, two small monitor glyphs, a label-free PSI input-drift detector and an I-MR residual control chart fed by the sparse offline assay. Station six, change control, a rose card with a PCCP box, a four-eyes promotion gate, and a rollback arrow back to the previous locked version. A central node reads locked validated model, advisory only, with a human-decides badge. Around the rim a thin band names the governing documents, FDA credibility framework, draft Annex 22, ISPE GAMP AI, CSA, ALCOA-plus, and a footer notes the model is never edited in place. The critical-decision zone, where the model would release a lot, is drawn outside the loop behind a red Annex 22 boundary line and left empty. The model-validation lifecycle as a governed loop: intended use scopes it, risk sets how much evidence it owes, credibility evidence is measured against a pre-stated gate on a pinned ALCOA+ dataset, the validated model is locked, two drift detectors watch it, and every change runs through a PCCP and a four-eyes gate with rollback always available — the whole loop sitting inside the documents that govern it, with the critical-decision zone beyond the Annex 22 line deliberately empty. Original diagram by the authors, created with AI assistance.

That lifecycle is not a slide; it is a file you can run. The suite's run_all.py harness is the chapter's contribution, and it is deliberately not another model. It is the analogue, in code, of the FDA credibility framework and a CSA-style assurance check: a model is trustworthy not because it ran, but because evidence was produced and checked against a pre-stated acceptance criterion, on a pinned dataset, by a reproducible procedure. Every module in the suite ends with an assert over a held-out metric — the script-level equivalent of an acceptance criterion in a validation protocol — and the harness runs each one, records whether the gate held, and pins the sha256 of the datasets it was fitted on. Running python run_all.py prints a validation summary, not a benchmark:

model-credibility evidence harness — Book 5 suite
dataset root: examples/platform/datasets
acceptance gate per model is the module's own assert

PASS developability.py ch05 [(synthetic)]
gate: ranking AUROC > 0.70 on held-out clones
PASS soft_sensor_pls.py ch11 [raman_spectra.parquet:9f3c1a7b2e44]
gate: Raman->titer R2 > 0.85 on held-out hours
PASS hybrid_model.py ch11 [fedbatch_state.parquet:6b81d0f4a3c2]
gate: hybrid beats pure-ML extrapolation
PASS mspc.py ch18 [hplc_results.csv:2d7e90a1c5f8, batches.csv:aa14c7]
gate: MSPC flags ONLY the OOS batch; SPE points at HCP
PASS vision_avi.py ch17 [(synthetic)]
gate: defect classifier recall above the gate

credibility summary: 9/9 models cleared their acceptance gate on the pinned datasets
every executed model produced evidence that cleared its stated gate.
NOTE: passing the gate is necessary, not sufficient — GMP credibility
also needs intended-use scope, change control, and human oversight.

Read that last NOTE the way an inspector would. The harness proves the evidence exists and the gate was cleared on pinned data — the necessary, machine-checkable core of step 3. But it ends by naming what code cannot supply: the intended-use scope, the change control, the human oversight. Those are properties of a validated system and procedure, not of a script — which is exactly the line CSA and Annex 22 draw, and exactly why a passing test suite is the beginning of a validation file, not the end of one.

Anatomy of a model-validation dossier (with its PCCP)

The unit of governance in this chapter is not a prediction; it is the dossier — the document set that, taken together, makes a model deployable under GMP. Like every artifact in this series, its value is in what travels alongside the weights. Dissect one the way a quality reviewer or an inspector would, section by section.

Anatomy identity card of one GMP model-validation dossier for the glucose soft sensor, with seven stacked sections. An indigo header names the dossier, glucose_softsensor v4 validation file, and its status, approved for advisory use. A scope section holds the intended-use statement, advises a bolus feed, does not release a lot, mAb-A only, within the validated glucose range, with the context-of-use boundary drawn explicitly. A risk section holds the model-risk score as reliance times consequence reading medium, advisory, human-gated, mapped to the evidence depth it requires. A credibility-evidence section holds the pinned training dataset by sha256, the independent held-out test split, the pre-stated acceptance gate R-squared above 0.85, and the measured held-out result against it, each marked frozen-before-testing. A controls section lists the ALCOA-plus data-governance attestation, the locked-model attestation freezing weights and scaler, and the human-oversight description. A PCCP section, drawn as a nested sub-card, holds the allowed-change envelope, the retraining trigger written as a rule, the new-version acceptance criteria, and the rollback plan, marked pre-approved. A governance section holds the four-eyes approval signatures with timestamps, the change-history log, and the next-revalidation date, all marked GxP-controlled with e-signatures under Part 11 and Annex 11. A violet relationships panel links the dossier governs the locked model version, bound-to the PCCP, evidenced-by the run_all harness output, monitored-by the PSI and residual detectors, and approved-by the quality unit. A caption notes the PCCP is the section that lets the model evolve along a path proven safe in advance. One model-validation dossier, fully unpacked: the intended-use scope that bounds everything, the risk score that sets the evidence depth, the credibility evidence measured against a gate frozen before testing on a pinned ALCOA+ dataset, the data-governance and locked-model controls, the nested PCCP that pre-approves how the model may change, and the GxP-controlled approval signatures and revalidation date — the difference between a model file and a deployable validated object. Original diagram by the authors, created with AI assistance.

Read the dossier top to bottom and the chapter is laid out as sections. The scope section is the intended-use statement — the context of use — and it does the most work, because every other section's depth is set by it. The risk section records model risk as reliance × consequence and maps it to the evidence the file must carry; get this wrong and you either over-document a dashboard or under-prove a controller. The credibility-evidence section is the heart: the training dataset pinned by sha256, the independent held-out split (Annex 22's test-data independence), the acceptance gate frozen before testing, and the measured result against it — the run_all.py evidence in formal dress. The controls section holds the ALCOA+ data-governance attestation, the locked-model attestation, and the human-oversight description. The PCCP is a nested sub-card precisely because it is the document's cleverest part: it pre-approves the allowed-change envelope, the written retraining trigger, the new-version acceptance criteria, and the rollback plan, so the model can evolve along a path proven safe in advance. And the governance section carries the four-eyes signatures, the change history, and the next revalidation date — the GxP-controlled core that makes the whole thing auditable and the human accountability unmistakable. A model file has weights; a dossier has all of this, which is why only the dossier can make a decision about a medicine.

The unsolved part: keeping a learning model "in validation"

Be honest about the contradiction the documents manage but do not dissolve. GMP validation means prove the system does what it should, lock it, and prove again before any change. Machine learning means improve by changing in response to new data. A model that keeps learning is, by definition, a system that keeps changing — the one thing validation forbids without re-qualification. The PCCP is the best instrument the field has, and it is genuinely clever: it pre-approves the shape of allowed change so a retrain is a planned event, not a new negotiation. But it does not make a continuously-learning model validatable; it makes a sequence of locked models governable. Learning still happens between versions, never within one, and the draft Annex 22 codifies exactly that limit by excluding adaptive models from critical use.

The deeper unsolved residue is the one the MLOps chapter named and this chapter inherits: the only true ground-truth drift detector is lagging by construction, because the offline reference that would expose a model's error arrives once or twice a day. Between the moment concept drift begins and the moment enough sparse assays accumulate to prove it, a validated model that has started to mislead looks identical to one that is working. The PCCP tells you what to do when you detect drift; it cannot shorten the time to detect it. So a model "in validation" under GMP is really a model held in a disciplined suspicion — locked, monitored, periodically reconciled against slow truth, and assumed wrong until the data proves otherwise. The validation paradox is managed by paperwork and lifecycle, not resolved by them, and a regulator who understands this will ask not "is your model perfect?" but "what is your evidence, what is your trigger, and who signs?" The honest state of the art is that those three questions have good answers and the question behind them — how to trust a learning system unattended — does not yet.

What this chapter adds to the model suite

This chapter's contribution is examples/platform/ml/run_all.py, and it is deliberately the suite's governance artifact rather than another model — the code analogue of the FDA credibility framework and a CSA-style assurance check. It runs every model module in the suite as a subprocess, treats each module's terminal assert over a held-out metric as that model's acceptance gate, records whether the gate held, and pins the sha256 of the dataset(s) the model was fitted on — emitting a single machine-checkable ledger: which model, which gate, did the evidence clear it, on which frozen data. It coordinates with, and does not duplicate, the case-studies ledger (which grades external deployments by maturity and evidence tier) and the drift detectors (which monitor a deployed model): the harness sits at validation time, asserting that credibility evidence was produced against a pre-stated gate on pinned data — the necessary, checkable core of step 3 of the lifecycle. Its closing NOTE is the chapter's thesis in code: passing the gate is necessary, not sufficient; GMP credibility also needs intended-use scope, change control, and human oversight, none of which a script can supply. The harness makes the evidence auditable; the dossier and the human make it deployable.

Why it matters

Every model in this book is a liability until it is governed. A soft sensor that drifts and mis-feeds a culture, an MSPC monitor that waves through a batch it should flag, a generative copilot that fabricates a root cause — each is a real risk to product and patient, and the only thing that converts a clever model into a defensible deployment is the discipline this chapter maps: a scoped intended use, evidence proportional to risk, a locked validated object, monitoring that never expires, and a human at every critical gate. That discipline is not bureaucratic friction layered on top of good engineering; under CSA and Annex 22 it is good engineering, focused exactly where a wrong answer would do harm. The firms that internalize it can deploy AI in GMP with confidence; the firm that did not — that let an AI generate the records governing how a batch is made and released, with no quality unit reading them — drew the first AI warning letter. The gap between those two outcomes is not the quality of the model. It is the presence or absence of the dossier, the lock, the monitor, and the signature.

In the real world

The regulatory frame went from sparse to concrete in two years, and in 2026 it is genuinely usable. On the guidance side: the FDA's 2023 discussion paper Artificial Intelligence in Drug Manufacturing frames the questions, its 2025 AI-in-drug-development draft supplies the 7-step risk-based credibility framework, the ISPE GAMP AI Guide (2025) translates it into a validation playbook with its seven control layers, and the finalized Computer Software Assurance guidance (latest update February 2026) makes the whole effort risk-based and least-burdensome rather than documentation-heavy [1][2][6][7]. On the binding side: the draft EU GMP Annex 22, in EU/PIC/S consultation through 2025 with finalization expected around mid-2026, is the first manufacturing-specific AI rule — permitting only static, deterministic models for critical GMP and excluding adaptive, probabilistic, and generative AI from critical use, with the PCCP as the change-control instrument [3][4]. Underneath both, ICH Q8–Q12 and ALCOA+ are the inherited quality and data-integrity spine an AI deployment cannot opt out of [5][8]. And the enforcement anchor is now real, not hypothetical: on 2 April 2026 the FDA issued its first AI-citing cGMP warning letter, to Purolea, a firm that used AI agents to generate specifications, SOPs, and master production records without quality-unit review [9][10]. The violation was not "you used AI" — it was that an AI produced GMP-controlling documents and no human quality unit reviewed them, the exact missing four-eyes gate this chapter's whole lifecycle is built to supply. Read the guidance, the draft rule, and the one enforcement action together and the message is one sentence: a model may advise, the evidence must match the risk, and a qualified human decides and signs.

Key terms

  • Intended use / context of use — the precise statement of what a model's output will and will not be used for; the scope from which every other validation requirement is sized.
  • Model risk — reliance on the model × consequence of the model being wrong; the risk score that sets how much credibility evidence a model owes (FDA 7-step framework).
  • Credibility-assessment framework (FDA 7-step) — question of interest → context of use → model risk → credibility plan → execution → results → adequacy; the risk-proportionate structure that governs trivial and high-stakes models alike.
  • Static (locked) model — a model frozen after validation so the same input yields the same output for its whole life; the only kind Annex 22 permits for critical GMP use.
  • Dynamic / adaptive model — a model that continues to learn in production; excluded from critical GMP use under draft Annex 22 because it cannot be validated by a one-time test.
  • Draft Annex 22 — the draft EU/PIC/S GMP annex on AI: permits only static, deterministic models for critical use, excludes adaptive, probabilistic, and generative AI, and requires intended purpose, risk assessment, data governance, test-data independence, human oversight, and change control.
  • Predetermined change control plan (PCCP) — a pre-approved written specification of how a model may change (data, frozen algorithm, acceptance criteria, rollback), so a retrain inside the envelope is a planned event rather than a new regulatory negotiation.
  • Locked-then-relearn — the only pattern permitted for critical applications: lock a model, run it, detect drift, retrain off-line into a new validated version, promote through change control; learning happens between versions, never within one.
  • ISPE GAMP AI Guide — the 2025 practitioner playbook extending GAMP 5's risk-based, lifecycle validation thinking to AI/ML, with its "seven control layers" for data, model, deployment, monitoring, and oversight.
  • Computer Software Assurance (CSA) — the FDA's risk-based, critical-thinking approach to software assurance: least-burdensome evidence focused where a failure would harm product or patient.
  • ALCOA+ — Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available; the data-integrity properties every datum that trains, tests, or feeds a model must hold.
  • Test-data independence — Annex 22's requirement that test data be genuinely held out from training; overlapping test and training data invalidates a model's credibility evidence.
  • Four-eyes gate — the requirement that a second qualified person review and sign the promotion of a model version; the control whose absence drew the Purolea warning letter.
  • Purolea warning letter — the FDA's first AI-citing cGMP warning letter (2 April 2026), against a firm that used AI to generate GMP records without quality-unit review; the enforcement anchor.

Where this leads

The rules are mapped, the lifecycle is worked, and the one enforcement action shows exactly where the line falls: a model advises, a human decides, and the evidence must match the risk. That is the governed present. The final substantive chapter, The Frontier: Foundation Models, Autonomous Labs, and Agentic AI, looks the other way — at the self-driving bioreactors, federated learning, bioprocess foundation models, and agentic AI that promise to move the line, and asks honestly how far each has actually traveled from a controlled demonstration toward routine commercial GMP use. The governance chapter tells you the conditions under which a real deployment becomes defensible; the frontier chapter tells you which of tomorrow's capabilities can meet those conditions, and which are, for now, on the wrong side of both the evidence and the rule.