MLOps and Lifecycle: Drift, Retraining, and the Validation Paradox

📍 Where we are: Part VI · The Whole System — Chapter 22. The previous chapter built the dominant paradigm — hybrid models and digital twins that fuse physics with data. This chapter answers the question every deployed model eventually forces: once it is running in a GMP plant, how do you keep it true, and how do you keep it validated while it keeps learning?

Every chapter before this one trained a model and showed it working: a Raman soft sensor recovering titer, a harvest-endpoint optimizer, an MSPC monitor, a vision system auto-releasing vials. They all share an unstated assumption — that a model which passed validation on Tuesday is still correct on Friday. It is not. A bioprocess model decays faster and more silently than almost any model in industry, because the thing it predicts is alive, the ground truth that would expose its error arrives once or twice a day, and the hardware underneath it changes with every probe swap and every scale-up. This is the chapter about the part nobody demos: the long, unglamorous afterlife of a model in production.

It is also where machine learning collides head-on with the central rule of pharmaceutical manufacturing. A GMP (Good Manufacturing Practice — the legally enforced quality regime for making medicines) process must be validated — proven, locked, and held under change control so that what you make tomorrow is what you proved yesterday. A machine-learning model, by its nature, learns — it changes as it sees new data. Those two requirements are in direct tension, and the resolution of that tension is the single most important idea in deploying ML under GMP. We will build the drift detectors that make a learning model governable, then confront the validation-versus-learning paradox honestly.

The simple version

A bridge inspector does not declare a bridge "safe forever" the day it opens. They re-inspect it on a schedule, because steel rusts, traffic grows, and the river scours the footings. A deployed model is the same: it was safe on the data it was built for, but the world it watches keeps moving — the sensor fouls, the supplier changes, the cells behave a little differently this campaign. MLOps is the inspection regime for models: instruments that watch for the model going stale (drift detection), a written rule for when to rebuild it (retraining triggers), and a way to swap back to the old one if the new one is worse (rollback). The hard part is that, unlike a bridge, the regulator insists the model not change on its own between inspections — so we lock it, and only ever change it on purpose, with paperwork — the one way a model is unlike a bridge: you may not quietly patch it in place even when you can see it failing.

What this chapter covers

Why models drift in bioprocess specifically — probe fouling, raw-material lot shifts, scale moves, and the living system's own run-to-run wander, mapped to the three mathematical kinds of drift.
Detecting drift two ways — a leading indicator that needs no labels (Population Stability Index on the input distribution) and a lagging indicator that needs the slow offline reference (a control chart on the prediction residual), with the math worked through.
The retraining lifecycle under GMP — locked models, the Predetermined Change Control Plan (PCCP), retraining triggers, the four-eyes promotion gate, champion/challenger shadow deployment, and rollback.
A runnable drift detector and a runnable governed retrain loop — examples/platform/ml/drift.py (an online-vs-offline glucose residual chart plus cross-batch PSI) and examples/platform/ml/lifecycle_retrain.py (champion/challenger promotion with held-out revalidation, a version bump, a dataset hash, and a CPV log entry), both grounded in the committed datasets.
The validation-versus-learning paradox — how regulators (FDA, the draft EU/PIC/S GMP Annex 22) reconcile "validated" with "learning," and why the honest answer today is locked-then-relearn, not continuously-learning.

Why models drift — and why bioprocess is the worst case

A model is a frozen claim about a relationship: given these inputs, the answer is this. Drift is what happens when the relationship the world actually obeys diverges from the one the model froze. It comes in three mathematically distinct flavors, and naming them matters because each is caught by a different instrument [1].

Covariate shift (input drift): the distribution of the inputs P(X) moves, even if the underlying physics P(Y|X) is unchanged. (Read P(X) as the spread of input values the model typically sees — what range they take and how often — and P(Y|X) as the answer Y you expect for a given input X; the bar means "given.") A Raman probe slowly fouls and its baseline rises; a new raw-material lot shifts the spectral background; a new campaign runs a fraction warmer. The model is now extrapolating outside the input region it was calibrated on, which is exactly where a data-driven model is least trustworthy. Crucially, covariate shift is visible without any labels — you can see the inputs have moved before you know whether the answers are wrong.

Concept drift (relationship drift): P(Y|X) itself changes — the same inputs now imply a different answer. A cell line subtly adapts over passages; a media reformulation changes how glucose maps to growth; a new feed strategy alters the kinetics the model learned. This is the dangerous kind, because the inputs can look perfectly normal while the predictions quietly go wrong. Concept drift can in general only be caught with ground truth — the slow offline reference — which is why it is always a lagging discovery.

Label/prior shift: the distribution of the outcome moves (the process is run at a new titer target, a new product enters the same line). This reshapes what "normal" output looks like and can make a well-calibrated model's predictions sit systematically off-center.

The taxonomy is not pedantry. Each flavor changes one factor of the joint distribution P(X,Y) = P(Y|X)·P(X) — the full input-and-answer picture always splits into "how often each input occurs" (P(X)) times "what answer that input gives" (P(Y|X)): covariate shift moves P(X), concept drift moves P(Y|X), prior shift moves P(Y). That factorization is the whole reason two detectors are unavoidable rather than one — a monitor that watches inputs (P(X)) is structurally blind to a change in the mapping (P(Y|X)), and vice versa. No single statistic on the input stream can ever catch concept drift, because by definition concept drift leaves the input stream looking identical.

Bioprocess suffers all three more acutely than almost any other domain, for reasons this book has hammered since Chapter 1 and Chapter 2:

The system is alive and never identical twice. Run-to-run biological variability means the input distribution genuinely moves batch to batch — covariate shift is the baseline, not the exception. Over a product's life the cell line itself can adapt — productivity, glycosylation, or growth shifting with passage number — which is exactly why the dossier fixes a limit of in-vitro cell age (ICH Q5D); a static model is implicitly bounded by that same passage window, and a model still trusted beyond it is extrapolating in time as well as in input space — slow concept drift no static model survives.
Ground truth is sparse and slow. The offline reference assay returns roughly once or twice a day — the cold-start, sparse-reference reality. Concept drift is therefore detectable only at those sparse grounding points, days after it begins.
The hardware moves under the model. A Raman calibration is bound to its exact probe: swap the probe and the published study saw roughly a 20% cell-density prediction error from instrument-to-instrument differences alone, halved only by an explicit calibration-transfer step. A scale move from development to manufacturing is a wholesale covariate shift.
The models decay fast and there are few of them to learn from. Small data means a freshly retrained model is itself fragile, so retraining is not a free reset — it is a new validation burden each time.

This is the deep reason pure-ML stalls in biomanufacturing and hybrid modeling wins: a physics backbone drifts far more slowly than a pure black box, because mass balances do not care which raw-material lot you bought. But even a hybrid model's data-driven part drifts, so drift detection is mandatory regardless of model class.

Two detectors: a leading indicator and a lagging one

The asymmetry between dense predictions and sparse ground truth dictates the whole monitoring design. You cannot wait for the offline assay to tell you the model is wrong — by then a day of batches has been mis-controlled. So a real MLOps loop runs two detectors with complementary timing.

PSI: the leading, label-free input-drift detector

The Population Stability Index (PSI) asks a question that needs no ground truth at all: has the distribution of the model's inputs moved away from what it was trained on? It bins the reference (training) distribution into deciles (ten equal-population slices of the training distribution), then measures how much probability mass the new data has shifted between those bins. Summed over the bins, it is:

PSI = sum_over_bins( (a_i - e_i) * ln(a_i / e_i) )

where e_i is the fraction of the expected (reference) population in bin i and a_i the fraction of the actual (new) population. Two implementation details decide whether the number means anything. First, the bins are pinned on the reference quantiles — deciles of the training distribution, so each reference bin holds about 10% of the expected mass — not on equal-width intervals; equal-width bins on a skewed feature dump most of the mass into one bin and make PSI nearly uninformative. Second, any empty bin would send ln(a_i / e_i) to plus or minus infinity, so each fraction is floored at a small epsilon (the example uses 1e-6) before the log.

At heart PSI is just a single number measuring how far the new histogram has moved from the training one — it grows when mass piles up where little used to be, and is zero only when the two histograms match exactly. The term (a_i - e_i)·ln(a_i / e_i) is, formally, the Kullback–Leibler divergence integrand symmetrized in both directions — PSI equals KL(a‖e) + KL(e‖a), the symmetrized relative entropy (Jeffreys divergence) between the two histograms (you do not need that name to use PSI; it just identifies the quantity for readers who want it). Each bin's contribution is non-negative and grows when the actual mass a_i departs from the expected mass e_i in either direction; bins that gain mass and bins that lose it both add to the total. PSI is therefore a single scalar summarizing how far one histogram has moved from another, dimensionless and directionless (this identity is exact for the binned histograms, which is also why PSI's value — and its rule-of-thumb cutoffs — depend on the binning and cannot be read as a calibrated test statistic).

The industry rule of thumb is blunt and durable: PSI below 0.1 is stable, 0.1 to 0.25 is a moderate shift worth watching, and 0.25 or above is a significant shift that should trigger investigation [1]. Those cutoffs come from decades of credit-risk scorecard monitoring; they are a convention, not a hypothesis test — there is no Type I / Type II error rate attached to them, and on a feature with naturally wide run-to-run spread a healthy batch can clear 0.25 with nothing wrong. That is exactly why the example below shows several perfectly good sibling batches "failing" the generic threshold: the rule of thumb is a starting point to be recalibrated against the process's own normal variability, not a law. PSI's value is its timing: because it watches only the inputs, it fires the moment a new lot or a fouling probe moves the feature distribution — before the offline assay can confirm any error. It is the smoke detector that goes off before the fire is visible.

A control chart on the residual: the lagging, ground-truth detector

PSI is blind to concept drift by construction — see the P(X,Y) factorization above: the inputs can be perfectly in-distribution while the relationship has changed. For that you need the residual: the difference between the model's prediction and the offline reference, computed at each sparse grounding point. The trick is to treat that residual stream as a process to be controlled, and apply the exact SPC machinery from Book 3: build an I-MR (Individuals and Moving Range) control chart on the residuals of a known-good reference batch, then flag any live residual that falls outside the data-derived control limits. A control chart, at bottom, is just a plot of a quantity with statistically-derived upper and lower limits drawn on it: points inside the limits are ordinary noise, a point outside is a signal worth investigating.

The arithmetic is the standard individuals-and-moving-range construction, and it matters that the spread is estimated the SPC way rather than as a plain standard deviation. The center line is the mean residual of the clean reference run. The spread comes from the moving range — the absolute difference between each consecutive pair of residuals, MR_i = |r_i − r_{i−1}| — whose average estimates the short-term, within-run noise: sigma = mean(MR) / d2, with the control-chart constant d2 = 1.128 for a moving range of n=2 — d2 is the standard tabulated factor that converts the average range of a group of points back into an unbiased estimate of the underlying noise sigma, and 1.128 is its value for the group size of two consecutive points (n=2) that a moving range uses. The control limits are then the familiar three-sigma band, center ± 3·sigma. Using the moving-range estimate rather than the overall standard deviation is deliberate: a slow drift inflates the overall standard deviation (which would widen the limits and hide the very drift you are hunting), but it barely touches the point-to-point moving range, so the limits stay tight and the drifting points break through them. A residual that drifts off-center, or a run of residuals marching one way, is the model going stale — the same Western Electric run rules that catch a drifting process catch a drifting model.

This is the direct mechanization of the open challenge the Book 2 soft-sensor chapter named: the residual is a lagging indicator, true only once enough slow reference data accumulates. PSI and the residual chart are designed to be read together — PSI says "the inputs moved, suspect the model," the residual chart says "the answers are now provably wrong." When PSI fires and the residual chart later confirms, you have both the early warning and the evidence a change-control record demands.

A runnable drift detector grounded in the campaign

The example module examples/platform/ml/drift.py implements both detectors over the committed datasets, with no services. The online glucose feed is the simulator's minute-cadence state in fedbatch_state.parquet; the offline grounding samples are the roughly twice-daily bench assays in offline_assays.csv. The residual detector pairs each sparse offline assay with the nearest online reading, then charts the residual stream — and to prove the chart catches something real, it injects a slow probe-fouling bias onto the online sensor after day 7 (the same day the simulator seeds its temperature excursion).

# examples/platform/ml/drift.py
import numpy as np
import pandas as pd

D2 = 1.128  # I-MR control-chart constant for the moving range of n=2


def ground_residuals(batch_id="BATCH-2026-001", fouling_bias=0.0, fouling_after_day=7.0):
    """Pair each offline assay with the nearest online reading and form the residual.

    fouling_bias injects a slow probe-fouling drift (g/L) onto the online sensor
    after fouling_after_day so the detector has something to catch."""
    online = online_glucose(batch_id)
    if fouling_bias:
        age_day = (online.ts - online.ts.iloc[0]).dt.total_seconds() / 86400.0
        ramp = np.clip((age_day - fouling_after_day) / (14.0 - fouling_after_day), 0, 1)
        online = online.assign(glucose_online=online.glucose_online + fouling_bias * ramp)
    offline = offline_glucose(batch_id)
    paired = pd.merge_asof(offline.sort_values("ts"), online.sort_values("ts"),
                           on="ts", direction="nearest")
    paired["residual"] = paired.glucose_online - paired.glucose_offline
    return paired.reset_index(drop=True)


def psi(expected, actual, bins=10):  # (abridged from drift.py for the page; epsilon-clip fused inline)
    """Population Stability Index: bins fixed on reference quantiles, sum (a-e)*ln(a/e)."""
    e, a = np.asarray(expected, float), np.asarray(actual, float)
    edges = np.unique(np.quantile(e, np.linspace(0, 1, bins + 1)))
    edges[0], edges[-1] = -np.inf, np.inf
    e_pct = np.clip(np.histogram(e, edges)[0] / len(e), 1e-6, None)
    a_pct = np.clip(np.histogram(a, edges)[0] / len(a), 1e-6, None)
    return float(np.sum((a_pct - e_pct) * np.log(a_pct / e_pct)))

Two design choices in this code are the whole lesson. The residual is formed by pd.merge_asof(..., direction="nearest") — each sparse offline assay is paired to the nearest-in-time online reading, which is the only honest way to compare a once-or-twice-a-day bench value against a minute-cadence sensor. And the control limits are set from a clean reference run (no fouling) but applied to a fouled live run, so the chart is asked the real production question: given what "normal disagreement" looked like on a known-good batch, is this run still inside it?

Running python drift.py prints exactly this against the real datasets:

online-vs-offline glucose residual I-MR chart (reference = clean run):
  center=-0.0125 g/L  UCL=0.409  LCL=-0.4341  sigma=0.1405
  live (probe fouling +1.2 g/L after day 7): 10/28 points out of control, max |residual|=1.037 g/L -> drift_detected=True

cross-batch glucose PSI vs golden batch BATCH-2026-001:
  BATCH-2026-006  PSI=1.5403  SHIFT
  BATCH-2026-004  PSI=0.3439  SHIFT
  BATCH-2026-003  PSI=0.3191  SHIFT
  BATCH-2026-005  PSI=0.256   SHIFT
  BATCH-2026-002  PSI=0.1135  moderate
  BATCH-2026-001  PSI=0.0     stable

ASSERT ok: the residual chart catches the injected probe-fouling drift.

Read both halves the way an MLOps engineer would. The residual chart sets its limits from the clean reference run — a center near zero (-0.0125 g/L) with control limits at roughly +0.41 and -0.43 g/L, the natural spread of the online-versus-offline disagreement when nothing is wrong. That band is about ±3·0.1405, i.e. three moving-range sigmas wide, which is why a normal sensor wobble of a tenth of a g/L stays comfortably inside it. Replay the same batch with a probe fouling drift of +1.2 g/L ramping in after day 7 and 10 of 28 residual points fall outside those limits, with a maximum residual over 1 g/L — the chart catches the fouling unambiguously. By construction this is the lagging detector: it fires only at the sparse offline grounding points, days into the drift, and note that early points (before the ramp engages) sit inside the limits while late points march out — exactly the "run marching one way" signature concept drift produces.

The cross-batch PSI is the leading detector, and its numbers are honest about a subtlety. Against the golden batch BATCH-2026-001, every sibling shows some glucose-distribution shift — BATCH-2026-006 at PSI 1.54 is a dramatic SHIFT, BATCH-2026-002 at 0.11 is only moderate, and the golden batch against itself is exactly 0.0 (the histograms are identical, so every bin contributes zero — a useful sanity check that the metric is implemented right). This is not a bug; it is the living-system reality from earlier in the chapter made numerical. Run-to-run biological variability means the input distribution genuinely moves batch to batch, so a naive PSI alarm would fire constantly. The lesson the example teaches is precisely that a drift threshold must be tuned to the process's own normal variability, not borrowed from a generic ML rulebook — the 0.25 rule of thumb is a starting point, and a bioprocess team must learn what PSI its own healthy campaigns produce before they can set an alarm that means something. (The glucose values here are the dataset's real offline numbers; the +1.2 g/L fouling bias is an injected illustrative perturbation to exercise the detector, clearly labeled in the code. The positive offset is a deliberately simple stand-in: real probe fouling can bias the reading in either direction and often distorts the response slope rather than merely adding a constant offset — which is exactly why the chart watches for a sustained one-way march in the residual rather than assuming a known sign.)

The lifecycle of a model that must stay both true and validated: a locked model serves predictions while a label-free PSI monitor and a ground-truth residual chart watch for drift; only a written trigger opens the change-control return path, which retrains, revalidates, and promotes a new locked version through a four-eyes gate — with rollback to the prior version always available, and never an in-place silent edit. Original diagram by the authors, created with AI assistance.

The retraining lifecycle: locked models and the PCCP

Detecting drift is half the job. The other half is what you are allowed to do about it, and here MLOps in pharma diverges sharply from MLOps everywhere else. In a consumer setting, drift detection often triggers automatic retraining — the model continuously updates itself on fresh data, and nobody signs anything. Under GMP that pattern is, today, effectively prohibited for anything touching a critical quality attribute (CQA) — a measurable product property that must stay inside a safe range for the batch to release, defined in Book 1's QC and release chapter and glossed in this book's opening chapter. The reason is the validation principle: you must be able to prove that the model making decisions about a medicine is the exact model you qualified.

Before the mechanism, name the regulatory program the whole loop feeds into. Continued Process Verification (CPV) is a defined term: per the FDA's 2011 Process Validation guidance it is "Stage 3" of process validation — the ongoing program of collecting and analyzing product and process data throughout commercial production to give continual assurance that the process stays in a validated state of control (ICH Q12 governs the lifecycle change management built on top of it). A deployed model's drift monitoring and governed retraining are not a separate ML ritual bolted onto the plant; they feed this existing CPV program directly. The PSI and residual detectors above are CPV data sources, and a model-revalidation is a change-control event logged into CPV — the same Stage-3 record that tracks every other source of process drift. Framing it this way is what makes ML monitoring auditable: it lands in a program the inspector already knows and already expects to see.

The resolution the industry and regulators have converged on is the locked model. A model in GMP production is frozen — its weights, its preprocessing, its scaler, its operating range, all version-pinned and unchangeable in place. It does not learn on the fly. When drift detection says the model has gone stale, that does not silently update anything; it raises a flag that enters the change-control process, exactly as any other change to a validated system would. Retraining produces a new model version, which must be revalidated and formally promoted before it can replace the old one. This is the same discipline the Book 2 governance chapter named: a retrained model is a new validated object, not an edit.

The mechanism that makes this workable — that lets you change a model on purpose without a full novel validation each time — is the Predetermined Change Control Plan (PCCP). A PCCP is a pre-approved, written specification of how a model may change in the future. Concretely it pins down three things in advance: the data-management plan (which data the model will be retrained on, how it is curated, labeled, and quality-gated), the modification protocol (which algorithm, architecture, and hyperparameters stay fixed, what is allowed to change, and the exact retrain-and-test procedure), and the impact assessment plus acceptance criteria (the metric thresholds the retrained model must meet and the rollback plan if it does not). With an approved PCCP in place, a retrain that stays inside the envelope the plan describes can be executed as a planned, documented event rather than treated as an unforeseen change requiring fresh regulatory engagement. The PCCP is the bridge across the validation-versus-learning gap: it lets a model evolve along a path you proved in advance was safe. The same construct exists on both sides of the Atlantic — the draft Annex 22 (drafted by EU regulators together with PIC/S — the Pharmaceutical Inspection Co-operation Scheme, the international body that harmonizes GMP inspection) names a predetermined change-control approach for locked-model updates, and the parallel US instrument is the FDA's PCCP for AI-enabled device software.

A retraining trigger should be a written rule, not a judgment call, and the two detectors above feed it directly. A defensible trigger combines them with an AND, not an OR, to suppress false alarms: a sustained PSI breach (the inputs have moved and stayed moved across several batches, not one outlier) and a residual chart out-of-control signal (the answers are provably wrong against ground truth), plus an event trigger for any known hardware change (a probe swap, a resin lot change, a scale move are automatic re-qualification events, not waits for drift to appear). Retraining is therefore trigger-driven, not interval-driven: the primary trigger is the drift detector tripping, not a clock. The calendar backstop is real but anchored to a rhythm that already exists — the plant's periodic CPV review cycle, in practice the Annual Product Quality Review (APQR; 21 CFR 211.180(e), EU GMP Ch. 1), the annual product review — rather than an arbitrary "every N months." A model that never visibly drifts is still re-examined when CPV is, so the cadence rides a regulatory rhythm the quality system already runs, not a number someone made up. Requiring both detectors before retraining is the deliberate guard against the living-system PSI noise the example exposed — a single PSI spike on a healthy batch should never, on its own, force a revalidation. When a trigger fires, the loop runs: retrain on the curated new data, validate against the PCCP's acceptance criteria, present both the old and new model's metrics, and promote only through a four-eyes gate — a second qualified person signs the promotion, exactly as a batch record is reviewed by exception before release. And because the old version is never deleted, rollback is always one promotion away: if the new model underperforms in production, you revert to the last known-good locked version while you investigate.

Champion/challenger: how you prove the new model is better

A subtlety hides inside "retrain and promote": how do you know the candidate is actually better, before you trust it with a medicine? You cannot validate it only on a held-out slice of historical data, because the whole reason you are retraining is that the world has moved past that data. The answer borrowed from the rest of MLOps — and the one compatible with GMP locking — is champion/challenger shadow deployment.

The current locked model is the champion; the freshly retrained candidate is the challenger. The challenger runs in shadow: it receives the same live inputs and produces predictions, but those predictions touch nothing — no controller, no batch decision, no release. The plant continues to be run by the champion. The challenger's predictions are simply logged alongside the champion's and, as the sparse offline reference data lands, both models' residuals are scored against the same ground truth on the same batches. After enough batches to be meaningful (and in small-data bioprocess that means weeks to months, not hours), you have a like-for-like comparison: did the challenger's residual control chart stay in control where the champion's drifted out? Is its R² against the offline assay higher on the new campaigns? Only then does the challenger go to the four-eyes gate, carrying head-to-head evidence rather than a self-reported metric.

Shadow mode is also the GMP-safe way to introduce any model into a critical setting: run it advisory and shadow for a qualification period, accumulate the evidence that it agrees with the established method, and promote it to active only once the comparison is documented. It is the same logic the book's release and MSPC models live under — they advise; the validated assay decides — generalized to model replacement. Crucially, none of this is continuous learning: the challenger is itself a locked candidate, trained off-line and frozen before it is shadowed. Learning still happens only between locked versions, never inside one.

The governed retrain loop, made runnable

Champion/challenger and retrain-then-revalidate were, in earlier drafts, prose-only — described but never executed. The suite now makes the whole governed loop runnable in examples/platform/ml/lifecycle_retrain.py. The runnable loop below is the offline analogue of shadow deployment: instead of accumulating weeks of live shadow predictions, it stands the champion and challenger side by side on a strictly later held-out window the retrain never saw — the same like-for-like, same-ground-truth comparison, compressed into one campaign so it can run in seconds. It picks up exactly where drift.py leaves off: it continues the probe-fouling scenario on BATCH-2026-001, takes the same I-MR residual chart as its drift trigger, then retrains a challenger calibration on recent post-drift groundings, revalidates it on a held-out later window against the acceptance gate, and promotes champion to challenger only if the challenger both clears revalidation and beats the champion — stamping a version bump (v1.0.0 to v1.1.0), the before/after metrics, and the dataset hash. Running python lifecycle_retrain.py prints:

governed retrain after drift on BATCH-2026-001 (probe fouling +1.2 g/L after day 7)
  drift detector (drift.py I-MR residual chart): 10/28 points out of control, max |residual|=1.037 g/L -> drift_detected=True
  retrain window days 7-11 (n=8 groundings)   held-out revalidation window days 11-14 (n=6)
    champion  v1.0.0 (trust probe)      : held-out max|residual| = 1.04 g/L  -> out of control
    challenger v1.1.0 (a=-1.06, b=1.09)   : held-out max|residual| = 0.13 g/L  -> in control
    decision: PROMOTE challenger (clears revalidation AND beats champion); record dataset hash aba381af160e
    CPV note: this model-revalidation event is logged into the plant's Stage-3 Continued Process Verification program; retrain happened under change control, not silently.
    authority: ADVISORY only -- the loop proposes the versioned candidate + its evidence; a human and the offline panel keep authority (locked-model + PCCP posture).

Read it as the four-eyes reviewer would. The challenger is revalidated on a held-out later window — days 11 to 14, data the retrain never saw — which is the only honest test, because the whole reason to retrain is that the world moved past the data the champion was built on. On that held-out window the challenger's max residual drops from the champion's 1.04 g/L (out of control) to 0.13 g/L (back in control), so promotion is earned, not assumed: the candidate cleared the acceptance gate and beat the incumbent on the same later batches. One caveat the four-eyes reviewer would add: six held-out grounding points is itself near the small-data floor, so a real promotion would repeat this head-to-head over several batches — the weeks-to-months cadence named above — before trusting it. The loop demonstrates the decision logic; the evidence bar in production is more batches, not one clean window. Had the challenger failed either test, the loop keeps the champion deployed and still logs the event — a failed retrain is itself a CPV record, not a quiet non-event. And champion/challenger is now genuinely executed: the loop emits a new validated model version carrying its own before/after evidence and dataset hash — the PCCP-style "frozen at validation" record from the previous section, produced rather than merely described. The loop's authority is deliberately bounded: it proposes the versioned candidate and the evidence; a human and the offline panel keep the decision, exactly the locked-model-plus-PCCP posture the rest of the chapter insists on.

Anatomy of a model-version record

A model in GMP production is not a .pkl file on a disk — it is a governed, versioned record, and like every artifact in this series its value is in what travels alongside the weights. When a drift trigger fires and a retrain is proposed, the reviewer who signs the promotion is reading this record field by field. Dissect a model-version record the way an auditor would.

One model version, fully unpacked: the build provenance that pins it to an exact dataset hash, scaler, and hyperparameters; the green validation core with its acceptance criteria, operating range, and advisory scope; the amber live-lifecycle block carrying the model's current drift status, written retraining trigger, and next revalidation date; the governance block with the PCCP it is bound to, the four-eyes promotion signatures, and the rollback pointer; and the lineage tying it to the data it learned from, the version it superseded, and the detectors that watch it — the difference between a model file and a validated object. Original diagram by the authors, created with AI assistance.

Read the card field by field; each field answers a question an auditor will actually ask.

The build block — provenance. Feature contract: the exact named inputs, their units, and their expected ranges, so a serving-time schema mismatch is caught rather than silently mispredicted. Training dataset, pinned by sha256: the cryptographic hash from the dataset manifest binds the version to one immutable dataset, so "which data trained this?" stops being a guess and "was the data tampered with since?" becomes verifiable. Held-out split and random seed: the two things that make the reported metrics reproducible — without the seed, "R² = 0.99" is unrepeatable folklore. Fitted scaler: the mean and standard deviation (or PLS loadings) learned at training time, which must travel with the weights — feed a model raw values it expects standardized and every prediction is silently, confidently wrong, with no error thrown. Frozen hyperparameters: the number of latent variables, regularization, architecture — pinned, which is what makes "locked" literal rather than aspirational.

The green core — what validation produced. R² and RMSE against written acceptance criteria (a metric with no pre-stated threshold is not a validation, it is a vibe). The qualified operating range — the input region over which the model was proven, outside which it is by definition extrapolating and untrusted. And the intended-use scope, the single most important field on the card: advisory, human decides — never autonomous control of a CQA. That one line is the difference between this model and the unreviewed AI that drew the first warning letter.

The amber lifecycle block — what makes this record uniquely a learning artifact. A frozen software module has no such block; a learning artifact must carry its own current health. Live drift status, two fields: this version's current PSI against the golden batch and its residual-chart out-of-control count, the running output of the two detectors above. Retraining trigger, written as a rule: the exact PSI-and-residual-and-calendar-and-hardware condition that opens change control — printed on the record so the model carries its own re-qualification logic. Next scheduled revalidation date: the calendar backstop, so a model that never visibly drifts is still re-proven on schedule rather than trusted forever.

The governance block — who is accountable. The PCCP reference the model is bound to (the pre-approved envelope this retrain stayed inside). The four-eyes promotion signatures with timestamps — the second qualified reviewer, the control whose absence is the entire point of the Purolea letter. The rollback pointer to the previous locked version (here, v3), naming exactly what production reverts to if the new version underperforms.

The violet relationships panel — lineage as a graph. This version trainedOn the curated dataset, supersedes v3, predictsFor BATCH-2026-001, is monitoredBy the PSI and residual detectors, governedBy the PCCP, and rollsBackTo v3. A model file has weights; a model version record has all of this — which is why only the latter can make a decision about a medicine.

The record as a real graph, not a metaphor

Those relationship labels are not a drawing convention; they are the same typed edges Book 4 makes executable. A knowledge graph (a web of facts in which each thing is a node and each named, typed link between two things is an edge) lets the model-version record live inside the campaign's graph instead of beside it in a registry table. Written as RDF (the subject-predicate-object triple data model) with OWL (the logic layer that lets a relation carry rules), the panel becomes literal triples — bp:GlucoseSoftSensor-v4 bp:trainedOn bp:TrainSet-2026Q2, bp:GlucoseSoftSensor-v4 bp:supersedes bp:GlucoseSoftSensor-v3, bp:GlucoseSoftSensor-v4 bp:predictsFor bp:BATCH-2026-001 — that an inspector can query rather than read off a slide. The model version is then governed by exactly the ontology lifecycle the companion book builds: its IRI is version-pinned, its supersedes edge is a deprecation under change control, and the grounding ontology's own version is part of the model's provenance, not a detail beneath it.

Three properties of that graph do real work the moment the record is machine-readable rather than human-prose:

A feature pulled by its IRI, not a fragile column name. The feature-contract row above is brittle if a feature is "the column named glucose_online" — a rename or a re-ordered export silently mis-feeds the model. Grounded, each feature is a semantically identified, unit-bearing quantity: the glucose input is the value on bp:glucoseGperL of the lot, carrying a UCUM unit code (g/L) and a BFO typing that keeps the measurement (a quality of the material) distinct from the run that produced it. A model wired to the IRI cannot be fooled by a column reshuffle, and the same identity is what lets a historian's bare float — BR101.Feed.PV = 0.40 with no units — become a feature a model can trust across systems. That cross-system identity is exactly the semantic-interoperability work the data book does: the feature's lineage anchors it to an ISA-95-style equipment and batch hierarchy (the standard model of plant assets and production records), the live signal arrives over OPC UA (the standard industrial-data transport), and the batch record it is grounded against is a B2MML document (the XML serialization of the ISA-95 / ISA-88 batch model) — so the feature contract is not a private spreadsheet convention but a row pinned to the same standards the historian, MES, and LIMS already speak.
The same SHACL shape that gates a release gates the training set. A model has no native notion of "complete": handed a lot whose HMW result silently never loaded, it imputes or predicts around the hole and reports a confident number. So conformance becomes an admission gate — before a lot's CQA panel is turned into a feature row, Book 4's release-gate bp:ReleaseShape (a SHACL shape: a closed-world rule that every required result be present, singular, typed, and in range) certifies the row is well-formed. A subgraph that fails its shapes is precisely the hollow or mislabeled training set that teaches a model a confident error; SHACL catches it before the model does. The release gate is, read this way, the labeling contract for the PASS/OOS label a release predictor learns.
The lineage edge is the cross-validation grouping key. Because bp:derivedFrom is declared owl:TransitiveProperty, every lot that shares a working-cell-bank ancestor is reachable by one query — and those lots are not independent rows. A naive random train/test split leaks sibling samples of the same lot across the fold and flatters the score; splitting on the shared derivedFrom ancestry instead — a grouped / leave-one-batch-out cross-validation — reports a number the model would actually earn on an unseen campaign. The graph makes that grouping mechanical: the derivedFrom edges are the grouping key, where a flat table leaves it to a hopeful convention. The same transitive edge that scopes a recall scopes a fair model evaluation.

This is also why the validated graph, not the model's fluency, is the ground truth a GraphRAG assistant is grounded against: ask such a model "was DP-004 released?" and the honest answer — no, it tripped its HMW limit — is one the graph derives and certifies (its transitive closure and SHACL shapes machine-checked) while the model merely generates. A model that contradicts a reasoned graph is, in that contradiction, the more likely wrong of the two. The ontology does for the retrieval boundary what an applicability-domain check does for a soft sensor: a query returning no conforming subgraph is the graph analogue of an out-of-distribution flag — a refusal to answer rather than a confident guess on unfamiliar ground.

The validation paradox: how do you validate something that learns?

Here is the contradiction stated plainly. GMP validation means: prove the system does what it should, then lock it, and prove again before any change. Machine learning means: improve by changing in response to new data. A model that keeps learning is, by definition, a system that keeps changing — which is precisely the thing validation forbids without re-qualification. You cannot validate a moving target with a one-time test, and biomanufacturing's whole quality edifice is built on one-time validation plus change control.

The honest current resolution is not "validate a continuously-learning model" — that remains unsolved and, for critical decisions, unwanted. The resolution is to break the learning into discrete, governed steps: lock a model, run it unchanged, detect when it has drifted, retrain off-line into a new candidate, validate the candidate, and promote it through change control. Learning happens between locked versions, never within one — this is the pattern this book calls locked-then-relearn. The PCCP is what makes that cadence efficient rather than glacial — it pre-approves the shape of the change so each retrain is a planned event, not a new regulatory negotiation. The continuously-learning ideal is not banned because regulators fear progress; it is set aside because no one has yet shown how to prove, at any instant, that a model which rewrote itself a minute ago is still safe — and "prove it" is the non-negotiable core of GMP.

The regulators have drawn this line with increasing sharpness, and the evidence tiers matter:

The FDA's 2023 discussion paper Artificial Intelligence in Drug Manufacturing (a public regulatory document, but explicitly a discussion paper, not a binding rule) raises exactly these questions — managing training data, validating and re-validating models, applying risk-based scrutiny so a model touching a critical decision faces more — without prescribing settled answers [2]. FDA's broader 7-step model-credibility framework (from its credibility-of-computational-models guidance, aligned to ASME V&V 40) gives the risk-proportionate structure: define the question of interest, determine the context of use, assess model risk, develop a credibility plan, execute it, document it, and assess adequacy for that context of use — so the higher the consequence of a model being wrong, the more credibility evidence it must carry [3]. The framework is why a Raman soft sensor advising a feed and a model auto-releasing vials are held to wildly different evidence bars even though both are "ML."
The draft Annex 22 (regulatory document, in consultation) draws the hardest line of all: it requires AI models used in critical GMP applications to be static (locked) — parameters fixed during use — and explicitly excludes self-learning, generative, and adaptive AI from those uses, demanding a predetermined change-control approach for any model update [4]. In other words, the draft codifies locked-then-relearn as the only acceptable pattern for critical applications — the continuously-learning model is, for now, regulatorily off the table where it touches quality. The mirror-image consequence is also worth stating: a model that is not critical (advisory, human-in-the-loop, far from a CQA) faces a far lighter touch, which is exactly why the deployed reality clusters there.
The ISPE GAMP guidance on AI extends the established computerized-system-validation thinking toward models — risk-based, lifecycle, ongoing-evidence — translating "validate your software" into "validate your model and monitor it forever" [5]. The "forever" is the operative word: GAMP makes ongoing monitoring a validation requirement, not a nicety, which is what turns the drift detectors of this chapter from good engineering into compliance evidence.

The consequence of getting this wrong is no longer hypothetical. The Purolea cGMP warning letter (April 2026) is the first FDA warning letter to cite AI: a firm used AI agents to generate specifications, SOPs, and master production records without quality-unit review — exactly the missing four-eyes gate this chapter insists on [6]. The failure was not that AI was used; it was that an unvalidated, unreviewed model was allowed to produce GMP records autonomously, cited under the quality-unit-responsibilities regulation. The lifecycle discipline here — locked models, written triggers, human promotion gates, rollback — is precisely the set of controls whose absence drew that letter.

The unsolved part: detecting concept drift before the reference lands

Be honest about the hard residual problem this chapter does not solve. The residual control chart, our only true ground-truth detector, is lagging by construction. Between the moment concept drift begins and the moment enough sparse offline assays accumulate to prove it, a model that has started to mislead looks identical to one that is working. The factorization from the start of the chapter is the reason this gap is structural, not an implementation weakness: concept drift moves P(Y|X) while leaving P(X) untouched, so PSI — which measures only the movement of P(X) — is mathematically guaranteed to stay silent. PSI partly covers the other gap (input drift without labels), but it is blind to the dangerous case by definition.

The genuinely dangerous scenario — a cell line that has subtly adapted so that the same spectrum now implies a different titer — produces no PSI alarm and no residual alarm until the slow assay finally disagrees. There is, today, no reliable way to detect concept drift in real time without ground truth, and ground truth in bioprocess is sparse and slow on purpose. Worse, the sparsity caps how fast the residual chart can ever react: at one or two offline assays a day, the moving-range estimate that sets the limits is built from a handful of points, so the chart needs several consecutive out-of-control assays — days — before the signal is statistically convincing rather than a single noisy bench result. You cannot tighten that without making the bench assay cheaper or faster, which is a wet-lab problem, not an algorithm.

The partial mitigations are exactly the ones this book keeps returning to, and it is important that none is a full answer — they shrink the blind window, they do not close it:

Hybrid models drift more slowly and flag physically implausible outputs (a negative concentration, a mass balance that does not close), narrowing the window in which silent drift can hide because the physics backbone refuses to follow the data into nonsense.
Uncertainty-aware models — a prediction interval that widens when the input is far from training data — turn extrapolation into a visible warning rather than a confident wrong number; a model that says "I am unsure here" is one a human can catch even before the assay lands. This is the closest the field has to a label-free concept-drift hint, but it detects only the input-far-from-training case, not a relationship change at familiar inputs.
Transfer learning and Bayesian priors let a model re-anchor on a handful of new reference points faster, shortening the time-to-detect and the time-to-retrain once the slow data does arrive.
Multivariate model diagnostics (Hotelling's T-squared and the Q-residual / SPE) — a PLS or PCA soft sensor fits the data to a low-dimensional "plane" of the patterns it learned; the squared prediction error (SPE, also called the Q-residual) simply measures how far a new spectrum sits off that plane, so a large SPE flags an input unlike anything the model was built on — a label-free alarm that the model is being asked to extrapolate. For such a soft sensor, the SPE off the model plane rises the moment a spectrum carries structure the calibration never saw, flagging a novel input in real time and without labels. It is the multivariate generalization of PSI — sensitive to correlation breaks a per-feature PSI is blind to — and it is already what SIMCA-online and ProMV watch. Like the uncertainty band, it catches input-far-from-model, not a relationship change at familiar inputs.
Surrogate or proxy labels — a faster, cheaper, noisier online measurement that correlates with the slow gold-standard assay — can give a more frequent (if less trustworthy) residual stream to chart, buying earlier suspicion at the cost of more false alarms — for instance, an online capacitance (permittivity) probe gives a minute-cadence, noisier proxy for the once-or-twice-daily offline viable-cell-density count (off-gas OUR/CER — the oxygen-uptake and CO2-evolution rates measured from the exhaust gas — is a second such metabolism proxy), letting the residual chart run more often between the slow gold-standard groundings. This is an active research direction, not a deployed solution.

But the structural truth stands: until offline reference data is cheaper or faster, a deployed bioprocess model must be distrusted on a schedule — monitored, periodically reconciled, retrained under change control, with the standing assumption that it is wrong until the slow data proves otherwise. The validation paradox is managed, not dissolved.

What this chapter adds to the model suite

This chapter contributes two modules to the Book 5 example suite. examples/platform/ml/drift.py is the detection half: it consumes fedbatch_state.parquet and offline_assays.csv, runs the two complementary detectors — an I-MR residual control chart and cross-batch PSI against the golden BATCH-2026-001 — and its headline result is the chart catching an injected probe-fouling bias (10 of 28 points out of control). examples/platform/ml/lifecycle_retrain.py is the governance half: it picks up where drift.py's trigger leaves off and runs the full governed retrain loop on the same datasets — champion/challenger promotion with held-out revalidation, the v1.0.0-to-v1.1.0 promote on held-out evidence, a dataset hash, and a CPV/PCCP log entry — so the lifecycle this chapter describes is executed end to end, not just narrated. The detection half says whether to act; the governance half says how. Both coordinate with — and deliberately do not duplicate — the soft-sensor module and MSPC module they watch: those make predictions; these watch whether the predictions can still be believed. The glucose values are the real committed dataset numbers; the fouling bias is a clearly-labeled injected perturbation.

Why it matters

Every model in the previous twenty-one chapters is worthless the day it goes stale and nobody notices. A soft sensor that has drifted high quietly mis-feeds a culture; an MSPC monitor calibrated on last year's cell line waves through a batch it should flag; a vision system whose lighting changed starts passing defects. Drift is not an edge case in bioprocess — it is the default trajectory of a static model watching a living, changing process on moving hardware. The MLOps discipline in this chapter is what converts a model from a one-time achievement into a sustainable instrument: detection that fires before and after the damage, a written rule for when to rebuild, a locked-then-relearn lifecycle that satisfies the regulator, and a rollback that means a bad retrain cannot become a bad batch. And it is what keeps a learning model legal: the difference between a defensible, validated, monitored model and the unreviewed-AI failure that drew the industry's first AI warning letter is exactly the lifecycle drawn here.

In the real world

The production reality is sobering and worth stating without spin. The ISPE Pharma 4.0 survey finds that AI/ML has the most pilots and the fewest scaled implementations of any digital technology in biomanufacturing, and that production deployments cluster in monitoring, predictive maintenance, vision inspection, and human-in-the-loop documentation — not autonomous control of CQAs [7]. The MLOps gap is a large part of why: building a model is easy, but operating one for years under GMP — drift monitoring, governed retraining, validation evidence that never expires — is hard, expensive, and not something you can pip install.

What is genuinely production-grade today is the monitoring layer this chapter rests on. Multivariate statistical process monitoring with PCA/PLS — Sartorius SIMCA-online and AspenTech ProMV — is deployed (production) for continued process verification and golden-batch monitoring, and a drifting model shows up in exactly those tools as a process that has left its envelope (vendor-self-reported as to specific installations, but the method itself is long-established and independently published). MLflow and similar registries give the technical spine of versioning, dataset-hash binding, and run-to-model lineage that the model-version record above depends on — but, as the open-source analytics chapter is careful to say, MLflow tracks; it does not validate. The audit trail on who promoted a version, the e-signature on the validation report, the change-control procedure governing retraining, and the documented intended use are properties of a validated system and procedure, not of a tool. The registry can store the model-version record's fields; it cannot supply the four-eyes signature or the PCCP — those are human and procedural, and that is the gap that stalls most programs.

The honest summary: drift detection and residual monitoring are real, mathematically simple, and ready to deploy — the drift.py detectors here are a few dozen lines of NumPy. The PCCP and locked-model lifecycle is the regulatorily-blessed path, codified in the draft Annex 22 and anticipated by the FDA discussion paper, but it is paperwork and procedure as much as software, and it is where most organizations' ML ambitions slow to a walk. Continuously-learning models touching CQAs are not deployed in GMP today and, under the current draft regulations, are not permitted to be. The frontier is not a self-improving controller; it is a locked, monitored, periodically-relearned model with a human at the promotion gate.

Key terms

MLOps — the operational discipline of deploying, monitoring, retraining, and governing models in production; under GMP, dominated by the validation lifecycle rather than by continuous deployment.
Model drift — the gradual divergence between the relationship a model froze and the one the world now obeys; the default trajectory of a static model watching a living process.
Critical quality attribute (CQA) — a measurable product property that must stay inside a safe range for the batch to release (e.g. host-cell protein, SEC monomer); whether a model touches a CQA decides how tightly it must be governed.
Covariate shift (input drift) — the input distribution P(X) moves while the physics is unchanged; detectable without labels (e.g. by PSI).
Concept drift — the relationship P(Y|X) itself changes, so the same inputs imply a different answer; detectable only with ground truth, hence always a lagging discovery.
Population Stability Index (PSI) — a label-free input-drift metric, the symmetrized relative-entropy (Jeffreys) distance between the reference and new input histograms, binned on reference quantiles; rule of thumb below 0.1 stable, 0.1 to 0.25 moderate, 0.25 and above significant — a convention, not a hypothesis test.
Residual control chart — an I-MR control chart applied to the prediction-minus-reference residual stream, with limits set from the moving-range sigma of a clean reference run; flags model drift the same way SPC flags process drift; a lagging, ground-truth detector.
Locked model — a model frozen in production (weights, preprocessing, scaler, operating range), version-pinned and never edited in place; the only pattern the draft Annex 22 permits for critical applications.
Continued Process Verification (CPV) — "Stage 3" of process validation in the FDA's 2011 Process Validation guidance: the ongoing program of collecting and analyzing product and process data throughout commercial production to give continual assurance the process stays in a validated state of control; a deployed model's drift monitoring and governed retraining feed this existing program, and a model-revalidation is a change-control event logged into CPV rather than a separate ML ritual.
Predetermined Change Control Plan (PCCP) — a pre-approved written specification of how a model may be retrained and updated (data-management plan, modification protocol, acceptance criteria, rollback), so a retrain inside the envelope is a planned event rather than a new regulatory negotiation.
Retraining trigger — the written rule (combining a sustained PSI breach AND a residual out-of-control signal, plus a calendar backstop, plus hardware-change events) that opens the change-control return path.
Champion/challenger — running a retrained candidate in shadow on live inputs alongside the deployed model so the two can be scored head-to-head against the same ground truth before promotion; the candidate touches nothing until it wins.
Four-eyes promotion gate — the requirement that a second qualified person review and sign the promotion of a new model version, the missing control whose absence drew the first AI cGMP warning letter.
Rollback — reverting to the last known-good locked model version when a retrain underperforms in production; always available because old versions are never deleted.
Model-version record as a graph — expressing the version's lineage as typed RDF/OWL edges (bp:trainedOn, bp:supersedes, bp:predictsFor, bp:monitoredBy, bp:governedBy, bp:rollsBackTo) inside the campaign's knowledge graph, so the record is queryable and governed by the same ontology lifecycle rather than read off a slide.
Feature by IRI — wiring a model input to a semantically identified, unit-bearing quantity (its ontology IRI, with a UCUM unit and BFO typing) rather than to a fragile column name, anchored to the ISA-95 / OPC UA / B2MML standards the plant systems already speak, so a rename or re-export cannot silently mis-feed the model.
Conformance as admission gate — running the SHACL release shape (bp:ReleaseShape) over a lot's panel before it becomes a training row, so a hollow or mislabeled subgraph is rejected before the model learns a confident error; the closed-world rule is the labeling contract for a PASS/OOS predictor.
Grouped / leave-one-batch-out cross-validation — splitting model evaluation on the transitive bp:derivedFrom lineage (the shared working-cell-bank key) rather than at random, so sibling samples of one lot never leak across the fold and the reported score is honest; the graph's lineage edges supply the grouping key.
Validation-versus-learning paradox — the structural tension between GMP's demand for a locked, proven system and ML's nature of changing as it learns; managed today by locked-then-relearn, not by validating a continuously-learning model.

Where this leads

A model that stays true and stays validated is one part of running a real plant; the rest of the system has its own learning to do. The next chapter, Manufacturing Operations: Predictive Maintenance, Yield, and Scheduling, steps back from the molecule to the factory around it — learning when a pump will fail before it does, forecasting yield across a campaign, and scheduling a multi-product plant — the operational ML that keeps the lights on and the suites full, governed by the same lifecycle discipline this chapter just built.

What this chapter covers​

Why models drift — and why bioprocess is the worst case​

Two detectors: a leading indicator and a lagging one​

PSI: the leading, label-free input-drift detector​

A control chart on the residual: the lagging, ground-truth detector​

A runnable drift detector grounded in the campaign​

The retraining lifecycle: locked models and the PCCP​

Champion/challenger: how you prove the new model is better​

The governed retrain loop, made runnable​

Anatomy of a model-version record​

The record as a real graph, not a metaphor​

The validation paradox: how do you validate something that learns?​

The unsolved part: detecting concept drift before the reference lands​

What this chapter adds to the model suite​

Why it matters​

In the real world​

Key terms​

Where this leads​