QC and Release: MSPC, Real-Time Release, and Predicting the OOS

📍 Where we are: Part V · Fill-Finish & Release, Learned — Chapter 18. Formulation and fill-finish turned DS-001 into filled, inspected vials of DP-001 and met the strongest production ML case of all, deep-learning vision. But no unit ships until the lot is released — and this chapter is the gate every batch must pass.

The product exists. It is in the vials. What it does not yet have is permission to leave the building. Release is the moment a quality unit (the independent quality-assurance organization that owns batch release) looks at a batch's full evidence package — the release assays that make up its Certificate of Analysis, the deviations raised during manufacture, the environmental monitoring record — and decides whether every critical quality attribute (CQA — a must-pass property the product is required to meet) landed inside its specification. In our running campaign five siblings pass and one, BATCH-2026-004, does not: its host-cell-protein (HCP — a residual process impurity, the leftover protein from the production cells that must be cleared to a safe trace) result reads 128.0 ng/mg (nanograms of impurity per milligram of product) against a 100 ng/mg not-more-than (NMT) limit — the 0 in the dataset's 0–100 band is a lower-bound placeholder, not a target, since HCP is never truly zero by any sensitive ELISA (the standard antibody-based lab test that measures a specific protein) — an out-of-specification (OOS) result that must be investigated before any disposition (the formal decision to release, reject, or quarantine the lot). HCP, host-cell DNA, endotoxin (a fever-inducing toxin from bacterial cell walls), and bioburden (the count of viable microbes in the product) are all one-sided NMT upper limits of this kind — impurities and contaminants you want as low as possible. Book 1 defines each of these release assays in analytical and formulation. This chapter is about where machine learning genuinely helps at that gate — and, just as importantly, where it does not.

The honest headline is that release is the most conservative corner of the whole process. A model can advise a release decision; it almost never makes one. So the learning that matters here is monitoring, not autonomy: catching a batch that has drifted out of the family of good batches, predicting a likely failure early enough to act, and turning the dense, multivariate quality record into a single defensible verdict. Multivariate statistical process monitoring (MSPM, also written MSPC) is the production-grade tool, and it is the spine of this chapter.

The simple version

Imagine a customs officer who has waved through ten thousand normal travelers. They are not checking each one against a giant rulebook; they have learned the shape of normal — the posture, the pace, the paperwork — and they notice when someone does not fit it, even if no single thing is technically wrong. Multivariate monitoring is that officer for a batch: it learns the joint shape of every quality result on good batches, and flags the one that does not fit the family, then points at exactly which attribute broke the pattern. Predicting the OOS is the officer who, from cheap early signals, guesses who will fail inspection before they reach the desk — useful, but never a substitute for the inspection itself.

What this chapter covers

Multivariate SPM (MSPC): PCA on the release panel, Hotelling's T² (in-plane distance) and SPE / Q (off-plane residual), and why the two statistics catch different failures.
Golden-batch and multiway PCA: monitoring a whole trajectory, not one number, by unfolding the batch × variable × time cube.
Real-time release testing (RTRT): what it promises, why it is common in small-molecule continuous manufacturing, and why it is genuinely scarce for biologics.
Predicting the OOS: a calibrated, interpretable release-outcome classifier from in-process features — and why the threshold, not the accuracy, is where the quality unit's risk tolerance lives.
The anatomy of one MSPC verdict — how a flag carries its model, statistics, limits, contribution, confirming assay, and disposition — and the GMP (Good Manufacturing Practice — the binding quality regulations all drug manufacturing runs under) framing that keeps every release model advisory. (MSPC's SPE statistic is this chapter's anomaly detector: fit on the good-batch family, it grades each new batch as novel-or-not against that family.)

MSPC: learning the shape of a good batch

Univariate SPC (statistical process control — watching one measured quantity over time against control limits) — the I-MR chart and Cpk (a run-chart of individual values and a process-capability index, both of which Book 3's analytics chapter builds in code) — charts one attribute at a time ("univariate" = one variable; "multivariate" = many at once). It is necessary and it is not sufficient. A batch can have every individual result comfortably inside its own specification and still be subtly, dangerously abnormal, because the relationship between attributes has moved. Two batches with identical monomer purity (the fraction of product that is the intact single antibody, rather than clumped-together aggregates or fragments) can differ entirely in how their charge variants (slightly modified forms of the molecule that carry different electrical charge), aggregates (clumped-together molecules), and impurities co-vary. That joint structure is exactly what univariate SPC throws away and what multivariate monitoring reads back out. The deeper reason univariate charts fail at scale is the multiple-comparison problem: run eleven independent 3σ charts on one batch and, even if every attribute is truly in control, the probability that at least one chart false-alarms is 1 − 0.9973^11 ≈ 2.9% per batch (a single 3-sigma chart leaves a 1 − 0.9973 = 0.27% false-alarm chance in control, so eleven of them compound) — and conversely, a batch can stay inside all eleven individual limits while sitting in a region of the joint space that no good batch ever occupied. MSPC replaces eleven separate limits with one joint limit on the correlated whole.

The PCA model plane

The workhorse is Principal Component Analysis (PCA). Intuitively, PCA finds a few new combined axes — each a weighted blend of the original attributes — that capture the directions good batches actually vary along; projecting a batch onto those few axes is the low-dimensional "shape of normal" everything below measures against. The rest of this section is the formal version of that one idea, and a reader who is not comfortable with matrix algebra can skim the symbols and keep the intuition. Take the release panel as a matrix X of shape n × k: one row per batch (n batches), one column per quality attribute (k attributes). Standardize each column to zero mean and unit variance — z_ij = (x_ij − μ_j) / s_j — so a part-per-million impurity and a percentage purity are on comparable footing; without this step the largest-magnitude attribute would dominate the variance and PCA would simply track it. Call the standardized matrix Z. PCA decomposes Z by singular value decomposition, Z = U S Vᵀ, and keeps the first a columns of V as the loadings P (a k × a matrix whose columns are orthonormal directions of greatest joint variation). Equivalently, P holds the leading eigenvectors of the correlation matrix R = Zᵀ Z / (n − 1), and the eigenvalues λ₁ ≥ λ₂ ≥ … (the diagonal of S²/(n−1)) are the variance each component explains. Those first a components — a is the number of components you keep — define a low-dimensional model plane, "the shape of a normal batch." In practice a is chosen so the kept components explain most of the joint variance (a scree / variance-explained criterion); on our deliberately tiny five-batch panel we fix a = 2, which is all the data can responsibly support.

Each batch projects onto that plane as a row of scores t = z P (a point in the a-dimensional plane), and is approximately reconstructed as ẑ = t Pᵀ = z P Pᵀ. The reconstruction is exact only if a = k; with a < k, the part of z the plane cannot represent is the residual e = z − ẑ. Those two pieces — the in-plane projection and the off-plane residual — are precisely the two things the monitoring statistics measure.

Two statistics, two kinds of failure

What makes MSPC a monitoring method rather than a clustering exercise is one disciplined move: fit the model on the good-batch family only, then score every batch — including new, unknown ones — against it. The model encodes what good looks like; a new batch is graded by how far it sits from that learned normal. Two complementary distances do the grading, and they are orthogonal by construction — one lives in the plane, the other off it:

Hotelling's T² measures how far a batch sits from the center of the good-batch cloud inside the model plane. It is the scores scaled by their own variance: T² = Σ_{r=1..a} t_r² / λ_r, the squared Mahalanobis distance in score space (a distance that stretches each direction by how much good batches vary in it, so an unusual move along a normally-stable direction counts more than the same move along a naturally-variable one). The per-component division by λ_r matters — a 0.1-unit move along a tight, low-variance component counts far more than the same move along a loose, high-variance one, because the former is genuinely unusual for a good batch. A high T² means "an extreme but still recognizable member of the family" — unusual along directions the good batches did vary in.
SPE (squared prediction error), also called the Q-statistic, measures the distance off the plane: SPE = Σ_{j=1..k} e_j² = ‖z − ẑ‖², the squared length of the residual. A high SPE means "something we have never seen" — a new correlation, a novel impurity, a behavior no good batch ever showed, because that behavior projects into the dimensions the model discarded as noise. This is the statistic that catches genuinely new failure modes, and it is usually the one that fires first.

Where the control limits come from

A statistic is only actionable against a limit. The two have different distributional theory. The T² limit follows the F-distribution: for a components fit on n batches, at significance α, T²_lim = a(n−1)(n+1) / (n(n−a)) · F_{α; a, n−a}. (For n ≫ a this collapses to the familiar χ²_{α; a}, but the small-sample F form is honest about how few batches we have.) The SPE limit has no clean closed form; the rigorous route is the Box / Jackson–Mudholkar weighted-χ² approximation SPE_lim = g·χ²_{α; h} with g = ν/2m and h = 2m²/ν from the mean m and variance ν of the good-batch SPE. In our deliberately tiny example we use the simpler, transparent mean + 3σ of the good-batch SPE — it is a five-sample approximation, not a validated limit, and we say so. Either way, the limit is learned from the good family, never assumed. Both forms rest on a near-normality assumption — the F-form T² limit on multivariate-normal scores, the weighted-χ² SPE limit on an approximately normal residual — that small biologic datasets rarely give you enough batches to verify; the unsolved part returns to how heavy tails and a handful of batches leave these limits under-determined.

The two together are the heart of every commercial batch-monitoring suite. Sartorius SIMCA and SIMCA-online, and AspenTech ProMV, are productized PCA and PLS (Partial Least Squares — PCA's supervised cousin, which relates the attributes to a target outcome) monitoring with exactly these T² and SPE charts plus contribution plots; they are (production) tools used for continued process verification (CPV — the ongoing, lifelong monitoring that a validated process is required to maintain), golden-batch monitoring, and fault detection across commercial biopharma [1]. Amgen's Juncos site has publicly described SIMCA-based OPLS models running on commercial GMP harvest and in-process data (production, first-party/self-reported) [2]. What we build below is the open core those suites wrap.

Evidence

MSPC with PCA/PLS is the strongest (production) multivariate-monitoring case in biomanufacturing, and the evidence is solid: it rests on a decades-old, peer-reviewed methodological literature (Nomikos and MacGregor's multiway PCA for batch monitoring, 1994–1995) [3] and on two independently sold, widely deployed commercial platforms (Sartorius SIMCA, AspenTech ProMV) [1]. The caution is scale: a credible MSPC model is fit on tens to hundreds of historical batches, not the five we have here. Our example is the right method on a deliberately tiny dataset — a teaching model, not a validated monitor.

MSPC on the release panel: flagging BATCH-2026-004

The most concrete version of MSPC is also the one that catches our OOS batch. We read the real release panel from hplc_results.csv, pivot it to one row per batch and one column per attribute, drop the bioburden column (it is a constant zero across the campaign, so it carries no variance and would make standardization undefined — dividing by s_j = 0), and fit PCA on the five PASS siblings. Ten attributes remain (the eleventh, bioburden, is the one univariate SPC would still chart but PCA must discard for lack of variance — the same eleven-versus-ten panel viewed two ways): aggregation measured by SEC (size-exclusion chromatography, which separates molecules by size into monomer, HMW = high-molecular-weight aggregates, and LMW = low-molecular-weight fragments), charge heterogeneity by CEX (cation-exchange chromatography, which separates the main, acidic, and basic charge variants), HCP, residual Protein A (trace of the purification reagent leached from the capture column), host-cell DNA, and endotoxin. Then we score all six batches — the held-out OOS sibling included — against that good-batch model with two components.

The result is unambiguous. BATCH-2026-004's SEC, CEX, residual Protein A, DNA, and endotoxin results are all individually in spec; only its HCP is out. A univariate chart on monomer purity would see nothing. MSPC sees the batch leave the plane entirely, because no good batch ever paired this HCP level with these otherwise-normal results — the offending point projects almost perfectly into the plane (its T² is unremarkable) but sits far off it (its SPE explodes). There is a mechanism behind that off-plane behavior, not a coincidence: HCP is near-orthogonal to (uncorrelated with — it moves independently of) the SEC/CEX product-quality block by construction, because HCP is governed by downstream impurity clearance in capture and polishing (the purification stages after the bioreactor: capture is the first column that grabs the antibody, polishing the later columns that remove residual impurities), while monomer purity and charge variants reflect the product molecule itself. The two co-vary little batch-to-batch, so a clearance failure lands in a direction the product-quality attributes never spanned — off the good-batch plane mechanistically, not statistically.

MSPC catches what univariate SPC cannot: every individual result is in spec, T² stays under its limit for every batch, but the SPE statistic — distance off the good-batch plane — flags BATCH-2026-004 alone, and the contribution plot points the investigator straight at HCP. Original diagram by the authors, created with AI assistance.

The implementation lives in examples/platform/ml/mspc.py. It is pure NumPy and SciPy over the committed dataset, so it runs with no services and CI asserts the OOS batch is the only one flagged. Run it from examples/platform/ml/ inside the suite's virtual environment — .venv/bin/python mspc.py (one-time setup: python3 -m venv .venv && .venv/bin/pip install -r requirements.txt, per that directory's README); the bare python mspc.py shorthand below assumes that environment is active. Two functions carry the math — the fit, and the pair of statistics:

# examples/platform/ml/mspc.py
def fit_pca(X: np.ndarray, n_components: int = 2):
    """Mean-centre + unit-scale on the GOOD batches, then PCA by SVD."""
    mu = X.mean(axis=0)
    sd = X.std(axis=0, ddof=1)
    Z = (X - mu) / sd
    U, S, Vt = np.linalg.svd(Z, full_matrices=False)
    P = Vt[:n_components].T                       # loadings (k x a)
    eig = (S[:n_components] ** 2) / (len(X) - 1)  # variance per component
    return {"mu": mu, "sd": sd, "P": P, "eig": eig,
            "n_components": n_components, "n": len(X)}

def t2_spe(model: dict, X: np.ndarray):
    """Hotelling's T2 (in-plane) and SPE/Q (off-plane) for each row."""
    Z = (X - model["mu"]) / model["sd"]
    T = Z @ model["P"]                            # scores
    t2 = np.sum(T ** 2 / model["eig"], axis=1)    # in-model-plane distance
    Z_hat = T @ model["P"].T                      # reconstruction
    spe = np.sum((Z - Z_hat) ** 2, axis=1)        # off-plane residual
    return t2, spe

The T² limit comes straight from the F-distribution form above, in five lines that make the small-sample scaling explicit:

# examples/platform/ml/mspc.py
def t2_limit(model: dict, alpha: float = 0.05) -> float:
    """Small-sample Hotelling T2 limit via the F-distribution."""
    n, a = model["n"], model["n_components"]
    f = stats.f.ppf(1 - alpha, a, n - a)
    return a * (n - 1) * (n + 1) / (n * (n - a)) * f

The script fits on the five PASS batches, scores all six, and decomposes the flagged batch's SPE back onto the original attributes — the contribution analysis that tells an investigator which variable broke the pattern. The decomposition is the cleanest part of the whole method: because SPE = Σ_j e_j², each attribute's share of the residual is just e_j² / Σ e_j², so naming the culprit is one argmax over the per-variable squared residuals:

# examples/platform/ml/mspc.py
contrib = (Z[oos_i] - Z_hat[oos_i]) ** 2          # per-attribute SPE share
top = ATTRS[int(np.argmax(contrib))]              # the variable that broke the pattern

Running python mspc.py prints exactly this:

MSPC on the release panel (PCA fit on 5 PASS batches, 2 components):
  T2 limit (alpha=0.05) = 30.57   SPE limit (good-batch mean+3sd) = 4.95
  BATCH-2026-001: T2=  0.95  SPE=    2.76  release=PASS
  BATCH-2026-002: T2=  0.85  SPE=    2.73  release=PASS
  BATCH-2026-003: T2=  2.73  SPE=    0.56  release=PASS
  BATCH-2026-004: T2=  8.36  SPE=  356.59  release=OOS  <-- FLAGGED
  BATCH-2026-005: T2=  2.18  SPE=    1.17  release=PASS
  BATCH-2026-006: T2=  1.29  SPE=    2.42  release=PASS

SPE contribution for BATCH-2026-004: top driver = HCP_ng_per_mg (83% of the residual)
ASSERT ok: MSPC flags only BATCH-2026-004, and SPE points at HCP.

Read it the way a quality engineer would. Every batch's T² is well under its limit of 30.57 — even BATCH-2026-004, at 8.36, is recognizable in-plane, because its purity and charge results look normal and the model can still place it near the cluster. But its SPE of 356.59 is roughly two orders of magnitude (about a hundredfold) above the next-largest good-batch SPE (2.76) and far past the 4.95 limit: its residual has exploded. That gap — T² quiet, SPE screaming — is the textbook signature of a new failure mode, a novel direction the model discarded as noise and PCA was never trained to span. The contribution analysis then attributes 83% of the residual to HCP_ng_per_mg — exactly the attribute that is OOS — with the remaining 17% smeared thinly across the other nine, which is the residue you expect when a single variable dominates. MSPC did not just say "this batch is bad"; it said "this batch is bad in a way no good batch ever was, and here is where to look." That diagnostic — not the binary flag — is the value an investigator acts on, and it is precisely what a univariate panel of eleven separate charts cannot hand you, because each chart has, by construction, thrown away the cross-attribute structure that makes the residual interpretable.

The trajectory view: golden batch and multiway PCA

The release panel is the endpoint view — one row of finished results per batch. But a batch is also a trajectory: the production bioreactor ran for fourteen days, every tag (a "tag" is one continuously logged sensor channel — temperature, pH, dissolved oxygen, and so on) moving together over time. MSPC on the endpoint catches a batch that finished out of family; MSPC on the trajectory catches a batch that is leaving the family while it is still running, hours before it fails.

The method is multiway PCA (MPCA), the foundational batch-monitoring trick from Nomikos and MacGregor [3]. A batch dataset is three-dimensional: a cube X of shape I × J × K — I batches × J variables × K time points. Ordinary PCA expects a flat matrix, so you unfold the cube. The standard batch-wise unfold slices it so each batch becomes one long row, with every variable-at-every-timepoint laid side by side as columns — the cube becomes an I × (J·K) matrix. A fourteen-day batch of seven tags at hourly cadence (K ≈ 336) unfolds into one row of roughly 7 × 336 ≈ 2,300 columns. Then the loop is mechanical:

Unfold the good historical batches into the I × (J·K) batch-wise matrix (one row per completed batch).
Fit PCA on that matrix — the model now encodes "the shape of a normal trajectory," variable correlations and their evolution in time, not just a normal endpoint.
Score a new batch against the model to get T² and SPE at each point in batch time, so the deviation is time-resolved rather than a single end-of-run number.
Diagnose any excursion with the contribution plot, decomposing the SPE at the alarming timepoint to see which variable, at which phase, drove it.

A subtlety the formula hides: a batch in progress has no future. To score it at hour h you must fill the K − h unmeasured columns, and the classic Nomikos–MacGregor device is to assume the current deviation persists (or to use a more elaborate missing-data projection). This is why real-time MPCA limits are wider early in the run and tighten as the batch fills in — the honest uncertainty of judging a trajectory before it has finished.

It is worth naming the canonical method precisely, because it is the spine of every commercial trajectory monitor. Nomikos and MacGregor's multiway PCA (MPCA) and its supervised sibling multiway PLS (MPLS) are the batch-process monitoring methods: both unfold the (time × variable) trajectory and grade it against a golden-batch envelope learned from good runs, MPCA on the process tags alone and MPLS regressing those unfolded tags onto a final quality outcome. The as-the-batch-progresses form — sometimes called evolving MPCA — is the one that runs during the batch rather than after it, and it is exactly here that the missing-future-data problem above must be handled: at hour h the model is scoring a row that is mostly empty, so the limits widen and the deviation estimate carries the uncertainty of a trajectory not yet finished. When we say "trajectory MSPC," "golden-batch monitoring," or what batch_mvda.py implements, this Nomikos–MacGregor MPCA/MPLS family is the method being named.

This is the golden-batch idea generalized: instead of a mean ± 3σ envelope on a single tag (which Book 3 builds in code), the model holds the joint envelope of all tags and their correlations across batch time. The day-7 temperature excursion our simulator seeds shows up as a localized SPE spike at that timepoint, and the contribution plot at that moment points at the temperature tag — not at the endpoint, where the assay does not yet exist. The genuine pay-off is early warning: an abnormal batch is caught as a trajectory, before any endpoint assay exists, which is the whole reason commercial suites ship multiway PCA rather than only endpoint monitoring.

The cost is data and alignment. A trajectory model needs many complete, aligned good batches, and batches are rarely the same length or perfectly time-aligned — variable batch length and phase shifts mean column J·(t−1)+5 of one batch is not the same process moment as in another, so a warping or landmark-alignment step (often against an indicator variable — cumulative feed rate, or integral viable-cell density (IVCD) — viable-cell density is the count of living cells per unit volume, and its integral over time is the running total of cell-hours the culture has accumulated — rather than wall-clock time) must precede the unfold. IVCD is the natural landmark for a 14-day CHO fed-batch: two batches at the same IVCD are at the same metabolic age regardless of how many calendar days each took to get there, and IVCD already appears as a feature elsewhere in the suite. That alignment burden, and the dozens-of-batches minimum, are why endpoint MSPC (which we can run on six rows) is far more common in early campaigns than full multiway trajectory monitoring (which wants a mature process history). Both are the same statistics; they differ only in what they monitor and how much history they demand.

The suite now makes this whole story runnable in examples/platform/ml/batch_mvda.py. It draws a cohort of batch growth-curve trajectories, DTW-aligns them (dynamic time warping — it stretches and compresses each batch's time axis so the same process landmarks line up across batches before they are compared, the warping step the prose argued for), batch-wise unfolds the batch × variable × time cube into one long row per batch, fits multiway PCA on the good-batch family, and flags the stressed (starved) batches by SPE with the same contribution analysis the endpoint model uses. Running it prints:

trajectory MSPC: DTW-aligned, batch-wise unfolded (14 batches x 4 vars x 15 days = 60 columns), MPCA on 12 PASS batches
  SPE limit (good-batch mean+3sd) = 48.50
  batch 12: SPE=  356.95  label=STRESS  <-- FLAGGED
  batch 13: SPE=  375.47  label=STRESS  <-- FLAGGED
  worst batch 13 SPE contribution: top driver = glc at day 3 (19% of the residual)
ASSERT ok: DTW+unfold+MPCA flags the stressed batches as trajectory outliers, and SPE attributes the deviation to a process variable over time.

Read it the way the prose predicted. The two starved batches land far outside the good-batch family on the trajectory model — SPE 356.95 and 375.47 against a 48.50 limit — and the contribution analysis points at glucose (glc in the output above) around day 3, the starvation signature, exactly the early-trajectory warning the section argued for, now demonstrated rather than asserted. This is the trajectory complement to the endpoint MSPC of mspc.py: where the endpoint model catches the finished-batch HCP excursion after the run, trajectory-MSPC catches the deviation as it unfolds, while the batch is still in the bioreactor and there is still time to act.

Real-time release testing: the honest scarcity

If a model can predict a CQA accurately enough, the tantalizing prospect is real-time release testing (RTRT): replace an end-product laboratory test with an in-process measurement (or a model over in-process measurements) so the result is available at the moment manufacturing finishes, not days later. RTRT is a recognized regulatory pathway — ICH Q8(R2) defines it as evaluating and ensuring product quality "based on process data, which typically include a valid combination of measured material attributes and process controls" — and it is the destination the whole PAT (process analytical technology — measuring quality in-line during the process rather than only in the lab afterward) and Quality-by-Design (QbD — building quality into the process by design rather than testing it in at the end) program points at [4]. (ICH Q8(R2) is one of the international regulatory guidelines that govern pharmaceutical development.) The point is not to weaken the specification; the model-plus-process-control must demonstrate equivalence to the test it replaces, validated as a permanent part of the control strategy.

For small molecules, RTRT is real and shipping. The clearest example is Janssen's Prezista (darunavir) continuous-manufacturing line, where NIR-based RTRT models replace conventional end-product testing for attributes like content uniformity (production) — the first FDA approval converting a marketed product from batch to continuous manufacturing, which collapsed the release timeline from roughly two weeks to about one day [5]. Continuous oral-solid-dose manufacturing makes RTRT natural: the CQAs (blend uniformity, dissolution, assay) map cleanly onto NIR spectra (near-infrared spectroscopy — a fast optical measurement whose curve, a spectrum, fingerprints the sample's chemical makeup) through a chemometric PLS calibration (chemometrics is the practice of extracting chemical quantities from such spectra with statistical models like PLS), the process is fast and at steady state, and the chemistry is small, well-defined, and stable.

For biologics, RTRT is genuinely scarce, and it is worth being honest about why. A monoclonal antibody's release panel is not one number — it is the dozen-attribute battery in hplc_results.csv: aggregation by SEC, charge heterogeneity by CEX, host-cell protein, residual Protein A, host-cell DNA, endotoxin, bioburden. Several of these are safety attributes (HCP, DNA, endotoxin, sterility) with no fast in-line surrogate, and the molecule's micro-heterogeneity (the many slightly-different molecular forms a biologic exists as — glycosylation is the variable sugar chains attached to the antibody, charge variants are charge-shifted forms, fragmentation is broken pieces) is exactly what makes biologics biologics — it does not collapse into a single spectral reading the way a small molecule's content does. A soft sensor (a model that infers a hard-to-measure value from easy in-line signals instead of measuring it directly) can predict titer (the concentration of product the culture has made) in real time with high accuracy (the Raman PLS model reaches R² near 0.99 on our data), but titer is a quantity, not a quality attribute; predicting aggregation or HCP from an in-line probe with the accuracy a release decision demands is a different and far harder problem, because the very low-concentration impurities that matter most for safety sit near or below the detection floor of a fast in-line measurement. The sterility test in particular has no real-time substitute that a regulator will accept for biologic release today — a rapid-microbiological-method may shorten it, but it does not eliminate the incubation a release still rests on.

The economic prize is large and is sometimes quoted aggressively. Ferring has estimated that RTRT could cut its cost of goods by on the order of 25% — but that is a development-stage internal estimate for a specific continuous-manufacturing prototype program, not an industry-established figure, and it should be read as illustrative (development-stage estimate, illustrative) [6]. The honest state of the art for biologics is partial RTRT — using in-process models to release some attributes faster (or to support a parametric release of an attribute with a defensible surrogate) while the safety panel still runs conventionally — and even that is rare and hard-won. The genuinely emerging footholds are at the edges of the panel: endotoxin by a validated recombinant Factor C or kinetic LAL assay, and bioburden by rapid microbiological methods (growth-based or ATP-bioluminescence), can both compress to hours rather than days. But the protein-impurity attributes — HCP above all — remain the hard wall, with no fast in-line surrogate at release-grade accuracy. The frontier is not "no lab test"; it is "fewer, faster, model-supported tests for the attributes that have a defensible in-line surrogate."

Predicting the OOS before the assay runs

MSPC catches a batch that is already out of family. The earlier, harder question is: from the in-process summary a batch has accrued by the end of culture — before the slow release panel runs — can we predict whether it will pass or go OOS? This is release prediction in its honest, advisory form. The point is not to skip the assay; it is to flag a likely failure early enough to investigate, segregate, or schedule a re-test, and to do it with a calibrated, interpretable model whose decision threshold the quality unit controls.

There is a data problem that the chapter confronts head-on: the shipped campaign has exactly one OOS batch. You cannot train or honestly evaluate a classifier on a single positive — a single point gives no variance, no cross-validation fold can hold it out and still see a positive in training, and any metric is undefined or degenerate. So examples/platform/ml/release_predict.py draws a cohort of 120 batches from the mechanistic fed-batch model (model_fedbatch.py), perturbing the biology per run (growth rate, specific productivity, feed rates, lactate inhibition), with two seeded failure mechanisms chosen to make the lesson faithful rather than flattering:

A stress pathway — underfeeding drives high lactate and ammonia and low viability, and cell lysis raises HCP. This failure does leave an in-process metabolic trace, so a model can learn it from the culture summary alone.
A contamination pathway — a sharp HCP add-on. Real microbial contamination also stresses the culture (lower viability, higher lactate), so it leaves a weak metabolic footprint — much weaker than the stress pathway. This is deliberate: the OOS you most need to catch is often the one with the faintest upstream signal. The point is feature blind spots, not that contamination is undetectable — in a real plant microbial contamination is caught first by in-process bioburden and sterility testing and by abrupt pH, dissolved-oxygen, or oxygen-uptake-rate excursions, none of which appear in the end-of-culture metabolic summary this model is deliberately restricted to.

A batch is labeled OOS by the same rule the real assay applies — HCP > 100 ng/mg or monomer < 95% — so the simulated label is mechanistically tied to the in-process state, not pasted on. The classifier is a StandardScaler (which rescales every feature to a common spread first) plus an L2-regularized LogisticRegression with balanced class weights — logistic regression is a simple model that outputs a probability of the positive class; "L2-regularized" means a penalty (whose strength is the knob C) discourages over-large coefficients to curb overfitting, and "balanced class weights" up-weight the rare failures so the model does not ignore them. It is chosen for calibration and interpretability, not raw power, because a release-adjacent model must justify itself: its log-odds are linear in the standardized features, so each coefficient reads directly as "how strongly does this signal push toward OOS." The positive class is failure, and the evaluation is nested cross-validation. Plain cross-validation splits the data into several equal parts ("folds"), trains on all but one and scores on the held-out fold, rotating through so every batch is scored once on data it did not train on. Nested CV does this twice over: the L2 strength C is tuned in an inner loop on each training split, and the outer fold — which never saw C selected — scores the model, so the reported number is not optimistically biased by the tuning. The same nested loop produces the held-out probabilities the threshold and calibration analyses below depend on:

# examples/platform/ml/release_predict.py
FEATURES = [
    "final_titer_g_L", "peak_VCD_e6_per_mL", "peak_lactate_g_L",
    "final_lactate_g_L", "final_ammonia_mM", "end_viability_pct",
    "integral_VCD",
]

def evaluate(df: pd.DataFrame, seed: int = SEED) -> dict:
    """Nested CV over independent runs: tune C inside, score on the outer fold."""
    X = df[FEATURES].to_numpy(float)
    y = (df["release"] == "OOS").astype(int).to_numpy()   # positive = failure
    pipe = make_pipeline(StandardScaler(),
                         LogisticRegression(max_iter=2000, class_weight="balanced"))
    grid = {"logisticregression__C": [0.01, 0.1, 1.0, 10.0, 100.0]}
    inner = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
    outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed + 1)

    # optimistic: tune AND read off the same inner folds
    gs = GridSearchCV(pipe, grid, scoring="roc_auc", cv=inner).fit(X, y)
    auroc_naive = float(gs.best_score_)

    # honest: outer folds never participate in selecting C
    proba = cross_val_predict(
        GridSearchCV(pipe, grid, scoring="roc_auc", cv=inner),
        X, y, cv=outer, method="predict_proba")[:, 1]
    return {"y": y, "proba": proba,
            "auroc": round(float(roc_auc_score(y, proba)), 3),
            "auprc": round(float(average_precision_score(y, proba)), 3),
            "auroc_naive": round(auroc_naive, 3),
            "optimism": round(auroc_naive - float(roc_auc_score(y, proba)), 3),
            "prevalence": round(float(y.mean()), 3)}

The threshold sweep is its own small function, because the whole pedagogical point is that the threshold — not the fitted model — is the decision. It just counts the confusion matrix at a probability cutoff, with the missed OOS (a false negative) called out as the costly error:

# examples/platform/ml/release_predict.py
def confusion_at(y, proba, thr: float) -> dict:
    pred = (proba >= thr).astype(int)
    fn = int(((pred == 0) & (y == 1)).sum())   # a MISSED OOS -- the costly error
    fp = int(((pred == 1) & (y == 0)).sum())   # a false alarm -- an extra investigation
    ...

Running python release_predict.py prints (all numbers illustrative, from the simulated cohort). Every metric in the block below — AUROC, AUPRC, the 95% CI, the no-skill baseline, recall, and precision — is unpacked in the two subsections that follow; skim the numbers now, then read them properly below:

release-prediction cohort: 120 simulated batches, 10 OOS (8%)  # illustrative

logistic release predictor (NESTED 5x5 CV; C tuned in the inner loop):
  AUROC = 0.923 (95% CI (0.781, 1.0))   AUPRC = 0.805 (95% CI (0.521, 1.0))   (prevalence 0.083 = the no-skill AUPRC baseline)   # illustrative
  naive (tuned-and-read-off the same folds) AUROC = 0.968  -> optimism removed by nesting = +0.045  (selected C=100.0)
  calibration of the probabilities: Brier = 0.0524  ECE = 0.0769  (lower is better-calibrated)

  operating points (positive = predicted OOS):
    thr=0.50: recall=0.8 precision=0.571 | missed OOS (FN)=2  false alarms (FP)=6
    thr=0.30: recall=0.8 precision=0.444 | missed OOS (FN)=2  false alarms (FP)=10

  standardized logistic coefficients (log-odds of OOS):
    end_viability_pct        -2.85
    peak_lactate_g_L         +2.74
    final_ammonia_mM         -0.89
    integral_VCD             -0.89
    final_lactate_g_L        -0.80
    peak_VCD_e6_per_mL       +0.67
    final_titer_g_L          -0.58

ASSERT ok: in-process features predict the release outcome (illustrative).

Three lessons matter more than the headline AUROC of 0.923. First, AUPRC, not AUROC, is the honest metric on a rare event. Two quantities recur below: recall is the share of true OOS batches the model actually flags (catch rate), and precision is the share of its flags that are real OOS (how trustworthy an alarm is). AUROC and AUPRC each summarize a whole curve of recall-vs-something into a single area-under-the-curve number between 0 and 1, higher being better. With OOS prevalence (the fraction of batches that are actually OOS) near 8%, a no-skill model — one with no predictive ability — scores 0.083 on AUPRC (the prevalence itself is the baseline); our 0.805 is genuinely better than chance, but the gap between an impressive-looking AUROC of 0.923 and a soberer AUPRC of 0.805 is exactly the flattery the ROC curve provides on imbalanced data — ROC's x-axis (false-positive rate) is forgiving when negatives vastly outnumber positives, whereas the precision-recall curve confronts the model with how many of its alarms are false. Second, and most important, the decision is a threshold, not the model. The two operating points show the real tradeoff a quality unit owns — and they show it in a way the small positive count makes unusually stark. Both thresholds catch the same failures: recall stays 0.8 (two of the ten OOS still missed) whether the cutoff is 0.50 or 0.30; dropping the threshold only widens the net for false alarms, from 6 to 10, while precision slips from 0.571 to 0.444. Recall moves in visible 0.1 steps here because each missed OOS is a tenth of the tiny positive class — the quantization is itself the small-data lesson: with so few positives, no smooth recall-precision tradeoff exists to tune. The cost asymmetry is what justifies erring low anyway — a missed OOS is a false release of a potentially unsafe lot, while a false alarm is only an extra investigation — so a quality unit will deliberately run a conservative threshold, accepting the extra false alarms to push missed failures toward zero. Third, where that threshold sits is a risk-tolerance decision, made by humans under change control, and it is the single most consequential number in the whole system. The model's accuracy is the easy part; the threshold is where the responsibility lives.

The coefficients close the honesty loop, and because the features are standardized they are directly comparable in magnitude. Low end_viability_pct (−2.85) and high peak_lactate_g_L (+2.74) are by far the strongest predictors of OOS — exactly the stress-pathway signature — which is why this model catches stress-driven failures well; the remaining coefficients are a fifth their size or less. It catches contamination-driven failures poorly, because by design that mechanism leaves only a faint metabolic trace that those same features barely register. A release predictor is real, but it is strongest on the failures that announce themselves upstream and weakest on the ones that do not — and the ones that do not are often the ones you most need to catch.

Why the AUROC is honest: nested CV and calibration

The output block above already carries the three numbers that make it trustworthy — naive 0.968, honest nested 0.923, and the calibration pair — but they deserve unpacking, because they are the same model evaluated for two failures a release-adjacent model must not hide. The first is selection bias from tuning. The logistic model has a hyperparameter — the L2 strength C — and if you choose it by cross-validation and then report the same cross-validated score, that score is optimistically biased: the folds that picked C have already seen the data they are graded on. That optimistic number is the naive 0.968. Nested cross-validation removes it by putting the tuning in an inner loop on each training split, so the outer fold scores a model whose C was chosen without ever touching it — the honest 0.923. Nesting removes roughly +0.045 of selection optimism here; read that gap as an estimate of the optimism rather than a clean decomposition, because the two AUROCs are also computed slightly differently (the naive one is the mean of the inner folds' scores, the honest one a single score over pooled out-of-fold predictions), so a little of the difference is that bookkeeping rather than tuning bias. The bootstrap 95% confidence interval — 0.781 to 1.0 on AUROC, 0.521 to 1.0 on AUPRC — is wide, and that width is not a defect of the method but an honest report of evidence: with prevalence near 8% the whole cohort holds only about ten out-of-sample OOS batches, so the interval reaches up to 1.0 and a point estimate alone would overstate what we know.

The second problem is that AUROC only ranks: it tells you whether OOS batches tend to score higher than PASS batches, but a quality unit does not read a rank, it reads a probability and sets a threshold against it. Calibration asks the orthogonal question — when the model says 0.7, does the batch actually fail about 70% of the time? The Brier score (mean squared error of the probabilities) and ECE (expected calibration error, the average gap between predicted and observed frequency across probability bins) measure exactly that; both are lower-is-better. A model can have a high AUROC and badly miscalibrated probabilities, which would make any fixed release threshold mean something different from what the quality unit intended. Here the calibration is the reassuring part: a Brier of 0.0524 and an ECE of 0.0769 mean the predicted probabilities are close enough to observed frequencies that a quality unit can set a release threshold on the probability itself and have it mean roughly what it says — which is exactly the property the threshold discussion above depends on.

Anatomy of an MSPC verdict

A release decision is not a number; it is a structured verdict, and what travels alongside the flag is what makes it actionable and defensible. When MSPC scores BATCH-2026-004, it does not just emit "FAIL"; it emits a record that ties the verdict to the model that produced it, the two statistics and their limits, the contribution that names the culprit, and the disposition it triggers.

One MSPC verdict is a whole record: the standardized attribute row, the two statistics against their limits (T² in-control, SPE alarming), the contribution that names HCP as the driver, the confirming offline assay, and the human disposition the model advises but never makes. Original diagram by the authors, created with AI assistance.

Read the card top to bottom and the chapter's argument is laid out as fields. The input is the standardized row of ten release attributes — the z vector for this batch, mean-centered and unit-scaled against the good-batch statistics, so every field is already expressed in "standard deviations away from a normal batch." The core holds the two statistics with their limits: T² = 8.36 against 30.57 (in-control, the batch is recognizable in-plane because its purity and charge scores fall inside the cloud) and SPE = 356.59 against 4.95 (alarming, the batch is off the plane because its residual e is enormous). The contribution panel ranks the attributes by their share e_j² / Σ e_j² of the SPE residual, with HCP_ng_per_mg at 83% — the diagnostic that turns a flag into a lead, because it converts "off the plane" into "off the plane because of HCP." The reconciliation row carries the confirming offline assay, 128.0 ng/mg against the 0–100 spec, the ground truth that grades the model's alarm — and, critically, the legal record of record: the model advised, the assay decided. And the relationships panel records where the verdict came from and where it goes: fit_on the good-batch family (the five PASS siblings, traceable to a specific model version), opens an OOS investigation, may_feed a CAPA (Corrective and Preventive Action — the formal fix-and-prevent workflow a confirmed deviation triggers), and — the field that matters most — advises a human disposition rather than making one. That last field is not decoration; it is the regulatory boundary. The model flags and explains; a person in the quality unit authorized to release the lot decides. Everything else on the card exists so that human can decide fast and defensibly: the statistic that fired, the limit it broke, the variable that drove it, and the assay that confirms it. This typed-relationship record is the informal version of a formal model: Book 4 turns the same release gate into a closed-world SHACL validation in The Release Gate and SHACL.

The release gate is also the model's input contract

The anatomy card's typed fields are not a presentation convenience; they are the seam where a knowledge graph makes this whole chapter trustworthy, and it is worth being explicit about three places that seam load-bears.

First, the same SHACL shape that gates the release decision also guarantees the model's inputs are complete and in range. bp:ReleaseShape in Book 4 asserts, closed-world, that a released lot carries exactly one in-spec value for every required CQA — monomer, HMW, CEX-main, HCP, protein concentration — with a missing or duplicated result a failure now rather than an open question (see the release gate and SHACL). MSPC's standardization step z = (x − μ) / s quietly assumes exactly that: one numeric value per attribute, present, and on a sane scale. A LIMS integration that drops the HCP row or files a repeat under a second IRI does not crash the model — it produces a confident wrong score on a malformed z vector, the most dangerous failure of all. Running the release shape over a lot's graph before it reaches t2_spe() turns "is every monitored attribute present and singular?" from a hope into a validated precondition. The release gate and the model's input contract are the same closed-world shape, read twice.

Second, a feature pulled by its ontology IRI survives a column rename that breaks a fragile lookup. The contribution plot names HCP_ng_per_mg; in the graph that attribute is the typed, unit-bearing property bp:hcpPpm hung on the lot, the same IRI across the ELN, LIMS, and MES that the data shadow is scattered over. Sourcing the monitored panel by IRI rather than by a brittle df["HCP"] is what lets the monitor mean the same thing after a system upgrade renames a column — the semantically-grounded-feature discipline the models-and-validation chapter argues for, applied at the release gate. And because QUDT/UCUM units travel with each value, an HCP reported as a per-total-protein ratio can never be silently compared against a monomer percentage — a class of error no datatype catches but the unit IRI does.

Third, the lineage edge is the grouping key that keeps the trajectory cohort honest. When batch_mvda.py fits multiway PCA on the good-batch family, "which batches are independent for validation?" is exactly the bp:derivedFrom question — the one transitive spine that roots every lot, hop by hop, in the frozen cell bank WCB-CHO-001 (relations and genealogy). Two trajectories that look independent by name but descend from the same seed culture are not independent, and a leave-one-batch-out split that ignores that leaks the held-out batch into the fit — so the same edge that scopes a recall also scopes a fold. The continuant/occurrent cut from the upper ontology keeps the bookkeeping straight underneath: a release result (a continuant — a value that persists) is a different kind of thing from the culture run (an occurrent — a process that happened) it was measured during, so the graph cannot quietly confuse the measurement with the run and let a per-batch label leak in as a per-timepoint feature.

The regulatory framing closes the loop. The verdict card's advised-versus-decided boundary is what keeps every release model CSA-light rather than the validation burden of a decision-making system — under the FDA's Computer Software Assurance (CSA) reframing of CSV (computerized-system validation), an advisory monitor whose output a qualified human always confirms carries far less validation weight than a system that disposed the lot itself. Each field on the card is an ALCOA+ data-integrity attribute made concrete — the release_mspc_pca model version is attributable, the 128.0 ng/mg confirming assay is the original record of record the model only advised, and the typed fit_on edge makes the verdict traceable to the exact good-batch family it was scored against — so the record is Part 11 / Annex 11 audit-trailable by construction. The model that may change under a Predetermined Change Control Plan lives behind that same boundary; the graph is where the audit trail of which version advised which lot is queryable rather than buried in filenames.

The unsolved part: detecting drift with almost no failures

The hardest open problem in release ML is the same one that haunts every model in this book, but sharpest here: you cannot validate a rare-event detector on the events you almost never see. A release-outcome classifier or an MSPC monitor is meant to catch OOS batches, and a well-run process produces OOS batches very rarely — by design. So the positive class is tiny, the confidence intervals on any failure-detection metric are enormous, and a model can look excellent for a year simply because no real failure tested it. The arithmetic is unforgiving: if a process runs one OOS in fifty batches, then even a perfect year of fifty batches contributes a single positive to your evidence, and the 95% confidence interval on a detector's recall estimated from one true positive spans nearly the whole [0, 1] interval. You are validating a smoke detector in a building that almost never catches fire.

This bites in two ways. First, the model decays silently. A new raw-material lot, an aging probe, a process tweak, or a slow shift in the cell line moves the world away from the training data, and the MSPC limits or classifier calibration drift — but you only discover the drift when a failure slips through or a wave of false alarms erupts, both of which are expensive and late. The drift is, by construction, a lagging indicator: the very rarity that makes the failure worth catching is what blinds you to your detector going stale. Second, the limits themselves are under-determined. Our SPE limit was estimated from five good batches; even a real campaign of dozens gives wide intervals, and the F-distribution T² limit assumes a multivariate-normality that small biologic datasets rarely honor — the scores are often skewed, and a handful of batches cannot tell a heavy-tailed distribution from a Gaussian one. The statistics are principled; the limits are approximations whose uncertainty is rarely reported alongside the alarm, so a "SPE = 356.59 vs limit 4.95" reads as crisp certainty when the limit itself carries a fat error bar.

There is no clean fix, only disciplines that help, and each one trades a known weakness for a partial guard. Monitor the inputs — do incoming batches' attributes still look like the training distribution? — as a leading proxy for drift, since input drift (a population-stability shift on a raw signal) precedes failure-detection drift and is observable on every batch, not just the failing ones. Re-qualify the limits periodically as history accumulates, tightening the F and SPE approximations toward their large-sample forms. Hold a conservative threshold that accepts false alarms to suppress misses, since the cost asymmetry favors it. And keep physics- or knowledge-based plausibility checks that flag impossible outputs (a negative concentration, an HCP below assay noise) no statistical model would catch on its own. But none of these substitutes for the ground truth a real failure provides, and the field knows it. The FDA's 2023 discussion paper names exactly this — monitoring and re-validating models whose performance can decay silently after deployment — as an open question for AI under cGMP (current Good Manufacturing Practice — the up-to-date form of the GMP regulations every release decision lives under), without prescribing a settled answer [7]. Until OOS events are common enough to test a detector continuously (which no one wants), a release-monitoring model must be distrusted on a schedule: validated narrowly, monitored on its inputs, and re-qualified as evidence accrues, with the standing assumption that it is decaying until proven otherwise.

What this chapter adds to the model suite

This chapter contributes three runnable modules to examples/platform/ml/, all pure scikit-learn / NumPy / SciPy over the committed datasets, all ending in hard CI assertions so the book's claims cannot silently rot:

mspc.py — multivariate SPC on the real release panel from hplc_results.csv. It fits PCA on the five PASS siblings, scores all six batches with Hotelling's T² and SPE, decomposes the flagged batch's SPE with a contribution analysis, and asserts that only BATCH-2026-004 is flagged and that the SPE points at HCP. It is the release-side companion to the upstream soft sensors: instead of predicting one quantity, it asks the monitoring question — does this finished batch look like the family of good batches?
batch_mvda.py — trajectory MSPC. It DTW-aligns a cohort of batch trajectories, batch-wise unfolds the batch × variable × time cube, fits multiway PCA on the good batches, and flags the stressed (starved) batches by SPE with a contribution analysis that points at glucose around day 3. It makes the chapter's golden-batch / multiway-PCA story runnable, and asserts that DTW + unfold + MPCA flags the stressed batches and SPE attributes the deviation to a process variable over time — the early-trajectory counterpart to mspc.py's endpoint view.
release_predict.py — a calibrated, interpretable release-outcome classifier. Because the shipped campaign has only one OOS, it draws a 120-batch cohort from the mechanistic fed-batch model with two seeded failure mechanisms (stress and contamination), evaluates with nested cross-validation (tuning C in an inner loop, scoring on the outer fold), reports AUROC and AUPRC with bootstrap confidence intervals, measures probability calibration (Brier, ECE), and prints two operating points to make the threshold tradeoff explicit. It is deliberately not the plant-yield model of batch_outcome.py (Chapter 23); its subject is the release decision and the cost asymmetry around it.

All three run the same way, and the reproducibility contract is deliberately strict, because a book claim that cannot be re-run is a claim that will rot. From examples/platform/ml/, one-time setup pins the environment — python3 -m venv .venv && .venv/bin/pip install -r requirements.txt resolves every dependency to a pinned version, so the SVD and the F-distribution limit produce the same 30.57, 4.95, and 356.59 on any machine — and then .venv/bin/python mspc.py (or batch_mvda.py, or release_predict.py) reproduces the verbatim output blocks above. Every module fixes its random seed, so the 120-batch cohort and the nested-CV folds are identical run to run, and run_all.py records each committed dataset's MANIFEST.sha256 content hash, pinning every printed number to exactly which bytes produced it. The whole suite is open-source under the repository's license, pure scikit-learn / NumPy / SciPy with no service to stand up, which is the point the open-source analytics chapter makes about the commercial monitors: the T²/SPE arithmetic SIMCA and ProMV productize is the same numpy.linalg.svd you can read, run, and audit here.

Together they cover the two release-ML questions that actually matter: is this finished batch in family? (MSPC, endpoint and trajectory) and will this batch fail before the assay tells us? (release prediction).

Why it matters

Release is the last gate, and it is the one where being wrong is most expensive in both directions: ship a bad lot and patients are at risk; reject a good lot and a multi-million-dollar batch is destroyed. MSPC earns its keep here precisely because it does not pretend to autonomy — it compresses a dozen-attribute quality record into two interpretable statistics, catches the batch that drifted out of family even when every individual result is in spec, and points the investigator at the responsible attribute. Our example does exactly that, flagging BATCH-2026-004 on SPE and naming HCP, when eleven separate univariate charts would have shown eleven green lights and one quiet red. Release prediction adds a second layer of warning, moving the alarm earlier in time — but only for failures that announce themselves upstream, and only as advice. The throughline is the same one the whole book keeps reaching: at the gate that matters most, ML is a powerful monitor and explainer, and the decision stays with a human.

In the real world

Commercial biologic plants run MSPC on validated suites — most often Sartorius SIMCA / SIMCA-online or AspenTech ProMV — for continued process verification, golden-batch monitoring, and fault detection (production) [1]. Amgen's Juncos site has publicly described SIMCA-based OPLS models on commercial harvest and in-process data, one of the more concrete first-party MSPC deployments on record (production, self-reported) [2]. The underlying multiway-PCA method is the peer-reviewed Nomikos–MacGregor framework [3], and the open-source core — numpy.linalg.svd plus the T²/SPE arithmetic this chapter shows — is the same math those suites productize, wrapped in validated GUIs, audit trails, batch-time score overlays, and contribution plots.

RTRT is the more sobering reality. Where it is real — small-molecule continuous manufacturing such as Janssen's Prezista NIR-based release (production) [5] — it is a genuine, regulator-accepted achievement that collapsed a two-week release to about a day. For biologics it remains scarce and partial: titer is soft-sensed routinely, but the safety panel (HCP, DNA, endotoxin, sterility) has no in-line surrogate a regulator will accept for release today, and the often-quoted cost savings (Ferring's ~25% COGS) are development-stage estimates, not established outcomes (illustrative) [6]. The broad-industry picture matches: the ISPE Pharma 4.0 surveys consistently find ML clustered in monitoring — exactly the MSPC and anomaly-detection work of this chapter — and almost never in autonomous release decisions [8]. The honest verdict for release ML is that monitoring is mature and production-grade, prediction is useful and advisory, and autonomous real-time release for biologics is still over the horizon.

Key terms

MSPC / MSPM — multivariate statistical process control/monitoring; modeling many correlated quality attributes (and their relationships) at once to flag a batch that is out of family.
PCA (Principal Component Analysis) — the decomposition Z = U S Vᵀ that compresses many correlated attributes into a few orthogonal principal components defining the "shape of a good batch"; the loadings P are the leading eigenvectors of the correlation matrix.
PLS (Partial Least Squares) — PCA's supervised cousin: instead of just compressing the attributes, it finds the directions in them that best predict a target outcome (titer, a CQA); the parent method behind MPLS below.
Scores / loadings / residual — a batch's coordinates in the model plane (t = z P), the directions that define the plane (P), and the part of the batch the plane cannot represent (e = z − z P Pᵀ).
Hotelling's T² — the in-plane distance from the center of the good-batch cloud, Σ t_r²/λ_r; a high value means an extreme but recognizable batch. Limit from the small-sample F-distribution.
SPE / Q-statistic — the off-plane squared prediction error ‖e‖²; a high value means genuinely novel behavior no good batch ever showed. Usually the statistic that fires first.
Contribution plot — the decomposition of an out-of-bounds T² or SPE back onto the original attributes (for SPE, each attribute's share e_j²/Σe_j²), naming which variable drove the excursion.
Golden batch / multiway PCA (MPCA), multiway PLS (MPLS) — Nomikos and MacGregor's batch-process trajectory monitoring; the I × J × K batch × variable × time cube unfolded to an I × (J·K) matrix so a whole run, not just its endpoint, is scored against the normal (golden-batch) envelope. MPCA monitors the process tags; MPLS regresses them onto a quality outcome. The evolving form scores the batch as it progresses, filling the unmeasured future and widening the limits early in the run.
RTRT (real-time release testing) — replacing an end-product test with an in-process measurement or model so the result is available when manufacturing finishes; common in small-molecule CM, scarce for biologics.
OOS (out-of-specification) — a result outside its acceptance criterion, requiring investigation before disposition; our example is BATCH-2026-004's HCP at 128.0 ng/mg against a 0–100 spec.
Release prediction — an advisory classifier estimating from in-process features whether a batch will pass or go OOS, before the slow release panel runs.
AUPRC vs AUROC — on a rare event, the precision-recall curve (with the prevalence as its no-skill baseline) is the honest performance metric; ROC flatters an imbalanced classifier.
Nested cross-validation — tuning a hyperparameter in an inner CV loop while estimating performance on the outer fold, so the reported score is not optimistically biased by selection; here it removed +0.045 of AUROC optimism (naive 0.968 versus honest 0.923).
Calibration (Brier / ECE) — whether a predicted probability matches the observed failure frequency (a predicted 0.7 should fail about 70% of the time); the Brier score and expected calibration error (ECE) measure it, lower being better, and it is what makes a probability usable for setting a release threshold.
Decision threshold — the probability cutoff that converts a model's score into an accept/flag decision; where the quality unit's risk tolerance — and the missed-OOS versus false-alarm tradeoff — actually lives.
Input contract (SHACL release shape) — the closed-world shape that gates the release decision doubling as the guarantee that a monitor's inputs are complete, singular, and in range; running it before scoring turns "is every monitored attribute present?" into a validated precondition rather than an assumption MSPC's standardization quietly makes.
Semantically-grounded feature — a monitored attribute pulled by its ontology IRI (bp:hcpPpm, unit-bearing) rather than a brittle column name, so it means the same thing across the ELN, LIMS, and MES and survives a system rename; the discipline that lets the release gate and the model share one vocabulary.
CSA / ALCOA+ / Part 11 — Computer Software Assurance (the risk-based reframing of CSV that puts an advisory, human-confirmed monitor in a far lighter validation tier than a deciding system), the ALCOA+ data-integrity attributes the verdict card makes concrete (attributable model version, original confirming assay, traceable fit_on family), and the 21 CFR Part 11 / EU Annex 11 audit-trail rules the typed record satisfies by construction.

Where this leads

The lot is released; the good units are counted and approved to ship. But before a single vial leaves the warehouse it must be labeled, cartoned, and given a unique serial that lets it be traced through the supply chain. The next chapter, Packaging and Serialization: Vision, Track-and-Trace, and Anomalies, follows the released product onto the packaging line — where machine vision reads labels and codes, track-and-trace data builds the product's downstream genealogy, and anomaly detection guards against the diversion and counterfeiting that serialization exists to stop.

What this chapter covers​

MSPC: learning the shape of a good batch​

The PCA model plane​

Two statistics, two kinds of failure​

Where the control limits come from​

MSPC on the release panel: flagging BATCH-2026-004​

The trajectory view: golden batch and multiway PCA​

Real-time release testing: the honest scarcity​

Predicting the OOS before the assay runs​

Why the AUROC is honest: nested CV and calibration​

Anatomy of an MSPC verdict​

The release gate is also the model's input contract​

The unsolved part: detecting drift with almost no failures​

What this chapter adds to the model suite​

Why it matters​

In the real world​

Key terms​

Where this leads​