The Production Bioreactor: Soft Sensors, Closed-Loop Control, and the Digital Twin

📍 Where we are: Part III · Upstream, Learned — Chapter 11. The seed train handed a healthy, on-schedule inoculum across the SEED-001 → BATCH-2026-001 gate. Now the cells go into the production bioreactor BR-101, where the antibody is actually made — and where the soft-sensing sketched at seed scale becomes the most mature machine learning anywhere in upstream manufacturing.

This is the chapter the whole upstream points at. Upstream is the cell-growing half of the process — everything from the vial through the production bioreactor where the antibody is made, as opposed to downstream, the purification that comes after (Book 1 draws the split in full). For roughly two weeks, a few thousand litres of CHO (Chinese Hamster Ovary) cells — the standard mammalian host for antibody manufacturing — grow, switch to producing antibody, and pour out a torrent of sensor data — temperature, pH, dissolved oxygen every few seconds, and, on the best-instrumented tanks, an in-line Raman probe — one inserted directly through the vessel wall, measuring the broth continuously without pulling a sample — firing a full spectrum every minute. The production bioreactor is the single most data-rich step in the entire process, and it is also where machine learning has earned its only genuinely production-grade foothold in biomanufacturing. (This chapter grades every capability on a three-rung maturity ladder: (production) — routinely deployed in commercial GMP manufacturing; (pilot) — demonstrated at scale but not yet the validated default; (research) — promising in the literature but not yet in a plant.) The production-grade capability here is the spectroscopic soft sensor, a model that reads titer and metabolite concentrations off a Raman spectrum in real time, with no offline sample and no waiting for the lab.

But the bioreactor is also where the field's honesty is tested. The same probe that predicts titer beautifully predicts viable cell density poorly — a gap that fifteen years of work has not closed. Closed-loop control, where a model not only measures glucose but decides the feed and actuates the pump, exists in pilots but barely in GMP (Good Manufacturing Practice — the legally-enforced quality system commercial drug batches must run under). And the most ambitious object in the chapter, the digital twin — a first-principles model of the run, corrected by a learned residual — is real, useful, and still mostly a development-and-pilot tool, not a thing that runs your commercial batch unattended. This chapter builds the soft sensor, builds the twin, and is exact about where each one stops.

The simple version

Imagine a master brewer who can taste a fermenting tank and tell you, instantly, how much alcohol it has made and how much sugar is left — no lab, no waiting. That tasting is a soft sensor: a cheap, instant signal (here, the light a laser scatters off the broth — its Raman spectrum) read by a trained model to estimate something you would otherwise have to measure slowly in a lab. The brewer is brilliant at "how much alcohol" because alcohol has a clear taste, but bad at "exactly how many yeast cells are alive" because that has almost no taste at all — which is precisely the production bioreactor's situation. And the very best brewer doesn't just taste; she acts — adds a little sugar when the tank runs low — which is closed-loop control. The digital twin is her mental model of how the whole fermentation will unfold, corrected every time reality surprises her.

What this chapter covers

The soft-sensing task — what in-line Raman, NIR, and dielectric spectroscopy actually measure, and the menu of targets (titer, glucose, lactate, glutamine, ammonia, VCD) ranked by how learnable each one is.
The PLS chemometrics pipeline — the real preprocessing (SNV, Savitzky-Golay derivatives) and the latent-variable regression that is the 40-year workhorse of spectroscopic PAT, with the math stated plainly.
Why VCD is the weak spot — the molecular reason titer and metabolites predict well while total/viable cell density does not, and why this is the field's persistent unsolved problem.
Closed-loop glucose control — the measure-decide-actuate loop that turns a soft sensor into an actuator, and the regulatory line it has not yet crossed at scale.
The hybrid digital twin — a mechanistic fed-batch backbone corrected by an ML residual, and ML-surrogate MPC for feeding, with two runnable modules.
The GMP reality — what is genuinely production-grade, what is still pilot, and what is research-only, each attributed correctly.

The soft-sensor task: reading chemistry off light

A soft sensor (also called a virtual sensor or inferential sensor) is a model that estimates a hard-to-measure quantity from easy-to-measure ones. In the production bioreactor the hard-to-measure quantities are the ones that matter most — antibody titer, the glucose and glutamine the cells are eating, the lactate and ammonia they excrete, and the viable cell density (VCD) — and the classical way to get them is to pull a sample twice a day and run it on a bench analyzer. That is the cold-start cadence the whole book keeps colliding with — cold-start meaning you must learn from very little labelled truth: a flood of cheap online signal, a trickle of expensive offline truth. The soft sensor's job is to interpolate the truth continuously from the signal.

Why the bioreactor and not some other step? Three things converge here that converge nowhere else. The value is highest — titer (how much antibody has accumulated, in g/L) and metabolites (the small molecules cells consume and excrete) are the levers on yield and on glycosylation (the pattern of sugar chains the cell attaches to the antibody, which shapes how the drug behaves in a patient), the CQA (Critical Quality Attribute — a property that must stay in spec for the drug to be safe and effective) you fight for. The signal is richest — a stirred, well-mixed, aqueous broth is the near-ideal sample chamber for an immersion probe, and the tank runs for two weeks, long enough to build a real calibration trajectory. And the reference is most painful to get — every offline metabolite count costs an aseptic sample pull, an analyzer run, and a half-day of lab time, so the economic case for replacing it with a model is overwhelming. The combination is why spectroscopic soft sensing matured here first and is still ahead of every other unit operation in the plant.

Several in-line probe technologies dominate, and they see different things:

Raman spectroscopy shines a monochromatic laser into the broth and measures the light that scatters back at shifted wavelengths. Most photons scatter elastically (Rayleigh scattering, no energy change); a tiny fraction scatter inelastically, having exchanged a quantum of energy with a molecular vibration, and it is that faint inelastic Raman scatter — perhaps one photon in ten million — that carries the chemistry. Each shift (a Raman band, indexed by wavenumber in cm⁻¹ — just the unit that labels where along the spectrum a band sits, with each value tied to one kind of molecular vibration) corresponds to a specific molecular vibration, so a Raman spectrum is a chemical fingerprint of everything dissolved in the medium. Our datasets carry exactly this: raman_spectra.parquet holds 701 intensity channels (wn_400 … wn_1800) read alongside the kinetic state. Raman is the workhorse of modern bioprocess PAT (Process Analytical Technology — the umbrella term for measuring quality in real time during the process rather than only in a lab afterward) because water scatters Raman weakly, so the aqueous broth does not drown the signal — the same water that absorbs NIR strongly and handicaps it here, one molecule seen two ways, and the single most important reason Raman beats NIR in a cell culture.
NIR (near-infrared) spectroscopy measures absorption of near-infrared light by overtone and combination vibrations. It is cheaper and faster than Raman but water absorbs NIR strongly, so it is better suited to some metabolites and to drying/lyophilization (freeze-drying, a later fill-finish step) than to a cell-dense broth.
Dielectric (capacitance) spectroscopy applies a radio-frequency field and measures the permittivity of the suspension. Intact cell membranes act as tiny capacitors, so capacitance is roughly proportional to the viable biovolume — which makes dielectric the one probe with a genuine, direct line to VCD, the very quantity Raman struggles with. We return to this below; it is the crux of the chapter's unsolved part.
2D fluorescence (excitation–emission matrix, EEM) spectroscopy sweeps the excitation wavelength while recording the emission spectrum, building an excitation × emission grid that captures the native fluorescence of certain ring-shaped (aromatic) building blocks of proteins and of cofactors — small helper molecules cells use in metabolism (e.g., tryptophan, an amino acid, and NAD(P)H, a cell's energy-carrier), which glow under light and so report on biomass and metabolic state. That grid gives a soft-sensor input for biomass and some metabolites, complementary to Raman because it reads different chemistry — and it is also a known source of the fluorescence baseline that Raman preprocessing has to remove, so the two probes are as much foils as partners.

The targets are not equally learnable, and ranking them honestly is the first real lesson. (To keep them straight: glucose and glutamine are food the cells consume; lactate and ammonia are waste they excrete; titer is the antibody product itself; and VCD is the count of living cells.)

Target	Direct molecular Raman band?	Soft-sensing quality	Why
Glucose	yes (C-O, C-H stretches)	excellent	strong, specific bands; the canonical Raman target
Lactate	yes	excellent	strong band; rises monotonically, easy to track
Glutamine / glutamate	yes	good	clear bands, lower concentration
Ammonia	weak/indirect	fair	low concentration, weak signature; often inferred via correlation
Titer (mAb)	yes (protein amide bands)	very good	accumulating protein gives a growing, specific signal
VCD / total cell density	no direct signature	poor — the weak spot	cells scatter light but have no clean Raman band for "count"

The pattern in that last column is the whole chapter in miniature: the things with a direct molecular band in the spectrum predict well, and the thing that is a count of objects rather than a concentration of a molecule predicts badly. Hold that thought; it is the reason VCD soft-sensing is the field's persistent open problem.

The PLS chemometrics pipeline

A raw Raman spectrum is not model-ready, and the model that turns it into a titer is not a neural network — it is Partial Least Squares (PLS) regression, the 40-year-old workhorse of chemometrics, and in the small-data regime of bioprocess it is genuinely hard to beat [1]. ("Small-data" despite the torrent of spectra: what is scarce is not signal rows but batches and the expensive offline truth labels that calibrate against them — you may have thousands of spectra but only a few dozen genuine bench measurements, and only a handful of independent runs.) The pipeline has two halves: preprocessing, then latent-variable regression.

Preprocessing removes the physics that has nothing to do with chemistry. A spectrum carries baseline drift (slow curvature from fluorescence and the instrument) and multiplicative scatter (the whole spectrum scaled up or down by probe fouling, bubbles, or a changing path length). Two standard, decades-old steps handle these, exactly as the data chapter introduced:

Standard Normal Variate (SNV): for each spectrum independently, subtract its mean and divide by its standard deviation. For a single spectrum x of p channels, the transformed channel is x'ᵢ = (xᵢ − x̄) / s, where x̄ and s are that one spectrum's own mean and standard deviation. This removes multiplicative scatter and per-spectrum offset, so two spectra of the same broth taken through a slightly fouled versus a clean window become comparable. Because SNV is computed per-spectrum (row-wise) from that row's own statistics, it carries no information across rows and so cannot leak across the train/test boundary — a rare preprocessing step that is genuinely fit-free.
Savitzky-Golay derivatives: fit a low-order polynomial in a sliding window of 2m+1 channels by least squares, then read off the value (or a derivative) of that polynomial at the window centre. A first derivative kills a constant offset; a second derivative kills a linear baseline slope — while the polynomial fit smooths noise and sharpens overlapping peaks. Window length and polynomial order are real hyperparameters: too wide a window smears peaks, too narrow amplifies noise; a second-order polynomial in a roughly seven-to-fifteen-channel window with a first or second derivative is the bioprocess default — the companion module runs a 15-channel window with a first derivative, a hair wider than the textbook five-to-eleven because the 701-channel grid (about 2 cm⁻¹ per channel across 400–1800) is finely sampled, so 15 channels span only ~30 cm⁻¹. SNV-then-derivative is the canonical pairing — SNV normalizes the scale, the derivative flattens the baseline.

Latent-variable regression then solves the problem that makes ordinary regression useless here. You have 701 wavenumber channels and, at any moment, perhaps a few hundred training points — far more features than examples, and the features are massively collinear (neighbouring wavenumbers move together because a single vibrational band spans many adjacent channels). Ordinary least squares would overfit catastrophically: with more columns (channels) than rows (examples) the matrix it must invert, (XᵀX)⁻¹, is singular — it has no unique solution, so the fit is unstable — and the model would fit the calibration noise exactly and generalize to nothing. PLS sidesteps this by finding a handful of latent variables — linear combinations of the 701 wavenumbers — chosen to maximize covariance with the target, then regressing the target on those few components instead of the raw channels. Where PCA (Principal Component Analysis — the standard technique for finding the directions a dataset varies along most) finds directions of maximum variance in X, PLS finds directions of maximum covariance between X and y (the directions that move together with the target), which is why it predicts better with fewer components: it spends its degrees of freedom on the part of the spectrum that actually moves with titer, not on the loudest source of spectral variance (which, in a cell culture, is often the rising turbidity that has nothing to do with the target). A typical bioprocess Raman model uses roughly six to a dozen latent variables; our baseline lands at five, chosen by cross-validation (below) rather than fixed in advance.

Stated a little more precisely, PLS decomposes the spectral matrix X (timepoints × wavenumbers) and the target vector y into a shared set of k scores T and loadings P, q:

X ≈ T·Pᵀ        (the spectra reconstructed from k latent scores)
y ≈ T·qᵀ        (the target predicted from the same k scores)

The scores T = X·W are the projections of each spectrum onto k weight vectors W = [w₁ … w_k], and the algorithm builds those weights one at a time: the first direction w₁ maximizes the covariance cov(X·w, y); each later w maximizes the same covariance subject to its scores being orthogonal to all the directions already extracted, so no two latent variables explain the same thing twice. Because X and y share the same scores T, every latent direction is forced to be relevant to the target — that shared decomposition is the whole trick. The number of components k is the one real hyperparameter, and it is chosen by cross-validation grouped by batch (never row-wise — see below): too few components and the model underfits the chemistry; too many and it starts fitting batch-specific noise, the classic overfitting knee where the cross-validated error stops falling and turns back up. The final model collapses to a single regression-coefficient vector b (one weight per wavenumber, plus an intercept) that is linear in the spectrum — which is a feature, not a limitation. A linear chemometric model is interpretable (you can plot b against wavenumber and literally see which bands the titer prediction leans on, and confirm they sit where amide or C-H vibrations should) and validatable in a way a deep network is not, which is exactly why PLS, not a neural net, is what actually ships in GMP soft sensors.

The full pipeline, then, is: SNV → Savitzky-Golay derivative → mean-center → PLS with k latent variables → predicted concentration. Every parameter that learns from the data (the scaler's means, the PLS loadings and weights) must be fit on the training split only — fit them before you split and you have leaked. And the whole calibration is bound to its probe: change the laser, the flow cell, or the cell line and the model is, for regulatory purposes, a new procedure until re-validated — the analytical-procedure-lifecycle expectation that ICH Q2(R2) and Q14 (finalized 2023) now set out for validating, and re-validating, a multivariate procedure like a Raman+PLS soft sensor. (ICH is the international body whose guidelines harmonize drug-quality expectations across regulators; Q2(R2) and Q14 are its guidelines on validating analytical methods — here, a measurement procedure built on a model.) The bench reference assay never goes away, because it is what re-grounds the calibration whenever the hardware moves.

The flagship upstream node, learned: an in-line Raman probe feeds a SNV/Savitzky-Golay/PLS chemometrics pipeline that reads titer and metabolites off the spectrum in real time and anchors them at sparse bench counts; a dielectric probe reaches toward the stubborn VCD target; a glucose-driven closed-loop controller actuates the feed every thirty minutes; and a hybrid mechanistic-plus-ML twin forecasts the whole 14-day run — production-grade for metabolites and titer, still an open frontier for VCD and autonomous control. Original diagram by the authors, created with AI assistance.

A PLS soft sensor in code, and the leakage trap that flatters it

The chapter's first runnable artifact is examples/platform/ml/soft_sensor_pls.py. It builds the PLS titer soft sensor over the simulator's Raman spectra — the chemometric baseline that any fancier model has to beat. It applies the real SNV + Savitzky-Golay front end, lets an inner cross-validation pick how many latent components to keep, and maps them onto antibody titer — then names the bands it leaned on and gates each new spectrum as in- or out-of-domain:

# examples/platform/ml/soft_sensor_pls.py — PLS titer soft sensor from in-line Raman
import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler

import dataio                                              # shared loaders + chemometric front end
TARGET = "titer_g_L"

def select_n_components(Z, y, k_max=15, seed=0):
    # inner 5-fold CV; pick the smallest k within one SE of the best (one-SE rule)
    cv = KFold(n_splits=5, shuffle=True, random_state=seed)
    means, ses = [], []
    for k in range(2, k_max + 1):
        s = cross_val_score(PLSRegression(n_components=k), Z, y, cv=cv, scoring="r2")
        means.append(float(s.mean())); ses.append(float(s.std(ddof=1) / np.sqrt(len(s))))
    means = np.array(means); best = int(means.argmax())
    chosen = int(np.argmax(means >= means[best] - ses[best]))    # smallest k clearing the 1-SE band
    return chosen + 2, round(means[chosen], 4)

def applicability_domain(pls, Z_train):
    # per-prediction in-/out-of-domain gate from Hotelling T2 (in-plane) + SPE (off-plane)
    T = pls.x_scores_; var = T.var(axis=0, ddof=1)
    spe_tr = np.sum((Z_train - pls.inverse_transform(T)) ** 2, axis=1)
    t2_lim, spe_lim = np.quantile(np.sum(T ** 2 / var, axis=1), 0.99), np.quantile(spe_tr, 0.99)
    def score(Z):
        Ts = pls.transform(Z); spe = np.sum((Z - pls.inverse_transform(Ts)) ** 2, axis=1)
        t2 = np.sum(Ts ** 2 / var, axis=1)
        return (t2 <= t2_lim) & (spe <= spe_lim)
    return score

def train_pls(test_size=0.3, seed=2026):
    X, y, wn = dataio.raman_xy(label=TARGET)
    Xp = dataio.snv_savgol(X)                              # SNV + Savitzky-Golay 1st derivative
    Xtr, Xte, ytr, yte = train_test_split(Xp, y, test_size=test_size, random_state=seed)
    scaler = StandardScaler().fit(Xtr)                     # mean-centre after SNV/SavGol, TRAIN only
    Ztr, Zte = scaler.transform(Xtr), scaler.transform(Xte)
    n_comp, cv_r2 = select_n_components(Ztr, ytr)          # inner-CV, one-SE rule (NOT hard-coded)
    pls = PLSRegression(n_components=n_comp).fit(Ztr, ytr)
    pred = pls.predict(Zte).ravel()
    vip = dataio.vip_scores(pls)                           # which wavenumbers the model leans on
    top = [(int(wn[i].split("_")[1]), round(float(vip[i]), 2)) for i in np.argsort(vip)[::-1][:6]]
    gate = applicability_domain(pls, Ztr)
    in_clean = gate(Zte)                                   # genuine held-out spectra
    corrupt = X[np.random.default_rng(seed).integers(0, len(X))].copy()
    j = int(np.argmin(np.abs(np.array([int(c.split("_")[1]) for c in wn]) - 1650)))
    corrupt[j:j+8] += 6.0 * corrupt.std()                 # cosmic-ray-style artifact at ~1650 cm-1
    out = gate(scaler.transform(dataio.snv_savgol(corrupt[None, :])))
    return {"n_components": n_comp, "cv_r2": cv_r2, "n_params": int(np.prod(pls.coef_.shape) + 1),
            "r2": round(float(r2_score(yte, pred)), 4),
            "rmse_g_L": round(float(np.sqrt(mean_squared_error(yte, pred))), 4),
            "vip_top": top, "vip_n_above_1": int((vip > 1.0).sum()),
            "clean_in_domain": round(float(in_clean.mean()), 3), "flags_corrupted": bool(not out[0])}

if __name__ == "__main__":
    m = train_pls()
    # prints R2/RMSE, the inner-CV n_components, the top VIP bands, and the AD gate result
    assert m["r2"] > 0.85, f"PLS R2 too low ({m['r2']}): dataset not predictive"
    assert m["vip_n_above_1"] > 0, "VIP should identify the bands the model uses"
    assert m["flags_corrupted"], "applicability domain must flag a corrupted spectrum"
    assert m["clean_in_domain"] >= 0.90, "genuine held-out spectra should be in-domain"

The run prints:

PLS soft sensor (titer from SNV+SavGol Raman): R2=0.9944 RMSE=0.127 g/L
  n_components=5 (inner 5-fold CV, one-SE rule; inner-CV R2=0.9905), 701 wavenumbers, 235 train / 101 test, 702 coefficients
  VIP > 1 on 389 wavenumbers; top bands: 1274cm-1 (VIP 1.41), 1276cm-1 (VIP 1.4), 1272cm-1 (VIP 1.4), 1270cm-1 (VIP 1.39), 1266cm-1 (VIP 1.38), 1268cm-1 (VIP 1.38)
  applicability domain (train 99th pct): T2<=17.164, SPE<=503.542 | held-out spectra in-domain=99%, corrupted spectrum flagged=True
ASSERT ok: R2 > 0.85, VIP names the bands, and the AD gate flags an out-of-domain spectrum while passing genuine ones.

Two numbers in that line are worth pausing on. The model is 702 coefficients — one per wavenumber plus an intercept — yet only five of them are free in the chemometric sense, because PLS forces the whole 701-channel weight vector to lie in a five-dimensional latent subspace — and that five is not hard-coded but chosen by an inner 5-fold cross-validation under the one-standard-error rule (the most parsimonious model within one SE of the best) — a deliberate bias toward simplicity: among models whose cross-validated error is statistically tied with the best, you pick the smallest, because a simpler model overfits less and generalizes more reliably — so the "a handful of components" claim is earned, not asserted. That is the parameter economy that makes PLS robust on a few hundred points: it reports a coefficient for every channel for interpretability, but it only ever fit five directions.

The companion module soft_sensor_deep.py makes the contrast explicit. A 1D-CNN with 5,713 parameters scores R2=0.9924, statistically indistinguishable from PLS's R2=0.9944, using roughly eight times the parameters; on this data the deep model does not beat the chemometric baseline — PLS now slightly edges it — which is exactly why PLS is what ships.

Two more lines of the output are worth reading, because they make claims the chapter has so far only promised. The VIP scores now name the bands the model leans on: VIP > 1 on 389 wavenumbers, with the top scores clustered at 1274cm-1, 1276cm-1, 1272cm-1, 1270cm-1, 1266cm-1, and 1268cm-1 — all in the protein amide III region (roughly 1230–1300 cm⁻¹) — amide bands are the signature of the protein backbone, and the antibody titer is a protein, so these are exactly the bands titer should write into the spectrum. The titer model is therefore demonstrably right for the right reason: it is leaning on protein bands, not on baseline drift or turbidity. And the applicability-domain gate — Hotelling's T2 (how extreme a spectrum is inside the PLS latent plane) and SPE (Squared Prediction Error — how much of the spectrum lies off that plane, i.e. unexplained by the model) computed on the PLS latent scores, with limits set at the training 99th percentile (T2<=17.164, SPE<=503.542) — passes 99% of the genuine held-out spectra while flagging an injected corrupted spectrum (corrupted spectrum flagged=True) — the injected fault is a sharp spike of the kind a Raman detector really does pick up when a cosmic ray strikes it, a known glitch the gate must catch. That is the out-of-distribution self-check the anatomy card promises, now operational in code rather than only on the diagram. The gate catches spectra that are novel in shape (high SPE) or extreme in the calibrated directions (high T2); it does not catch a spectrum that drifts quietly along a calibrated direction — a slow baseline shift, say — which is exactly the fouling-drift failure mode, and it surfaces only at the bench-reference reconciliation, which is why the residual monitor, not the AD gate, is the real decay backstop.

Read like a process engineer, that R² is suspiciously good — and the reason it is so good is the single most important caveat in applied bioprocess ML. The default train_test_split shuffles rows, scattering near-identical within-batch neighbours across the train/test boundary, so the model is interpolating between near-duplicate points rather than generalizing to a new batch. The spectra one hour apart in a slow 14-day run are almost the same vector; split them randomly and the test set is full of near-copies of training rows. This is exactly the leakage trap the data chapter builds the entire dataio.py module to prevent. The committed raman_spectra.parquet is a single golden batch (BATCH-2026-001), so the reproducible honest split is forward-in-time (temporal) rather than batch-grouped — train on the first 70% of the run and test on the later hours, which forces the sensor to extrapolate past the titer range it was calibrated on. The shipped module examples/platform/ml/soft_sensor_split_demo.py (run by run_all.py) does exactly that, printing both numbers side by side:

# soft_sensor_split_demo.py on the single golden batch (verbatim)
[ROW-WISE random split]        235 train / 101 test   R2 =  0.9927   <- LEAKED, do not trust
[TEMPORAL forward-in-time]     held-out later hours    R2 = -0.6325   <- honest, RMSE 1.6153 g/L

The honest split collapses to a negative R² — the model does worse than just predicting the mean titer — because a single batch's calibration cannot extrapolate past the range it has seen; the random split flatters it by 1.625 R². A true held-out-batch split (calibrate on several runs, test a genuinely unseen one) is the production ideal and needs the multi-batch Raman that transfer.py and drift.py cover later; it is not realizable on one golden batch. Both kinds of honest split share the lesson: the row-wise number is high for the wrong reason. The temporal (or, with more data, batch-grouped) number is the one you could put before a reviewer. The lesson is not "Raman predicts titer" (it does); it is that the validation discipline is what separates a number you can defend from a fantasy. The single-batch collapse here is an extrapolation artifact, not a verdict on the method: real Raman calibrations are built across many batches spanning the full titer range, and titer-from-Raman is strongly predictive once that range is covered — which is exactly why it is the one production-grade ML deployment in upstream.

One more chemometrics decision shapes how this model behaves in practice: one global model, or several local ones? A single global PLS calibrated across the whole run is simple and is what most deployments start with, but a fed-batch culture is really two regimes — an exponential growth phase and a stationary production phase — whose spectra and titer relationships differ, so a global model can be a compromise that fits neither phase well. The alternative is a small bank of local models (a just-in-time or phase-segmented approach): pick the calibration appropriate to the current phase, or weight nearby calibration points more heavily. Local models often predict better but multiply the validation burden — each one is a procedure that must be qualified and maintained, and the logic that switches between them is itself a thing that can fail. The pragmatic production answer is usually a global model with enough latent variables to span both phases, re-validated whenever the process changes; the local-model gains are real but are spent in development, not bought for free.

Why VCD is the hard one

Now the chapter's central technical argument. Run the same PLS pipeline against VCD_e6_per_mL instead of titer and, under an honest split, the R² falls off a cliff. The reason is not a software bug or a tuning failure — it is physics, and it is worth stating precisely because it explains a fifteen-year unsolved problem.

Titer predicts well because antibody is a molecule with a Raman signature. As the cells secrete protein, the concentration of a specific chemical species rises in the broth, and that species has characteristic Raman bands (the protein amide I band near 1650 cm⁻¹ and amide III near 1250 cm⁻¹ from the peptide backbone, plus aromatic side-chain bands from phenylalanine and tyrosine). More antibody means a stronger, specific signal at known wavenumbers — a direct, causal, molecular link between the quantity and the spectrum. Glucose, lactate, and glutamine are the same story: each is a dissolved molecule with its own bands, so its concentration writes itself directly into the spectrum.

VCD predicts badly because "viable cell density" is a count, not a concentration. A cell is not a molecule with a Raman band; it is a complex object suspended in the medium. Cells do affect a Raman measurement — they scatter and absorb light, they raise turbidity, and the intracellular biochemistry contributes a diffuse background — but there is no clean band that means "two million cells per millilitre." Whatever VCD signal Raman carries is indirect and confounded: it rides on turbidity (which also changes with debris and bubbles), on the bulk biochemical composition (which shifts as cells grow, then die), and on correlations with the metabolites that do have bands — VCD and glucose-consumption rise together early, so a model can "predict" VCD by quietly reading glucose. A model can latch onto those correlations and look decent on the calibration batches, but the correlations are not stable run-to-run — viability shifts, the dead fraction climbs, the medium lot changes, the death phase decouples the count from the metabolites — so a Raman VCD model transfers poorly and decays fast. It is predicting a count through a proxy, and the proxy keeps moving. And note which count: whatever biomass signal Raman or turbidity carries leans toward total cell or biovolume, because the dead and lysing fraction still scatters light — so it is the viability discrimination, separating live cells from dead, that Raman fundamentally cannot do, which is exactly why VCD (not total cell density) is the hardest of all.

This is why dielectric (capacitance) spectroscopy matters: it has a genuine physical line to the answer Raman lacks. Capacitance measures the polarizability of intact cell membranes under a radio-frequency field, and that polarizable biovolume is roughly proportional to viable cell density — a direct, causal link, not a confounded correlation. But even capacitance has its own version of the same trap: its signal is viable biovolume, not cell count, so a population that swells, shrinks, or changes membrane properties as it ages will read differently for the same nominal VCD, and as cells lyse and lose membrane integrity the capacitance signal fades faster than the count does. So the honest engineering answer to "how do I soft-sense VCD?" is usually "don't rely on Raman for it — add a capacitance probe, and even then expect it to degrade as the dead fraction rises and the membrane signal blurs." VCD soft-sensing remains the recognized weak spot of spectroscopic monitoring: it is the one routinely-needed upstream quantity that no single in-line probe reads cleanly, and closing the gap is an active research problem rather than a solved one.

There is a subtle, important consequence of this for every Raman model, not just the VCD one. Because cells scatter and absorb light, a rising cell density attenuates and distorts the whole spectrum — a turbidity effect that the chemistry-bearing bands of glucose or titer ride on top of. SNV and the derivative preprocessing are partly there to remove this confound, but they never remove it completely, which means a metabolite model calibrated at low cell density can drift as the culture becomes dense and turbid. The VCD weak spot, in other words, is not an isolated failure of one target; it is the same scattering physics that quietly threatens the accuracy of the metabolite models the chapter just praised — one more reason the bench reference and a batch-grouped validation are not optional.

The deeper lesson generalizes past this one variable: a soft sensor is only as good as the physical link between the signal and the target. Where that link is a direct molecular band, ML is production-grade. Where it is an indirect, drifting proxy, ML produces a number that looks fine in calibration and betrays you on the next lot. Knowing which regime you are in — and saying so — is the difference between a trustworthy soft sensor and a confident liar.

Closed-loop glucose control: from measuring to acting

A soft sensor that only reports is a monitor. The moment it drives an actuator, it becomes control — and the most mature example in the bioreactor is closed-loop glucose feedback control. The idea is to hold glucose in a tight target band by measuring it in-line and adjusting the feed automatically, instead of bolus-feeding (delivering a single discrete dose of feed) on a fixed schedule. The loop runs continuously:

Measure. The Raman soft sensor estimates current glucose concentration (say, every minute) and tags it BR101.Glucose.PV — the data chapter's dotted unit.measurement.role tag, here bioreactor BR101's glucose process value (PV, the live measured reading) — alongside the rest of the online state.
Decide. A controller compares the estimate to the setpoint and computes a correction — how much feed to add over the next interval to bring glucose back to target without overshooting.
Actuate. The controller commands the feed pump (the BR101.GlucoseFeed actuator) to deliver that volume.
Wait and repeat. The loop typically actuates on roughly a 30-minute cadence — long enough for the feed to mix and the cells to respond, short enough to hold a tight band, and long enough that a single noisy spectrum cannot lurch the pump.

Holding glucose low and constant has real process value: it suppresses the overflow metabolism that produces lactate and ammonia, can shift glycosylation (the antibody's sugar-chain pattern, a quality attribute) toward the desired distribution, and reduces the osmolality (the total dissolved-solute concentration the cells live in) swings that fixed bolus feeding causes. It also does more than trim a byproduct: under controlled-low or glucose-limited feeding CHO cultures often flip from net lactate production to lactate consumption — the well-known metabolic "lactate shift" (when glucose is scarce the cells stop the overflow metabolism that burns glucose fast and dumps the excess carbon as lactate, and instead oxidize the lactate they made earlier, which eases the pH-buffering load and improves culture health) — which is one of the most reliable reasons a plant wants a tight glucose band in the first place. The control law itself can be simple or model-based. The simplest workable version is a discrete feedforward-plus-feedback law: estimate the cells' current glucose consumption rate from the recent trajectory (the feedforward term that supplies what the cells are about to eat), and add a feedback correction proportional to the gap between the measured glucose and its setpoint:

feed(t) = consumption_estimate(t) · Δt  +  Kp · (glucose_setpoint − glucose_hat(t))

where glucose_hat(t) is the Raman soft sensor's estimate and Kp is a tuned gain — a single dial that sets how hard the loop reacts to the error: a bigger Kp corrects faster but risks over-reacting. The two terms divide the labour cleanly: the feedforward term (a correction computed ahead of time from what the cells are about to need) does most of the work — it tracks the rising biomass and supplies the bulk demand before glucose has a chance to drop — while the feedback term cleans up the error the soft sensor and the consumption model leave behind, the small residual mismatch a pure feedforward schedule would accumulate. Tune Kp too high and the loop chases sensor noise and oscillates; too low and it sags below setpoint whenever demand jumps. More advanced demonstrations replace consumption_estimate with a small predictive model (a one-step-ahead MPC — Model Predictive Control, a controller that decides each move by simulating the process a short distance into the future and picking the action with the best forecast — that forecasts demand over the next interval) and use a deep-learning soft sensor to supply glucose_hat.

The suite now ships exactly such a controller in runnable form — examples/platform/ml/mpc_loop.py, an advisory receding-horizon MPC of the glucose feed. At each control step (every 12 h — a coarser step than the ~30-minute production cadence above, set by the simulator's sampling; the loop logic is identical) it reads a noisy soft-sensor glucose estimate (the true plant glucose plus seeded measurement noise — the controller never sees the truth), rolls a mechanistic digital twin forward a short horizon under each of a small grid of candidate feed boluses, and proposes the feed that holds predicted glucose closest to a 4.0 g/L setpoint, applying only the first move before re-planning. Against the open-loop fixed-bolus baseline it tightens glucose tracking sharply:

advisory soft-sensor MPC of glucose feed (setpoint = 4.0 g/L; control every 12 h)
  open-loop (fixed boluses)   : glucose tracking RMSE = 3.37 g/L, final titer = 5.77 g/L
  closed-loop (MPC, advisory) : glucose tracking RMSE = 1.09 g/L, final titer = 4.01 g/L
  note: the controller PROPOSES; in GMP a human / qualified automation with a safe fallback retains authority.

The framing is the same one this section has insisted on: the controller proposes a feed; it does not seize the pump. In a real GMP plant a human, or a qualified automation layer with a safe fallback, retains authority — the module's own printed note and its asserts keep that honest, checking only that the advisory loop tracks glucose better than open-loop and did not crash the culture, not that it is fit to run a batch unattended. The loop's hardest engineering is not the control law but the failure modes: a fouled probe makes glucose_hat drift, and a drifting estimate wired straight to a pump can systematically over- or under-feed for hours before the twice-daily bench sample catches it. Production loops therefore wrap the controller in interlocks:

Rate and volume bounds — the feed per interval is clamped to a physiologically plausible band, so no single command can dump or starve the tank.
Estimate sanity checks — glucose_hat is cross-checked against the slower online state (a sudden jump with no matching change in pH or oxygen-uptake is treated as a sensor fault, not a real swing) and against the probe's own spectral-quality flag.
Safe-fallback — if the probe quality flag goes bad or the estimate fails its sanity check, the loop reverts to a conservative scheduled feed and raises an alarm, so a bad estimate fails safe rather than driving the batch off a cliff.

The honest framing is about what crosses into GMP. Closed-loop glucose control via Raman plus deep learning has been demonstrated at scale — Amgen has shown it — but it sits at (pilot) maturity: a demonstrated capability, not yet the default way commercial batches are fed under a locked, validated control loop. The regulatory reason is the recurring theme of upstream ML: a model that autonomously moves an input affecting a CQA is held to a far higher bar than a model that advises a human. A soft sensor that displays glucose to an operator who decides the feed is monitoring; a soft sensor wired straight to the pump is control, and control of a parameter that affects product quality invites the full weight of process validation (the documented proof that a process reliably makes in-spec drug), change control, and the "locked model" expectation — that a deployed model is frozen, versioned, and not silently retrained in production — that the FDA's 2023 discussion paper and the draft EU/PIC/S GMP Annex 22 both set out. The capability is real and demonstrated; the scaled, routine, commercial-GMP deployment is the frontier, not the norm.

The hybrid digital twin: physics carries the trend, ML corrects the curvature

The most ambitious object in the chapter is the digital twin of the run — a model that does not just read the current state off a probe but forecasts the whole trajectory, so you can ask "if I feed like this, where does titer land on day 14?" A pure black-box network would need far more batches than a process ever has. A pure first-principles model captures the trend but misses the systematic ways this cell line, in this medium, deviates from the textbook. The winning pattern in bioprocess is neither: it is the hybrid (gray-box) model, a first-principles backbone corrected by a small ML residual.

A full mechanistic fed-batch model is a set of coupled ordinary differential equations (ODEs — equations that state the rate of change of each quantity, e.g. how fast cell count or glucose moves at each instant; "coupled" because each rate depends on the others) — viable cells grow and die, glucose and glutamine are consumed, lactate and ammonia are produced, and titer accumulates — each governed by kinetic parameters (a maximum growth rate μ_max, Monod half-saturation constants — Monod is the standard rule that a nutrient drives growth strongly when plentiful and tapers off as it runs low — yield coefficients tying substrate consumed to biomass, and a specific productivity qP, how fast one cell secretes antibody). Schematically:

dXv/dt = (μ − kd)·Xv          μ = μ_max · [Glc/(K_glc + Glc)] · ...   (growth − death)
dGlc/dt = −(1/Y_xglc)·μ·Xv − m·Xv  + feed(t)        (consumed ∝ growth + maintenance)
dP/dt  = qP·Xv                                       (titer accumulates ∝ viable cells)

That backbone is the first-principles part of the twin, and it is what makes the model extrapolate sensibly into conditions it was not trained on — the Monod terms encode that growth slows as a nutrient runs out, knowledge no black box gets for free. The simplest useful slice of it — the one the runnable module implements — is the titer relation alone: integrating dP/dt = qP·Xv over the run says secreted titer P equals the integral of viable cell density (IVCD) times a single specific-productivity constant qP. In words, total antibody made is (how many productive cells there were) × (how long they were productive) × (how fast each one secretes). That one constant, fit by least squares through the origin, already explains most of the variance. But the constant-qP assumption is wrong in detail — specific productivity rises in stationary phase as growth slows and the cells redirect resources to secretion, and that rise is not arbitrary: qP in CHO is commonly growth-rate-dependent, tending to climb as the specific growth rate falls, the same physiology behind the deliberate temperature shift many platforms use — cooling the culture a few degrees deliberately slows growth, and slower-growing cells redirect their resources into secreting more antibody each, pushing them into a slow-growth, high-secretion production phase — so a small neural network is trained on only the residual (true titer minus the mechanistic prediction), reading the process state. The physics carries the trend; the network corrects the curvature the constant can't capture. This is a parallel gray-box: ŷ = mechanistic(state) + NN(state). The alternative, a serial gray-box, instead has the network supply a parameter — say a time-varying qP(state) — that the mechanistic equations then integrate; the serial form keeps the output guaranteed to obey the physics (titer can only rise, can never go negative) but is harder to fit because the network's error passes through the ODE solver, while the parallel form is the easiest to fit and reason about. Both arrangements are used in production-scale twins.

The chapter's second runnable artifact, examples/platform/ml/hybrid_model.py, builds exactly this parallel gray-box and pits it against a pure-NN baseline:

# examples/platform/ml/hybrid_model.py — mechanistic IVCD backbone + NN residual
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler

DATA = Path(__file__).resolve().parents[2] / "datasets"
TARGET = "titer_g_L"
FEATS = ["Xv_e6_per_mL", "glucose_g_L", "lactate_g_L", "glutamine_mM",
         "ammonia_mM", "t_day", "viability_pct"]

def load_state():
    df = pd.read_parquet(DATA / "fedbatch_state.parquet")
    return df.iloc[::60].reset_index(drop=True)          # minute -> hourly (336 rows)

def ivcd(df):                                            # cumulative integral of viable cells
    t, xv = df["t_day"].to_numpy(), df["Xv_e6_per_mL"].to_numpy()
    return np.cumsum(xv * np.diff(t, prepend=t[0]))

def train_hybrid(test_size: float = 0.3, seed: int = 2026):
    df = load_state(); y = df[TARGET].to_numpy(); iv = ivcd(df); X = df[FEATS].to_numpy()
    tr, te = train_test_split(np.arange(len(df)), test_size=test_size, random_state=seed)
    qp = float(np.sum(iv[tr] * y[tr]) / np.sum(iv[tr] ** 2))      # qP through the origin
    mech = qp * iv                                               # mechanistic backbone
    scaler = StandardScaler().fit(X[tr])
    nn = MLPRegressor((32, 16), max_iter=5000, alpha=1e-3, random_state=seed)
    nn.fit(scaler.transform(X[tr]), (y - mech)[tr])             # NN learns ONLY the residual
    hybrid = mech + nn.predict(scaler.transform(X))
    pure = (MLPRegressor((32, 16), max_iter=5000, alpha=1e-3, random_state=seed)
            .fit(scaler.transform(X[tr]), y[tr]).predict(scaler.transform(X)))   # pure-NN baseline
    sc = lambda p: (round(float(r2_score(y[te], p[te])), 4),
                    round(float(np.sqrt(mean_squared_error(y[te], p[te]))), 4))
    return {"qP": round(qp, 5),
            "mech": dict(zip(("r2", "rmse"), sc(mech))),
            "hybrid": dict(zip(("r2", "rmse"), sc(hybrid))),
            "pure_nn": dict(zip(("r2", "rmse"), sc(pure)))}

if __name__ == "__main__":
    m = train_hybrid()
    print(f"Hybrid titer model on BATCH-2026-001 state (qP={m['qP']} g per 1e6 cell-day/mL):")
    print(f"  mechanistic only  R2={m['mech']['r2']:.4f}  RMSE={m['mech']['rmse']:.4f} g/L")
    print(f"  pure NN           R2={m['pure_nn']['r2']:.4f}  RMSE={m['pure_nn']['rmse']:.4f} g/L")
    print(f"  HYBRID (mech+NN)  R2={m['hybrid']['r2']:.4f}  RMSE={m['hybrid']['rmse']:.4f} g/L")
    assert m["hybrid"]["rmse"] <= m["mech"]["rmse"], "hybrid should beat mechanistic-only"
    print("ASSERT ok: the residual network lowers RMSE below the mechanistic backbone.")

The run prints:

Hybrid titer model on BATCH-2026-001 state (235 train / 101 test, qP=0.04049 g per 1e6 cell-day/mL):
  mechanistic only  R2=0.9865  RMSE=0.1983 g/L
  pure NN           R2=0.9995  RMSE=0.0370 g/L  (801 params)
  HYBRID (mech+NN)  R2=0.9998  RMSE=0.0228 g/L
ASSERT ok: the residual network lowers RMSE below the mechanistic backbone.

Read the three lines as the argument for hybrid modeling. The mechanistic-only backbone — a single fitted constant, qP=0.04049 g per million cell-days per mL — already reaches R2=0.9865 on held-out hours, which is why first principles are such a powerful prior in bioprocess: one physically meaningful number explains the bulk of a 14-day titer curve. The pure NN, with 801 parameters and no physics, drives RMSE down to 0.0370 g/L on this clean, single-batch data — but it would fall apart faster off-distribution because it has no trend to fall back on, only memorized correlations. The hybrid wins clearly and wins cheaply: it reaches R2=0.9998, RMSE=0.0228 g/L, the best of the three, because the network only had to learn the small stationary-phase curvature on top of a backbone that was already nearly right — so its residual target is tiny and easy to fit. But read all three numbers the way the PLS section taught you to: this is a row-wise split of one batch, so every score here is interpolation within a single run, not next-batch performance — the hybrid 0.9998 carries exactly the same leakage caveat as the PLS 0.9944, and none of these numbers is what you would put before a reviewer for a new batch. In the cold-start, few-batch regime that defines bioprocess, that parameter economy is the whole point — the physics spends the labels you have on the part you actually don't know, and the network's job shrinks to a small correction it can learn from a handful of runs. (On this single golden batch all three score high; the hybrid's true advantage shows on a new batch, where the mechanistic trend still holds and the pure NN's memorized correlations do not — the same batch-grouped honesty the PLS section insisted on.)

Two further uses of the twin are worth naming. First, ML-surrogate MPC for feeding: once you have a fast hybrid forward model, you can wrap it in model predictive control — at each step, simulate the next several hours under each candidate feed profile, pick the one that best hits the titer or glucose objective subject to the lactate and osmolality constraints, actuate the first move, and re-plan at the next step. The hybrid model is the cheap, differentiable surrogate that makes this real-time optimization tractable where a full mechanistic kinetic-CFD simulation would be far too slow to evaluate hundreds of candidate profiles per control interval. Second, soft-sensing the unmeasured states: the twin can estimate quantities you cannot probe at all (specific productivity, the true viable fraction, the dead-cell load) by reconciling the mechanistic state with whatever measurements arrive — a learned analogue of a Kalman filter (the classic recursive estimator that blends a model's prediction with each new noisy measurement to track a hidden state), where the physics provides the prediction and the sparse bench references provide the correction.

Anatomy of one soft-sensor prediction

A soft-sensor reading is never a bare number. Like every artifact in this series, its value is in what travels with it — the spectrum that produced it, the preprocessing and model version behind it, the uncertainty around it, and the bench reference that will eventually grade it. Pull one prediction apart and the whole chapter is laid out as fields.

One soft-sensor prediction, fully unpacked: the raw and preprocessed Raman spectrum and aligned online state that fed it, the confident titer and metabolite estimates with bands and the latent variables behind them, the deliberately wide VCD estimate that flags the weak spot, the bench reference that will grade it, and the relationships — calibration set, model version, the controller it can drive, and the human-in-the-loop line for any CQA-affecting action — that make it governable. Original diagram by the authors, created with AI assistance.

Read the card field by field:

Header — identity and provenance. model: soft_sensor_pls v1, subject: BR-101, batch: BATCH-2026-001, t = batch-hour 168. Naming the model version is not decoration: the calibration is bound to a specific probe, flow cell, and cell line, so the version pins which validated procedure produced this number. A reading with no model version is an orphan.
Input — the cheap, fast signal. The raw 701-channel spectrum (wn_400 … wn_1800) as a sparkline, its SNV-then-Savitzky-Golay-preprocessed twin beside it, and the aligned online state (BR101.Temp.PV = 36.5 °C, BR101.pH.PV = 7.04, dissolved oxygen). The raw and preprocessed spectra appear both so a reviewer can see what the preprocessing removed — the baseline curvature and scatter — and confirm it did not erase a real peak.
Green core — the confident prediction. Titer in g/L with a confidence band, then glucose, lactate, glutamine, and ammonia each with their own band — the targets that have direct molecular bands. Every estimate is stamped with the five PLS latent variables that produced it, so the number traces back through five scores to the spectrum; this is the interpretability that a deep net would not give you.
Amber weak-spot — honesty as a field. The VCD estimate is carried with a deliberately wide uncertainty band and an explicit note: Raman has no direct band for cell count; this rides on a confounded proxy — prefer the capacitance probe. The chapter's central caveat is not a footnote here; it is a structural field on the card, so no downstream consumer can read VCD with the same trust as titer.
Reconciliation — the only real grade. The twice-daily bench reference titer and metabolites, and the residual of each prediction against that reference. This is the lagging signal drift detection has to live with: between two bench samples, a drifting sensor looks identical to a healthy one, so the residual row is where model decay eventually surfaces.
Violet relationships — what makes it governable. Links to the calibration batches the model was bound to, the dataset hash, the model and preprocessing version, the closed-loop glucose controller the glucose estimate can drive, and the human-in-the-loop boundary that any action affecting a CQA must not cross silently.

Read top to bottom, the card is the chapter compressed: a cheap signal, a confident core, an honestly-flagged weak spot, a lagging grade, and the governance edges that decide what the number is allowed to do.

What makes the soft sensor trustworthy: the semantic layer beneath it

The card's violet edges are not decoration — they are the difference between a number an auditor can defend and a number that is an orphan. Each edge is, formally, a piece of a knowledge graph, and grounding the soft sensor in an ontology is what makes its features stable, its training data complete, and its validation honest. Four threads run straight out of the books beside this one.

A feature pulled by its meaning, not its column name. The model's input — BR101.Glucose.PV — is the dotted unit.measurement.role tag the data chapter defined, but its real anchor is one level deeper. That tag resolves to a position in the ISA-95 equipment hierarchy (the IEC 62264 object model that says what a unit, an equipment element, and a process value formally are) and reaches the model over OPC UA, the vendor-neutral protocol the data shadow describes for moving tagged readings — though, as that chapter notes, no OPC UA Companion Specification for a mammalian-cell bioreactor exists yet, so the semantics of BR101.Glucose.PV still vary plant to plant until an ontology pins them. A feature pulled by its ontology IRI (Internationalized Resource Identifier — a global, web-style name) rather than a fragile spreadsheet column survives a historian rename, a vendor swap, or a site transfer; a feature pulled by string match silently feeds the model a different quantity the day someone renames a tag. The semantic-interoperability discipline — map every source tag once to a canonical, ontology-grounded node — is exactly what lets the same glucose feature mean the same thing across the historian, the MES, and the LIMS that supply this model its labels.

SHACL as the training-data completeness gate. The same closed-world shape that gates a lot at release can gate the model's inputs. The release gate is a SHACL sh:NodeShape (Shapes Constraint Language — the W3C standard for validating that graph data has the required structure) that refuses a lot record missing a required result or carrying one out of range. Point the same kind of shape at a training row and it answers the question OWL cannot — is a required field missing? — for every spectrum: every calibration point must carry its bound bench reference, its units (anchored to a unit ontology like QUDT so Celsius and Kelvin are concepts a reasoner relates, not spellings), its in-range process state, and its dataset hash, or it is refused before it ever reaches the fit. A SHACL-validated training set is the upstream guarantee that the R2=0.9944 was earned on complete, in-range data rather than on a table with silent holes — the model-governance card the open-source analytics chapter draws as a sh:NodeShape in spirit, now applied to the data the model learned from.

Lineage edges as the grouping key for honest validation. The leakage trap the PLS section dwelt on has a semantic fix. A prov:wasDerivedFrom / bp:derivedFrom edge — PROV-O, the W3C provenance vocabulary that the genealogy chapter builds the digital thread from — records which batch each spectrum came from as a graph edge you can walk. That edge is precisely the grouping key for leave-one-batch-out cross-validation: grouping by the lineage IRI, not by a guessed column, is what guarantees no row of a test batch leaks into training. When transfer.py and drift.py reach the multi-batch regime, the honest split is a SPARQL traversal of the provenance graph, not a string heuristic — the ontology is what makes the batch-grouped number the one you can put before a reviewer.

BFO keeps a measurement distinct from the run. Finally, the upper-ontology distinction the classes-and-taxonomy chapter draws from BFO (Basic Formal Ontology) keeps the card's fields from ever collapsing into each other: a soft-sensor prediction at batch-hour 168 is a continuant (an entity that persists and bears qualities — the titer estimate exists as a value), while the 14-day fed-batch run that produced it is an occurrent (a process that happens and is over). Conflating the two — treating a measurement as if it were the run — is the modeling error that makes a graph answer "what was DP-004 derived from?" with confident nonsense; keeping them typed is what lets a GraphRAG LLM be grounded against the graph rather than hallucinate, the lesson the ontologies-and-AI chapter builds in full. The soft sensor's number is only as trustworthy as the semantics of the things it points at.

The soft sensor as a governed analytical procedure

Because the soft sensor's output can feed an in-process control decision, it is not a dashboard widget but a regulated analytical procedure, and it inherits the full data-integrity apparatus. Every prediction must be ALCOA+ (Attributable, Legible, Contemporaneous, Original, Accurate — plus Complete, Consistent, Enduring, Available): attributable to the model version and probe that produced it, contemporaneous because the spectrum is timestamped as it is read, and complete because the SHACL gate above refuses a record with a missing field. The model itself is an electronic record under 21 CFR Part 11 (and its EU counterpart Annex 11, modernized in a 2025 draft to address networked, multi-system integrity), so the locked, versioned calibration carries an audit trail of who deployed it and when.

The validation lens has shifted with it. The old CSV (Computerized System Validation) mindset would script every screen of the soft-sensor application; the FDA's CSA (Computer Software Assurance) successor asks instead for critical-thinking, risk-based assurance — heavy testing where the model touches a CQA (the glucose estimate that can drive the feed), light testing on a read-only display. A soft sensor wired straight to the pump is high-risk and earns the full process-validation and change-control weight; the same model advising an operator earns far less. That risk gradient is the same line the MLOps chapter draws, and it is why the EU/PIC/S draft Annex 22 singles out self-learning systems for the locked-model expectation: a model that silently retrained itself in production would break the contemporaneous, versioned record this whole apparatus depends on.

The unsolved part: the VCD weak spot and the model that decays

Two honest difficulties sit under this chapter, and both are about the limits of a soft sensor rather than its successes.

The first is the VCD weak spot itself, argued above and worth restating as a standing engineering fact. The single most operationally useful upstream quantity — how many living cells are in the tank — is the one no in-line probe reads cleanly. Raman gives a confounded proxy that drifts run-to-run because it is reading turbidity and metabolite correlations, not a cell count. Capacitance gives a genuine but viability-sensitive biovolume signal that blurs as the dead fraction climbs and membranes lose integrity — and crucially, the two probes fail in different directions: Raman's proxy can hold while capacitance fades, or vice versa, so even fusing them does not produce a clean count, only a better-hedged estimate. Image-based and other approaches remain research. A plant that needs VCD continuously is, today, stitching together an imperfect capacitance reading, a Raman correlation, and the twice-daily count, and accepting that the live VCD trace is the least trustworthy line on the dashboard. This is not a failure of effort — it is a genuine, fifteen-year, still-open problem, and any product claim of "Raman VCD soft sensing" deserves to be read with the question under what split, and how does it transfer to a new batch?

The second is model decay, the consequence of the cold-start cadence every soft sensor inherits. A Raman calibration is bound to its probe, its cell line, and its medium; the moment any of those moves — a new medium lot with a slightly different baseline, a fouled flow cell that scales the spectrum, a probe swapped at maintenance, a clone change that shifts productivity — the calibration begins to drift, and because the bench reference only arrives twice a day, drift is detected late. A soft sensor that began over-reading titer at breakfast is not provably wrong until the evening sample returns, and between those two points it looks identical to a healthy one — same smooth trace, same plausible numbers, same green band. The same physics that makes the soft sensor valuable — it interpolates the sparse truth — is exactly what makes its failures slow to surface, because the only ground truth that can contradict it is the sparse truth it was built to replace. This is why production soft sensors are governed objects with a re-calibration trigger, a residual monitor that watches the prediction-versus-reference gap, and a locked-model change-control plan, not fire-and-forget regressions; the MLOps chapter builds that lifecycle out in full. The honest production soft sensor is one that reports its uncertainty, widens its band as it gets further from the last reference, and hands every consequential decision back to a human at the points the regulation requires — a sensor that knows how stale it is.

What this chapter adds to the model suite

This chapter contributes three modules to the Book 5 example suite — the most of any single chapter, which befits the flagship:

examples/platform/ml/soft_sensor_pls.py — the PLS titer soft sensor from in-line Raman: SNV + Savitzky-Golay preprocessing now actually applied (via dataio.snv_savgol, not a plain scaler on raw intensities), n_components selected by an inner-CV one-SE rule rather than hard-coded (it lands at 5), a VIP computation that names the amide III bands the model leans on, a per-prediction applicability-domain gate (Hotelling T2 and SPE on the latent scores) that flags a corrupted spectrum, all over the 701 wavenumbers of raman_spectra.parquet (702 reported coefficients, five free directions), and a CI assertion that the dataset is genuinely predictive of titer (R² > 0.85; the run lands at R2=0.9944, RMSE=0.127 g/L). It is the chemometric baseline every fancier model must beat — and its sibling soft_sensor_deep.py confirms a 5,713-parameter CNN does not beat it — and (via its sibling soft_sensor_split_demo.py) the place the chapter demonstrates the row-wise-versus-temporal leakage contrast on the single golden batch — R2=0.9927 leaked vs R2=-0.6325 honest — using the shared dataio split; true batch-grouped generalization needs the multi-batch data of transfer.py/drift.py.
examples/platform/ml/hybrid_model.py — the gray-box titer twin: a mechanistic IVCD × qP backbone (qP=0.04049, R2=0.9865) corrected by an MLP residual over the process state in fedbatch_state.parquet, benchmarked against an 801-parameter pure-NN baseline, with a CI assertion that the hybrid lowers RMSE below the mechanistic backbone (the hybrid lands at R2=0.9998, RMSE=0.0228 g/L). It is the runnable core of the digital-twin section and the pattern the seed-train readiness model and downstream chromatography models both reuse.
examples/platform/ml/mpc_loop.py — the advisory receding-horizon MPC of the glucose feed: at each control step it reads a noisy soft-sensor glucose estimate, rolls a mechanistic digital twin forward a short horizon under a grid of candidate feeds, and proposes the bolus that best holds the 4.0 g/L setpoint, with a CI assertion that the advisory loop tracks glucose better than open-loop without crashing the culture (closed-loop tracking RMSE 1.09 vs open-loop 3.37 g/L). It is the runnable core of the closed-loop control section, and it keeps the chapter's framing honest: the controller proposes, a human or qualified automation retains authority.

Together they make the production bioreactor a node the model suite can actually run on, and they encode the chapter's central truths in executable form: the soft sensor is real (the PLS assertion holds), the hybrid beats both pure physics and pure ML (the residual assertion holds), and the advisory controller tracks glucose far closer than open-loop while only ever proposing the feed (the MPC assertion holds).

Why it matters

The production bioreactor is where machine learning in biomanufacturing is most real and most tested. A spectroscopic soft sensor turns the once-or-twice-a-day metabolite count into a continuous trace, so an operator sees glucose and lactate and titer climbing in real time instead of guessing between bench samples — and that is the one ML capability in upstream that is genuinely (production) today. Closed-loop glucose control turns that trace into action, holding the feed tight to suppress lactate and stabilize quality — real, demonstrated, and still mostly (pilot). The hybrid digital twin forecasts the whole run, so a process can be designed and steered on a model instead of on trial batches — powerful, and still mostly a development-and-pilot tool. And the VCD weak spot is the standing reminder that a soft sensor is only as good as the physical link beneath it: where that link is a molecular band, ML is trustworthy; where it is a drifting proxy, ML is a confident guess. Get the soft sensor right and the most data-rich step in manufacturing becomes its most observable one; over-trust it — especially on VCD — and you have built a dashboard that lies slowly. The whole upstream has been building toward this tank; this is where its data finally becomes knowledge, and where the honesty about ML's limits matters most.

In the real world

In-line Raman with SNV/derivative preprocessing and PLS chemometrics, soft-sensing glucose, lactate, and titer in CHO culture, is (production) practice — the strongest, most mature ML deployment anywhere in upstream biomanufacturing, established in the literature for over a decade [1][2]. The platforms are real and named: Sartorius's SIMCA and SIMCA-online for the chemometrics, and BioPAT Spectro for the in-line spectroscopy. The strongest first-party deployment anchor is Amgen's facility at Juncos, Puerto Rico, where SIMCA OPLS models predict harvest titer and other in-process attributes inside commercial GMP drug-substance manufacturing — a deployed (production) MVDA (multivariate data analysis — the PLS/OPLS family this chapter built) model at the upstream-downstream boundary, reported first-party by Amgen engineers together with a Sartorius vendor case study (vendor-self-reported / self-authored evidence tier), not independently audited [3]. The model class there is OPLS (orthogonal PLS), a PLS variant that separates the target-predictive variation from the orthogonal "everything else," which makes the regression vector even easier to interpret for a reviewer — the same chemometric lineage as the baseline this chapter built, hardened for GMP. SIMCA and SIMCA-online are closed, commercial tools, and the chapter's own modules deliberately reproduce their core in open source: the PLSRegression, T2/SPE applicability domain, and VIP that ship here are the same chemometrics those suites wrap, built on scikit-learn with fixed seeds and a run_all.py so the whole pipeline is reproducible without a six-figure license — the open-source path the analytics chapter takes to do CPV-grade soft sensing without a commercial statistics suite. What the commercial suites add is not better math but validated packaging, support, and a GMP-qualified audit trail.

Closed-loop glucose control via Raman plus deep learning has been demonstrated at Amgen, and it sits at (pilot): a real demonstrated capability, peer-reviewed, but not the routine way commercial batches are fed under a locked validated loop [4]. Hybrid mechanistic-plus-ML twins are (pilot) too — Sartorius's process-twin work pairing a parsimonious dynamic flux-balance backbone (PC-dFBA — a compact first-principles model of how carbon and nutrients flow through the cell's metabolism over time) with a neural-network VCD/state estimate is the canonical example, the same physics-carries-the-trend, ML-corrects-the-residual pattern this chapter's hybrid_model.py implements, useful for design and prediction at development scale, not a commercial autopilot [5]. Two corrections the field gets wrong often enough to name: the Boehringer Ingelheim 16-attribute in-line Raman work on Protein A capture chromatography (downstream, not in the bioreactor) used a k-nearest-neighbours model, not deep learning, and should be cited as the KNN multi-attribute demonstration it is [6]; and National Resilience's widely cited "+50% titer" perfusion result (perfusion is the continuous-culture alternative to fed-batch, constantly exchanging fresh medium while removing spent) is a PAT-plus-manual-feed-optimization story from a vendor press release, not an ML deployment, and must never be presented as machine learning lifting titer [7]. And the standing caveat behind all of it: VCD soft-sensing remains the weak spot — no in-line probe reads viable cell density as cleanly as Raman reads titer, which is why dielectric capacitance, not Raman, is the probe of choice when VCD is the target, and why the live cell-density trace is still the least trustworthy line on the upstream dashboard. The throughline the FDA's 2023 discussion paper and the ISPE Pharma 4.0 survey keep finding holds here exactly: AI/ML in this part of the plant is strongest as human-in-the-loop monitoring and soft sensing, thinner as autonomous closed-loop control of a CQA [8].

Key terms

CQA (Critical Quality Attribute) — a measurable product property (e.g. glycosylation, aggregate level) that must stay within limits for the drug to be safe and effective; autonomously moving an input that affects a CQA is what raises the regulatory bar on closed-loop control.
GMP (Good Manufacturing Practice) — the legally-enforced quality system commercial drug batches must be made under; the line a capability must cross to be "production-grade."
PAT (Process Analytical Technology) — measuring and controlling quality in real time during the process (e.g. in-line Raman) rather than only in a lab afterward.
Soft sensor (virtual / inferential sensor) — a model that estimates a hard-to-measure quantity (titer, metabolites, VCD) from easy-to-measure online signals (a spectrum, the online state), continuously and without an offline sample.
Raman spectroscopy — laser-scatter spectroscopy in which a faint inelastically-scattered fraction of light carries molecular-vibration bands; the workhorse in-line probe for bioprocess because water scatters it weakly. Our datasets carry 701 channels, wn_400 … wn_1800.
NIR spectroscopy — near-infrared absorption spectroscopy; cheaper and faster than Raman but absorbed strongly by water, limiting it in a cell-dense broth.
Dielectric / capacitance spectroscopy — radio-frequency permittivity sensing of intact cell membranes; roughly proportional to viable biovolume, giving it the one direct physical line to VCD that Raman lacks.
2D fluorescence (EEM) — excitation–emission-matrix spectroscopy that measures the native fluorescence of aromatic residues and cofactors (tryptophan, NAD(P)H) across an excitation × emission grid; a soft-sensor input for biomass and some metabolites, complementary to Raman and itself a source of the fluorescence baseline Raman preprocessing removes.
PLS (Partial Least Squares) regression — the chemometric workhorse: it compresses many collinear wavenumbers into a few latent variables chosen to maximize covariance with the target via a shared score matrix for X and y, then regresses on those; beats OLS and (usually) deep nets in the small-data Raman regime.
OPLS (orthogonal PLS) — a PLS variant that splits target-predictive variation from orthogonal variation, easing interpretation; the class behind Amgen's Juncos harvest-titer models.
Latent variable — a linear combination of wavenumbers; the few components (here five) PLS keeps instead of the raw 701 channels.
SNV (Standard Normal Variate) — per-spectrum centring and scaling (x' = (x − x̄)/s) that removes multiplicative scatter; leak-free because it is computed row by row from that row's own statistics.
Savitzky-Golay derivative — sliding-window least-squares polynomial fit returning a smoothed value or derivative; a first/second derivative removes spectral baseline and sharpens peaks.
Applicability domain (AD) — a per-prediction in-/out-of-domain gate built from Hotelling's T2 (how extreme the spectrum is inside the PLS latent plane) and SPE (how much of it lies off that plane); it flags a new spectrum the model should not be trusted on, the OOD self-check soft_sensor_pls.py runs against a corrupted spectrum.
VIP (Variable Importance in Projection) — a per-wavenumber importance score from a fitted PLS model; wavenumbers with VIP > 1 are the bands the prediction leans on, here clustering in the amide III region for the titer model.
Closed-loop control — a measure-decide-actuate loop (here, glucose every ~30 min) in which the soft sensor's estimate drives an actuator (the feed pump), not just a display; production loops wrap it in rate bounds, estimate sanity checks, and a scheduled-feed fallback.
Hybrid (gray-box) model — a first-principles backbone (titer = qP × IVCD) corrected by a small ML residual; physics carries the trend, ML corrects the curvature, succeeding with far fewer parameters than a pure black box. Parallel form adds the network's output to the physics; serial form has the network supply a physics parameter.
Digital twin — a model of the run that forecasts its trajectory, used to design, predict, and steer the process; here a hybrid forward model that can also drive ML-surrogate MPC for feeding.
IVCD — the integral of viable cell density over the run; total productive cell-hours, the mechanistic backbone's input for titer.
VCD weak spot — the persistent, unsolved difficulty of soft-sensing viable cell density: a count, not a molecule, so it has no clean Raman band and rides on a drifting proxy.
Model decay — the drift of a probe-bound calibration as the cell line, medium, or hardware moves; detected late because the bench reference arrives only twice a day.
Semantic feature grounding — pulling a model input by its ontology IRI and its ISA-95 / OPC UA position rather than a fragile column name, so the feature survives a tag rename, a vendor swap, or a site transfer and still means the same quantity.
SHACL training-data gate — the same closed-world sh:NodeShape that gates a lot at release, pointed at a training row to refuse any spectrum missing its bench reference, units, in-range state, or dataset hash before it reaches the fit.
bp:derivedFrom / PROV-O lineage — the W3C provenance edge recording which batch each spectrum came from; the grouping key that makes leave-one-batch-out cross-validation honest, walked as a SPARQL traversal rather than guessed from a column.
BFO continuant vs occurrent — the upper-ontology distinction (Basic Formal Ontology) that keeps a soft-sensor prediction (a continuant, a value that persists) typed apart from the fed-batch run that produced it (an occurrent, a process that is over), so a graph — and a GraphRAG LLM grounded on it — cannot confuse the measurement with the run.
ALCOA+ / CSA — the data-integrity attributes (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available) every prediction must carry, and the risk-based Computer Software Assurance successor to CSV that scales validation effort to where the model touches a CQA; the soft sensor is an electronic record under 21 CFR Part 11 / EU Annex 11.

Where this leads

The product is made and continuously watched; the tank is full of antibody suspended in a culture that is starting to die. The next chapter, Harvest and Clarification: Predicting the Endpoint, takes the soft sensors we just built — titer, VCD, viability — and turns them on the one decision the bioreactor models barely touched: when is this batch done? It learns the harvest endpoint as a constrained optimization, the turbidity of the feed you are about to clarify, and the filter you will need to clarify it — the hinge where everything upstream becomes a consequence downstream.

What this chapter covers​

The soft-sensor task: reading chemistry off light​

The PLS chemometrics pipeline​

A PLS soft sensor in code, and the leakage trap that flatters it​

Why VCD is the hard one​

Closed-loop glucose control: from measuring to acting​

The hybrid digital twin: physics carries the trend, ML corrects the curvature​

Anatomy of one soft-sensor prediction​

What makes the soft sensor trustworthy: the semantic layer beneath it​

The soft sensor as a governed analytical procedure​

The unsolved part: the VCD weak spot and the model that decays​

What this chapter adds to the model suite​

Why it matters​

In the real world​

Key terms​

Where this leads​