Skip to main content

The Production Bioreactor: Soft Sensors, Closed-Loop Control, and the Digital Twin

📍 Where we are: Part III · Upstream, Learned — Chapter 11. The seed train handed a healthy, on-schedule inoculum across the SEED-001 → BATCH-2026-001 gate. Now the cells go into the production bioreactor BR-101, where the antibody is actually made — and where the soft-sensing sketched at seed scale becomes the most mature machine learning anywhere in upstream manufacturing.

This is the chapter the whole upstream points at. For roughly two weeks, a few thousand litres of CHO cells grow, switch to producing antibody, and pour out a torrent of sensor data — temperature, pH, dissolved oxygen every few seconds, and, on the best-instrumented tanks, an in-line Raman probe firing a full spectrum every minute. The production bioreactor is the single most data-rich step in the entire process, and it is also where machine learning has earned its only genuinely production-grade foothold in biomanufacturing: the spectroscopic soft sensor, a model that reads titer and metabolite concentrations off a Raman spectrum in real time, with no offline sample and no waiting for the lab.

But the bioreactor is also where the field's honesty is tested. The same probe that predicts titer beautifully predicts viable cell density poorly — a gap that fifteen years of work has not closed. Closed-loop control, where a model not only measures glucose but decides the feed and actuates the pump, exists in pilots but barely in GMP. And the most ambitious object in the chapter, the digital twin — a first-principles model of the run, corrected by a learned residual — is real, useful, and still mostly a development-and-pilot tool, not a thing that runs your commercial batch unattended. This chapter builds the soft sensor, builds the twin, and is exact about where each one stops.

The simple version

Imagine a master brewer who can taste a fermenting tank and tell you, instantly, how much alcohol it has made and how much sugar is left — no lab, no waiting. That tasting is a soft sensor: a cheap, instant signal (here, the light a laser scatters off the broth — its Raman spectrum) read by a trained model to estimate something you would otherwise have to measure slowly in a lab. The brewer is brilliant at "how much alcohol" because alcohol has a clear taste, but bad at "exactly how many yeast cells are alive" because that has almost no taste at all — which is precisely the production bioreactor's situation. And the very best brewer doesn't just taste; she acts — adds a little sugar when the tank runs low — which is closed-loop control. The digital twin is her mental model of how the whole fermentation will unfold, corrected every time reality surprises her.

What this chapter covers

  • The soft-sensing task — what in-line Raman, NIR, and dielectric spectroscopy actually measure, and the menu of targets (titer, glucose, lactate, glutamine, ammonia, VCD) ranked by how learnable each one is.
  • The PLS chemometrics pipeline — the real preprocessing (SNV, Savitzky-Golay derivatives) and the latent-variable regression that is the 40-year workhorse of spectroscopic PAT, with the math stated plainly.
  • Why VCD is the weak spot — the molecular reason titer and metabolites predict well while total/viable cell density does not, and why this is the field's persistent unsolved problem.
  • Closed-loop glucose control — the measure-decide-actuate loop that turns a soft sensor into an actuator, and the regulatory line it has not yet crossed at scale.
  • The hybrid digital twin — a mechanistic fed-batch backbone corrected by an ML residual, and ML-surrogate MPC for feeding, with two runnable modules.
  • The GMP reality — what is genuinely (production), what is (pilot), and what is (research), attributed correctly.

The soft-sensor task: reading chemistry off light

A soft sensor (also called a virtual sensor or inferential sensor) is a model that estimates a hard-to-measure quantity from easy-to-measure ones. In the production bioreactor the hard-to-measure quantities are the ones that matter most — antibody titer, the glucose and glutamine the cells are eating, the lactate and ammonia they excrete, and the viable cell density (VCD) — and the classical way to get them is to pull a sample twice a day and run it on a bench analyzer. That is the cold-start cadence the whole book keeps colliding with: a flood of cheap online signal, a trickle of expensive offline truth. The soft sensor's job is to interpolate the truth continuously from the signal.

Three in-line probe technologies dominate, and they see different things:

  • Raman spectroscopy shines a monochromatic laser into the broth and measures the light that scatters back at shifted wavelengths. Each shift (a Raman band, indexed by wavenumber in cm⁻¹) corresponds to a specific molecular vibration, so a Raman spectrum is a chemical fingerprint of everything dissolved in the medium. Our datasets carry exactly this: raman_spectra.parquet holds 701 intensity channels (wn_400wn_1800) read alongside the kinetic state. Raman is the workhorse of modern bioprocess PAT because water scatters Raman weakly, so the aqueous broth does not drown the signal.
  • NIR (near-infrared) spectroscopy measures absorption of near-infrared light by overtone and combination vibrations. It is cheaper and faster than Raman but water absorbs NIR strongly, so it is better suited to some metabolites and to drying/lyophilization than to a cell-dense broth.
  • Dielectric (capacitance) spectroscopy applies a radio-frequency field and measures the permittivity of the suspension. Intact cell membranes act as tiny capacitors, so capacitance is roughly proportional to the viable biovolume — which makes dielectric the one probe with a genuine, direct line to VCD, the very quantity Raman struggles with. We return to this below; it is the crux of the chapter's unsolved part.

The targets are not equally learnable, and ranking them honestly is the first real lesson:

TargetDirect molecular Raman band?Soft-sensing qualityWhy
Glucoseyes (C-O, C-H stretches)excellentstrong, specific bands; the canonical Raman target
Lactateyesexcellentstrong band; rises monotonically, easy to track
Glutamine / glutamateyesgoodclear bands, lower concentration
Ammoniaweak/indirectfairlow concentration, weak signature; often inferred via correlation
Titer (mAb)yes (protein amide bands)very goodaccumulating protein gives a growing, specific signal
VCD / total cell densityno direct signaturepoor — the weak spotcells scatter light but have no clean Raman band for "count"

The pattern in that last column is the whole chapter in miniature: the things with a direct molecular band in the spectrum predict well, and the thing that is a count of objects rather than a concentration of a molecule predicts badly. Hold that thought; it is the reason VCD soft-sensing is the field's persistent open problem.

The PLS chemometrics pipeline

A raw Raman spectrum is not model-ready, and the model that turns it into a titer is not a neural network — it is Partial Least Squares (PLS) regression, the 40-year-old workhorse of chemometrics, and in the small-data regime of bioprocess it is genuinely hard to beat [1]. The pipeline has two halves: preprocessing, then latent-variable regression.

Preprocessing removes the physics that has nothing to do with chemistry. A spectrum carries baseline drift (slow curvature from fluorescence and the instrument) and multiplicative scatter (the whole spectrum scaled up or down by probe fouling, bubbles, or a changing path length). Two standard, decades-old steps handle these, exactly as the data chapter introduced:

  • Standard Normal Variate (SNV): for each spectrum independently, subtract its mean and divide by its standard deviation. This removes multiplicative scatter and per-spectrum offset, so two spectra of the same broth taken through a slightly fouled versus a clean window become comparable. Because SNV is computed per-spectrum (row-wise), it carries no leakage across the dataset.
  • Savitzky-Golay derivatives: fit a low-order polynomial in a sliding window and take its first or second derivative. A first derivative kills a constant offset; a second derivative kills a linear baseline slope — while the polynomial smooths noise and sharpens overlapping peaks. Window length and polynomial order are real hyperparameters: too wide a window smears peaks, too narrow amplifies noise.

Latent-variable regression then solves the problem that makes ordinary regression useless here. You have 701 wavenumber channels and, at any moment, perhaps a few hundred training points — far more features than examples, and the features are massively collinear (neighbouring wavenumbers move together). Ordinary least squares would overfit catastrophically: with more columns than rows the normal equations are singular, and the model would fit the calibration noise exactly and generalize to nothing. PLS sidesteps this by finding a handful of latent variables — linear combinations of the 701 wavenumbers — chosen to maximize covariance with the target, then regressing the target on those few components instead of the raw channels. Where PCA finds directions of maximum variance in X, PLS finds directions of maximum covariance between X and y, which is why it predicts better with fewer components. A typical bioprocess Raman model uses six to a dozen latent variables; our baseline uses six.

Stated a little more precisely, PLS decomposes the spectral matrix X (timepoints × wavenumbers) and the target vector y into a shared set of k scores T and loadings P, W, q:

X ≈ T P^T (the spectra reconstructed from k latent scores)
y ≈ T q (the target predicted from the same k scores)

where each successive latent direction w is chosen to maximize the covariance cov(Xw, y) subject to orthogonality with the directions already extracted. The number of components k is the one real hyperparameter, and it is chosen by cross-validation grouped by batch (never row-wise — see below): too few components and the model underfits the chemistry; too many and it starts fitting batch-specific noise, the classic overfitting knee. The final model is linear in the spectrum, which is a feature, not a limitation — a linear chemometric model is interpretable (you can plot the regression coefficient against wavenumber and see which bands the titer prediction leans on) and validatable in a way a deep network is not, which is exactly why PLS, not a neural net, is what actually ships in GMP soft sensors.

The full pipeline, then, is: SNV → Savitzky-Golay derivative → mean-center → PLS with k latent variables → predicted concentration. Every parameter that learns from the data (the scaler's means, the PLS loadings) must be fit on the training split only — fit them before you split and you have leaked. And the whole calibration is bound to its probe: change the laser, the flow cell, or the cell line and the model is, for regulatory purposes, a new procedure until re-validated. The bench reference assay never goes away, because it is what re-grounds the calibration whenever the hardware moves.

Hero diagram of the production bioreactor as the flagship soft-sensing node: a large stirred-tank bioreactor BR-101 at center holding a 14-day CHO fed-batch culture, an in-line Raman probe inserted through the vessel wall firing a laser into the broth and returning a 701-channel spectrum drawn as a sparkline labelled wn_400 to wn_1800; the spectrum flowing through a preprocessing chain box (SNV then Savitzky-Golay derivative then mean-center) into a PLS chemometrics box that compresses 701 collinear wavenumbers into six latent variables and emits continuous real-time estimates of titer, glucose, lactate, glutamine and ammonia, each as a climbing or falling trace anchored at sparse twice-daily bench counts shown as dots; a separate dielectric capacitance probe drawn with its own path to a VCD estimate flagged as the weak spot with a wide uncertainty band; a closed-loop control panel on the right where the predicted glucose feeds a controller that decides a feed rate every thirty minutes and actuates a feed pump, drawn as a measure-decide-actuate cycle; and beneath it a hybrid digital twin lane showing a mechanistic fed-batch ODE backbone whose prediction is summed with a small neural-network residual to track the true titer curve; an honest banner notes titer and metabolites are production-grade while VCD soft-sensing and closed-loop control remain the open frontier. The flagship upstream node, learned: an in-line Raman probe feeds a SNV/Savitzky-Golay/PLS chemometrics pipeline that reads titer and metabolites off the spectrum in real time and anchors them at sparse bench counts; a dielectric probe reaches toward the stubborn VCD target; a glucose-driven closed-loop controller actuates the feed every thirty minutes; and a hybrid mechanistic-plus-ML twin forecasts the whole 14-day run — production-grade for metabolites and titer, still an open frontier for VCD and autonomous control. Original diagram by the authors, created with AI assistance.

A PLS soft sensor in code, and the leakage trap that flatters it

The chapter's first runnable artifact is examples/platform/ml/soft_sensor_pls.py. It builds the PLS titer soft sensor over the simulator's Raman spectra — the chemometric baseline that any fancier model has to beat. It compresses the 701 collinear wavenumbers into six latent components and maps them onto antibody titer:

# examples/platform/ml/soft_sensor_pls.py — PLS titer soft sensor from in-line Raman
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.cross_decomposition import PLSRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

DATA = Path(__file__).resolve().parents[2] / "datasets"
TARGET = "titer_g_L"

def load_xy():
df = pd.read_parquet(DATA / "raman_spectra.parquet")
wn = [c for c in df.columns if c.startswith("wn_")] # 701 channels wn_400 .. wn_1800
return df[wn].to_numpy(), df[TARGET].to_numpy(), wn

def train_pls(n_components: int = 6, test_size: float = 0.3, seed: int = 2026):
X, y, wn = load_xy()
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=test_size, random_state=seed)
scaler = StandardScaler().fit(Xtr) # fit on TRAIN only — no leakage
pls = PLSRegression(n_components=n_components).fit(scaler.transform(Xtr), ytr)
pred = pls.predict(scaler.transform(Xte)).ravel()
return {"model": "PLS", "n_components": n_components, "n_wavenumbers": len(wn),
"n_train": len(ytr), "n_test": len(yte),
"r2": round(float(r2_score(yte, pred)), 4),
"rmse_g_L": round(float(np.sqrt(mean_squared_error(yte, pred))), 4)}

if __name__ == "__main__":
m = train_pls()
print(f"PLS soft sensor (titer from Raman): R2={m['r2']} RMSE={m['rmse_g_L']} g/L "
f"({m['n_components']} comps, {m['n_wavenumbers']} wavenumbers, "
f"{m['n_train']} train / {m['n_test']} test)")
assert m["r2"] > 0.85, f"PLS R2 too low ({m['r2']}): dataset not predictive"
print("ASSERT ok: R2 > 0.85 — the Raman dataset is genuinely predictive of titer.")

A representative run prints (illustrative — the assertion guards the floor, not the exact figure):

PLS soft sensor (titer from Raman): R2=0.987 RMSE=0.213 g/L (6 comps, 701 wavenumbers, 235 train / 101 test)
ASSERT ok: R2 > 0.85 — the Raman dataset is genuinely predictive of titer.

Read like a process engineer, that R² is suspiciously good — and the reason it is so good is the single most important caveat in applied bioprocess ML. The default train_test_split shuffles rows, scattering near-identical within-batch neighbours across the train/test boundary, so the model is interpolating between near-duplicate points rather than generalizing to a new batch. This is exactly the leakage trap the data chapter builds the entire dataio.py module to prevent. The honest experiment is to split by batch and report the gap:

# soft_sensor_pls.py, run under both splitting schemes (illustrative)
[ROW-WISE random split] train 235 / test 101 R2 = 0.987 <- LEAKED, do not trust
[BATCH-GROUPED split] test = held-out batches R2 = 0.951 <- honest, next-batch performance

Both numbers are high because the simulator's spectra carry the titer signal cleanly, and real Raman is messier — but the row-wise number is high for the wrong reason. The batch-grouped number is the one you could put before a reviewer, because the test batches were genuinely never seen during the fit. The lesson is not "Raman predicts titer" (it does); it is that the validation discipline is what separates a number you can defend from a fantasy. Titer-from-Raman is strongly predictive even under the honest split — which is exactly why it is the one production-grade ML deployment in upstream.

One more chemometrics decision shapes how this model behaves in practice: one global model, or several local ones? A single global PLS calibrated across the whole run is simple and is what most deployments start with, but a fed-batch culture is really two regimes — an exponential growth phase and a stationary production phase — whose spectra and titer relationships differ, so a global model can be a compromise that fits neither phase well. The alternative is a small bank of local models (a just-in-time or phase-segmented approach): pick the calibration appropriate to the current phase, or weight nearby calibration points more heavily. Local models often predict better but multiply the validation burden — each one is a procedure that must be qualified and maintained, and the logic that switches between them is itself a thing that can fail. The pragmatic production answer is usually a global model with enough latent variables to span both phases, re-validated whenever the process changes; the local-model gains are real but are spent in development, not bought for free.

Why VCD is the hard one

Now the chapter's central technical argument. Run the same PLS pipeline against VCD_e6_per_mL instead of titer and, under an honest batch-grouped split, the R² falls off a cliff. The reason is not a software bug or a tuning failure — it is physics, and it is worth stating precisely because it explains a fifteen-year unsolved problem.

Titer predicts well because antibody is a molecule with a Raman signature. As the cells secrete protein, the concentration of a specific chemical species rises in the broth, and that species has characteristic Raman bands (the protein amide I and amide III backbone vibrations, plus aromatic side-chain bands). More antibody means a stronger, specific signal at known wavenumbers — a direct, causal, molecular link between the quantity and the spectrum. Glucose, lactate, and glutamine are the same story: each is a dissolved molecule with its own bands, so its concentration writes itself directly into the spectrum.

VCD predicts badly because "viable cell density" is a count, not a concentration. A cell is not a molecule with a Raman band; it is a complex object suspended in the medium. Cells do affect a Raman measurement — they scatter and absorb light, they raise turbidity, and the intracellular biochemistry contributes a diffuse background — but there is no clean band that means "two million cells per millilitre." Whatever VCD signal Raman carries is indirect and confounded: it rides on turbidity (which also changes with debris and bubbles), on the bulk biochemical composition (which shifts as cells grow, then die), and on correlations with the metabolites that do have bands. A model can latch onto those correlations and look decent on the calibration batches, but the correlations are not stable run-to-run — viability shifts, the dead fraction climbs, the medium lot changes — so a Raman VCD model transfers poorly and decays fast. It is predicting a count through a proxy, and the proxy keeps moving.

This is why dielectric (capacitance) spectroscopy matters: it has a genuine physical line to the answer Raman lacks. Capacitance measures the polarizability of intact cell membranes under a radio-frequency field, and that polarizable biovolume is roughly proportional to viable cell density — a direct, causal link, not a confounded correlation. But even capacitance has its own version of the same trap: its signal is viable biovolume, not cell count, so a population that swells, shrinks, or changes membrane properties as it ages will read differently for the same nominal VCD, and as cells lyse and lose membrane integrity the capacitance signal fades faster than the count does. So the honest engineering answer to "how do I soft-sense VCD?" is usually "don't rely on Raman for it — add a capacitance probe, and even then expect it to degrade as the dead fraction rises and the membrane signal blurs." VCD soft-sensing remains the recognized weak spot of spectroscopic monitoring: it is the one routinely-needed upstream quantity that no single in-line probe reads cleanly, and closing the gap is an active research problem rather than a solved one.

There is a subtle, important consequence of this for every Raman model, not just the VCD one. Because cells scatter and absorb light, a rising cell density attenuates and distorts the whole spectrum — a turbidity effect that the chemistry-bearing bands of glucose or titer ride on top of. SNV and the derivative preprocessing are partly there to remove this confound, but they never remove it completely, which means a metabolite model calibrated at low cell density can drift as the culture becomes dense and turbid. The VCD weak spot, in other words, is not an isolated failure of one target; it is the same scattering physics that quietly threatens the accuracy of the metabolite models the chapter just praised — one more reason the bench reference and a batch-grouped validation are not optional.

The deeper lesson generalizes past this one variable: a soft sensor is only as good as the physical link between the signal and the target. Where that link is a direct molecular band, ML is production-grade. Where it is an indirect, drifting proxy, ML produces a number that looks fine in calibration and betrays you on the next lot. Knowing which regime you are in — and saying so — is the difference between a trustworthy soft sensor and a confident liar.

Closed-loop glucose control: from measuring to acting

A soft sensor that only reports is a monitor. The moment it drives an actuator, it becomes control — and the most mature example in the bioreactor is closed-loop glucose feedback control. The idea is to hold glucose in a tight target band by measuring it in-line and adjusting the feed automatically, instead of bolus-feeding on a fixed schedule. The loop runs continuously:

  1. Measure. The Raman soft sensor estimates current glucose concentration (say, every minute).
  2. Decide. A controller compares the estimate to the setpoint and computes a correction — how much feed to add over the next interval to bring glucose back to target without overshooting.
  3. Actuate. The controller commands the feed pump (the BR101.GlucoseFeed actuator) to deliver that volume.
  4. Wait and repeat. The loop typically actuates on roughly a 30-minute cadence — long enough for the feed to mix and the cells to respond, short enough to hold a tight band.

Holding glucose low and constant has real process value: it suppresses the overflow metabolism that produces lactate and ammonia, can shift glycosylation, and reduces the osmolality swings that fixed bolus feeding causes. The control law itself can be simple or model-based. The simplest workable version is a discrete feedforward-plus-feedback law: estimate the cells' current glucose consumption rate from the recent trajectory (the feedforward term that supplies what the cells are about to eat), and add a feedback correction proportional to the gap between the measured glucose and its setpoint:

feed(t) = consumption_estimate(t) · Δt + Kp · (glucose_setpoint − glucose_hat(t))

where glucose_hat(t) is the Raman soft sensor's estimate and Kp is a tuned gain. The feedforward term does most of the work — it tracks the rising biomass — and the feedback term cleans up the error the soft sensor and the consumption model leave behind. More advanced demonstrations replace consumption_estimate with a small predictive model (a one-step-ahead MPC) and use a deep-learning soft sensor to supply glucose_hat. The loop's hardest engineering is not the control law but the failure modes: a fouled probe makes glucose_hat drift, and a drifting estimate wired straight to a pump can systematically over- or under-feed for hours before the twice-daily bench sample catches it. Production loops therefore wrap the controller in interlocks — bounds on the feed rate, a sanity check of the soft-sensor estimate against the online state, and a fallback to scheduled feeding if the probe quality flag goes bad — so that a bad estimate fails safe rather than driving the batch off a cliff.

The honest framing is about what crosses into GMP. Closed-loop glucose control via Raman plus deep learning has been demonstrated at scale — Amgen has shown it — but it sits at (pilot) maturity: a demonstrated capability, not yet the default way commercial batches are fed under a locked, validated control loop. The regulatory reason is the recurring theme of upstream ML: a model that autonomously moves an input affecting a CQA is held to a far higher bar than a model that advises a human. A soft sensor that displays glucose to an operator who decides the feed is monitoring; a soft sensor wired straight to the pump is control, and control of a parameter that affects product quality invites the full weight of process validation, change control, and the "locked model" expectation that the FDA's 2023 discussion paper and the draft EU Annex 22 both set out. The capability is real and demonstrated; the scaled, routine, commercial-GMP deployment is the frontier, not the norm.

The hybrid digital twin: physics carries the trend, ML corrects the curvature

The most ambitious object in the chapter is the digital twin of the run — a model that does not just read the current state off a probe but forecasts the whole trajectory, so you can ask "if I feed like this, where does titer land on day 14?" A pure black-box network would need far more batches than a process ever has. A pure first-principles model captures the trend but misses the systematic ways this cell line, in this medium, deviates from the textbook. The winning pattern in bioprocess is neither: it is the hybrid (gray-box) model, a first-principles backbone corrected by a small ML residual.

A full mechanistic fed-batch model is a set of coupled ordinary differential equations — viable cells grow and die, glucose and glutamine are consumed, lactate and ammonia are produced, and titer accumulates — each governed by kinetic parameters (a maximum growth rate, Monod half-saturation constants, yield coefficients, a specific productivity). Schematically:

dXv/dt = (μ − kd)·Xv glucose, glutamine consumed ∝ Xv
dP/dt = qP·Xv lactate, ammonia produced ∝ Xv

That backbone is the first-principles part of the twin, and it is what makes the model extrapolate sensibly into conditions it was not trained on. The simplest useful slice of it — the one the runnable module implements — is the luminal-protein relation for titer alone: secreted titer P tracks the integral of viable cell density (IVCD) times a single specific-productivity constant qP. In words, total antibody made is (how many productive cells there were) × (how long they were productive) × (how fast each one secretes). That one constant, fit by least squares through the origin, already explains most of the variance. But the constant-qP assumption is wrong in detail — specific productivity rises in stationary phase as growth slows and the cells redirect resources to secretion — so a small neural network is trained on only the residual (true titer minus the mechanistic prediction), reading the process state. The physics carries the trend; the network corrects the curvature the constant can't capture. This is a parallel gray-box: ŷ = mechanistic(state) + NN(state). (The alternative, a serial gray-box, instead has the network supply a parameter — say a time-varying qP(state) — that the mechanistic equations then integrate; both arrangements are used, and the parallel form is the easiest to fit and reason about.)

The chapter's second runnable artifact, examples/platform/ml/hybrid_model.py, builds exactly this and pits it against a pure-NN baseline:

# examples/platform/ml/hybrid_model.py — mechanistic IVCD backbone + NN residual
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler

DATA = Path(__file__).resolve().parents[2] / "datasets"
TARGET = "titer_g_L"
FEATS = ["Xv_e6_per_mL", "glucose_g_L", "lactate_g_L", "glutamine_mM",
"ammonia_mM", "t_day", "viability_pct"]

def load_state():
df = pd.read_parquet(DATA / "fedbatch_state.parquet")
return df.iloc[::60].reset_index(drop=True) # minute -> hourly (336 rows)

def ivcd(df): # cumulative integral of viable cells
t, xv = df["t_day"].to_numpy(), df["Xv_e6_per_mL"].to_numpy()
return np.cumsum(xv * np.diff(t, prepend=t[0]))

def train_hybrid(test_size: float = 0.3, seed: int = 2026):
df = load_state(); y = df[TARGET].to_numpy(); iv = ivcd(df); X = df[FEATS].to_numpy()
tr, te = train_test_split(np.arange(len(df)), test_size=test_size, random_state=seed)
qp = float(np.sum(iv[tr] * y[tr]) / np.sum(iv[tr] ** 2)) # qP through the origin
mech = qp * iv # mechanistic backbone
scaler = StandardScaler().fit(X[tr])
nn = MLPRegressor((32, 16), max_iter=5000, alpha=1e-3, random_state=seed)
nn.fit(scaler.transform(X[tr]), (y - mech)[tr]) # NN learns ONLY the residual
hybrid = mech + nn.predict(scaler.transform(X))
pure = (MLPRegressor((32, 16), max_iter=5000, alpha=1e-3, random_state=seed)
.fit(scaler.transform(X[tr]), y[tr]).predict(scaler.transform(X))) # pure-NN baseline
sc = lambda p: (round(float(r2_score(y[te], p[te])), 4),
round(float(np.sqrt(mean_squared_error(y[te], p[te]))), 4))
return {"qP": round(qp, 5),
"mech": dict(zip(("r2", "rmse"), sc(mech))),
"hybrid": dict(zip(("r2", "rmse"), sc(hybrid))),
"pure_nn": dict(zip(("r2", "rmse"), sc(pure)))}

if __name__ == "__main__":
m = train_hybrid()
print(f"Hybrid titer model on BATCH-2026-001 state (qP={m['qP']} g per 1e6 cell-day/mL):")
print(f" mechanistic only R2={m['mech']['r2']:.4f} RMSE={m['mech']['rmse']:.4f} g/L")
print(f" pure NN R2={m['pure_nn']['r2']:.4f} RMSE={m['pure_nn']['rmse']:.4f} g/L")
print(f" HYBRID (mech+NN) R2={m['hybrid']['r2']:.4f} RMSE={m['hybrid']['rmse']:.4f} g/L")
assert m["hybrid"]["rmse"] <= m["mech"]["rmse"], "hybrid should beat mechanistic-only"
print("ASSERT ok: the residual network lowers RMSE below the mechanistic backbone.")

A representative run prints (illustrative — the assertion guards that the hybrid beats the backbone, not the exact figures):

Hybrid titer model on BATCH-2026-001 state (qP=0.00231 g per 1e6 cell-day/mL):
mechanistic only R2=0.965 RMSE=0.349 g/L
pure NN R2=0.972 RMSE=0.318 g/L
HYBRID (mech+NN) R2=0.991 RMSE=0.178 g/L
ASSERT ok: the residual network lowers RMSE below the mechanistic backbone.

Read the three lines as the argument for hybrid modeling. The mechanistic-only backbone — a single fitted constant — already explains the bulk of the titer curve, which is why first principles are such a powerful prior in bioprocess. The pure NN, with hundreds of parameters and no physics, barely edges it out, and would fall apart faster off-distribution because it has no trend to fall back on. The hybrid wins clearly and wins cheaply: the network only had to learn the small stationary-phase curvature, so it succeeds with far fewer effective parameters than a black box would need. In the cold-start, few-batch regime that defines bioprocess, that parameter economy is the whole point — the physics spends the labels you have on the part you actually don't know.

Two further uses of the twin are worth naming. First, ML-surrogate MPC for feeding: once you have a fast hybrid forward model, you can wrap it in model predictive control — at each step, simulate the next several hours under each candidate feed profile, pick the one that best hits the titer or glucose objective, actuate, and re-plan. The hybrid model is the cheap, differentiable surrogate that makes this real-time optimization tractable where a full mechanistic CFD/kinetic simulation would be too slow. Second, soft-sensing the unmeasured states: the twin can estimate quantities you cannot probe at all (specific productivity, the true viable fraction) by reconciling the mechanistic state with whatever measurements arrive — a learned analogue of a Kalman filter's state estimate.

Anatomy of one soft-sensor prediction

A soft-sensor reading is never a bare number. Like every artifact in this series, its value is in what travels with it — the spectrum that produced it, the preprocessing and model version behind it, the uncertainty around it, and the bench reference that will eventually grade it. Pull one prediction apart and the whole chapter is laid out as fields.

Anatomy identity card of one soft-sensor prediction for BR-101 on BATCH-2026-001 at batch-hour 168: an indigo header naming the model soft_sensor_pls v1 and the subject vessel BR-101; an input block showing the raw 701-channel Raman spectrum wn_400 to wn_1800 as a sparkline, its SNV-and-Savitzky-Golay-preprocessed twin beside it, and the aligned online state temperature 36.5 C, pH 7.04, dissolved oxygen; a green core block holding the predicted titer in g per L with a confidence band, the predicted glucose, lactate, glutamine and ammonia each with a band, all stamped with the six PLS latent variables that produced them; an amber weak-spot block holding the VCD estimate with a deliberately wide uncertainty band and a note that Raman has no direct band for cell count so this rides on a confounded proxy and a capacitance probe is preferred; a reconciliation block for the twice-daily bench reference titer and metabolites and the residual against prediction; a violet relationships panel linking the prediction to its calibration batches, the dataset hash, the model and preprocessing version, the closed-loop glucose controller it can feed, and the human-in-the-loop boundary for any action affecting a CQA; a caption noting titer and metabolites are confident while VCD is flagged uncertain by design. One soft-sensor prediction, fully unpacked: the raw and preprocessed Raman spectrum and aligned online state that fed it, the confident titer and metabolite estimates with bands and the latent variables behind them, the deliberately wide VCD estimate that flags the weak spot, the bench reference that will grade it, and the relationships — calibration set, model version, the controller it can drive, and the human-in-the-loop line for any CQA-affecting action — that make it governable. Original diagram by the authors, created with AI assistance.

Read the card top to bottom and the argument is complete. The input block is the cheap, fast signal — a 701-channel spectrum, its preprocessed twin, and the aligned online state. The green core is the prediction proper: titer and the metabolites that have direct molecular bands, each with a confidence band and stamped with the six latent variables that produced them, so a reviewer can trace the number back to the spectrum. The amber weak-spot block is the chapter's honesty made into a field: the VCD estimate is carried with a deliberately wide band and an explicit note that Raman has no direct band for it, so no downstream consumer over-trusts it. The reconciliation rows hold the twice-daily bench reference and the residual — the only honest grade the soft sensor ever gets, and the lagging signal that drift detection has to live with. And the violet relationships panel records governance: the calibration batches the model was bound to, the dataset and model version, the closed-loop controller the glucose estimate can drive, and the human-in-the-loop line that any action affecting a CQA must not cross silently.

The unsolved part: the VCD weak spot and the model that decays

Two honest difficulties sit under this chapter, and both are about the limits of a soft sensor rather than its successes.

The first is the VCD weak spot itself, argued above. The single most operationally useful upstream quantity — how many living cells are in the tank — is the one no in-line probe reads cleanly. Raman gives a confounded proxy that drifts run-to-run; capacitance gives a genuine but viability-sensitive biovolume signal that blurs as the dead fraction climbs; image-based and other approaches remain research. A plant that needs VCD continuously is, today, stitching together an imperfect capacitance reading, a Raman correlation, and the twice-daily count, and accepting that the live VCD trace is the least trustworthy line on the dashboard. This is not a failure of effort — it is a genuine, fifteen-year, still-open problem, and any product claim of "Raman VCD soft sensing" deserves to be read with the question under what split, and how does it transfer?

The second is model decay, the consequence of the cold-start cadence every soft sensor inherits. A Raman calibration is bound to its probe, its cell line, and its medium; the moment any of those moves — a new medium lot, a fouled flow cell, a probe swapped at maintenance, a clone change — the calibration begins to drift, and because the bench reference only arrives twice a day, drift is detected late. A soft sensor that began over-reading titer at breakfast is not provably wrong until the evening sample returns, and between those two points it looks identical to a healthy one. The same physics that makes the soft sensor valuable — it interpolates the sparse truth — is what makes its failures slow to surface. This is why production soft sensors are governed objects with a re-calibration trigger, a residual monitor, and a locked-model change-control plan, not fire-and-forget regressions; the MLOps chapter builds that lifecycle out in full. The honest production soft sensor is one that reports its uncertainty, widens its band away from the last reference, and hands every consequential decision back to a human at the points the regulation requires.

What this chapter adds to the model suite

This chapter contributes two modules to the Book 5 example suite — the most of any single chapter, which befits the flagship:

  • examples/platform/ml/soft_sensor_pls.py — the PLS titer soft sensor from in-line Raman: SNV-ready preprocessing, a six-latent-variable PLS regression over the 701 wavenumbers of raman_spectra.parquet, and a CI assertion that the dataset is genuinely predictive of titer (R² > 0.85). It is the chemometric baseline every fancier model must beat, and the place the chapter demonstrates the row-wise-versus-batch-grouped leakage contrast using the shared dataio split.
  • examples/platform/ml/hybrid_model.py — the gray-box titer twin: a mechanistic IVCD × qP backbone corrected by an MLP residual over the process state in fedbatch_state.parquet, benchmarked against a pure-NN baseline, with a CI assertion that the hybrid lowers RMSE below the mechanistic backbone. It is the runnable core of the digital-twin section and the pattern the seed-train readiness model and downstream chromatography models both reuse.

Together they make the production bioreactor a node the model suite can actually run on, and they encode the chapter's two central truths in executable form: the soft sensor is real (the PLS assertion holds), and the hybrid beats both pure physics and pure ML (the residual assertion holds).

Why it matters

The production bioreactor is where machine learning in biomanufacturing is most real and most tested. A spectroscopic soft sensor turns the once-or-twice-a-day metabolite count into a continuous trace, so an operator sees glucose and lactate and titer climbing in real time instead of guessing between bench samples — and that is the one ML capability in upstream that is genuinely (production) today. Closed-loop glucose control turns that trace into action, holding the feed tight to suppress lactate and stabilize quality — real, demonstrated, and still mostly (pilot). The hybrid digital twin forecasts the whole run, so a process can be designed and steered on a model instead of on trial batches — powerful, and still mostly a development-and-pilot tool. And the VCD weak spot is the standing reminder that a soft sensor is only as good as the physical link beneath it: where that link is a molecular band, ML is trustworthy; where it is a drifting proxy, ML is a confident guess. Get the soft sensor right and the most data-rich step in manufacturing becomes its most observable one; over-trust it — especially on VCD — and you have built a dashboard that lies slowly. The whole upstream has been building toward this tank; this is where its data finally becomes knowledge, and where the honesty about ML's limits matters most.

In the real world

In-line Raman with SNV/derivative preprocessing and PLS chemometrics, soft-sensing glucose, lactate, and titer in CHO culture, is (production) practice — the strongest, most mature ML deployment anywhere in upstream biomanufacturing, established in the literature for over a decade [1][2]. The platforms are real and named: Sartorius's SIMCA and SIMCA-online for the chemometrics, and BioPAT Spectro for the in-line spectroscopy. The strongest first-party deployment anchor is Amgen's facility at Juncos, Puerto Rico, where SIMCA OPLS models predict harvest titer and other in-process attributes inside commercial GMP drug-substance manufacturing — a deployed (production) MVDA model at the upstream-downstream boundary, reported first-party by Amgen engineers together with a Sartorius vendor case study (vendor-self-reported / self-authored evidence tier), not independently audited [3].

Closed-loop glucose control via Raman plus deep learning has been demonstrated at Amgen, and it sits at (pilot): a real demonstrated capability, peer-reviewed, but not the routine way commercial batches are fed under a locked validated loop [4]. Hybrid mechanistic-plus-ML twins are (pilot) too — Sartorius's process-twin work pairing a parsimonious dynamic flux-balance backbone with a neural-network VCD/state estimate is the canonical example, useful for design and prediction at development scale, not a commercial autopilot [5]. A correction the field gets wrong often enough to name: the Boehringer Ingelheim 16-attribute in-line Raman work used a k-nearest-neighbours model, not deep learning, and should be cited as the KNN multi-attribute demonstration it is [6]; and National Resilience's widely cited "+50% titer" perfusion result is a PAT-plus-manual-feed-optimization story from a vendor press release, not an ML deployment, and must never be presented as machine learning lifting titer [7]. And the standing caveat behind all of it: VCD soft-sensing remains the weak spot — no in-line probe reads viable cell density as cleanly as Raman reads titer, which is why dielectric capacitance, not Raman, is the probe of choice when VCD is the target, and why the live cell-density trace is still the least trustworthy line on the upstream dashboard. The throughline the FDA's 2023 discussion paper and the ISPE Pharma 4.0 survey keep finding holds here exactly: AI/ML in this part of the plant is strongest as human-in-the-loop monitoring and soft sensing, thinner as autonomous closed-loop control of a CQA [8].

Key terms

  • Soft sensor (virtual / inferential sensor) — a model that estimates a hard-to-measure quantity (titer, metabolites, VCD) from easy-to-measure online signals (a spectrum, the online state), continuously and without an offline sample.
  • Raman spectroscopy — laser-scatter spectroscopy whose bands fingerprint specific molecular vibrations; the workhorse in-line probe for bioprocess because water scatters it weakly. Our datasets carry 701 channels, wn_400wn_1800.
  • NIR spectroscopy — near-infrared absorption spectroscopy; cheaper and faster than Raman but absorbed strongly by water, limiting it in a cell-dense broth.
  • Dielectric / capacitance spectroscopy — radio-frequency permittivity sensing of intact cell membranes; roughly proportional to viable biovolume, giving it the one direct physical line to VCD that Raman lacks.
  • PLS (Partial Least Squares) regression — the chemometric workhorse: it compresses many collinear wavenumbers into a few latent variables chosen to maximize covariance with the target, then regresses on those; beats OLS and (usually) deep nets in the small-data Raman regime.
  • Latent variable — a linear combination of wavenumbers; the few components PLS keeps instead of the raw 701 channels.
  • SNV (Standard Normal Variate) — per-spectrum centring and scaling that removes multiplicative scatter; leak-free because it is computed row by row.
  • Savitzky-Golay derivative — sliding-window polynomial smoothing-plus-derivative that removes spectral baseline and sharpens peaks.
  • Closed-loop control — a measure-decide-actuate loop (here, glucose every ~30 min) in which the soft sensor's estimate drives an actuator (the feed pump), not just a display.
  • Hybrid (gray-box) model — a first-principles backbone (titer = qP × IVCD) corrected by a small ML residual; physics carries the trend, ML corrects the curvature, succeeding with far fewer parameters than a pure black box.
  • Digital twin — a model of the run that forecasts its trajectory, used to design, predict, and steer the process; here a hybrid forward model that can also drive ML-surrogate MPC for feeding.
  • IVCD — the integral of viable cell density over the run; total productive cell-hours, the mechanistic backbone's input for titer.
  • VCD weak spot — the persistent, unsolved difficulty of soft-sensing viable cell density: a count, not a molecule, so it has no clean Raman band and rides on a drifting proxy.
  • Model decay — the drift of a probe-bound calibration as the cell line, medium, or hardware moves; detected late because the bench reference arrives only twice a day.

Where this leads

The product is made and continuously watched; the tank is full of antibody suspended in a culture that is starting to die. The next chapter, Harvest and Clarification: Predicting the Endpoint, takes the soft sensors we just built — titer, VCD, viability — and turns them on the one decision the bioreactor models barely touched: when is this batch done? It learns the harvest endpoint as a constrained optimization, the turbidity of the feed you are about to clarify, and the filter you will need to clarify it — the hinge where everything upstream becomes a consequence downstream.