Skip to main content

Process Analytics: SPC, MVDA & Soft Sensors

๐Ÿ“ Where we are: Part VII, "Insight & Verdict." Every layer we built โ€” capture, historian, batch context, semantics, trust โ€” exists so that this chapter can finally pay off. Clean, contextualized data gets turned into charts that catch drift, models that fingerprint a batch, and a soft-sensor that predicts a quality attribute in real time.

The simple version

Think of the historian as a kitchen full of perfectly labelled ingredients. Up to now we have been stocking the pantry โ€” wiring sensors, tagging streams, storing them with their batch and units attached. Analytics is the cooking. A control chart is the recipe card that says "this dish always comes out between 64 and 74 โ€” if it doesn't, stop and look." A soft-sensor is the experienced cook who tastes the sauce and tells you the salt level without sending it to a lab. Both only work because the ingredients were clean and labelled. Garbage in, garbage out โ€” but we spent twenty-five chapters making sure it wasn't garbage.

What this chapter coversโ€‹

This is the moment the data stack earns its keep. We take the deterministic datasets the simulator produced (SIM_SEED=2026) and run real, tested analytics over them โ€” the kind a process engineer and a quality unit actually use:

  • Statistical process control (SPC): an I-MR control chart and a process-capability (Cpk) number for a release attribute across the campaign, plus a golden-batch envelope for an online tag.
  • Multivariate data analysis (MVDA): why a single batch is better described by all its variables at once, and how PCA/PLS fingerprint a trajectory.
  • A titer soft-sensor: a Partial Least Squares (PLS) model that predicts antibody titer from in-line Raman spectra, trained and validated with scikit-learn โ€” and honest about where it breaks.
  • Model governance: the real bar for using a machine-learning model in a GMP decision, and why MLflow is the start, not the finish.

The two scripts at the heart of the chapter โ€” examples/analytics/spc.py and examples/analytics/soft_sensor.py โ€” are pure NumPy/Pandas/scikit-learn over the committed datasets, so they run standalone with no services at all: python analytics/spc.py and python analytics/soft_sensor.py (the latter also wired as make soft-sensor), and CI asserts their outputs are stable. The served, MLflow-logged path โ€” the soft-sensor running behind an API and logging every run for governance โ€” lives in the full repo's analytics profile (model tracking alongside the historian); the standalone scripts here are the same models with the plumbing stripped away.

SPC: the chart that catches driftโ€‹

Statistical process control is the oldest idea in this book and still the most useful. The premise: a stable process has common-cause variation โ€” random wobble around a mean โ€” and you can compute, from the data itself, the band that common-cause variation should stay inside. Anything outside that band is a special cause worth investigating. You are not setting the limits from the specification; you are letting the process tell you what "normal" looks like, then watching for "not normal."

For a release attribute measured once per batch, the right tool is an I-MR (individuals / moving-range) chart. You have one number per batch, so you estimate the spread from the moving range โ€” the absolute difference between consecutive batches โ€” rather than from a within-subgroup standard deviation. The control-chart constant d2 = 1.128 converts the average moving range of pairs into a sigma estimate.

Here is the core of examples/analytics/spc.py. It is hand-rolled on purpose: there is no maintained, permissively licensed pure-Python SPC library we were willing to pin, so the chapter shows the arithmetic plainly โ€” the classical control-chart and capability statistics that the open-source statistics ecosystem has long documented for process work:

# examples/analytics/spc.py
D2 = 1.128 # control-chart constant for moving range of n=2

def imr_limits(values: np.ndarray) -> dict:
"""Individuals (I) and moving-range (MR) control limits."""
v = np.asarray(values, dtype=float)
mr = np.abs(np.diff(v))
mr_bar = mr.mean()
sigma = mr_bar / D2
center = v.mean()
return {
"center": round(float(center), 4),
"ucl": round(float(center + 3 * sigma), 4),
"lcl": round(float(center - 3 * sigma), 4),
"sigma": round(float(sigma), 5),
"mr_bar": round(float(mr_bar), 5),
}

Capability is the second half of the story. A process can be perfectly in control (stable, predictable) and still incapable (its natural spread doesn't fit inside the spec). Cpk measures the distance from the process mean to the nearer specification limit, in units of 3-sigma โ€” so Cpk โ‰ฅ 1.33 is the conventional "comfortably capable" bar, and 1.0 means the spread only just fits:

# examples/analytics/spc.py
def cpk(values: np.ndarray, lsl: float, usl: float) -> float:
v = np.asarray(values, dtype=float)
mu, sd = v.mean(), v.std(ddof=1)
if sd == 0:
return float("inf")
return round(float(min((usl - mu) / (3 * sd), (mu - lsl) / (3 * sd))), 3)

We run this against datasets/hplc_results.csv, the simulated Certificate-of-Analysis table โ€” one release row per batch per assay. The release_spc function pulls the cation-exchange main-peak purity (CEX_main_pct) across the six campaign batches:

batch_id,test,value,unit,spec_low,spec_high,result
BATCH-2026-001,CEX_main_pct,70.686,%,60.0,80.0,PASS
BATCH-2026-002,CEX_main_pct,69.085,%,60.0,80.0,PASS
BATCH-2026-003,CEX_main_pct,70.404,%,60.0,80.0,PASS
BATCH-2026-004,CEX_main_pct,67.879,%,60.0,80.0,PASS
BATCH-2026-005,CEX_main_pct,66.699,%,60.0,80.0,PASS
BATCH-2026-006,CEX_main_pct,69.171,%,60.0,80.0,PASS

Running python analytics/spc.py prints exactly this:

I-MR control chart for CEX_main_pct (n=6): {'center': 68.9873, 'ucl': 73.8262, 'lcl': 64.1485, 'sigma': 1.61294, 'mr_bar': 1.8194}
spec [60.0, 80.0] Cpk=1.984

Read it like a quality engineer would. The process centres at 68.99% main peak, with control limits at 64.1% and 73.8% โ€” those limits came from the batch-to-batch variation, not the spec. The specification is wider (60โ€“80%), so the process comfortably fits inside it: Cpk = 1.98, well above the 1.33 bar. This is precisely the Continued Process Verification (CPV) activity that FDA's process-validation lifecycle expects in Stage 3 โ€” ongoing statistical trending of routine production data to show the process stays in a state of control [1]. Serving this chart straight off the historian is the open-source way to do CPV without a six-figure statistics suite.

The golden batch: a control chart over timeโ€‹

A single number per batch is the release view. The process view is a trajectory โ€” a tag like reactor temperature evolving over fourteen days. The golden-batch envelope is an SPC chart turned sideways: instead of one limit for the whole batch, you compute a mean ยฑ 3-sigma band at each point in batch time, so a live batch can be overlaid and you can see the moment it leaves the herd.

golden_envelope in the same file resamples one online tag to an hourly cadence and builds the band:

# examples/analytics/spc.py
def golden_envelope(tag: str = "BR101.Temp.PV", freq: str = "1h") -> pd.DataFrame:
"""Mean +/- 3 sigma envelope over batch time for one online tag (golden batch)."""
ts = pd.read_parquet(DATA / "fedbatch_timeseries.parquet")
s = ts[ts.tag == tag].set_index("ts")["value"].resample(freq).agg(["mean", "std"]).dropna()
s["upper"] = s["mean"] + 3 * s["std"].fillna(0)
s["lower"] = s["mean"] - 3 * s["std"].fillna(0)
return s.reset_index()

The same run reports:

golden-batch Temp envelope: 336 hourly points, mean range 36.49-37.01 degC

336 hourly points (fourteen days) with the temperature mean tracking between 36.49 and 37.01 ยฐC โ€” exactly the tight band a PID-controlled bioreactor should hold, and the day-7 0.5 ยฐC cooling excursion the simulator seeds shows up as a dip that pulls the mean down to the band's lower edge (36.49 ยฐC). Overlay a new batch on that band in Grafana and an operator sees a deviation forming hours before it trips an alarm.

A two-panel process-analytics figure: on the left an I-MR control chart of CEX main-peak purity across six campaign batches with the centre line and 3-sigma control limits inside the wider specification band; on the right a Raman-to-titer PLS soft-sensor scatter of predicted versus measured titer hugging the 45-degree line, annotated with R-squared 0.99.

Left: the I-MR chart and Cpk for a release attribute โ€” stable, capable, well inside spec. Right: the PLS soft-sensor recovering titer from simulated Raman spectra, predicted-vs-measured points hugging the identity line. Two faces of the same idea โ€” let clean, contextualized data say whether the process is behaving.

Original diagram by the authors, created with AI assistance.

MVDA: one batch is a cloud, not a numberโ€‹

SPC watches one variable at a time. But a bioreactor batch is dozens of variables moving together โ€” temperature, pH, dissolved oxygen, glucose, lactate, viable cell density, titer โ€” and the correlations between them carry the real signal. Two batches can each have every individual tag inside its own control chart and still be subtly different, because the relationship between glucose and lactate drifted. That is what multivariate data analysis (MVDA) sees and univariate SPC misses.

The two workhorses are PCA and PLS. Principal Component Analysis (PCA) compresses many correlated tags into a few latent components, so a whole batch trajectory becomes a path through a low-dimensional space โ€” and an abnormal batch literally veers off the normal path. The foundational method for monitoring batch processes this way is multiway PCA: you unfold the three-dimensional (batch ร— variable ร— time) data into a matrix, fit the model on good historical batches, and then score a new batch's deviation as Hotelling's Tยฒ and squared prediction error [2]. When a new batch's score plot leaves the confidence ellipse, a contribution plot tells you which variables pushed it out โ€” the diagnostic the operator actually wants.

Partial Least Squares (PLS) is PCA's supervised cousin: it finds the latent components that best predict an outcome (titer, a quality attribute) rather than just explaining variance. PLS is the basic tool of chemometrics precisely because spectroscopic and process data are wide, collinear, and noisy โ€” exactly the regime where ordinary regression falls apart and PLS thrives [3]. In scikit-learn these are sklearn.decomposition.PCA and sklearn.cross_decomposition.PLSRegression โ€” a few lines each over a contextualized batch table; the soft-sensor below is the PLS engine in its most concrete form, and the same PLSRegression API that fingerprints a batch trajectory is what predicts titer from a spectrum.

A titer soft-sensor: predicting quality from a spectrumโ€‹

Here is the chapter's centrepiece. A soft-sensor (or virtual sensor) is a model that infers a hard-to-measure quantity from easy-to-measure ones. Titer โ€” how many grams of antibody per litre you have made โ€” normally needs an offline assay that takes hours. But an in-line Raman probe produces a spectrum every few minutes, and that spectrum carries a faint fingerprint of every molecule in the broth, including the product. If a model can learn that fingerprint, you get titer in real time โ€” the essence of Process Analytical Technology (PAT), building quality assurance on in-process measurement instead of waiting for end-product testing [4].

The dataset is datasets/raman_spectra.parquet: one row per hourly timepoint, 701 intensity columns (wn_400 โ€ฆ wn_1800, the Raman shift in cmโปยน) plus reference labels the simulator carried along from the same kinetic state, so the spectra are genuinely informative about concentration:

ts batch_id glucose_g_L lactate_g_L glutamine_mM VCD_e6_per_mL titer_g_L
2026-01-05 00:00:00+00:00 BATCH-2026-001 6.000 0.200 4.000 0.300 0.000
2026-01-05 01:00:00+00:00 BATCH-2026-001 5.998 0.201 3.999 0.306 0.000
2026-01-05 02:00:00+00:00 BATCH-2026-001 5.995 0.202 3.998 0.312 0.001

The model is in examples/analytics/soft_sensor.py. Loading is trivial โ€” every column starting wn_ is a feature, titer_g_L is the target:

# examples/analytics/soft_sensor.py
def load_xy():
df = pd.read_parquet(DATA / "raman_spectra.parquet")
wn = [c for c in df.columns if c.startswith("wn_")]
X = df[wn].to_numpy()
y = df[TARGET].to_numpy()
return X, y, wn

Training is a textbook chemometrics pipeline: hold out a slice, standardize the spectra so no single wavenumber dominates by sheer magnitude, fit a PLS regression with a handful of latent components, and score on the held-out set. We use scikit-learn, the canonical open-source machine-learning library for exactly this kind of work [5]:

# examples/analytics/soft_sensor.py
def train(n_components: int = 6, test_size: float = 0.3, seed: int = 2026):
X, y, wn = load_xy()
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=test_size, random_state=seed)
scaler = StandardScaler().fit(Xtr)
pls = PLSRegression(n_components=n_components)
pls.fit(scaler.transform(Xtr), ytr)
pred = pls.predict(scaler.transform(Xte)).ravel()
r2 = r2_score(yte, pred)
rmse = float(np.sqrt(mean_squared_error(yte, pred)))
return {"n_components": n_components, "n_wavenumbers": len(wn),
"n_train": len(ytr), "n_test": len(yte),
"r2": round(float(r2), 4), "rmse_g_L": round(rmse, 4)}

python analytics/soft_sensor.py (or make soft-sensor) prints:

PLS soft-sensor (titer from Raman): R2=0.9923 RMSE=0.1498 g/L (6 comps, 701 wavenumbers, 235 train / 101 test)
ASSERT ok: R2 > 0.85 โ€” the Raman dataset is genuinely predictive of titer.

Six latent components, distilled from 701 wavenumbers, recover titer with Rยฒ = 0.99 and an RMSE of 0.15 g/L on data the model never saw. The script ends with a hard assertion โ€” assert m["r2"] > 0.85 โ€” so the book's claim cannot silently rot: if a future change to the simulator broke the signal, CI would fail loudly. This mirrors published reality, where data-driven models forecast mAb titer and metabolites in CHO fed-batch culture toward digital-twin use [6], and where Raman + PLS glucose-feedback control has actually been shown to lift titer by roughly a quarter in CHO bioreactors [7].

Where the soft-sensor breaks โ€” and why honesty is the pointโ€‹

Rยฒ = 0.99 is a suspiciously good number, and the chapter's docstring says so out loud: the simulated spectra carry the titer signal cleanly. Real Raman is harder. A bioprocess soft-sensor must survive variable batch length, multiple process phases (lag, exponential, stationary, death โ€” each with different spectral behaviour), fouling and bubbles on the probe, and outright sensor faults โ€” and a model trained on one set of conditions degrades quietly when any of those shift [8]. One spectrum genuinely needs many models: a calibration that works in stationary phase may be useless during the exponential ramp.

The takeaway is a design rule, not a disclaimer. A soft-sensor is only trustworthy with drift monitoring beside it: track the spectral inputs against the training distribution and the prediction residuals against the occasional offline reference, and raise a flag when either wanders. The serving path in the full repo logs every prediction so this is auditable; the model is never allowed to make a GMP decision unattended.

Model governance: the GxP bar for MLโ€‹

A model that predicts titer for a dashboard is one thing. A model whose output feeds a release or in-process control decision is a regulated analytical procedure, and the bar climbs sharply. ICH Q14 is explicit that model-based analytical procedures โ€” NIR, Raman, multivariate calibrations โ€” carry a lifecycle obligation: documented calibration, formal validation against a reference method, defined operating ranges, and ongoing performance monitoring with a re-validation trigger when the model drifts [9]. A soft-sensor is not "trained once and trusted forever"; it is a controlled procedure with a maintenance burden.

That is where open-source tooling helps and where it stops. MLflow gives you the technical spine of governance: experiment tracking (every run's parameters, metrics, and the exact dataset hash), a model registry with versions and stage aliases, and run-to-model lineage so you can answer "which model version, trained on which data, produced this prediction on this batch?" [10]. The soft-sensor logs its Rยฒ and RMSE to MLflow in the full repo for exactly this reason.

But MLflow tracks; it does not validate. The honest GxP last mile is the same one this whole book keeps arriving at: the audit trail on who promoted a model version, the e-signature on the validation report, the change-control procedure that governs re-training, and the documented intended use that scopes what the model is allowed to decide. Those are properties of a validated system and procedure, not a pip install. MLflow gets you the lineage; the validated lifecycle around it is the work.

Why it mattersโ€‹

Every prior chapter was infrastructure; this one is the reason the infrastructure exists. A control chart that catches a drifting purity attribute before it fails spec, a golden-batch envelope that flags a deviating reactor hours early, a soft-sensor that turns a four-hour assay into a real-time number โ€” these are what let a plant move from testing quality into the product at the end toward building quality assurance on process understanding throughout [4]. And it only works because the data arriving here is clean, contextualized, and attributable. The most sophisticated PLS model in the world is worthless on data you cannot trust; the most trustworthy data is wasted if no one turns it into a decision. Analytics is where those two halves meet.

In the real worldโ€‹

In a commercial mAb plant the SPC and CPV layer is often a validated statistics suite (JMP, Minitab, or a Discoverant-style process-intelligence platform) and the MVDA is frequently SIMCA (Sartorius/Umetrics) โ€” the tools that productized multiway PCA/PLS for batch monitoring. Soft-sensors ride on PAT data-management systems wired to in-line Raman analyzers (Endress+Hauser/Kaiser, tornado). Our open-source stack does not pretend to replace a validated PAT system that controls a feed pump; it does the analysis, the prototyping, and the contextualized historian feed alongside them โ€” and it does the SPC/CPV trending genuinely well.

A few concrete anchors:

  • NIIMBL โ€” the U.S. public-private Manufacturing USA institute for biopharmaceutical innovation โ€” funds exactly this kind of PAT and real-time-monitoring work, and its SABRE facility (the NIIMBL / University of Delaware pilot-scale cGMP โ€” current Good Manufacturing Practice โ€” facility that broke ground in April 2024) is being built to demonstrate next-generation, in-line-analytics-rich processing at pilot scale. SABRE is a facility, not a data program, but it is where soft-sensor-driven control of the kind sketched here is meant to be exercised.
  • The continuous variant raises the analytics stakes. A perfusion / 3MCC line runs for weeks at near-steady state, so SPC moves from per-batch points to time-windowed trending, and a soft-sensor's drift becomes a daily operational concern rather than a per-batch one โ€” which is exactly why the drift-monitoring discipline above is non-negotiable in intensified processing.

The honest OSS-vs-commercial verdict for this layer: open source wins more of this layer than any other in the book. pandas/NumPy/SciPy, scikit-learn [5], and statsmodels [11] are the same algorithms the commercial suites wrap; a Jupyter notebook plus MLflow is a credible, reproducible analytics environment, and serving an SPC chart or a soft-sensor prediction into Grafana costs nothing but engineering. Where pure OSS stops: it is not a validated, vendor-accountable PAT control system, there is no maintained Part 11-grade SPC GUI for a non-coding analyst, and the model-governance lifecycle (validation evidence, signed approvals, locked change control) is procedure you build, not software you download. Get the analytics in open source; buy or build the validated wrapper for the decisions that touch release.

Key termsโ€‹

  • SPC (statistical process control) โ€” using the process's own variation to set control limits and flag special-cause deviations.
  • I-MR chart โ€” individuals / moving-range control chart for one value per batch; spread estimated from consecutive-pair differences via d2 = 1.128.
  • Control limits vs specification โ€” limits come from the data (what the process does); spec comes from the product requirement (what it must do). They are not the same band.
  • Cpk โ€” process capability index: distance from the mean to the nearer spec limit in 3-sigma units; โ‰ฅ 1.33 is comfortably capable.
  • Golden-batch envelope โ€” a mean ยฑ 3-sigma band computed at each point in batch time, for overlaying a live batch on the historical norm.
  • MVDA โ€” multivariate data analysis; modelling many correlated variables (and their relationships) at once.
  • PCA / multiway PCA โ€” Principal Component Analysis; multiway PCA unfolds batch ร— variable ร— time data to monitor whole trajectories.
  • PLS โ€” Partial Least Squares regression; supervised latent-variable modelling for wide, collinear data like spectra.
  • Contribution plot โ€” the MVDA diagnostic that attributes an out-of-control multivariate score to specific variables.
  • Soft-sensor (virtual sensor) โ€” a model inferring a hard-to-measure quantity (titer) from easy-to-measure inputs (Raman spectra).
  • Raman spectrum โ€” an in-line optical measurement; here 701 intensity points over wavenumber, carrying a molecular fingerprint of the broth.
  • CPV (Continued Process Verification) โ€” Stage 3 of the process-validation lifecycle; ongoing statistical trending of routine production data.
  • PAT โ€” Process Analytical Technology; real-time in-process measurement driving quality decisions.
  • Drift monitoring โ€” tracking model inputs and residuals against the training distribution to detect when a model has gone stale.

Where this leadsโ€‹

We now have the full toolkit โ€” capture, historian, context, semantics, trust, and the analytics that turn it all into insight. The final build chapter, Capstone: One Batch, End to End, puts every layer in a single line: a complete simulated fed-batch CHO + Protein A run driven from sensor through ingestion, historian, contextualization, signed records, audit trail, and these very analytics, ending in one reviewable, release-ready dataset and dashboard โ€” the proof that everything you built actually interconnects.