Process Analytics: SPC, MVDA & Soft Sensors

📍 Where we are: Part VII, "Insight & Verdict." Every layer we built — capture, historian, batch context, semantics, trust — exists so that this chapter can finally pay off. Clean, contextualized data gets turned into charts that catch drift, models that fingerprint a batch, and a soft-sensor that predicts a quality attribute in real time.

The simple version

Think of the historian as a kitchen full of perfectly labelled ingredients. Up to now we have been stocking the pantry — wiring sensors, tagging streams, storing them with their batch and units attached. Analytics is the cooking. A control chart is the recipe card that says "this dish always comes out between 64 and 74 — if it doesn't, stop and look." A soft-sensor is the experienced cook who tastes the sauce and tells you the salt level without sending it to a lab. Both only work because the ingredients were clean and labelled. Garbage in, garbage out — but we spent twenty-five chapters making sure it wasn't garbage.

What this chapter covers

This is the moment the data stack earns its keep. We take the deterministic datasets the simulator produced (SIM_SEED=2026) and run real, tested analytics over them — the kind a process engineer and a quality unit (the independent QA/QC organization that decides whether a batch may ship) actually use:

Statistical process control (SPC): an I-MR control chart and a process-capability (Cpk) number for a release attribute across the campaign, plus a golden-batch envelope for an online tag.
Multivariate data analysis (MVDA): why a single batch is better described by all its variables at once, and how PCA/PLS fingerprint a trajectory.
A titer soft-sensor: a Partial Least Squares (PLS) model that predicts antibody titer from in-line Raman spectra, trained and validated with scikit-learn — and honest about where it breaks.
Model governance: the real bar for using a machine-learning model in a GMP decision, and why MLflow is the start, not the finish.

The two scripts at the heart of the chapter — examples/analytics/spc.py and examples/analytics/soft_sensor.py — are pure NumPy/Pandas/scikit-learn over the committed datasets, so they run standalone with no services at all: after make venv, run make soft-sensor (or sim/.venv/bin/python analytics/soft_sensor.py, since the scikit-learn dependency lives in that venv — a virtual environment, an isolated per-project Python install set up by make venv) and sim/.venv/bin/python analytics/spc.py. CI (continuous integration — the automated test run on every change) asserts the soft-sensor's R² stays above 0.85 and that the SPC chart stays a sane, capable control chart (lcl < center < ucl, Cpk > 1.0). The served path — the soft-sensor running behind an API and logging every run to MLflow, the open-source experiment-tracking and model-registry tool, for governance — is sketched as the production target (an MLflow tracking server logging each run alongside the historian); the standalone scripts here are the same models with that serving plumbing stripped away. (The committed analytics Compose profile ships only a metrics store, VictoriaMetrics — the MLflow serving path is described, not yet wired in the example repo.)

SPC: the chart that catches drift

Statistical process control is the oldest idea in this book and still the most useful. The premise: a stable process has common-cause variation — random wobble around a mean — and you can compute, from the data itself, the band that common-cause variation should stay inside. Anything outside that band is a special cause worth investigating. You are not setting the limits from the specification; you are letting the process tell you what "normal" looks like, then watching for "not normal."

For a release attribute measured once per batch, the right tool is an I-MR (individuals / moving-range) chart. You have one number per batch, so you estimate the spread from the moving range — the absolute difference between consecutive batches — rather than from a within-subgroup standard deviation (a "subgroup" being a small batch of several measurements taken together, like five parts off a line; here each batch yields a single number, so there is no subgroup to take a spread within). The control-chart constant d2 = 1.128 converts the average moving range of pairs into a sigma estimate (sigma being the standard deviation — a measure of how much the values scatter around their mean). The constant is a fixed, tabulated value: for moving ranges of pairs (n=2), the average range of a normal process runs about 1.128 times its standard deviation, so dividing the average moving range by 1.128 backs out an estimate of sigma.

Here is the core of examples/analytics/spc.py. It is hand-rolled on purpose: there is no maintained, permissively licensed pure-Python SPC library we were willing to pin, so the chapter shows the arithmetic plainly — the classical control-chart and capability statistics that the open-source statistics ecosystem has long documented for process work:

# examples/analytics/spc.py
D2 = 1.128  # control-chart constant for moving range of n=2

def imr_limits(values: np.ndarray) -> dict:
    """Individuals (I) and moving-range (MR) control limits."""
    v = np.asarray(values, dtype=float)
    mr = np.abs(np.diff(v))
    mr_bar = mr.mean()
    sigma = mr_bar / D2
    center = v.mean()
    return {
        "center": round(float(center), 4),
        "ucl": round(float(center + 3 * sigma), 4),
        "lcl": round(float(center - 3 * sigma), 4),
        "sigma": round(float(sigma), 5),
        "mr_bar": round(float(mr_bar), 5),
    }

Capability is the second half of the story. A process can be perfectly in control (stable, predictable) and still incapable (its natural spread doesn't fit inside the spec). Cpk measures the distance from the process mean to the nearer specification limit, expressed in multiples of 3-sigma. (A stable process scatters about its mean so that very nearly all output — about 99.7% — falls within ±3 sigma, i.e. a natural width of 6 sigma, 3 sigma on each side; capability then asks how many of those 3-sigma half-widths fit between the mean and the nearer spec limit.) So Cpk = 1.0 means that nearer limit sits exactly 3-sigma away — the spread only just fits — and Cpk ≥ 1.33 (the nearer limit a comfortable 4-sigma out) is the conventional "comfortably capable" bar, leaving margin for the process to wobble without making out-of-spec product:

# examples/analytics/spc.py
def cpk(values: np.ndarray, lsl: float, usl: float) -> float:
    v = np.asarray(values, dtype=float)
    mu, sd = v.mean(), v.std(ddof=1)
    if sd == 0:
        return float("inf")
    return round(float(min((usl - mu) / (3 * sd), (mu - lsl) / (3 * sd))), 3)

We run this against datasets/hplc_results.csv, the simulated Certificate-of-Analysis (CofA) table — the formal per-batch quality record that lists each release test, its result, and the specification it must meet — one release row per batch per assay. The release_spc function pulls the cation-exchange main-peak charge-variant percent (CEX_main_pct) across the six campaign batches. An antibody is not one perfectly uniform molecule: small chemical differences (a deamidation here, a clipped C-terminal lysine there) shift a fraction of the population to a slightly different electrical charge, so the product is really a mixture of charge variants — the intended main form plus more-acidic and more-basic siblings. Cation-exchange chromatography (CEX) separates that mixture by charge — each charge variant emerges at a different time and registers as a peak on the detector trace, so the dominant form is literally the "main peak." CEX_main_pct = 70% means 70% of the antibody is the main charge form; it is the dominant charge variant, not a purity measure. Purity and aggregate are reported by a different assay — size-exclusion chromatography, SEC (SEC_monomer_pct) — and the impurity assays:

batch_id,test,value,unit,spec_low,spec_high,result
BATCH-2026-001,CEX_main_pct,70.686,%,60.0,80.0,PASS
BATCH-2026-002,CEX_main_pct,69.085,%,60.0,80.0,PASS
BATCH-2026-003,CEX_main_pct,70.404,%,60.0,80.0,PASS
BATCH-2026-004,CEX_main_pct,67.879,%,60.0,80.0,PASS
BATCH-2026-005,CEX_main_pct,66.699,%,60.0,80.0,PASS
BATCH-2026-006,CEX_main_pct,69.171,%,60.0,80.0,PASS

Running python analytics/spc.py prints exactly this:

I-MR control chart for CEX_main_pct (n=6): {'center': 68.9873, 'ucl': 73.8262, 'lcl': 64.1485, 'sigma': 1.61294, 'mr_bar': 1.8194}
  spec [60.0, 80.0]  Cpk=1.984

Read it like a quality engineer would. The process centres at 68.99% main peak, with control limits at 64.1% and 73.8% — those limits came from the batch-to-batch variation, not the spec. The specification is wider (60–80%), so the process comfortably fits inside it: Cpk = 1.98, well above the 1.33 bar. This is precisely the Continued Process Verification (CPV) activity that FDA's process-validation lifecycle expects in Stage 3 (the lifecycle runs Stage 1 process design, Stage 2 process qualification, then Stage 3 ongoing verification of routine commercial production) — ongoing statistical trending of routine production data to show the process stays in a state of control [1]. Serving this chart straight off the historian is the open-source way to do CPV without a six-figure statistics suite.

Anatomy of an I-MR / Cpk SPC record

That one printed line is not a chart yet — it is a small record, and a quality engineer reads it field by field before they trust the verdict. The dict release_spc("CEX_main_pct") returns is the SPC record for this attribute, and every field has a job. Pull it apart the way a reviewer would:

The output struct of release_spc as an identity card: the spread chain (mr_bar → d2 → sigma) that builds the limits, the capability number Cpk, and — the field pair that matters most — the data-derived control limits set against the product specification.

Original diagram by the authors, created with AI assistance.

center = 68.9873 — v.mean() over the six batch values; the line every other field is measured against.
mr_bar = 1.8194 — the mean of the five consecutive-pair moving ranges. With one number per batch there is no within-subgroup spread to use, so the moving range stands in for it.
d2 = 1.128 — the control-chart bias constant for pairs (n=2); it is the single hard-coded number in the file (D2 = 1.128), and it converts the average moving range into a sigma: sigma = mr_bar / d2.
sigma = 1.61294 — 1.8194 / 1.128. This is not the ordinary standard deviation of the six values; it is the moving-range estimate, deliberately, because that is what an I-MR chart uses.
ucl / lcl = 73.8262 / 64.1485 — center ± 3·sigma. These are the control limits, and they exist only because of the chain above.
Cpk = 1.984 — computed by the separate cpk() function, which uses the ordinary sample standard deviation (std(ddof=1) — the ddof=1 asks NumPy for the sample standard deviation, dividing by n−1 rather than n, the standard correction when estimating spread from a finite sample) and the spec, not the moving-range sigma. The two halves of the record answer two different questions, and they are allowed to use different spread estimates.

Control limits are not specifications

The single most misread field pair in all of SPC sits in that record, and the figure draws it as two stacked bands on purpose. The specification [60.0, 80.0] is what the product must do — it is fixed by the CofA acceptance criteria, lives in spec_low/spec_high in hplc_results.csv, and would be identical even if every batch came out at 61%. The control limits [64.1485, 73.8262] are what the process actually does — they are computed from the batch-to-batch variation and would tighten or widen if the process changed, regardless of the spec. Confusing the two is how plants either chase phantom problems (treating a wide spec as the alarm band) or miss real drift (assuming "in spec" means "in control"). They are different bands answering different questions, and a process can be inside one while leaving the other.

That distinction is also why a real control chart applies more than the one out-of-band test. A single point outside 64.15–73.83 is the obvious special cause, but a process can drift dangerously while every point is still inside the limits — nine points hugging one side of the centre line, or six points marching steadily upward. Those patterns are caught by the classic Western Electric / Nelson run rules, listed in the figure's side panel — a representative five of the eight Nelson rules (the five most common, skipping the rarer Rules 4, 7 and 8): one point beyond 3-sigma (Rule 1), nine on one side of centre (Rule 2), six in a steady trend (Rule 3), and the 2-of-3-beyond-2-sigma and 4-of-5-beyond-1-sigma zone tests (Rules 5–6). (A control chart is mentally divided into one-, two-, and three-sigma "zones" on each side of the centre line; the zone tests count how many recent points fall in the outer zones.) Our six campaign batches pass every rule, so the verdict "in a state of control" is earned, not assumed. spc.py ships only the limits and Cpk; the run-rule layer is the obvious next function, and it is the difference between a chart that catches a blown batch and one that catches a drifting process.

The golden batch: a control chart over time

A single number per batch is the release view. The process view is a trajectory — a tag like reactor temperature evolving over fourteen days. The golden-batch envelope is an SPC chart turned sideways: instead of one limit for the whole batch, you compute a mean ± 3-sigma band at each point in batch time, so a live batch can be overlaid and you can see the moment it leaves the herd.

golden_envelope in the same file resamples one online tag to an hourly cadence and builds the band:

# examples/analytics/spc.py
def golden_envelope(tag: str = "BR101.Temp.PV", freq: str = "1h") -> pd.DataFrame:
    """Mean +/- 3 sigma envelope over batch time for one online tag (golden batch)."""
    ts = pd.read_parquet(DATA / "fedbatch_timeseries.parquet")
    s = ts[ts.tag == tag].set_index("ts")["value"].resample(freq).agg(["mean", "std"]).dropna()
    s["upper"] = s["mean"] + 3 * s["std"].fillna(0)
    s["lower"] = s["mean"] - 3 * s["std"].fillna(0)
    return s.reset_index()

The same run reports:

golden-batch Temp envelope: 336 hourly points, mean range 36.49-37.01 degC

336 hourly points (fourteen days) with the temperature mean tracking between 36.49 and 37.01 °C — exactly the tight band a PID-controlled bioreactor should hold (PID — proportional-integral-derivative — is the standard feedback controller that nudges heating and cooling to keep a measured value on setpoint), and the day-7 0.5 °C cooling excursion (a brief unintended departure from setpoint) the simulator deliberately seeds — so the chapter has a known deviation to detect — shows up as a dip that pulls the mean down to the band's lower edge (36.49 °C). That dip is an unplanned setpoint excursion — distinct from the deliberate production-phase temperature downshift (often to ~31–33 °C) that some CHO (Chinese Hamster Ovary, the standard mammalian host cell line for antibody drugs) processes run on schedule to slow cell growth and favour product formation, which an envelope built from good batches would already contain and must not alarm on. One honest caveat about the band itself: the demo here is a single seeded batch (BATCH-2026-001), so the spread is the within-window variation of one trajectory, illustrative of the method; a true production envelope runs the same mean ± 3-sigma computation across a library of good batches at each batch-time index. Overlay a new batch on that band in Grafana and an operator sees a deviation forming hours before it trips an alarm.

A two-panel process-analytics figure: on the left an I-MR control chart of CEX main-peak (charge variant) across six campaign batches with the centre line and 3-sigma control limits inside the wider specification band; on the right a Raman-to-titer PLS soft-sensor scatter of predicted versus measured titer hugging the 45-degree line, annotated with R-squared 0.99.

Left: the I-MR chart and Cpk for a release attribute — stable, capable, well inside spec. Right: the PLS soft-sensor recovering titer from simulated Raman spectra, predicted-vs-measured points hugging the identity line. Two faces of the same idea — let clean, contextualized data say whether the process is behaving.

Original diagram by the authors, created with AI assistance.

MVDA: one batch is a cloud, not a number

SPC watches one variable at a time. But a bioreactor batch is dozens of variables moving together — temperature, pH, dissolved oxygen, glucose, lactate, viable cell density, titer — and the correlations between them carry the real signal. Two batches can each have every individual tag inside its own control chart and still be subtly different, because the relationship between glucose and lactate drifted. That is what multivariate data analysis (MVDA) sees and univariate SPC misses.

The two workhorses are PCA and PLS. Principal Component Analysis (PCA) compresses many correlated tags into a few latent components — new synthetic axes, each a weighted blend of the original correlated tags, that capture most of the variation in far fewer numbers — so a whole batch trajectory becomes a path through a low-dimensional space — and an abnormal batch literally veers off the normal path. The foundational method for monitoring batch processes this way is multiway PCA: you unfold the three-dimensional (batch × variable × time) data into a matrix, fit the model on good historical batches, and then score a new batch's deviation as Hotelling's T² and squared prediction error (two multivariate scores defined in the next subsection) [5]. When a new batch's score plot leaves the confidence ellipse, a contribution plot tells you which variables pushed it out — the diagnostic the operator actually wants.

Partial Least Squares (PLS) is PCA's supervised cousin: it finds the latent components that best predict an outcome (titer, a quality attribute) rather than just explaining variance. PLS is the basic tool of chemometrics precisely because spectroscopic and process data are wide, collinear, and noisy — exactly the regime where ordinary regression falls apart and PLS thrives [4]. In scikit-learn these are sklearn.decomposition.PCA and sklearn.cross_decomposition.PLSRegression — a few lines each over a contextualized batch table; the soft-sensor below is the PLS engine in its most concrete form, and the same PLSRegression API that fingerprints a batch trajectory is what predicts titer from a spectrum.

Reading a score plot: Hotelling's T-squared and the contribution plot

The output of a fitted PCA model is not a prediction; it is a map. Each batch (or each timepoint within a batch) becomes a point in the low-dimensional space of the first few components — the score plot. Good historical batches cluster into a cloud, and a 95% confidence ellipse drawn around that cloud is the multivariate equivalent of a control chart's limits: it is sized to enclose 95% of normal batches (so a normal batch only rarely falls outside by chance), and a new batch whose score lands inside the ellipse looks like the others, while one that lands outside has a problem.

Two statistics turn that picture into numbers a monitor can alarm on. Hotelling's T² measures how far a batch sits from the centre of the cloud inside the model plane — it is the multivariate distance that the confidence ellipse encodes, so a high T² means "this batch is an extreme but recognisable member of the family." The squared prediction error (SPE, or Q-statistic) measures the distance off the plane — the part of the batch the model cannot explain at all, which flags genuinely novel behaviour (a sensor fault, a new contaminant) that no good batch ever showed. T² catches "too much of something we know"; SPE catches "something we have never seen."

When either statistic trips, the question is immediately why — and the answer is the contribution plot. It decomposes a single out-of-bounds score back onto the original variables, ranking which ones pushed the batch out: a spike on the lactate contribution and a dip on glucose says the metabolic relationship drifted, not that any single tag left its own control chart. This is the diagnostic an operator actually acts on, and it is exactly what univariate SPC cannot give you — by definition it has thrown away the cross-variable structure that the contribution plot reads back out.

Multiway PCA: unfolding batch × variable × time

A batch dataset is three-dimensional: batches × variables × time. PCA expects a flat matrix, so the foundational trick of multiway PCA [5] is the unfold: slice the cube so that each batch becomes one long row, with every variable-at-every-timepoint laid out side by side as columns (a 14-day batch of 7 tags at hourly cadence unfolds to one row of roughly 2,350 columns — 336 hourly points × 7 tags). The procedure is then mechanical and worth naming as a fixed walkthrough:

Unfold the good historical batches into the batch-wise matrix (one row per completed batch).
Fit a PCA model on that matrix — the model now encodes "the shape of a normal trajectory," not just a normal value.
Score a new batch against the model to get its T² and SPE at each point in batch time, so the deviation is time-resolved.
Diagnose any excursion with the contribution plot to see which variable, at which phase, drove it.

The pay-off is that an abnormal batch is caught as a trajectory: it can have every individual tag inside its own I-MR limits and still veer off the normal path because the correlation structure moved — the exact failure univariate SPC is blind to. Commercial batch-monitoring suites (SIMCA and the like) are productized multiway PCA; the unfold-fit-score-diagnose loop above is the open-source core they wrap, and it is a few dozen lines of NumPy plus sklearn.decomposition.PCA.

A titer soft-sensor: predicting quality from a spectrum

Here is the chapter's centrepiece. A soft-sensor (or virtual sensor) is a model that infers a hard-to-measure quantity from easy-to-measure ones. Titer — how many grams of antibody per litre you have made — normally needs an offline assay that takes hours. But an in-line Raman probe produces a spectrum every few minutes (Raman spectroscopy shines a laser into the broth (the cell-culture fluid in the bioreactor) and reads the tiny shifts in scattered light caused by molecular vibrations — each kind of chemical bond shifts it by a characteristic amount), and that spectrum carries a faint fingerprint of every molecule in the broth, including the product. If a model can learn that fingerprint, you get titer in real time — the essence of Process Analytical Technology (PAT), building quality assurance on in-process measurement instead of waiting for end-product testing [2]. Raman is not the only in-line probe that fits this mould: near-infrared (NIR) is another common vibrational-spectroscopy PAT measurement, and it gets exactly the same chemometric treatment — a PLS calibration mapping a wide, collinear spectrum onto a concentration. The soft-sensor pattern below is therefore technique-agnostic; swap a Raman matrix for an NIR one and the scikit-learn pipeline is unchanged.

The dataset is datasets/raman_spectra.parquet: one row per hourly timepoint, 701 intensity columns (wn_400 … wn_1800, one per wavenumber — the position along the Raman shift axis, measured in cm⁻¹, that labels which molecular vibration each column reads — stepped in 2-unit increments from 400 to 1800) plus reference labels the simulator carried along from the same kinetic state, so the spectra are genuinely informative about concentration. One scope note up front, because it matters for everything below: this demo Raman dataset is a single batch (BATCH-2026-001) — 336 hourly spectra of one trajectory — unlike the six-batch release table (BATCH-2026-001 … 006) the SPC section trended. The single-batch limitation, and what it means for the suspiciously high R², is revisited under "Where the soft-sensor breaks" below. The reference columns are the offline measurements the simulator logged alongside each spectrum: glucose and lactate in g/L, glutamine in mM (millimolar, a concentration unit), VCD_e6_per_mL the viable cell density in millions of cells per mL, and titer_g_L the antibody concentration in g/L that the soft-sensor learns to predict:

ts                          batch_id        glucose_g_L  lactate_g_L  glutamine_mM  VCD_e6_per_mL  titer_g_L
2026-01-05 00:00:00+00:00   BATCH-2026-001       6.000        0.200         4.000          0.300      0.000
2026-01-05 01:00:00+00:00   BATCH-2026-001       5.998        0.201         3.999          0.306      0.000
2026-01-05 02:00:00+00:00   BATCH-2026-001       5.995        0.202         3.998          0.312      0.001

The model is in examples/analytics/soft_sensor.py. Loading is trivial — every column starting wn_ is a feature, titer_g_L is the target:

# examples/analytics/soft_sensor.py
def load_xy():
    df = pd.read_parquet(DATA / "raman_spectra.parquet")
    wn = [c for c in df.columns if c.startswith("wn_")]
    X = df[wn].to_numpy()
    y = df[TARGET].to_numpy()
    return X, y, wn

Training is a textbook chemometrics pipeline: hold out a slice, standardize the spectra (rescale every wavenumber column to the same mean-zero, unit-spread footing so a naturally large-valued column does not dominate the fit just because its raw numbers are bigger), fit a PLS regression with a handful of latent components, and score on the held-out set. We use scikit-learn, the canonical open-source machine-learning library for exactly this kind of work [9]:

# examples/analytics/soft_sensor.py
def train(n_components: int = 6, test_size: float = 0.3, seed: int = 2026):
    X, y, wn = load_xy()
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=test_size, random_state=seed)
    scaler = StandardScaler().fit(Xtr)
    pls = PLSRegression(n_components=n_components)
    pls.fit(scaler.transform(Xtr), ytr)
    pred = pls.predict(scaler.transform(Xte)).ravel()
    r2 = r2_score(yte, pred)
    rmse = float(np.sqrt(mean_squared_error(yte, pred)))
    return {"n_components": n_components, "n_wavenumbers": len(wn),
            "n_train": len(ytr), "n_test": len(yte),
            "r2": round(float(r2), 4), "rmse_g_L": round(rmse, 4)}

python analytics/soft_sensor.py (or make soft-sensor) prints:

PLS soft-sensor (titer from Raman): R2=0.9923 RMSE=0.1498 g/L (6 comps, 701 wavenumbers, 235 train / 101 test)
ASSERT ok: R2 > 0.85 — the Raman dataset is genuinely predictive of titer.

Six latent components, distilled from 701 wavenumbers, recover titer with R² = 0.99 and an RMSE of 0.15 g/L on data the model never saw. Two standard scores summarise a regression: R² (the fraction of the variation in titer the model explains, where 1.0 is a perfect fit) and RMSE (root-mean-square error — the typical size of a prediction miss, in the same g/L units as the target). So R² = 0.99 says the model captures almost all of the titer signal, and RMSE = 0.15 g/L says a typical prediction is off by about 0.15 g/L. The script ends with a hard assertion — assert m["r2"] > 0.85 — so the book's claim cannot silently rot: if a future change to the simulator broke the signal, CI would fail loudly. This mirrors published reality, where data-driven models forecast mAb titer and metabolites in CHO fed-batch culture toward digital-twin use [7], and where Raman + PLS glucose-feedback control has actually been shown to lift product output by roughly a quarter in CHO bioreactors — in Gibbons 2023 that ~25% gain was for one of two cell lines, achieved by holding glucose steadier to prolong viability rather than by feeding harder, so the exact number is cell-line-dependent [8].

Anatomy of a soft-sensor model record

That printed line — R2=0.9923 RMSE=0.1498 g/L (6 comps, 701 wavenumbers, 235 train / 101 test) — is the model's whole story compressed into one string, and that is exactly the problem. A print() is not a governed artifact. The fields it carries (and a few the script does not yet print but a registry must) are precisely what turn a pls.predict() call into a controlled analytical procedure you could defend in an audit. Dissect them as a single identity card:

The soft-sensor as a governed record rather than a console line: the inputs and hyperparameters, the dataset pinned by its MANIFEST.sha256 hash, the fitted scaler, the held-out metrics (green), and — what an audit actually asks for — the operating range, intended-use scope, and MLflow lineage (violet).

Original diagram by the authors, created with AI assistance.

target = titer_g_L, inputs = wn_400 … wn_1800 — the procedure's contract: 701 spectral columns in, one concentration out. load_xy() defines this by selecting every wn_-prefixed column, so the input shape is data-derived, not hand-listed.
n_components = 6, n_wavenumbers = 701 — the model's only real knob and the input width it operates on. Six latent variables are the entire model complexity; everything else is fitted.
training data = raman_spectra.parquet, sha256 = 4d7f12c4…cbc998c — the field that makes the record governable. A sha256 hash is a short fingerprint computed from the file's exact bytes — change a single value and the fingerprint changes completely — so it uniquely identifies this dataset. The dataset hash is real: it is the line for raman_spectra.parquet in datasets/MANIFEST.sha256. Bind that hash to the model version ("pin" the data) and "which data trained this model?" stops being a guess.
held-out split = 235 train / 101 test, seed 2026 — test_size=0.3 with random_state=2026. The seed is what makes the 235/101 split — and therefore the metrics — reproducible run after run; drop it and the numbers wander.
StandardScaler: mean_[0] ≈ 42.47, scale_[0] ≈ 1.44 — the scaler is fitted on the training slice only (StandardScaler().fit(Xtr)) and then applied to the test slice. Those per-wavenumber means and scales are part of the model: serve the model, you must serve the scaler, or every prediction is silently wrong.
validation: R² = 0.9923, RMSE = 0.1498 g/L — the held-out metrics, the green core. These are the numbers the CI assertion guards (r2 > 0.85), so the record cannot quietly degrade without the build going red.
operating range, intended use, lineage — the violet panel holds what the print() omits and a registry must add: the titer span the model was calibrated over (0 to ~5.72 g/L in this dataset), the scope (advisory titer, not an unattended release decision), and the MLflow lineage run → model → registry stage (one training run produces a saved model, which is then promoted through registry stages such as Staging and Production). These are not code; they are the governed metadata that ICH Q14 (the international regulatory guideline on analytical-procedure development) expects around a model-based procedure, and the gap between the printed string and this full card is the whole subject of the governance section below.

Where these records come from

Both identity cards close a loop that runs the length of the trilogy. The CEX main-peak number the SPC chart trends is a physical measurement — it is generated on the bench in Book 1's analytical assays and QC release, where each batch earns its Certificate of Analysis; the titer the soft-sensor predicts is the protein accumulating in the production bioreactor itself. Book 2 then turns each into a governed datapoint and poses the open question this chapter answers in code: the I-MR / Cpk and CPV record behind the SPC card, and the soft-sensor model record behind the PLS card. This chapter is where that data-management challenge becomes running scikit-learn over a hashed dataset.

Both records are graph nodes: a triple, a SHACL gate, a competency question

Neither identity card has to stay a Python dict. Each is a small bundle of facts about one subject, which is exactly the shape of an RDF node — the subject-predicate-object triple model the semantics chapter builds the digital thread from. The soft-sensor record becomes a handful of triples on one IRI (Internationalized Resource Identifier — a global, web-style name, the graph's equivalent of a primary key), reusing the bp: vocabulary that chapter aligns to the Industrial Ontologies Foundry:

# Illustrative — the model record as RDF, the same bp: vocabulary the semantics chapter loads.
bp:titer_pls_v3  a            bp:SoftSensorModel ;
                 bp:predicts  bp:titer_g_L ;
                 bp:trainedOn bp:raman_spectra_parquet ;     # the dataset node, pinned by hash
                 bp:nComponents       6 ;
                 bp:r2                "0.9923"^^xsd:float ;
                 bp:operatingRangeMax "5.72"^^xsd:float ;
                 prov:wasDerivedFrom  bp:raman_spectra_parquet .   # PROV-O lineage edge

bp:raman_spectra_parquet  bp:sha256 "4d7f12c4…cbc998c" .     # the MANIFEST.sha256 line, now a triple

That prov:wasDerivedFrom edge is PROV-O, the W3C provenance vocabulary — it makes "which dataset trained this model?" a graph edge you can walk, the same derivedFrom traversal the genealogy SPARQL uses on lot lineage, and the machine-readable twin of the MLflow run → model → registry chain the anatomy card draws. The SPC record maps the same way: bp:CEX_main_pct_spc bp:cpk 1.984 ; bp:center 68.9873 are just more triples on the attribute.

Once the record is a node, the release-gate discipline the SHACL chapter applies to a lot applies to a model. A governed soft-sensor must carry exactly one operating range, a non-empty intended-use scope, and a bound dataset hash — and "is a required field missing?" is precisely the closed-world question OWL cannot answer but SHACL can [3]. The model-governance card is therefore a sh:NodeShape in spirit, the soft-sensor equivalent of bp:ReleaseShape:

# Illustrative — a model-governance gate, mirroring bp:ReleaseShape from the ontology book.
bp:GovernedModelShape a sh:NodeShape ;
    sh:targetClass bp:SoftSensorModel ;
    sh:property [ sh:path bp:trainedOn ;        sh:minCount 1 ;
                  sh:message "Model is not pinned to a dataset." ] ;
    sh:property [ sh:path bp:operatingRangeMax ; sh:minCount 1 ;
                  sh:message "Model has no declared operating range." ] ;
    sh:property [ sh:path bp:intendedUse ;       sh:minCount 1 ;
                  sh:message "Model has no documented intended use." ] .

And the audit question becomes a one-line SPARQL competency question — the analytics analogue of the ontology book's release CQs: "list every served model whose dataset hash no longer matches the current MANIFEST.sha256" is SELECT ?m WHERE { ?m bp:trainedOn ?d . ?d bp:sha256 ?h . FILTER(?h != $current) }. That is the drift-and-staleness question of the next section, asked of the graph instead of a spreadsheet — and it is why pinning the hash on the anatomy card was never cosmetic.

Where the soft-sensor breaks — and why honesty is the point

R² = 0.99 is a suspiciously good number, and the chapter's docstring says so out loud: the simulated spectra carry the titer signal cleanly. There is a second, more subtle reason the number flatters the model. The dataset is a single batch's 336 hourly spectra, and train() uses a random train/test split — so each held-out test point sits about an hour between two of its own training neighbours on one smooth trajectory. The model is mostly interpolating between near-identical rows, which is easier than truly predicting an unseen batch. An honest calibration validation holds out whole batches (a leave-one-batch-out split, grouped by batch_id), which this single-batch demo cannot do; the R² here is best read as "the signal is genuinely there," not "this is the field accuracy." Real Raman is harder still. A bioprocess soft-sensor must survive variable batch length, multiple process phases (a culture grows through a lag phase where it adapts, an exponential phase of rapid division, a stationary phase where growth plateaus, and a death phase — each with different spectral behaviour), fouling and bubbles on the probe, and outright sensor faults — and a model trained on one set of conditions degrades quietly when any of those shift [6]. One spectrum genuinely needs many models: a calibration that works in stationary phase may be useless during the exponential ramp.

The takeaway is a design rule, not a disclaimer. A soft-sensor is only trustworthy with drift monitoring beside it: track the spectral inputs against the training distribution and the prediction residuals against the occasional offline reference, and raise a flag when either wanders. The served path is designed to log every prediction so this is auditable; the model is never allowed to make a GMP decision unattended.

Validate the way the model will be used: grouped CV, applicability domain, two kinds of drift

The random split above is the single biggest source of optimism, and the fix is a discipline the ML book treats as non-negotiable: validate the way the model will be used. A served soft-sensor predicts on a batch it has never seen, so the honest estimate of its field error is grouped, leave-one-batch-out cross-validation — split by batch_id so no row from a test batch ever leaks into training, fit on the rest, score on the held-out batch, and rotate. The gap between a random-split R² and a leave-one-batch-out R² is the leakage the random split hid; the same trap (and the grouped-CV cure) is exactly what the ML book's data and validation chapters warn about for any time-correlated bioprocess series. This single-batch demo cannot run it, which is precisely why its R² is read as "the signal is there," not "this is the accuracy."

Cross-validation tells you how good the model is inside the data it has seen; it says nothing about a spectrum unlike any of them. That second question is the applicability domain (AD) — the region of input space where the calibration is entitled to be believed — and the MVDA section already built the tool for it. A new spectrum's Hotelling's T² and SPE/Q against the training PCA model are the AD test: a low T² and low SPE mean "this spectrum looks like the calibration set, trust the prediction," while a high SPE means "this is off the model plane — novel chemistry, a fouled probe, a bubble — flag the number, do not act on it." A soft-sensor that ships an AD check beside every prediction is the difference between a model that fails loudly out-of-domain and one that extrapolates a confident, wrong titer. (PLS hands you a third cue for free: a wildly out-of-range prediction, or one whose spectral residual spikes, is the same signal read from the regression side.)

Finally, keep two drifts apart, because they trigger different actions. Model drift is the model going stale against an unchanged process — a fouling probe, a calibration that no longer matches the instrument — and the cure is re-calibration or re-training. Process drift is the process itself moving — the charge-variant trend the I-MR chart catches, a real metabolic shift — and the cure is a CAPA on the process, not the model. The danger is mistaking one for the other: re-training a model to chase a genuine process excursion hides the deviation the SPC chart exists to surface. The two monitors must run side by side — the residual control chart watching the model, the I-MR and golden-batch envelope watching the process — which is the MLOps lifecycle the ML book builds, here grounded in the very same datasets.

The two monitors also fire at different times, and that gap is the whole point. The leading detector is label-free — a PSI input-distribution check (and the same Hotelling's T² / SPE applicability-domain test) on the incoming Raman and tags against the training distribution — so it raises a flag the moment the inputs move, before any quality result returns. The lagging detector is the residual control chart, which cannot confirm anything until the slow offline assay comes back with a reference value. Only then can you ask whether the input shift was a real process excursion (flag the deviation, CAPA on the process) or a covariate shift in the inputs alone (re-train under change control):

Two drift detectors on one batch-time axis: the label-free input monitor fires hours before the residual monitor, which can only confirm once the slow offline assay returns — and the verdict routes process drift to a CAPA and covariate shift to a re-train under change control.

Original diagram by the authors, created with AI assistance.

When the calibration moves: field degradation across probes and scales

The single most important field on the anatomy card above — the dataset hash — is also the one that quietly invalidates everything else. A PLS calibration is bound to the exact spectrometer, probe, and process conditions it was trained on, and the literature is blunt about what happens when any of those change. In a controlled bioprocess study, Pétillot and colleagues put two probes of the same Raman analyzer into the same CHO culture at the same time — removing all biological variability — and the model prediction error on cell density between the two probes was about 20%, purely from instrument-to-instrument differences; a calibration-transfer step (Kennard-Stone piecewise direct standardization) was needed to halve that to roughly 10% [12]. That is the soft-sensor equivalent of the connectivity chapter's "92% of OPC UA deployments shipped insecure" datapoint: a hard number showing the default state is not trustworthy. Swap the probe, move from a 3 L development reactor to a 2,000 L manufacturing tank, or let a window foul, and a model that scored R² = 0.99 on its own data can degrade well past its acceptance criterion without a single line of code changing.

This is why the anatomy card pins the dataset by hash and declares an operating range, and why the governance section treats re-validation as mandatory rather than optional. A soft-sensor that moves to new hardware is, for regulatory purposes, a new analytical procedure until it is re-qualified — calibration transfer is not a free lunch, it is documented work with its own evidence. The honest reading of our R² = 0.99 is therefore narrow: it is true for this probe, this process, this dataset hash, and the moment any of those move, the drift monitor — not the original metric — is what tells you whether the number can still be believed.

Model governance: the GxP bar for ML

A model that predicts titer for a dashboard is one thing — but once a model touches a regulated decision it inherits the GxP bar (GxP is the umbrella for the regulated Good-Practice standards — Good Manufacturing Practice and its siblings). A model whose output feeds a release decision (the formal call to ship or reject a finished batch) or an in-process control decision is a regulated analytical procedure, and the bar climbs sharply. ICH Q14 is explicit that model-based analytical procedures — NIR, Raman, multivariate calibrations — carry a lifecycle obligation: documented calibration, formal validation against a reference method, defined operating ranges, and ongoing performance monitoring with a re-validation trigger when the model drifts [3]. A soft-sensor is not "trained once and trusted forever"; it is a controlled procedure with a maintenance burden.

That is where open-source tooling helps and where it stops. MLflow gives you the technical spine of governance: experiment tracking (every run's parameters, metrics, and the exact dataset hash), a model registry with versions and stage aliases, and run-to-model lineage so you can answer "which model version, trained on which data, produced this prediction on this batch?" [11]. The served path would log the soft-sensor's R² and RMSE to MLflow for exactly this reason.

But MLflow tracks; it does not validate. The honest GxP last mile is the same one this whole book keeps arriving at: the audit trail on who promoted a model version, the e-signature on the validation report, the change-control procedure that governs re-training, and the documented intended use that scopes what the model is allowed to decide. Those are properties of a validated system and procedure, not a pip install. MLflow gets you the lineage; the validated lifecycle around it is the work.

Why it matters

Every prior chapter was infrastructure; this one is the reason the infrastructure exists. A control chart that catches a drifting charge-variant attribute before it fails spec, a golden-batch envelope that flags a deviating reactor hours early, a soft-sensor that turns a four-hour assay into a real-time number — these are what let a plant move from testing quality into the product at the end toward building quality assurance on process understanding throughout [2]. And it only works because the data arriving here is clean, contextualized, and attributable. The most sophisticated PLS model in the world is worthless on data you cannot trust; the most trustworthy data is wasted if no one turns it into a decision. Analytics is where those two halves meet.

In the real world

In a commercial mAb plant the SPC and CPV layer is often a validated statistics suite (JMP, Minitab, or a Discoverant-style process-intelligence platform) and the MVDA is frequently SIMCA (Sartorius/Umetrics) — the tools that productized multiway PCA/PLS for batch monitoring. Soft-sensors ride on PAT data-management and orchestration platforms — most commonly synTQ (Optimal Industrial Technologies) or Siemens SIPAT — wired to in-line Raman analyzers — representative process-Raman heads include the Endress+Hauser Kaiser Raman Rxn2/Rxn4 process analyzers, with Renishaw Virsa and Tornado Spectral Systems as alternatives (named as common industry instruments). Our open-source stack does not pretend to replace a validated PAT system that controls a feed pump; it does the analysis, the prototyping, and the contextualized historian feed alongside them — and it does the SPC/CPV trending genuinely well.

A few concrete anchors:

The N-GLYcanyzer testbed is a concrete instance of this PAT vision — an automated in-line system that, instead of inferring a quality attribute from a Raman fingerprint, actually measures two of them on-line: it runs a chromatographic separation to report the antibody's glycan profile (the sugar chains attached to it, a key quality attribute) and a Protein A titer (a quick affinity measurement of how much antibody is present), demonstrated on a CHO-cell biosimilar of the antibody trastuzumab — the same real-time, in-process-measurement idea this chapter's soft-sensor sketches.
The continuous variant raises the analytics stakes. A perfusion / 3MCC (multi-column continuous capture) line runs for weeks at near-steady state, so SPC moves from per-batch points to time-windowed trending, and a soft-sensor's drift becomes a daily operational concern rather than a per-batch one — which is exactly why the drift-monitoring discipline above is non-negotiable in intensified processing.

The honest OSS-vs-commercial verdict for this layer: open source wins more of this layer than any other in the book. pandas/NumPy/SciPy, scikit-learn [9], and statsmodels [10] are the same algorithms the commercial suites wrap; a Jupyter notebook plus MLflow is a credible, reproducible analytics environment, and serving an SPC chart or a soft-sensor prediction into Grafana costs nothing but engineering. Where pure OSS stops: it is not a validated, vendor-accountable PAT control system, there is no maintained Part 11-grade SPC GUI for a non-coding analyst, and the model-governance lifecycle (validation evidence, signed approvals, locked change control) is procedure you build, not software you download. Get the analytics in open source; buy or build the validated wrapper for the decisions that touch release.

Key terms

SPC (statistical process control) — using the process's own variation to set control limits and flag special-cause deviations.
I-MR chart — individuals / moving-range control chart for one value per batch; spread estimated from consecutive-pair differences via d2 = 1.128.
Control limits vs specification — limits come from the data (what the process does); spec comes from the product requirement (what it must do). They are not the same band.
Cpk — process capability index: distance from the mean to the nearer spec limit in 3-sigma units; ≥ 1.33 is comfortably capable.
Golden-batch envelope — a mean ± 3-sigma band computed at each point in batch time, for overlaying a live batch on the historical norm.
MVDA — multivariate data analysis; modelling many correlated variables (and their relationships) at once.
PCA / multiway PCA — Principal Component Analysis; multiway PCA unfolds batch × variable × time data to monitor whole trajectories.
PLS — Partial Least Squares regression; supervised latent-variable modelling for wide, collinear data like spectra.
Hotelling's T² / SPE (Q) — the two multivariate-monitoring statistics: T² is the distance inside the model plane (an extreme but recognisable batch), SPE/Q the distance off the plane (genuinely novel behaviour).
Special-cause run rules (Western Electric / Nelson) — pattern tests beyond the single 3-sigma point — runs, trends, and zone tests — that flag a drifting process while every point is still inside the control limits.
Contribution plot — the MVDA diagnostic that attributes an out-of-control multivariate score to specific variables.
Soft-sensor (virtual sensor) — a model inferring a hard-to-measure quantity (titer) from easy-to-measure inputs (Raman spectra).
Raman spectrum — an in-line optical measurement; here 701 intensity points over wavenumber, carrying a molecular fingerprint of the broth.
NIR (near-infrared) — a sibling in-line vibrational-spectroscopy PAT measurement that takes the same PLS/chemometric treatment as Raman, so the soft-sensor pattern is technique-agnostic.
CPV (Continued Process Verification) — Stage 3 of the process-validation lifecycle; ongoing statistical trending of routine production data.
PAT — Process Analytical Technology; real-time in-process measurement driving quality decisions.
Latent component (latent variable) — a compressed, synthetic axis (a weighted blend of many correlated inputs) that PCA/PLS use in place of the raw columns; the soft-sensor works in 6 of these instead of all 701 wavenumbers.
R² / RMSE — model-fit scores: R² is the fraction of variance the model explains (1.0 = perfect); RMSE is the typical prediction error in the target's own units (here g/L). See the ML book for the full treatment.
ICH Q14 — the international regulatory guideline on analytical-procedure development; it makes model-based procedures a lifecycle obligation (validation, operating ranges, ongoing monitoring, re-validation triggers).
GxP — umbrella for the regulated Good-Practice standards (Good Manufacturing Practice and its siblings); the bar a model must clear once it touches a regulated decision.
Drift monitoring — tracking model inputs and residuals against the training distribution to detect when a model has gone stale.
Leave-one-batch-out (grouped) cross-validation — validating by holding out whole batches (split on batch_id) so no row of a test batch leaks into training; the honest estimate of a soft-sensor's error on an unseen batch.
Applicability domain (AD) — the region of input space where a calibration is entitled to be believed; tested per-spectrum by its Hotelling's T² and SPE against the training PCA model.
Model drift vs process drift — model drift is the model going stale against an unchanged process (cure: re-calibrate/re-train); process drift is the process itself moving (cure: a CAPA on the process). Re-training to chase a real process excursion hides the deviation SPC exists to surface.
RDF triple / SHACL / PROV-O (for an analytics record) — the SPC and soft-sensor records can be RDF triples on one IRI, gated by a SHACL NodeShape (the model-governance analogue of bp:ReleaseShape), with prov:wasDerivedFrom recording the dataset lineage — see the semantics and release-gate chapters.

Where this leads

We now have the full toolkit — capture, historian, context, semantics, trust, and the analytics that turn it all into insight. The final build chapter, Capstone: One Batch, End to End, puts every layer in a single line: a complete simulated fed-batch CHO + Protein A run driven from sensor through ingestion, historian, contextualization, signed records, audit trail, and these very analytics, ending in one reviewable, release-ready dataset and dashboard — the proof that everything you built actually interconnects.

What this chapter covers​

SPC: the chart that catches drift​

Anatomy of an I-MR / Cpk SPC record​

Control limits are not specifications​

The golden batch: a control chart over time​

MVDA: one batch is a cloud, not a number​

Reading a score plot: Hotelling's T-squared and the contribution plot​

Multiway PCA: unfolding batch × variable × time​

A titer soft-sensor: predicting quality from a spectrum​

Anatomy of a soft-sensor model record​

Both records are graph nodes: a triple, a SHACL gate, a competency question​

Where the soft-sensor breaks — and why honesty is the point​

Validate the way the model will be used: grouped CV, applicability domain, two kinds of drift​

When the calibration moves: field degradation across probes and scales​

Model governance: the GxP bar for ML​

Why it matters​

In the real world​

Key terms​

Where this leads​

What this chapter covers

SPC: the chart that catches drift

Anatomy of an I-MR / Cpk SPC record

Control limits are not specifications

The golden batch: a control chart over time

MVDA: one batch is a cloud, not a number

Reading a score plot: Hotelling's T-squared and the contribution plot

Multiway PCA: unfolding batch × variable × time

A titer soft-sensor: predicting quality from a spectrum

Anatomy of a soft-sensor model record

Both records are graph nodes: a triple, a SHACL gate, a competency question

Where the soft-sensor breaks — and why honesty is the point

Validate the way the model will be used: grouped CV, applicability domain, two kinds of drift

When the calibration moves: field degradation across probes and scales

Model governance: the GxP bar for ML

Why it matters

In the real world

Key terms

Where this leads