Analytical Methods: Chemometrics, Deep Spectroscopy, and Automated Chromatograms

📍 Where we are: Part II · Discovery & Development, Learned — Chapter 8. The last chapter used Bayesian optimization to choose a process in a fraction of the runs a factorial grid needs. But every one of those runs only counts if you can measure what it produced. This chapter turns the learning lens on the analytical lab — the instruments that decide whether BATCH-2026-001 is 98.611 percent monomer (intact, correctly assembled antibody) or out of specification (OOS — a release result outside its acceptance limit, which blocks release until a formal investigation closes it), and that flagged BATCH-2026-004 as out of specification on host-cell protein (a process impurity carried over from the cells) at 128 ng/mg against a 100 ng/mg limit. (Release testing and these acceptance limits are Book 1's QC and release chapter.)

The analytical lab is where a process becomes a number. A bioreactor can run beautifully, but until an assay reports a titer, a purity, a glycan profile, the run is a rumor. And the analytical lab is the part of biomanufacturing that has been quietly doing machine learning the longest — for forty years, under a name that predates the hype: chemometrics. A Raman probe does not emit a glucose concentration; it emits a thousand intensity channels (one measured light intensity at each point along the spectrum), and a model turns those channels into a number. That model is machine learning, whether or not anyone calls it that.

This chapter walks the lab's instruments through the learning lens and is careful about a distinction the field often blurs: the gap between the established chemometrics that runs in production today (PLS, PCA — linear, interpretable, validated) and the deep learning that the literature is excited about (CNNs, autoencoders, transformers on spectra) but that has barely reached the GMP lab (Good Manufacturing Practice — the legally-enforced quality system commercial drug batches run under, in which every method and model must be validated, reproducible, and auditable). We build that contrast as runnable code — a 1D convolutional network against a PLS baseline on the same Raman spectra — and the result is the most honest thing in the chapter: on small, clean spectra, the deep model does not win.

The simple version

A wine taster does not run a mass spectrometer on every glass. They smell and taste a few cheap, fast signals and infer grape, region, and vintage from patterns learned over years. Chemometrics is that learned inference for chemistry: a spectrum (the "smell") goes in, a concentration or a quality verdict comes out. For most of the analytical lab, a simple, experienced taster — a linear PLS model — is hard to beat, because there are only a few hundred glasses to learn from. The deep-learning newcomer needs a cellar of millions before its extra cleverness pays off, and the lab rarely has one. The art is knowing which problem actually needs the newcomer.

What this chapter covers

We start with chemometrics proper — PLS and PCA, the linear latent-variable methods that are the analytical lab's production ML, written out as the actual decomposition and the preprocessing pipeline (standard normal variate, Savitzky-Golay derivatives) that wraps it — then trace the move toward deep learning for spectroscopy (Raman, NIR, FTIR, mass spectrometry), being precise about the 1D-CNN architecture and where it helps and where it is a solution looking for a problem. We cover automated chromatogram and peak integration for HPLC, CE, and SEC, where the incumbent is deterministic signal processing and ML is arriving at the edges; glycan, charge-variant, and PTM characterization, the high-dimensional structural assays; and anomaly detection in analytical and stability data. We build a 1D-CNN versus PLS head-to-head on the simulator's Raman dataset, and we issue the correction that anchors the chapter: the celebrated Boehringer Ingelheim 16-attribute Raman work used k-nearest-neighbors, not deep learning, and citing it as evidence of a "deep-learning Raman wave" is wrong.

Chemometrics: the lab's production machine learning

Chemometrics is the application of statistical and mathematical methods to chemical data — and in the biopharma lab it means, overwhelmingly, two algorithms: Partial Least Squares (PLS) regression and Principal Component Analysis (PCA). Both are latent-variable methods. A spectrum is a wide, redundant object: a Raman scan in our dataset is 701 intensity channels, but those channels are massively correlated — neighboring wavenumbers (the cm⁻¹ positions along the spectrum, each tied to one kind of molecular vibration) move together, and the chemistry lives in a handful of underlying factors, not 701 independent dimensions. PCA finds the directions of greatest variance and projects the spectrum onto a few principal components; PLS finds the directions that best predict a target (titer, glucose, a charge-variant fraction) and regresses on those few latent variables.

This is not a footnote in the history of bioprocess ML; it is the production deployment. Throughout this chapter, a parenthesized maturity tag rates how far a method has actually travelled: (production) means routine in commercial GMP use, (pilot) means demonstrated at scale but not yet routine, and (research) means published but lab-bound. In-line Raman plus PLS for glucose, lactate, metabolites, and titer (how much antibody has accumulated, in g/L) is a (production) PAT (Process Analytical Technology — measuring quality in real time during the process) technology in commercial CHO (Chinese-hamster-ovary cell) culture, including closed-loop glucose control [1][2] — with the honest asymmetry that what the loop actually actuates is a glucose set-point (a high-concentration feed nutrient Raman reads cleanly, well above the detection floor), not a quality attribute, which sits far closer to the measurement floor and is harder to read reliably. The multivariate statistical process control (MSPC) that watches whole batches — Sartorius SIMCA / SIMCA-online and AspenTech ProMV — is PCA and PLS at its core, and it is (production) for continued process verification, golden-batch monitoring, and fault detection [3][4]. When this book's other volumes build a soft sensor, they build a PLS model: the data book's soft-sensor chapter and the open-source analytics chapter both center on PLSRegression. This chapter does not duplicate that; it asks the next question — can deep learning beat it? — and answers honestly.

Evidence

In-line Raman + PLS chemometric soft sensors for glucose/lactate/metabolites/titer, including closed-loop glucose control in CHO culture, are (production) and peer-reviewed-independent [1][2]. MSPC with PCA/PLS (SIMCA, ProMV) is the (production) backbone of multivariate batch monitoring [3][4]. These are the strongest analytical-lab ML exemplars in the book, and they are linear methods — a fact the rest of the chapter keeps returning to.

The math: what PCA and PLS actually compute

Strip the marketing and both methods are a matrix factorization. Stack the calibration spectra into a matrix X of shape (n samples × 701 wavenumbers). PCA factors it as X ≈ T Pᵀ, where the scores T (n × k) place each spectrum in a low-dimensional space and the loadings P (701 × k) are the spectral shapes — the directions of maximum variance, found as the leading eigenvectors of the covariance matrix XᵀX. Keeping k components (often k of three to seven) reconstructs almost all the variance from a handful of axes. PCA never looks at the target; it is purely a description of how the spectra vary, which is exactly why it is the right tool for monitoring — it builds a model of "normal" spectral shape with no knowledge of the answer.

PLS is the supervised cousin. It factors X and the target y together, choosing latent directions w that maximize the covariance between the spectral projection X w and y — not the variance of X alone. The NIPALS algorithm extracts one latent variable at a time: find the weight vector w that maximizes cov(Xw, y), compute the score t = Xw, regress both X and y on t to get loadings, deflate (subtract the explained part) from both, and repeat for the next component. After k components the model collapses to a single vector of regression coefficients b of length 701 (plus an intercept) such that ŷ = X b + b₀. PLS chooses k not by assertion but by inner cross-validation under the one-standard-error rule (pick the simplest model — fewest components — whose cross-validated error is within one standard error of the best, which guards against over-fitting the component count); on this dataset it picks five. That is why our PLS reports 702 coefficients — one per wavenumber plus the intercept — yet has only five real degrees of freedom: the 702 numbers are a rank-5 object in disguise. Ordinary least squares cannot do this: with 701 columns and a few hundred rows, XᵀX is rank-deficient and non-invertible, the fit is unstable, and the coefficients explode. PLS sidesteps the inversion entirely by working in the five-dimensional latent space, which is precisely why it survives the small-data regime the foundations chapter described.

The committed baseline does more than fit those five components — it earns and audits them. Beyond selecting the component count by inner 5-fold cross-validation (the one-standard-error rule lands on five), it reports which wavenumbers it leans on via VIP (Variable Importance in Projection), flagging 389 bands above 1 and peaking near 1274 cm⁻¹ (from the committed simulator run, so illustrative), and it gates every prediction through a Hotelling-T²/SPE applicability domain (both distances explained under anomaly detection below) set at the training 99th percentile — which passes genuine held-out spectra (99 percent in-domain) and flags a deliberately corrupted one. That is the difference between a model that is merely accurate and one that is right for the right reason and knows when it is out of its depth — the self-distrust the anatomy card later makes concrete.

Preprocessing: SNV and Savitzky-Golay, the half of chemometrics that is not regression

A raw Raman or NIR spectrum is contaminated before the chemistry is even visible. A sloping fluorescence baseline rides under the Raman peaks; light scattering scales the whole spectrum up or down depending on particle density and probe coupling; instrument drift shifts the floor. None of these carry concentration. The standard pipeline removes them with two named operations, applied before PLS ever sees the data.

Standard Normal Variate (SNV) is a per-spectrum normalization: for each spectrum, subtract its own mean across all 701 channels and divide by its own standard deviation. The effect is to put every spectrum on the same scale, cancelling the multiplicative scatter and additive offset that vary sample to sample — it is row-wise z-scoring of the spectrum, and it is the workhorse scatter correction.

Savitzky-Golay derivatives attack the baseline. The Savitzky-Golay filter fits a low-order polynomial (typically order 2) to a sliding window of points (say 11 or 15 wide) by least squares and reads off the smoothed value or its derivative at the window center. Taking the first derivative removes a constant baseline offset; the second derivative removes a sloping (linear) baseline and sharpens overlapping peaks into resolvable troughs — at the cost of amplifying high-frequency noise, which is why the polynomial smoothing is built into the same pass. The combination is so standard that "SNV plus second-derivative Savitzky-Golay" is the default opening move on a vibrational spectrum, and choosing the window width and derivative order is where decades of domain knowledge are encoded. Chemometrics is half preprocessing and half regression, and a CNN's seductive pitch is that it can learn this preprocessing — a claim the next section tests and the chapter's code refutes for small data.

Deep learning for spectroscopy: the wave, and the honest caveat

The research literature is full of deep learning on spectra — 1D convolutional neural networks (1D-CNNs) that learn spectral features instead of hand-designing the preprocessing, variational autoencoders (VAEs) for monitoring, transformers for mass spectra. The pitch is seductive: a CNN can learn its own baseline correction and feature extraction, so you skip the chemometric preprocessing craft and let the network find the signal. Amgen and others have benchmarked CNNs and VAEs (variational autoencoders) against PLS for Raman CQA (Critical Quality Attribute — a product property that must stay in spec for the drug to be safe and effective) prediction, including charge-variant monitoring, at the (pilot) tier [5][6].

How a 1D-CNN reads a spectrum

To see why a CNN could learn the preprocessing, look at what it actually does. A spectrum of 701 channels is treated as a one-dimensional signal with a single input channel — the analogue of a grayscale image, but in one dimension along the wavenumber axis. A 1D convolution slides a small kernel (say 15 wavenumbers wide) across the spectrum, computing at each position a weighted sum of the local intensities; with eight such kernels the first layer produces eight feature maps, each highlighting a different local spectral motif (a peak shoulder, a baseline slope, a doublet). Crucially, a derivative is a convolution with a fixed kernel, and a Savitzky-Golay smooth is a convolution with a fixed kernel — so a convolutional layer with learnable weights can, in principle, learn the preprocessing that chemometricians hand-specify. A ReLU nonlinearity then zeroes the negative responses, letting the network represent things a single linear projection cannot. Pooling downsamples: max-pooling over a width of 4 keeps the strongest response in each window (translation tolerance against small peak shifts), and adaptive average pooling collapses the variable-length feature map to a fixed size so a dense head can read it. Stacking two such conv-pool blocks builds features of features — local motifs in the first block, combinations of motifs in the second — before a small fully-connected regression head maps the pooled features to a single titer number. This is genuine representation learning, and on a large, nonlinear problem it would shine. The question is whether a GMP spectral calibration is that problem.

The honest caveat

Here is the caveat the hype skips, and it is the spine of this chapter. Deep learning earns its keep when data is abundant and the function is complex and nonlinear. Spectroscopy in the GMP lab is the opposite: the relationship between a vibrational spectrum and a concentration is close to linear in the working range — Beer-Lambert for the absorption methods (NIR, FTIR), where absorbance is the log-ratio of transmitted light, and a directly proportional scattering intensity for Raman — and the data is scarce, a calibration set of dozens to a few hundred spectra, not millions. A model with thousands of parameters has nothing to bite on; it overfits the calibration set and, worse, it is a black box a regulator must be persuaded to trust. The pattern across careful benchmarks is consistent: on small, clean spectral datasets, PLS is a baseline that deep learning ties at best and complicates always [7]. What edge deep learning does show on spectra typically appears only under heavy spectral data augmentation (synthetic shift/scatter/slope perturbations of the training spectra); a well-optimized iterative-PLS stays competitive without it on small clean sets [7]. Deep learning's real openings are narrower — heavily nonlinear matrices, calibration transfer between instruments, or fusing spectra with images — not the everyday Raman-to-titer map.

The correction: the BI 16-attribute Raman work was KNN, not deep learning

One paper is cited so often, and so often miscited, that it deserves its own paragraph. Boehringer Ingelheim demonstrated computational Raman with robotic calibration predicting 16 quality attributes during Protein A chromatography in real time — a genuinely impressive (pilot) result, monitoring product quality roughly every half-minute [8]. It is repeatedly invoked as proof of a "deep-learning Raman wave." It is not a deep-learning paper. The model that performed best overall and on aggregates was k-nearest-neighbors (KNN) — a classical, non-parametric, distance-based method with no neural network anywhere in it, paired with Butterworth-filter preprocessing. (On the fragments metric a CNN and SVR slightly edged KNN, but KNN won on aggregates and on overall error, and KNN is the headline method.) KNN does not even fit parameters: it stores the calibration spectra and, at prediction time, averages the targets of the few nearest stored spectra in distance — the antithesis of a deep network. Citing this work as evidence that deep learning has conquered Raman PAT is a factual error, and one this book will not make. If anything, the BI result reinforces the chapter's thesis: on real bioprocess spectra, a simple, classical learner beat the deep ones.

Evidence

The Boehringer Ingelheim 16-attribute real-time Raman work [8] is (pilot), peer-reviewed, and used KNN as its best overall model — not a CNN or any deep network. It is the strongest evidence in this chapter, and it is evidence for classical chemometrics, not against it. Deep-learning-versus-PLS Raman benchmarks (Amgen and others) are (pilot)/(research) and report deep learning matching, not decisively beating, PLS on charge-variant and CQA prediction [5][6][7].

A 1D-CNN versus PLS, on real spectra

The cleanest way to make this argument is to run it. The series simulator emits an in-line Raman dataset, examples/datasets/raman_spectra.parquet: 336 hourly timepoints from the golden batch BATCH-2026-001, each a 701-channel spectrum (wn_400 through wn_1800, the Raman shift in cm⁻¹) paired with the kinetic-state reference labels (the lab-measured ground-truth values the model is trained to predict) — glucose, lactate, glutamine, VCD (viable cell density), and titer_g_L. We build two soft sensors for titer on identical train/test splits: a PLS regression whose component count is chosen by inner cross-validation — five on this dataset (the chemometric baseline) — and a compact 1D-CNN that treats the spectrum as a one-channel signal and learns its own filters over wavenumber. Then we let them race.

The 1D-CNN is deliberately small — two convolution blocks and a tiny regression head — because the dataset is small. Even so it carries thousands of parameters where PLS carries a few hundred coefficients constrained to five degrees of freedom. The architecture is the one dissected above, made concrete:

# examples/platform/ml/soft_sensor_deep.py — a compact 1D-CNN over the Raman spectrum.
import torch.nn as nn

class SpectraCNN(nn.Module):
    """A compact 1D-CNN: two conv blocks over wavenumber, then a small head."""
    def __init__(self, n_wavenumbers: int):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv1d(1, 8, kernel_size=15, padding=7), nn.ReLU(),   # 8 learned filters, width 15
            nn.MaxPool1d(4),                                          # tolerate small peak shifts
            nn.Conv1d(8, 16, kernel_size=11, padding=5), nn.ReLU(),   # features of features
            nn.AdaptiveAvgPool1d(8),                                  # collapse to fixed length
        )
        self.head = nn.Sequential(
            nn.Flatten(), nn.Linear(16 * 8, 32), nn.ReLU(),
            nn.Dropout(0.2), nn.Linear(32, 1),                        # regress one titer number
        )

    def forward(self, x):                 # x: (batch, 1, n_wavenumbers)
        return self.head(self.features(x))

Training is the standard supervised loop, and it is worth naming the pieces because they are the same ones a regulator will ask about. The spectra are standardized on the training split only (StandardScaler().fit(Xtr)), then reused on test — fitting the scaler on the full set would be a form of data leakage, letting test statistics bleed into training. The optimizer is Adam at learning rate 1e-3 with weight_decay=1e-4 (L2 regularization, the network's defense against overfitting 5,713 parameters to a few hundred rows), the loss is mean-squared error, and the loop runs 300 epochs under a fixed seed (torch.manual_seed(2026)) so the result is bit-reproducible. The PLS side is the same chemometric pipeline the rest of the series uses — apply the SNV and Savitzky-Golay front end, let inner cross-validation pick the component count (n_components=select_n_components(Xtr, ytr), five here under the one-SE rule), fit PLSRegression, score on the held-out slice — wrapped so the two models share one train_test_split(..., random_state=2026). The head-to-head driver runs both and prints the comparison:

# examples/platform/ml/soft_sensor_deep.py — the head-to-head, run with a fixed seed.
from soft_sensor_pls import train_pls          # PLSRegression baseline

pls = train_pls()                              # inner-CV-selected PLS (5 comps) over 701 wavenumbers
cnn = train_cnn(epochs=300)                    # the SpectraCNN above
print(f"  PLS    : R2={pls['r2']}  RMSE={pls['rmse_g_L']} g/L  ({pls['n_params']} coefficients)")
print(f"  1D-CNN : R2={cnn['r2']}  RMSE={cnn['rmse_g_L']} g/L  ({cnn['n_params']} parameters)")
print(f"  PLS uses {cnn['n_params'] / pls['n_params']:.0f}x fewer params and is not beaten on R2.")

Running python soft_sensor_deep.py prints (verbatim, SIM_SEED=2026, deterministic across runs):

Head-to-head: titer from 701-wavenumber Raman, golden batch BATCH-2026-001
  PLS    : R2=0.9944  RMSE=0.127 g/L  (702 coefficients)
  1D-CNN : R2=0.9924  RMSE=0.1488 g/L  (5713 parameters)
  PLS uses 8x fewer params and is not beaten on R2.

Read that result the way an honest analyst should. PLS lands at R² = 0.9944 (RMSE 0.127 g/L) and the 1D-CNN at R² = 0.9924 (RMSE 0.149 g/L) — on 101 test points and a fraction of the parameters, the linear model is not merely tied, it is marginally ahead. The deep model, with 8x more parameters, did not just fail to buy anything; it cost accuracy as well as a GPU-shaped training loop and a model far harder to validate and explain to a regulator. This is the small-data ceiling the learning-problem chapter named, made concrete on real spectra: when the relationship is near-linear and the data is scarce, the linear model is not a compromise — it is the right answer. One caveat keeps the comparison honest: this head-to-head is a within-batch interpolation test — all 336 rows come from the single golden run BATCH-2026-001, split by a random hold-out — so it measures whether a sensor tracks titer during a run, not whether it transfers across batches (the cross-batch question lives in transfer.py and drift.py). That is why both R² values sit so high: the simulator's spectra carry the titer signal cleanly within one run. On real Raman both models drop, but the ranking — PLS competitive with, or ahead of, deep learning — is exactly what the peer-reviewed benchmarks report [7].

The chapter's thesis as a picture: a PLS lane (five inner-CV-selected latent components, 702 coefficients) and a 1D-CNN lane (5713 parameters) both recover titer from the same Raman spectrum, and PLS edges the deep model on held-out accuracy — so the network spends 8x the parameters and still loses, the small-data ceiling made visible. Original diagram by the authors, created with AI assistance.

Automated chromatograms and peak integration

Spectroscopy is the lab's continuous, in-line frontier; chromatography is its discrete, offline backbone. HPLC, SEC (size-exclusion, for monomer/aggregate), CEX (cation-exchange, for charge variants), and CE (capillary electrophoresis) all produce a chromatogram — a trace of detector signal against time, with peaks whose area is the measurement. BATCH-2026-001's release record is a stack of these: SEC reports 98.611 percent monomer, 1.287 percent HMW (high-molecular-weight aggregate), 0.439 percent LMW (low-molecular-weight fragment); CEX reports 70.686 percent main, 21.551 percent acidic, 10.452 percent basic — every one of those numbers is a peak area, integrated from a curve, and a few percent of misallocated area can move a charge-variant or aggregate fraction across its limit. (Not every release number is a chromatogram peak: the host-cell-protein result that put BATCH-2026-004 out of specification — 128 ng/mg against a 100 ng/mg limit — comes from an ELISA immunoassay, not an integrated peak, a reminder that the peak-integration problem governs the aggregation and charge-heterogeneity assays specifically, while residual Protein A and host-cell DNA are likewise read by immunoassay and qPCR.)

Peak integration is the act of deciding where each peak starts, ends, and how the baseline runs beneath it — and it is shockingly consequential. Two analysts integrating the same wobbly baseline can disagree on whether a small shoulder is one peak or two, and a few percent of integrated area can move an attribute across a spec limit. The incumbent here is not ML. Mainstream chromatography data systems — Waters Empower with its ApexTrack algorithm — use deterministic signal processing: ApexTrack detects peaks by the second derivative of the signal (an apex is where curvature is most negative, the same Savitzky-Golay derivative idea reappearing), then constructs baselines by threshold and inflection-point rules. That determinism is a regulatory feature, not a bug, because the same chromatogram, with the same method parameters, always integrates to the same area — the result is reproducible and auditable by construction [9]. Where ML is arriving is at the edges: deep-learning peak integration (a Merck KGaA / Bosch CNN architecture, (research)) aims to learn an analyst's judgment on hard, noisy, co-eluting peaks where the deterministic algorithm needs manual rework [10], and CDS vendors are adding ML-based anomaly flagging that calls a human to peaks worth reviewing rather than silently re-integrating them.

Evidence

Mainstream chromatography peak integration (Waters Empower ApexTrack) remains deterministic signal processing and is (production) — the determinism is what makes it auditable [9]. Deep-learning peak integration (Merck KGaA / Bosch universal CNN architecture) is (research) — demonstrated, not GMP-routine [10]. The honest reading: ML is augmenting chromatography review (flagging hard peaks for human eyes), not replacing the validated integrator that decides a release number.

The reason the deterministic incumbent is so sticky is the same theme again: a chromatographic release number feeds a batch-disposition decision, so the integration must be reproducible, explainable, and locked under change control — exactly the properties a deterministic algorithm has by construction and a learned one must earn. The Merck KGaA / Bosch CNN work confronts this directly, proposing a GxP human-in-the-loop framework in which the network proposes integrations and a qualified analyst confirms them, rather than letting the model emit release numbers unsupervised. The draft EU/PIC/S GMP Annex 22 (the EU/PIC/S draft GMP rule on AI, which permits only locked, non-adaptive models for critical tasks) draws a sharp line excluding adaptive/generative AI from critical GMP tasks and requires locked models with a predetermined change control plan [11]; an integrator whose behavior shifts as it learns is precisely what that line is about.

Glycan, charge-variant, and PTM characterization

The hardest analytical assays are the structural ones: where a chromatogram or mass spectrum carries not one number but a whole profile of product variants. Glycosylation — the sugar chains decorating the antibody's Fc — is a CQA because it tunes effector function and clearance: core afucosylation sharply raises FcγRIIIa binding and ADCC potency, while high-mannose and galactosylation/sialylation shift half-life and complement activity — in short, the sugar pattern tunes how strongly the antibody recruits immune killing and how fast the body clears it, which is why a glycan profile is a critical quality attribute, so a glycan map is both a forest of peaks and a forest of consequences. (The physical glycobiology behind these terms is laid out in Book 1's analytical and formulation chapter.) Charge variants (the acidic and basic species CEX resolves, 21.551 and 10.452 percent on BATCH-2026-001) and post-translational modifications (PTMs) like deamidation and oxidation are read by mass spectrometry, increasingly through the Multi-Attribute Method (MAM) — LC-MS that reports many product-quality attributes from one run, with automated New Peak Detection (NPD) to catch any new species the process did not make before [12].

This is where the dimensionality genuinely invites learning. A mass spectrum is high-dimensional and a glycopeptide fragmentation pattern is complex, so deep learning has real (research)-tier traction: BERT-style transformer models predicting glycopeptide tandem mass spectra to power glycoproteomics — a problem with the two properties the Raman-to-titer map lacks, namely rich data and a genuinely complex, nonlinear structure-to-spectrum mapping [13] — and deep-learning-assisted glycan database search. But note the maturity gap. MAM's production New Peak Detection is largely algorithmic/feature-based (peak-matching and intensity-threshold rules against a reference run), not deep learning, with ML being layered on rather than replacing the validated core [12]; the deep-learning glycan work lives in discovery and characterization, not in GMP lot release. The pattern holds: where data is rich and the structure is genuinely complex (mass spectra), deep learning has a foothold; where a number gates a release (the integrated charge-variant fraction on BATCH-2026-001's CofA), the validated classical method still rules.

Anomaly detection in analytical and stability data

Not every analytical learning task is a soft sensor. A large class is anomaly detection — flagging the assay result, the chromatogram, or the stability trend that does not look like the others, without a labeled "this is bad" training set, because real failures are too rare to learn from directly. This is unsupervised learning: fit a model of "normal" on the good runs, then score how far each new run sits from that normal. In MSPC this is exactly the Hotelling's T² and squared-prediction-error (SPE/Q) machinery from the open-source analytics chapter — a PCA model of normal batches, with a new batch's distance off the model plane flagging genuinely novel behavior. The two distances answer two different questions: T² measures how far a sample sits from the calibration cloud inside the model's plane (an extreme but recognizable value, like an unusually high glucose reading on a known trajectory), while SPE/Q measures the residual off the plane (a spectrum whose shape the model cannot reconstruct — a genuinely novel event the calibration never saw). A point can be normal on one and alarming on the other. Newer (research) work pushes toward LSTM autoencoders and isolation forests for the same job [6].

Stability and shelf-life data is a distinct, and instructive, sub-case. Predicting how a CQA will drift over months of storage is a small-data, extrapolation-heavy problem — exactly the regime where pure black-box ML is dangerous and a Bayesian hierarchical model (Merck & Co.'s HPV-vaccine shelf-life work, (research)) or a mechanistic kinetic model (Arrhenius-style degradation kinetics) wins, because the structure carries the extrapolation the sparse data cannot [14]. A Bayesian hierarchical model pools information across lots — each lot gets its own degradation rate, but the rates are drawn from a shared population distribution, so a lot with only a few timepoints borrows strength from the others rather than extrapolating off a noisy two-point line. It is the hybrid-modeling argument the whole book makes, applied to the timeline of a vial sitting in a fridge.

Anatomy of a chemometric soft-sensor prediction record

When the PLS Raman model fires on a live spectrum, it does not emit a bare 3.8. Like every artifact in this series, the value is only as trustworthy as what travels with it — and for a chemometric prediction that means the spectrum, the preprocessing, the latent projection, the uncertainty, and the eventual reference assay that will grade it. Unpack one prediction the way a reviewer would.

One chemometric prediction is a whole record: the spectrum that fed it, the preprocessing and five latent scores that transformed it, the estimate paired with its two MSPC distances (Hotelling's T² for in-model extremeness, SPE/Q for off-model novelty), the input-quality status that distrusts a fouled probe, and the delayed reference assay that will eventually reveal the residual. Original diagram by the authors, created with AI assistance.

Read the card field by field, top to bottom. The record falls into four bands: identity, the core estimate, reconciliation, and relationships.

Header — model: raman_titer_pls, version v5c, dataset_hash. The model identity is a version plus the hash of the exact calibration set it was fit on. Two predictions are comparable only if they came from the same locked model; the hash lets an auditor prove which PLSRegression(n_components=5) produced a number, and it is the anchor ICH Q14's (the international analytical-procedure-lifecycle guideline) lifecycle expectations attach a change record to.
timestamp. When the spectrum was acquired — distinct from when the reference assay will land, which matters because the reference is delayed (the residual cannot be computed until it arrives).
input_spectrum (raw, 701 intensities). The cheap, fast signal as the probe emitted it, with its fluorescence baseline and scatter intact. Stored raw because every downstream transform must be reproducible from it.
preprocessed (baseline- and scatter-corrected). The same spectrum after SNV and Savitzky-Golay — the form PLS actually consumed. Stored separately so a reviewer can see exactly what the model saw, not infer it.
latent_scores (5 values). The spectrum's coordinates in the five-dimensional PLS latent space — the t scores. These are what the T² and SPE distances are computed from, compressing 701 channels to the five numbers that carry the model's view of this sample.
coefficients (frozen). The 702-number regression vector, frozen at calibration. Frozen is the operative word: under draft Annex 22 a critical-GMP model must be locked, so these coefficients do not move between predictions.
Core — titer_predicted_g_L. The estimate itself, e.g. 3.8 g/L. A bare soft sensor stops here; a governed one does not.
Core — hotelling_t2. The in-model distance: how far this spectrum's latent scores sit from the center of the calibration cloud, against a control limit. A high T² says "extreme, but a kind of sample I recognize."
Core — spe_q. The off-model residual: the part of the spectrum the five components could not reconstruct, against its own limit. A high SPE/Q says "this spectrum has a shape I was never trained on" — a fouled window, a bubble, a new contaminant. T² and SPE together let the model say not just "the titer is 3.8" but "and this spectrum looks like the ones I was trained on" — the self-distrust the data book argued a soft sensor must carry.
Reconciliation — reference_value. The delayed offline assay (an HPLC titer) that is the official number; the soft sensor is decision support until this lands.
Reconciliation — residual. Prediction minus reference, computed once the reference arrives — the running record of whether the model is still in calibration.
Reconciliation — status. The input-quality verdict (e.g. ok, probe_fouled, bubble) that lets the system distrust its own input rather than trust a prediction made on a corrupted spectrum.
Relationships — trained_on, validated_against, feeds, revalidates_when. Links to the calibration set, the reference method the model was qualified against, the MSPC chart this record populates, and the trigger (probe swap, scale change, drift threshold) that forces re-qualification — the four edges that make this record a node in an instance graph rather than an orphan row, the kind Book 4 models.

The card is the difference between a number and a governed analytical result — the same governance the open-source soft-sensor record and ICH Q14's model-lifecycle expectations demand.

What makes the record trustworthy: the ontology under the row

Read the card once more and notice that every field on it is a claim about a typed thing — a spectrum, a latent score, a reference assay, a calibration set — and the discipline that pins down what each of those things is, and how they connect, is the bioprocess ontology the companion ontology book builds. That grounding is not decoration; it is what makes a chemometric model FAIR (Findable, Accessible, Interoperable, Reusable) and auditable in the first place, and it earns the difference between a brittle column name and a fact.

A feature pulled by its IRI, not its column header. When the PLS model's latent_scores row is fed forward — into the MSPC chart, a release predictor, or a retrieval-augmented LLM — it travels as a quantity bound to a global IRI (Internationalized Resource Identifier — a globally unique web name for a thing), not as the string "wn_1274" in a CSV. The ontology book's identifiers chapter makes the case: BR-101 is a primary key that means one thing in this plant and something else in the next, while bp:BATCH-2026-001 resolves to one batch everywhere. A feature pulled by IRI does not break when a vendor renames a tag or a wavenumber bin shifts; a feature pulled by header silently mismatches. And the value never travels bare — a titer of 3.8 carries a QUDT unit (g/L) and an xsd:float datatype, so a downstream model cannot confuse grams-per-litre with the percent of a charge variant.
The same SHACL shape that gates a release gates the training set. A model has no native notion of "complete": handed a spectrum row whose reference assay silently never loaded, it will impute or predict around the hole and report a confident number. The release-gate-and-SHACL chapter supplies the guard — the closed-world bp:ReleaseShape that certifies every required CQA is present, singular, typed, and in range before a lot is trusted. Run that same shape as an admission gate on a calibration or training row and a hollow or mislabeled spectrum is caught in front of a reviewer, not learned as a clean label. SHACL guarantees a well-formed row, not a true one — the right HMW value on the wrong vial still passes — which is exactly the completeness-not-correctness limit that keeps a CQA-touching soft sensor advisory under locked-model change control.
BFO keeps the measurement distinct from the run. The card's spectrum is a measurement of a sample; the bioreactor culture it was drawn from is a process. The classes-and-taxonomy chapter draws that as the continuant/occurrent cut — a Raman result (a quality borne by a material) is never the fermentation (an occurrent that happens and is over), and BR-101 the vessel is never BATCH-2026-001 the material it held. A loader that conflates them attaches the wrong lineage to the wrong node, which is precisely how a sibling spectrum leaks into the wrong batch's record.
The lineage edge is the cross-validation grouping key. This chapter is blunt that the head-to-head is a within-batch test — all 336 rows are BATCH-2026-001. The honest cross-batch evaluation needs the rows grouped by their shared ancestry, and the relations-and-genealogy bp:derivedFrom spine (a PROV-O-style provenance edge declared owl:TransitiveProperty) is that grouping key: split on the shared working cell bank, never across it, and a grouped / leave-one-batch-out cross-validation stops sibling spectra of one run from leaking across the fold and flattering the score. The same transitive edge that scopes a recall scopes a fair model evaluation. This is why the within-batch R² = 0.9944 is honest only as interpolation, and why transfer.py and drift.py carry the cross-batch question.

The lesson is the one the whole series keeps returning to: a soft sensor is only as trustworthy as the typed, identified, lineage-anchored record it travels in — and the ontology is what turns the anatomy card from a tidy table into a node a reviewer, a query, and a model can all stand on.

The unsolved part: calibration transfer and the cost of being instrument-bound

The honest open problem in analytical ML is not accuracy — it is portability. A chemometric model is bound to the exact spectrometer, probe, fiber, and process conditions it was calibrated on. The bind is physical: two probes are never optically identical, so the same chemistry produces subtly different spectra on different hardware, and a model that learned the first probe's idiosyncrasies reads the second one's spectra wrong. Swap the probe, move from a 3 L development reactor to a 2,000 L manufacturing tank, or let a window foul, and a model that scored R² = 0.99 can degrade past its acceptance criterion without a single line of code changing. The literature is blunt about the magnitude: in a controlled study, two probes of the same Raman analyzer in the same culture at the same time disagreed by roughly 20 percent on cell density purely from instrument-to-instrument differences, and a calibration-transfer step was needed to halve that to roughly 10 percent [15].

The standard fix has a name and a math. Calibration transfer maps one instrument's spectra onto another's so the original model still applies. Piecewise direct standardization (PDS) — the method in that study, paired with Kennard-Stone sample selection — fits, for each wavenumber on the secondary instrument, a small local regression from a window of nearby wavenumbers on the primary instrument, building a transformation matrix that rewrites secondary spectra into the primary's frame. It is an active, unsolved research front because every new transfer needs its own standardization samples and the mapping degrades as conditions drift, and it is why a soft sensor moved to new hardware is, for regulatory purposes, a new analytical procedure until it is re-qualified.

This is also, paradoxically, one of the places deep learning may genuinely help: a network that learns instrument-invariant features could transfer where a fixed PLS calibration cannot — the abundant-data, complex-mapping regime where deep learning's edge is real rather than imagined. But that promise is (research), not (production), and it runs straight into the validation wall — a model that adapts to a new instrument is, by draft Annex 22's logic, exactly the kind of adaptive system excluded from critical GMP tasks [11]. The unsolved part is therefore a genuine tension: the technical fix (a model that travels) and the regulatory posture (a model that is locked) point in opposite directions, and no one has fully reconciled them. The next chapter takes exactly this tension as its subject.

What this chapter adds to the model suite

This chapter contributes the head-to-head soft-sensor pair to examples/platform/ml/:

soft_sensor_pls.py — the chemometric baseline: a PLSRegression over the 701-wavenumber Raman dataset whose component count is selected by inner cross-validation (five here, under the one-SE rule), reporting R² = 0.9944 and RMSE = 0.127 g/L on a held-out slice, plus VIP band importances and a Hotelling-T²/SPE applicability-domain gate, with the same train_test_split(random_state=2026) the series uses so the comparison is fair.
soft_sensor_deep.py — the SpectraCNN, a compact PyTorch 1D-CNN (two conv blocks, max-pool and adaptive-average-pool, a 32-unit dense head with dropout), plus the driver that trains both models on identical splits with a fixed seed and prints the head-to-head. It reports R² = 0.9924 and RMSE = 0.1488 g/L from 5,713 parameters — 8x more than PLS, and still edged out on accuracy.

Together they are the runnable proof of the chapter's thesis, and they are deliberately small and dependency-light (scikit-learn plus PyTorch over the committed parquet) so the result is reproducible on a laptop with no GPU. They also seed the production-bioreactor chapter, where the same PLS soft sensor moves from a lab benchmark to a live, closed-loop control element.

Why it matters

The analytical lab is where biomanufacturing's ML is both oldest and most over-claimed. Oldest, because chemometrics — PLS and PCA — has run in production for forty years and is the genuine, validated, (production) success story of bioprocess learning. Over-claimed, because the deep-learning literature implies a wave that has not reached the GMP lab, and miscitations (the BI work as "deep learning") prop up that impression. Getting this chapter right means holding two truths at once: machine learning in the analytical lab is real and deployed — and it is mostly linear. A team that reaches for a transformer to read a Raman spectrum, when a five-component PLS matches or beats it on a fraction of the parameters and a fraction of the validation burden, has mistaken novelty for progress. The honest verdict from the runnable code is the whole lesson: on small, clean spectra, the simple model wins, and knowing which analytical problem actually needs the deep one is the real expertise.

In the real world

The (production) analytical ML you would actually find in a commercial mAb lab clusters tightly. In-line Raman + PLS soft sensors run on process analyzers (Endress+Hauser Kaiser Raman Rxn heads) for glucose, lactate, and titer, with closed-loop glucose control the most mature deployment [1][2]. MSPC runs on Sartorius SIMCA / SIMCA-online and AspenTech ProMV for batch monitoring and fault detection [3][4]. Chromatography data systems (Waters Empower with ApexTrack) integrate peaks deterministically and are adding ML anomaly flagging at the edges [9]. MAM (Thermo Chromeleon + BioPharma Finder, Genedata, Protein Metrics) does automated new-peak detection, largely algorithmically, with ML being layered on [12].

The (pilot)/(research) frontier is where the excitement lives and the maturity drops: the Boehringer Ingelheim 16-attribute real-time Raman work (KNN, (pilot)) [8]; CNN/VAE-versus-PLS benchmarks for charge variants (Amgen and others, (pilot)) [5][6]; deep-learning peak integration (Merck KGaA / Bosch, (research)) [10]; BERT-style glycopeptide MS prediction (research) [13]; and Bayesian/kinetic stability prediction (Merck & Co., Sanofi, (research)) [14]. The (production)-versus-frontier line maps almost perfectly onto the linear-versus-deep line — which is the single most useful thing to know walking into an analytical-ML vendor meeting.

Key terms

Chemometrics — statistical/mathematical methods applied to chemical data; in biopharma, overwhelmingly PLS and PCA. The analytical lab's production machine learning.
PLS (Partial Least Squares) — supervised latent-variable regression that compresses a wide, collinear spectrum into a few components chosen to predict the target (maximize covariance with y), then regresses on them; collapses to a coefficient vector with few real degrees of freedom. The chemometrics workhorse.
PCA (Principal Component Analysis) — unsupervised latent-variable method factoring the spectra into scores and loadings along directions of greatest variance; the core of MSPC batch monitoring, and target-blind by design.
SNV / Savitzky-Golay — the standard spectral preprocessing: SNV row-normalizes each spectrum to cancel scatter and offset; Savitzky-Golay fits a sliding polynomial whose first/second derivative removes the baseline and sharpens peaks. The half of chemometrics that is not regression.
1D-CNN — a one-dimensional convolutional neural network that treats a spectrum as a signal, learning local filters (a derivative is a fixed convolution), ReLU nonlinearities, and pooling; the deep-learning challenger that, on small clean spectra, is edged out by PLS for far more parameters.
MSPC (multivariate statistical process control) — PCA/PLS monitoring of whole batches via Hotelling's T² (distance inside the model) and SPE/Q (distance off it); (production) in SIMCA and ProMV.
Hotelling's T² / SPE (Q) — the two MSPC distances: T² flags an extreme but recognizable sample inside the model plane; SPE/Q flags a genuinely novel one whose shape the model cannot reconstruct.
Peak integration — deciding where a chromatographic peak starts, ends, and how the baseline runs beneath it; the act that turns a curve into a release number. Deterministic in production (Empower ApexTrack, second-derivative apex detection); ML at the edges.
MAM (Multi-Attribute Method) — LC-MS reporting many product-quality attributes from one run, with automated New Peak Detection; production NPD is largely algorithmic, with ML layered on.
Calibration transfer / PDS — mapping a chemometric model from one instrument to another (piecewise direct standardization rewrites secondary spectra into the primary's frame); the unsolved portability problem, and the reason a model on new hardware is a new analytical procedure.
Glycan / charge variant / PTM — structural CQAs (sugar chains, acidic/basic species, deamidation/oxidation) read by chromatography and mass spectrometry; where deep learning has a research foothold.
KNN (k-nearest-neighbors) — the classical, non-parametric method (store the calibration spectra, average the targets of the nearest few) that was the best model in the Boehringer Ingelheim 16-attribute Raman work — which is therefore not a deep-learning result.
Semantic grounding (IRI / QUDT) — pulling a feature by its global identifier and unit-typed value from the bioprocess ontology, not by a fragile CSV column header; the discipline that makes a chemometric model FAIR and keeps a renamed tag or shifted wavenumber bin from silently corrupting a feature row.
Admission gate (SHACL on the training row) — running the closed-world release shape (bp:ReleaseShape) over a calibration or training row before a model learns from it, so a hollow or mislabeled spectrum is caught in front of a reviewer; it certifies completeness, not correctness, which is why a CQA-touching soft sensor stays advisory.
Grouped / leave-one-batch-out cross-validation — splitting evaluation on the shared bp:derivedFrom lineage (the working-cell-bank key) rather than at random, so sibling spectra of one run never leak across the fold; the same transitive provenance edge that scopes a recall scopes a fair score, and the reason this chapter's within-batch R² is honest only as interpolation.

Where this leads

A model that matches — or, as here, edges — PLS on the lab bench is one thing; a model that must survive the jump from a 3 L development reactor to a 2,000 L manufacturing tank is another, and the calibration-transfer problem this chapter ended on is exactly that jump in miniature. The next chapter, Tech Transfer and Scale-Up: Models That Travel Between Scales, takes the question head-on: what does it mean for a learned model — a soft sensor, a process model, a control policy — to travel from the scale it was trained on to the scale it must run at, and which modeling choices make a model portable instead of instrument-bound.

What this chapter covers​

Chemometrics: the lab's production machine learning​

The math: what PCA and PLS actually compute​

Preprocessing: SNV and Savitzky-Golay, the half of chemometrics that is not regression​

Deep learning for spectroscopy: the wave, and the honest caveat​

How a 1D-CNN reads a spectrum​

The honest caveat​

The correction: the BI 16-attribute Raman work was KNN, not deep learning​

A 1D-CNN versus PLS, on real spectra​

Automated chromatograms and peak integration​

Glycan, charge-variant, and PTM characterization​

Anomaly detection in analytical and stability data​

Anatomy of a chemometric soft-sensor prediction record​

What makes the record trustworthy: the ontology under the row​

The unsolved part: calibration transfer and the cost of being instrument-bound​

What this chapter adds to the model suite​

Why it matters​

In the real world​

Key terms​

Where this leads​

What this chapter covers

Chemometrics: the lab's production machine learning

The math: what PCA and PLS actually compute

Preprocessing: SNV and Savitzky-Golay, the half of chemometrics that is not regression

Deep learning for spectroscopy: the wave, and the honest caveat

How a 1D-CNN reads a spectrum

The honest caveat

The correction: the BI 16-attribute Raman work was KNN, not deep learning

A 1D-CNN versus PLS, on real spectra

Automated chromatograms and peak integration

Glycan, charge-variant, and PTM characterization

Anomaly detection in analytical and stability data

Anatomy of a chemometric soft-sensor prediction record

What makes the record trustworthy: the ontology under the row

The unsolved part: calibration transfer and the cost of being instrument-bound

What this chapter adds to the model suite

Why it matters

In the real world

Key terms

Where this leads