Skip to main content

Analytical Methods: Chemometrics, Deep Spectroscopy, and Automated Chromatograms

📍 Where we are: Part II · Discovery & Development, Learned — Chapter 8. The last chapter used Bayesian optimization to choose a process in a fraction of the runs a factorial grid needs. But every one of those runs only counts if you can measure what it produced. This chapter turns the learning lens on the analytical lab — the instruments that decide whether BATCH-2026-001 is 98.611 percent monomer or out of spec.

The analytical lab is where a process becomes a number. A bioreactor can run beautifully, but until an assay reports a titer, a purity, a glycan profile, the run is a rumor. And the analytical lab is the part of biomanufacturing that has been quietly doing machine learning the longest — for forty years, under a name that predates the hype: chemometrics. A Raman probe does not emit a glucose concentration; it emits a thousand intensity channels, and a model turns those channels into a number. That model is machine learning, whether or not anyone calls it that.

This chapter walks the lab's instruments through the learning lens and is careful about a distinction the field often blurs: the gap between the established chemometrics that runs in production today (PLS, PCA — linear, interpretable, validated) and the deep learning that the literature is excited about (CNNs, autoencoders, transformers on spectra) but that has barely reached the GMP lab. We build that contrast as runnable code — a 1D convolutional network against a PLS baseline on the same Raman spectra — and the result is the most honest thing in the chapter: on small, clean spectra, the deep model does not win.

The simple version

A wine taster does not run a mass spectrometer on every glass. They smell and taste a few cheap, fast signals and infer grape, region, and vintage from patterns learned over years. Chemometrics is that learned inference for chemistry: a spectrum (the "smell") goes in, a concentration or a quality verdict comes out. For most of the analytical lab, a simple, experienced taster — a linear PLS model — is hard to beat, because there are only a few hundred glasses to learn from. The deep-learning newcomer needs a cellar of millions before its extra cleverness pays off, and the lab rarely has one. The art is knowing which problem actually needs the newcomer.

What this chapter covers

We start with chemometrics proper — PLS and PCA, the linear latent-variable methods that are the analytical lab's production ML — then trace the move toward deep learning for spectroscopy (Raman, NIR, FTIR, mass spectrometry), being precise about where it helps and where it is a solution looking for a problem. We cover automated chromatogram and peak integration for HPLC, CE, and SEC, where the incumbent is deterministic and ML is arriving at the edges; glycan, charge-variant, and PTM characterization, the high-dimensional structural assays; and anomaly detection in analytical and stability data. We build a 1D-CNN versus PLS head-to-head on the simulator's Raman dataset, and we issue the correction that anchors the chapter: the celebrated Boehringer Ingelheim 16-attribute Raman work used k-nearest-neighbors, not deep learning, and citing it as evidence of a "deep-learning Raman wave" is wrong.

Chemometrics: the lab's production machine learning

Chemometrics is the application of statistical and mathematical methods to chemical data — and in the biopharma lab it means, overwhelmingly, two algorithms: Partial Least Squares (PLS) regression and Principal Component Analysis (PCA). Both are latent-variable methods. A spectrum is a wide, redundant object: a Raman scan in our dataset is 701 intensity channels, but those channels are massively correlated — neighboring wavenumbers move together, and the chemistry lives in a handful of underlying factors, not 701 independent dimensions. PCA finds the directions of greatest variance and projects the spectrum onto a few principal components; PLS finds the directions that best predict a target (titer, glucose, a charge-variant fraction) and regresses on those few latent variables.

This is not a footnote in the history of bioprocess ML; it is the production deployment. In-line Raman plus PLS for glucose, lactate, metabolites, and titer is a (production) PAT technology in commercial CHO culture, including closed-loop glucose control [1][2]. The multivariate statistical process monitoring (MSPC) that watches whole batches — Sartorius SIMCA / SIMCA-online and AspenTech ProMV — is PCA and PLS at its core, and it is (production) for continued process verification, golden-batch monitoring, and fault detection [3][4]. When this book's other volumes build a soft sensor, they build a PLS model: the data book's soft-sensor chapter and the open-source analytics chapter both center on PLSRegression. This chapter does not duplicate that; it asks the next question — can deep learning beat it? — and answers honestly.

Evidence

In-line Raman + PLS chemometric soft sensors for glucose/lactate/metabolites/titer, including closed-loop glucose control in CHO culture, are (production) and peer-reviewed-independent [1][2]. MSPC with PCA/PLS (SIMCA, ProMV) is the (production) backbone of multivariate batch monitoring [3][4]. These are the strongest analytical-lab ML exemplars in the book, and they are linear methods — a fact the rest of the chapter keeps returning to.

Why PLS, specifically, dominates the lab

Ordinary least-squares regression falls apart on spectra: with 701 columns and a few hundred rows, the design matrix is rank-deficient and the fit is unstable. PLS sidesteps this by compressing the 701 channels into, say, six latent components chosen to co-vary with the target, then regressing on those six. The result is a model with hundreds of coefficients but only six real degrees of freedom — which is exactly why it survives the small-data regime the foundations chapter described. The preprocessing matters as much as the regression: a raw Raman spectrum carries a sloping fluorescence baseline and scatter effects that have nothing to do with concentration, so the standard pipeline applies a baseline correction (asymmetric least squares, polynomial, or a Savitzky-Golay derivative) and a scatter normalization (standard normal variate) before PLS ever sees the data. Chemometrics is half preprocessing and half regression, and the preprocessing is where decades of domain knowledge are encoded.

Deep learning for spectroscopy: the wave, and the honest caveat

The research literature is full of deep learning on spectra — 1D convolutional neural networks (1D-CNNs) that learn spectral features instead of hand-designing the preprocessing, variational autoencoders (VAEs) for monitoring, transformers for mass spectra. The pitch is seductive: a CNN can learn its own baseline correction and feature extraction, so you skip the chemometric preprocessing craft and let the network find the signal. Amgen and others have benchmarked CNNs and VAEs against PLS for Raman CQA prediction, including charge-variant monitoring, at the (pilot) tier [5][6].

But here is the caveat the hype skips, and it is the spine of this chapter. Deep learning earns its keep when data is abundant and the function is complex and nonlinear. Spectroscopy in the GMP lab is the opposite: the relationship between a Raman spectrum and a concentration is close to linear (Beer-Lambert-like, in the working range), and the data is scarce — a calibration set is dozens to a few hundred spectra, not millions. A model with thousands of parameters has nothing to bite on; it overfits the calibration set and, worse, it is a black box a regulator must be persuaded to trust. The pattern across careful benchmarks is consistent: on small, clean spectral datasets, PLS is a baseline that deep learning ties at best and complicates always [7]. Deep learning's real openings are narrower — heavily nonlinear matrices, calibration transfer between instruments, or fusing spectra with images — not the everyday Raman-to-titer map.

The correction: the BI 16-attribute Raman work was KNN, not deep learning

One paper is cited so often, and so often miscited, that it deserves its own paragraph. Boehringer Ingelheim demonstrated computational Raman with robotic calibration predicting 16 quality attributes during Protein A chromatography in real time — a genuinely impressive (pilot) result, monitoring product quality every roughly 38 seconds [8]. It is repeatedly invoked as proof of a "deep-learning Raman wave." It is not a deep-learning paper. The model that performed best overall and on aggregates was k-nearest-neighbors (KNN) — a classical, non-parametric, distance-based method with no neural network anywhere in it. (On the fragments metric a CNN and SVR slightly edged KNN, but KNN won on aggregates and on overall error, and KNN is the headline method.) Citing this work as evidence that deep learning has conquered Raman PAT is a factual error, and one this book will not make. If anything, the BI result reinforces the chapter's thesis: on real bioprocess spectra, a simple, classical learner beat the deep ones.

Evidence

The Boehringer Ingelheim 16-attribute real-time Raman work [8] is (pilot), peer-reviewed, and used KNN as its best overall model — not a CNN or any deep network. It is the strongest evidence in this chapter, and it is evidence for classical chemometrics, not against it. Deep-learning-versus-PLS Raman benchmarks (Amgen and others) are (pilot)/(research) and report deep learning matching, not decisively beating, PLS on charge-variant and CQA prediction [5][6][7].

A 1D-CNN versus PLS, on real spectra

The cleanest way to make this argument is to run it. The series simulator emits an in-line Raman dataset, examples/datasets/raman_spectra.parquet: 336 hourly timepoints from the golden batch BATCH-2026-001, each a 701-channel spectrum (wn_400 through wn_1800, the Raman shift in cm⁻¹) paired with the kinetic-state reference labels — glucose, lactate, glutamine, VCD, and titer_g_L. We build two soft sensors for titer on identical train/test splits: a PLS regression with six latent components (the chemometric baseline), and a compact 1D-CNN that treats the spectrum as a one-channel signal and learns its own filters over wavenumber. Then we let them race.

The 1D-CNN is deliberately small — two convolution blocks and a tiny regression head — because the dataset is small. Even so it carries thousands of parameters where PLS carries a few hundred coefficients constrained to six degrees of freedom.

# examples/platform/ml/soft_sensor_deep.py — a compact 1D-CNN over the Raman spectrum.
import torch.nn as nn

class SpectraCNN(nn.Module):
"""Two conv blocks over wavenumber, then a small regression head."""
def __init__(self, n_wavenumbers: int):
super().__init__()
self.features = nn.Sequential(
nn.Conv1d(1, 8, kernel_size=15, padding=7), nn.ReLU(),
nn.MaxPool1d(4),
nn.Conv1d(8, 16, kernel_size=11, padding=5), nn.ReLU(),
nn.AdaptiveAvgPool1d(8),
)
self.head = nn.Sequential(
nn.Flatten(), nn.Linear(16 * 8, 32), nn.ReLU(),
nn.Dropout(0.2), nn.Linear(32, 1),
)

def forward(self, x): # x: (batch, 1, n_wavenumbers)
return self.head(self.features(x))

The PLS side is the same chemometric pipeline the rest of the series uses — standardize the spectra so no single wavenumber dominates by magnitude, fit PLSRegression(n_components=6), score on the held-out slice — wrapped so the two models share one train_test_split(..., random_state=2026). The head-to-head driver runs both and prints the comparison:

# examples/platform/ml/soft_sensor_deep.py — the head-to-head, run with a fixed seed.
from soft_sensor_pls import train_pls # PLSRegression baseline

pls = train_pls() # 6-component PLS over 701 wavenumbers
cnn = train_cnn(epochs=300) # the SpectraCNN above
print(f" PLS : R2={pls['r2']} RMSE={pls['rmse_g_L']} g/L ({pls['n_params']} coefficients)")
print(f" 1D-CNN : R2={cnn['r2']} RMSE={cnn['rmse_g_L']} g/L ({cnn['n_params']} parameters)")
print(f" PLS uses {cnn['n_params'] / pls['n_params']:.0f}x fewer params and is not beaten on R2.")

Running python soft_sensor_deep.py prints (verbatim, SIM_SEED=2026, deterministic across runs):

Head-to-head: titer from 701-wavenumber Raman, golden batch BATCH-2026-001
PLS : R2=0.9923 RMSE=0.1498 g/L (702 coefficients)
1D-CNN : R2=0.9924 RMSE=0.1488 g/L (5713 parameters)
PLS uses 8x fewer params and is not beaten on R2.

Read that result the way an honest analyst should. The 1D-CNN, with 8x more parameters, lands at R² = 0.9924 against PLS's R² = 0.9923 — a difference of one ten-thousandth, statistical noise on 101 test points. The RMSEs (0.1488 versus 0.1498 g/L) are indistinguishable. The deep model did not fail; it simply did not buy anything, while costing eight times the parameters, a GPU-shaped training loop, and a model far harder to validate and explain to a regulator. This is the small-data ceiling the learning-problem chapter named, made concrete on real spectra: when the relationship is near-linear and the data is scarce, the linear model is not a compromise — it is the right answer. (The numbers are suspiciously high because the simulator's spectra carry the titer signal cleanly; on real Raman both models drop, but the ranking — PLS competitive with deep learning — is exactly what the peer-reviewed benchmarks report [7].)

Hero diagram contrasting two soft sensors for titer from a 701-channel Raman spectrum: on the left a cyan PLS lane showing the wide spectrum compressed into six latent components and a linear regression to titer, labelled 702 coefficients, six degrees of freedom; on the right a violet 1D-CNN lane showing the same spectrum as a one-channel signal passing through two convolution-and-pool blocks into a small dense head, labelled 5713 parameters; both lanes converging on a predicted-versus-measured scatter hugging the identity line; a result banner across the bottom reading PLS R-squared 0.9923 RMSE 0.1498 grams per litre, 1D-CNN R-squared 0.9924 RMSE 0.1488 grams per litre, eight times more parameters for no gain; a small side note that on small clean spectra the linear model is not beaten. The chapter's thesis as a picture: a PLS lane (six latent components, 702 coefficients) and a 1D-CNN lane (5713 parameters) both recover titer from the same Raman spectrum and land at the same held-out accuracy — so the deep model spends 8x the parameters for a difference lost in the noise, the small-data ceiling made visible. Original diagram by the authors, created with AI assistance.

Automated chromatograms and peak integration

Spectroscopy is the lab's continuous, in-line frontier; chromatography is its discrete, offline backbone. HPLC, SEC (size-exclusion, for monomer/aggregate), CEX (cation-exchange, for charge variants), and CE (capillary electrophoresis) all produce a chromatogram — a trace of detector signal against time, with peaks whose area is the measurement. BATCH-2026-001's release record is a stack of these: SEC reports 98.611 percent monomer, 1.287 percent HMW, 0.439 percent LMW; CEX reports 70.686 percent main, 21.551 percent acidic, 10.452 percent basic — every one of those numbers is a peak area, integrated from a curve.

Peak integration is the act of deciding where each peak starts, ends, and how the baseline runs beneath it — and it is shockingly consequential. Two analysts integrating the same wobbly baseline can disagree on whether a small shoulder is one peak or two, and a few percent of integrated area can move an attribute across a spec limit. The incumbent here is not ML: mainstream chromatography data systems — Waters Empower with its ApexTrack algorithm — use deterministic, signal-processing integration (apex detection, threshold-based baseline construction), and that determinism is a regulatory feature, not a bug, because the same chromatogram always integrates the same way [9]. Where ML is arriving is at the edges: deep-learning peak integration (a Merck KGaA / Bosch CNN architecture, (research)) aims to learn an analyst's judgment on hard, noisy, co-eluting peaks where the deterministic algorithm needs manual rework [10], and CDS vendors are adding ML-based anomaly flagging that calls a human to peaks worth reviewing rather than silently re-integrating them.

Evidence

Mainstream chromatography peak integration (Waters Empower ApexTrack) remains deterministic signal processing and is (production) — the determinism is what makes it auditable [9]. Deep-learning peak integration (Merck KGaA / Bosch universal CNN architecture) is (research) — demonstrated, not GMP-routine [10]. The honest reading: ML is augmenting chromatography review (flagging hard peaks for human eyes), not replacing the validated integrator that decides a release number.

The reason the deterministic incumbent is so sticky is the same theme again: a chromatographic release number feeds a batch-disposition decision, so the integration must be reproducible, explainable, and locked under change control — exactly the properties a deterministic algorithm has by construction and a learned one must earn. EU draft Annex 22 draws a sharp line excluding adaptive/generative AI from critical GMP tasks and requires locked models with a predetermined change control plan [11]; an integrator whose behavior shifts as it learns is precisely what that line is about.

Glycan, charge-variant, and PTM characterization

The hardest analytical assays are the structural ones: where a chromatogram or mass spectrum carries not one number but a whole profile of product variants. Glycosylation — the sugar chains decorating the antibody — is a CQA because it affects efficacy and half-life, and a glycan map is a forest of peaks. Charge variants (the acidic and basic species CEX resolves) and post-translational modifications (PTMs) like deamidation and oxidation are read by mass spectrometry, increasingly through the Multi-Attribute Method (MAM) — LC-MS that reports many product-quality attributes from one run, with automated New Peak Detection (NPD) to catch any new species the process did not make before [12].

This is where the dimensionality genuinely invites learning. A mass spectrum is high-dimensional and a glycopeptide fragmentation pattern is complex, so deep learning has real (research)-tier traction: BERT-style models predicting glycopeptide tandem mass spectra to power glycoproteomics [13], and deep-learning-assisted glycan database search. But note the maturity gap. MAM's production New Peak Detection is largely algorithmic/feature-based, not deep learning, with ML being layered on rather than replacing the validated core [12]; the deep-learning glycan work lives in discovery and characterization, not in GMP lot release. The pattern holds: where data is rich and the structure is genuinely complex (mass spectra), deep learning has a foothold; where a number gates a release (the integrated charge-variant fraction on BATCH-2026-001's CofA), the validated classical method still rules.

Anomaly detection in analytical and stability data

Not every analytical learning task is a soft sensor. A large class is anomaly detection — flagging the assay result, the chromatogram, or the stability trend that does not look like the others, without a labeled "this is bad" training set, because real failures are too rare to learn from directly. This is unsupervised learning: fit a model of "normal" on the good runs, then score how far each new run sits from that normal. In MSPC this is exactly the Hotelling's T² and squared-prediction-error (SPE/Q) machinery from the open-source analytics chapter — a PCA model of normal batches, with a new batch's distance off the model plane flagging genuinely novel behavior. Newer (research) work pushes toward LSTM autoencoders and isolation forests for the same job [6].

Stability and shelf-life data is a distinct, and instructive, sub-case. Predicting how a CQA will drift over months of storage is a small-data, extrapolation-heavy problem — exactly the regime where pure black-box ML is dangerous and a Bayesian hierarchical model (Merck & Co.'s HPV-vaccine shelf-life work, (research)) or a mechanistic kinetic model (Sanofi's degradation modeling) wins, because the structure carries the extrapolation the sparse data cannot [14]. It is the hybrid-modeling argument the whole book makes, applied to the timeline of a vial sitting in a fridge.

Anatomy of a chemometric soft-sensor prediction record

When the PLS Raman model fires on a live spectrum, it does not emit a bare 3.8. Like every artifact in this series, the value is only as trustworthy as what travels with it — and for a chemometric prediction that means the spectrum, the preprocessing, the latent projection, the uncertainty, and the eventual reference assay that will grade it. Unpack one prediction the way a reviewer would.

Identity card unpacking one chemometric soft-sensor prediction record: an indigo header naming the model raman_titer_pls v6c with its bound dataset hash; input rows for timestamp, the raw 701-channel input spectrum, the baseline-and-scatter-corrected preprocessed spectrum, the six PLS latent-variable scores, and the frozen regression coefficients; a green core block holding the predicted titer with a Hotelling T-squared in-model distance and a squared-prediction-error off-model distance that together act as the prediction interval and the novelty flag; reconciliation rows for the delayed offline reference assay, the residual, and an input-quality status that flags a fouled or bubble-hit probe; a violet relationships panel linking the record to its training calibration set, the reference method it is validated against, the MSPC chart it feeds, and the re-validation trigger; a caption noting the two distances answer two different questions, in-family extremeness versus genuine novelty. One chemometric prediction is a whole record: the spectrum that fed it, the preprocessing and six latent scores that transformed it, the estimate paired with its two MSPC distances (Hotelling's T² for in-model extremeness, SPE/Q for off-model novelty), the input-quality status that distrusts a fouled probe, and the delayed reference assay that will eventually reveal the residual. Original diagram by the authors, created with AI assistance.

Read top to bottom. The input rows are the cheap, fast signal: a timestamp, the raw input_spectrum of 701 intensities, the preprocessed (baseline- and scatter-corrected) version, the six latent_scores PLS extracts, and the frozen coefficients that map them to titer. The green core is the prediction with two distances that a bare soft sensor lacks: titer_predicted_g_L, the Hotelling's T² that measures how far this spectrum sits from the calibration cloud inside the model (an extreme but recognizable sample), and the SPE/Q that measures distance off the model plane (a genuinely novel spectrum the calibration never saw — a fouled window, a new contaminant). T² and SPE together are what let a chemometric model say not just "the titer is 3.8" but "and this spectrum looks like the ones I was trained on" — the self-distrust that the data book argued a soft sensor must carry. The reconciliation rows hold the offline reference_value, the residual, and an input-quality status; the violet relationships panel records the calibration set the model trained_on, the reference method it is validated_against, the MSPC chart it feeds, and the revalidates_when trigger. The card is the difference between a number and a governed analytical result — the same governance the open-source soft-sensor record and ICH Q14's model-lifecycle expectations demand.

The unsolved part: calibration transfer and the cost of being instrument-bound

The honest open problem in analytical ML is not accuracy — it is portability. A chemometric model is bound to the exact spectrometer, probe, fiber, and process conditions it was calibrated on. Swap the probe, move from a 3 L development reactor to a 2,000 L manufacturing tank, or let a window foul, and a model that scored R² = 0.99 can degrade past its acceptance criterion without a single line of code changing. The literature is blunt about the magnitude: in a controlled study, two probes of the same Raman analyzer in the same culture at the same time disagreed by roughly 20 percent on cell density purely from instrument-to-instrument differences, and a calibration-transfer step was needed to halve that [15].

This is why calibration transfer — methods like piecewise direct standardization that map one instrument's spectra onto another's — is an active, unsolved research front, and why a soft sensor moved to new hardware is, for regulatory purposes, a new analytical procedure until it is re-qualified. It is also, paradoxically, one of the places deep learning may genuinely help: a network that learns instrument-invariant features could transfer where a PLS calibration cannot. But that promise is (research), not (production), and it runs straight into the validation wall — a model that adapts to a new instrument is, by EU draft Annex 22's logic, exactly the kind of adaptive system excluded from critical GMP tasks [11]. The unsolved part is therefore a genuine tension: the technical fix (a model that travels) and the regulatory posture (a model that is locked) point in opposite directions, and no one has fully reconciled them.

What this chapter adds to the model suite

This chapter contributes the head-to-head soft-sensor pair to examples/platform/ml/:

  • soft_sensor_pls.py — the chemometric baseline: a six-component PLSRegression over the 701-wavenumber Raman dataset, reporting R² = 0.9923 and RMSE = 0.1498 g/L on a held-out slice, with the same train_test_split(random_state=2026) the series uses so the comparison is fair.
  • soft_sensor_deep.py — the SpectraCNN, a compact PyTorch 1D-CNN, plus the driver that runs both models on identical splits and prints the head-to-head. It reports R² = 0.9924 and RMSE = 0.1488 g/L from 5,713 parameters — 8x more than PLS, for a difference lost in the noise.

Together they are the runnable proof of the chapter's thesis, and they are deliberately small and dependency-light (scikit-learn plus PyTorch over the committed parquet) so the result is reproducible on a laptop with no GPU. They also seed the production-bioreactor chapter, where the same PLS soft sensor moves from a lab benchmark to a live, closed-loop control element.

Why it matters

The analytical lab is where biomanufacturing's ML is both oldest and most over-claimed. Oldest, because chemometrics — PLS and PCA — has run in production for forty years and is the genuine, validated, (production) success story of bioprocess learning. Over-claimed, because the deep-learning literature implies a wave that has not reached the GMP lab, and miscitations (the BI work as "deep learning") prop up that impression. Getting this chapter right means holding two truths at once: machine learning in the analytical lab is real and deployed — and it is mostly linear. A team that reaches for a transformer to read a Raman spectrum, when a six-component PLS ties it on a fraction of the parameters and a fraction of the validation burden, has mistaken novelty for progress. The honest verdict from the runnable code is the whole lesson: on small, clean spectra, the simple model wins, and knowing which analytical problem actually needs the deep one is the real expertise.

In the real world

The (production) analytical ML you would actually find in a commercial mAb lab clusters tightly. In-line Raman + PLS soft sensors run on process analyzers (Endress+Hauser Kaiser Raman Rxn heads) for glucose, lactate, and titer, with closed-loop glucose control the most mature deployment [1][2]. MSPC runs on Sartorius SIMCA / SIMCA-online and AspenTech ProMV for batch monitoring and fault detection [3][4]. Chromatography data systems (Waters Empower) integrate peaks deterministically and are adding ML anomaly flagging at the edges [9]. MAM (Thermo Chromeleon + BioPharma Finder, Genedata, Protein Metrics) does automated new-peak detection, largely algorithmically, with ML being layered on [12].

The (pilot)/(research) frontier is where the excitement lives and the maturity drops: the Boehringer Ingelheim 16-attribute real-time Raman work (KNN, (pilot)) [8]; CNN/VAE-versus-PLS benchmarks for charge variants (Amgen and others, (pilot)) [5][6]; deep-learning peak integration (Merck KGaA / Bosch, (research)) [10]; BERT-style glycopeptide MS prediction (research) [13]; and Bayesian/kinetic stability prediction (Merck & Co., Sanofi, (research)) [14]. The (production)-versus-frontier line maps almost perfectly onto the linear-versus-deep line — which is the single most useful thing to know walking into an analytical-ML vendor meeting.

Key terms

  • Chemometrics — statistical/mathematical methods applied to chemical data; in biopharma, overwhelmingly PLS and PCA. The analytical lab's production machine learning.
  • PLS (Partial Least Squares) — supervised latent-variable regression that compresses a wide, collinear spectrum into a few components chosen to predict the target; the chemometrics workhorse.
  • PCA (Principal Component Analysis) — unsupervised latent-variable method finding directions of greatest variance; the core of MSPC batch monitoring.
  • 1D-CNN — a one-dimensional convolutional neural network that treats a spectrum as a signal and learns its own filters; the deep-learning challenger that, on small clean spectra, ties PLS for far more parameters.
  • MSPC (multivariate statistical process control) — PCA/PLS monitoring of whole batches via Hotelling's T² (distance inside the model) and SPE/Q (distance off it); (production) in SIMCA and ProMV.
  • Hotelling's T² / SPE (Q) — the two MSPC distances: T² flags an extreme but recognizable sample; SPE/Q flags a genuinely novel one the calibration never saw.
  • Peak integration — deciding where a chromatographic peak starts, ends, and how the baseline runs beneath it; the act that turns a curve into a release number. Deterministic in production (Empower ApexTrack); ML at the edges.
  • MAM (Multi-Attribute Method) — LC-MS reporting many product-quality attributes from one run, with automated New Peak Detection; production NPD is largely algorithmic, with ML layered on.
  • Calibration transfer — mapping a chemometric model from one instrument to another; the unsolved portability problem, and the reason a model on new hardware is a new analytical procedure.
  • Glycan / charge variant / PTM — structural CQAs (sugar chains, acidic/basic species, deamidation/oxidation) read by chromatography and mass spectrometry; where deep learning has a research foothold.
  • KNN (k-nearest-neighbors) — the classical, non-parametric method that was the best model in the Boehringer Ingelheim 16-attribute Raman work — which is therefore not a deep-learning result.

Where this leads

A model that ties PLS on the lab bench is one thing; a model that must survive the jump from a 3 L development reactor to a 2,000 L manufacturing tank is another, and the calibration-transfer problem this chapter ended on is exactly that jump in miniature. The next chapter, Tech Transfer and Scale-Up: Models That Travel Between Scales, takes the question head-on: what does it mean for a learned model — a soft sensor, a process model, a control policy — to travel from the scale it was trained on to the scale it must run at, and which modeling choices make a model portable instead of instrument-bound.