본문으로 건너뛰기

Capture Chromatography: Hybrid Models and Real-Time Pooling

📍 Where we are: Part IV · Downstream, Learned — Chapter 13. The last chapter predicted the harvest endpoint and handed us clarified harvest — a few hundred litres of dilute, antibody-bearing fluid still carrying host-cell protein, DNA, and media. Now that fluid meets its first column. Capture is where the product is concentrated a hundredfold and most of the impurities fall away, and it is also the one downstream step where computational models are already routine in production — though, as we will see, the mature ones are mechanistic, not machine-learned.

Clarified harvest from CLAR-001 flows onto a Protein A column, and in a single bind-and-elute cycle the antibody is pulled out of a dilute, complex broth and released as a small, concentrated, low-pH pool — the capture pool PApool-001 that the rest of downstream will polish. This is the most consequential single step in purification: it sets the impurity floor everything else inherits, and the decision of exactly which slice of the elution peak to keep — the pooling decision — trades yield against purity in real time, on a signal that is changing second by second.

It is tempting to assume that because chromatography is physics — adsorption, mass transfer, fluid flowing through a packed bed — the learning here would be the showcase of the book. It is not. The showcase is mechanistic modeling: first-principles column simulators that are genuinely deployed in commercial process development. Machine learning sits beside them, doing the jobs mechanistic models do poorly — reading a messy in-line trace, timing a pool, predicting a yield from cheap online signals. This chapter draws that line carefully, then builds the ML layer on a real chromatogram.

The simple version

Imagine pouring muddy water through a sponge that grabs only the gold flecks. First you load — the water runs through, the sponge fills with gold, and the moment it starts to overflow (gold escaping out the bottom) is the breakthrough. Then you rinse, then you squeeze the sponge with a special solvent so the gold pours out in a sharp, concentrated stream. Catching exactly the right part of that stream — start the bucket too early and you catch rinse water, stop too late and you catch junk — is the pooling decision. A mechanistic model is a physics simulation of how the sponge fills and releases; the machine-learning layer is the operator watching the live stream, calling out "phase change," "start collecting now," "stop now," and "you'll recover about 43 grams."

What this chapter covers

We start by being honest about the landscape: mechanistic chromatography modeling — the general rate model plus steric mass action, as shipped in Cytiva's GoSilico and the open-source CADET — is the mature, production-grade computational tool here, and it is not ML. We then place learning where it genuinely helps: hybrid mechanistic-plus-ML models that let data fill the parameters physics cannot pin down; breakthrough and dynamic-binding-capacity (DBC) prediction that decides how much to load; real-time pooling from in-line UV and conductivity; and phase classification and yield prediction for automated chromatogram review. The runnable artifact, examples/platform/ml/chromatography.py, trains a phase classifier and a pooling-plus-recovery model on the simulator's Protein A capture chromatogram — examples/datasets/protein_a_chromatogram.csv — and we dissect one pooling decision end to end.

The capture step, as a set of learnable decisions

A Protein A cycle is a fixed sequence of phases, each with a different job, and each a different shape on the UV-conductivity-pH trace. In our simulated cycle they are equilibration (3 column volumes, condition the resin), load (8 CV, the clarified harvest flows on and the antibody binds), wash (4 CV, rinse loosely-bound impurity away), elution (5 CV, drop the pH to about 3.3 and the antibody releases as a sharp peak), strip (3 CV, low-pH clean), and CIP (3 CV, caustic clean-in-place). At 0.5 CV per minute and one sample per second, that is 3120 samples of a four-channel signal per cycle.

Inside that sequence sit three decisions that learning can sharpen:

  • How much to load. Load too little and you waste column capacity and run more cycles than you need; load too much and the column saturates, antibody breaks through the bottom unbound, and you lose product. The limit is the dynamic binding capacity — the mass the column can hold at the operating flow rate before breakthrough — and it is not a constant: it drifts down as the resin ages over its cycle life.
  • When each phase begins and ends. Automated review and any real-time logic needs to know which phase the trace is in right now, from the signal alone — a classification problem.
  • Exactly which slice of the elution peak to keep. The pooling decision: pick a start and a stop on the rising and falling edges of the product peak. This is where yield meets purity, and it is made live, on a moving signal.

Each of these is a place where a model reads a cheap online signal and produces a decision or a number. Each is also, crucially, a place where the consequence is bounded and reviewable — you can audit the pool a model chose against the assay of what it actually collected — which is exactly why downstream ML has moved into plants faster than the autonomous-control fantasies upstream.

The mature tool is mechanistic, not ML — say so plainly

Before any learning, the strongest computational model of a chromatography column is a set of partial differential equations. The general rate model describes the antibody's convection and dispersion down the bed, its diffusion into the porous resin beads, and its binding kinetics; the steric mass action (SMA) isotherm describes how the antibody competes with salt counter-ions for binding sites during a salt or pH gradient. Calibrated against a handful of small-scale runs, such a model becomes a mechanistic digital twin that predicts elution profiles, breakthrough curves, and pooling windows across conditions it was never run at — and it does so with the extrapolation guarantees that come from encoding real physics [1].

This is a production technology, and the canonical commercial implementation is Cytiva's GoSilico (ChromX/DSPX), acquired by Cytiva in 2021; the open-source CADET from Forschungszentrum Jülich is the academic and increasingly industrial workhorse [1][2]. Disclosed industrial users have built mechanistic ion-exchange and mixed-mode models for real molecules. It is essential to attribute this correctly: GoSilico and CADET are mechanistic, not machine learning. Treating them as "AI chromatography" is a category error that recurs in vendor copy and that this book refuses to repeat. The math is solved physics, not a fitted black box.

Evidence

Mechanistic chromatography modeling (general rate model + SMA) is the most mature deployed computational technique in downstream processing and is mechanistic, not ML — Cytiva GoSilico (production, in CMC process development) and open-source CADET are the exemplars [1][2] (peer-reviewed and vendor documentation). Any vendor headline number — for example "+5 percentage points yield" or per-molecule cost savings — is vendor-self-reported and must carry that label; the modeling capability is well established, the specific savings are not independently audited.

Where learning actually helps: the hybrid model

A mechanistic model is only as good as its parameters, and some of those parameters are genuinely hard to measure: the precise binding kinetics of this antibody on this resin lot, the way capacity fades with resin age, the messy dependence of the isotherm on a feed whose composition shifts run to run. This is exactly the gap Book 2's hybrid-modeling chapter identified, and the Book 5 hybrid chapter makes the dominant paradigm: keep the mechanistic backbone that you trust, and use a small machine-learning component only for the part the physics cannot write down.

In capture, the hybrid takes a few shapes. A serial hybrid lets a neural network estimate hard-to-measure isotherm or kinetic parameters from feed properties, then feeds those into the mechanistic simulator — the physics still does the simulation, the data just supplies its inputs. A parallel hybrid runs the mechanistic model and adds a learned residual that corrects its systematic errors. Either way the division of labour is the small-data win: the network has far less to learn because the physics carries the trend, so it can succeed on the handful of runs a downstream campaign actually produces [3].

The evidence here is real but should be read with care. A peer-reviewed pilot study optimized a commercial anion-exchange polishing step by screening 30 input factors against quality and yield across 400-plus commercial lots, refining the screen with mechanistic models, then running tens of thousands of in silico optimizations — reporting roughly 12 percent higher yield and about a third lower high-molecular-weight impurity [4]. That is genuine hybrid work, but the improvement figures are self-reported by the authoring manufacturer on its own single-step process via in-silico optimization, not closed-loop control, and have not been independently reproduced. Likewise, physics-informed neural networks (PINNs) have been used to accelerate mechanistic models enough for real-time model-predictive control of continuous multi-column capture — cutting an offline fit from roughly 2600 seconds to about 110, and online evaluation to a handful of seconds — which is a research result on a hard problem, not a deployed plant control loop [5].

Breakthrough and dynamic binding capacity: deciding how much to load

The load decision turns on the breakthrough curve — the rising UV signal at the outlet of the column as the resin fills and antibody begins to escape unbound. The dynamic binding capacity is conventionally read at a fixed breakthrough level (often 10 percent of the feed concentration, "DBC10"), and the loading target is set to a safe fraction below it so that no product is lost. The trouble is that DBC is not fixed: it falls as the resin ages across its validated cycle life, and it shifts with flow rate, temperature, and feed titer.

This is a clean prediction task. Given the resin's cycle count, the load flow rate, the feed titer (itself often a soft-sensed quantity from the bioreactor), and a few early points of the live breakthrough trace, a model can predict where breakthrough will cross the threshold and recommend a load volume. Two flavours appear in the literature. The first is mechanistic-first: fit the general rate model and read DBC off the simulated breakthrough curve — accurate, and the production default. The second is data-driven monitoring: track features of the chromatographic profile across cycles to detect resin aging before yield visibly drops. On-line PAT with PCA and batch-level modeling has been shown in a pilot to detect Protein A resin aging some 20-25 cycles before observable yield decline, with a proposed strategy to extend resin life — a modeled benefit, not yet a validated GMP outcome [6]. The honest framing is that DBC prediction is where mechanistic and ML approaches cooperate: physics for the curve, data for the slow drift the physics does not model.

Real-time pooling: a guard band on a moving signal

The pooling decision is the one most people imagine when they picture "AI controlling a column," so it is worth being precise about what is actually deployed. Real-time pooling is, at its core, a thresholding rule on a chromatographic signal — collect the eluate while a monitored signal stays inside a guard band, and divert it otherwise. In the simplest and most common case the signal is UV280 and the rule is "start collecting when UV rises above a cutoff on the leading edge, stop when it falls back below on the trailing edge." More sophisticated versions add conductivity or pH to catch the boundary between product and a salt or pH front, or pool to a target mass rather than a fixed window.

Where does learning enter? Not, in production, as an autonomous agent free to redraw the pool however it likes — that is exactly the kind of adaptive control of a quality attribute that EU draft Annex 22 draws a sharp line against. Learning enters as the thing that sets and validates the cutoff, and that predicts the consequence. A model trained on historical cycles, each with its release assays, can learn the UV (or conductivity, or pH) cutoff that best trades recovery against an impurity specification — and then that cutoff is locked and runs as a fixed rule, with the model's role being design-time and monitoring, not live autonomy. The most ambitious research in this space pushes further: convolutional-network-guided Raman has been used to make charge-variant pooling decisions on a cation-exchange polishing step with reported R-squared between 0.94 and 0.99 — but that is a polishing-chromatography result on a different separation, and it is research, not routine [7].

For capture, the workhorse is humbler and that is the point: a locked UV guard band, a model that chose and validated it, and a soft sensor that predicts the recovered mass so the operator knows what to expect before the assay comes back. Let us build exactly that.

Hero diagram of learned capture chromatography: a horizontal Protein A cycle trace running left to right showing UV280 as the main curve over six color-banded phases — equilibration, load with a small breakthrough sigmoid rising at its end, wash decaying to baseline, a tall sharp elution peak, strip, and CIP — with conductivity and pH drawn as fainter lines below; a phase-classifier band beneath the trace coloring each one-second sample by its predicted phase; over the elution peak a horizontal guard-band line at 100 mAU with two vertical markers where the rising and falling edges cross it, labeled pool start and pool stop, and the collected slice shaded; a side panel on the right showing the mass-balance recovery model as loaded mass into a column-capacity ceiling into bound mass times a learned recovery fraction into predicted eluted mass; a banner across the top noting the mechanistic twin sits beside this learned layer, physics for the column, ML for read-time-predict. Capture, learned: a phase classifier labels every second of the trace, a locked UV guard band over the elution peak picks the pool start and stop on the rising and falling edges, and a mass-balance-anchored soft sensor predicts the recovered mass — the machine-learning layer that sits beside, not instead of, the mechanistic column twin. Original diagram by the authors, created with AI assistance.

Building it: phase classification plus pooling and recovery

The runnable module frames two tasks on one real chromatogram. First, a phase classifier: label each one-second sample as one of the six phases from the three live signals — UV280, conductivity, pH — plus light temporal context. A bare triplet confuses equilibration with wash (both low UV, neutral pH, similar conductivity); what separates them is where you are in the cycle and which way UV is moving, so we add the running column-volume position and a 30-second rolling slope and mean. That position feature is honest in real time — the cumulative volume delivered is always known — but it is also a strong hint, so the chapter is candid below about not letting the model simply learn the clock.

Second, pooling and recovery: inside the model-predicted elution phase, apply the locked 100 mAU UV guard band to pick the pool start and stop in column volumes, then a mass-balance-anchored soft sensor predicts the recovered mass. The recovery model encodes the physics ceiling — eluted mass cannot exceed what the column bound, which is the lesser of the mass loaded and the capacity (DBC times column volume) — and lets the learned recovery fraction fill the gap. This is a miniature hybrid: physics sets the ceiling, data fills the fraction.

# examples/platform/ml/chromatography.py — phase classifier + pooling/recovery.
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

POOL_THRESHOLD_mAU = 100.0 # locked UV guard band for collection
CV_ML = 1000.0 # 1 L Protein A column

def featurize(df):
f = pd.DataFrame()
f["UV280"], f["cond"], f["pH"] = df["UV280_mAU"], df["conductivity_mS_cm"], df["pH"]
f["cv_position"] = df["volume_CV"] # where in the cycle (CV elapsed)
f["UV_slope"] = df["UV280_mAU"].diff().rolling(30, min_periods=1).mean().fillna(0.0)
f["UV_roll"] = df["UV280_mAU"].rolling(30, min_periods=1).mean()
f["pH_roll"] = df["pH"].rolling(30, min_periods=1).mean()
f["cond_roll"] = df["conductivity_mS_cm"].rolling(30, min_periods=1).mean()
return f

df = pd.read_csv("examples/datasets/protein_a_chromatogram.csv").sort_values("time_s")
X, y = featurize(df).to_numpy(), df["phase"].to_numpy()
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.3, random_state=2026, stratify=y)
clf = GradientBoostingClassifier(n_estimators=120, max_depth=3, random_state=2026).fit(Xtr, ytr)
pred = clf.predict(Xte)
print(f"accuracy={accuracy_score(yte, pred):.4f} macro-F1={f1_score(yte, pred, average='macro'):.4f}")

# real-time pooling: collect inside predicted Elution while UV stays above the guard band
phase_hat = clf.predict(featurize(df).to_numpy())
elute = df[(phase_hat == "Elution") & (df["UV280_mAU"] >= POOL_THRESHOLD_mAU)]
start_cv, stop_cv = float(elute.volume_CV.iloc[0]), float(elute.volume_CV.iloc[-1])
print(f"pool {start_cv}-{stop_cv} CV = {(stop_cv - start_cv) * CV_ML:.1f} mL")

Running it on the simulator's golden capture cycle gives the verified output:

Phase classifier (GBT on UV/cond/pH + context): accuracy=0.9989 macro-F1=0.9989 (2184 train / 936 test samples)
Real-time pooling (UV280 >= 100 mAU in predicted Elution): collect 15.0-16.92 CV, pool 1916.7 mL
Recovery soft sensor: loaded 47.0 g, capacity 58.0 g, bound 47.0 g x 0.92 = 43.2 g (measured 43.3 g)
ASSERT ok: phase classifier recovers the chromatogram structure (accuracy > 0.9).

The classifier recovers the six-phase structure nearly perfectly, the pool window — collect from 15.0 to 16.92 CV for a 1916.7 mL pool — matches the cycle summary in protein_a_summary.csv exactly, and the recovery soft sensor predicts 43.2 g against a measured 43.3 g. The near-perfect accuracy is partly because the cycle-position feature is so informative on a single, well-separated cycle; on a real multi-cycle, multi-resin-lot dataset the score would be lower and the classifier would have to lean harder on the signal shape — which is the honest caveat the next section opens with.

Anatomy of one pooling decision

A pooling decision is small — a start, a stop, a pool — but it carries the whole logic of the capture step, and it is worth unpacking one. The record below is the decision the module made on PApool-001, and every field is either a live signal, a learned/locked rule, or a physics-anchored prediction, with nothing invented.

Identity card unpacking one Protein A pooling decision for PApool-001: a header row naming the cycle and the source chromatogram; a live-signals block listing UV280, conductivity, and pH as the three streamed channels at one sample per second; a predicted-phase row showing the gradient-boosted classifier output Elution with its 0.9989 accuracy on held-out samples; a rule block showing the locked UV280 guard band of 100 mAU with leading-edge cross at pool start 15.0 CV and trailing-edge cross at pool stop 16.92 CV, pool volume 1916.7 mL; a physics-ceiling block showing mass loaded 47.0 g, column capacity DBC 58 g per L times 1 L equals 58.0 g, bound mass min equals 47.0 g; a learned-fraction row showing recovery 0.92 giving predicted eluted mass 43.2 g against measured 43.3 g; a lineage row showing PApool-001 derivedFrom CLAR-001; and a governance footnote that the cutoff is locked design-time, the model is monitoring not autonomous, per Annex 22. One pooling decision, fully unpacked: three live signals feed a phase classifier that says "Elution," a locked UV guard band picks start and stop on the peak edges, a mass-balance ceiling caps the recoverable mass, and a learned fraction predicts the yield — the lineage from clarified harvest to capture pool carried in a single reviewable record. Original diagram by the authors, created with AI assistance.

Read top to bottom, the record is the chapter in miniature. The live signals are the only inputs available the instant the decision is made — UV280, conductivity, pH, sampled each second. The predicted phase is the classifier's call, and it gates everything: the guard band only applies inside elution, so a UV bump during wash never triggers collection. The rule is the locked 100 mAU cutoff; the start (15.0 CV) and stop (16.92 CV) are simply where the rising and falling edges of the peak cross it. The physics ceiling is the mass balance — 47.0 g loaded against a 58.0 g column capacity means the column bound all 47.0 g it was given — and the learned fraction (0.92) turns that into a 43.2 g prediction that the eventual assay (43.3 g) confirms. The whole record is auditable after the fact, which is precisely why this pattern passes GMP review where an autonomous controller would not.

The unsolved part: capacity drift, resin lifetime, and the locked-model paradox

The hard, unsolved problem in capture is time — specifically, that the column is not the same column on cycle 200 as on cycle 1. Protein A resin is expensive and is validated for a finite cycle life; over that life its dynamic binding capacity slowly falls, its pressure rises, and its impurity clearance can drift. A model that perfectly times the pool today will, if frozen, slowly mistime it as the resin ages and the peak shape shifts — the model decay problem this book keeps meeting, here driven by a physical asset wearing out rather than by a biological process drifting.

This collides head-on with the regulatory reality. The very fix a data scientist wants — let the model adapt to the aging resin — is the thing Annex 22 most explicitly forbids for a critical GMP function: it requires locked models and a predetermined change-control plan, not online adaptation. So the field is left with an awkward middle ground: detect the drift (the resin-aging monitors are good at this), but respond to it through governed retraining and validated cleaning/re-equilibration cycles, not through a model quietly rewriting its own pooling rule. Doing that well — knowing when a model has decayed enough to warrant a controlled retrain, and proving the new one is at least as safe — is the open MLOps problem of downstream, and it is genuinely unsolved at scale.

A second unsolved piece is the one our own demo exposed: the phase classifier leaned heavily on cycle position, and a single clean cycle made that trivially easy. Real capture data is multi-cycle, multi-resin-lot, multi-scale, and the genuinely hard version of phase detection — robust to a column that channels, a feed that is off-spec, a pump that stutters — is harder than 0.9989 accuracy on one golden run suggests. Distinguishing learning the signal from learning the clock is the difference between a classifier that generalizes and one that memorizes the schedule.

What this chapter adds to the model suite

This chapter contributes examples/platform/ml/chromatography.py to Book 5's example suite — the Protein A phase classifier plus the pooling-and-recovery model. It provides:

  • train_phase_classifier() — a gradient-boosted six-class phase classifier over UV280, conductivity, pH, and temporal context, validated on held-out samples of the real chromatogram (accuracy 0.9989, macro-F1 0.9989).
  • predict_pool() — the real-time pooling rule: the locked UV guard band applied inside the model-predicted elution phase, returning start/stop column volumes and pool volume that match the cycle summary.
  • recovery_model() — the mass-balance-anchored recovery soft sensor: a physics ceiling (the lesser of loaded mass and DBC times column volume) times a learned recovery fraction, predicting 43.2 g against a measured 43.3 g.

It reads examples/datasets/protein_a_chromatogram.csv (3120 one-second samples) and protein_a_summary.csv (the cycle's pooling decision and recovery), and it deliberately complements rather than duplicates the upstream soft sensors — soft_sensor_pls.py and hybrid_model.py — by working on a downstream unit operation with a categorical (phase) task alongside the regression.

Why it matters

Capture sets the impurity floor for the entire purification train, and the pooling decision it makes — live, on a moving signal — is one of the few places in manufacturing where a model's choice has an immediate, measurable, auditable consequence on product quality and yield. Getting the learning layer right means three concrete things: a phase classifier that automates chromatogram review so a human reviews exceptions, not every trace; a pooling rule that is learned and validated but locked, trading yield against purity on evidence rather than on a fixed historical window; and a recovery soft sensor that tells the operator the yield before the assay does. Get the boundary right — mechanistic physics for the column, machine learning for read-time-predict, neither pretending to be the other — and capture becomes the model citizen of downstream ML: real, deployed-adjacent, and honest about its limits. Blur the boundary, call a solved-physics simulator "AI," or let a model adapt its own CQA-affecting rule, and you have either over-claimed or stepped over the line the regulators have drawn brightest.

In the real world

Mechanistic chromatography modeling is production technology in CMC process development: Cytiva GoSilico and the open-source CADET are used to model elution and breakthrough for real molecules, and disclosed industrial groups have built mechanistic ion-exchange and mixed-mode models in their pipelines [1][2] — but, again, this is mechanistic, not ML, and vendor savings headlines are vendor-self-reported. On the learning side, the deployed-adjacent reality is monitoring and prediction rather than autonomous control: ML-based Raman has been used to predict many quality attributes in-line during Protein A capture (a pilot at a major manufacturer, using K-nearest-neighbours regression — not a deep-learning model, a point worth correcting wherever it is miscited) [8]; on-line PAT plus PCA has flagged resin aging cycles ahead of yield loss (pilot) [6]; and PINN-accelerated mechanistic models have reached real-time speeds for model-predictive control of continuous capture (research) [5]. The pattern matches the ISPE Pharma 4.0 picture the whole book reports: downstream ML clusters in monitoring, prediction, and human-in-the-loop review — not in autonomous control of a critical quality attribute. The open-source analytics chapter shows the same shape of model running in code, and Book 1's capture chromatography and Book 4's downstream ontology describe the same physical step through their own lenses.

Key terms

  • Capture (Protein A) chromatography — the first downstream step, a bind-and-elute cycle that concentrates the antibody a hundredfold and removes most impurities, yielding the capture pool PApool-001.
  • Phase — one of the six segments of a capture cycle (equilibration, load, wash, elution, strip, CIP), each with a distinct UV-conductivity-pH signature; classified per-sample from the live signals.
  • Breakthrough — antibody escaping the column unbound when the resin saturates during load; the rising outlet UV that marks it bounds how much can be loaded.
  • Dynamic binding capacity (DBC) — the mass the column can hold at the operating flow before a fixed breakthrough level; not a constant — it falls as the resin ages.
  • Pooling decision — choosing the start and stop of eluate collection, typically as a guard band on UV280 (and sometimes conductivity or pH), trading recovery against purity; in production the cutoff is learned and validated but locked, not adapted live.
  • Mechanistic chromatography model — a first-principles column simulator (general rate model + steric mass action; GoSilico, CADET); the mature production tool here and not machine learning.
  • Hybrid (gray-box) chromatography model — a mechanistic backbone whose hard-to-measure parameters or residuals are supplied by a learned component; the small-data-friendly middle path.
  • Recovery soft sensor — a prediction of eluted mass from a physics ceiling (the lesser of loaded mass and DBC times column volume) times a learned recovery fraction.

Where this leads

The capture pool PApool-001 is concentrated, low-pH, and far purer than the harvest — but "far purer" is not "safe." Before the product can advance, the process must prove it can clear viruses by a validated margin, and that proof is a number with a hard floor. The next chapter, Viral Safety: Learning Log-Reduction and Orthogonal Clearance, takes up the log-reduction-value problem — how learning predicts and supports viral clearance across orthogonal steps, and why this is the most safety-critical place in the whole book to be careful about what a model is allowed to claim.