Skip to main content

Downstream Capture: Chromatography & Filtration Skids

πŸ“ Where we are: Part II, "Capturing the Process." The bioreactor handed off a tank of cells and antibody; now we follow the molecule through the purification skids β€” and learn to capture data that is less about steady trends and more about decisions.

The simple version

Upstream data is like a long flight: altitude and speed drift slowly for hours, and you mostly want the average. Downstream data is like the landing. Everything that matters happens in a few sharp minutes β€” the flare, the touchdown, the brakes β€” and the question is never "what was the average?" but "did we do the right thing at the right moment, and can we prove it?" A chromatography run is a sequence of short, named steps, and the valuable record is which window of liquid we decided to keep.

What this chapter covers​

After the bioreactor, the harvested broth is still mostly water, cell debris, and host-cell junk with a little antibody dissolved in it. Downstream purification is the series of skids that turns that into pure drug substance: Protein A capture, viral inactivation and filtration, polishing chromatography, and ultrafiltration/diafiltration (UF/DF). This chapter shows how to capture that data with open-source tools:

  • Why downstream traces are phase-rich and decision-bearing, and what signals each skid produces.
  • Reading UV, conductivity, pH, pressure, and flow off the skid PLCs over OPC UA.
  • Segmenting a run into ISA-88 operations and phases, normalized in column volumes (CV).
  • Recording the two GMP-critical decisions: the pooling window and hold times / integrity tests, written to PostgreSQL as events.operation_event rows.
  • The intensified multi-column continuous capture (3MCC) variant, and where pure OSS stops being enough.

We will run real, tested code from the companion repo against the deterministic chromatogram the simulator produces (SIM_SEED=2026), and look at the exact numbers that come out.

Downstream is a state machine, not a trend line​

Upstream, you log a tag every few seconds for two weeks and the story is in the slow curve. Downstream, a single Protein A cycle lasts about an hour, and inside that hour the column passes through a fixed sequence of steps: equilibrate, load, wash, elute, strip, clean. Each step has a totally different "normal." During load, UV at 280 nm sits near baseline; during elution, it spikes to thousands of milli-absorbance-units (mAU) as concentrated antibody comes off the column; during clean-in-place, conductivity jumps because you are pushing sodium hydroxide through.

This is exactly what ISA-88 / IEC 61512 was written for: a batch is structured as a procedure β†’ unit procedure β†’ operation β†’ phase hierarchy [1]. For our purposes the useful unit is the phase β€” equilibration, load, wash, elution β€” and the job of data capture is to slice the continuous sensor trace into those named windows so every later query can ask "what was the UV during elution?" instead of "what was the UV at 09:14:32?"

The signals themselves come off the skid's PLC over OPC UA (IEC 62541) [2], the same self-describing, timestamped, quality-flagged transport we used for the bioreactor. A chromatography skid typically exposes UV280 (sometimes UV at multiple wavelengths), conductivity, pH, inlet and outlet pressure, flow rate, and the current step number the skid's own controller is running. We capture all of it; the engineering work is turning it into decisions.

The simulated Protein A cycle​

You cannot put a real €100k chromatography skid on a laptop, so the companion repo ships a deterministic simulator that emits a physically plausible cycle. The point of the simulator is honesty: every number in this chapter comes from a file you can regenerate byte-for-byte. Protein A affinity capture is the platform first step for essentially every CHO-derived monoclonal antibody [3], so it is the right thing to model.

From examples/sim/bioproc_sim/protein_a.py, the cycle is defined as a list of phases measured in column volumes β€” the natural, scale-independent clock of chromatography (one CV here is one litre of resin bed, run at 0.5 CV/min):

# examples/sim/bioproc_sim/protein_a.py
CV_ML = 1000.0 # column volume (mL); 1 L Protein A column
CV_PER_MIN = 0.5 # 0.5 CV/min -> 1 CV = 2 min

# phase -> (duration in CV)
PHASES = [
("Equilibration", 3.0),
("Load", 8.0), # load to ~80% of dynamic binding capacity to avoid breakthrough loss
("Wash", 4.0),
("Elution", 5.0),
("Strip", 3.0),
("CIP", 3.0),
]

The simulator builds the UV/conductivity/pH traces phase by phase. The two pieces of real chromatography physics worth pointing at: during load, UV rises as a breakthrough sigmoid near the end β€” the column is filling toward its dynamic binding capacity, and if you keep loading, product starts washing straight through unbound. During elution, a low-pH step releases the antibody as a sharp, slightly tailing peak:

# examples/sim/bioproc_sim/protein_a.py β€” the elution peak
lo, hi = seg("Elution")
emask = (cv >= lo) & (cv < hi)
x = cv[emask] - (lo + 0.8)
peak = 1850.0 * np.exp(-(x ** 2) / (2 * 0.45 ** 2)) * (1 + 0.5 * (x > 0) * np.exp(-x / 0.9))
uv[emask] = 4.0 + peak
ph[emask] = 3.3 + 0.4 * np.exp(-((cv[emask] - lo) / 1.5))

Run it (python -m bioproc_sim.protein_a) and you get a ~1 Hz trace plus a one-row summary. Here are the first committed rows of datasets/protein_a_chromatogram.csv β€” long, tidy, exactly the shape the historian likes:

ts,time_s,volume_CV,UV280_mAU,conductivity_mS_cm,pH,phase,batch_id
2026-01-19 08:00:00+00:00,0,0.0,4.3,4.926,7.207,Equilibration,BATCH-2026-001
2026-01-19 08:00:01+00:00,1,0.0083,5.36,4.958,7.181,Equilibration,BATCH-2026-001
2026-01-19 08:00:02+00:00,2,0.0167,2.64,4.997,7.204,Equilibration,BATCH-2026-001

The whole cycle is 3,120 rows. The interesting one β€” the elution peak β€” tops out at 2,769.6 mAU around CV 15.8, at pH 3.5. That single number is what the rest of the chapter exists to act on.

A Protein A capture chromatogram annotated by ISA-88 phase, with UV280, conductivity and pH traces and the shaded pooling window between the up-slope and down-slope 100 mAU thresholds on the elution peak.

A single Protein A bind-and-elute cycle. UV280 (blue) stays near baseline through equilibration, load and wash, then erupts during elution; pH (green) drops to ~3.3 to release the antibody; conductivity (orange) spikes during CIP. The shaded band is the pooling window β€” the slice of eluate we actually keep.

Original diagram by the authors, created with AI assistance.

Slicing the run into ISA-88 phases​

In a real plant the skid controller already knows what step it is on, and it stamps each sample with a step/phase label. The repo's simulator does the same: every row in the trace carries a phase column. What you still need is to turn that dense per-sample label into the handful of contiguous windows the batch record cares about β€” so the repo ships a small, boring, robust collapser that does exactly that: walk the trace, and a new phase window starts wherever the label changes.

(Signal-only reconstruction β€” deriving the phase windows from the raw UV/conductivity/pH trace when no step label exists, as you might face on an older skid or in merged data β€” is a genuinely harder problem and is out of scope here. The code below assumes a per-sample step label is available; it does not infer phases from the physics.)

From examples/chapters/10-downstream-chromatography/phase_detect.py:

# examples/chapters/10-downstream-chromatography/phase_detect.py
def detect_phases(trace: pd.DataFrame) -> pd.DataFrame:
"""Collapse the per-sample phase labels into contiguous phase windows."""
t = trace.copy()
# a new phase starts wherever the label changes
t["grp"] = (t["phase"] != t["phase"].shift()).cumsum()
rows = []
for _, g in t.groupby("grp"):
rows.append({
"phase": g["phase"].iloc[0],
"start_ts": g["ts"].iloc[0],
"end_ts": g["ts"].iloc[-1],
"start_CV": round(float(g["volume_CV"].iloc[0]), 3),
"end_CV": round(float(g["volume_CV"].iloc[-1]), 3),
"max_UV_mAU": round(float(g["UV280_mAU"].max()), 1),
})
return pd.DataFrame(rows)

The (label != label.shift()).cumsum() trick is the whole idea: each time the phase name differs from the previous row, the cumulative sum ticks up by one, giving every contiguous run of identical labels a unique group id you can groupby. Running python chapters/10-downstream-chromatography/phase_detect.py against the committed trace prints exactly this:

phase start_CV end_CV max_UV_mAU
Equilibration 0.000 2.992 7.3
Load 3.000 10.992 64.0
Wash 11.000 14.992 35.2
Elution 15.000 19.992 2769.6
Strip 20.000 22.992 45.9
CIP 23.000 25.992 8.1

Six phases, each with a clean CV window and its peak UV. Notice the diagnostic value already falling out: the Load phase's max UV of 64 mAU is the breakthrough shoulder β€” a quiet early-warning that the column is approaching saturation. In production you would alarm on that. (The chapter's pytest, test_ch10_phase_detection_and_pooling, asserts this collapse lands on exactly these six phases in order.)

The decision that matters: the pooling window​

Detecting phases is bookkeeping. The GMP-critical decision is pooling: of all the liquid coming off during elution, which slice do you collect into the product pool and which do you send to waste? Collect too early and you carry over impurities; collect too late and you dilute the pool or lose yield. This is process analytical technology (PAT) in its purest form β€” real-time UV measurement driving an in-process control decision rather than waiting for an offline assay [4], and the academic literature is explicit that on-line analytics now drive column-pooling decisions against product-quality attributes, which must fall inside validated ranges [5].

The classic, robust rule is UV-threshold pooling: start collecting when UV crosses a threshold on the up-slope, stop when it falls back through a threshold on the down-slope. Newer plants use inline UV/Vis spectral deconvolution to pool on concentration (and even on impurity content) rather than raw absorbance, computing step yield far more accurately than offline peak-area integration [6]. The repo implements the simple, defensible threshold version:

# examples/chapters/10-downstream-chromatography/phase_detect.py
POOL_THRESHOLD_MAU = 100.0 # start/stop collecting the eluate at 100 mAU

def pooling_decision(trace: pd.DataFrame) -> dict:
"""Collect the elution peak between up-slope and down-slope UV thresholds."""
elute = trace[trace["phase"] == "Elution"]
above = elute[elute["UV280_mAU"] >= POOL_THRESHOLD_MAU]
if above.empty:
return {"pooled": False}
start_cv = float(above["volume_CV"].iloc[0])
stop_cv = float(above["volume_CV"].iloc[-1])
return {
"pooled": True,
"pool_start_CV": round(start_cv, 3),
"pool_stop_CV": round(stop_cv, 3),
"pool_CV": round(stop_cv - start_cv, 3),
"threshold_mAU": POOL_THRESHOLD_MAU,
"peak_UV_mAU": round(float(elute["UV280_mAU"].max()), 1),
}

On our run that returns:

pooling: {'pooled': True, 'pool_start_CV': 15.0, 'pool_stop_CV': 16.917,
'pool_CV': 1.917, 'threshold_mAU': 100.0, 'peak_UV_mAU': 2769.6}

We keep a 1.917 CV slice β€” about 1.9 L of eluate β€” between CV 15.0 and CV 16.92. That single record answers the inspector's question. The summary the simulator writes (datasets/protein_a_summary.csv) closes the mass balance around it:

batch_id,step,column,cv_mL,load_titer_g_L,load_volume_L,mass_loaded_g,pool_start_CV,pool_stop_CV,pool_volume_mL,DBC_g_per_L,recovery_frac,eluted_mass_g,eluate_titer_g_L
BATCH-2026-001,ProteinA_capture,MabSelect-PA01,1000.0,5.88,8.0,47.0,15.0,16.92,1916.7,58.0,0.92,43.3,22.58

One thing to know if you run the code yourself: the committed datasets/protein_a_summary.csv is produced by the campaign run (make data), which feeds the fed-batch line's final titer of ~5.88 g/L into simulate(load_titer_g_L=5.88). The bare module command python -m bioproc_sim.protein_a uses the function's own default of load_titer_g_L=5.5 and so prints a slightly different summary row (mass_loaded 44.0 g, eluted 40.5 g, eluate_titer 21.12 g/L). The chromatogram trace and the pooling numbers β€” peak 2,769.6 mAU, pool 15.0 β†’ 16.917 β€” are identical either way, because the load titer only scales the mass balance, not the UV trace.

47.0 g of antibody loaded, a dynamic binding capacity of 58 g/L, 92% step recovery, and 43.3 g eluted at 22.58 g/L β€” concentrated nearly 4Γ— (22.58 / 5.88 β‰ˆ 3.8Γ—) versus the load. The mass balance in the simulator is deliberately honest: you can only elute what the column actually bound, so eluted mass is capped at min(mass_loaded, DBC Γ— CV) before applying recovery (see eluted_g = bound_g * recovery in protein_a.py). A pooling decision that implied recovering more than you loaded would be a bug. The chapter's pytest (test_ch10_phase_detection_and_pooling) guards the decision end of this: it asserts the run pools and that the product peak clears 1000 mAU.

Writing it down: operation events in PostgreSQL​

Phases and the pooling decision are useless unless they are stored next to the batch they belong to. The repo's relational backbone (PostgreSQL 17, the timescale/timescaledb:2.17.2-pg17 image pinned in examples/platform/compose/compose.yaml) has one table built for this β€” the bridge between the time-series stream and the batch context.

From examples/platform/db/30-lab-events.sql:

-- examples/platform/db/30-lab-events.sql
CREATE TABLE events.operation_event (
event_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
batch_id text REFERENCES s88.batch,
unit_id text REFERENCES s88.unit,
event_type text NOT NULL, -- phase_start | phase_end | pool | hold | excursion
phase text,
start_ts timestamptz NOT NULL,
end_ts timestamptz,
attributes jsonb NOT NULL DEFAULT '{}'
);
CREATE INDEX ON events.operation_event (batch_id, start_ts);

The shape is deliberate. Structured columns (batch_id, unit_id, phase, the time window) carry the things you always filter and join on; the open-ended attributes jsonb carries the per-event payload that varies by event type β€” the pooling window for an elution, the integrity-test result for a filter, the duration for a hold. The phase detector emits one row per phase and attaches the pool window to the elution row:

# examples/chapters/10-downstream-chromatography/phase_detect.py
phases["event_type"] = "phase"
# attach the pool window to the Elution row
phases["attributes"] = phases["phase"].map(
lambda p: pool if p == "Elution" else {})

So the Elution row that lands in Postgres looks like this β€” structured context plus a self-describing JSON payload that a reviewer (or a SPARQL query later in the book) can read without a schema migration:

{
"batch_id": "BATCH-2026-001",
"unit_id": "PA01",
"event_type": "phase",
"phase": "Elution",
"start_ts": "2026-01-19T08:30:00Z",
"end_ts": "2026-01-19T08:39:59Z",
"attributes": {
"pooled": true, "pool_start_CV": 15.0, "pool_stop_CV": 16.917,
"pool_CV": 1.917, "threshold_mAU": 100.0, "peak_UV_mAU": 2769.6
}
}

This is the heart of why we use PostgreSQL here rather than the historian [7]: the historian holds millions of raw samples; this table holds the handful of interpreted, decision-bearing records that the batch record and the audit trail actually care about.

Hold times, integrity tests, and the rest of the train​

Protein A is only the first skid. The same operation_event pattern records the rest of the downstream train, and the event_type enum (phase | pool | hold | excursion) is built for it:

  • Viral inactivation is a low-pH hold β€” the eluate sits at pH ~3.5 for a validated minimum time. That hold has a defined maximum hold time before the next step, and both the minimum and maximum are GMP-critical, must be validated, and must be recorded [8]. A hold event with start_ts, end_ts, and an attributes payload of {"target_min": 60, "actual_min": 64, "pH": 3.5} is the whole evidence trail.
  • Filtration (viral filtration, sterile filtration) produces a pre-use/post-use integrity test whose pass/fail is a discrete, recorded event: {"test": "bubble_point", "psi": 51, "spec_min_psi": 45, "result": "pass"}. Pressure and flow trends from the skid go to the historian; the verdict goes here.
  • Polishing (cation/anion exchange) is another bind-and-elute or flow-through chromatography step β€” the same phase-detection and pooling code applies, just with different thresholds.
  • UF/DF concentrates and buffer-exchanges the drug substance; the recorded decisions are the concentration-factor and diavolume targets met.

The honest note: a maximum hold-time breach is exactly the kind of excursion event that must surface to quality. Capturing it is easy; the workflow that routes, investigates, and dispositions it is the regulated part, and we build that in Part V.

The intensified variant: multi-column continuous capture​

The whole chapter so far describes a batch column: one column, one cycle, idle between cycles. The modern, intensified alternative is multi-column continuous capture (3MCC / PCC) β€” three or four small Protein A columns plumbed so that while one is eluting, the next is loading and capturing the breakthrough from the first. This is what the perfusion bioreactor (the book's continuous variant) feeds into, and it dramatically improves resin utilization because you can deliberately load past breakthrough, knowing a downstream column catches the leak.

For data capture, 3MCC changes one thing fundamentally: instead of one clean phase sequence, you have several columns each in a different phase at the same instant, with the controller orchestrating switch events between them. The operation_event model handles this without change β€” every row carries its own unit_id, so three concurrent phase timelines (one per column, e.g. PA01, PA02, PA03) simply become three streams of rows under the same batch_id. (The seed only provisions the single-column unit PA01 in s88.unit; because operation_event.unit_id is a foreign key to s88.unit, a real 3MCC run would first need PA02 and PA03 added there before those rows could insert.) The pooling logic moves up a level: you pool a cumulative product stream across column switches rather than one peak. The repo's fed-batch path uses the single-column simulator above; the perfusion sidebar reuses the identical detector per column.

Why it matters​

Downstream is where the molecule becomes the drug, and where most of the yield and most of the risk live. A pooling decision made two CV too late can fail a host-cell-protein spec; a missed hold-time can scrap a batch worth a quarter-million dollars. The reason we go to the trouble of segmenting traces into ISA-88 phases and writing structured event rows is that those rows are what a batch-record review, an OOS investigation, and an inspector all read. A wall of raw UV samples proves nothing on its own; the sentence "the operator pooled between 15.0 and 16.9 CV, inside the validated 14.5–17.5 CV window" β€” reconstructed automatically from the trace β€” is the proof.

In the real world​

In a commercial mAb plant the chromatography skids are typically Cytiva Γ„KTA process systems or equivalent, and the data of record lives in a vendor chromatography data system (CDS) and the MES batch record. Our OSS stack does not replace those β€” it contextualizes and historizes alongside them. That boundary is a recurring honesty theme of this book.

A few real-world anchors:

  • NIIMBL, the U.S. public-private Manufacturing USA institute for biopharmaceutical innovation, funds exactly this kind of continuous and intensified processing work; its SABRE facility (the NIIMBL / University of Delaware pilot-scale cGMP β€” current Good Manufacturing Practice β€” facility that broke ground in April 2024) is being built to demonstrate next-generation processes including continuous capture. SABRE is a facility, not a data program, but it is where the 3MCC-style processes this chapter sketches are meant to run at pilot scale.
  • Data formats: vendor CDS exports lock chromatograms into proprietary files. Where you need to interchange a chromatogram β€” into a data lake, between sites, into a regulatory submission β€” the vendor-neutral ASTM ANDI/NetCDF chromatography format (ASTM E1947) is the long-standing standard, and exporting .cdf rather than a vendor blob is the FAIR-friendly choice [9].
  • Standards bite here. Annex 11 (EU GMP for computerised systems) requires that a system recording GMP-relevant decisions and changes β€” like a pooling decision β€” generate a reviewable, time-stamped audit trail [10]. PIC/S PI 041-1 reinforces that those pooling and hold-time records must be attributable, contemporaneous, and complete (the ALCOA+ attributes), and that it is the data flow and risk, not just whether the IT works, that governs how you capture decision data [11].

The honest OSS-vs-commercial verdict for this layer: the open-source stack does the capture and contextualization beautifully. Python + the phase detector turns a raw trace into ISA-88 events; PostgreSQL stores them with full structure; Node-RED with community finite-state-machine nodes can run the same segmentation as a live flow at the edge, triggering pool and hold events as they happen [12]. That gets you most of the way. What pure OSS does not give you out of the box is the validated, Part 11-grade e-signature on the pooling decision, the locked-down change control on the pooling thresholds, or vendor accountability for the CDS that physically commanded the fraction-collector valve. Those are the GxP last mile β€” the audit trail, signing, and validation we build (honestly, and with their limits) in Part V.

Key terms​

  • Chromatography skid β€” the automated system (pumps, valves, UV/conductivity/pH detectors, fraction collector) that runs a packed column through its phases.
  • Protein A capture β€” the platform affinity step that selectively binds the antibody's Fc region, giving high purity in one step.
  • Column volume (CV) β€” the volume of the packed resin bed; the scale-independent clock for chromatography (here 1 CV = 1 L = 2 min).
  • Phase / operation (ISA-88) β€” a named step within a procedure (equilibration, load, wash, elution, strip, CIP); the unit of segmentation.
  • Breakthrough β€” product washing through the column unbound once it nears its binding capacity; visible as a rising UV shoulder during load.
  • Dynamic binding capacity (DBC) β€” how much product a litre of resin can bind under flow before breakthrough; here ~58 g/L.
  • Pooling window β€” the start/stop volumes between which eluate is collected into the product pool; the chapter's central GMP-critical decision.
  • Hold time β€” the validated minimum (and maximum) time a pool sits at a step, e.g. low-pH viral inactivation.
  • Integrity test β€” a pass/fail check (e.g. bubble point) on a filter before/after use, recorded as a discrete event.
  • 3MCC / PCC β€” multi-column / periodic counter-current continuous capture; the intensified alternative to a single batch column.
  • PAT β€” Process Analytical Technology; using real-time measurement (here inline UV) to make in-process quality decisions.

Where this leads​

We have captured the process from sensor to skid β€” every in-line tag, every phase, every pooling decision. But the molecule's quality is ultimately judged off-line, by instruments: HPLC for purity, assays for host-cell protein, a balance for concentration. The next chapter, The Analytical Lab: Instruments, LIMS & ELN, leaves the production floor for the QC lab and shows how to capture that data β€” the offline results that confirm whether the decisions we just recorded actually produced a drug worth releasing.