Downstream Capture: Chromatography & Filtration Skids

📍 Where we are: Part II, "Capturing the Process." The bioreactor handed off a tank of cells and antibody; now we follow the molecule through the purification skids — and learn to capture data that is less about steady trends and more about decisions.

The simple version

Upstream data is like a long flight: altitude and speed drift slowly for hours, and you mostly want the average. Downstream data is like the landing. Everything that matters happens in a few sharp minutes — the flare, the touchdown, the brakes — and the question is never "what was the average?" but "did we do the right thing at the right moment, and can we prove it?" A chromatography run is a sequence of short, named steps, and the valuable record is which window of liquid we decided to keep.

What this chapter covers

After the bioreactor, the harvested broth is still mostly water, cell debris, and host-cell protein (the CHO cells' own proteins, co-produced alongside the antibody) with a little antibody dissolved in it. Downstream purification is the series of skids — each a self-contained, automated chromatography-or-filtration machine (pumps, valves, detectors on a frame) — that turns that into pure drug substance: Protein A capture, viral inactivation and filtration, polishing chromatography (a second chromatography step that strips out the last trace impurities after capture), and ultrafiltration/diafiltration (UF/DF). This chapter shows how to capture that data with open-source tools:

Why downstream traces are phase-rich and decision-bearing, and what signals each skid produces.
Reading UV, conductivity, pH, pressure, and flow off the skid PLCs over OPC UA.
Segmenting a run into ISA-88 operations and phases, normalized in column volumes (CV).
Recording the two GMP-critical (Good Manufacturing Practice — the legally enforceable rules for making medicines safely and reproducibly) decisions: the pooling window and hold times / integrity tests, written to PostgreSQL (the open-source relational database) as events.operation_event rows.
The intensified multi-column continuous capture (3MCC) variant, and where pure OSS stops being enough.

We will run real, tested code from the companion repo against the deterministic chromatogram the simulator produces (SIM_SEED=2026), and look at the exact numbers that come out.

Downstream is a state machine, not a trend line

Upstream, you log a tag every few seconds for two weeks and the story is in the slow curve. Downstream, a single Protein A cycle lasts about an hour, and inside that hour the column passes through a fixed sequence of steps: equilibrate, load, wash, elute, strip, clean. Each step has a totally different "normal." During load, UV at 280 nm sits near baseline; during elution, it spikes to thousands of milli-absorbance-units (mAU) as concentrated antibody comes off the column. (UV absorbance at 280 nm is the workhorse protein detector — proteins absorb ultraviolet light at that wavelength, so a high UV reading means a lot of antibody is flowing past.) During clean-in-place (CIP), conductivity — a measure of how many dissolved ions are in the liquid — jumps because you are pushing sodium hydroxide, a strong base full of ions, through the column.

This is exactly what ISA-88 / IEC 61512 was written for: a batch is structured as a procedure → unit procedure → operation → phase hierarchy [1] — from the whole recipe (procedure) down through what happens on one piece of equipment (unit procedure) and a major processing activity (operation) to the smallest named step (phase). For our purposes the useful unit is the phase — equilibration, load, wash, elution — and the job of data capture is to slice the continuous sensor trace into those named windows so every later query can ask "what was the UV during elution?" instead of "what was the UV at 09:14:32?"

The signals themselves come off the skid's PLC (programmable logic controller — the industrial computer that runs the equipment) over OPC UA (IEC 62541) [2], the industrial data protocol — the same self-describing, timestamped, quality-flagged transport we used for the bioreactor. A chromatography skid typically exposes UV280 (sometimes UV at multiple wavelengths), conductivity, pH, inlet and outlet pressure, flow rate, and the current step number the skid's own controller is running. We capture all of it; the engineering work is turning it into decisions.

The simulated Protein A cycle

You cannot put a real €100k chromatography skid on a laptop, so the companion repo ships a deterministic simulator that emits a physically plausible cycle. The point of the simulator is honesty: every number in this chapter comes from a file you can regenerate byte-for-byte. Protein A affinity capture is the platform first step for essentially every CHO-derived monoclonal antibody [3], so it is the right thing to model.

From examples/sim/bioproc_sim/protein_a.py, the cycle is defined as a list of phases measured in column volumes — the natural, scale-independent clock of chromatography (one CV here is one litre of resin bed — the column packed with tiny porous beads the antibody sticks to — run at 0.5 CV/min, i.e. half a column volume of liquid pushed through per minute):

# examples/sim/bioproc_sim/protein_a.py
CV_ML = 1000.0          # column volume (mL); 1 L Protein A column
CV_PER_MIN = 0.5        # 0.5 CV/min -> 1 CV = 2 min

# phase -> (duration in CV)
PHASES = [
    ("Equilibration", 3.0),
    ("Load", 8.0),       # load to ~80% of dynamic binding capacity to avoid breakthrough loss
    ("Wash", 4.0),
    ("Elution", 5.0),
    ("Strip", 3.0),
    ("CIP", 3.0),
]

The simulator builds the UV/conductivity/pH traces phase by phase. The two pieces of real chromatography physics worth pointing at: during load, UV rises as a breakthrough sigmoid near the end — the column is filling toward its dynamic binding capacity (DBC — the amount of antibody a litre of resin can hold under flow before it saturates), and if you keep loading, product starts washing straight through unbound (the breakthrough loss the code comment warns about). During elution, a low-pH buffer breaks the Protein A–Fc bond (the Fc is the constant stem of the Y-shaped antibody — the part Protein A grabs) that held the antibody on the column, releasing it as a sharp, slightly tailing peak:

# examples/sim/bioproc_sim/protein_a.py — the elution peak
lo, hi = seg("Elution")
emask = (cv >= lo) & (cv < hi)
x = cv[emask] - (lo + 0.8)
peak = 1850.0 * np.exp(-(x ** 2) / (2 * 0.45 ** 2)) * (1 + 0.5 * (x > 0) * np.exp(-x / 0.9))
uv[emask] = 4.0 + peak
ph[emask] = 3.3 + 0.4 * np.exp(-((cv[emask] - lo) / 1.5))

Run it (python -m bioproc_sim.protein_a) and you get a ~1 Hz trace plus a one-row summary. Here are the first committed rows of datasets/protein_a_chromatogram.csv — long, tidy, exactly the shape the historian likes (a real chromatography system's native result file is a richer, proprietary record; this open CSV is the tidy form you would export from it):

ts,time_s,volume_CV,UV280_mAU,conductivity_mS_cm,pH,phase,batch_id
2026-01-19 08:00:00+00:00,0,0.0,4.3,4.926,7.207,Equilibration,BATCH-2026-001
2026-01-19 08:00:01+00:00,1,0.0083,5.36,4.958,7.181,Equilibration,BATCH-2026-001
2026-01-19 08:00:02+00:00,2,0.0167,2.64,4.997,7.204,Equilibration,BATCH-2026-001

The whole cycle is 3,120 rows. The interesting one — the elution peak — tops out at 2,769.6 mAU around CV 15.8, at pH 3.5. (pH falls across the elution as a low-pH gradient — roughly 3.7 down to 3.3 — and is about 3.5 at the peak apex, which is why you will see ~3.3 and ~3.5 quoted for the same step.) That single number is what the rest of the chapter exists to act on.

A Protein A capture chromatogram annotated by ISA-88 phase, with UV280, conductivity and pH traces and the shaded pooling window between the up-slope and down-slope 100 mAU thresholds on the elution peak.

A single Protein A bind-and-elute cycle. UV280 (blue) stays near baseline through equilibration, load and wash, then erupts during elution; pH (green) drops to ~3.3 to release the antibody; conductivity (orange) spikes during CIP. The shaded band is the pooling window — the slice of eluate we actually keep.

Original diagram by the authors, created with AI assistance.

A chromatogram is a curve, not a row

It is worth being precise about what kind of thing a chromatogram actually is, because it shapes how you can store it. The tidy CSV above is a convenient flattening, but the underlying object is not a list of rows — it is a curve: absorbance as a continuous function of elution volume, and on a multi-wavelength detector it is several curves at once (UV at each wavelength versus volume), i.e. a small n-dimensional array. Most of the genuine information lives in the shape of that curve — the leading edge, the peak apex, the tailing shoulder — not in any single sample.

That matters for two reasons this book keeps returning to. First, the BMIC ontology — a shared, machine-readable vocabulary that fixes the meaning of manufacturing terms and their relationships (the knowledge-graph chapter builds one, and Book 4, Ontologies for Biopharmaceutical Manufacturing, is devoted to them); published by the IOF (Industrial Ontologies Foundry) with OAGi/NIIMBL, released 2026-02 — is deliberately prescriptive: it models the recipe, the specification, the critical process parameters and quality attributes (the CPPs and CQAs — the dials you control and the qualities you must hit) — what the run is supposed to do — and it has no class for the measured curve itself. So you cannot "store the chromatogram in BMIC"; BMIC tells you the elution step's CPP/CQA targets, and the trace lives elsewhere. Second, if you want to keep the full trace as a faithful original — not just the handful of derived numbers (peak height, pool window) we extract here — a flat JSON or CSV is the wrong container. The vendor-neutral homes for the whole array are the same two this book covers in The Analytical Lab: Instruments, LIMS & ELN: Allotrope's HDF5-based ADF, whose n-dimensional Data Cube is built precisely for spectra, chromatograms and curves, or AnIML, whose SeriesSet carries the array as open XML. Allotrope ASM is JSON, and JSON is for the result — a scalar like the 2,769.6 mAU peak or the 1.917 CV pool — not for the dense curve behind it. (That chapter explains ADF, ASM and AnIML in full; here the only point is that the decision-bearing numbers we write to PostgreSQL and the archival curve are two different jobs, served by two different stores.)

Slicing the run into ISA-88 phases

Left-to-right flow of the six Protein A phases — Equilibrate, Load, Wash, Elution, Strip, CIP — each annotated with its column-volume window, with a green connector dropping from the Elution box to the marked pooling window at 15.0 to 16.917 CV.

The phase state machine: load, wash, elute

The cycle is not a free-running trend; it is a fixed sequence the skid controller marches through, and that sequence is the spine of everything we capture. Read left to right: Equilibrate primes the bed (0–3 CV), Load binds the antibody while UV climbs toward the breakthrough shoulder (3–11 CV), Wash flushes loosely-bound impurities (11–15 CV), Elution drops the pH and releases the product as the sharp peak (15–20 CV), then Strip and CIP regenerate the column (20–26 CV). Each transition is a discrete event the controller stamps, and — crucially — each box in that sequence becomes exactly one events.operation_event row. Only one of the six carries a decision: the Elution row, which gets the pooling window attached to it. The other five are pure bookkeeping that locate the decision in time and identity. This is why the model is a state machine rather than a trend line: the question is never "what is the value now?" but "which named state are we in, and what did we decide while we were in it?"

In a real plant the skid controller already knows what step it is on, and it stamps each sample with a step/phase label. The repo's simulator does the same: every row in the trace carries a phase column. What you still need is to turn that dense per-sample label into the handful of contiguous windows the batch record cares about — so the repo ships a small, boring, robust collapser that does exactly that: walk the trace, and a new phase window starts wherever the label changes.

(Signal-only reconstruction — deriving the phase windows from the raw UV/conductivity/pH trace when no step label exists, as you might face on an older skid or in merged data — is a genuinely harder problem and is out of scope here. The code below assumes a per-sample step label is available; it does not infer phases from the physics.)

From examples/chapters/10-downstream-chromatography/phase_detect.py:

# examples/chapters/10-downstream-chromatography/phase_detect.py
def detect_phases(trace: pd.DataFrame) -> pd.DataFrame:
    """Collapse the per-sample phase labels into contiguous phase windows."""
    t = trace.copy()
    # a new phase starts wherever the label changes
    t["grp"] = (t["phase"] != t["phase"].shift()).cumsum()
    rows = []
    for _, g in t.groupby("grp"):
        rows.append({
            "phase": g["phase"].iloc[0],
            "start_ts": g["ts"].iloc[0],
            "end_ts": g["ts"].iloc[-1],
            "start_CV": round(float(g["volume_CV"].iloc[0]), 3),
            "end_CV": round(float(g["volume_CV"].iloc[-1]), 3),
            "max_UV_mAU": round(float(g["UV280_mAU"].max()), 1),
        })
    return pd.DataFrame(rows)

The (label != label.shift()).cumsum() trick is the whole idea: each time the phase name differs from the previous row, the cumulative sum ticks up by one, giving every contiguous run of identical labels a unique group id you can groupby. Running python chapters/10-downstream-chromatography/phase_detect.py against the committed trace prints exactly this:

        phase  start_CV  end_CV  max_UV_mAU
Equilibration     0.000   2.992         7.3
         Load     3.000  10.992        64.0
         Wash    11.000  14.992        35.2
      Elution    15.000  19.992      2769.6
        Strip    20.000  22.992        45.9
          CIP    23.000  25.992         8.1

Six phases, each with a clean CV window and its peak UV. Notice the diagnostic value already falling out: the Load phase's max UV of 64 mAU is the breakthrough shoulder — a quiet early-warning that the column is approaching saturation. Because the run stops loading at ~81% of DBC (47 g loaded onto a column whose capacity is 58 g — its 1 L bed at 58 g/L; the code comment rounds this to ~80%), the breakthrough leak stays small — but push past that shoulder and unbound antibody flows straight to waste, which is the whole reason load is capped below capacity. In production you would alarm on that. (The chapter's pytest, test_ch10_phase_detection_and_pooling, asserts this collapse lands on exactly these six phases in order.)

The decision that matters: the pooling window

Detecting phases is bookkeeping. The GMP-critical decision is pooling: of all the liquid coming off during elution, which slice do you collect into the product pool and which do you send to waste? Collect too early and you carry over impurities; collect too late and you dilute the pool or lose yield. This is process analytical technology (PAT) in its purest form — real-time UV measurement driving an in-process control decision rather than waiting for an offline assay [4], and the academic literature is explicit that on-line analytics now drive column-pooling decisions against product-quality attributes, which must fall inside validated ranges (limits proven and documented in advance, before the process is run for real, to reliably give acceptable product) [5].

The classic, robust rule is UV-threshold pooling: start collecting when UV crosses a threshold on the up-slope, stop when it falls back through a threshold on the down-slope. Newer plants use inline UV/Vis multivariate calibration (often called spectral deconvolution) to pool on concentration (and even on impurity content) rather than raw absorbance, computing step yield far more accurately than offline peak-area integration [6] — a learning-on-spectra method Book 5 builds on this same chromatogram in Analytical Methods and Capture Chromatography. The repo implements the simple, defensible threshold version:

# examples/chapters/10-downstream-chromatography/phase_detect.py
POOL_THRESHOLD_MAU = 100.0   # start/stop collecting the eluate at 100 mAU

def pooling_decision(trace: pd.DataFrame) -> dict:
    """Collect the elution peak between up-slope and down-slope UV thresholds."""
    elute = trace[trace["phase"] == "Elution"]
    above = elute[elute["UV280_mAU"] >= POOL_THRESHOLD_MAU]
    if above.empty:
        return {"pooled": False}
    start_cv = float(above["volume_CV"].iloc[0])
    stop_cv = float(above["volume_CV"].iloc[-1])
    return {
        "pooled": True,
        "pool_start_CV": round(start_cv, 3),
        "pool_stop_CV": round(stop_cv, 3),
        "pool_CV": round(stop_cv - start_cv, 3),
        "threshold_mAU": POOL_THRESHOLD_MAU,
        "peak_UV_mAU": round(float(elute["UV280_mAU"].max()), 1),
    }

On our run that returns:

pooling: {'pooled': True, 'pool_start_CV': 15.0, 'pool_stop_CV': 16.917,
          'pool_CV': 1.917, 'threshold_mAU': 100.0, 'peak_UV_mAU': 2769.6}

We keep a 1.917 CV slice — about 1.9 L of eluate — between CV 15.0 and CV 16.92. (The summary CSV's pool_volume_mL of 1916.7 and the detector's pool_CV of 1.917 are the same window: 16.92 − 15.0 at the CSV's 2-dp rounding versus 16.917 − 15.0 at the detector's 3-dp, just as the load-titer footnote below reconciles 5.5 against 5.88.) That single record answers the inspector's question. The summary the simulator writes (datasets/protein_a_summary.csv) closes the mass balance around it:

batch_id,step,column,cv_mL,load_titer_g_L,load_volume_L,mass_loaded_g,pool_start_CV,pool_stop_CV,pool_volume_mL,DBC_g_per_L,recovery_frac,eluted_mass_g,eluate_titer_g_L
BATCH-2026-001,ProteinA_capture,MabSelect-PA01,1000.0,5.88,8.0,47.0,15.0,16.92,1916.7,58.0,0.92,43.3,22.58

One thing to know if you run the code yourself: the committed datasets/protein_a_summary.csv is produced by the campaign run (make data), which feeds the fed-batch line's final titer of ~5.88 g/L into simulate(load_titer_g_L=5.88). The bare module command python -m bioproc_sim.protein_a uses the function's own default of load_titer_g_L=5.5 and so prints a slightly different summary row (mass_loaded 44.0 g, eluted 40.5 g, eluate_titer 21.12 g/L). The chromatogram trace and the pooling numbers — peak 2,769.6 mAU, pool 15.0 → 16.917 — are identical either way, because the load titer only scales the mass balance, not the UV trace.

47.0 g of antibody loaded, a dynamic binding capacity of 58 g/L, 92% step recovery, and 43.3 g eluted at 22.58 g/L — concentrated nearly 4× (22.58 / 5.88 ≈ 3.8×) versus the load. The mass balance in the simulator is deliberately honest: you can only elute what the column actually bound, so eluted mass is capped at min(mass_loaded, DBC × CV) before applying recovery (see eluted_g = bound_g * recovery in protein_a.py). A pooling decision that implied recovering more than you loaded would be a bug. The chapter's pytest (test_ch10_phase_detection_and_pooling) guards the decision end of this: it asserts the run pools and that the product peak clears 1000 mAU.

When pooling goes wrong: the field record

The reason we labour over the attributes payload is that a pooling decision is precisely where downstream batches are lost, and the record is the only thing that survives the argument afterwards. Two failure modes are worth naming.

The first is a window in the wrong place. Pool a couple of CV too early and you fold the leading-edge impurities — aggregates (clumped-together antibody molecules), host-cell protein — into the product; pool too late and you dilute the pool and shed yield. The PAT literature is explicit that on-line analytics now drive these column-pooling decisions against product-quality attributes that must fall inside validated ranges [5] — so the defensible record is not just "we pooled" but "we pooled between pool_start_CV 15.0 and pool_stop_CV 16.917, inside the validated band." A raw-absorbance threshold can also be fooled when the peak shape drifts batch to batch, which is exactly why newer plants pool on inline-measured concentration rather than raw mAU, computing step yield far more accurately than the threshold rule [6]. Either way, the stored threshold_mAU and peak_UV_mAU are what let a reviewer reconstruct why the window fell where it did.

The second is a record that cannot be trusted even if the window was right. PIC/S PI 041-1 (the international GMP inspectorates' data-integrity guidance) is blunt that pooling and hold-time records must be attributable, contemporaneous, and complete — written as the decision happened, by an identifiable actor, with nothing missing [11]. An operation_event row written live by the phase detector, stamped with start_ts/end_ts from the trace and bound by foreign key to BATCH-2026-001 and PA01, satisfies contemporaneous and attributable by construction; a number a technician copies into a spreadsheet the next morning satisfies neither. And because a maximum hold-time or out-of-band pooling event is an in-process control with a defined limit, a breach is a recordable in-process-control failure, not a footnote [8] — the excursion event type exists for exactly this, and Part V is where it gets routed to quality.

Writing it down: operation events in PostgreSQL

Phases and the pooling decision are useless unless they are stored next to the batch they belong to. The repo's relational backbone (PostgreSQL 17, the timescale/timescaledb:2.17.2-pg17 image pinned in examples/platform/compose/compose.yaml) has one table built for this — the bridge between the time-series stream and the batch context.

From examples/platform/db/30-lab-events.sql:

-- examples/platform/db/30-lab-events.sql
CREATE TABLE events.operation_event (
    event_id   bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    batch_id   text REFERENCES s88.batch,
    unit_id    text REFERENCES s88.unit,
    event_type text NOT NULL,                  -- phase | pool | hold | excursion
    phase      text,
    start_ts   timestamptz NOT NULL,
    end_ts     timestamptz,
    attributes jsonb NOT NULL DEFAULT '{}'
);
CREATE INDEX ON events.operation_event (batch_id, start_ts);

The shape is deliberate. Structured columns (batch_id, unit_id, phase, the time window) carry the things you always filter and join on; the open-ended attributes jsonb carries the per-event payload that varies by event type — the pooling window for an elution, the integrity-test result for a filter, the duration for a hold. The phase detector emits one row per phase and attaches the pool window to the elution row:

# examples/chapters/10-downstream-chromatography/phase_detect.py
phases["event_type"] = "phase"
# attach the pool window to the Elution row
phases["attributes"] = phases["phase"].map(
    lambda p: pool if p == "Elution" else {})

So the Elution row that lands in Postgres looks like this — structured context plus a self-describing JSON payload that a reviewer (or a SPARQL query later in the book) can read without a schema migration:

{
  "batch_id": "BATCH-2026-001",
  "unit_id": "PA01",
  "event_type": "phase",
  "phase": "Elution",
  "start_ts": "2026-01-19T08:30:00Z",
  "end_ts":   "2026-01-19T08:39:59Z",
  "attributes": {
    "pooled": true, "pool_start_CV": 15.0, "pool_stop_CV": 16.917,
    "pool_CV": 1.917, "threshold_mAU": 100.0, "peak_UV_mAU": 2769.6
  }
}

This is the heart of why we use PostgreSQL here rather than the historian [7] — the time-series database that stores the raw, high-frequency sensor stream: the historian holds millions of raw samples; this table holds the handful of interpreted, decision-bearing records that the batch record and the audit trail actually care about.

Anatomy of an operation_event: one pooling decision as a row

It is worth slowing down and dissecting that Elution row field by field, because every column on it is load-bearing — and the split between the structured columns and the attributes jsonb is the whole design.

Identity card dissecting one events.operation_event row — the structured columns event_id, batch_id, unit_id, event_type, phase, start_ts and end_ts above a highlighted attributes jsonb block carrying the pooling payload, with a violet panel decoding the two foreign-key edges to s88.batch and s88.unit.

One row of events.operation_event: the Elution pooling decision. The seven structured columns carry the things you always join and filter on; the green attributes jsonb block carries the per-event payload — here the pooling window — that varies by event type.

Original diagram by the authors, created with AI assistance.

Walking the columns from the schema in 30-lab-events.sql:

event_id — bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY. The database mints it; you never supply it. It is the stable handle an audit entry or a downstream join points at.
batch_id — text REFERENCES s88.batch. A foreign key into the GMP batch record. BATCH-2026-001 is the released golden batch (lot L26001) seeded in seed_cho_line.sql; the FK is what guarantees you can never log a pooling decision against a batch that does not exist.
unit_id — text REFERENCES s88.unit. The other foreign key, into the equipment hierarchy. PA01 resolves to the Protein A Capture Skid (a Cytiva ÄKTA process unit in the DOWNSTREAM area). This is the column that makes 3MCC work: three columns are just three unit_id values under one batch_id.
event_type — text NOT NULL. The discriminator: phase | pool | hold | excursion. It tells a reader (and a query) which shape of payload to expect in attributes. Here it is phase.
phase — text (nullable). The ISA-88 step name — Elution — copied out of the structured columns so you can WHERE phase = 'Elution' without parsing JSON.
start_ts / end_ts — timestamptz. The window. start_ts is NOT NULL (an event must have a beginning); end_ts is nullable so an open, still-running phase can be written before it closes.
attributes — jsonb NOT NULL DEFAULT '{}'. The open-ended payload. On the five non-Elution rows it is the empty object {}; on the Elution row it carries the entire pooling decision: pooled, threshold_mAU, peak_UV_mAU, pool_start_CV, pool_stop_CV, pool_CV. That is the green block on the card — the one field that turns this from bookkeeping into evidence.

The lesson the card teaches is the one this book repeats: put what you always filter on in columns, and what varies by event in jsonb. You will always join pooling events to their batch and unit and slice them by phase and time — so those are columns, indexed ((batch_id, start_ts)). You will read the pooling window but rarely filter the database on pool_CV — so it lives in jsonb, where a new event type (a hold, a filter integrity test) can bring its own keys without a schema migration.

The same row as a triple, a shape, and a competency question

The Elution row is a relational fact, but the knowledge-graph chapter shows the same fact lifted into RDF (the Resource Description Framework — the graph data model whose unit of fact is a subject-predicate-object triple), where the foreign keys become edges you can walk. The pooling decision is then not a jsonb blob hanging off one table but a small set of triples about the elution event — and that reframing is what lets a downstream investigation query across systems. Two triples carry the load: one ties the event to its unit, one records the window:

# the pooling decision as RDF triples (bp: is the local bioprocess vocabulary)
bp:ev-PA01-elution-001 bp:onUnit    bp:PA01 ;
                       bp:pooledFrom 15.0 ;
                       bp:pooledTo   16.917 ;
                       bp:partOfBatch bp:BATCH-2026-001 .

Because the graph stores bp:partOfBatch and bp:onUnit as edges rather than implied joins, the SPARQL property paths of the semantics chapter walk straight from a failed drug-substance lot back through this pooling event to the batch and column that produced it — the digital-thread traversal that Book 4, Ontologies for Biopharmaceutical Manufacturing, makes the spine of lot genealogy.

The closed-world half — did the elution event actually record a pooling window? — is exactly what a SHACL shape (the Shapes Constraint Language — a W3C standard that validates a graph against required structure) checks, and it is the gate Book 4's release-gate chapter builds the whole release decision from. An open-world reasoner treats a missing pooledFrom as merely unknown; SHACL treats it as a failure, now — which is precisely the discipline a GMP record needs, because a pooling event with no window is not an open question but an incomplete batch record:

# an elution event must carry exactly one pooling window (closed-world)
bp:PoolingShape a sh:NodeShape ;
    sh:targetClass bp:ElutionEvent ;
    sh:property [ sh:path bp:pooledFrom ; sh:minCount 1 ; sh:maxCount 1 ;
                  sh:message "Elution event is missing its pooling-window start." ] ;
    sh:property [ sh:path bp:pooledTo ;   sh:minCount 1 ; sh:maxCount 1 ;
                  sh:message "Elution event is missing its pooling-window stop." ] .

That shape is the graph-native restatement of the database's own NOT NULL and the chapter's test_ch10_phase_detection_and_pooling assertion: three artifacts — a column constraint, a unit test, and a SHACL shape — enforcing one rule. And the question a reviewer actually asks of the graph — which batches pooled outside the validated band? — is a competency question (the ontology engineer's term for a query the vocabulary must be able to answer, used as an acceptance test), answerable by a one-line SPARQL ASK or SELECT that filters bp:pooledFrom/bp:pooledTo against the 14.5–17.5 CV window. Book 4 turns 23 such questions into runnable PASS/FAIL checks; the pooling window is one more fact those checks can stand on. The same bp:onUnit edge is what makes 3MCC (below) a graph problem rather than a schema change: three columns are three bp:PA01/bp:PA02/bp:PA03 nodes, and a bp:derivedFrom edge from the cumulative pool to each column's contribution keeps the genealogy intact across the switches.

Where this row comes from in the trilogy

This operation_event row is the open-source implementation of a physical decision made on the plant floor. The column and chromatogram it describes are the capture and polishing steps Book 1 builds in Capture Chromatography and Polishing Chromatography; the same event_type enum later records the hold and filtration steps from harvest, viral inactivation, and viral filtration. The decision-bearing UV trace itself — the chromatography data station whose pooling window we just wrote down — is the data-point Book 2 frames in A Tour of Where Process Data Is Born. Book 1 is the physical step, Book 2 is the data-point and its open challenge, and this chapter is the code and the row that close the loop.

Hold times, integrity tests, and the rest of the train

Protein A is only the first skid. The same operation_event pattern records the rest of the downstream train, and the event_type enum (phase | pool | hold | excursion) is built for it:

Viral inactivation is a low-pH hold — the eluate is titrated (adjusted by adding acid) to a validated low-pH set point (~3.5 here) and held for a validated minimum time, because low pH inactivates any enveloped viruses that could have ridden along from the cell culture, giving the train a dedicated viral-safety barrier. That hold has a defined maximum hold time before the next step, and both the minimum and maximum are GMP-critical, must be validated, and must be recorded [8]. A hold event with start_ts, end_ts, and an attributes payload of {"target_min": 60, "actual_min": 64, "pH": 3.5} is the whole evidence trail.
Filtration (viral filtration, sterile filtration) produces a pre-use/post-use integrity test whose pass/fail is a discrete, recorded event: {"test": "bubble_point", "psi": 51, "spec_min_psi": 45, "result": "pass"}. Pressure and flow trends from the skid go to the historian; the verdict goes here.
Polishing (cation/anion exchange) is another bind-and-elute or flow-through chromatography step — the same phase-detection and pooling code applies, just with different thresholds.
UF/DF concentrates and buffer-exchanges the drug substance; the recorded decisions are the concentration-factor and diavolume targets met.

The honest note: a maximum hold-time breach is exactly the kind of excursion event that must surface to quality. Capturing it is easy; the workflow that routes, investigates, and dispositions it is the regulated part, and we build that in Part V.

When the threshold becomes a learned model

The 100 mAU rule is a fixed constant, and a fixed constant needs no validation beyond "is it the approved number?". The moment a plant pools on inline-measured concentration instead — the spectral-deconvolution path the field is moving toward — the threshold is replaced by a learned model (a partial-least-squares or similar calibration mapping the inline UV/Vis spectrum to a concentration), and that model inherits the entire validation apparatus Book 5, Machine Learning & AI for Biomanufacturing, is devoted to. It is worth being precise about what changes, because the change is easy to underestimate.

First, how you prove it works. A concentration calibration cannot be scored by splitting rows of one chromatogram at random: neighbouring samples on a single elution peak are almost identical, so a within-run split lets the model interpolate between near-duplicate points and reports a flatteringly high accuracy that will not survive a new batch. The honest test holds out whole batches — a grouped, leave-one-batch-out cross-validation — so the score is measured on a column run the model never saw, which is the only number a reviewer should trust. This is the same leakage trap and the same batch-grouped discipline Book 5's data chapter and models-and-validation chapter build the whole soft-sensor argument on, and it is why this chapter's repo deliberately ships the fixed-threshold version: a constant has no folds to leak across.

Second, knowing when not to trust it. A learned pooling model is only valid inside the region of peak shapes, buffers, and resin ages it was calibrated on; a fouled resin or an off-spec load can push a run outside that envelope, where the model extrapolates silently and confidently. The guard is an applicability-domain check — a Hotelling T² and squared-prediction-error test, the release chapter's gate — that flags a spectrum lying outside the training envelope before its concentration is acted on, so the fraction-collector valve never moves on an out-of-distribution number. The fixed threshold has a crude analogue (the stored peak_UV_mAU lets a reviewer see a peak that is the wrong height), but a learned model needs the formal gate.

Third, watching it age. The peak this model reads drifts for purely physical reasons — resin fouls over its lifetime, a new raw-material lot shifts the baseline, a scale change moves the hydrodynamics — which is covariate shift (the input distribution moving) on top of the genuine process drift the analytics chapter already charts with SPC. The two are distinct and both matter: a control chart on the pooled-pool titer catches the process wandering, while a population-stability index on the inline spectra catches the model's inputs wandering, often earlier and without waiting for the offline assay — the leading-versus-lagging detector pair Book 5's MLOps chapter builds. And because a GMP model must be locked — frozen under change control so today's pooling decision is the one that was validated — its lineage belongs in the record next to the pooling event: which model version, trained on which dataset hash, decided this window. The attributes jsonb is the natural home for that pin ({"model": "pool-conc-pls@v2", "dataset_sha256": "…"}), turning the pooling row into a node in the auditable model-lineage graph the models-and-validation chapter and Book 4's instance graph describe. The fixed threshold needs none of this; the learned model needs all of it — which is the honest reason most plants still pool on the constant, and reach for the model only where the accuracy genuinely pays for the validation burden.

The intensified variant: multi-column continuous capture

The whole chapter so far describes a batch column: one column, one cycle, idle between cycles. The modern, intensified alternative is multi-column continuous capture (3MCC / PCC) — three or four small Protein A columns plumbed so that while one is eluting, the next is loading and capturing the breakthrough from the first. This is what the perfusion bioreactor (the continuous bioreactor mode, which harvests product steadily instead of in one fed-batch at the end) feeds into, and it dramatically improves resin utilization — how much of each column's binding capacity you actually use — because you can deliberately load past breakthrough, knowing a downstream column captures the product that leaks through.

For data capture, 3MCC changes one thing fundamentally: instead of one clean phase sequence, you have several columns each in a different phase at the same instant, with the controller orchestrating switch events between them. The operation_event model handles this without change — every row carries its own unit_id, so three concurrent phase timelines (one per column, e.g. PA01, PA02, PA03) simply become three streams of rows under the same batch_id. (The seed only provisions the single-column unit PA01 in s88.unit; because operation_event.unit_id is a foreign key to s88.unit, a real 3MCC run would first need PA02 and PA03 added there before those rows could insert.) The pooling logic moves up a level: you pool a cumulative product stream across column switches rather than one peak. The repo's fed-batch path uses the single-column simulator above; in a 3MCC run the same single-column detector would simply be applied per column.

Why it matters

Downstream is where the molecule becomes the drug, and where most of the yield and most of the risk live. A pooling decision made two CV too late can fail a host-cell-protein spec; a missed hold-time can scrap a batch worth a quarter-million dollars. The reason we go to the trouble of segmenting traces into ISA-88 phases and writing structured event rows is that those rows are what a batch-record review, an OOS (out-of-specification — a result outside its accepted limit) investigation, and an inspector all read. A wall of raw UV samples proves nothing on its own; the sentence "the operator pooled between 15.0 and 16.9 CV, inside the validated 14.5–17.5 CV window" — reconstructed automatically from the trace — is the proof.

In the real world

In a commercial mAb plant the chromatography skids are typically Cytiva ÄKTA process systems or equivalent, and the data of record lives in a vendor chromatography data system (CDS) and the MES (Manufacturing Execution System) batch record. Our OSS stack does not replace those — it contextualizes and historizes alongside them. That boundary is a recurring honesty theme of this book.

A few real-world anchors:

Data formats: vendor CDS exports lock chromatograms into proprietary files. The representative example most readers will meet is Cytiva UNICORN, the control-and-CDS software that drives ÄKTA systems (one widely-deployed platform among several): a UNICORN run lands in a native .res result file, and newer UNICORN versions store the result as a zip archive of XML rather than a single binary blob. Either way it is a proprietary container, so it is not something Benchling's allotropy (or any text/CSV/Excel canonicalizer) can turn into Allotrope ASM JSON — that path is for textual exports, not for the .res/zip-of-XML record itself. Where you need to interchange a chromatogram — into a data lake, between sites, into a regulatory submission — the vendor-neutral ASTM ANDI/NetCDF chromatography format (ASTM E1947) is the long-standing standard, and exporting .cdf rather than a vendor blob is the FAIR-friendly choice [9], [13].
Standards bite here. Annex 11 (EU GMP for computerised systems) requires that a system recording GMP-relevant decisions and changes — like a pooling decision — generate a reviewable, time-stamped audit trail [10]. PIC/S PI 041-1 reinforces that those pooling and hold-time records must be attributable, contemporaneous, and complete (the ALCOA+ data-integrity attributes — Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available), and that it is the data flow and risk, not just whether the IT works, that governs how you capture decision data [11].
Scale-up changes the numbers, not the record. Chromatography's CV clock is what makes a process scale-independent on paper: a method developed on a 1 mL column at 0.5 CV/min transfers to a commercial-scale column of tens of litres by holding CV, residence time, and linear flow velocity (bed-height-per-minute) constant while bed diameter grows. But the physics does not fully cooperate — wall effects, packing uniformity, and pressure-flow limits all shift with diameter, so the pooling thresholds and DBC qualified at small scale must be confirmed at manufacturing scale, not assumed. That confirmation is the regulated work of tech transfer, and a new skid is only fit to run a GMP batch once it has passed IQ/OQ/PQ — Installation Qualification (the equipment is installed and connected as specified), Operational Qualification (it runs across its operating ranges, e.g. the gradient pump and UV detector behave to spec), and Performance Qualification (it makes in-spec product on the actual process). The operation_event row is indifferent to scale — PA01 could be a 1 L development column or a 50 L commercial one — but the thresholds and limits its attributes enforce are scale-specific and carry a qualification pedigree, which is why the validated 14.5–17.5 CV band is a number tied to a particular column, resin lot, and qualified skid, not a universal constant.

The honest OSS-vs-commercial verdict for this layer: the open-source stack does the capture and contextualization beautifully. Python + the phase detector turns a raw trace into ISA-88 events; PostgreSQL stores them with full structure; Node-RED with community finite-state-machine nodes can run the same segmentation as a live flow at the edge, triggering pool and hold events as they happen [12]. That gets you most of the way. What pure OSS does not give you out of the box is the validated, Part 11-grade e-signature (21 CFR Part 11 — the US FDA rule on electronic records and signatures) on the pooling decision, the locked-down change control on the pooling thresholds, or vendor accountability for the CDS that physically commanded the fraction-collector valve. Those are the GxP (the family of regulated good-practice rules, GMP among them) last mile — the audit trail, signing, and validation we build (honestly, and with their limits) in Part V.

Key terms

Chromatography skid — the automated system (pumps, valves, UV/conductivity/pH detectors, fraction collector) that runs a packed column through its phases.
Protein A capture — the platform affinity step that selectively binds the antibody's Fc region (its constant stem), giving high purity in one step.
Column volume (CV) — the volume of the packed resin bed; the scale-independent clock for chromatography (here 1 CV = 1 L = 2 min).
Phase / operation (ISA-88) — a named step within a procedure (equilibration, load, wash, elution, strip, CIP); the unit of segmentation.
Breakthrough — product washing through the column unbound once it nears its binding capacity; visible as a rising UV shoulder during load.
Dynamic binding capacity (DBC) — how much product a litre of resin can bind under flow before breakthrough; here ~58 g/L.
Pooling window — the start/stop volumes between which eluate is collected into the product pool; the chapter's central GMP-critical decision.
Hold time — the validated minimum (and maximum) time a pool sits at a step, e.g. low-pH viral inactivation.
Integrity test — a pass/fail check (e.g. bubble point) on a filter before/after use, recorded as a discrete event.
3MCC / PCC — multi-column / periodic counter-current continuous capture; the intensified alternative to a single batch column.
PAT — Process Analytical Technology; using real-time measurement (here inline UV) to make in-process quality decisions.
operation_event — the PostgreSQL table that bridges the time-series stream and the batch record; one row per ISA-88 phase, with structured columns (batch_id, unit_id, phase, the time window) plus an open-ended attributes jsonb payload.
attributes (jsonb) — the per-event JSON payload on an operation_event row; empty on bookkeeping rows, carrying the pooling window on the Elution row, a hold result on a hold row, and so on, without a schema migration.
Triple / SHACL shape — RDF's unit of fact (subject-predicate-object), and the closed-world constraint that checks a graph has the structure it must (here, that an elution event records its pooling window); the graph-native restatement of the row's NOT NULL and unit test.
Competency question — a query the ontology must be able to answer, used as an acceptance test; "which batches pooled outside the validated band?" is one such question over the pooling triples.
Grouped (leave-one-batch-out) cross-validation — holding out whole batches, not random rows, so a learned pooling-by-concentration model is scored on a column run it never saw; a within-run split leaks near-duplicate samples and flatters the score.
Applicability domain — the region of peak shapes, buffers, and resin ages a learned model is valid over; a Hotelling T²/SPE gate flags an out-of-envelope spectrum before its concentration drives the valve.
Covariate shift vs process drift — the model's inputs moving (resin fouling, a new lot's baseline) versus the process itself wandering; caught by a population-stability index on the spectra and an SPC chart on the pool titer respectively.
IQ/OQ/PQ — Installation, Operational, and Performance Qualification; the staged proof that a chromatography skid is installed, runs across its ranges, and makes in-spec product before it may run a GMP batch — the tech-transfer gate behind a scale-specific pooling band.

Where this leads

We have captured the process from sensor to skid — every in-line tag, every phase, every pooling decision. But the molecule's quality is ultimately judged off-line, by instruments: HPLC for purity, assays for host-cell protein, a balance for concentration. The next chapter, The Analytical Lab: Instruments, LIMS & ELN, leaves the production floor for the QC lab and shows how to capture that data — the offline results that confirm whether the decisions we just recorded actually produced a drug worth releasing.

What this chapter covers​

Downstream is a state machine, not a trend line​

The simulated Protein A cycle​

A chromatogram is a curve, not a row​

Slicing the run into ISA-88 phases​

The phase state machine: load, wash, elute​

The decision that matters: the pooling window​

When pooling goes wrong: the field record​

Writing it down: operation events in PostgreSQL​

Anatomy of an operation_event: one pooling decision as a row​

The same row as a triple, a shape, and a competency question​

Hold times, integrity tests, and the rest of the train​

When the threshold becomes a learned model​

The intensified variant: multi-column continuous capture​

Why it matters​

In the real world​

Key terms​

Where this leads​