Harvest and Clarification: Predicting the Endpoint

📍 Where we are: Part III · Upstream, Learned — Chapter 12. The production bioreactor ran for two weeks and made the antibody; now the broth is full of cells we must get rid of. This chapter learns the moment to stop the run and the consequences of how it ended.

The previous chapter modeled the inside of the tank: soft sensors reading titer and viable cell density from Raman, closed-loop glucose control, a digital twin forecasting the trajectory. Everything pointed at one decision the models barely touched — when is this batch done? Harvest is the hinge between upstream and downstream, and it is one of the most underrated machine-learning targets in the whole process: a batch held a day too long keeps making antibody but also keeps dying, spilling intracellular junk — host-cell protein (HCP — the cells' own proteins, a contaminant regulators cap), DNA, lipases (fat-splitting enzymes that can attack the drug) — that the entire downstream train then has to claw back out. Harvest a day early and you leave titer on the table. The endpoint is an optimization, not a calendar.

Then comes clarification: separating the antibody-bearing fluid from the cells and debris, usually a centrifuge followed by a train of depth filters, so the next step (Protein A capture — the first purification column, which grabs the antibody and lets impurities flow past) sees clean liquid instead of a slurry. How well clarification goes — how big a filter you need, whether it clogs mid-run — is almost entirely decided upstream, by the cell density and viability the harvest decision produced. This chapter learns three coupled things: the harvest endpoint, the turbidity of what you are about to clarify, and the filter you will need to clarify it. They are the same problem viewed from three angles, and the running example's clarified pool, CLAR-001, is what they collectively produce.

The simple version

Think of straining a pot of stock. If you let it cook too long, the bones start to break down and cloud the broth — straining gets slow and the filter clogs. Stop too early and you lose flavor. A good cook learns the moment to pull it, and learns to size the strainer to how cloudy the batch looks. A harvest model is that learned instinct: it watches the cheap signals — cell count, how many cells are still alive, the spectrum — and predicts both the right moment to stop and how hard the straining will be, so the kitchen has the right-sized strainer ready before the pot comes off the heat.

What this chapter covers

Framing the harvest decision as a learning problem — what the target really is (it is not "day 14"), and why upstream signals at the endpoint are the features.
Turbidity soft sensing — predicting harvest-feed cloudiness (NTU) from VCD, viability, and spectra, the proxy for "how hard will this be to clarify."
Filter sizing and Vmax prediction — turning a small-scale Vmax/throughput test into a model that sizes the depth-filter area for the full batch, and predicting clog risk.
The upstream→clarification link — regressing clarification performance and downstream HCP burden on the VCD and viability the harvest produced; the CLAR-001 node.
A runnable model module, examples/platform/ml/harvest_endpoint.py, and the anatomy of one harvest-decision record.
The honest open problem: the endpoint label is censored and the clarification feedback is slow.

The harvest decision is an optimization, not a date

A naive plant harvests on a fixed day — "we always pull on day 14." A learning plant treats the endpoint as the argmax (the day t that makes the objective largest) of an objective that trades the amount of product against its quality cost. Concretely, let day t of the run carry a viable cell density VCD(t) (how many live cells per mL of broth), a viability via(t) (the fraction of those cells still alive, as a percent), and a titer titer(t) (the concentration of antibody product in the broth, in g/L). Harvesting later raises titer(t) (cells keep secreting) but lowers via(t), and falling viability is the single best leading indicator of the debris and HCP that clarification and capture must remove. The harvest objective is, loosely:

J(t) = recoverable_titer(t)  −  λ · downstream_burden(t)

where downstream_burden(t) rises as viability falls and lysed-cell content climbs, and λ encodes how expensive that burden is to remove. The endpoint is t* = argmax_t J(t) subject to hard constraints — a viability floor (many fed-batch platforms — cultures fed a concentrated nutrient stream over the run rather than topped up continuously — will not harvest below roughly 70% viability, though the exact floor is product- and platform-specific and is really a downstream-impurity-burden choice rather than a fixed biological limit — a perfusion or intensified process can run a different target), a turbidity ceiling the clarification train can handle, and the bioburden (the microbial-contamination load a sterile process must keep in check) and scheduling realities of a GMP suite (the regulated, controlled manufacturing space — GMP = Good Manufacturing Practice). Machine learning enters because almost every term in J(t) is a quantity we can only measure slowly or partially: titer(t) lags hours behind in the lab, via(t) comes from a twice-a-day offline count, and downstream_burden(t) is not measured at harvest at all — it only reveals itself days later in the capture pool's HCP result. So the harvest model is really a stack of soft sensors feeding a constrained decision.

It is worth being explicit about what kind of optimization this is, because the structure shapes every modeling choice that follows. J(t) is evaluated only over the discrete sample times you actually have — twice-a-day offline counts, not a continuum — so the argmax is a search over a dozen-or-so candidate endpoints, not a calculus problem. Each candidate carries a predicted burden, not a measured one, so the objective is a function of model outputs, and its curvature near the optimum is shallow: the difference between harvesting at sample 26 and sample 28 might be a few tenths of a g/L of titer against a steep rise in dead-cell load. That shallow, noisy peak is exactly the regime where a hard feasibility constraint dominates the smooth objective — the viability floor, not the titer curve, is usually what picks the day. A learning plant therefore spends more of its modeling effort getting the constraint surrogates right (will viability cross 70%? will turbidity exceed the train's ceiling?) than polishing the value term, because the constraints are what bind.

In our running example, BATCH-2026-001's offline panel tells the story the model has to read. Over the last three days the viable cell density plateaus and then turns down while viability slides — the panel runs from a feasible sample at 2026-01-17 18:00 (21.63e6 cells/mL, 76.8% viability, 4.221 g/L titer) through 2026-01-18 06:00 (19.80e6 cells/mL, 72.9%, 5.475 g/L) to the final recorded sample at 2026-01-18 18:00, which is at 19.66e6 cells/mL at 68.0% viability with a titer of 5.877 g/L [1]. The titer is still climbing — almost 1.7 g/L gained over those last 24 hours — but the viability has now slid below the 70% floor most platforms respect, and the rising ammonia (from 9.06 to 10.57 mM across the same window) is the metabolic-stress signature of a culture that is starting to lyse (cells bursting open and spilling their contents, which is where the extra ammonia comes from). That tension — more product, worse feed — is the harvest decision, and it is exactly what a model has to weigh.

Turbidity soft sensing: predicting how hard clarification will be

The cleanest single proxy for "how hard will this batch be to clarify" is the turbidity of the harvest feed, measured in nephelometric turbidity units (NTU). High turbidity means lots of cells and debris in suspension, which means a centrifuge working harder and depth filters that clog faster. Turbidity at harvest is driven by total cell mass and, critically, by the dead and lysing fraction — which is why it correlates so strongly with falling viability. If you can predict harvest-feed turbidity a day ahead from the signals you already have, you can pre-stage the right clarification setup instead of discovering the problem when the filter pressure spikes.

Strictly, the depth filters do not see this raw broth turbidity: a disc-stack centrifuge first removes the bulk cell mass, and the filters are sized off the centrate turbidity (the centrate is the clarified liquid the centrifuge spins off) that survives it. The dead-cell-load argument carries through both steps, though, because a low-viability harvest both raises raw turbidity and produces smaller, more fragmented debris that the centrifuge clears less efficiently, so the centrate it hands the filters is dirtier on both counts.

A soft sensor is a model that predicts a quantity that is slow or expensive to measure directly (here, turbidity) from cheap signals you already have (VCD, viability, the spectrum). The soft-sensor framing here is identical to the titer soft sensor but with a different target. The features are the cheap, available signals at and near the endpoint: VCD, viability, the integral of viable cell density over the run (the IVCD, a measure of total productive cell-hours), lactate and ammonia (metabolic stress markers that track lysis), and — where an in-line probe exists — the Raman or capacitance spectrum (Raman reads a light-scattering fingerprint of what is dissolved in the broth; a capacitance probe senses live-cell membranes electrically) that already feeds the upstream soft sensors. The target is the offline turbidity of the harvest pool.

The reason turbidity tracks viability so tightly is physical, not statistical, and it is worth pinning down because it tells you which feature to weight. A live CHO cell is roughly 12–18 µm and intact; it scatters light, but it is a clean, deformable sphere that a depth filter's graded matrix retains gently. A dead cell is a different object: its membrane has lysed, releasing sub-micron debris, DNA strands, and aggregated host-cell protein that both scatter strongly (raising NTU) and foul a filter by plugging its finer pores rather than caking on its surface. So the right load variable is not VCD alone but the dead-cell density, which is the total cell mass times the dead fraction (1 − viability). The example module encodes exactly this: it builds a dead_load = (100 − end_viability_pct) · peak_VCD / 100 term and lets turbidity rise with it (plus an ammonia contribution for the lysis already underway). That is a deliberately interpretable surrogate, not a black box — turbidity climbs with the product of cell mass and the complement of viability, which is the textbook driver.

Because that relationship is monotone but mildly nonlinear (turbidity climbs sharply once viability drops past a knee), one might reach for a gradient-boosted tree. The example module deliberately does not, and the choice is the small-data lesson made concrete: on the simulated 60-batch cohort it fits a RidgeCV — an L2-regularized linear model that selects its penalty α by cross-validation over a log-spaced grid — because a boosted tree, with hundreds of split decisions, will memorize the run-to-run noise of a few dozen batches and report a flattering training score that collapses on a genuinely new run. A regularized linear model is the right capacity for the data you actually have. This is the same bias-variance argument the data chapter makes in the abstract; here it is a single line of code that picks Ridge over trees, and it is the more defensible model precisely because it cannot overfit a tiny cohort.

The feature engineering matters as much as the model. The module standardizes its features with a StandardScaler fit on the training split only — a small but real leakage discipline, because scaling on the full set leaks the test distribution's mean and variance into training. The five features it uses — peak VCD, end-of-culture viability, final lactate, final ammonia, and the integral VCD — are all summary statistics of a whole run, computed once per batch, which is the correct unit of analysis: clarification is a batch-level outcome, so the feature row and the label live at the batch level. In this module there is exactly one summary row per batch, so the train/test split is batch-level by construction and per-sample leakage is impossible here — there are no within-batch rows to scatter across the split. That makes this module the wrong place to show the sample-split trap; the place to see it bite is soft_sensor_pls, where every batch genuinely contributes many hourly spectra, so splitting by sample would let the same batch appear in both train and test and inflate every metric — the cardinal sin that module demonstrates live.

It is worth being precise about what "turbidity" buys you that titer does not. Titer answers how much product; turbidity answers how dirty the feed. The harvest decision needs both, because the optimum is where the marginal product gained by waiting is no longer worth the marginal dirtiness incurred — and dirtiness is what the entire downstream train pays for. A turbidity soft sensor turns that abstract trade-off into a number the day before you have to act.

There is also an actuator on the other side of the prediction: high-turbidity harvests can be pretreated before the depth filter — acid or caprylic-acid precipitation, divalent-cation or polyelectrolyte flocculation (pDADMAC, chitosan), or stimulus-responsive polymers — which knock down debris, DNA, and HCP and raise the effective Vmax. So a turbidity soft sensor that flags a hard harvest a day ahead does not only size a bigger filter; it can trigger a pretreatment step staged in advance.

Filter sizing and Vmax: turning a 50 mL test into a full-batch decision

Once you have decided to harvest, you must clarify, and the practical question is brutally concrete: how much filter area do I need so the depth filters do not clog before the batch is through? Buy too little area and the filter blinds mid-run, pressure climbs, and you lose product or stop the line; buy too much and you have wasted single-use consumables that cost real money at scale. This is the classic filter sizing problem, and it has a well-established small-scale ritual that machine learning extends rather than replaces.

The ritual is the Vmax test (and its cousin, the Pmax/constant-pressure test). You run a small disc of the actual depth-filter media — tens of milliliters — against a sample of the real harvest feed and record the cumulative volume filtered versus time. Under the classical gradual pore-blocking model, the data linearize: plotting t/V against t gives a straight line whose slope is 1/Vmax, where Vmax is the maximum volume the membrane can ever process per unit area before it fully blinds. You then size the full-scale filter so the batch volume stays comfortably under the area-scaled Vmax with a safety factor. It is elegant, cheap, and standard.

The math is worth stating in full because it clarifies what any model on top of it is actually predicting. Membrane fouling is described by a family of blocking laws, each a special case of a single differential equation d²t/dV² = k (dt/dV)^n, where the exponent n selects the mechanism: n = 2 is complete pore blocking, n = 1.5 standard (internal) pore constriction, n = 1 intermediate blocking, and n = 0 cake filtration. The Vmax method is the closed-form solution of the n = 1.5 constriction law at constant pressure, which integrates to

t / V(t)  =  (1 / Qi) + (t / Vmax)

where V(t) is cumulative filtrate volume per unit area, Qi is the initial flux (the starting flow rate per unit of filter area), and Vmax is the asymptotic capacity. A linear fit of t/V against t recovers 1/Vmax as the slope and 1/Qi as the intercept — two numbers from one cheap run. You then size area as A = (safety · batch_volume) / Vmax, where the safety factor (typically 1.3–1.5×) buys headroom against the day-to-day variability the single test cannot see.

Where does learning come in? Two places. First, the Vmax test itself is a small extrapolation that the gradual-pore-blocking model can get wrong when fouling is not gradual — real harvest feeds often show a mix of pore constriction and cake buildup, so the true curve bends away from the single n = 1.5 line late in the run. A model fit on the full t/V-versus-t curve, or a short physics-informed model that blends the blocking laws (estimating an effective n rather than assuming 1.5), extrapolates the clog point more reliably than reading one slope from the early-time linear region. Second, and more valuable, you can skip ahead of the bench test entirely: a model that predicts Vmax directly from the upstream state — the VCD, viability, and turbidity at harvest — lets you size the filter before the harvest feed even exists, the day the harvest model says you are about to pull the batch. The mapping is Vmax ≈ f(VCD, viability, turbidity, …) trained across past batches and their bench Vmax results, so the bench test grounds the labels and the model generalizes them to the next batch's upstream conditions. The clog-load proxy the example module learns is the front half of exactly this chain: predict the turbidity, and the turbidity (a strong inverse of Vmax) tells you the area to stage. The Vmax test stays as the confirmatory measurement; the model is what lets you order the right filter area in advance.

The upstream→clarification link: where clarification performance is really decided

The deepest idea in this chapter is that clarification performance is overwhelmingly a function of upstream conditions, not of the clarification equipment. The cell density and especially the viability at harvest set the debris load, the turbidity, the Vmax, and — the part that bites days later — the HCP burden the capture step inherits. This is why a chapter on the first downstream step belongs in the Upstream, Learned part: the lever is upstream, even though the consequence is downstream.

Concretely, you can regress each clarification and downstream-burden outcome on the harvest-state features:

Clarified turbidity / centrate quality — how clean the centrifuge centrate is, as a function of feed VCD and viability.
Depth-filter Vmax / capacity — area you will need, as above.
Step yield / recovery — product lost to the filter cake and centrifuge underflow.
Inherited HCP — the host-cell protein the capture pool starts with, which a low-viability harvest inflates because lysed cells dump their contents into the feed.

Depth filters do more than strain: their charged cellulose/diatomaceous matrix also adsorbs soluble HCP, DNA, and endotoxin. A debris-heavy, low-viability harvest therefore hits clarification twice — it fouls the filter faster and saturates the adsorptive capacity that would otherwise have trimmed the HCP the capture step inherits.

That last link is the one that closes the loop to the QC and release chapter. In our campaign the released-batch HCP values are mostly well inside the 100 ng/mg limit (ng/mg = nanograms of host-cell protein per milligram of antibody product; the 100 ceiling is the release specification) — BATCH-2026-001 finishes at 28.203 ng/mg, and the other passing batches sit between 14.953 and 26.093 ng/mg — but the out-of-spec sibling BATCH-2026-004 fails HCP at 128.0 ng/mg [1]. HCP is a downstream-and-release attribute, but its origin is frequently an upstream one: an over-extended harvest of a low-viability culture floods the feed with host-cell protein that no single purification step fully recovers. A model that links harvest viability to inherited HCP is, in effect, an early-warning system for exactly the failure mode BATCH-2026-004 represents — which is why this is not an academic correlation but a deviation-prevention tool. (We are careful here: the dataset records BATCH-2026-004's HCP failure as the real OOS; we do not invent a causal harvest value for it, only the modeling principle that low-viability harvests raise inherited HCP risk.)

The simulated cohort the example module trains on makes that principle audit-able rather than asserted. Each of its 60 runs is drawn from the same mechanistic fed-batch model the rest of the suite uses, with the biologically meaningful knobs perturbed per batch (growth rate, productivity, feed amounts) and an 18% chance of a contamination event; the cohort then applies the same release spec as the shipped datasets — HCP ≤ 100 ng/mg and monomer ≥ 95% (the fraction of antibody that is the intact single molecule rather than clumped-together aggregate) — to label each run PASS or OOS (out-of-spec, the same failure the "out-of-spec sibling" above names). So the OOS minority in the cohort is generated by the same lysis dynamics that drive turbidity — which is exactly why a turbidity soft sensor reads as a leading indicator of HCP risk, even though the two are measured days apart and by different assays.

The honest framing is risk stratification, not control. You are not going to autonomously move a critical quality attribute (CQA) — a measured property that defines whether the drug is acceptable, like the HCP result — with this model; you are going to flag, before clarification, that this particular batch's harvest profile looks like the batches that later struggled at HCP — and put a human and a tightened in-process check on it. That is exactly the human-in-the-loop posture the regulatory consensus endorses for ML touching quality decisions.

That posture has a concrete compliance shape, and it is worth naming because it is what makes the difference between a model a regulator will accept and one they will not. A harvest model that informs a GMP disposition is computerized-system-validated — but under the modern CSA (Computer Software Assurance, the risk-based successor the FDA's 2022 draft guidance frames over the old document-heavy CSV), the validation effort scales to risk: an advisory, human-decides harvest sensor earns lighter scripted testing than the closed-loop controller it explicitly is not. The record itself must satisfy 21 CFR Part 11 electronic-records rules — the recommended endpoint, the inputs that justified it, and the human sign-off are attributable, time-stamped, and audit-trailed — which is just ALCOA+ (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available) read onto an ML artifact: the anatomy card above, with every field provenance-tagged and the binding constraint written down, is an ALCOA+-shaped record by construction. And the model itself is locked and versioned — harvest_endpoint v1 on the card — so a retrain is a change-controlled event, not a silent weight update, exactly the locked-validated-model expectation the FDA's 2023 Artificial Intelligence in Drug Manufacturing discussion paper and the draft EU/PIC/S GMP Annex 22 both carry. The compliance machinery is not overhead bolted on after the science; it is the same auditability the manufacturing-science reviewer was already demanding of the whole card.

The first downstream step, learned: a harvest-endpoint optimization, a turbidity soft sensor, and a Vmax filter-sizing model are three views of one problem — predicting, from the upstream state at harvest, both when to stop and how hard the clarification will be — and together they produce the clarified pool CLAR-001 and an early HCP-risk flag for the batches that look like the OOS sibling. Original diagram by the authors, created with AI assistance.

A runnable model: harvest_endpoint.py

The example module examples/platform/ml/harvest_endpoint.py is the runnable core behind this chapter. It frames the harvest-to-clarification handoff as a single regression: end-of-culture state → a harvest turbidity / filter-load proxy, so a plant can size depth filters and flag a hard harvest before it clogs a clarification train. The features come from the simulator-backed 60-run cohort built by fixtures.py (loaded through the shared dataio.load_cohort_summary() helper); the turbidity target is a transparent synthetic function of cell lysis — dead-cell load plus the lysis-driven ammonia, with deliberately invented coefficients (the 20, 3.2, 1.5, and the noise term are made-up simulation constants, not measured physical values) — which is why the module is tagged (simulated cohort). It runs standalone with no services.

# examples/platform/ml/harvest_endpoint.py
import numpy as np
from sklearn.linear_model import RidgeCV
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import dataio


def main() -> dict:
    df = dataio.load_cohort_summary()
    rng = np.random.default_rng(2026)
    # turbidity rises with dead-cell burden (low end viability, high peak VCD) and age.
    dead_load = (100 - df["end_viability_pct"]) * df["peak_VCD_e6_per_mL"] / 100.0
    turbidity = 20 + 3.2 * dead_load + 1.5 * df["final_ammonia_mM"] + rng.normal(0, 2.5, len(df))
    feats = ["peak_VCD_e6_per_mL", "end_viability_pct", "final_lactate_g_L",
             "final_ammonia_mM", "integral_VCD"]
    X = df[feats].to_numpy(float)
    y = turbidity.to_numpy(float)
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.35, random_state=0)
    sc = StandardScaler().fit(Xtr)
    # a regularized linear model is the right capacity for ~40 training runs; a
    # boosted-tree overfits this small a cohort (the small-data lesson, in code).
    model = RidgeCV(alphas=np.logspace(-3, 3, 25)).fit(sc.transform(Xtr), ytr)
    r2 = r2_score(yte, model.predict(sc.transform(Xte)))
    print("Harvest turbidity / filter-load model (simulated cohort, n=%d)" % len(df))
    print(f"  end-of-culture state -> harvest turbidity proxy: R2 = {r2:.3f}")
    print(f"  turbidity range {y.min():.0f}-{y.max():.0f} NTU-proxy")
    assert r2 > 0.70, "harvest-load model should clear R2 0.70"
    return {"r2": float(r2)}


if __name__ == "__main__":
    main()

Running python platform/ml/harvest_endpoint.py prints the following verbatim:

Harvest turbidity / filter-load model (simulated cohort, n=60)
  end-of-culture state -> harvest turbidity proxy: R2 = 0.875
  turbidity range 45-90 NTU-proxy

Read those three lines carefully, because each makes one of the chapter's points concrete. The cohort is n=60 simulated runs — enough to train and honestly evaluate, where the six named batches are not. The held-out R² = 0.875 means end-of-culture state explains roughly 88% of the variance in harvest-feed turbidity on runs the model never saw — a genuinely useful soft sensor, and crucially this is a batch-held-out score (the 35% test fraction is whole runs), so it is an estimate of performance on a new batch, not the inflated number a row-split would produce. The 45–90 NTU-proxy range is the spread the clarification train must absorb — the floor sits at 45, not the formula's bare 20 base, because even the cleanest run adds its dead-cell and ammonia terms on top: a low-turbidity harvest at the bottom of that band sizes a small filter, while one near 90 demands a larger area or risks a mid-run blind. The assert r2 > 0.70 line is a guardrail baked into the module — if a future change to the cohort or features pushed the score below the floor where this sensor is worth deploying, the example fails loudly rather than silently shipping a weak model.

The harvest objective — the constrained argmax that turns BATCH-2026-001's trajectory into an endpoint — is the decision this soft sensor feeds, and the running example's own panel shows what it must resolve. The last recorded sample sits at 68.0% viability, below the 70% floor, so a constrained argmax does not chase the richest titer; it backs off to the previous feasible sample at 76.8% viability and 4.221 g/L. The plant that "always pulls last" would harvest a richer but dirtier feed — higher titer, higher turbidity, more inherited HCP — while the constrained decision trades a little product for a feed the downstream train can actually handle. That difference, multiplied across a campaign, is the value of learning the endpoint instead of reading it off a calendar.

Anatomy of one harvest-decision record

A harvest decision, like every artifact in this series, is not a bare timestamp — it is a structured record that ties the recommended endpoint to the state that justified it, the predictions that fed it, the constraints it respected, and the downstream consequence it forecasts. Dissect it the way a manufacturing-science reviewer would before signing off on a harvest.

One harvest decision, fully unpacked: the upstream state that fed it, the constrained endpoint it chose (backing off the sub-70-percent last sample), the turbidity and Vmax it predicted, the viability floor that vetoed the richer feed, the inherited-HCP risk it forecasts, and the lineage tying it to BATCH-2026-001 and the clarified pool CLAR-001 it produces — with the honest note that the true best endpoint can never be observed. Original diagram by the authors, created with AI assistance.

Read top to bottom and the chapter is laid out as fields. The input block is the harvest state, and the discipline here is that every field is tagged by how it arrived — provenance is part of the record, not an afterthought. VCD and viability come from the twice-a-day offline count (an automated cell counter, hours of latency between samples); titer comes from the lab HPLC, even later; lactate and ammonia come from the offline metabolite panel; the IVCD is derived (a trapezoidal integral of VCD over the run), so it inherits the count's provenance and uncertainty. Where an in-line Raman or capacitance probe exists, a parallel near-real-time estimate of VCD sits alongside the offline number, and the record keeps both so a reviewer can see the soft sensor agreeing (or not) with the reference. A field with no provenance tag is a field a reviewer cannot trust.

That provenance discipline is not bespoke to this chapter — it is the harvest model's slice of the plant's data shadow, and it lands on the same shared standards the data book builds on. The harvest-state row is anchored to one ISA-95 (the IEC 62264 enterprise-to-plant object model) material-lot and equipment identity, so BR-101 and BATCH-2026-001 mean the same thing to the historian, the MES, and the model — the semantic-interoperability layer that resolves a historian's BR101.VCD.PV tag, a MES batch field, and a LIMS result to one agreed property rather than three lookalike columns. The in-line spectra arrive over OPC UA from the bioreactor, the as-executed batch values are carried in B2MML (the XML serialization of ISA-95) out of the MES, and each feature carries a metadata schema — unit, instrument, timestamp format, and the offline/in-line/derived tag above — so the model never standardizes a Fahrenheit field against a Celsius one or files a repeat result under a second key. A training row whose lineage and units are governed this way is a row that can be re-derived and re-audited years later; a flat CSV scraped from a screen is not.

The green core is the recommendation — the endpoint time, the feasible titer it implies (4.221 g/L at the 76.8%-viability sample, not the 5.877 g/L of the vetoed last sample), and the predicted turbidity (NTU), Vmax (L/m²), and filter area (m²) that pre-stage clarification. The predicted turbidity is the one number the runnable module actually produces; the Vmax and area are the downstream chain that turbidity feeds, marked illustrative because the module learns the load proxy, not the bench-calibrated Vmax. The constraints row is what makes the record auditable: the 70% viability floor that vetoed the last sample is written down, not implied, so a reviewer can see why the model did not chase the higher titer, alongside the turbidity ceiling the train can absorb (the top of that 45–90 band, beyond which area must grow). Writing the binding constraint into the record is the difference between an advisory the reviewer can interrogate and a black-box number they must take on faith.

The forecast row carries the inherited-HCP risk band that links forward to the release HCP result and the OOS-sibling pattern — a band, not a point, because the harvest model's HCP forecast is the slowest-confirmed and least-certain output it makes. The violet relationships panel records lineage: this decision derivedFrom BATCH-2026-001, produces CLAR-001, feeds Protein A capture, reconciledWith the bench Vmax test and the offline turbidity that grade it, and retrains_when the residual (the prediction error against those references) drifts. These are the record's own relation labels; on the Book 4 graph they map to derivedFrom and its hasOutput/hasInput process edges — the same CLAR-001 node, named in that ontology's vocabulary. The record is the CLAR-001 node modeled in Book 4's ontology — here carrying not just lineage but the predictions and constraints that produced it. A manufacturing-science reviewer signs off not on the recommended day alone but on the whole card: state with provenance, prediction with uncertainty, constraint that bound the choice, and forecast with its honest band.

The pseudo-predicates are real edges — and they are what make the model trustworthy

It is worth being precise that those back-ticked labels are not prose decoration: each is a typed edge that already exists in Book 4's ontology, and pinning the harvest model to them is what turns a fragile column-name pipeline into a semantically-grounded, auditable one. Three of the ontology's mechanisms do concrete work here.

First, the feature row is pulled by IRI, not by string. The model's five features — peak VCD, end-of-culture viability, final lactate, final ammonia, integral VCD — are not "whatever column happens to be named end_viability_pct"; each is a bp: datatype property hung on a bp:Material node (the object/datatype split that distinguishes a relationship you can walk from a measurement you can read), carrying its own unit and IRI so a viability in percent is never silently fed where a fraction was meant. The BFO continuant/occurrent cut is load-bearing in the same way: the batch (a continuant material) is a different node from the cell-culture run (an occurrent), so the harvest model's per-batch feature row attaches to the material it actually describes and can never accidentally average a quality over a process. A feature pulled by its ontology IRI survives a column rename that would silently break a hard-coded join.

Second, the lineage spine is the grouping key for honest cross-validation. The same transitive bp:derivedFrom walk that scopes a recall is what makes a 60-run cohort splittable correctly: two clarified pools that share a BATCH-2026-001 ancestor are not independent rows, so a model scored by an ordinary random split inflates every metric on near-duplicate siblings. The defensible split is a grouped, leave-one-batch-out cross-validation that holds out an entire derivedFrom-connected lineage at a time — exactly the batch-grouped discipline the data chapter insists on, here handed to the model for free because the genealogy already records which rows descend from one campaign. The example module's whole-batch test split is the single-summary-row degenerate case of that grouping; on a multi-sample feed it is the spine, not a row index, that defines the held-out unit.

Third, the same SHACL shape that gates the release gates the training set. The closed-world bp:ReleaseShape from the release gate refuses a lot whose CQA panel is incomplete or out of range; pointed at a candidate training subgraph, that identical shape refuses a dataset whose harvest-state rows are missing a derivedFrom parent, a unit-bearing value, or a verdict. This is the catch an open-world reasoner cannot make — a missing viability reads as merely unknown to OWL but as a hard gap to SHACL — and it matters precisely because a model trained on a graph with silent gaps will, like the fluent-but-ungrounded LLM the AI chapter warns about, narrate a turbidity it never actually saw. Conformance-before-training is how a reasoned, shape-validated graph becomes the ground truth the model is checked against, not the other way round — and it is the same shape, run by the same harness, that already certifies the release. The harvest model is trustworthy not because Ridge is clever but because every input it learns from passed the gate that the product itself must pass.

The unsolved part: the censored label and the slow feedback

Be honest about why harvest ML is harder than it looks. The first difficulty is that the true best endpoint is never observed. For any real batch you harvest at exactly one time and see exactly one outcome; you never get to see what would have happened had you waited a day or pulled a day early. The label t* is censored — counterfactual, not measured — so you cannot simply do supervised regression onto "the right day." This is not the gentle censoring of survival analysis, where you at least know an event happened after some time; here you observe one point on one curve per batch and must infer the shape of a trade-off you can never trace out. Teams work around this with mechanistic forward models (simulate the trajectory under each candidate endpoint and optimize over the simulation — which is exactly what the simulated cohort here stands in for), with surrogate labels (regress onto the downstream outcomes you can measure, like inherited HCP and step yield, and let the objective imply the endpoint), or with the kind of constrained argmax shown above. None of these is the same as observing ground truth, and all of them inherit the small-data ceiling: a real process has a few dozen batches, each harvested once, so the model is learning the endpoint from a handful of single-point observations — the very reason the example module manufactures a 60-run cohort and the very reason it picks a regularized linear model over a tree.

The second difficulty is feedback latency. The harvest decision's most important consequence — the inherited HCP, the realized step yield, the filter behavior at scale — is not known at harvest. It arrives days later, after capture, after the assays. So the residual that would tell you the harvest model is drifting is one of the slowest in the entire process, slower even than the titer soft sensor's hours-late reference. Between the harvest and the verdict, a harvest model that has begun to mislead looks exactly like one that is working — the absence of an alarm is not evidence of correctness, only evidence that the truth has not landed yet. This is the sparse-reference, slow-feedback regime at its most extreme, and it is why harvest models in practice are advisory: they recommend, a human and the platform constraints decide, and the model is re-graded only when the downstream truth finally lands. A practical consequence is that the fast surrogate — the turbidity soft sensor, which can be checked against an offline NTU reading the same day — does double duty as an early drift sentinel for the slow chain, because if the model's turbidity prediction starts missing the same-day reference, its HCP forecast is almost certainly drifting too.

The third, quieter difficulty is transfer. A Vmax model and a turbidity soft sensor are bound to the specific cell line, media, depth-filter media, and centrifuge they were trained on, exactly as a Raman calibration is bound to its probe. Change the filter grade or scale from a 50 mL disc to a full lenticular stack and the model is, for regulatory purposes, a new procedure until re-qualified — and the change is not cosmetic: a different depth-filter grade has a different pore-size gradient and therefore a different fouling mechanism, which moves the effective blocking exponent n and invalidates a Vmax surrogate trained on the old grade. The bench Vmax test never goes away precisely because it is the confirmatory measurement that re-grounds the model whenever the hardware moves; it is the anchor that lets a learned shortcut stay honest across a change-control boundary.

What this chapter adds to the model suite

This chapter contributes examples/platform/ml/harvest_endpoint.py to the Book 5 example suite: a standalone module that loads the simulator-backed 60-run cohort, builds a transparent dead-cell-load-plus-ammonia turbidity proxy as its target, splits by whole batches, standardizes on the training split only, and fits a cross-validated RidgeCV regression from end-of-culture state to harvest-feed turbidity — clearing a built-in R² ≥ 0.70 floor at a held-out R² = 0.875. It coordinates with — and deliberately does not duplicate — the upstream soft-sensor module (titer/VCD from Raman) and the SPC/MVDA reference scripts: those predict what is in the tank; this predicts how hard it will be to clarify once you empty it. The model choice — regularized linear over boosted tree — is itself a teaching point: it is the small-data, defensible capacity for a few-dozen-batch process, and the assertion guard makes that floor explicit. The turbidity target and the Vmax/area chain it feeds are clearly labeled illustrative (simulated cohort); the release HCP figures the chapter cites — 28.203 ng/mg for the golden batch, 128.0 ng/mg for the OOS sibling — are the real committed dataset values.

Why it matters

Harvest is the single decision where upstream and downstream meet, and it is decided with the least information of any step — at the moment you must act, the most important consequences are still days away. Get it right and the entire downstream train sees a clean, predictable feed: the centrifuge runs smoothly, the depth filters do not blind, capture starts from low HCP, and the release assays pass with margin. Get it wrong — hold a dying culture too long — and you spend the rest of the process, and a chunk of your yield, clawing back the host-cell protein and debris a single over-extended harvest dumped into the feed. Learning the endpoint, the turbidity, and the filter size turns the most consequential blind decision in manufacturing into a defensible, advisory, evidence-backed one. It will not autonomously move a CQA; it will keep you from harvesting your way into the next OOS.

In the real world

The strongest production-grade anchor for learning at harvest is Amgen's deployment at Juncos, Puerto Rico (production), where SIMCA OPLS models predict harvest titer and other in-process attributes inside commercial GMP drug-substance manufacturing. Amgen engineers report that the harvest-titer model eliminated roughly six hours of harvest idle time and around ten hours of idle time between chromatography columns by letting operators act on a model prediction rather than wait for the lab — a concrete, deployed case of an MVDA model at the upstream-downstream boundary [2]. The honest caveat travels with the headline: this is a first-party, self-reported account from Amgen engineers together with a Sartorius vendor case study (vendor-self-reported / peer-reviewed-self-authored evidence tier), not an independently audited result, and the specific hour-savings cannot be externally confirmed.

The turbidity-and-Vmax half of the chapter is, today, more pilot and engineering practice than productized ML. Depth-filter sizing via the Vmax/Pmax bench tests is universal production practice, and the gradual-pore-blocking math behind it (the n = 1.5 constriction law) is textbook filtration theory; the machine-learning extension — predicting Vmax or turbidity directly from upstream VCD/viability so the filter is sized before harvest — is an applied (pilot/research) idea that process-development groups use internally more than vendors sell. There is a published basis for the underlying correlation: clarification and depth-filter capacity studies in the bioprocess literature consistently report that harvest cell density and viability are the dominant predictors of filter throughput and turbidity load (peer-reviewed-independent), which is what makes the upstream-state regression credible rather than speculative. The broader regulatory and consensus picture frames where this sits: the ISPE Pharma 4.0 reality is that ML in biomanufacturing clusters in monitoring and human-in-the-loop decision support, not autonomous control of CQAs, and a harvest-endpoint or HCP-risk model is squarely in the advisory, human-decides category that the FDA's 2023 Artificial Intelligence in Drug Manufacturing discussion paper and the draft EU/PIC/S GMP Annex 22 both expect — a locked, validated model supporting a human decision, never silently moving a quality attribute on its own [3][4]. The honest summary: harvest-titer prediction is real and deployed at at least one major manufacturer; turbidity and Vmax learning are credible, physics-anchored applications still mostly inside process-development groups; and none of it autonomously decides when to harvest.

Key terms

Titer — the concentration of antibody product in the culture broth, in g/L; the amount of product the harvest decision is trying to keep, rising as the culture ages.
Viability — the fraction of cells in the culture still alive, as a percent; falling viability is the leading indicator of debris and HCP. VCD (viable cell density) — the count of live cells per volume (e.g. 21.63e6 = 21.63 million cells/mL).
Soft sensor — a model that predicts a quantity that is slow or expensive to measure (here, turbidity) from cheap signals you already have (VCD, viability, spectra).
CQA (Critical Quality Attribute) — a measured property that defines whether the drug is acceptable (e.g. HCP ≤ 100 ng/mg); the thing an advisory ML model must never autonomously move.
Harvest endpoint — the chosen time to stop the culture and begin clarification; an optimization trading recoverable titer against the downstream burden a dying culture creates, subject to a viability floor.
Harvest objective J(t) — the (illustrative) function being maximized: recoverable titer minus a weighted downstream-burden penalty, evaluated only where the viability constraint holds, over the discrete sample times available.
Viability floor — the hard constraint (commonly around 70%) below which a platform will not harvest; the constraint that vetoes the richest-but-dirtiest endpoint.
Turbidity (NTU) — nephelometric turbidity of the harvest feed; the single best proxy for how hard clarification will be, driven by total cell mass and the dead/lysing fraction.
Dead-cell load — total cell mass times the dead fraction (1 − viability); the physical driver of both turbidity and filter fouling, and the core feature term in the example module.
Clarification — the first downstream step, separating antibody-bearing fluid from cells and debris (centrifuge plus depth filters), producing the clarified pool (CLAR-001).
Depth filter — graded-porosity filter media that traps cells and debris; sized so it does not blind before the batch is through.
Blocking laws — the family of fouling models (d²t/dV² = k (dt/dV)^n) whose exponent n selects complete blocking (2), pore constriction (1.5), intermediate (1), or cake (0) filtration.
Vmax — the maximum volume a filter membrane can process per unit area before fully blinding; read from the slope of a t/V-versus-t linearization of the n = 1.5 law, or predicted from upstream state.
Vmax / Pmax test — the small-scale bench ritual that measures filter capacity on a sample of real harvest feed; the confirmatory measurement that re-grounds any Vmax model after a hardware change.
IVCD — the integral of viable cell density over the run; total productive cell-hours, a derived feature for both titer and clarification models.
Inherited HCP — the host-cell protein the capture step starts with; inflated by a low-viability harvest, linking the upstream endpoint to a downstream release attribute.
Censored label — the harvest endpoint truth that can never be observed because each batch is harvested exactly once and you see one point on one curve; why harvest ML cannot be plain supervised regression.
Leave-one-batch-out cross-validation — a grouped split that holds out an entire derivedFrom-connected lineage at a time, so siblings sharing a batch ancestor cannot leak across train and test; the genealogy spine is its grouping key.
SHACL training-set gate — the same closed-world shape that gates lot release, pointed at a candidate training subgraph to refuse rows missing a lineage parent, a unit-bearing value, or a verdict; conformance-before-training.
CSA / CSV — Computer Software Assurance, the risk-based validation approach (FDA 2022 draft) that scales testing to risk, succeeding the document-heavy Computer Software Validation; an advisory harvest model earns lighter assurance than a closed-loop controller.
ALCOA+ / Part 11 — the data-integrity attributes (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available) and the 21 CFR Part 11 electronic-records rules a GMP-relevant harvest record must satisfy.
Locked, versioned model — a model frozen and version-named (harvest_endpoint v1) so a retrain is a change-controlled event, the form the FDA AI discussion paper and draft EU/PIC/S Annex 22 expect for ML touching quality decisions.

Where this leads

Clarification handed downstream a clean, defined pool — CLAR-001, deriving from BATCH-2026-001 and feeding the first purification column. The next chapter, Capture Chromatography: Hybrid Models and Real-Time Pooling, enters the Protein A step, where the feed's HCP burden — set by the harvest decision we just made — meets hybrid mechanistic-plus-ML chromatography models and the real-time pooling decisions that turn a UV trace into a defined capture pool.

What this chapter covers​

The harvest decision is an optimization, not a date​

Turbidity soft sensing: predicting how hard clarification will be​

Filter sizing and Vmax: turning a 50 mL test into a full-batch decision​

The upstream→clarification link: where clarification performance is really decided​

A runnable model: harvest_endpoint.py​

Anatomy of one harvest-decision record​

The pseudo-predicates are real edges — and they are what make the model trustworthy​

The unsolved part: the censored label and the slow feedback​

What this chapter adds to the model suite​

Why it matters​

In the real world​

Key terms​

Where this leads​