Process Development: Bayesian Optimization Beats the Factorial Grid

📍 Where we are: Part II · Discovery & Development, Learned — Chapter 7. The last chapter ranked the clones and handed us a winner — a CHO line (Chinese-Hamster-Ovary cells, the standard antibody-producing host) that makes mAb-A (this series' running monoclonal antibody). Now that line needs a process: a medium, a feed, a temperature, a pH. Classical development walks a factorial grid of experiments to find good settings. This chapter is about the learning method that walks far fewer of them, and walks them on purpose.

A clone is not a process. The cell line from WCB-CHO-001 can in principle make a great deal of antibody, but only inside a narrow envelope of conditions it has never been told. What feed schedule? What glucose setpoint? What temperature, and should it shift mid-run? Process development (PD) is the search for that envelope, and historically the search has been a design of experiments (DoE) — a structured grid of bioreactor runs, often a full or fractional factorial or a central-composite response-surface design, that maps a response surface across a handful of factors. The grid is principled and auditable, and it is also expensive: a single Ambr or bench run costs weeks of a scientist's time and a parallel reactor slot, and the grid's size explodes combinatorially with every factor you add.

This chapter argues that the single strongest, most defensible machine-learning application in PD is not a black-box predictor of titer but a search strategy: Bayesian optimization (BO) built on a Gaussian process (GP). BO does not try to map the whole response surface. It builds a probabilistic belief about where the optimum is, and at each step it spends the next precious run where that belief says it will learn the most. Against the same goal, BO routinely reaches a better optimum in a fraction of the runs a factorial grid needs — and unlike a grid, it gets smarter as it goes. The runnable artifact in this chapter makes that quantitative: BO matches a 25-run factorial grid's optimum in 15 runs on the shared simulator.

The simple version

Imagine hunting for the deepest point in a dark, hilly lake with only a few sonar pings to spend. A factorial DoE drops the pings on a fixed grid — evenly spaced, decided in advance, blind to what it finds. Bayesian optimization drops one ping, looks at the depth, builds a guess of where the lake floor probably dips, and aims the next ping at the most promising spot — balancing "go where it already looks deep" against "go somewhere we know nothing about." After a dozen pings it has usually found a deeper hole than the grid found with fifty, because every ping after the first was chosen on purpose. The Gaussian process is the running guess of the lake floor; the acquisition function is the rule for where to ping next.

What this chapter covers

We frame PD as sequential black-box optimization under a tight experimental budget, then build it up piece by piece: the search space of feed and process parameters; the Gaussian process as a surrogate model that returns a mean and an uncertainty everywhere, written out as the posterior equations; the acquisition function (Expected Improvement in closed form, plus UCB and PI) that turns that uncertainty into the next experiment; the multi-objective case where titer fights product quality and the answer is a Pareto front, not a point; the ML-assisted design space and how BO connects to Quality by Design (QbD) and ICH Q8; the high-throughput Ambr automation that makes the loop physically runnable; and an honest accounting of where autonomous self-driving labs actually stand. The runnable artifact, examples/platform/ml/bayesopt_doe.py, optimizes a glucose/glutamine feed policy against the fed-batch simulator and watches BO overtake a factorial grid within a dozen-odd simulated runs.

The task: sequential optimization under a brutal budget

Most of this book's learning tasks are prediction — given features, estimate a number. PD's central task is different. It is optimization: find the input setting x* that maximizes an objective f(x) — say, day-14 titer — where f is an expensive, noisy, black-box function you can only learn by running an experiment. You cannot differentiate f; you cannot evaluate it a million times; each evaluation is a two-week bioreactor run that costs reagents, a reactor slot, and an analyst's time. This is precisely the regime BO was invented for, and it maps onto bioprocess PD almost without distortion: a 2025/2026 tutorial review in Biotechnology and Bioengineering establishes BO as the method for sequential, expensive, noisy black-box optimization of bioprocesses, reaching competitive optima in far fewer runs than classical DoE and naming Expected Improvement as the default acquisition [1].

Frame it concretely against our running example. The clone is fixed (WCB-CHO-001 → mAb-A). The decision variables are a small vector of process settings — for our simulator, a feed policy: how much glucose (g/L per bolus — one discrete feed dose added at once) and glutamine (mM — millimolar — per bolus) to add on feed days. The objective is the run's day-14 titer (with quality constraints we will add shortly). The budget is the number of reactor runs we are willing to spend before locking a process — in real PD often ten to forty, in our simulated demonstration a couple of dozen. The question BO answers is not "what is the entire response surface" but "where do I run next so that, after my budget is gone, the best run I have seen is as good as possible."

Why does a factorial DoE struggle here? Three reasons, each structural rather than incidental. First, the curse of dimensionality: a full factorial at three levels across five factors is 3^5 = 243 runs, which no PD team can afford, so they fall back to fractional designs that confound (alias) effects: a main effect (one factor's own standalone influence on titer) becomes mathematically indistinguishable from an interaction (the joint influence of two factors acting together), because shrinking the grid lands both on the same column of the math — so a single measured change could be caused by either and you literally cannot tell which. Second, the grid is non-adaptive — every point is chosen before the first result arrives, so a run that lands in an obviously hopeless corner is wasted, and there is no mechanism to redirect the remaining budget toward a region that looked promising at run three. Third, a DoE typically fits a low-order polynomial response surface (linear plus two-way interactions plus quadratic curvature) that cannot represent the sharp ridges, plateaus, and cliffs that real bioprocess optima sit on; the polynomial is smooth where the biology is not. BO replaces the polynomial with a Gaussian process that bends to the data, and replaces the fixed grid with a feedback loop.

Evidence

The claim that BO reaches a competitive optimum in materially fewer experiments than classical DoE is supported across recent peer-reviewed bioprocess work — thermodynamics-aware BO validated in Ambr15 micro-bioreactors for CHO media design, compared head-to-head against a space-filling design [2]; an iterative BO framework for cell-culture media development that reached improved compositions using 3–30x fewer experiments than estimated for standard DoE [3]; and the 2025/2026 tutorial review of BO in bioprocess engineering [1] (all research, peer-reviewed-independent). Vendor "fewer experiments" headlines (e.g. 40–80%) are vendor-self-reported and are flagged as such later in this chapter; do not conflate the two tiers.

The Gaussian process: a belief about the surface, with honest error bars

The heart of BO is the surrogate model — a cheap stand-in for the expensive function. The standard choice is a Gaussian process, and the reason is its single most useful property: a GP returns, at every point in the search space, not just a predicted mean but a calibrated uncertainty. Where you have run experiments, the uncertainty collapses toward the measurement noise; far from any data, it balloons toward the prior. That uncertainty field is what lets BO decide where exploring is worth it — a polynomial fit gives you a residual variance, but not an honest "I have never seen this corner" signal.

Formally, a GP places a distribution over functions: any finite collection of points has a joint Gaussian distribution governed by a mean function m(x) (usually taken as zero or a constant after centering) and a covariance or kernel function k(x, x') that encodes how similar two settings' outcomes should be. Collect the n observed settings into a matrix X, their (noisy) titers into a vector y, let K be the n×n matrix with entries k(x_i, x_j), let k_* be the vector of kernel values between a new candidate x and the observed points, and let σ_n² be the observation-noise variance. Conditioning the joint Gaussian on the runs observed so far gives closed-form posterior equations:

mean      mu(x)    = k_*^T (K + sigma_n^2 I)^-1 y
variance  sigma2(x) = k(x, x) - k_*^T (K + sigma_n^2 I)^-1 k_*

the posterior mean mu(x) is the GP's best guess of the titer at x — a weighted average of observed titers, weighted by kernel similarity;
the posterior variance sigma2(x) is how unsure it is there — it shrinks near observed points (where k_* is large) and grows far from all of them.

The workhorse kernel is the Matérn kernel (the squared-exponential / RBF kernel is the limiting smooth case). Matérn with smoothness parameter ν = 2.5 (i.e. 5/2, the form the code uses) is the bioprocess default: it says nearby settings give similar titers, is differentiable enough to be optimized but not so smooth it pretends the surface is infinitely gentle, and carries a per-dimension length scale ℓ_d that controls how fast similarity decays along each axis. A short length scale on the glucose-feed axis means titer is sensitive to glucose; a long one means that knob barely matters. The length scales and the kernel amplitude and the noise term σ_n² are not guessed; they are fit by maximizing the marginal likelihood (the "evidence") of the observed runs — a low-dimensional optimization with a handful of hyperparameters, which is why a GP with only a dozen data points can still be well-calibrated. It has very few parameters to learn, exactly the small-data discipline the learning-problem chapter argued for. A GP is, in effect, the hybrid-modeling instinct of Book 2 applied to the search rather than the prediction: the kernel is a smoothness prior, so the data has less to do.

The acquisition function: turning uncertainty into the next run

A surrogate alone does not pick experiments. The acquisition function does. It scores every candidate setting by how useful running it would be, then BO runs the maximizer of that score. Every acquisition function is some balance of two impulses: exploitation (run where the posterior mean is already high) and exploration (run where the posterior variance is high, because you might be missing a better region). Tilt all the way to exploitation and BO climbs the nearest hill and stops at a local optimum; tilt all the way to exploration and it scatters like a space-filling grid and never converges. The art is the balance, and a good acquisition function gets it without a hand-tuned knob.

The most common acquisition function is Expected Improvement (EI). Let f_best be the best titer observed so far (the incumbent). EI scores a candidate by the expectation, under the GP posterior, of how much it would improve on f_best:

EI(x) = E[ max(f(x) - f_best, 0) ]

Because the posterior at x is Gaussian with mean mu(x) and standard deviation sigma(x), this expectation has a closed form built from the standard normal PDF phi and CDF Phi. With z = (mu(x) - f_best - xi) / sigma(x) and a small exploration margin xi:

EI(x) = (mu(x) - f_best - xi) * Phi(z) + sigma(x) * phi(z)     if sigma(x) > 0
EI(x) = 0                                                      if sigma(x) = 0

(The runnable artifact uses the standard default exploration margin xi = 0, so its EI reduces to (mu - f_best)*Phi(z) + sigma*phi(z) — the form in the code block below.) The two terms are the whole story. The first, (mu - f_best) * Phi(z), is the exploitation term — it is large when the mean is well above the incumbent. The second, sigma * phi(z), is the exploration term — it is large when uncertainty is high, even where the mean is mediocre, because the upside tail of a wide Gaussian pokes above f_best. A point that is confidently mediocre (mu below f_best, sigma small) scores near zero on both terms, and BO never wastes a run there. Two other common choices are Upper Confidence Bound (UCB), mu(x) + kappa * sigma(x), which makes the exploration weight an explicit knob kappa you can anneal over the campaign, and Probability of Improvement (PI), Phi((mu(x) - f_best) / sigma(x)), which is greedier (it cares only that you beat the incumbent, not by how much) and so is less used. EI is the default because it needs no tuning knob and behaves well out of the box [1].

The loop, then, is five lines of logic: (1) fit the GP to all runs so far and refit its hyperparameters by marginal likelihood; (2) maximize EI over the search space — itself a cheap inner optimization, since the GP and EI are analytic — to pick the next setting; (3) run that experiment — in our demo, simulate the fed-batch; (4) append the (setting, titer) pair to the history; (5) repeat until the budget is gone. The expensive step is (3); everything else is milliseconds. That asymmetry — cheap to decide, costly to evaluate — is the whole reason BO exists, and the reason it is profligate with computation (thousands of EI evaluations per iteration) to save a single reactor run.

Bayesian optimization as a feedback loop: a Gaussian process turns a few expensive runs into a belief surface with honest error bars, Expected Improvement spends the next run where that belief says the payoff is largest, and the new result sharpens the surface — so the points cluster near the optimum instead of tiling a grid. Original diagram by the authors, created with AI assistance.

Building it on the simulator: BO versus a factorial DoE

The runnable artifact makes the argument concrete by optimizing a feed policy against the 14-day fed-batch CHO simulator the whole series shares. The simulator integrates Monod-limited growth with lactate inhibition of the growth rate, an age- and ammonia-driven death phase, and bolus feeds across the run; its day-14 titer is the objective. We expose a two-parameter feed search space — a glucose-feed amount (g/L per bolus) and a glutamine-feed amount (mM per bolus) — and ask BO to maximize final titer. The objective is one full fed-batch run of the parametrized mechanistic model model_fedbatch.py; each call to objective(x) is one "experiment."

The experiment is a head-to-head: a 5×5 factorial grid of feed conditions (25 runs) versus GP-BO with Expected Improvement, seeded with 5 random runs and then 10 acquisition-chosen runs (15 runs total). We run both against the identical deterministic objective over the identical feed bounds, so the comparison is fair; the fixed seed (0) only pins BO's five random seed runs, making the whole campaign exactly reproducible. The harness asserts that BO reaches the grid optimum within 0.1 g/L while using strictly fewer runs — so the demonstration fails loudly if BO ever stops beating the grid.

# examples/platform/ml/bayesopt_doe.py — GP-BO over the fed-batch feed policy.
import numpy as np
from scipy.stats import norm
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, Matern, WhiteKernel
import model_fedbatch as mfb

# search space: glucose-feed (g/L per bolus), glutamine-feed (mM per bolus)
BOUNDS = np.array([[0.8, 3.6], [0.5, 2.8]])

def objective(x) -> float:
    """One fed-batch run -> final titer (the expensive evaluation)."""
    p = mfb.Params(feed_glc=float(x[0]), feed_gln=float(x[1]))
    return mfb.simulate(p, seed=0).summary["final_titer_g_L"]

def _ei(mu, sigma, best):                       # Expected Improvement, closed form
    sigma = np.maximum(sigma, 1e-9)
    z = (mu - best) / sigma
    return (mu - best) * norm.cdf(z) + sigma * norm.pdf(z)

def bayes_opt(n_init=5, n_iter=10, seed=0):
    rng = np.random.default_rng(seed)
    X = rng.uniform(BOUNDS[:, 0], BOUNDS[:, 1], size=(n_init, 2))   # 1. seed runs
    y = np.array([objective(x) for x in X])
    gx, gy = np.meshgrid(np.linspace(*BOUNDS[0], 40), np.linspace(*BOUNDS[1], 40))
    cand = np.column_stack([gx.ravel(), gy.ravel()])               # dense EI mesh
    kernel = ConstantKernel(1.0) * Matern(length_scale=[1.0, 1.0], nu=2.5) + WhiteKernel(1e-3)
    for _ in range(n_iter):
        gp = GaussianProcessRegressor(kernel=kernel, normalize_y=True,
                                      n_restarts_optimizer=2, random_state=seed)
        gp.fit(X, y)                                               # 2. fit surrogate
        mu, sigma = gp.predict(cand, return_std=True)
        x_next = cand[_ei(mu, sigma, y.max()).argmax()]            # 3. maximize EI
        X = np.vstack([X, x_next]); y = np.append(y, objective(x_next))   # 4. run + append
    return X, y

def grid_doe(n=5):                              # 5x5 factorial baseline = 25 runs
    gx, gy = np.meshgrid(np.linspace(*BOUNDS[0], n), np.linspace(*BOUNDS[1], n))
    pts = np.column_stack([gx.ravel(), gy.ravel()])
    return pts, np.array([objective(x) for x in pts])

Running it prints the pattern the chapter is about (verbatim from RUN_OUTPUTS.txt; numbers are the simulator's, not a real plant's):

Single-objective BO of feed policy (objective = final titer g/L)
  factorial DoE   : best 6.246 g/L in 25 runs
  Bayesian opt    : best 6.269 g/L in 15 runs at feed_glc=3.17 g/L, feed_gln=2.80 mM
  BO matched/beat the grid optimum with 10 fewer runs.

The headline is not the absolute titer but the shape: BO matched and slightly beat the 25-run factorial grid's best (6.269 vs 6.246 g/L) using 15 runs — 10 fewer — and it located the optimum in the feed-aggressive corner (feed_glc=3.17 g/L, feed_gln=2.80 mM), near the high end of both bounds, exactly where a sensitivity analysis of the mechanistic model would point. After the 5-run random seed, every BO run was placed on purpose, and the placements migrated toward that corner as the surrogate learned where the ridge was. The two assertions in main() — that BO comes within 0.1 g/L of the grid and uses strictly fewer runs — turn the chapter's economic claim into a regression test: if a library upgrade ever degraded the optimizer, the demo would fail rather than quietly mislead. That is the entire argument for BO in PD, reproduced in code you can run.

That the optimum sits in a corner is an artifact of where we drew the feed bounds: real CHO fed-batch is non-monotonic in feed — push glucose or glutamine too far and accumulating lactate, ammonia, and osmolality (the dissolved-salt-and-solute load the concentrated feeds raise) inhibit growth and erode quality, so the true optimum is usually an interior ridge, not an edge (precisely the curved shape a low-order DoE polynomial smooths over and an adaptive GP bends to find). One further honest caveat: the simulator's objective is deterministic and noise-free, so the GP's WhiteKernel noise term collapses toward zero here — that isolates the search-efficiency argument cleanly, but it is kinder than real PD, where the noise term does real work soaking up assay and run-to-run jitter the GP must not mistake for a ridge.

Multi-objective BO: when titer fights quality

Real PD never optimizes titer alone. Aggressive titer-chasing can do two distinct kinds of damage: (a) it can erode product-quality critical quality attributes (CQAs) — the measured product-quality properties the release panel will judge — like glycosylation, aggregate (high-molecular-weight species), or charge variants directly; and (b) it can stress the culture into lysis (cells bursting open) that raises a process impurity like host-cell protein (HCP), a purification burden driven by dead-cell fraction and ammonia rather than by the feed-to-aggregate pathway. The golden batch (the reference run whose quality the process aims to reproduce) BATCH-2026-001 sets the bar at 98.611% monomer (the share of product that is intact, un-aggregated antibody); a process that chases titer into a corner where monomer or HCP clears spec is not a better process, it is a recall waiting to happen — and it was the second pathway that sank BATCH-2026-004, which went out of specification (OOS) on host-cell protein at 128 ng/mg against a spec maximum of 100, the kind of quality failure an aggressive titer-only optimizer is structurally prone to walk into. The honest objective is therefore a vector: maximize titer and keep monomer purity high and hold a glycoform in spec. These objectives conflict, so there is no single best point — there is a Pareto front, the set of settings where you cannot improve one objective without sacrificing another.

Multi-objective Bayesian optimization (MOBO) searches for that front directly. The clean construction fits a separate GP per objective (one for titer, one for monomer %, one per glycoform) and replaces single-objective EI with a multi-objective acquisition function — most commonly Expected Hypervolume Improvement (EHVI), which scores a candidate by how much it would grow the hypervolume of objective-space dominated by the current Pareto front (the volume between the front and a reference point). An alternative is to run per-objective GPs and search their predicted means with an evolutionary front-finder such as NSGA-II. The output is not a recommendation but a menu: a Pareto set of process settings, each a different defensible trade-off between yield and quality, from which the team picks according to the target product profile.

Evidence

A 2025 ETH Zurich + Novo Nordisk study used multi-objective BO (per-objective GPs plus an NSGA-II front search, built on the open-source ProcessOptimizer) to develop a monoclonal-antibody formulation, identifying highly optimized conditions in 33 experiments and improving the diffusion-interaction parameter kD from 9.1 to 48.6 mL/g — research, peer-reviewed-independent [4]. For the culture side, a 2025 study explicitly targets ML-guided CHO bioprocess and media optimization for improved titer and glycosylation (titer prediction R^2 ~0.93, glycan metrics R^2 ~0.79–0.95), with an active-learning step proposing a composition that cut mannosylation by 10% while raising titer — the titer-versus-quality trade-off this section describes — research, peer-reviewed-independent [5]. MOBO for titer-versus-glycosylation is demonstrated, not yet routine GMP practice.

This is also where BO stops being "just an optimizer" and starts being a design-space tool — which is what makes it palatable to the quality organization.

Two objectives at once: the Pareto front, made runnable

The conflict above is not rhetorical — it is geometry baked into the simulator. Pushing titer means a glucose-rich feed, which drives more lactate and ammonia, which the mechanistic model turns into lower monomer (more aggregate); a feed policy that wins on titer loses on quality. MOBO with EHVI maps that trade-off directly: the suite fits one GP per objective (titer, monomer %), maximizes a Monte-Carlo EHVI over a dense candidate mesh (here 20×20, coarser than the single-objective 40×40 because each EHVI evaluation Monte-Carlo-samples the joint posterior), and races the result against a 5×5 factorial grid scored by the same hypervolume:

Multi-objective BO (maximize titer AND monomer %) via EHVI
  factorial DoE   : hypervolume 2.338 in 25 runs
  EHVI Bayesian   : hypervolume 2.338 in 16 runs (13 Pareto-optimal feed policies)
  Pareto front (titer g/L, monomer %):
    titer=6.25  monomer=98.09
    titer=6.03  monomer=98.23
    titer=5.99  monomer=98.26
    titer=5.92  monomer=98.27
    titer=5.68  monomer=98.36
    titer=5.48  monomer=98.40
    titer=5.42  monomer=98.41
    titer=5.26  monomer=98.49
    titer=5.15  monomer=98.51
    titer=5.00  monomer=98.59
    titer=4.69  monomer=98.64
    titer=4.68  monomer=98.64
    titer=4.65  monomer=98.64
  EHVI covered >= the grid's trade-off surface with 9 fewer runs.

Read it honestly. EHVI matched the factorial grid's trade-off surface — hypervolume 2.338 either way — with 9 fewer runs (16 versus 25), and it recovered 13 Pareto-optimal feed policies that span the whole tension: from a high-titer, lower-monomer policy (titer 6.25, monomer 98.09) at one end to a high-monomer, lower-titer one (titer 4.65, monomer 98.64) at the other, with eleven defensible compromises in between. Most of the front in fact sits below the golden batch's 98.611% monomer bar — only the high-monomer end clears it — so those eleven compromises are defensible as trade-off points to study, not as ship-ready processes. Notice what the model does not do: it never names a best policy. The choice among the thirteen is the process scientist's, because only they hold the quality target — a program that must clear the golden batch's 98.611% monomer is forced toward the high-monomer end of the front and pays for it in titer; a program with quality headroom can move the other way. The model maps the trade-off; it does not pick the compromise.

From a point to a design space: BO meets QbD and ICH Q8

A regulator does not want your single best setting; under Quality by Design (QbD), codified in ICH Q8(R2), they want a design space — defined verbatim as "the multidimensional combination and interaction of input variables and process parameters that have been demonstrated to provide assurance of quality," inside which you may move without filing a change [6]. Classically a design space is carved out of a DoE's response surface by drawing a contour where the predicted response clears spec. The learning version replaces that deterministic surface with a probabilistic model and reports the design space as the region where the posterior predictive probability of simultaneously meeting all CQAs exceeds a chosen threshold — a Bayesian probabilistic design space, the construction Peterson introduced in 2008 [7] and that Bano and colleagues extended with PLS latent-variable models and a posterior-predictive criterion in 2018 [8].

The GP that BO builds is such a model. Its posterior gives, at every setting, a predicted mean and a calibrated uncertainty for each CQA — exactly the ingredients to compute "the probability this setting passes." Concretely, with a per-CQA GP you can evaluate P(all CQAs in spec | x) by integrating each Gaussian posterior over its acceptance interval and (under independence, or with a joint model under correlation) multiplying — then threshold at, say, 0.95 to draw the design-space boundary with uncertainty made explicit rather than hidden inside a polynomial's residuals. So the same machinery that found the optimum can be reused to describe the safe region around it. That dual use is why BO sits so well with QbD: it is an optimizer for the development scientist and a design-space generator for the regulatory filing, from one fitted model. The connection runs straight into Book 4's ontology, where bp:FeedRate bp:affectsQuality bp:MonomerPct-CQA is exactly the type-level edge a probabilistic design space quantifies — BO supplies the evidence and the numeric ranges that edge needs.

A caution belongs here. A GP's uncertainty is only as honest as its kernel and its data. With a dozen runs across even a handful of dimensions, the posterior far from the data is a prior assumption wearing the costume of a measurement, and a probabilistic design space drawn from it can look more authoritative than it is. The validation chapter and the FDA's model-credibility framework both insist the design space's claims be checked against held-out confirmation runs before they govern anything — the model proposes the region; confirmation runs ratify it. And a design space demonstrated at micro-bioreactor scale is not automatically a manufacturing-scale design space: scale-dependent gradients (mixing, kLa, CO2 stripping, shear) can move the edges of failure, so confirmation runs must include the intended scale — not only held-out points at development scale (ICH Q8/Q11 expect a design space to account for scale and equipment).

What grounds the loop: the ontology under the GP's training data

The bp:FeedRate bp:affectsQuality bp:MonomerPct-CQA edge is the visible tip of a deeper dependency: a BO loop is only as trustworthy as the rows it folds into its surrogate, and an ontology (a formal, machine-readable vocabulary of the domain's types and relations) plus the knowledge graph built on it is what makes those rows trustworthy and FAIR (Findable, Accessible, Interoperable, Reusable). The dependency runs four ways, each grounded in Book 4.

First, semantically-grounded features. The GP's objective(x) returns a titer, and a richer loop's constraint check reads a CQA — but in a real plant those numbers arrive from a LIMS and a historian under fragile, plant-local column names. Pulling a feature by its ontology IRI (Internationalized Resource Identifier — the global, unambiguous name of a thing in the graph), bp:monomerPct hung on the lot rather than a string like temp_reactor, is what lets the same BO code run against two sites' data without a silent column-name mismatch corrupting the surrogate. The identifiers-and-units chapter is why a feed read in g/L per bolus is never accidentally averaged with one in mM.

Second, SHACL gates the training set, not only the release lot. The same closed-world SHACL (Shapes Constraint Language) shape that decides whether a lot may ship — every required CQA present, singular, and in range — pointed at the candidate runs the GP is about to ingest, refuses a dataset with a dropped result or a missing unit before a single row reaches gp.fit. A garbage row poisons a GP faster than a polynomial (the surrogate trusts it as a hard observation), so the release-gate shape doing double duty as a training-data gate is exactly the guard this loop needs.

Third, lineage is the grouping key. When BO warm-starts from a related molecule's or related scale's campaign (the transfer-learning move below), the runs it pools are not independent: two results sharing a BATCH-2026-001 ancestor are near-duplicate siblings, and a surrogate validated on a row-random split looks brilliant and generalizes to nothing. The bp:derivedFrom spine — a PROV-O (prov:wasDerivedFrom) lineage edge in the graph — is the grouping key for a leave-one-batch-out validation that holds out an entire campaign at a time, the only honest way to ask "would this prior have helped on a genuinely new batch."

Fourth, BFO keeps the measurement distinct from the run. A titer reading (a continuant — a thing that persists and bears qualities) is a different node from the fed-batch run that produced it (an occurrent — a process that happens and is over) and from BR-101 the vessel. Collapse "the result" into "the run" and a surrogate trained across a reused reactor inherits one run's facts into another — the same modeling error the upper-spine continuant/occurrent split exists to forbid. None of this is decoration: the reasoned, shape-validated graph is the ground truth the model is checked against, not the other way round — and a probabilistic design space is only as defensible as the typed, unit-checked, lineage-grouped data the GP was allowed to see.

Anatomy of one Bayesian-optimization iteration

The series signature is to take one record apart. For BO the record is not a batch or a prediction — it is one iteration of the loop, the unit that turns the entire run history into a single chosen experiment. Unpack it and the chapter's logic is laid out as fields.

One BO iteration, fully unpacked: the run history in, a fitted Gaussian process with its kernel and length scales, the Expected-Improvement score over candidates, the chosen next experiment with its predicted mean and uncertainty and its constraint check, and the provenance — seed, acquisition type, budget — that makes the decision auditable and reproducible. Original diagram by the authors, created with AI assistance.

Read the card top to bottom; each field is a line of the algorithm made inspectable.

Header — what iteration this is. Iteration 8 of the feed-policy campaign for mAb-A: the index matters because BO's behavior is non-stationary — early iterations explore, late ones exploit, and the same setting proposed at iteration 2 versus iteration 12 means different things.
Inputs — the entire history. Every (setting, titer) pair run so far, not just the latest. BO is stateful in a way a one-shot predictor is not: the next decision is conditioned on all prior results through the GP posterior. Drop a row and the surrogate, the EI surface, and the chosen experiment all change.
Surrogate — the fitted GP. The kernel family (Matérn ν = 2.5), the fitted per-dimension length scales (a short length scale on the glucose-feed axis means titer is sensitive there — the model has learned which knobs matter, and this is directly readable as a sensitivity report), the kernel amplitude, and the noise term (WhiteKernel) that keeps the GP from interpolating sensor jitter and assay noise as if it were signal. These hyperparameters are refit by marginal likelihood every iteration, so the field is a snapshot, not a constant.
Acquisition — the EI computation. The incumbent best f_best, the EI surface evaluated over the dense candidate mesh (40×40 = 1,600 points in the demo; thousands in practice), and the argmax — the single setting EI says is most worth running, reported with its predicted mean and uncertainty so a scientist can see why it was chosen: high mean (exploitation), high variance (exploration), or both. A reviewer who sees a high-variance, modest-mean pick knows BO is exploring; a high-mean, low-variance pick is BO closing in.
Decision — the experiment to run. The concrete feed-policy values (feed_glc, feed_gln, and in a richer search a temperature-shift flag), plus — in a constrained variant — a constraint check that the candidate keeps lactate in band, so a production BO cannot recommend a setting the real process would reject. The demo's optimizer is unconstrained (an EI argmax with no quality fence), but this is exactly where the fence belongs — so the optimizer is not free to "win" by starving the culture.
Provenance — what makes it auditable. The random seed, the acquisition type (EI), the budget consumed (runs used / budget), and a link to the simulator (or LIMS/historian) run that will return the new titer. This is the discipline the MLOps chapter demands of any model that touches development decisions: the iteration is exactly reproducible — same seed, same history, same pick — and every field can be replayed in an audit.
Footer — the fold. The returned titer appends one row to the history and the loop repeats from the surrogate step.

The record is the loop made inspectable: nothing about the next experiment is arbitrary, and every field can be traced.

High-throughput automation: the Ambr makes the loop physically runnable

BO's loop is only as fast as its slowest step — running the experiment. A loop that waits two weeks per single bench run, one at a time, is mathematically elegant and operationally hopeless: 15 sequential runs at two weeks each is over half a year. What makes BO practical in PD is high-throughput micro-bioreactor automation, above all the Ambr (automated micro-bioreactor) systems — the Ambr 15 (24–48 vessels at roughly 10–15 mL) and the Ambr 250 (24–48 vessels at roughly 100–250 mL) — with automated liquid handling, sampling, and feeding. Now a BO "iteration" can propose a batch of settings — 24 at once — run them in parallel, and fold all 24 results back into the surrogate, collapsing a year of sequential runs into a few parallel rounds. Batch (parallel) BO acquisition functions, the q-extensions q-EI and q-EHVI, are designed for exactly this: pick q diverse, jointly informative settings rather than q copies of the single EI maximizer, so the parallel slots are not wasted on near-duplicate runs — the joint acquisition explicitly penalizes redundancy among the batch.

This is the configuration the strongest real demonstrations use. Thermodynamics-aware BO for media design was validated in Ambr15 micro-bioreactors against a space-filling baseline [2]; a recent self-driving perfusion-development study (DataHow, Sartorius, and Merck KGaA / Ares Trading) ran Bayesian optimal experimental design with a cognitive digital twin (a hybrid mechanistic-plus-data model with step-wise Gaussian-process surrogates) across 24 parallel Ambr250 mini-bioreactors over a 27-day perfusion cultivation, transferring learning between cell lines [9]. The pairing is the point: BO supplies the brains (where to run next), the Ambr supplies the hands (running 24 at once), and the historian and contextualization from Book 2 and Book 3 supply the memory (so each run's result lands as a clean, attributable row the surrogate can trust — garbage rows poison the GP faster than they poison a polynomial). That memory is only trustworthy if it is standardized: each Ambr run's setpoints and results land against an ISA-95 (IEC 62264) equipment-and-material model so the surrogate knows which vessel and what kind of measurement produced a number, arrive over OPC UA (the vendor-neutral protocol that carries tagged readings between instrument and historian), and are exchanged as B2MML (Business-to-Manufacturing Markup Language, the XML form of ISA-95) batch records — with a PROV-O provenance edge (prov:wasGeneratedBy) tying each result back to the run and the analyst that produced it, so the BO history is auditable, not just a pile of (setting, titer) pairs (Book 2's data-shadow and semantic-interoperability chapters are where this plumbing is built). The honest caveat is the one Book 2 flags: an OPC UA Companion Specification standardizing the information model for a mammalian-cell bioreactor does not yet exist, so the semantics still vary plant to plant — which is precisely why the ontology layer above does the reconciling. Retrofitting in-line Raman onto Ambr — Sartorius's BioPAT Spectro, with a standardized optical interface for model transfer to BIOSTAT STR — closes part of the measurement gap so the objective is read sooner, rather than waiting for offline assays [10] (vendor-self-reported for the integration and efficiency claims).

The unsolved part: the cold start, transferability, and trusting the closed loop

BO's weakness is the same small-data ceiling that haunts the whole book, sharpened at the start of the loop. The cold start is acute: with zero prior runs the GP is pure prior, and its first handful of suggestions are barely better than space-filling guesses — in our own demo BO spends 5 seed runs before the acquisition function has anything to act on. BO earns its advantage only after the surrogate has data to bend, so on a budget of three or four runs it can lose to a clever fixed design. A poor kernel choice (wrong smoothness, wrong length-scale prior) or a badly scaled search space (mixing a 0–1 flag with a 0–100 setpoint without normalization) can leave it worse than a grid for the first several iterations, and the failure is silent — the GP will happily report confident nonsense. The standard mitigations are informed priors (seed the GP with a mechanistic model's predictions — the hybrid move — so the prior mean is not flat) and transfer learning (warm-start from a related molecule's or related scale's campaign). But here lurks the field's documented hard problem: run-to-run variability in living systems severely compromises transferability, so a surrogate learned on one cell line or one scale may actively mislead on the next, and a confident warm-started prior can be worse than no prior at all [1].

The deeper unsolved part is trusting a closed loop in a regulated setting. A BO campaign that merely proposes experiments for a scientist to approve is comfortably human-in-the-loop: a person sits between the acquisition function and the reactor and can veto a pick. The frontier — a self-driving lab that chooses, runs, and acts on experiments without a human between iterations — is exactly the autonomy the regulatory line is drawn against. The peer-reviewed self-driving demonstrations are explicit that they are pilot/research at PD scale (3–15 L) — a development bench scale, not the 2,000 L-and-up production reactors a commercial process runs in, where the scale-dependent gradients flagged above (mixing, kLa, CO2 stripping, shear) can move the result — and their own authors stress the gap between robotic capability and device autonomy [9][11]. The WuXi Biologics ISLFCC autonomous-lab result — a self-reported +26.8% average titer (illustrative; single-company, self-reported, unreplicated) across three CHO clones, with late-phase lactate held low, versus traditional three-stage empirical development — is peer-reviewed but PD-scale (3 L and 15 L) and not GMP [11]; read the 26.8% figure as an illustrative single-company headline, not an industry benchmark. Draft EU/PIC/S GMP Annex 22 — the draft AI annex to the EU Good Manufacturing Practice (GMP) guide (the legally binding rules for commercial drug production), co-developed with the international PIC/S inspectorate scheme — sharpens the line: it covers only static, deterministic AI models in critical GMP applications and explicitly states that models which adapt their performance during use (continuously-learning / adaptive models) are not covered and should not be used in critical GMP roles [12]. A BO loop is, by construction, continuously learning — which means it is fine in development but cannot drive a commercial process without first locking the model and operating under a predetermined change-control plan. The honest status: BO is a production-grade development accelerator and a strong design-space tool; the autonomous lab that runs PD end to end without a human is a vivid, real, but still pilot capability.

What this chapter adds to the model suite

This chapter contributes the optimization workhorse of Book 5's example suite:

examples/platform/ml/bayesopt_doe.py — a Gaussian-process Bayesian optimizer over the fed-batch simulator's glucose/glutamine feed policy, with a closed-form Expected-Improvement acquisition maximized over a dense candidate mesh, a Matérn(ν=2.5)-plus-WhiteKernel surrogate refit each iteration, and a head-to-head 5×5 factorial-grid baseline. It demonstrates, on the shared simulator, BO matching and beating a 25-run factorial grid's best (6.269 vs 6.246 g/L) in 15 runs — 10 fewer — and asserts that result so the demo fails loudly if BO ever stops winning. The module is structured so the single-objective EI can be swapped for a multi-objective EHVI acquisition (titer versus a quality proxy) and so the GP posterior can be reused to sketch a probabilistic design space, wiring it forward to the QC/release and hybrid-model chapters.

It sits beside the suite's predictive models (the deep soft sensor, the clone-ranking model) as the one that decides what to run, not what will happen — the suite's only search algorithm.

Reproducibility here is a full chain, not just a seed. The fixed seed (0) pins BO's random seed runs, but the whole result is replayable only because the suite's run_all.py runs every module under a pinned environment (a versions.lock of the exact scikit-learn / SciPy / NumPy versions) — a single library upgrade can move a GP's hyperparameter fit and silently shift which corner EI selects, so a reproducible BO claim that does not pin its libraries is not reproducible. With the lock in place, the RUN_OUTPUTS.txt figures above (6.269 g/L in 15 runs, hypervolume 2.338) are reproducible byte-for-byte, and the assert that BO beats the grid is a regression test the locked environment makes meaningful.

Why it matters

Process development is where the cost of a biologic's manufacturing process is largely fixed, and where experiments are most expensive relative to their information content. A method that finds a better process in 15 runs instead of 25 — or in dozens instead of hundreds — is not a marginal convenience; it shortens timelines, frees parallel-reactor capacity, and — because the surrogate it leaves behind is a probabilistic model of the whole region, not a single point — it hands the regulatory filing a quantified, uncertainty-aware design space for free. BO is also the rare ML application in this book whose value does not depend on a flood of data: it is designed for the small-data, expensive-experiment regime that defeats black-box prediction, which is precisely why it is the strongest, most defensible learning method in PD and one of the few that crosses cleanly from research into routine development use.

In the real world

The production-grade reality is hybrid modeling plus experiment design, and the named players are consistent. DataHow — an independent ETH Zurich spin-off, not a Sartorius subsidiary — sells DataHowLab hybrid models with transfer learning and reports 30–60% (up to 80%) fewer experiments; its flagship Bristol Myers Squibb PD case (48 runs at 5 L, 12 CPPs — critical process parameters, the controllable inputs like feed and temperature — and 18 CQAs) has a peer-reviewed companion headlining roughly 33% better product-quality prediction with about half the data — and the peer-reviewed figures (33%, ~half) are the ones to cite, not the vendor page's 22%/3x [13] (pilot/research; vendor efficiency headlines are vendor-self-reported). Sartorius ships the Umetrics MODDE (DoE) / SIMCA (multivariate modeling) stack and the Ambr / BioPAT Spectro hardware that the loop runs on; its autonomous self-optimizing layer remains aspirational [10]. On the academic frontier, the ETH Zurich + Novo Nordisk multi-objective formulation work and the DataHow / Sartorius / Merck self-driving perfusion study are the clearest peer-reviewed BO demonstrations [4][9], while WuXi Biologics' ISLFCC is the most ambitious autonomous-lab claim (peer-reviewed, single-company, self-reported, PD-scale) [11]. The cross-industry maturity signal is sobering: the 7th ISPE Pharma 4.0 survey found AI/ML to have the most pilots and the fewest scaled implementations, with the share of projects stuck in pilot stagnant [14] — BO-driven PD is one of the few corners where the pilots are genuinely turning into routine practice.

Key terms

Bayesian optimization (BO) — a sequential strategy for maximizing an expensive, noisy, black-box function in few evaluations, by fitting a probabilistic surrogate and choosing each next experiment to maximize an acquisition score.
Gaussian process (GP) — the usual surrogate model; a distribution over functions that returns, at every point, a posterior mean mu(x) and a calibrated variance sigma2(x) in closed form, governed by a kernel and fit by marginal likelihood.
Kernel (covariance function) — the GP's similarity prior (e.g. Matérn ν=2.5); its fitted per-dimension length scales reveal which process parameters matter, and a white-noise term keeps it from interpolating measurement jitter.
Acquisition function — the rule that scores candidate experiments by usefulness, balancing exploitation (high mean) against exploration (high uncertainty); Expected Improvement (EI), UCB (mu + kappa*sigma), and Probability of Improvement are the common choices.
Expected Improvement (EI) — the default acquisition; the expected amount by which a candidate would beat the best result seen so far, in closed form (mu - f_best)*Phi(z) + sigma*phi(z) from the GP posterior.
Design of experiments (DoE) — the classical, non-adaptive grid of experiments (factorial, fractional-factorial, central-composite, Latin-hypercube) that BO competes against and usually beats on budget.
Multi-objective BO (MOBO) — BO over several conflicting objectives (titer versus quality), returning a Pareto front of trade-offs rather than a single point; typically uses Expected Hypervolume Improvement (EHVI) or per-objective GPs with an NSGA-II front search.
Expected Hypervolume Improvement (EHVI) — the multi-objective acquisition; scores a candidate by how much it is expected to enlarge the hypervolume the current Pareto front dominates (relative to a reference point) — the area-of-trade-off-surface analogue of single-objective Expected Improvement.
Pareto front — the set of settings where no objective can be improved without worsening another; the honest output when goals conflict.
Design space / QbD / ICH Q8 — the demonstrated region of conditions assured to give acceptable quality; a probabilistic design space reports it as the region where the modeled posterior probability of meeting all CQAs exceeds a threshold.
Ambr (automated micro-bioreactor) — automated 24/48-parallel micro-bioreactor systems (the Ambr 15 at roughly 10–15 mL, the Ambr 250 at roughly 100–250 mL) that make BO's experiment loop physically fast enough to run; enables batch BO (propose q diverse settings at once via q-EI / q-EHVI).
Cold start — BO's weak early phase, when the surrogate has too little data to be better than space-filling; mitigated by mechanistic or transfer-learning priors, though run-to-run variability limits transfer.
Self-driving lab — an autonomous loop that chooses, runs, and acts on experiments without a human between iterations; demonstrated at PD scale (pilot/research), excluded from critical GMP by draft Annex 22's bar on continuously-learning models.
Semantically-grounded feature — a model input pulled by its ontology IRI (e.g. bp:monomerPct), not a fragile plant-local column name, so the same BO code runs against two sites' data without a silent mismatch corrupting the surrogate.
SHACL-gated training set — using the same closed-world release-gate shape that decides whether a lot may ship to refuse a BO training subgraph with a missing result, unit, or lineage parent before any row reaches gp.fit.
Lineage grouping key (bp:derivedFrom / PROV-O) — the genealogy spine that makes a warm-started or pooled BO surrogate honestly validatable, by holding out an entire derivedFrom-connected campaign (leave-one-batch-out) rather than splitting near-duplicate sibling runs at random.
ISA-95 / OPC UA / B2MML — the equipment-and-material model (IEC 62264), vendor-neutral transport protocol, and XML batch-record form that land each Ambr run's result as a standardized, contextualized row the surrogate can trust; an OPC UA mammalian-bioreactor Companion Specification does not yet exist, so the ontology layer still reconciles plant-to-plant semantics.
Pinned environment (versions.lock) — the reproducibility complement to a fixed seed; locking the exact scikit-learn / SciPy / NumPy versions so a library upgrade cannot silently move the GP fit and change which setting EI selects.

Where this leads

We have a process — a feed, a temperature, a design space — found in a fraction of the runs a grid would need. But every BO iteration leaned on being able to measure the objective: titer, glycosylation, purity. Those measurements come from the analytical lab, and the next chapter asks how learning transforms the instruments themselves. Analytical Methods: Chemometrics, Deep Spectroscopy, and Automated Chromatograms turns to the soft sensors and spectral models that read the very objectives this chapter optimized against — from PLS chemometrics to deep spectroscopy and automated chromatogram interpretation — the measurement layer the entire learning enterprise is built on.

What this chapter covers​

The task: sequential optimization under a brutal budget​

The Gaussian process: a belief about the surface, with honest error bars​

The acquisition function: turning uncertainty into the next run​

Building it on the simulator: BO versus a factorial DoE​

Multi-objective BO: when titer fights quality​

Two objectives at once: the Pareto front, made runnable​

From a point to a design space: BO meets QbD and ICH Q8​

What grounds the loop: the ontology under the GP's training data​

Anatomy of one Bayesian-optimization iteration​

High-throughput automation: the Ambr makes the loop physically runnable​

The unsolved part: the cold start, transferability, and trusting the closed loop​

What this chapter adds to the model suite​

Why it matters​

In the real world​

Key terms​

Where this leads​