Tech Transfer and Scale-Up: Models That Travel Between Scales

📍 Where we are: Part II · Discovery & Development, Learned — Chapter 9. The analytical methods chapter built the chemometric and deep-spectroscopy models that read a spectrum or a chromatogram. This chapter asks the uncomfortable follow-up: do those models still work when you move them to a different probe, a different reactor, a different scale?

Every model in this book so far has been trained and validated on data from one place — one probe, one Ambr (Sartorius's automated mini-bioreactor system, here the 15 mL format) bank, one development suite, one set of raw-material lots. That is exactly where models are cheapest to build and where the data is densest. It is almost never where the medicine is actually made. The whole point of process development is to hand a recipe to manufacturing, and a model that rode along with that recipe has to make the same journey: from 15 mL to 2,000 L, from a development Raman head to the plant's installed analyzer, from the team that built it to the team that has to trust it on a commercial batch. This chapter is about what happens to a learned model on that journey, and it is mostly a chapter about failure modes — because the default outcome of moving a model is that it gets quietly, sometimes catastrophically, worse.

The honest framing is this: transferability is the binding constraint on machine learning in biomanufacturing, more than algorithm choice or even data volume. A bioprocess model is not a law of nature; it is a fit to a particular measurement apparatus observing a particular living system under particular conditions. Change the apparatus or the conditions and the fit no longer holds — and the model, being just arithmetic, has no way to tell you it has stopped being right.

The simple version

Imagine you learn to judge a soup's saltiness by tasting it in your own kitchen, with your own spoon, on your own tongue. You get very good at it. Then someone hands you a different spoon, in a restaurant kitchen, with the soup made at fifty times the volume in a vat instead of a pot — and asks you to call the salt level to the gram. Your skill does not simply transfer. The spoon tastes different, the big vat is not stirred the same way, and your calibrated tongue was tuned to your soup. A bioprocess model is in exactly this position when it moves from the lab to the plant. This chapter is about the two ways to cope: re-tune your tongue to the new spoon (calibration transfer), or build a model that understands why soup gets salty (mechanistic hybrid) so it travels better in the first place. And there is a catch the rest of the chapter keeps returning to: the restaurant quietly changes its recipe a little every night, so however well you re-tune your tongue today, tomorrow's soup is a slightly different soup — which is why no one tuning ever stays right for long.

What this chapter covers

The transferability problem framed precisely — why a model trained at small scale degrades at large scale, decomposed into the three shifts that cause it (instrument, scale, and biology) and stated as the formal covariate-shift / concept-drift violation underneath.
Calibration transfer / domain shift for spectra — the concrete machinery (Direct and Piecewise Direct Standardization, transfer standards, Kennard-Stone selection) that re-aligns a spectral model across probes and scales, written out as math, with real published numbers.
Hybrid mechanistic scale-down models — using first-principles physics so the part that must be learned is small enough to survive the jump, and why the scale-aware structure is what travels.
Predicting scale-up risk — turning scale-up from a guess into a model whose inputs are engineering parameters (kLa, mixing time, shear, CO₂ stripping).
Run-to-run variability — the living-system noise floor that limits all of the above, and why it makes transfer validation mandatory rather than optional.
A runnable calibration-transfer demonstration over the real Raman dataset, contributed as examples/platform/ml/transfer.py, with its verbatim output.

The transferability problem, decomposed

When people say "the model degraded at scale," they are usually compressing three distinct failures into one phrase. Pulling them apart is the whole game, because each has a different fix.

Shift 1 — the instrument (the measurement moved). A Raman or NIR (near-infrared) soft sensor — an optical probe that shines light into the broth and reads the returning spectrum, a fingerprint of intensity versus wavelength that the model maps to a concentration — does not see glucose; it sees photons that a particular probe, on a particular spectrometer, with particular collection optics and laser power, turned into intensities. Two probes of the same model, looking at the same broth, do not produce identical spectra: laser power drifts, the focus differs, the optical window fouls at a different rate, and the wavenumber axis can be off by a fraction of a channel. The chemistry is identical; the number the model reads is not. This is the most measurable shift, and the most fixable.

Shift 2 — the scale (the physics moved). A 15 mL Ambr vessel and a 2,000 L stirred tank are not the same reactor scaled by a constant. Mixing time grows, oxygen transfer (kLa) changes, dissolved CO₂ accumulates because the big tank strips it less efficiently, hydrostatic pressure shifts gas solubility, and steep shear gradients near the impeller (the "shear" being the fluid-stress the spinning impeller imposes on the cells) grow far steeper and more localized than anything seen in a mini-bioreactor [1]. The cells genuinely behave differently because their environment is different. A model that learned "this glucose trajectory implies this titer (the product-antibody concentration in the broth, in grams per litre — the thing the process exists to maximise)" at small scale is now watching cells living a different life.

Shift 3 — the biology (the system moved). Even at one scale, on one probe, the next batch is not the previous batch. A new working cell bank vial, a new media lot, a seasonal shift in raw materials, an aged inoculum — all move the process. This is run-to-run variability, and unlike the first two it is not a one-time offset you can calibrate away; it is a standing noise floor that every transfer must clear. We return to it at the end because it caps everything before it.

The formal statement: which distribution moved

The three shifts are worth stating in the vocabulary that makes the fixes follow. A supervised model learns a map f: X → y — here X is the spectrum the probe produces and y is the titer we want — by minimizing error on samples drawn from a training joint distribution P_train(X, y), which factorizes as P(y|X)·P(X). In plain words: P(X) is how the input spectra are distributed (what the readings tend to look like), and P(y|X) is the rule that turns a given spectrum into a titer (what a reading means); "factorizes" just means any joint behavior splits into those two pieces. To tie it back to the soup: P(X) is the new spoon, and P(y|X) is why soup gets salty. Calibration transfer can fix a change in the first; only new physics or new data fixes a change in the second. The model is only guaranteed to be right at deployment if P_deploy(X, y) = P_train(X, y). Transfer breaks that equality in two distinct ways, and the distinction decides what you can do about it:

Covariate shift: P(X) moves while P(y|X) stays fixed. The spectra look different, but a given true titer still produces the same underlying signal — the probe or scale only re-coded it. This is Shift 1 and the measurable part of Shift 2, and it is recoverable by re-aligning X (calibration transfer), because the relationship the model learned is still valid once the inputs are mapped back onto the basis it expects.
Concept drift: P(y|X) itself moves — the same observed inputs now imply a different output, because the cells respond differently. This is the deep part of Shift 2 and all of Shift 3. No amount of input re-alignment fixes it, because the learned relationship is now simply wrong; you need new structure (physics) or new data.

Classical supervised learning assumes both are fixed between training and deployment. Bioprocess scale-up violates that assumption on purpose, by design, every single time — and the reason calibration transfer is "the tractable part" is precisely that the instrument shift is almost pure covariate shift, while the biology shift is concept drift that no standardization can chase.

Calibration transfer: re-tuning the model to the new instrument

The most tractable shift — and the one with the most mature toolkit — is the instrument shift. The field that owns it is calibration transfer (sometimes model transfer or spectral standardization), borrowed wholesale from analytical chemistry, where the same problem is decades old: a NIR calibration built on one spectrometer must run on another without re-collecting thousands of reference samples.

The cleanest evidence that this is a real, large effect comes from a controlled bioprocess study. Pétillot and colleagues immersed two probes of the same Raman analyzer into the same CHO (Chinese Hamster Ovary — the workhorse cell line used to make most antibody drugs) culture at the same time — deliberately removing all biological and scale variability, leaving only the instrument difference. The cross-probe prediction error on cell density was roughly 20%, purely from probe-to-probe optics. After a calibration-transfer step — Kennard-Stone sampling plus Piecewise Direct Standardization — that error roughly halved, to about 10% [2] (peer-reviewed-independent, research). Read that twice: identical chemistry, identical timing, identical analyzer model, and a frozen calibration was 20% wrong from the spoon alone.

How calibration transfer actually works, with the math

The core idea is a function that maps spectra from the new (target) instrument back onto the basis the old (source) model expects, learned from a small set of transfer standards — the same physical samples measured on both instruments. Write the paired standards as matrices X_s (source, m standards × p wavenumbers) and X_t (target, same shape). The goal is a transform F (or a family of local transforms) such that X_t · F ≈ X_s, after which any new target spectrum x_t is corrected to x_t · F before the frozen model sees it. The model never changes; only its input is re-aligned.

Direct Standardization (DS) fits a single global transform. Treating each row as a spectrum, it solves min_F ‖X_s − X_t·F‖² in the least-squares sense, giving F = X_t⁺ · X_s (with X_t⁺ the pseudoinverse). F is a full p × p matrix, so every target wavenumber is allowed to influence every source wavenumber. Powerful, but with only a couple dozen standards and 701 wavenumbers the problem is wildly underdetermined — F overfits the standards and generalizes badly. DS needs heavy regularization or many standards to be safe.
Piecewise Direct Standardization (PDS) is the workhorse, and it earns that by respecting the physics of the instrument. A probe distortion at one Raman shift mostly perturbs nearby shifts, not the whole spectrum, so PDS fits a local map per wavenumber instead of one global one. For each target wavenumber j it takes a small window of target intensities [j−h, … , j+h] (window width w = 2h+1) and regresses the source intensity at j on that window plus an intercept: x_s[:, j] ≈ [X_t[:, j−h:j+h+1] | 1] · b_j, solved by ordinary least squares for the short coefficient vector b_j. The full transform is then a banded matrix (one band per j), which has O(p·w) free parameters instead of O(p²) — for w = 11 and p = 701 that is roughly 7,700 parameters rather than ~491,000. That collapse in parameter count is exactly why PDS survives on a handful of standards while DS does not [2]. PDS remains the default because it is linear, banded, and survives on a handful of paired standards; where the probe distortion is nonlinear or paired standards are unavailable, the field reaches instead for standard-free / instrument-standardization methods (such as Spectral Space Transformation) or domain adaptation, at the cost of more data or more assumptions.
Transfer-standard selection decides which samples to measure twice — the expensive part, since each standard must be run on both instruments. Kennard-Stone picks a maximally spread-out subset: start from the two most distant spectra (largest pairwise Euclidean distance), then repeatedly add the sample whose minimum distance to the already-chosen set is largest. The result spans the spectral space rather than clustering, so the local maps are fit across the whole range of intensities the model will meet — which is why Kennard-Stone pairs naturally with PDS.

The discipline this imposes is the heart of the chapter: a model that moves to a new probe is, for regulatory purposes, a new analytical procedure until it is re-qualified. ICH Q14 (the International Council for Harmonisation's analytical-procedure-development guideline) treats a multivariate calibration as a procedure with a lifecycle — defined operating range, documented validation, ongoing performance monitoring — and calibration transfer is documented work with its own evidence, not a free lunch [3] (regulatory). Concretely, the transfer must show that the corrected procedure still meets accuracy and precision over the declared range, that the transfer standards span that range, and that a change-control record links the source procedure to the target one. The open-source analytics chapter draws the same line from the soft-sensor side; here we make the transfer step itself runnable.

Calibration transfer in code, on the real Raman dataset

examples/platform/ml/transfer.py makes the whole arc concrete on the real simulated spectra in datasets/raman_spectra.parquet — the 336 hourly Raman spectra of BATCH-2026-001, 701 wavenumbers (wn_400 … wn_1800), paired with the kinetic titer_g_L reference. We train a PLS soft sensor (Partial Least Squares — a regression that compresses the 701 correlated wavenumbers into a few "latent variables," combined directions of the spectrum that best predict titer; here 6 of them, on standard-scaled inputs) at "small scale," then synthesize a different-probe / larger-scale spectral response — a wavenumber-dependent multiplicative gain (±6%, modeling laser power / collection-optics differences), a sloped additive baseline rising to 180 intensity units (fouling, focus, fluorescence background), extra read noise, and a one-channel axis drift — the canonical instrument distortions, none of which touches the underlying titer. We apply the frozen model naively, watch it break, and then recover it with PDS using 24 evenly-spaced transfer standards. The synthetic shift is clearly labelled illustrative (we have one real batch on one simulated probe); the pipeline is real scikit-learn and NumPy.

# examples/platform/ml/transfer.py — calibration transfer across a (synthetic) probe shift.
def apply_domain_shift(X, rng):
    """Synthesize a different-probe / larger-scale spectral response (illustrative).
    Three physically-motivated, instrument-driven distortions, NOT a chemistry change."""
    n_wn = X.shape[1]
    gain = 1.0 + 0.06 * np.sin(np.linspace(0, 3 * np.pi, n_wn))   # +/-6% wavenumber-dependent gain
    baseline = np.linspace(0.0, 180.0, n_wn)                       # sloped additive background
    noise = rng.normal(0.0, 2.0, size=X.shape)                     # extra read noise
    Xs = X * gain + baseline + noise
    return np.roll(Xs, 1, axis=1)                                  # 1-channel axis drift

def fit_pds(Xs_std, Xt_std, win=11):
    """Piecewise Direct Standardization from paired transfer standards.
    For each wavenumber j, a local least-squares map predicts the SOURCE
    intensity at j from a window of TARGET intensities around j (+ intercept)."""
    n_wn = Xs_std.shape[1]; half = win // 2; maps = []
    for j in range(n_wn):
        lo, hi = max(0, j - half), min(n_wn, j + half + 1)
        A = np.hstack([Xt_std[:, lo:hi], np.ones((Xt_std.shape[0], 1))])  # window + intercept
        coef, *_ = np.linalg.lstsq(A, Xs_std[:, j], rcond=None)
        maps.append((lo, hi, coef))
    return maps

# 1. fit PLS on the source ("small-scale") spectra; 2. score in-domain;
# 3. apply naively to the shifted ("large-scale") spectra; 4. correct with PDS on
#    24 evenly-spaced transfer standards; 5. re-score.
idx    = np.linspace(0, len(X) - 1, 24).astype(int)          # transfer standards
maps   = fit_pds(X[idx], X_tgt[idx])                         # learn local maps on standards...
X_corr = apply_pds(maps, X_tgt)                              # ...then correct ALL target rows

Running python platform/ml/transfer.py prints exactly this (verbatim run output):

Calibration transfer demo (illustrative shift) over 336 spectra x 701 wavenumbers, 24 transfer standards
  source  (in-domain) : R2=0.9995  RMSE=0.0356 g/L
  naive transfer      : R2=-577.7308  RMSE=39.1968 g/L   <- model breaks on the new probe
  PDS-corrected       : R2=0.9908  RMSE=0.1567 g/L   <- recovered after calibration transfer
ASSERT ok: PDS calibration transfer reduces cross-probe RMSE.

The story is stark and it is the whole point. In-domain, the model is near-perfect (R² = 0.9995, RMSE 0.0356 g/L) — though note this is a resubstitution fit, scored on the rows it was trained on, on the single batch BATCH-2026-001, so it sits a touch above the held-out R² = 0.9944 the analytical-methods chapter reports for the same in-domain PLS; with one batch a leak-free split is impossible, so what we measure here is the relative collapse and recovery, not a generalization claim. Move it to the new probe with no transfer step and it does not merely degrade — it produces physically absurd predictions: an RMSE (root-mean-square error — the model's typical prediction miss, expressed in the titer's own units, g/L) of 39 g/L against a titer that never exceeds 5.72 g/L. A negative R² is worse than useless. R² = 0 is the score of a model that ignores the spectrum and just guesses the dataset mean every time, so the −577.73 here means the naive model is hundreds of times worse than even that trivial baseline. Apply PDS with two dozen transfer standards and it snaps back to R² = 0.9908, RMSE = 0.157 g/L — usable again. Note what the standards bought: 24 paired samples out of 336, about 7%, rescued a model that was otherwise unusable. The naive number is exaggerated by the size of our synthetic shift, but its direction is exactly Pétillot's finding: a frozen spectral model on a new instrument is untrustworthy until it is transferred, and a small, well-chosen set of paired standards can rescue it. The CI-style assertion at the end (naive RMSE > corrected RMSE) is what keeps the chapter's claim from silently rotting if the dataset or the pipeline changes.

The transferability problem and its two cures: a model trained at small scale meets three stacked shifts when it crosses to large scale; calibration transfer re-aligns the measurement using a few paired standards, while a hybrid mechanistic backbone shrinks what has to be learned so the model travels in the first place. Original diagram by the authors, created with AI assistance.

Transfer across scales, not just probes

Calibration transfer was invented for instrument-to-instrument, but the same machinery is what the industry now uses for scale transfer — because the spectral response of a probe in a 2,000 L tank differs from the same probe in an Ambr for the same covariate-shift reasons (path length, turbulence, gassing at the window, temperature), even before the biology moves. Sartorius reported moving Raman PAT models from Ambr 250 mini-bioreactors up to 10 L stirred tanks, retaining usable predictive performance — a CHO mAb PLS model calibrated on twelve Ambr 250 vessels transferred to validate two 10 L Biostat STR cultivations, predicting glucose at R² ≈ 0.84 (RMSEP ≈ 0.29 g/L — the same RMSE, just measured on held-out prediction samples rather than training rows) by transferring the calibration rather than rebuilding it from scratch [4] (vendor/trade-press-self-reported, pilot). Note the ceiling on this evidence, though: the published transfers stop at roughly 10 L — a bench scale where mixing times, dissolved-CO₂ accumulation, and impeller shear are still mild. The leap the anatomy card draws, Ambr to 2,000 L, is where those scale shifts bite hardest, and the open literature largely has not followed the calibration there; the machinery transfers, but the demonstration, honestly, has not yet — consistent with this book's "scaled implementations are scarce" thesis. The economic logic for staying at small scale is overwhelming: building a calibration de novo at 2,000 L means sacrificing commercial-scale batches to generate paired reference data, at six figures per run. Transferring a model built cheaply at small scale — twelve parallel mini-bioreactor runs cost a fraction of one production batch — and validating it with a handful of large-scale standards is the only affordable path.

Scale also introduces measurement problems that have no small-scale analogue, and these are not covariate shift — they are new physical phenomena the standardization map was never designed to chase. Merck's perfusion-Raman platform had to solve fluorescence interference that only appears at the very high cell densities and long run times of intensified culture: a strong, time-varying fluorescence background overwhelms the weak Raman signal — a spectral artifact the development model never saw because development never ran that dense or that long. Merck's fix was not a smarter transfer map but a measurement change — moving the probe to the cell-free permeate stream so it reads a clean spectrum — before the transferred chemometric model could be trusted [5] (peer-reviewed, pilot). That is the difference between covariate shift you can standardize and a new physical phenomenon you must first handle: calibration transfer fixes the former; the latter needs a hardware or modeling change, and no quantity of transfer standards substitutes for it.

Hybrid scale-down models: travelling light by knowing physics

Calibration transfer fixes the instrument shift. It does almost nothing for the scale and biology shifts, because those are not measurement artifacts — they are concept drift, the cells really are behaving differently. The deeper answer to "make a model that travels" is to make less of the model learnable in the first place.

This is the hybrid (gray-box) idea from the data-management ML chapter and the analytical-methods chapter, now aimed squarely at transferability. A scale-down model is a small-scale model intended to predict large-scale behavior; a hybrid scale-down model keeps a mechanistic backbone — mass balances, Monod-type growth kinetics, gas-transfer relations — that is scale-aware by construction, and uses a learned component only for the residual the physics cannot capture. The split matters because of which parameters travel: the kinetic constants (maximum specific growth rate, Monod half-saturation constants, yield coefficients, inhibition thresholds) are properties of the cell line and are the same biology at any scale; what changes is the environment the balances are solved in. Put plainly: the cell line's growth rate is the same at 15 mL and 2,000 L; only the tank around it changes. Engineering correlations describe how kLa, mixing time, and CO₂ stripping change with vessel geometry and operating point — for instance kLa scaling with power-per-volume and superficial gas velocity. Feed those scale-dependent terms into the mechanistic part and the model already "knows" that the big tank mixes slower and strips CO₂ worse — knowledge the data never had to supply, and therefore knowledge that does not have to be re-learned from precious large-scale batches.

The data-efficiency argument from earlier chapters becomes a transfer argument here, and it is the claim the whole hybrid case rests on. Because the mechanistic part contributes scale-aware structure for free, the learned residual is small, so it can be fit on the handful of batches available at the new scale — and, crucially, it generalizes more safely outside the conditions it saw, because the physics holds the extrapolation in check [6]. The peer-reviewed basis for this is direct: Rogers and colleagues showed that embedding a kinetic backbone into a learned model improves extrapolation and generalization from small datasets, with the optimal model structure tied to the underlying process mechanism — the data-efficiency-becomes-transfer argument made experimentally. A pure black box trained at 15 mL has learned a correlation that is only true at 15 mL; a hybrid has learned biology plus a small correction, and the biology travels. The mechanistic fed-batch simulator behind this book's datasets (examples/sim/bioproc_sim/fed_batch.py) is exactly this kind of backbone: Monod growth, death, lactate inhibition, bolus feeds, and a day-7 temperature excursion, with the kinetic constants exposed so a learned residual can ride on top — meaning the same simulator that generated the running example is also the template for the scale-aware part of a hybrid that would transfer it.

Sartorius has demonstrated this pattern at the pre-commercial scale — a hybrid framework combining a genome-scale dynamic flux-balance backbone (PC-dFBA), an ODE-based kinetic model, and a learned viable-cell-density neural-network component to build predictive digital twins of CHO culture, validated on 23 fed-batch cultivations and explicitly aimed at the transfer/scale-up problem [7] (vendor/self-reported, preprint pilot). DataHow, an independent ETH Zurich spin-off (founded 2017, not a Sartorius subsidiary), markets hybrid models with transfer learning as a way to cut the number of experiments a process needs — vendor figures in the 30–60% range, which should be read as vendor/self-reported rather than established fact, though the underlying methods (hybrid Gaussian-process models with entity-embedding vectors that transfer knowledge across cell lines, and hybrid-plus-intensified-DoE for reduced experimental burden) are peer-reviewed [8]. The recurring theme is the same across vendor and academic work: the hybrids that travel are the ones where the data-driven part is kept small and the physics carries the scale.

Predicting scale-up risk before you scale

The most ambitious use of learning here is not to transfer a soft sensor at all — it is to predict, before committing to a large-scale run, whether the process will scale successfully. The features are engineering scale-up parameters: power per volume, impeller tip speed, kLa, mixing time, CO₂ stripping rate, shear exposure, sometimes summarized from computational-fluid-dynamics simulations of the specific vessel. The label is the scale-up outcome — did growth, titer, and the critical quality attributes (CQAs — the measurable product properties, such as glycosylation or aggregation, that must stay within defined limits for a batch to be releasable; an ICH quality concept) hold at scale, or did they shift?

A 2024 perspective review in Engineering in Life Sciences surveyed exactly this: ML applied to bioreactor scale-up, treating the jump as a prediction problem informed by CFD and engineering correlations rather than a wet-lab gamble [1] (peer-reviewed-independent, research). The honest status is research, not production, and the reasons are structural rather than fixable with effort. First, the datasets are tiny: every large-scale run is precious, so the number of labelled scale-up outcomes any one organization holds is in the tens, not thousands. Second, the outcome is multivariate and the failure modes are not interchangeable — a process can scale fine for titer and growth but fail for glycosylation or aggregation, so a single "did it scale" label hides the attribute that actually breaks. And the attribute most likely to break is rarely titer: it is a quality attribute tied to the gradients that only appear at scale — N-glycosylation and aggregation are notably sensitive to the elevated dissolved CO₂, longer mixing times, and local pH and feed inhomogeneity of a large tank. A scale-up risk model watching only growth and titer is watching exactly the attributes least likely to fail. Third, and most insidiously, the same transferability problem recurses one level up: a scale-up risk model built on one molecule's history may not transfer to the next molecule, because the relationship between engineering parameters and CQA outcome is itself molecule-specific. The model that predicts transfer risk is itself subject to transfer risk. But the framing is valuable even before the models are deployable, because it forces the question "what physically changes at scale?" to be answered in features a model can use — which is the same question hybrid modeling answers in equations.

For our running example, this is the question that sits between the process-development chapter's Bayesian-optimized design space and the production bioreactor: the conditions that produced the golden BATCH-2026-001 (monomer purity 98.611% by SEC) at development scale carry a predicted risk when transferred to the manufacturing tank, and the out-of-specification (OOS) sibling BATCH-2026-004 — which failed on host-cell protein (HCP) at 128 ng/mg, over the 100 ng/mg specification limit a batch must stay under to be released, not on any spectrally-monitored attribute — is the exact illustration of the multivariate-failure point above. A scale-up risk model watching titer and growth would have called BATCH-2026-004 a success; the failure surfaced in a downstream attribute the upstream model never watched. That gap between "the attribute the model predicts" and "the attribute that fails release" is why scale-up risk prediction has to be multivariate to be honest.

Anatomy of a transferred model record

A model that crosses a scale boundary should carry a transfer record, and like every artifact in this series its value is in what travels alongside the numbers. A bare "R² = 0.99" is exactly the claim that calibration transfer exists to puncture: it is true for a probe, a scale, a dataset hash and silently false the moment any of those move. The transferred-model record makes the provenance of the jump auditable — it is the document a quality unit reads to decide whether the jump was earned, and the document an inspector reads to decide whether it was defensible.

One transfer is a whole record: the source domain it was trained in (pinned by dataset hash and in-domain metric), the target domain it must work in, the naive-transfer metric that proves it cannot be moved untouched, the calibration-transfer method and standards that rescued it, and the governance fields — requalification, operating range, intended use, lineage — that make the jump defensible. Original diagram by the authors, created with AI assistance.

Taken field by field, the card lays out the chapter's whole argument as a record.

Source instrument & scale (development Raman probe, Ambr 250). The domain the model was trained in. These are not metadata trivia: they define the validity envelope. The model is a fit to this probe and this scale, and every field below exists because that fit does not automatically extend past them.
Training dataset, pinned (raman_spectra.parquet, sha256:…). The exact bytes the calibration learned from, hashed via MANIFEST.sha256. This is what makes "the model" reproducible and what an auditor diffs against if the result is ever questioned — a model without its pinned training set is an unfalsifiable claim.
In-domain fit, resubstitution (R² = 0.9995, RMSE = 0.0356 g/L). The same in-domain PLS the analytical-methods chapter builds, and the field most likely to be mistaken for success. Read it carefully: this is a resubstitution fit — PLS scored on the very rows it was fit to, on the one available batch BATCH-2026-001 — which is exactly why it is higher than the analytical-methods chapter's held-out figure for the same model (R² = 0.9944, a random within-batch hold-out). With a single batch a leak-free batch-grouped split is impossible, so here we measure the relative collapse and recovery rather than claim a generalization number. It is true — and only a local truth, valid inside the source domain and nowhere else.
Target instrument & scale (plant installed probe, 2,000 L). The reality the model must now work in. The mismatch between this row and the source rows is the transfer the rest of the record has to earn.
Naive-transfer metrics (R² = −577.73, RMSE = 39.20 g/L), flagged red. The proof of work: the model applied as-is to the target domain, recorded before any correction. This field exists to forbid the silent copy-paste — it documents, in numbers, that the model could not be moved untouched. A transfer record that omits the naive metric is hiding the only evidence that the transfer step was necessary.
Calibration-transfer method (Piecewise Direct Standardization) and standards (24, evenly spaced (Kennard-Stone-style)). The intervention that earned the jump, named precisely enough to repeat. The standards here are picked evenly across acquisition time — a Kennard-Stone-style spread, not the greedy max-min spectral selection the algorithm strictly prescribes; with a single, smoothly evolving batch the two coincide closely, but the honest label is "evenly spaced." The standard count and selection method are part of the validation evidence under ICH Q14, because they bound how well the correction is expected to hold.
Post-transfer metrics (R² = 0.9908, RMSE = 0.1567 g/L), green. The recovered performance, re-validated in the target domain. The distance from the red field to this green one is exactly what the transfer step bought.
Requalification status (new analytical procedure under ICH Q14). The governance verdict: a transferred model is not the same procedure at a new location, it is a new procedure that has been re-validated [3]. This single field is what converts "we moved the model" from an IT operation into a controlled change.
Operating range (titer 0 to 5.72 g/L). The interval over which the re-validation actually holds — derived from the dataset, not asserted. Predictions outside it are extrapolation and are not covered by the transfer evidence; this is the field that says where the green metric stops being a promise.
Intended-use scope (advisory, not unattended release). The decision authority granted to the model. Under FDA's risk-based credibility framing [13] — and, in the same direction, the draft EU/PIC/S GMP Annex 22 (regulatory, draft), which the honest-verdict chapter treats in full — a transferred model touching a CQA is high-influence and high-consequence, so it advises a human rather than releasing a batch on its own.
Change-control / lineage (source model version → transfer event → target model version). The audit trail linking the source model, the transfer event, and the resulting target model, so the jump can be reconstructed and re-done when the biology moves — which, as the next section argues, it inevitably will.

The gap between the first metric and the last is the entire subject of this chapter, drawn as one record; the governance panel is what makes that gap defensible rather than merely survived.

Why the transfer record is a semantic record, not a spreadsheet

Every field on that card is sound only if the things it names are the same things the rest of the plant means by them — and that identity is exactly what an ontology supplies, which is why the record's defensibility rests on the knowledge-graph work of Books 3 and 4 rather than on a tidy table. Four threads from the model-validation discipline run straight through the transfer case, and naming them is what raises this chapter from "we moved a model" to "we moved a model and can prove the move was earned."

The lineage edge that scopes the transfer is the grouping key that validates it. The card's source model version → transfer event → target model version row is not a free-text breadcrumb; it is a typed bp:derivedFrom / PROV-O provenance edge — the same transitive spine the genealogy chapter roots every lot in WCB-CHO-001 with. That matters operationally because the transfer's own validation has the same leakage trap the data chapter warns about: when paired standards from one campaign are used to both fit and check a transfer map, near-duplicate siblings sharing a bp:derivedFrom ancestor must not straddle the fit/check cut. Walking the lineage IRI as a SPARQL traversal is what makes a leave-one-batch-out check of the transferred model honest — the genealogy graph defines what an independent group is, where a coincidental batch_id column only hopes.
The same SHACL shape that gates a release gates the transfer standards. A transferred model is only as trustworthy as the paired standards that fit it, and "is a required result present, singular, and in range?" is precisely the closed-world question no SELECT query can pose. Book 4's release-gate bp:ReleaseShape (a SHACL shape — a rule that every required field be present, singular, typed, and in declared range), pointed at a transfer-standard row instead of a release lot, refuses a standard that is missing its reference titer or carries an out-of-range value before it can poison the PDS map — the same admission-gate discipline the capture-chromatography cycle uses on its training rows. The operating-range field on the card (titer 0 to 5.72 g/L) is then not an assertion but a sh:minInclusive/sh:maxInclusive derived from the validated data.
A feature pulled by its IRI survives the very probe swap this chapter is about. The whole transferability problem is that a column named glucose_online or wn_700 silently means a different measurement once the probe or scale changes. Wire the model input instead to its semantically identified, unit-bearing quantity — the value on bp:glucoseGperL of the lot, carrying a UCUM unit code and a BFO typing that keeps the measurement (a continuant quality) distinct from the run (an occurrent) that produced it — and the feature contract cannot be quietly mis-fed by a re-ordered export on the new analyzer. The same identity is what lets a historian's bare float become a feature a model trusts across the source and target systems, anchored to the ISA-95 equipment/batch hierarchy the data shadow is scattered across, arriving over OPC UA and grounded against a B2MML batch record — the same manufacturing-data standards the plant's MES, LIMS, and historian already speak, so the transfer record is one row pinned to shared vocabulary rather than a private convention.
ALCOA+ is what makes the transfer record admissible, not merely tidy. Under 21 CFR Part 11 and EU GMP Annex 11, every value the transfer learned from — each paired standard, each naive and corrected metric — must be Attributable, Legible, Contemporaneous, Original, and Accurate (the ALCOA core, plus Complete, Consistent, Enduring, Available). The card's sha256-pinned training set is the Original/Complete leg, the change-control lineage is the Attributable/Enduring leg, and the red naive-transfer metric recorded before correction is Contemporaneous evidence that the jump was necessary. A transfer record that omits the naive number, or that cannot be traced to its signed source procedure, fails ALCOA+ regardless of how good the recovered R² looks — which is exactly why the requalification field reads "new analytical procedure under ICH Q14" and not "copied the model over."

Read together, these turn the transfer card from a status report into a node in the FAIR graph the rest of the series builds: the same record an auditor diffs, a GraphRAG assistant grounds an answer against ("what probe and scale was titer_pls validated on, and under which transfer event?"), and the next campaign re-walks when the biology moves. The ontology is not decoration on the transfer; it is what keeps the binder of facts honest enough to defend the jump.

The unsolved part: run-to-run variability defeats clean transfer

It would be dishonest to present calibration transfer and hybrid modeling as a solved pipeline, because the third shift — biology — does not hold still long enough to be transferred to. Calibration transfer assumes there is a stable target domain to map onto: a probe with consistent optics, a process at a fixed operating point. Bioprocesses violate that. The next batch has a different working-cell-bank vial, a different media lot, a slightly different inoculum age; raw materials drift seasonally; the same 2,000 L tank produces a different "normal" this campaign than last. This is run-to-run variability, and recalling the formal split from the start of the chapter, it is concept drift, not covariate shift — P(y|X) itself wanders. That is precisely the failure no input-alignment map can repair, because the relationship the model encodes is what moved.

The consequence is brutal for the clean story. The transfer standards you measure today to align the plant probe describe the process as it was today; if the next media lot shifts the spectral background or the cells' metabolism, your standards are already stale — they pinned a target domain that no longer exists. The R² = 0.9908 the demo recovers is conditional on the target domain being the domain you transferred to — and in a living system that domain keeps wandering. This is also why model decay is fast in bioprocess relative to other ML domains: a soft sensor can be in spec at qualification and out of spec two campaigns later with no code change, purely because the biology moved underneath it. The sparse-reference regime from the soft-sensor lifecycle makes the decay nearly invisible while it happens: the offline assay that could detect it returns only once or twice a day, so a model can drift for hours — through an entire shift — looking perfectly healthy on a screen, with nothing to contradict it until the next reference sample lands. The very sparsity that motivated building the soft sensor is what hides the soft sensor's own failure.

This is the deep reason the field's strongest verdict is hybrid modeling plus transfer/Bayesian priors, not pure ML. Run-to-run variability is precisely the regime where physics-anchored models and informative priors earn their keep: they constrain the model to plausible behavior between sparse reference points, and they degrade gracefully rather than catastrophically when the biology shifts — a hybrid whose backbone is a mass balance cannot predict a negative titer or a 39 g/L excursion the way a naive black box did on the new probe, because the physics forbids it. The 2026 review of transfer learning in bioprocess engineering reaches the same honest place: transfer learning is the near-term workaround for the small-data, high-variability reality, it offers a strategy-selection table by task and source-target similarity, and — the load-bearing admission — a true bioprocess "foundation model" that would make transfer trivial does not yet exist as a usable system [9] (peer-reviewed-independent, research). Until it does, every model that crosses a scale boundary must be distrusted on a schedule: re-qualified at the new scale, monitored for drift against the sparse reference, and re-transferred when the biology moves. That monitoring is not hand-waving — it is the concrete machinery the MLOps drift-detection module (drift.py) already builds: an I/MR control chart on the online-versus-offline residual stream catches a probe-fouling bias, and a Population Stability Index check on the inputs flags when live spectra leave the distribution the transfer standards pinned, which is the signal to re-transfer. The transfer record from the previous section is not a one-time artifact; it is a row in a ledger that has to be re-written each time the target domain wanders far enough to matter.

What this chapter adds to the model suite

This chapter contributes one module to examples/platform/ml/:

transfer.py — a runnable calibration-transfer / domain-shift demonstration. It trains a PLS titer soft sensor (6 latent variables, standard-scaled) on the real raman_spectra.parquet, synthesizes a different-probe / larger-scale spectral response (multiplicative gain, baseline tilt, read noise, axis drift — illustrative, since only one real batch on one probe is available), shows the naive frozen model collapsing on the shifted spectra (R² = −577.73), and recovers it with Piecewise Direct Standardization fit from 24 Kennard-Stone-style transfer standards (R² = 0.9908). It exposes fit_pds / apply_pds as reusable functions and ends with a CI-style assertion (run with python platform/ml/transfer.py) that fails if the transfer step stops reducing cross-probe RMSE, so the chapter's central claim cannot silently rot. transfer.py ships alongside the suite's 21 gated models rather than inside the run_all.py ledger, but it carries the same kind of self-check assert, so it is held to the same non-rot standard.

It builds directly on examples/analytics/soft_sensor.py (the in-domain PLS Raman→titer model) — transfer.py is that model meeting a domain shift — and feeds forward to the upstream soft-sensing chapters, where the transferred model is what actually runs on the production bioreactor.

Why it matters

Tech transfer is where a process either becomes a product or stalls, and increasingly the models built during development are part of what must transfer. If a soft sensor that took six months to calibrate at small scale cannot move to the plant, the development investment is stranded — the manufacturing team falls back to the slow offline assay and the real-time control the model promised never arrives. Calibration transfer turns that from a rebuild into a re-alignment with a handful of standards; hybrid scale-down modeling builds models that travel because they understand the physics that scale changes; and scale-up risk prediction starts to turn the most expensive gamble in manufacturing into something a model can inform. The throughline, as always, is honesty about the limit: a model is a fit to a place and a moment, and moving it is real work with real evidence, not a copy operation. Get the transfer discipline right and development knowledge reaches the plant; skip it and the most sophisticated model in the building is wrong in a way no one notices until a batch is.

In the real world

The production-grade pattern today is transfer the calibration, monitor it forever. Sartorius's SIMCA / SIMCA-online MVDA stack and BioPAT Spectro retrofit Raman into Ambr and Biostat STR vessels precisely so a model built at small scale can be moved up the scale ladder under monitoring [10] (vendor, production for monitoring). Amgen's Juncos, Puerto Rico site runs SIMCA OPLS harvest-titer models in commercial GMP (Good Manufacturing Practice — the regulated quality system manufacturing runs under; here 21 CFR Part 11-validated, the FDA rule governing electronic records and signatures; reported R² = 0.91 / Q² = 0.85, Q² being the cross-validated cousin of R² — how well the model predicts data it was not fit on) with the measured assay remaining the official result — a first-party/self-reported account of in-process models operating at commercial scale, which is exactly a model that survived the transfer to manufacturing and is held to advisory authority once there [11] (production, first-party self-reported). The published calibration-transfer studies that quantify the problem — Pétillot's two-probe experiment and the Sartorius Ambr-to-10 L transfer — are the strongest evidence that the effect is large and the fix is real [2][4].

The regulatory frame is catching up, and it lands squarely on the transferred model. FDA's 2023 discussion paper Artificial Intelligence in Drug Manufacturing names model maintenance and re-validation across deployment changes as an open question for AI under cGMP — which is precisely the cross-scale, cross-probe move this chapter is about [12] (regulatory). Its 2025 draft guidance on AI to support regulatory decision-making frames model credibility as a function of model influence and decision consequence in a seven-step, risk-based assessment — a transferred model touching a CQA decision sits high on both axes, which is why the anatomy card's intended-use field reads "advisory, not unattended release" [13] (regulatory). The ISPE 7th Pharma 4.0 survey's blunt summary applies here as much as anywhere: AI/ML has the most pilots and the fewest scaled implementations, and the production clusters are monitoring and human-in-the-loop, not autonomous cross-scale control [14]. Cross-scale model transfer is real and useful; cross-scale autonomous model-driven control is still mostly a slide.

Key terms

Transferability — the degree to which a model trained in one domain (probe, scale, process) still performs in another; the binding constraint on bioprocess ML.
Calibration transfer (model transfer) — re-aligning a spectral model to a new instrument or scale using a small set of paired samples, so the frozen model can be reused without full recalibration.
Transfer standards — samples measured on both the source and target instruments, used to learn the calibration-transfer map; the expensive, run-it-twice part of any transfer.
Direct Standardization (DS) / Piecewise Direct Standardization (PDS) — the global vs. local linear transforms that map target spectra onto the source basis; DS fits one full p × p matrix and overfits with few standards, while PDS fits one local windowed map per wavenumber (a banded matrix) and needs far fewer.
Kennard-Stone selection — an algorithm that greedily picks a maximally spread-out subset of samples as transfer standards, so they span the spectral space rather than cluster.
Covariate shift / concept drift — covariate shift is P(X) moving while P(y|X) holds (recoverable by re-aligning inputs, e.g. calibration transfer); concept drift is P(y|X) itself moving (not recoverable by input alignment — needs new physics or new data).
Scale-down model — a small-scale model intended to predict large-scale behavior; hybrid scale-down keeps a scale-aware mechanistic backbone so the learned residual is small enough to travel.
kLa / mixing time / CO₂ stripping — the engineering parameters that change with vessel scale and drive the scale shift; the features of a scale-up risk model and the scale-dependent terms fed into a hybrid backbone.
Run-to-run variability — the standing batch-to-batch noise floor from living-system and raw-material variation; a moving target (concept drift), not a one-time offset, so it caps transfer and drives fast model decay.
Requalification — treating a transferred model as a new analytical procedure under ICH Q14, with re-validation over a declared operating range, change-control lineage, and ongoing monitoring.
bp:derivedFrom / PROV-O lineage — the typed, transitive provenance edge that records the source → transfer event → target jump and roots every lot in the cell bank; it is also the grouping key that makes a leave-one-batch-out check of the transferred model honest, walked as a SPARQL traversal rather than trusting a batch_id column.
SHACL admission gate (bp:ReleaseShape) — the closed-world release shape that requires every result present, singular, typed, and in range; pointed at transfer-standard rows it refuses an incomplete or out-of-range standard before it can poison the calibration-transfer map.
Feature by IRI / ALCOA+ — wiring a transferred model's input to its semantically identified, unit-bearing quantity (a bp: IRI with a UCUM unit and BFO typing, anchored to ISA-95 / OPC UA / B2MML) so a probe or scale swap cannot silently mis-feed it; and recording the transfer under ALCOA+ (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete/Consistent/Enduring/Available) so the jump is admissible, not merely tidy.

Where this leads

The model has been built, validated, and — when it must move — transferred and re-qualified. Now it goes to work upstream. The next chapter, Seed Train: Soft Sensing the Inoculum and Predicting Contamination Risk, opens Part III by applying these learned models to the first production-scale step: estimating the state of the inoculum from sparse signals and predicting the contamination risk that can end a campaign before the production bioreactor is even filled — the first place a transferred model has to earn its trust on a real run.

What this chapter covers​

The transferability problem, decomposed​

The formal statement: which distribution moved​

Calibration transfer: re-tuning the model to the new instrument​

How calibration transfer actually works, with the math​

Calibration transfer in code, on the real Raman dataset​

Transfer across scales, not just probes​

Hybrid scale-down models: travelling light by knowing physics​

Predicting scale-up risk before you scale​

Anatomy of a transferred model record​

Why the transfer record is a semantic record, not a spreadsheet​

The unsolved part: run-to-run variability defeats clean transfer​

What this chapter adds to the model suite​

Why it matters​

In the real world​

Key terms​

Where this leads​