Data, the Fuel: Readiness, Features, and the Cold-Start Reality

📍 Where we are: Part I · Foundations of Learning in Bioprocess — Chapter 2. The last chapter argued that bioprocess breaks the data-science rulebook because the data is small, slow, and alive. This chapter goes one level down: before any model can be trained, the data has to exist in a usable shape — and in most plants it does not. Data readiness, not algorithm choice, is barrier number one.

Every survey of machine learning in biomanufacturing reaches the same unglamorous conclusion: the bottleneck is rarely the model. It is the data — its accessibility, its shape, its provenance, and above all its scarcity at the exact moments learning needs it most. A team can have scikit-learn, PyTorch, a GPU cluster, and a stack of review papers, and still be unable to train anything useful, because the batch records live half on paper, the historian and the LIMS (the lab's Laboratory Information Management System, which holds sample and assay results) do not agree on a batch ID, and the only ground-truth measurement of the thing they want to predict arrives twice a day from a bench assay. This chapter is about turning a real, messy bioprocess data estate into the fuel a model can actually burn.

We will be concrete throughout. The running example is the same one the whole series uses — the golden run BATCH-2026-001 (a "golden" batch is a clean, on-target reference run) and its siblings, the Raman spectra (Raman spectroscopy reads a chemical fingerprint of the broth from scattered laser light) and offline assays (bench lab measurements) that the simulator produced — so these are realistic synthetic data, generated by a physics-based model rather than measured on a real plant — and the release results (the end-of-batch quality tests that decide whether a drug batch may be shipped) in examples/datasets/hplc_results.csv (HPLC is a standard lab separation method used to measure purity and product concentration). By the end you will have a defensible answer to the single most consequential question in applied bioprocess ML: how do you split your data so that the score you report is the score you will actually get in production?

The simple version

Imagine you want to teach someone to judge whether a cake is done by smell alone. You bake a hundred cakes, sniff each one at many moments, and write down "done / not done" — but the only true answer comes from a thermometer you can use twice per cake. That is bioprocess data: a flood of cheap, fast signals (the smells, every minute) and a trickle of slow, expensive truth (the thermometer, twice a batch). Two mistakes ruin the lesson. First, if your hundred cakes are scattered across sticky notes, a notebook, and three different ovens that label time differently, you can't even line the smells up with the thermometer readings — that's the data-readiness problem. Second, if you let the student practice on whiffs from the same cake they'll be tested on, they'll ace the test and fail on the next cake — that's data leakage, and the fix is to keep whole cakes — whole batches — entirely on one side of the line.

What this chapter covers

Data readiness as barrier number one: silos, non-FAIR data, the paper/digital hybrid, and what "AI-ready" actually demands.
Where the fuel lives: the historian / data shadow (a time-series database of the dense online signals) and contextualization (the metadata that makes a number mean something).
Feature engineering for batches: time alignment and warping, resampling, batch-wise unfolding for MPCA, and spectral preprocessing (baseline correction, SNV, and Savitzky-Golay derivatives).
The leakage taxonomy: row vs batch vs temporal split, target leakage, and the scaling-before-split trap — the single most common and most damaging family of mistakes in bioprocess ML.
Anatomy of one training example: the single contextualized, leak-aware unit every model in the book consumes, unpacked field by field.
The cold-start cadence: offline reference measured once or twice a day, and why "more data" does not always mean "more information."
Data integrity (ALCOA+ — a regulatory data-trust standard): why a model trained on records that cannot be trusted is itself untrustworthy under GxP (the family of "Good Practice" regulations — chiefly GMP, Good Manufacturing Practice — that govern how medicines are made).
The module this chapter contributes: examples/platform/ml/dataio.py, the shared loaders and batch-aware split helpers every later chapter imports.

Data readiness is barrier number one

When practitioners are surveyed about what blocks AI in pharmaceutical manufacturing, the answer is overwhelmingly data, not models. A widely cited 2024 industry study found that roughly 70% of organizations struggle to access the data they need for AI because it is locked in silos, and only about 39% use standardized formats or ontologies (machine-readable shared vocabularies — Book 4) to describe it [1] (vendor/self-reported). The picture the bioprocess literature paints is consistent: data stays "siloed, fragmented, and underutilized," made worse by hybrid paper-and-digital records where a batch's truth is split between a validated MES (Manufacturing Execution System — the software system of record for a batch's steps and parameters), a scientist's notebook, an instrument's local drive, and a PDF (research/consensus) [2].

Three failure modes recur, and each is an engineering problem long before it is a modeling one:

Silos. The historian holds the online tags; the LIMS holds the assays; the MES holds the batch record; the chromatography skid holds its own chromatograms. Each is internally fine and mutually unreachable. A model needs all four joined on one batch identity, and the join is often the hardest engineering in the project — not because joining tables is hard, but because the four systems disagree on what a batch is. The historian keys on equipment and a continuous timestamp; the LIMS keys on a sample ID that an analyst typed in; the MES keys on a campaign-and-lot string; the skid logs a run number that resets to one every power cycle. Reconciling those four notions of identity is the silent majority of a bioprocess ML project's effort.
Non-FAIR data. Even when you can reach a dataset, it may not be Findable, Accessible, Interoperable, or Reusable — the four FAIR principles [3]. A spreadsheet named final_FINAL_v3.xlsx with units only in a human-readable header, no schema, and no link to the batch it describes is technically accessible and practically useless to an automated pipeline. Interoperable is the one most often missing in practice: two plants record dissolved oxygen as DO, pO2, and dissolved_oxygen_pct, in %, % air saturation, and mmHg, with no machine-readable statement of which is which. A model trained on one cannot consume the other, and a human has to translate every column by hand — which does not scale and does not survive an audit.
The paper/digital hybrid. A measurement written on a form and later keyed into a system loses its native timestamp resolution, its instrument metadata, and frequently its uncertainty. You cannot reconstruct what was never digitally captured. A glucose reading transcribed as "4.2 g/L, 14:30" has lost the analyzer serial number, the calibration state, the exact second, and the operator who ran it — every attribute a model would use to weight or distrust it, and every attribute a regulator will ask for. The hybrid record is not merely incomplete; it is irreversibly incomplete.

"AI-ready data" is the proposed remedy, and the industry's own framing — BioPhorum's "data as a product" workstream is the prominent example — treats a dataset as something with an owner, a schema, a quality contract, and a consumer, rather than a byproduct of running a plant [2] (industry consensus). The deep point for this book is that Books 2, 3, and 4 of this series are, between them, the data-readiness solution. Book 2 builds the governed data point and the historian; Book 3 builds the open-source stack that captures, contextualizes, and stores it; Book 4 models the whole spine as a knowledge graph so a machine can find and join it. Book 5 assumes that work is done — and this chapter is where we cash the assumption in. The honest framing, repeated by every survey in the field, is that this readiness work is 80% of the labor and 0% of the headline: it never appears in the paper that reports the R² (a goodness-of-fit score, usually between 0 and 1, where 1 is a perfect fit), but the R² is fiction without it.

Where the fuel lives: the historian and contextualization

Bioprocess ML draws on two kinds of data that could not be more different in shape, and confusing them is a common early error.

The dense online stream lives in the historian — the time-series database that records every probe on every vessel, sampled every few seconds, forever. In Book 2 this is the data shadow: the digital trace that runs alongside the physical batch. Historians (the AVEVA PI System (formerly OSIsoft PI) archive, or open-source equivalents) are built for exactly this load: append-mostly, billions of timestamped tag values, with compression that exploits the fact that a steady probe barely moves between samples. In our datasets the data shadow is fedbatch_timeseries.parquet and the minute-cadence fedbatch_state.parquet (the golden batch alone is 20,160 rows of internal state — temperature, pH, dissolved oxygen, agitation, feed events — over 14 days) plus the in-line raman_spectra.parquet, 701 intensity channels (wn_400 … wn_1800) read alongside the kinetic state. These signals are cheap and fast: thousands of points per batch per tag. They are the model's inputs, the features.

The in-line probe menu is broader than Raman alone. Near-infrared (NIR) reads overtone and combination bands; dielectric (capacitance) spectroscopy tracks viable biomass through the polarizable cell membrane; and 2D fluorescence (excitation–emission matrix, EEM) spectroscopy scans a grid of excitation × emission wavelengths and reads the native fluorescence of aromatic amino acids and cofactors (tryptophan, NAD(P)H, flavins), yielding a multi-way data cube usable as a soft-sensor input for biomass and some metabolites. EEM is complementary to Raman — it senses a different physics — and it is also the very source of the broad fluorescence baseline that Raman preprocessing (the SNV and derivative steps below) has to remove, so the two probes are two sides of the same signal.

The sparse offline reference lives in the LIMS and the bench: offline_assays.csv (viable cell density (VCD), glucose, lactate, glutamine, ammonia, osmolality, titer (the concentration of antibody product in the broth, in g/L — the chapter's primary prediction target), offline pH) and the release results in hplc_results.csv. These are expensive and slow: in our simulated campaign each batch has exactly 28 offline samples over 14 days — two per day — and a single end-of-process release Certificate of Analysis. They are the model's targets, the ground truth, and there are vanishingly few of them. The golden run BATCH-2026-001 releases at monomer 98.611% (the fraction of product that is intact, correctly assembled antibody — higher is better) and a clean HCP; its OOS (out-of-specification) sibling BATCH-2026-004 fails its release test on host-cell protein (HCP) = 128 ng/mg against a spec maximum of 100 — host-cell protein is a process impurity, and the spec maximum is the regulatory limit it must stay under. These are among the critical quality attributes (CQAs) a batch must pass to be released. (Release testing itself is the subject of Book 1's QC and release chapter and this book's own release predictor.) That single end-of-batch label is, for a batch-level model, the entire thing you are trying to predict from 14 days of dense input.

A raw historian value, though, is not yet a feature. The number 36.51 means nothing until you know it is BR101.Temp.PV (the dotted tag name reads as bioreactor BR101, the Temp measurement, its live process value PV) in degrees Celsius, with Good quality, at a specific timestamp, belonging to BATCH-2026-001 during its production phase. Attaching that identity is contextualization — the discipline Book 3 makes literal in its contextualization layer, and which Book 2 traces through the lifecycle of a data point. Contextualization is what makes the cross-system join possible: it is the reason a Raman spectrum row and an offline titer row and a release verdict can all be keyed to the same batch_id and the same point in batch time.

Two pieces of context matter most for learning:

The batch_id turns four silos into one joinable dataset and — as the next section makes brutally clear — is also the field that decides train versus test.
The quality flag (Good / Uncertain / Bad, the OPC-UA StatusCode trio) is the difference between a probe reading and a probe artifact. A model that trains on a Bad-quality stretch is learning sensor failure, not biology — but a stretch that is physically valid yet must still be excluded is flagged differently: a DO probe re-zeroed against air during a single-use calibration reads "fine" with a Good status, and only the phase/event context (a calibration or maintenance event) tells you to mask it. An instrument-asserted bad-quality code and a calibration/maintenance window are both excluded from training, but you find them through different fields — the status code versus the phase context.

Every loader in this book's example suite assumes contextualized data — a batch_id column and a comparable timestamp on every table — because that assumption is exactly what the instruments-and-sensors and contextualization chapters earn.

The ontology underneath: a feature is a typed edge, not a column name

Contextualization solves the join; an ontology (a machine-readable, shared vocabulary that fixes what each thing is and how things relate — Book 4) is what keeps the join meaning the same thing across plants, instruments, and years. The interoperability failure above — one plant's DO, another's pO2, a third's dissolved_oxygen_pct — is precisely the problem a knowledge graph dissolves: instead of a fragile string column name, the feature is pulled by its IRI (Internationalized Resource Identifier, a web-style global name), so dissolved oxygen on bioreactor BR101 is the single typed predicate bp:dissolvedOxygenPct regardless of which historian tag string a given site happened to type. Book 3 makes this literal — BATCH-2026-001 bp:monomerPct 98.611 is a stored triple, and the predicate is an IRI drawn from the ontology, not a header someone might rename next quarter. A feature pulled by IRI survives an instrument swap, a site transfer, and a column rename that would silently break a pipeline keyed on df["DO"].

That semantic grounding pays off in three ways a column-name pipeline cannot match:

The group key is a lineage edge, not a label. The batch_id that decides train-versus-test is, in the graph, the head of a bp:derivedFrom chain — the transitive lineage spine that roots every drug-substance lot back through its capture pool, its bioreactor batch, and its seed train to one frozen cell-bank vial. Because that edge is declared transitive, "every example that shares a production ancestor" is a graph query, not a brittle string match — which is exactly the grouping a leave-one-batch-out split needs. Two lots that look independent by name but descend from the same seed culture are not independent for validation, and the lineage graph is what surfaces it.
A measurement is kept distinct from the run it measures. Book 4 places every class under the continuant/occurrent cut of its upper ontology: a titer value (a continuant — a thing that persists) is a different kind of entity from the culture run (an occurrent — a process that unfolds in time) it was measured during. That distinction is not philosophy; it is the type discipline that stops a pipeline from accidentally treating a per-run label as a per-timepoint feature — the very target-leakage trap the next section names.
The release shape is a training-data completeness check. The same SHACL release gate (Shapes Constraint Language — a language that validates a graph has the required structure) that decides whether a lot may ship — every required CQA present (sh:minCount 1), singular, and in range — is, read from the model's side, a guarantee that a training example's inputs and label are complete and in-range before the model is allowed to learn from them. A closed-world shape that fails when the monomer result is missing is the same shape that should fail a training row whose target never arrived. Validating the training set against the release shape turns "is this data fit to learn from?" from a hope into a gate that runs. And it does run: the companion ontology's examples/platform/ontology/train_gate_demo.py loads the very same bp:ReleaseShape and the committed release records, then prints the admit/reject verdict — it admits the released lots (DS-001, DP-001) and rejects the out-of-spec sibling (DP-004, tripping bp:hmwPct at 2.41 % against the 2.0 % limit) — the same shape Book 4's validate.py gates release with, here read as a training-admission decision, so the claim is checkable rather than asserted.

A model whose features are semantically grounded, whose groups are lineage edges, and whose rows are SHACL-validated is a model whose inputs are FAIR and trustworthy by construction — which is the deeper reason the ontology work of Book 4 is a prerequisite to, not a decoration on, the ML of Book 5.

A single feature, fully grounded: an IRI-pulled value carrying its unit and a continuant/occurrent BFO tag, a transitive bp:derivedFrom lineage edge that serves as the group key, and a SHACL sh:minCount 1 completeness gate — all assembled into one row of the training matrix before the model is allowed to learn from it. Original diagram by the authors, created with AI assistance.

Turning batches into features

A trained model wants a rectangular matrix: rows are examples, columns are features, and one column is the target. A bioprocess gives you nothing of the kind. It gives you irregular, multi-rate, multi-table time-series tied to a few sparse labels. Four steps bridge the gap, and each carries a leakage trap we will return to.

Alignment: putting everything on one clock

The online tags, the Raman probe, and the bench assays each run on their own clock and their own cadence. The first job is alignment: expressing every measurement in a common time frame — usually batch time (hours since inoculation — the moment the cells are added to start the batch) rather than wall-clock time, so that day 7 of one batch lines up with day 7 of another. Alignment is also where you join across silos: an offline titer taken at batch-hour 168 must be matched to the Raman spectrum and the online state at that same batch-hour, so the model learns "this spectrum corresponds to this titer." Get the join wrong by even a few hours and you are teaching the model a lie — and in a cell culture where titer climbs steeply in the back half, a four-hour mismatch is a real error in grams per liter, baked silently into the label.

Batch time alone is not always enough, because two batches can reach the same biological state at different elapsed hours — a slow seed lags, a fast clone races ahead. When you need to compare trajectories (for batch-level monitoring, below), naive hour-by-hour alignment misaligns the phases: cell cultures grow through named stages (a slow lag phase, a fast-growing exponential phase, then a stationary phase), and if one batch's lag phase overlaps another's exponential phase, the comparison is meaningless. Dynamic time warping (DTW) is the classical fix: it finds a nonlinear stretch of one batch's time axis onto another's that minimizes the total distance between the warped trajectories, so the lag-to-exponential transition of every batch lines up with every other's regardless of when in wall-clock or even batch-clock time it happened. DTW is standard in batch-process monitoring precisely because batch duration and phase timing are themselves sources of variation you usually want to remove before comparing shapes. In practice the maturation coordinate — the axis along which a batch "matures" — is often not warped time at all but an integral measure such as integral viable cell density (IVCD/IVCC): the running total of viable cells accumulated over time, obtained by integrating the offline_assays VCD column across the batch. Titer and metabolite state move more reproducibly against this cumulative cell-exposure axis than against the wall clock; DTW is the tool when no such physical maturation variable is at hand. The cost is a hyperparameter — how much warping to allow — and a caution: warp too freely and you can align a bad batch onto a good one and erase the very deviation you were trying to catch.

Resampling: choosing a cadence

Once aligned, signals must be brought to a shared cadence by resampling — averaging or interpolating onto a fixed grid (every 10 minutes, or every hour). The book's datasets ship a 10-minute grid (fedbatch_timeseries_10min.csv) for exactly this reason. Resampling is a judgment, not a formality: too fine and you carry noise and inflate the data volume without adding information; too coarse and you blur the transients (a feed bolus, the day-7 temperature excursion the simulator seeds) that often carry the signal. The honest default is to resample no finer than the slowest thing you actually need to see.

Resampling hides a subtlety that bites real pipelines. When you downsample a 1-second tag to a 10-minute grid, you must aggregate (mean, last, max), and the choice matters: a max over the window catches a transient spike that a mean would dilute away. When you upsample a sparse offline assay onto the dense grid to pair it with spectra, you are interpolating a label — and forward-filling or linearly interpolating a titer between two bench samples invents ground truth that was never measured. That invented label is a quiet form of target leakage if the interpolation reaches across information the model should not have. The safe pattern is to resample features freely and to pair labels only at the timepoints where they were actually measured, never interpolated.

Batch-wise unfolding: a whole batch as one row

For predicting a batch-level outcome — final titer, a release CQA, a pass/fail — you need each batch to be one row. The classical move, inherited from multiway PCA (MPCA) and multiway PLS (the foundation of batch monitoring, covered in Book 3's analytics chapter), is the unfold: take the three-dimensional data cube of (batch × variable × time) and flatten each batch into one long row, laying every variable-at-every-timepoint side by side as columns. This is batch-wise unfolding in the Nomikos-MacGregor sense — the cube I × J × K (batches × variables × timepoints) collapses to a matrix I × (J·K), one row per batch, so that ordinary PCA or PLS can run on it. A 14-day batch of seven tags at hourly cadence unfolds to a single row of roughly 2,300 columns. Now "predict the CQA from the trajectory" is an ordinary supervised problem — with the brutal catch that you have one row per batch, so six campaign batches give you six training rows. This is the small-data wall from Chapter 1 in its rawest form, and it is why unfolded batch models lean so hard on dimensionality reduction (methods like PLS and PCA that compress thousands of correlated columns into a handful of underlying "latent" factors) rather than deep networks: when the number of columns J·K vastly exceeds the number of rows I, an ordinary model has far more knobs than examples and simply memorizes the few rows it sees, so the only thing that generalizes is a low-rank latent model — one that explains the data with a small number of shared factors instead of one weight per column. (Batch-wise unfolding also assumes every batch is the same length, which is exactly why DTW or another trajectory alignment usually has to come first.)

For predicting a within-batch quantity that changes over time — titer right now from the Raman spectrum right now — you do not unfold. Each aligned timepoint is its own row, which is how Book 3's MVDA titer soft sensor — a soft sensor being a software model that infers a slow, expensive measurement (here, titer) in real time from cheap online signals (here, the Raman spectrum) — gets hundreds of training examples from a handful of batches (this book's own dedicated treatment is the production-bioreactor chapter). Our raman_spectra.parquet is exactly this shape: 336 spectra from the golden batch BATCH-2026-001 (≈one per simulated hour across its 14-day run), each a 701-channel row paired with the reference titer at that instant. Because this file is a single batch, its honest validation is the forward-in-time temporal_split, not a batch-grouped one; batch_split is reserved for the multi-batch cohort. Which framing you choose — one row per batch, or one row per timepoint — dictates everything downstream, including what a group is when you split, which is the subject of the next section.

Spectral preprocessing: baseline, SNV, and derivatives

A raw Raman or near-infrared spectrum is not model-ready. It carries baseline drift (slow curvature from fluorescence and the instrument), multiplicative scatter (the whole spectrum scaled up or down by probe fouling, bubbles, or path-length changes), and high-frequency detector noise that have nothing to do with the chemistry you want. Chemometrics has a small, decades-old toolkit of standard preprocessing steps, and almost every production spectroscopic soft sensor uses some combination of them [8]:

Baseline correction removes the slow, broad background — most often fluorescence in Raman — that rides under the sharp chemical peaks. Methods range from fitting and subtracting a low-order polynomial to asymmetric least-squares (AsLS) baselines. The aim is to leave only the peaks the chemistry actually produces; without it, a drifting fluorescence background can swamp the very bands the model needs.
Standard Normal Variate (SNV). For each spectrum independently, subtract its own mean and divide by its own standard deviation:
```
x_snv[i] = (x[i] − mean(x)) / std(x)        # for each spectrum, over its 701 channels
```
This removes multiplicative scatter and per-spectrum offset, so two spectra of the same broth taken through a slightly fouled versus a clean window become comparable. The single most important property of SNV for this chapter: it is computed per spectrum, over its own channels — it never looks at any other spectrum — so it is leak-free by construction. You can apply it before or after the train/test split and get the identical numbers, because each row's transform depends only on that row. (Multiplicative scatter correction, MSC, achieves a similar goal but references a mean spectrum — which makes it leaky if that mean is computed over the whole dataset including the test rows. Prefer SNV, or fit the MSC reference on train only.)
Savitzky-Golay smoothing and derivatives. Fit a low-order polynomial in a sliding window and either evaluate it (smoothing) or take its first or second derivative. This removes baseline (a first derivative kills a constant offset; a second derivative kills a linear slope) while smoothing noise, and it sharpens overlapping peaks. The window length and polynomial order are real hyperparameters — too wide a window smears peaks, too narrow amplifies noise — and like all hyperparameters they must be chosen on the training folds, never tuned against the test set.

The mathematics is modest but the discipline is not. Preprocessing is part of the model, and — as we are about to see for scaling — its parameters must be fit on the training data only. SNV and Savitzky-Golay are per-spectrum so they are safe in any order; but any preprocessing that learns from the dataset as a whole (mean-centering across spectra, PCA denoising, a global StandardScaler, an MSC reference spectrum) is a leakage trap if you fit it before you split. This distinction — per-row transforms are safe, cross-row transforms must be split-aware — is the bridge to the cardinal sin.

The leakage taxonomy: row vs batch vs temporal, and the traps in between

Here is the single most important section in the chapter. Data leakage is any path by which information from the test set reaches the model during training, making the reported score better than the score you will get in production. In bioprocess it takes a handful of distinct forms, and the field's own reviews single it out — alongside the small-data regime — as one of the two technical reasons reported successes so often fail to reproduce or transfer [4] (research/consensus). Name them, because each has a different fix.

1. The row-wise split (the cardinal sin). The most common, most seductive, and most damaging mistake is to take your aligned table of timepoints, shuffle the rows, and split 70/30 into train and test. The reported R² will be spectacular — and meaningless. The reason is temporal and within-batch autocorrelation. Two Raman spectra taken an hour apart in the same batch are almost identical: same cells, same media lot, same probe, same fouling, nearly the same titer. If one of them is in the training set and the other in the test set, the model has effectively seen the answer — it is interpolating between two near-duplicate points, not generalizing. A row-wise random split scatters near-twins across the train/test boundary in every batch, so the test set is not independent of the training set. You measure memorization and call it skill. The fix is grouped splitting: every row from a given batch goes entirely to train or entirely to test — never both. Then the test batches are genuinely unseen processes, and the score is an honest estimate of "how will this do on the next batch?" — the only question that matters in manufacturing. The same logic forces a batch-aware form of cross-validation. (In cross-validation you rotate which slice of the data is held out for testing — each held-out slice is a fold — and average the scores across the rotations, which uses scarce data far more efficiently than a single one-off split.) Here that means GroupKFold (the folds are grouped by batch, so no batch is split across the line) and a leave-one-batch-out scheme when batches are precious. When you also tune a hyperparameter, the honest estimate needs nested cross-validation — tuning in an inner loop and scoring on an untouched outer fold, demonstrated in the release predictor — because selecting on the same folds you then report is optimistically biased. One subtlety hides inside even a correct grouped split: any sliding-window transform (Savitzky-Golay, a smoothing filter, a DTW alignment) must not span the train/test cut, and replicate spectra from an averaged or multi-probe acquisition must travel to the same side as their group — otherwise a test row's near-neighbour leaks across a boundary you thought was clean (a known chemometrics-validation failure mode).

2. Temporal leakage (training on the future). Even within a single batch — where you genuinely have only one batch and a grouped split is impossible — a random split leaks, because the model can train on batch-hour 200 and test on batch-hour 100, learning the future to predict the past. A soft sensor that will run forward in real time must be validated forward in time: train on the early part of the timeline, test on the later tail, so the model is forced to extrapolate forward exactly as it will in production. That is the temporal_split in the shared loader. Temporal leakage also hides in features: any feature that secretly encodes a future value (a "time-to-end-of-batch" column, a cumulative sum that runs to the end, a smoothing window that reaches forward) trains the model on information it will not have at inference.

3. Target leakage. A feature that is a proxy for the answer — or that would not exist at prediction time — inflates the score and then vanishes in production. The classic bioprocess version: predicting final titer from a feature that was itself derived using the final titer (a normalized yield, a per-cell productivity computed with the end-point), or predicting a release OOS from an in-process feature that is actually a downstream rework flag. The interpolated-label trap from the resampling section is target leakage too: forward-filling a titer label across a window leaks the next bench measurement into the rows before it. The fix is a hard rule — every feature must be computable from data available strictly before the prediction moment — and a habit of asking, for each suspiciously strong feature, "would I really have this at inference time?"

4. Scaling (and preprocessing) before the split. The quietest trap of all: fit a StandardScaler, a PCA, or any cross-row transform on the whole dataset, then split. The scaler's mean and standard deviation now encode the test rows, so the test set has leaked into the training transform — the score is optimistic and the leak is invisible in every code review. The fix is mechanical and absolute: fit every learned transform on the training fold only, then apply it to the test fold. In scikit-learn this is what a Pipeline is for — it makes "fit on train, transform test" the default and makes the leaky alternative awkward to write. Per-row transforms (SNV, Savitzky-Golay) are exempt because they never look across rows; everything else must live inside the cross-validation loop, fit fold by fold.

The trap is worse than it looks because it hides. A leaky split passes every code review, produces a beautiful predicted-versus-measured scatter, and survives until the model meets batch number seven in production and falls apart — at which point the validation evidence in your filing is wrong. Getting the split right is the cheapest insurance in the entire pipeline, and examples/platform/ml/dataio.py makes the leak-free choice the default so that no later chapter can fall into the trap by accident.

The defining choice of bioprocess ML, drawn twice: a row-wise random split (left) scatters near-duplicate within-batch neighbours across the train/test boundary and reports a fantasy score, while a batch-grouped split (right) keeps whole batches on one side, so the held-out batches are truly unseen and the score is honest — set against the cold-start reality that ground truth arrives only once or twice a day. Original diagram by the authors, created with AI assistance.

The shared loader and a leak-free split, in code

This chapter contributes examples/platform/ml/dataio.py — the foundation module every later chapter imports. It loads the committed datasets, joins them on batch identity, and — crucially — provides the three split helpers that span the leakage taxonomy: a deliberately leaky random_split kept only to demonstrate the inflated metric, an honest within-batch temporal_split for the single-batch soft sensor, and the whole-batch batch_split that is the right answer whenever you have more than one batch. The excerpt below is the heart of the module — load_raman() returns the full frame with its batch_id column, so a caller holds the group key alongside the X/y arrays that raman_xy() hands back, and the three splits sit side by side so the contrast is unmissable.

# examples/platform/ml/dataio.py — shared loaders + leakage-aware splits
from pathlib import Path
import numpy as np
import pandas as pd

DATASETS = Path(__file__).resolve().parents[1] / "datasets"   # examples/datasets


def load_raman() -> pd.DataFrame:
    """Golden-batch in-line Raman: 336 spectra x 701 wavenumbers + reference labels."""
    return pd.read_parquet(DATASETS / "raman_spectra.parquet")


def raman_xy(df: pd.DataFrame | None = None, label: str = "titer_g_L"):
    """Return (X spectra, y label, wavenumber column names)."""
    df = load_raman() if df is None else df
    wn = [c for c in df.columns if c.startswith("wn_")]
    return df[wn].to_numpy(float), df[label].to_numpy(float), wn


# --- leakage-aware splits ---------------------------------------------------
def temporal_split(df: pd.DataFrame, ts_col: str = "ts", frac_train: float = 0.7):
    """First `frac_train` of the timeline trains; the later tail tests
    (honest for a SINGLE batch — the model must extrapolate forward in time)."""
    order = df.sort_values(ts_col)
    cut = int(len(order) * frac_train)
    return order.iloc[:cut], order.iloc[cut:]


def random_split(df: pd.DataFrame, frac_train: float = 0.7, seed: int = 2026):
    """A deliberately leaky row-split, kept ONLY to demonstrate the inflated metric."""
    rng = np.random.default_rng(seed)
    idx = rng.permutation(len(df))
    cut = int(len(df) * frac_train)
    return df.iloc[idx[:cut]], df.iloc[idx[cut:]]


def batch_split(df: pd.DataFrame, batch_col: str, test_batches):
    """Hold out WHOLE batches — the only split that estimates performance on a
    genuinely unseen run. Every row of a test batch is wholly on the test side."""
    test_batches = set(test_batches)
    is_test = df[batch_col].isin(test_batches)
    return df[~is_test], df[is_test]

The three functions are a working illustration of the taxonomy above: random_split is the cardinal sin in code form (and is kept only so the demo can show how badly it lies), temporal_split is the answer when you genuinely have one batch and must extrapolate forward, and batch_split is the answer the moment you have more than one. The split-aware preprocessing discipline lives where it belongs — inside each model's scikit-learn Pipeline, fit fold by fold — so the loader never has to learn anything from the data it merely serves.

Running the live soft sensor on the golden batch reports a very high score — and the discipline is in reading why:

$ python -m soft_sensor_pls
PLS soft sensor (titer from SNV+SavGol Raman): R2=0.9944 RMSE=0.127 g/L
  n_components=5 (inner 5-fold CV, one-SE rule; inner-CV R2=0.9905), 701 wavenumbers, 235 train / 101 test, 702 coefficients
  VIP > 1 on 389 wavenumbers; top bands: 1274cm-1 (VIP 1.41), 1276cm-1 (VIP 1.4), 1272cm-1 (VIP 1.4) ...
  applicability domain (train 99th pct): T2<=17.164, SPE<=503.542 | held-out spectra in-domain=99%, corrupted spectrum flagged=True
ASSERT ok: R2 > 0.85, VIP names the bands, and the AD gate flags an out-of-domain spectrum while passing genuine ones.

The module fits the same SNV-plus-Savitzky-Golay preprocessing this chapter teaches, lets an inner 5-fold cross-validation pick n_components (the number of latent factors the PLS model keeps) by a one-standard-error rule — which prefers the simplest model within one standard error of the best, here 5, rather than a hardcoded count — and then holds out a random slice of the golden batch's hourly spectra — so the 0.9944 is an interpolation score: how well the sensor tracks titer within one run, between bench samples. (The companion RMSE=0.127 g/L is the root mean square error — the model's typical miss expressed in the target's own units, here grams per litre — and VIP is variable importance in projection, a per-wavenumber score that flags which Raman bands drive the PLS prediction; bands with VIP above 1 are the influential ones.) That number is high for the right reason on this internally consistent simulated batch — the spectra carry the titer signal cleanly and the SNV transform is per-spectrum and leak-free — but it is emphatically not a measure of cross-batch generalization. The honest cross-batch question (would this calibration survive a new batch?) is exactly what the row-wise-versus-temporal contrast above warns about, and what the multi-batch calibrations of tech transfer and the drift chapter actually test. In real Raman, the gap between a within-batch interpolation score and genuine cross-batch performance is typically large, and a model that looked excellent within one batch has been known to fail outright on the first new one.

The last line of the output is a safety gate called the applicability domain (AD): it asks whether a new spectrum even resembles the ones the model was trained on. Two distances measure that — T² (Hotelling's T-squared, how far the spectrum sits from the centre of the training data inside the model's latent factors) and SPE (squared prediction error, how much of the spectrum the model cannot explain at all). If either exceeds the threshold the training set established (here the 99th percentile, T2<=17.164, SPE<=503.542), the input is flagged as out-of-domain and the sensor declines to trust its own prediction — which is exactly why the deliberately corrupted spectrum in the check is flagged=True while genuine held-out spectra pass.

Anatomy of one training example

Every model in this book consumes the same atomic unit: a single, contextualized, leak-aware training example. Pull one apart and the whole chapter is laid out as fields. The example below is one aligned timepoint from the golden run — a Raman spectrum, the online state at that instant, the batch-time index, the offline reference titer it is paired to, and the group key that keeps it leak-free.

One training example, fully unpacked: the dense cheap features (Raman plus aligned online state), the SNV-preprocessed spectrum, the sparse expensive target from the LIMS, the batch-id group key that alone decides which side of the split it falls on, and the provenance and ALCOA+ attributes that make it trustworthy enough to learn from under GxP. Original diagram by the authors, created with AI assistance.

Read the card top to bottom and the argument is complete.

The features are the cheap, fast signal — a 701-channel spectrum and the aligned online state (36.51 °C, pH 7.04, dissolved oxygen, agitation), available continuously, thousands of rows per batch. They cost almost nothing and you can always collect more.
The SNV-preprocessed row is the model-ready transform of the spectrum, computed per-spectrum so it carries no leakage: this exact row would be identical whether it landed in train or test, which is the property that makes spectral preprocessing safe outside the split.
The green target is the slow, expensive truth: an offline titer that exists only because a sample was pulled at batch-hour 168 and run on a bench — one of just 28 such samples in the whole 14-day batch. This is the scarce resource the whole chapter is organized around.
The cyan group key — batch_id = BATCH-2026-001 — looks like mere metadata but is the most powerful field on the card: it alone decides whether this example trains the model or tests it, and getting it wrong is the row-wise leakage above. It is the field every split helper in dataio.py keys on.
The violet provenance block is what makes the example admissible at all under GxP: the historian tag IRIs (globally unique, web-style identifiers — Book 4's identifiers-and-units) and LIMS sample_id that tie it to its sources, the dataset hash from MANIFEST.sha256 (a fingerprint of the exact bytes) that pins exactly which data this is, and the ALCOA+ attributes (attributable, contemporaneous, original) that make it data integrity rather than just data.

The card also makes the cold-start asymmetry visible at a glance: every field in the feature block recurs thousands of times per batch, while the single green target recurs 28 times. The next section names why that asymmetry, not the algorithm, is the constraint that shapes the rest of the book.

The cold-start reality: dense features, sparse truth

Now we name the constraint that shapes every later chapter. In our datasets the asymmetry is exact: the Raman probe and online tags produce hundreds to thousands of rows per batch (the golden batch alone is 20,160 rows of minute-cadence state and 336 spectra), while the offline reference — the only ground truth for titer, metabolites, and viability — is sampled twice a day, 28 times across a 14-day batch, and the release CQAs exactly once, at the end. The bioprocess ML literature calls this the cold-start problem: living systems are observed sparsely and at low dimension by the offline reference, run-to-run variability "severely compromises transferability," and models decay fast in production [4] (research/consensus).

That run-to-run variability is not just narrative. The named mechanistic sources — the raw-material (media) lot, the cell passage number, and the operator or shift — are now carried as recorded per-batch metadata in the simulator-backed cohort, so a model can either condition on them (treat the lot as a feature) or stratify by them (a leave-one-lot-out evaluation, or a lot fixed effect). The seed-train lineage itself is one of those sources: the genealogy WCB-CHO-001 → SEED-001 → the production batch — the working cell bank (WCB) of the CHO (Chinese hamster ovary) cell line, the industry-standard mammalian host that produces the antibody here, expanded through a seed train of successive subcultures into the production batch — carries inoculum age and passage history per batch (the passage number is how many times the cells have been subcultured; the fixture below encodes it as passage 7–39), and inoculum state is a major variability source for the CHO line distinct from media lot — another stratum the validation can hold out or the model can condition on. The fixtures the later chapters draw from make the offsets concrete:

cohort: 60 batches; OOS=10 (17%); titer 2.60-10.48 g/L; HCP 19-123 ng/mg
  mean titer by raw-material lot: LOT-A 5.63, LOT-B 4.56, LOT-C 5.39  | passage 7-39

A raw-material lot visibly shifts mean titer (LOT-B at 4.56 g/L against LOT-A 5.63 and LOT-C 5.39) — exactly the kind of systematic offset a media-lot change introduces, and exactly the kind of input-distribution shift — a change in the statistics of the model's inputs from those it was trained on — that a drift detector's PSI (Population Stability Index, a standard measure of how far a distribution has moved) is built to catch. Recording the lot turns a hidden confounder into either a feature the model can use or a stratum the validation can hold out.

The largest such shift is not a lot change at all but a scale or site change. A calibration trained on a bench or pilot bioreactor faces a different probe geometry, a different mixing and mass-transfer regime, and often a different historian and tag dictionary when the process is transferred to a manufacturing-scale vessel — so a soft sensor that was honest at one scale can be systematically biased at the next, even with the cell line and media held fixed. This is technology transfer and scale-up, and the bioprocess version of the cold-start problem returns with it: every new scale or site is, in effect, a cohort of brand-new batches whose distribution the model has never seen. The tech-transfer chapter treats this as its own validation question (would this calibration survive the move?), and the practical guard is the same lineage and metadata discipline as for media lots — record the scale, vessel, and site as per-batch metadata so the validation can hold a whole scale out, exactly as a leave-one-lot-out evaluation holds out a media lot. Two further manufacturing facts shape what the data can even contain: the signals only exist because the instruments and the vessel passed equipment qualification (the IQ/OQ/PQ evidence that a probe and a bioreactor install, operate, and perform as specified — a calibrated probe is the precondition for a trustworthy Good quality flag), and the single release CQA per batch only exists at the end of a batch-release workflow in which Quality reviews the record and dispositions the lot. The model's scarcest label is therefore not just expensive to measure — it is the output of a regulated human decision, which is one more reason it arrives slowly and cannot be manufactured on demand.

This has three hard consequences that the rest of the book keeps colliding with:

Targets are the scarce resource, not features. You can always collect more spectra; you cannot easily collect more titers. Every method that wins in bioprocess — hybrid models, transfer learning, Bayesian priors, semi-supervised learning — is, at bottom, a way to spend fewer expensive labels. This is the through-line that explains why pure end-to-end deep learning so often stalls here [4]: a network with thousands of weights has dozens of labels to learn them from.
Drift is detected late. Because ground truth arrives twice a day, a soft sensor that began drifting at breakfast is not provably wrong until the evening sample comes back. The drift flag is, by construction, a lagging indicator — a problem Book 2's soft-sensor chapter names and Chapter 22 of this book has to engineer around. The whole point of a soft sensor is to fill the gap between offline samples, which means the very interval where it is most useful is the interval where you cannot yet tell whether to trust it.
Data volume is not information. A million Raman points from one batch is still one batch's worth of information about how a process behaves run-to-run. Resampling a tag from 1-second to 1-minute cadence loses almost nothing, because neighbouring seconds are near-duplicates. The quantity that actually limits a batch-level model is the number of independent batches, and that grows by ones, slowly, at the cost of weeks and a fortune each. Confusing rows with information is the same error as the row-wise split, wearing different clothes — it is why a model can report 0.99 on 20,000 correlated rows and still know almost nothing.

Data integrity: you cannot learn from data you cannot trust

There is a final readiness gate that sits above all the engineering: under GxP (the "Good Practice" regulations, chiefly GMP, that govern how medicines are made), a model is only as trustworthy as the data it learned from, and "trustworthy" has a formal definition. ALCOA+ is the data-integrity standard regulators expect — data must be Attributable, Legible, Contemporaneous, Original, and Accurate, plus Complete, Consistent, Enduring, and Available [5]. Every one of those attributes is a precondition for honest learning, not a paperwork afterthought:

Attributable and Original — you must know which instrument, which batch, which operator produced each value, or you cannot reconstruct the provenance a regulator (or a debugging engineer) will ask for. This is the provenance block on the anatomy card; it is also what lets you exclude a known-bad probe stretch instead of silently training on sensor failure.
Contemporaneous — a value recorded at the time it occurred preserves the alignment that feature engineering depends on; a value back-keyed from a paper form has already lost its true clock, and the alignment step above will quietly join it to the wrong spectrum.
Complete and Consistent — silently dropped rows or quietly reconciled outliers bias a model in ways no validation set will catch, because the bias is baked into both train and test. A "cleaned" dataset where someone deleted the inconvenient excursions is a dataset that has learned to ignore exactly the events you most need to predict.

The regulatory frameworks Book 5 returns to repeatedly — the FDA's 2023 discussion paper and 2025 credibility framework, the draft EU/PIC/S GMP Annex 22, and the ISPE GAMP AI guide — all treat data integrity and data management as the first pillar of a credible model, before any discussion of algorithms or validation [6][7]. The consistent expectation is a model locked at validation with a predetermined change-control plan, riding on ALCOA+ data throughout its lifecycle [7]. The practical reading for this chapter: a soft sensor trained on a hybrid paper/digital dataset with no attributable provenance is not merely lower-quality — it is, for a GMP decision, inadmissible, no matter how good its R² looks. This is why the example suite's run harness (run_all.py) pins each dataset by its MANIFEST.sha256 content hash and records it alongside every model's result: a model that cannot say exactly which bytes it trained on cannot meet Original or Complete, and a model that cannot meet them cannot be put in front of an inspector.

What makes that provenance legally sufficient, not just technically tidy, is the electronic-records machinery the rest of the plant already runs on. Under 21 CFR Part 11 in the US and EU GMP Annex 11 in Europe — the rules that govern electronic records and signatures — every value the model learned from must carry an attributable electronic signature (the named human, the timestamp, and the meaning of the signature: authored, reviewed, approved) and an immutable audit trail (a secure, time-stamped log of who changed what, when, and why). This is not a parallel obligation to the ontology work — it is the same obligation in another layer: the release shape's sh:minCount 1 on bp:approvedBy is the SHACL spelling of "an unsigned release is no release," and the MANIFEST.sha256 hash is the content-integrity counterpart of the audit trail's tamper-evidence. A training set assembled from records with a reviewed audit trail and a content hash satisfies Original, Complete, and Attributable at once; a dataset silently exported from a spreadsheet does not, and no amount of model accuracy rescues it.

Validating that machinery has itself shifted in posture, and the shift matters for ML. The move from CSV (computerized system validation — exhaustive, document-everything proof) to CSA (computer software assurance — risk-based critical thinking about real risk, the stance of FDA's 2025 guidance) is exactly the lens an ML pipeline needs: concentrate the heavy validation effort — full audit-trail review, dual sign-off, formal qualification — on the high-criticality path (the soft sensor whose output could influence a release decision) and apply a lighter, sampling-based check to the convenience dashboard built on the same data. Criticality is a property of what the prediction decides, not of the algorithm, which is why a titer soft sensor feeding a control loop and an exploratory clustering of the same spectra sit at opposite ends of the validation effort. The same risk-proportionate logic the plant applies to validating systems is the logic an ML lifecycle applies to validating models — and it is what keeps "validate everything to the same depth" (which never finishes) from becoming "validate nothing rigorously" (which never ships).

What this chapter adds to the model suite

This chapter contributes the foundation module of Book 5's example suite:

examples/platform/ml/dataio.py — the shared data layer. It loads the committed datasets (raman_spectra.parquet, fedbatch_state.parquet, offline_assays.csv, hplc_results.csv) into model-ready matrices via raman_xy() while load_raman() keeps the batch_id group key on the returned frame; it provides the full leakage taxonomy as code — the deliberately leaky random_split, the within-batch temporal_split, and the whole-batch batch_split — while the run harness (run_all.py) records each dataset's MANIFEST.sha256 content hash so every downstream model's result is pinned to exactly which bytes it trained on. Every later chapter — the soft sensor, the clone ranker, the OOS predictor — imports dataio rather than re-reading parquet, so the leak-free split is the default, never an opt-in. The module turns "don't leak" from a discipline you must remember into the only path the API makes easy.

Why it matters

Every later chapter in this book — every soft sensor, every clone ranker, every hybrid twin, every OOS predictor — stands on the work of this one. A model is a function from data to a decision, and if the data is siloed it cannot be assembled, if it is non-FAIR it cannot be joined, if it is split row-wise its reported skill is fiction, and if it fails ALCOA+ it is inadmissible. The most common reason a promising bioprocess ML project dies is not that the chosen algorithm was wrong; it is that the data was never ready, or the validation was never honest. Getting the fuel right — readiness, contextualization, leak-free features, and a clear-eyed view of the cold-start cadence — is the unglamorous majority of the work and the part that decides whether anything downstream is real.

In the real world

The data-readiness barrier is the most consistently reported finding in the field. Surveys put silo access and non-standardized formats at the top of the blocker list (vendor/self-reported and industry consensus) [1][2], and vendor "AI-ready data" layers — TetraScience's Tetra OS and similar fabrics — exist precisely to monetize the gap (vendor/self-reported). Production-grade spectroscopic soft sensing, the strongest upstream ML deployment, depends entirely on the preprocessing-and-calibration discipline described here: in-line Raman with baseline correction, SNV/Savitzky-Golay derivative preprocessing, and PLS chemometrics is established practice for glucose, lactate, and titer, up to closed-loop glucose control in CHO culture [8] (production). And the splitting discipline is not academic hygiene — the bioprocess ML reviews single out data leakage from improper validation and the small-data / cold-start regime as the two technical reasons reported successes so often fail to reproduce or transfer [4] (research/consensus). The unglamorous verdict from every quarter is the same: fix the data first, and report a number you can defend.

Key terms

Data readiness — the state in which data is accessible, joined, well-described, and trustworthy enough to train a model; barrier number one in bioprocess ML.
FAIR — Findable, Accessible, Interoperable, Reusable: the four principles for data that machines can locate and use.
Historian / data shadow — the time-series database of dense online signals (temperature, pH, DO, Raman); the model's cheap, fast features.
Offline reference — the slow, expensive bench/LIMS measurement (titer, metabolites, release CQAs); the model's scarce ground truth.
Soft sensor — a software model that predicts a slow, expensive measurement (e.g. titer) in real time from cheap online signals (e.g. a Raman spectrum); the central model object of this book.
Titer — the concentration of the product (the monoclonal antibody) in the broth, in g/L; the chapter's primary prediction target.
Raman / NIR spectroscopy — in-line optical probes that read a chemical fingerprint (intensity versus wavenumber) of the broth; the dense, cheap feature streams.
R² (coefficient of determination) — a goodness-of-fit score, usually between 0 and 1 (1 = perfect fit); the chapter's main regression metric, and the number a leaky split inflates.
Contextualization — attaching identity (batch, tag, unit, timestamp, quality flag) to a raw value so it can be joined across systems and become a feature.
Alignment / resampling — expressing all signals on a common clock (usually batch time) and a common cadence so they can be joined and fed to a model.
Dynamic time warping (DTW) — nonlinear stretching of one batch's time axis onto another's so that biological phases (lag, exponential) line up before trajectories are compared.
Batch-wise unfolding (MPCA) — flattening the (batch × variable × time) cube into one row per whole batch, so PCA/PLS can run on it for batch-level prediction.
2D fluorescence (EEM) — excitation–emission matrix spectroscopy; scans excitation × emission wavelengths to read native fluorescence (tryptophan, NAD(P)H, flavins) as a multi-way soft-sensor input, complementary to Raman and the source of the fluorescence baseline Raman preprocessing removes.
Baseline correction — removing the slow background (e.g. fluorescence in Raman) that rides under the chemical peaks.
SNV (Standard Normal Variate) — per-spectrum centring and scaling that removes multiplicative scatter; leak-free because it is computed row by row.
Savitzky-Golay derivative — sliding-window polynomial smoothing-plus-derivative that removes spectral baseline and sharpens peaks.
Data leakage — information from the test set reaching the model during training; the taxonomy is the row-wise split, temporal leakage, target leakage, and scaling/preprocessing fit before the split.
Cross-validation — rotating which slice (fold) of the data is held out for testing and averaging the scores, to use scarce data efficiently; in bioprocess the folds must be grouped by batch.
Batch-grouped split — putting every row of a batch wholly into train or test, so held-out batches are truly unseen; the correct split, and the GroupKFold / leave-one-batch-out family of cross-validation.
Cold start — the once-or-twice-a-day cadence of offline reference data that limits how fast a model can learn and how late drift is detected.
ALCOA+ — Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available: the data-integrity standard a learnable dataset must meet under GxP.
GxP / GMP — the family of "Good Practice" regulations (GMP = Good Manufacturing Practice) that govern how medicines are made; data and models used for GMP decisions must meet ALCOA+ data integrity.
Ontology / IRI — a machine-readable shared vocabulary that fixes what each thing is and how things relate; a feature pulled by its IRI (a global, web-style name) is robust where a string column name is fragile.
SHACL — the Shapes Constraint Language; the closed-world release-gate shape that requires every CQA present and in range is also a training-data completeness check.
Continuant / occurrent — the upper-ontology cut that keeps a measured value (a thing that persists) distinct from the run (a process in time) it was measured during; the type discipline that blocks treating a per-run label as a per-timepoint feature.
21 CFR Part 11 / EU Annex 11 — the electronic-records-and-signatures rules requiring an attributable signature and an immutable audit trail on every value a GMP-relevant model learns from.
CSV → CSA — the shift from exhaustive computerized-system validation to risk-based computer software assurance; concentrate validation effort on the high-criticality model path, sample the rest.
Technology transfer / scale-up — moving a process and its calibration to a new scale or site; the largest distribution shift a soft sensor faces, and a cohort of unseen batches the validation should hold out.

Where this leads

We have the fuel: data that is ready, contextualized, turned into leak-free features, and honestly split. The next chapter, Models and Validation: From PLS to Transformers, Under GxP, takes that fuel and chooses an engine — walking the ladder from partial least squares through tree ensembles to neural networks, asking at each rung how much data it needs and how it must be validated to be trusted with a decision about a medicine. The honest split we built here is exactly the harness that makes those validation numbers mean something.

What this chapter covers​

Data readiness is barrier number one​

Where the fuel lives: the historian and contextualization​

The ontology underneath: a feature is a typed edge, not a column name​

Turning batches into features​

Alignment: putting everything on one clock​

Resampling: choosing a cadence​

Batch-wise unfolding: a whole batch as one row​

Spectral preprocessing: baseline, SNV, and derivatives​

The leakage taxonomy: row vs batch vs temporal, and the traps in between​

The shared loader and a leak-free split, in code​

Anatomy of one training example​

The cold-start reality: dense features, sparse truth​

Data integrity: you cannot learn from data you cannot trust​

What this chapter adds to the model suite​

Why it matters​

In the real world​

Key terms​

Where this leads​