Skip to main content

Data, the Fuel: Readiness, Features, and the Cold-Start Reality

📍 Where we are: Part I · Foundations of Learning in Bioprocess — Chapter 2. The last chapter argued that bioprocess breaks the data-science rulebook because the data is small, slow, and alive. This chapter goes one level down: before any model can be trained, the data has to exist in a usable shape — and in most plants it does not. Data readiness, not algorithm choice, is barrier number one.

Every survey of machine learning in biomanufacturing reaches the same unglamorous conclusion: the bottleneck is rarely the model. It is the data — its accessibility, its shape, its provenance, and above all its scarcity at the exact moments learning needs it most. A team can have scikit-learn, PyTorch, a GPU cluster, and a stack of review papers, and still be unable to train anything useful, because the batch records live half on paper, the historian and the LIMS do not agree on a batch ID, and the only ground-truth measurement of the thing they want to predict arrives twice a day from a bench assay. This chapter is about turning a real, messy bioprocess data estate into the fuel a model can actually burn.

We will be concrete throughout. The running example is the same one the whole series uses — the golden run BATCH-2026-001 and its siblings, the Raman spectra and offline assays the simulator produced, and the release results in examples/datasets/hplc_results.csv. By the end you will have a defensible answer to the single most consequential question in applied bioprocess ML: how do you split your data so that the score you report is the score you will actually get in production?

The simple version

Imagine you want to teach someone to judge whether a cake is done by smell alone. You bake a hundred cakes, sniff each one at many moments, and write down "done / not done" — but the only true answer comes from a thermometer you can use twice per cake. That is bioprocess data: a flood of cheap, fast signals (the smells, every minute) and a trickle of slow, expensive truth (the thermometer, twice a batch). Two mistakes ruin the lesson. First, if your hundred cakes are scattered across sticky notes, a notebook, and three different ovens that label time differently, you can't even line the smells up with the thermometer readings — that's the data-readiness problem. Second, if you let the student practice on whiffs from the same cake they'll be tested on, they'll ace the test and fail on the next cake — that's data leakage, and the fix is to keep whole cakes — whole batches — entirely on one side of the line.

What this chapter covers

  • Data readiness as barrier number one: silos, non-FAIR data, the paper/digital hybrid, and what "AI-ready" actually demands.
  • Where the fuel lives: the historian / data shadow (the dense online signals) and contextualization (the metadata that makes a number mean something).
  • Feature engineering for batches: time alignment, resampling, batch-wise unfolding, and spectral preprocessing (SNV and Savitzky-Golay derivatives).
  • Splitting without leakage: why you split by batch, never by row — the single most common and most damaging mistake in bioprocess ML.
  • The cold-start cadence: offline reference measured once or twice a day, and why "more data" does not always mean "more information."
  • Data integrity (ALCOA+): why a model trained on records that cannot be trusted is itself untrustworthy under GxP.
  • The module this chapter contributes: examples/platform/ml/dataio.py, the shared loaders and batch-aware split helpers every later chapter imports.

Data readiness is barrier number one

When practitioners are surveyed about what blocks AI in pharmaceutical manufacturing, the answer is overwhelmingly data, not models. A widely cited 2024 industry study found that roughly 70% of organizations struggle to access the data they need for AI because it is locked in silos, and only about 39% use standardized formats or ontologies to describe it [1] (vendor/self-reported). The picture the bioprocess literature paints is consistent: data stays "siloed, fragmented, and underutilized," made worse by hybrid paper-and-digital records where a batch's truth is split between a validated MES, a scientist's notebook, an instrument's local drive, and a PDF (research/consensus) [2].

Three failure modes recur:

  • Silos. The historian holds the online tags; the LIMS holds the assays; the MES holds the batch record; the chromatography skid holds its own chromatograms. Each is internally fine and mutually unreachable. A model needs all four joined on one batch identity, and the join is often the hardest engineering in the project.
  • Non-FAIR data. Even when you can reach a dataset, it may not be Findable, Accessible, Interoperable, or Reusable — the four FAIR principles [3]. A spreadsheet named final_FINAL_v3.xlsx with units only in a human-readable header, no schema, and no link to the batch it describes is technically accessible and practically useless to an automated pipeline.
  • The paper/digital hybrid. A measurement written on a form and later keyed into a system loses its native timestamp resolution, its instrument metadata, and frequently its uncertainty. You cannot reconstruct what was never digitally captured.

"AI-ready data" is the proposed remedy, and the industry's own framing — BioPhorum's "data as a product" workstream is the prominent example — treats a dataset as something with an owner, a schema, a quality contract, and a consumer, rather than a byproduct of running a plant [2] (industry consensus). The deep point for this book is that Books 2, 3, and 4 of this series are, between them, the data-readiness solution. Book 2 builds the governed data point and the historian; Book 3 builds the open-source stack that captures, contextualizes, and stores it; Book 4 models the whole spine as a knowledge graph so a machine can find and join it. Book 5 assumes that work is done — and this chapter is where we cash the assumption in.

Where the fuel lives: the historian and contextualization

Bioprocess ML draws on two kinds of data that could not be more different in shape, and confusing them is a common early error.

The dense online stream lives in the historian — the time-series database that records every probe on every vessel, sampled every few seconds, forever. In Book 2 this is the data shadow: the digital trace that runs alongside the physical batch. In our datasets it is fedbatch_timeseries.parquet and the minute-cadence fedbatch_state.parquet — temperature, pH, dissolved oxygen, agitation, feed events — plus the in-line raman_spectra.parquet, 701 intensity channels (wn_400wn_1800) read alongside the kinetic state. These signals are cheap and fast: thousands of points per batch per tag. They are the model's inputs, the features.

The sparse offline reference lives in the LIMS and the bench: offline_assays.csv (viable cell density, glucose, lactate, glutamine, ammonia, osmolality, titer, offline pH) and the release results in hplc_results.csv. These are expensive and slow: in our simulated campaign each batch has exactly 28 offline samples over 14 days — two per day — and a single end-of-process release Certificate of Analysis. They are the model's targets, the ground truth, and there are vanishingly few of them.

A raw historian value, though, is not yet a feature. The number 36.51 means nothing until you know it is BR101.Temp.PV in degrees Celsius, with Good quality, at a specific timestamp, belonging to BATCH-2026-001 during its production phase. Attaching that identity is contextualization — the discipline Book 3 makes literal in its contextualization layer, and which Book 2 traces through the lifecycle of a data point. Contextualization is what makes the cross-system join possible: it is the reason a Raman spectrum row and an offline titer row and a release verdict can all be keyed to the same batch_id and the same point in batch time. Without it, you have four silos; with it, you have a dataset. Every loader in this book's example suite assumes contextualized data — a batch_id column and a comparable timestamp on every table — because that assumption is exactly what the instruments-and-sensors and contextualization chapters earn.

Turning batches into features

A trained model wants a rectangular matrix: rows are examples, columns are features, and one column is the target. A bioprocess gives you nothing of the kind. It gives you irregular, multi-rate, multi-table time-series tied to a few sparse labels. Four steps bridge the gap.

Alignment: putting everything on one clock

The online tags, the Raman probe, and the bench assays each run on their own clock and their own cadence. The first job is alignment: expressing every measurement in a common time frame — usually batch time (hours since inoculation) rather than wall-clock time, so that day 7 of one batch lines up with day 7 of another. Alignment is also where you join across silos: an offline titer taken at batch-hour 168 must be matched to the Raman spectrum and the online state at that same batch-hour, so the model learns "this spectrum corresponds to this titer." Get the join wrong by even a few hours and you are teaching the model a lie.

Resampling: choosing a cadence

Once aligned, signals must be brought to a shared cadence by resampling — averaging or interpolating onto a fixed grid (every 10 minutes, or every hour). The book's datasets ship a 10-minute grid (fedbatch_timeseries_10min.csv) for exactly this reason. Resampling is a judgment, not a formality: too fine and you carry noise and inflate the data volume without adding information; too coarse and you blur the transients (a feed bolus, the day-7 temperature excursion the simulator seeds) that often carry the signal. The honest default is to resample no finer than the slowest thing you actually need to see.

Batch-wise unfolding: a whole batch as one row

For predicting a batch-level outcome — final titer, a release CQA, a pass/fail — you need each batch to be one row. The classical move, inherited from multiway PCA (the foundation of batch monitoring, covered in Book 3's analytics chapter), is the unfold: take the three-dimensional cube of (batch × variable × time) and flatten each batch into one long row, laying every variable-at-every-timepoint side by side as columns. A 14-day batch of seven tags at hourly cadence unfolds to a single row of roughly 2,300 columns. Now "predict the CQA from the trajectory" is an ordinary supervised problem — with the brutal catch that you have one row per batch, so six campaign batches give you six training rows. This is the small-data wall from Chapter 1 in its rawest form, and it is why unfolded batch models lean so hard on dimensionality reduction (PLS, PCA) rather than deep networks.

For predicting a within-batch quantity that changes over time — titer right now from the Raman spectrum right now — you do not unfold. Each aligned timepoint is its own row, which is how the titer soft sensor gets hundreds of training examples from a handful of batches. Which framing you choose dictates everything downstream, including how you must split.

Spectral preprocessing: SNV and derivatives

A raw Raman or near-infrared spectrum is not model-ready. It carries baseline drift (slow curvature from fluorescence and the instrument) and multiplicative scatter (the whole spectrum scaled up or down by probe fouling, bubbles, or path-length changes) that have nothing to do with the chemistry you want. Chemometrics has two standard, decades-old preprocessing steps, and almost every production spectroscopic soft sensor uses some combination of them:

  • Standard Normal Variate (SNV): for each spectrum independently, subtract its mean and divide by its standard deviation. This removes multiplicative scatter and per-spectrum offset, so two spectra of the same broth taken through a slightly fouled versus clean window become comparable.
  • Savitzky-Golay derivatives: fit a low-order polynomial in a sliding window and take its first or second derivative. This removes baseline (a first derivative kills a constant offset; a second derivative kills a linear slope) while smoothing noise, and it sharpens overlapping peaks. The window length and polynomial order are real hyperparameters — too wide a window smears peaks, too narrow amplifies noise.

The mathematics is modest but the discipline is not. Preprocessing is part of the model, and — as we are about to see for scaling — its parameters must be fit on the training data only. SNV is per-spectrum so it is safe, but any preprocessing that learns from the dataset as a whole (mean-centering across spectra, PCA denoising, a global StandardScaler) is a leakage trap if you fit it before you split.

The cardinal sin: split by batch, not by row

Here is the single most important paragraph in the chapter. The most common, most seductive, and most damaging mistake in bioprocess ML is to take your aligned table of timepoints, shuffle the rows, and split 70/30 into train and test. The reported R² will be spectacular — and meaningless.

The reason is temporal and within-batch autocorrelation. Two Raman spectra taken an hour apart in the same batch are almost identical: same cells, same media lot, same probe, same fouling, nearly the same titer. If one of them is in the training set and the other in the test set, the model has effectively seen the answer — it is interpolating between two near-duplicate points, not generalizing. A row-wise random split scatters near-twins across the train/test boundary in every batch, so the test set is not independent of the training set. You measure memorization and call it skill.

The fix is grouped splitting: every row from a given batch goes entirely to train or entirely to test — never both. Then the test batches are genuinely unseen processes, and the score is an honest estimate of "how will this do on the next batch?" — which is the only question that matters in manufacturing. The same logic forces GroupKFold (grouped by batch) for cross-validation and a leave-one-batch-out scheme when batches are precious. This is not a refinement; it is the difference between a number you can defend to a regulator and a number that will collapse on first contact with a new lot.

The trap is worse than it looks because it hides. A row-wise split passes every code review, produces a beautiful predicted-versus-measured scatter, and survives until the model meets batch number seven in production and falls apart — at which point the validation evidence in your filing is wrong. Splitting by batch is the cheapest insurance in the entire pipeline, and examples/platform/ml/dataio.py makes it the default so that no later chapter can fall into the trap by accident.

Hero diagram contrasting a wrong row-wise random split against a correct batch-grouped split. On the left, six batch trajectories are drawn as horizontal sparklines; a row-wise random split scatters individual timepoints into train and test, with red double-headed arrows showing near-identical neighbouring points landing on opposite sides of the train/test line, labelled leakage and an inflated R-squared near 0.99. On the right, the same six trajectories are split as whole batches: batches one, two, three, five go entirely to train in indigo and batch four (the OOS sibling) plus batch six go entirely to test in cyan, with a green honest R-squared label and a note that the test batches are genuinely unseen processes. A bottom band shows the cold-start cadence: a dense Raman stream every few minutes against an offline reference marker appearing only twice per day. The defining choice of bioprocess ML, drawn twice: a row-wise random split (left) scatters near-duplicate within-batch neighbours across the train/test boundary and reports a fantasy score, while a batch-grouped split (right) keeps whole batches on one side, so the held-out batches are truly unseen and the score is honest — set against the cold-start reality that ground truth arrives only once or twice a day. Original diagram by the authors, created with AI assistance.

The shared loader and a leak-free split, in code

This chapter contributes examples/platform/ml/dataio.py — the foundation module every later chapter imports. It loads the committed datasets, joins them on batch identity, and — crucially — provides a batch-aware split so that grouped splitting is the path of least resistance rather than an extra discipline. The excerpt below shows the two functions that matter: loading the per-timepoint Raman table with its batch_id group key intact, and splitting by batch.

# examples/platform/ml/dataio.py — shared loaders + batch-aware split helpers
from __future__ import annotations
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit, GroupKFold

DATA = Path(__file__).resolve().parents[3] / "datasets"


def load_raman(target: str = "titer_g_L"):
"""Per-timepoint Raman design matrix, keeping the batch_id GROUP KEY intact.

Returns X (n x 701 spectra), y (target), groups (batch_id per row).
The groups array is the whole point: it is what makes a leak-free split possible.
"""
df = pd.read_parquet(DATA / "raman_spectra.parquet")
wn = [c for c in df.columns if c.startswith("wn_")]
X = df[wn].to_numpy()
y = df[target].to_numpy()
groups = df["batch_id"].to_numpy()
return X, y, groups, wn


def snv(X: np.ndarray) -> np.ndarray:
"""Standard Normal Variate: per-spectrum centre + scale. Row-wise, so leak-free."""
mu = X.mean(axis=1, keepdims=True)
sd = X.std(axis=1, keepdims=True)
return (X - mu) / np.where(sd == 0, 1.0, sd)


def batch_split(X, y, groups, test_size: float = 0.34, seed: int = 2026):
"""Split so that EVERY row of a batch is wholly in train OR test — never both.

This is the single most important line in the whole example suite: a row-wise
train_test_split here would leak within-batch neighbours and report a fantasy R2.
"""
gss = GroupShuffleSplit(n_splits=1, test_size=test_size, random_state=seed)
tr, te = next(gss.split(X, y, groups))
return X[tr], X[te], y[tr], y[te], groups[tr], groups[te]


def batch_cv(groups, n_splits: int = 3):
"""GroupKFold folds, grouped by batch — the honest cross-validation for this data."""
return GroupKFold(n_splits=n_splits)

Running a small demonstration that fits the same PLS soft sensor under both splitting schemes shows the gap between the comfortable lie and the honest number:

$ python -m examples.platform.ml.dataio --demo
loaded raman_spectra.parquet: 337 rows x 701 wavenumbers, 6 batches
rows per batch: BATCH-2026-001..006 (≈56 each)

[ROW-WISE random split] train 222 / test 115 R2 = 0.992 <- LEAKED, do not trust
[BATCH-GROUPED split] train batches {001,002,003,005} / test {004,006}
train 224 / test 113 R2 = 0.951 <- honest, held-out batches

held-out test batches were never seen during fit; R2 reflects next-batch performance. (illustrative)

Both numbers are high because the simulated spectra carry the titer signal cleanly — but the row-wise number is high for the wrong reason. The honest, batch-grouped number is the one you could put in front of a reviewer, because the test batches BATCH-2026-004 (the OOS sibling) and BATCH-2026-006 were genuinely never seen during the fit. In real Raman the gap between the two numbers is typically far larger, and a model that looked excellent under a row-wise split has been known to fail outright on the first new batch.

Anatomy of one training example

Every model in this book consumes the same atomic unit: a single, contextualized, leak-aware training example. Pull one apart and the whole chapter is laid out as fields. The example below is one aligned timepoint from the golden run — a Raman spectrum, the online state at that instant, the batch-time index, the offline reference titer it is paired to, and the group key that keeps it leak-free.

Anatomy identity card of one bioprocess training example. An indigo header reads training example, source BATCH-2026-001, batch-hour 168.0. A feature block lists the inputs: a 701-channel Raman spectrum row wn_400 to wn_1800 shown as a sparkline, an SNV-preprocessed version beside it, and the aligned online state temperature 36.51 degrees C, pH 7.04, dissolved oxygen, agitation. A green target block holds the offline reference titer_g_L from the LIMS with a note that it was measured at batch-hour 168 and arrives only twice per day. A cyan key block highlights group key batch_id = BATCH-2026-001, labelled the field that decides train versus test. A violet provenance block lists the historian tag IRIs, the LIMS sample_id, the dataset hash from MANIFEST.sha256, and ALCOA+ attributes attributable, contemporaneous, original. A footer notes that the cheap fast features outnumber the slow expensive target by hundreds to one. One training example, fully unpacked: the dense cheap features (Raman plus aligned online state), the SNV-preprocessed spectrum, the sparse expensive target from the LIMS, the batch-id group key that alone decides which side of the split it falls on, and the provenance and ALCOA+ attributes that make it trustworthy enough to learn from under GxP. Original diagram by the authors, created with AI assistance.

Read the card top to bottom and the argument is complete. The features are the cheap, fast signal — a 701-channel spectrum and the online state, available continuously. The SNV-preprocessed row is the model-ready transform, computed per-spectrum so it carries no leakage. The green target is the slow, expensive truth: an offline titer that exists only because a sample was pulled at batch-hour 168 and run on a bench, one of just twenty-eight such samples in the whole batch. The cyan group keybatch_id — looks like mere metadata but is the most powerful field on the card: it alone decides whether this example trains the model or tests it, and getting it wrong is the leakage failure above. The violet provenance block is what makes the example admissible at all: the historian tag IRIs and LIMS sample_id that tie it to its sources, the dataset hash from MANIFEST.sha256 that pins exactly which data this is, and the ALCOA+ attributes that make it data integrity rather than just data.

The cold-start reality: dense features, sparse truth

Now we name the constraint that shapes every later chapter. In our datasets the asymmetry is exact: the Raman probe and online tags produce hundreds to thousands of rows per batch, while the offline reference — the only ground truth for titer, metabolites, and viability — is sampled twice a day, 28 times across a 14-day batch, and the release CQAs exactly once, at the end. The bioprocess ML literature calls this the cold-start problem: living systems are observed sparsely and at low dimension by the offline reference, run-to-run variability "severely compromises transferability," and models decay fast in production [4] (research/consensus).

This has three hard consequences that the rest of the book keeps colliding with:

  1. Targets are the scarce resource, not features. You can always collect more spectra; you cannot easily collect more titers. Every method that wins in bioprocess — hybrid models, transfer learning, Bayesian priors, semi-supervised learning — is, at bottom, a way to spend fewer expensive labels. This is the through-line that explains why pure end-to-end deep learning so often stalls here [4].
  2. Drift is detected late. Because ground truth arrives twice a day, a soft sensor that began drifting at breakfast is not provably wrong until the evening sample comes back. The drift flag is, by construction, a lagging indicator — a problem Book 2's soft-sensor chapter names and Chapter 22 of this book has to engineer around.
  3. Data volume is not information. A million Raman points from one batch is still one batch's worth of information about how a process behaves run-to-run. Resampling a tag from 1-second to 1-minute cadence loses almost nothing, because neighbouring seconds are near-duplicates. The quantity that actually limits a batch-level model is the number of independent batches, and that grows by ones, slowly, at the cost of weeks and a fortune each. Confusing rows with information is the same error as the row-wise split, wearing different clothes.

Data integrity: you cannot learn from data you cannot trust

There is a final readiness gate that sits above all the engineering: under GxP, a model is only as trustworthy as the data it learned from, and "trustworthy" has a formal definition. ALCOA+ is the data-integrity standard regulators expect — data must be Attributable, Legible, Contemporaneous, Original, and Accurate, plus Complete, Consistent, Enduring, and Available [5]. Every one of those attributes is a precondition for honest learning, not a paperwork afterthought:

  • Attributable and Original — you must know which instrument, which batch, which operator produced each value, or you cannot reconstruct the provenance a regulator (or a debugging engineer) will ask for. This is the provenance block on the anatomy card.
  • Contemporaneous — a value recorded at the time it occurred preserves the alignment that feature engineering depends on; a value back-keyed from a paper form has already lost its true clock.
  • Complete and Consistent — silently dropped rows or quietly reconciled outliers bias a model in ways no validation set will catch, because the bias is baked into both train and test.

The regulatory frameworks Book 5 returns to repeatedly — the FDA's 2023 discussion paper and 2025 credibility framework, the draft EU Annex 22, and the ISPE GAMP AI guide — all treat data integrity and data management as the first pillar of a credible model, before any discussion of algorithms or validation [6][7]. The consistent expectation is a model locked at validation with a predetermined change-control plan, riding on ALCOA+ data throughout its lifecycle [7]. The practical reading for this chapter: a soft sensor trained on a hybrid paper/digital dataset with no attributable provenance is not merely lower-quality — it is, for a GMP decision, inadmissible, no matter how good its R² looks.

What this chapter adds to the model suite

This chapter contributes the foundation module of Book 5's example suite:

  • examples/platform/ml/dataio.py — the shared data layer. It loads the committed datasets (raman_spectra.parquet, fedbatch_timeseries.parquet, offline_assays.csv, hplc_results.csv) into model-ready matrices with the batch_id group key preserved on every row; provides the leak-free snv() preprocessing and batch_split() / batch_cv() helpers; and binds each load to the dataset's MANIFEST.sha256 hash so every downstream model can record exactly which data it trained on. Every later chapter — the soft sensor, the clone ranker, the OOS predictor — imports dataio rather than re-reading parquet, so the batch-grouped split is the default, never an opt-in. The module turns "don't leak" from a discipline you must remember into the only path the API offers.

Why it matters

Every later chapter in this book — every soft sensor, every clone ranker, every hybrid twin, every OOS predictor — stands on the work of this one. A model is a function from data to a decision, and if the data is siloed it cannot be assembled, if it is non-FAIR it cannot be joined, if it is split row-wise its reported skill is fiction, and if it fails ALCOA+ it is inadmissible. The most common reason a promising bioprocess ML project dies is not that the chosen algorithm was wrong; it is that the data was never ready, or the validation was never honest. Getting the fuel right — readiness, contextualization, leak-free features, and a clear-eyed view of the cold-start cadence — is the unglamorous majority of the work and the part that decides whether anything downstream is real.

In the real world

The data-readiness barrier is the most consistently reported finding in the field. Surveys put silo access and non-standardized formats at the top of the blocker list (vendor/self-reported and industry consensus) [1][2], and vendor "AI-ready data" layers — TetraScience's Tetra OS and similar fabrics — exist precisely to monetize the gap (vendor/self-reported). Production-grade spectroscopic soft sensing, the strongest upstream ML deployment, depends entirely on the preprocessing-and-calibration discipline described here: in-line Raman with SNV/derivative preprocessing and PLS chemometrics is established practice for glucose, lactate, and titer, up to closed-loop glucose control in CHO culture [8] (production). And the splitting discipline is not academic hygiene — the bioprocess ML reviews single out data leakage from improper validation and the small-data / cold-start regime as the two technical reasons reported successes so often fail to reproduce or transfer [4] (research/consensus). The unglamorous verdict from every quarter is the same: fix the data first, and report a number you can defend.

Key terms

  • Data readiness — the state in which data is accessible, joined, well-described, and trustworthy enough to train a model; barrier number one in bioprocess ML.
  • FAIR — Findable, Accessible, Interoperable, Reusable: the four principles for data that machines can locate and use.
  • Historian / data shadow — the time-series store of dense online signals (temperature, pH, DO, Raman); the model's cheap, fast features.
  • Offline reference — the slow, expensive bench/LIMS measurement (titer, metabolites, release CQAs); the model's scarce ground truth.
  • Contextualization — attaching identity (batch, tag, unit, timestamp, quality) to a raw value so it can be joined across systems and become a feature.
  • Alignment / resampling — expressing all signals on a common clock (usually batch time) and a common cadence so they can be joined and fed to a model.
  • Batch-wise unfolding — flattening the (batch × variable × time) cube so each whole batch is one row, for batch-level prediction.
  • SNV (Standard Normal Variate) — per-spectrum centring and scaling that removes multiplicative scatter; leak-free because it is computed row by row.
  • Savitzky-Golay derivative — sliding-window polynomial smoothing-plus-derivative that removes spectral baseline and sharpens peaks.
  • Data leakage — information from the test set reaching the model during training; in bioprocess, almost always caused by a row-wise split mixing within-batch neighbours.
  • Batch-grouped split — putting every row of a batch wholly into train or test, so held-out batches are truly unseen; the correct split, and the GroupKFold / leave-one-batch-out family of cross-validation.
  • Cold start — the once-or-twice-a-day cadence of offline reference data that limits how fast a model can learn and how late drift is detected.
  • ALCOA+ — Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available: the data-integrity standard a learnable dataset must meet under GxP.

Where this leads

We have the fuel: data that is ready, contextualized, turned into leak-free features, and honestly split. The next chapter, Models and Validation: From PLS to Transformers, Under GxP, takes that fuel and chooses an engine — walking the ladder from partial least squares through tree ensembles to neural networks, asking at each rung how much data it needs and how it must be validated to be trusted with a decision about a medicine. The honest split we built here is exactly the harness that makes those validation numbers mean something.