Glossary

📍 Quick reference: This is your pocket dictionary for the whole book — every recurring ML and AI term, in plain words. Bookmark it and come back anytime a word stops making sense.

Machine learning has its own language, and applying it to a living, regulated process adds a second one on top. Here are the most important terms from this book, in plain words, listed alphabetically so they are easy to find. Each entry is a plain-language starting point; the chapter it points to gives the full, precise picture.

ALCOA+ — the data-integrity standard regulators expect: data must be Attributable, Legible, Contemporaneous, Original, and Accurate, plus Complete, Consistent, Enduring, and Available. A model is only as trustworthy as the data it learned from, so a dataset that fails ALCOA+ is inadmissible no matter how good the model looks. (See Data, the fuel.)

Applicability domain (AD) — a safety gate that asks whether a new input even resembles the data the model was trained on; if the input is too far outside that envelope, the model declines to trust its own prediction rather than guessing confidently. (See Models and validation.)

Bayesian optimization (BO) — a smart search that fits a model (usually a Gaussian process) to the experiments run so far, then picks the next experiment most likely to improve the result, reaching a good answer in far fewer runs than testing every setting on a fixed grid. (See Process development.)

Bias-variance trade-off — the statistical reason a simpler model often wins on small data: a very flexible model has low bias but high variance (it chases noise and swings wildly), while a rigid one has higher bias but low variance, and on a few batches variance dominates the error. (See Models and validation.)

Champion/challenger — the safe way to replace a deployed model: run the new candidate (challenger) in shadow on live inputs while the old one (champion) still runs the plant, score both against the same ground truth, and promote the challenger only if it genuinely wins; learning happens between locked versions, never inside one. (See MLOps and lifecycle.)

Cold start (cold-start problem) — the once-or-twice-a-day cadence of the slow, expensive reference measurement that limits how fast a model can learn and how late drift is caught; the scarce resource is labels, not the flood of cheap feature rows. (See Data, the fuel.)

Concept drift — when the relationship the model learned changes, so the same inputs now imply a different answer (a cell line adapts, a feed strategy changes); the dangerous kind, because the inputs can look perfectly normal while predictions quietly go wrong. (See MLOps and lifecycle.)

Contextualization — attaching identity to a raw number (which batch, which tag, which unit, which timestamp, which quality flag) so it can be joined across systems and become a usable feature; a number means nothing until you know what it is. (See Data, the fuel.)

CPP (Critical Process Parameter) — a controllable input to the process, such as feed rate, temperature, or pH, that the model may use as a feature; contrast with a CQA, which is a measured product property. (See Hybrid models and digital twins.)

CQA (Critical Quality Attribute) — a measurable property of the drug (such as host-cell protein or SEC monomer) that must stay inside a safe range for the batch to be released; whether a model touches a CQA decides how tightly it must be governed. (See The learning problem.)

Cross-validation — rotating which slice (fold) of the data is held out for testing and averaging the scores, so scarce data is used efficiently; in bioprocess the folds must be grouped by batch, never split at random. (See Data, the fuel.)

Data leakage — any path by which information from the test set sneaks into training, making the reported score better than reality; in bioprocess it takes the row-wise split, temporal leakage, target leakage, and scaling-before-the-split forms, and it is the field's most common validation mistake. (See Data, the fuel.)

Data readiness — the state in which data is accessible, joined, well-described, and trustworthy enough to train a model; barrier number one, and the unglamorous majority of the work in any real project. (See Data, the fuel.)

Demo-to-plant gap — the structural distance between a model that dazzles in a curated demonstration and one that runs unattended every shift under GMP; measured by surveys as AI/ML's "most pilots, fewest scaled" profile. (See The learning problem.)

Deviation / CAPA — a recorded departure from a procedure (deviation) and the Corrective And Preventive Action that resolves it; the free-text records that language models are most used to triage, retrieve, and draft. (See Generative AI and LLMs.)

Developability — the bundle of properties (aggregation, viscosity, stability, chemical-liability motifs, immunogenicity) that decide whether a candidate antibody can actually be manufactured and dosed; predicted from sequence, years before any protein exists. (See Molecule discovery.)

Digital twin — a model (usually hybrid) wired to live plant data so it tracks a real asset in real time, used for soft-sensing, control, and what-if simulation; under GMP it is re-validated on a schedule, not allowed to quietly keep learning. (See Hybrid models and digital twins.)

Drift (model drift) — the gradual divergence between the relationship a model froze at validation and the one the world now obeys; the default trajectory of a static model watching a living process on changing hardware. (See MLOps and lifecycle.)

Evidence tier — a four-rung ladder for how good the evidence behind a claim is: press-release-only, vendor-self-reported, peer-reviewed-self-authored, and peer-reviewed-independent (the fact floor, at or above which a number may be stated as established fact). (See The learning problem.)

FAIR — Findable, Accessible, Interoperable, Reusable: the four principles for data that machines can locate and use; interoperable is the one most often missing when two plants record the same thing under different names and units. (See Data, the fuel.)

Feature — an input the model learns from — a Raman channel, a temperature reading, a sequence motif count; features are cheap and plentiful, which makes their sheer count misleading about how much the model actually knows. (See Data, the fuel.)

Foundation model — a very large model pre-trained on huge amounts of data so a new task starts from a strong prior instead of a cold start; in bioprocess they are mostly aspiration, because comparable, shareable batch data barely exists. (See The frontier.)

Gaussian process (GP) — a model that returns not just a predicted value but a calibrated uncertainty at every point — narrow where it has data, wide where it does not; the rare case where a more sophisticated model is exactly what scarce data wants. (See Models and validation.)

Generative AI — models that produce new content (text, sequences) rather than only classifying or scoring; powerful for drafting and design, but excluded from critical GMP decisions under draft Annex 22. (See Generative AI and LLMs.)

Golden batch — a clean, on-target reference run whose healthy multivariate fingerprint defines "normal"; new runs are graded against it, and a batch that strays from the golden-batch envelope is flagged. (See QC and release.)

GMP / GxP / cGMP — GMP (Good Manufacturing Practice) is the legally binding rule-set for how a medicine is made; GxP is the umbrella for the whole "Good x Practice" family; cGMP is the FDA's "current" GMP. Data and models used for GMP decisions must meet a high bar. (See The learning problem.)

GroupKFold / leave-one-batch-out — the batch-aware cross-validation that keeps every row of a batch wholly on one side of the split, so held-out batches are genuinely unseen and the reported score is honest. (See Data, the fuel.)

Hallucination — fluent, confident output from a language model that is simply wrong; the central risk of generative AI in a regulated setting, reduced (never eliminated) by grounding the model in real documents and keeping a human reviewer. (See Generative AI and LLMs.)

Hotelling's T² / SPE — the two monitoring limits: T² measures how far inside the normal pattern a batch sits, while SPE (squared prediction error) measures how much of the batch the model cannot explain at all; a large SPE flags something the model has never seen. (See QC and release.)

Human-in-the-loop — the discipline (and now regulatory expectation) that a qualified person reviews and signs the model's output; the model accelerates the work but never makes the critical decision. (See Generative AI and LLMs.)

Hybrid model (grey-box) — a model that keeps the equations we already trust (mass balances, kinetics — a mechanistic, or white-box, backbone) and asks machine learning to cover only the part we cannot write down; the dominant paradigm in bioprocess because physics does the work the scarce data cannot. (See Hybrid models and digital twins.)

IRI (Internationalized Resource Identifier) — a global, web-style name for a concept; pulling a feature by its IRI instead of a fragile column name keeps a model from silently breaking when an upstream system renames a tag. (See Data, the fuel.)

Label (target) — the answer the model is trained to predict — a titer, a pass/fail, a log-reduction value; in bioprocess labels are slow, expensive, and scarce, which makes them the true binding constraint, not the features. (See Data, the fuel.)

Large language model (LLM) — a neural model trained on vast text that reads and generates fluent natural language; the engine of the deviation/CAPA copilots and document assistants of the 2023-2026 wave. (See Generative AI and LLMs.)

Locked model — a model frozen for production use (weights, preprocessing, scaler, operating range all version-pinned and unchangeable in place); the only pattern draft Annex 22 permits for critical applications. (See MLOps and lifecycle.)

Maturity ladder — a three-rung label for how far a deployment has actually traveled: (production) running in a real GMP plant, (pilot) demonstrated at scale but not in routine use, and (research) a paper or lab result. (See The learning problem.)

MLOps — the operational discipline of deploying, monitoring, retraining, and governing models in production; under GMP it is dominated by the validation lifecycle rather than by the continuous deployment used everywhere else. (See MLOps and lifecycle.)

MSPC (Multivariate Statistical Process Control) — watching dozens of process signals together rather than one at a time, using PCA/PLS to fingerprint a healthy batch and flag any new run that strays; the most thoroughly deployed learning method in the industry. (See QC and release.)

Non-stationarity — the process moving under the model over time (cells drift, resin ages, raw-material lots change), so a calibration good this quarter can decay next quarter; the reason a model is never "done." (See The learning problem.)

OOS (Out-Of-Specification) — a measured result outside its allowed range; an OOS release CQA means the batch is rejected, and predicting an OOS before the slow assay confirms it is a key classification task. (See QC and release.)

Overfitting — when a flexible model memorizes the noise and quirks of its few training examples instead of the real relationship, so it looks brilliant in training and fails on the next genuinely new run; the constant danger in the small-data regime. (See Models and validation.)

PCCP (Predetermined Change Control Plan) — a pre-approved written specification of how a locked model may be retrained and updated, so a retrain inside the plan's envelope is a planned, documented event rather than a fresh regulatory negotiation. (See MLOps and lifecycle.)

PINN (Physics-Informed Neural Network) — a network that carries the governing equations as a soft penalty in its loss rather than structurally enforcing them; nudged toward obeying physics but not forced to, so it is weaker than a true hybrid when pushed outside its training range. (See Hybrid models and digital twins.)

PLS (Partial Least Squares) — the small-data workhorse: a linear method that compresses many correlated inputs (like 701 Raman channels) into a few latent components chosen to predict the target, and hands a reviewer interpretable coefficients and VIP scores a deep net cannot. (See Models and validation.)

PCA (Principal Component Analysis) — PLS's unsupervised cousin, compressing many correlated measurements into a few directions of greatest variance; the engine of MSPC golden-batch monitoring and its T²/SPE charts. (See Models and validation.)

Protein language model (PLM) — a model trained on hundreds of millions of natural protein sequences; it supplies sequence "naturalness" scores and transferable embeddings that let a small labeled dataset predict developability better than hand-built features. (See Molecule discovery.)

PSI (Population Stability Index) — a label-free metric of how far the model's input distribution has moved from its training distribution; a leading drift indicator that fires before the slow reference data can confirm an error, but must be tuned to the process's own normal batch-to-batch variability. (See MLOps and lifecycle.)

R² (coefficient of determination) — a goodness-of-fit score where 1.0 is perfect, 0 is no better than always guessing the average, and a negative value is worse than that; the main regression metric, and the number a leaky split inflates. (See The learning problem.)

RAG (Retrieval-Augmented Generation) — pairing a retriever over the firm's own validated documents with a language model told to answer only from those documents, with citations; the architecture that makes an LLM usable and auditable in GMP. (See Generative AI and LLMs.)

Reinforcement learning (RL) — learning a control policy by trial and feedback; famously data-hungry, so it is mostly held back in bioprocess, where physics-based model-predictive control dominates the critical path and you cannot ruin thousands of batches to learn a feed policy. (See The learning problem.)

Run-to-run variability — the biological reality that batches are neither independent nor identical (sister runs share a cell bank, a media lot, an operator), which breaks the textbook "independent and identically distributed" assumption and forces whole-batch splits. (See The learning problem.)

SHACL (Shapes Constraint Language) — a closed-world rule-language for checking that a graph of data carries every required value, present and in range; the same release-gate shape that decides whether a lot ships can be reused to gate which rows are fit to train on. (See Data, the fuel.)

Small-data ceiling — the binding constraint of bioprocess ML: you learn from dozens of costly runs, not millions, which is why hybrid models, transfer learning, and priors beat bigger networks and a deeper model usually does not help. (See The learning problem.)

SNV (Standard Normal Variate) — per-spectrum centering and scaling that cancels scatter differences; leak-free by construction because it is computed row by row and never looks at any other spectrum. (See Data, the fuel.)

Soft sensor — a model that infers a slow, expensive offline measurement (titer, glucose, lactate) in real time from a cheap in-line signal (a Raman spectrum) between reference samples; the central model object of the book. (See Production bioreactor.)

Supervised learning — fitting a function from labeled examples; regression predicts a continuous number (the soft sensor) and classification predicts a discrete label (vision inspection, OOS prediction). (See The learning problem.)

Target leakage — a feature that is secretly a stand-in for the answer, or that would not exist at prediction time, which inflates the score in testing and then vanishes in production; the fix is that every feature must be computable strictly before the prediction moment. (See Data, the fuel.)

Titer — the concentration of antibody product in the broth, in grams per litre; the soft sensor's primary prediction target and a recurring quantity across the book. (See Production bioreactor.)

Train/validation/test split — dividing data into a part the model learns from, a part used to tune it, and a held-back part used to grade it on data it never saw; in bioprocess the split must hold out whole batches, or the score is fiction. (See Data, the fuel.)

Transfer learning — carrying a model or representation trained on one task or instrument over to a related one, so a small labeled set goes further; a key small-data workaround, but brittle on its own (a soft sensor moved naively to a second probe can collapse). (See The learning problem.)

Transformer — the attention-based architecture behind large language models; on short, scarce bioprocess time series it is mostly research, but on sequence and text (protein design, document copilots) it genuinely earns its place. (See Models and validation.)

Uncertainty quantification — attaching an honest spread to a prediction (an interval or a calibrated probability) instead of a bare number; a quality unit acts on "the value is at least 4.1" or "passes with probability 0.93," not on a point estimate. (See Models and validation.)

Unsupervised learning — learning the shape of "normal" without any labeled answers; in bioprocess this is MSPC golden-batch monitoring and anomaly detection, where a library of good batches alone defines the envelope. (See The learning problem.)

Validation paradox — the structural tension between GMP's demand for a locked, proven model and machine learning's nature of changing as it learns; managed today by locked-then-relearn (freeze, monitor, retrain off-line under change control), not by validating a continuously-learning model. (See MLOps and lifecycle.)

VCD (Viable Cell Density) — the count of living cells per unit volume; notable as the soft-sensor family's weak spot, because it has no clean Raman band and a Raman VCD model transfers poorly. (See The learning problem.)

If a term here still feels fuzzy, follow it back into the chapter where it lives, and it will make far more sense in context.