Machine Learning, Soft Sensors, and Hybrid Models
π Where we are: Part V, Chapter 17 β having turned managed data into control with classical statistics, we now reach the frontier: machine learning that predicts what we cannot easily measure and fuses with the physics we already know.
The last chapter, From Data to Knowledge: SPC, Multivariate Analysis, and Continued Process Verification, showed how disciplined statistics turn a flood of process data into decisions β charting one variable at a time with Statistical Process Control (SPC), watching many at once with Multivariate Data Analysis (MVDA), and monitoring every batch forever under Continued Process Verification (CPV). Those tools are powerful, but they mostly describe and flag. This chapter is about tools that predict and learn: machine learning, soft sensors, and hybrid models.
Here is the tension that makes the topic interesting. Some of the things you most want to know during a batch β how much product you have made, how many cells are alive, how much sugar is left to feed them β are slow, expensive, or impossible to measure in real time. Machine learning offers a tempting shortcut: learn the answer from data you can measure cheaply. The promise is real, but so are the limits, and biomanufacturing has limits all its own.
A good doctor does not draw blood every minute to know how you are doing. They watch cheap, fast signals β your pulse, your color, your breathing β and infer the expensive number underneath. A soft sensor does the same for a bioreactor: it watches the cheap signals it has and predicts the costly measurement it lacks. Machine learning is how it learns that inference from past batches. And a hybrid model is the doctor who also knows physiology β combining learned pattern-matching with real biological rules, so the guess holds up even in a situation never seen before.
What this chapter coversβ
- Soft sensors: predicting hard-to-measure quantities from cheap online signals
- Machine learning in bioprocessing, plainly β and why "small data" changes everything
- Hybrid (gray-box) models that fuse mechanistic knowledge with data
- Validating artificial intelligence under GxP rules
- An honest reckoning of hype versus reality
Soft sensors: measuring the unmeasurableβ
A soft sensor (also called a virtual sensor or inferential sensor) is not a physical probe. It is a piece of software that estimates a quantity you cannot easily measure directly, using other signals you can [6]. The idea comes from the wider process industries β oil refining, chemicals β where Kadlec and colleagues laid out the now-standard recipe: gather historical data, clean it, train a model to map cheap inputs to the expensive target, then run that model live to produce a continuous prediction [6].
In a bioreactor the targets are tantalizing. Titer β the concentration of product (say, antibody) in the broth β is the number the whole business cares about, yet it usually comes from a slow lab assay hours later. Viable cell density (VCD) β how many living cells are working β and the glucose concentration that feeds them are similarly awkward to track minute by minute. A soft sensor predicts these from signals that are available continuously, such as oxygen uptake, stirring power, or a spectroscopic reading.
This connects directly to Raman spectroscopy, the optical fingerprinting technique we met as a PAT (Process Analytical Technology) tool back in Chapter 4. A Raman probe produces a rich spectrum every few seconds, but a raw spectrum is not a glucose number β it must be turned into one by a chemometric model (statistics applied to chemical spectra). That model is a soft sensor: cheap, fast spectra in, the expensive concentration out [6].
Bioprocesses, though, make soft sensors unusually hard to build. Brunner and colleagues catalogue why: batches vary in length, a run passes through distinct phases (growth, then production) that obey different rules, and the few probes that exist can drift or fail mid-run [7]. A soft sensor that quietly trusts a faulty input can be worse than no sensor at all, so fault tolerance β knowing when not to believe yourself β is part of the job, not an afterthought [7].
Machine learning, plainlyβ
Machine learning (ML) is software that improves its predictions by finding patterns in examples, rather than following rules a human wrote out by hand. Two broad kinds matter here. In supervised learning, you train on examples that come with the right answer attached β past batches where you logged both the cheap signals and the measured titer β so the model learns the mapping between them. A titer soft sensor is supervised learning. In unsupervised learning, there are no answers to copy; the algorithm groups or simplifies the data on its own β for instance, clustering batches into "behaved normally" and "drifted oddly," a cousin of the multivariate monitoring from the last chapter [5][9].
ML has been applied across the whole bioprocess workflow β selecting cell lines, optimizing media, predicting scale-up, and monitoring and controlling production [5]. The reviews that survey this work are enthusiastic. They are also unusually frank about a problem most ML textbooks never face.
Most celebrated machine learning lives in a world of big data β millions of photos, billions of words. Biomanufacturing lives in a world of small data. A single batch can cost weeks and a fortune, so a process team may have only dozens of complete runs to learn from, not millions [5][9]. A model that is data-hungry will simply starve, or worse, overfit β memorize the handful of batches it saw and fail on the next one.
This small-data, high-cost-experiment reality is the binding constraint of bioprocess ML, and it is exactly what motivates the next idea [9].
Hybrid models: physics plus dataβ
If you have very little data, the smartest move is to stop asking the data to learn everything from scratch. We already know a great deal about how bioreactors behave β mass balances, reaction kinetics, the basic arithmetic of cells consuming sugar and producing protein. That knowledge is a mechanistic (or first-principles) model: equations derived from physics and chemistry, not from data.
A hybrid model β also called a gray-box or semi-parametric model β combines the two. It keeps the trustworthy mechanistic backbone and uses a machine-learning component only for the parts we cannot write down cleanly, such as how cell growth rate depends on a messy combination of conditions [1]. The name "gray-box" sits deliberately between a transparent white box (pure equations) and an opaque black box (pure ML). Von Stosch and colleagues' survey gives the field its taxonomy: structurally, the data-driven and mechanistic parts can sit in series or in parallel, but in every case the physics constrains what the data is allowed to conclude [1].
A hybrid (gray-box) model: a mechanistic backbone supplies known physics while a machine-learning component covers what is hard to write down, and the physics keeps the learned part honest. Diagram by the authors.
Why does this fit data-scarce bioprocesses so well? Because the mechanistic part contributes knowledge the data never had to supply, the ML part has far less to learn, so it can succeed on a handful of batches where a pure black box would fail [1][3]. The evidence is concrete: for mammalian-cell processes making therapeutic proteins, Narayanan and colleagues showed that hybrid models predicted process behavior more accurately than either a purely mechanistic or a purely data-driven model on its own [8][1][3]. Hybrid modeling is not a compromise between two methods; in the small-data regime of bioprocesses, it often outperforms both on their own.
A soft sensor that survives the unseen: mechanistic physics keeps the learned model honest.
Original diagram by the authors, created with AI assistance.
It is also a practical enabler of the regulatory frameworks this book keeps returning to. An expert panel convened by von Stosch and colleagues argued that hybrid models are well suited to Quality by Design (QbD) and PAT, because a model that respects physics generalizes more safely across the design space β the proven region of operating conditions that reliably yields acceptable product β than one that has only ever seen a few points inside it [2]. This is the same QbD logic that ICH Q8(R2) (Pharmaceutical Development) and ICH Q9(R1) (Quality Risk Management) formalize: define a design space, and manage risk by staying inside it. The book Hybrid Modeling in Process Industries collects the theory and cross-industry case studies behind these claims [3].
Validating AI under GxPβ
A model that helps you understand a process is one thing. A model that decides something about a medicine β that releases a batch, sets a feed rate, or stands in for a lab test β is a regulated object, and that changes everything. GxP is the umbrella term for the "Good Practice" rules (Good Manufacturing, Laboratory, and Clinical Practice, among others) that govern medicine making. Under GxP, you cannot simply deploy a clever model; you must validate it and keep it trustworthy for its whole life.
Three difficulties make AI harder to govern than ordinary software. The first is model drift: the real process slowly changes β a new raw-material lot, an aging probe, a seasonal shift β until the world no longer matches the data the model learned from, and its predictions quietly decay. The second is explainability: a black-box model can be accurate yet unable to say why, which is uncomfortable when a regulator asks you to justify a decision about a human medicine. The third is the validation question itself: a model that keeps learning after deployment is a moving target, and traditional one-time validation was never designed for something that changes.
The U.S. FDA has begun to map this territory. Its 2023 CDER discussion paper, Artificial Intelligence in Drug Manufacturing, lays out the open questions for AI under cGMP β how to manage the data a model is trained on, how to validate and re-validate models, and how to apply risk-based expectations so that a model touching a critical decision faces more scrutiny than one that does not [4]. It is a discussion paper, not a finished rule: the regulatory frontier is still being drawn [4].
This is where the Computer Software Assurance (CSA) thinking from Chapter 11 earns its keep. CSA's core move β spend your validation effort in proportion to risk, and lean on continuous evidence rather than one heroic up-front test β is exactly the posture a learning model demands. An AI soft sensor needs ongoing assurance: monitoring its predictions for drift, defining when it must be retrained, and documenting that whole lifecycle, much as CPV monitors a process forever [4].
The same expectations apply on both sides of the Atlantic. In the United States, a model that touches a regulated record falls under 21 CFR Part 11 (electronic records and signatures); in the European Union, EU GMP Annex 11 (Computerised Systems) governs the validation of computational systems, including the models embedded in them. Both demand that a system be validated for its intended use, kept under change control, and audit-trailed throughout its life β which, for a model that keeps learning, means the validation never truly ends.
Why it mattersβ
Every promise in this chapter rests on one foundation: data. A soft sensor is only as good as the batches it learned from. A hybrid model still needs clean, contextualized measurements to fit its data-driven part. And an AI under GxP cannot be validated unless its training data is itself trustworthy β attributable, complete, and well-described [4]. The reviews that survey ML in bioprocessing keep landing on the same conclusion: the binding constraint is rarely the algorithm. It is the quantity and quality of the data available to feed it [5][9]. This is the throughline of the entire book. The reason to manage data well is not only to satisfy an auditor; it is to make the most advanced tools in the field actually work.
In the real worldβ
The most successful industrial deployments so far have been pragmatic, not flashy. Raman-based soft sensors that hold glucose at a setpoint, and hybrid models used to design experiments and shrink the number of costly runs, are exactly the targeted, physics-anchored applications the bioprocess ML literature recommends over grand black-box ambitions [5][8].
This is not theoretical. A typical glucose soft sensor runs on a process Raman analyzer β for example a Kaiser Optical Systems RamanRxn probe (now part of Endress+Hauser) β that collects a spectrum every minute or two. A chemometric model turns each spectrum into a live reading the controller can act on instantly: it might report glucose = 3.8 g/L in real time, while the confirming offline assay only returns 3.2 g/L four hours later. That few-hours head start is the entire value β by the time the lab result arrives, the feed has already been corrected. The models themselves are ordinary software artifacts: vendor platforms such as Sartorius's SIMCA build and deploy chemometric models through their own runtime (SIMCA-Q / SIMCA-online). More broadly, trained ML models are increasingly serialized in interchange formats like PMML (Predictive Model Markup Language) or ONNX (Open Neural Network Exchange) so that a model fit on one system can be deployed, version-controlled, and audited on another.
Validation frameworks are catching up too: alongside the FDA's AI discussion paper [4], the established computerized-system validation thinking we met in Chapter 11 is extending toward models, so that "validate your software" can begin to mean "validate your model" as well. And in the U.S., NIIMBL's new SABRE (Securing American Biomanufacturing Research and Education) pilot facility β being built to scale up and de-risk advanced biomanufacturing β points toward the dense, continuous data streams that, as the broader field moves toward continuous and intensified processing, make real-time soft sensing both possible and valuable. The frontier is not a robot that replaces the scientist; it is a model that earns its trust the same way every batch record does: through disciplined, defensible data.
Key termsβ
- Soft sensor (virtual / inferential sensor) β software that estimates a hard-to-measure quantity from cheaper signals that are available in real time.
- Titer β the concentration of product (such as antibody) in the bioreactor broth.
- Viable cell density (VCD) β the number of living, working cells in the culture.
- Chemometrics β statistical methods that turn chemical spectra, such as Raman, into concentration numbers.
- Machine learning (ML) β software that improves its predictions by finding patterns in examples rather than following hand-written rules.
- Supervised learning β ML trained on examples that already carry the correct answer.
- Unsupervised learning β ML that finds structure in data with no answers provided.
- Overfitting β when a model memorizes its training examples and fails on new ones.
- Small data β the regime, typical in biomanufacturing, where experiments are so costly that only a few are available to learn from.
- Mechanistic (first-principles) model β a model built from physics and chemistry equations rather than from data.
- Hybrid (gray-box / semi-parametric) model β a model that combines a mechanistic backbone with a data-driven component.
- Model drift β the gradual decay of a model's accuracy as the real process changes away from its training data. For example, a glucose soft sensor trained on 2023 batches might begin reading 10β15% high after a new raw-material supplier is introduced β a clear signal that it needs retraining.
- Explainability β the degree to which a model can justify why it made a prediction.
- GxP β the family of "Good Practice" regulations governing medicine development and manufacture.
Where this leadsβ
Soft sensors, hybrid models, and validated AI do not live in isolation β they only deliver their value when they plug into a factory that runs and integrates in real time. The next chapter, Real-Time Integration and Pharma 4.0: The Smart, Continuous Factory, pulls every thread of this book together: continuous and intensified processing, Real-Time Release Testing, the Pharma 4.0 vision, and the live data-integration efforts β above all the NIIMBL/NIST real-time lab-data proof of concept (the IOF Biopharma / BMIC work) β that demand everything we have built, all at once. We close, honestly, on data spaces and the still-distant dream of the autonomous bioprocess.