From Data to Knowledge: SPC, Multivariate Analysis, and Continued Process Verification
π Where we are: Part V opens. We have spent the whole book making data trustworthy, connected, and meaningful; now we finally use it β turning streams of numbers into control, assurance, and early warning.
The previous chapter wove everything together into two grand constructs. The digital thread is one connected, traceable record that follows a product from its first design idea all the way to the patient β every measurement, decision, and deviation linked in a single chain. The digital twin is a living computational model of the process, continuously fed by that thread, mirroring the real plant as it runs. Both are spectacular achievements, and both are useless if no one reads them. A thread you never analyze is just an archive; a twin no one interrogates is just a dashboard no one watches.
This chapter is about reading the data β the classical analytics that convert governed, connected numbers into knowledge you can act on. These methods are decades old, peer-reviewed, and embedded in regulatory expectation β the foundation every newer technique in the next chapter builds on.
Think of a car's dashboard. The speedometer needle wobbles a little even on cruise control β that gentle jitter is normal and tells you nothing is wrong. But if the temperature gauge suddenly climbs, that is a signal: something specific changed, and you should pull over before the engine exceeds its limit. This chapter's core work is learning to tell the harmless wobble from the meaningful signal β and to catch the signal long before the gauge hits the red line.
What this chapter coversβ
We start with Statistical Process Control (SPC) for a single measurement, then explain why bioprocesses, with hundreds of tangled variables, demand multivariate methods (PCA, PLS, and multivariate SPC). We then meet Continued Process Verification (CPV) β the regulatory rule that says you must watch every commercial batch forever β and see what data infrastructure that promise actually requires.
Telling wobble from signal: classical SPCβ
Every process varies. Fill a hundred vials from the same tank and they will not weigh exactly the same. The central insight of statistical quality control, originating with Walter Shewhart's control-chart method (1920s-1930s) and later systematized by W. Edwards Deming, is that variation comes in two flavors [7][8]. Common-cause variation is the background hum of the process β the sum of countless tiny, unavoidable influences. It is stable and predictable. Special-cause variation is something new and assignable β a clogged filter, a wrong raw-material lot, a miscalibrated probe. It is the temperature gauge climbing [7][8].
A control chart is the tool that separates the two. You plot a measurement over time and draw control limits, conventionally set at the historical mean Β± 3Ο (three standard deviations), which under common-cause variation alone captures about 99.7% of expected values β so a point outside is statistically rare enough to be worth investigating. Points dancing inside the limits are common-cause noise; a point outside, or a non-random pattern (a run, a trend), flags a special cause to investigate [7]. The simplest is the Shewhart chart, which reacts to each point on its own. Two cousins add memory: CUSUM (cumulative sum) and EWMA (exponentially weighted moving average) accumulate small deviations over time, so they catch slow drifts a Shewhart chart would shrug off [7][5].
"In spec" is not the same as "in control." A specification limit is a quality requirement β the value a product must meet to be acceptable. A control limit is a statistical description of how the process actually behaves. A batch can be inside spec yet wildly out of control (a special cause that simply has not breached the requirement yet), and a process can be in perfect control yet incapable of meeting spec. Confusing the two is one of the most common and costly errors in manufacturing [7].
One subtlety the literature insists on: building a chart and using a chart are different jobs. Phase I is the retrospective study where you analyze past data to establish what "in control" even means and to set the limits. Phase II is live monitoring, where new points are checked against those established limits. Treating the two as one β setting limits from the same data you are judging β quietly corrupts the chart's statistics [8].
Why one chart is not enough: going multivariateβ
Now the bioprocess problem. A single run on a modern bioreactor β a Sartorius ambr 250 high-throughput vessel, a Cytiva Xcellerex XDR single-use stirred tank, or a Cytiva WAVE rocking bag β is described not by one number but by hundreds: temperature, pH, dissolved oxygen, agitation, several gas flows, feed rates, off-gas readings, dozens of online and offline measurements, sampled every few seconds for days. Worse, these variables are deeply correlated β they move together. Raise agitation and dissolved oxygen rises; feed glucose and the pH shifts. They are not independent dials [4].
You cannot solve this by drawing three hundred separate control charts. Each individual chart can look perfectly normal while the combination is bizarre β like a person whose height is normal and whose weight is normal, but whose height-for-that-weight is alarming. Univariate charts are blind to the correlations, and they also inflate false alarms: as you add charts, the chance that at least one trips by accident climbs quickly β with hundreds of charts, false alarms become routine [7][5].
Each sensor in spec, yet the batch is abnormal β only the multivariate view sees it (after Nomikos & MacGregor).
Original diagram by the authors, created with AI assistance.
The escape is dimensionality reduction. Principal Component Analysis (PCA) is the foundational tool: it finds a handful of new combined variables β principal components β that capture most of the genuine variation in the data, exploiting the correlations to compress hundreds of measurements into two or three meaningful axes [4]. Its partner, Partial Least Squares (PLS), does the same compression but aims it at predicting an outcome β say, final product quality β from the process variables [4]. Both are latent-variable methods: they assume the many things you measure are driven by a few hidden things you cannot directly see (the true state of the cell culture), and they reconstruct those hidden drivers [4].
With the data compressed, you monitor it with just two charts instead of hundreds. Hotelling's T-squared asks: is the process operating in a normal region of its multivariate space? It catches unusual combinations even when every single variable looks fine [4][5]. Its alarm threshold is not guessed β it comes from the F-distribution with k and (nβk) degrees of freedom, where k is the number of retained principal components and n the number of observations, and is usually drawn at the 99% confidence level (a deliberate trade-off, typically tuning the per-batch false-alarm rate to roughly 1β2%). The SPE chart (squared prediction error, also called Q) asks the complementary question: does the process still behave like the model expects, or has something genuinely new appeared that the model never saw? Together they give a near-complete health check from two lines on a screen [4][5].
When one of these charts trips, a contribution plot answers "which of the original variables is driving the alarm?" β pointing the investigator straight at the suspect measurement instead of leaving them to hunt through hundreds [4].
The golden batch: monitoring a process that evolvesβ
A batch is not a steady state β it is a trajectory. The "normal" temperature on day one is not the normal temperature on day five; the culture grows, consumes, and changes throughout. So the reference for "in control" must itself be a moving target shaped like a healthy run.
The landmark solution is Multiway PCA (MPCA), introduced by Nomikos and MacGregor [3]. You take a historical library of past successful batches β the golden batches β each a full time-trajectory of every variable, and you unfold that three-dimensional block (batches Γ variables Γ time) into a form ordinary PCA can digest. The result is a multivariate fingerprint of what a good batch looks like at every moment of its life. A new batch is then monitored as it runs, point by point, against the time-varying control limits derived from that golden reference distribution [2][3].
The payoff is early fault detection. Because the model knows the expected trajectory, a developing problem typically nudges the T-squared or SPE chart off course before any single sensor breaches a limit, and long before the final quality test β leaving time to investigate and potentially intervene. How much early warning you actually get depends on the process: for a 14-day fed-batch it may be many hours, while for a fast continuous step it may be minutes, and whether the batch can be rescued depends on the root cause [2].
From a library of good batches to live, multivariate, trajectory-aware monitoring of a new one. Figure by the authors, after Nomikos and MacGregor.
Continued Process Verification: monitoring as a promiseβ
So far these are statistical tools. Now comes the regulatory force that makes using them non-optional. The FDA's 2011 process-validation guidance reframed validation as a three-stage lifecycle: Stage 1, process design; Stage 2, process qualification (proving the commercial process works); and β crucially β Stage 3, Continued Process Verification (CPV) [1].
CPV is the commitment that validation never ends. For the entire commercial life of the product, the manufacturer must continually monitor and trend data from every batch to give ongoing assurance that the process stays in a state of control [1]. The guidance explicitly calls for statistical methods β exactly the SPC and trending tools above β to detect special-cause variation as it emerges and to drive reduction of common-cause variability over time [1]. This is not a one-time study filed and forgotten; it is a perpetual, data-fed obligation. The data that feeds it is not optional either: the in-process sampling and testing behind those trends sits within the same cGMP world of in-process controls β the kind addressed by 21 CFR 211.110 on sampling and testing of in-process materials and drug products β where manufacturers establish and follow control procedures that monitor output and the performance of the processes responsible for variability. And because much of CPV is built on analytical measurements, the methods that produce those numbers should stay fit for purpose in the lifecycle spirit of ICH Q2(R2) (analytical validation) and ICH Q14 (analytical procedure development) β a drifting assay can masquerade as a drifting process.
CPV does not float alone. It is the operational expression of ICH Q10, the pharmaceutical quality system framework, whose central goals are maintaining a state of control and pursuing continual improvement across the product lifecycle through process performance and product quality monitoring [6]. SPC charts are the eyes; CPV and ICH Q10 are the standing orders to keep them open.
Why it mattersβ
CPV turns analytics into a data-management problem of the first order, and it cashes in every chapter that came before. To trend every batch forever, you need historical data that is available and enduring for the product's whole life (Part III's ALCOA+ and retention principles). To compare batch 4 with batch 4,000, every number must mean the same thing across years of instrument and software changes (Part IV's semantics and the digital thread) β which is exactly why standards like ANSI/ISA-95 (IEC 62264) for equipment and tag nomenclature and ISA-88 (IEC 61512) for batch recipe structure matter so much: they keep a tag like BR101.Temp.PV (bioreactor 101, temperature, present value) meaning the same physical thing across every system and every year. A model never sees "the bioreactor"; it sees aligned rows like this:
Timestamp,BatchID,BR101.Temp.PV,BR101.DO.PV,BR101.pH.PV
2026-06-14T08:00:00Z,BR101-2241,37.0,52.3,7.02
2026-06-14T08:00:05Z,BR101-2241,37.0,51.8,7.02
2026-06-14T08:00:10Z,BR101-2241,37.1,51.1,7.01
To build a golden-batch model, you must pull thousands of such aligned trajectories from many systems at once β historians, the MES, the LIMS β which only works if Part II's architecture and connectivity standards actually connected them. And the CPV system itself runs under GxP, so it must be validated, audit-trailed, and integrity-controlled like any other regulated record. A CPV program is, in the end, the moment the bill comes due for all the data governance the rest of the book argued for. Skip that groundwork and CPV degrades into manually copying numbers into spreadsheets β slow, error-prone, and blind to the multivariate signals that matter most.
In the real worldβ
Multivariate SPC is mainstream, not theoretical. Commercial process-analytics platforms make this routine: Sartorius SIMCA and SIMCA-online (the former Umetrics tools) and AspenTech's aspenProMV ship MPCA and PLS batch-monitoring out of the box, while general-purpose tools like SAS JMP add PCA/PLS multivariate control charts β all are in routine use, and golden-batch fingerprinting is a standard way large biomanufacturers watch live fermentations and detect faults early [2][3]. Dedicated CPV and trending suites such as MasterControl and Sartorius Umetrics are now part of an ordinary quality stack. CPV programs are an expected part of any commercial biologics license, and inspectors look for evidence that trending is real and acted upon, not cosmetic [1].
This is also precisely where the U.S. NIIMBL institute and its real-time lab-data proof of concept β the work of getting clean, connected lab data flowing in real time β are meant to land (a forward-looking aim, complemented by NIIMBL's SABRE pilot facility, a cGMP biomanufacturing and workforce-training facility being built nearby to scale up and de-risk advanced biomanufacturing). The hard part of CPV is rarely the statistics β those are textbook [7]. The hard part is getting clean, aligned, semantically consistent batch data flowing in real time from many instruments and partners so a model can run on it at all. Solve the data plumbing and the analytics in this chapter come almost for free; ignore it and the most advanced model in the world has nothing trustworthy to read.
Key termsβ
- Statistical Process Control (SPC) β using statistics to distinguish normal process variation from meaningful change.
- Common-cause vs special-cause variation β the stable background noise of a process versus a new, assignable disturbance.
- Control chart β a time plot with statistical limits that separates noise from signal.
- Control limit vs specification limit β how the process actually behaves versus the quality requirement it must meet; not the same thing.
- Shewhart / CUSUM / EWMA β control charts reacting to single points, to cumulative deviation, and to weighted recent history.
- Phase I vs Phase II β establishing chart limits from history versus monitoring live data against them.
- Multivariate Data Analysis (MVDA) β analyzing many correlated variables together rather than one at a time.
- PCA / PLS β latent-variable methods that compress many correlated variables into a few; PLS aims the compression at predicting an outcome.
- Hotelling's T-squared / SPE (Q) β the two multivariate charts: am I in a normal region, and do I still behave as the model expects.
- Contribution plot β a diagnostic that names which original variable caused a multivariate alarm.
- Multiway PCA (MPCA) / golden batch / fingerprint β modeling whole batch trajectories from past successful runs to monitor a new one in real time.
- Continued Process Verification (CPV) β FDA Stage 3: perpetual data-driven monitoring of every commercial batch.
- ICH Q10 β the pharmaceutical quality system framework mandating a state of control and continual improvement.
Where this leadsβ
These classical methods describe and detect β they tell you when the present moment is unusual. The frontier is to predict: to estimate a quantity you cannot measure in real time, or to forecast where a batch is heading before it gets there. The next chapter, Machine Learning, Soft Sensors, and Hybrid Models, builds directly on the data-driven, multivariate thinking introduced here β soft sensors that infer hard-to-measure values from easy ones, hybrid models that fuse mechanistic biology with data-driven learning, and the distinctive challenge of validating artificial intelligence inside a GxP environment.