Manufacturing Operations: Predictive Maintenance, Yield, and Scheduling
📍 Where we are: Part VI · The Whole System — Chapter 23. The last chapter built the MLOps spine — drift, retraining, and the validation paradox — that keeps any single model honest after deployment. Now we pull back from the model to the plant: the equipment, the schedule, and the batch outcome that all those unit-operation models add up to.
Every chapter before this one looked down a microscope at one step — a soft sensor on one bioreactor, a hybrid model of one capture column, a vision system on one filling line. This chapter looks up, at the factory as a system: a hall full of vessels, skids, chillers, and air handlers that wear out; a calendar of batches competing for those vessels; and, at the end of each campaign, a release decision that is either a clean pass or an expensive failure. The machine learning that lives at this altitude is the least glamorous in the book and, not coincidentally, the most deployed. It rarely touches a critical quality attribute directly. It keeps a motor from seizing, flags a batch that is drifting toward an out-of-spec result, drafts the review of a clean record so a human only reads the exceptions, and packs more product into the same steel.
This is also where the running example finally pays a debt. Across five books we have followed BATCH-2026-001 and its siblings, and we have always known there was one out-of-spec sibling, BATCH-2026-004. In this chapter it stops being a footnote and becomes a target: the positive label a batch-failure model is trying to predict before the lab confirms it. The honest part of the story is that with six batches you cannot train such a model — and that constraint, not the algorithm, is the real subject.
A good factory manager does three boring, valuable things. They listen to the machines — a bearing that sounds slightly rough today will fail next month, so fix it on a planned Saturday, not in the middle of a batch. They watch each batch as it runs and develop a gut feel for "this one's going sideways" before the final test confirms it. And they pack the schedule so no expensive vessel sits idle. Plant-level machine learning is software that does exactly those three things — predict the breakdown, predict the bad batch, pack the calendar — from the data the plant already generates. None of it cures cancer; all of it keeps the lights on and the steel full.
What this chapter covers
We move up a level of abstraction, from the unit operation to the plant, and walk the four families of operations ML that actually run in industry:
- Predictive maintenance (PdM) — learning an equipment-health signal from vibration, current, and temperature so a failure is caught with lead time, framed as the unsupervised "departure from healthy" problem it usually is.
- Batch-failure / yield / deviation prediction — a batch-level supervised task (one row per batch, not per second), with
BATCH-2026-004's real HCP excursion as the failure mode, and an honest reckoning with the six-batch data ceiling. - Right-first-time and review-by-exception — how ML rides on top of the MES execution layer (Körber PAS-X) to shrink the human review burden, and where the regulatory line forbids it from going further.
- Scheduling, capacity, supply-chain, and energy — the optimization and forecasting problems that are mostly not deep learning, and where the real money is.
Two runnable modules anchor the chapter: examples/platform/ml/batch_outcome.py (yield/failure prediction grounded in the real release table) and examples/platform/ml/pdm.py (an illustrative predictive-maintenance signal).
Predictive maintenance: the plant's most reliable ML
Ask a manufacturing-systems engineer where machine learning has actually earned its keep on the shop floor, and the answer is rarely "the bioreactor." It is the equipment around the bioreactor. A bioreactor's agitation motor, a chromatography skid's feed pump, a centrifuge's drive, a chiller compressor, an HVAC supply fan — each is a rotating or reciprocating machine with a physical signature (vibration spectrum, motor current, bearing temperature, acoustic emission) that drifts slowly as a bearing spalls or a seal degrades, often weeks before the part actually fails [1]. The economic case is stark: an unplanned failure mid-batch can scrap a multi-week run worth far more than the part; a planned swap on a maintenance Saturday costs the part and an hour.
The thing that makes PdM tractable where CQA prediction is hard is the labels. A soft sensor needs the expensive offline reference to learn from. PdM mostly does not, because real catastrophic failures are too rare to train a supervised classifier on — you may have one bearing failure per pump per several years. So the deployable formulation is almost always unsupervised: learn what "healthy" looks like from a baseline window of normal operation, then alarm when the live signal departs from that envelope. This is the same conceptual move as the golden-batch envelope and MSPC: build a model of normal, score novelty against it. The simplest version is a one-dimensional control chart on a vibration RMS; the richer versions are a PCA/autoencoder reconstruction error on a multi-sensor feature vector (an isolation forest or one-class SVM are common too), where the reconstruction residual is the health score.
The departure score, concretely
The reliable, auditable PdM pattern is a z-distance from a healthy baseline. Take a daily vibration RMS for an agitation motor, fit the mean and standard deviation over a baseline window of known-good days, and express every later day as a number of standard deviations from that baseline. When that departure crosses an alarm threshold — three sigma is the natural choice, borrowing the SPC convention — you have an alarm, and crucially you have it with lead time before the hard failure. The lead time is the entire value: it converts an emergency into a scheduled task.
The richer the feature, the earlier the warning. A raw vibration amplitude only spikes near the end; a spectral feature — energy at the bearing's characteristic defect frequency, or the kurtosis of the high-frequency band — moves much sooner, which is why production PdM platforms (AVEVA Predictive Analytics, Siemens Senseye, the equipment-analytics layer in Aizon and Rockwell) lean on frequency-domain features rather than a single scalar [1][2]. The method below is deliberately the scalar version, because it is the honest floor: if even a one-number z-score buys lead time, the spectral version buys more.
Batch-failure prediction: the OOS as a target
Predictive maintenance is about the equipment. The second family is about the product: can a model, partway through or at the end of a run, predict whether the batch will pass release? This is batch-failure prediction (and its continuous cousins, yield prediction and deviation prediction), and it is a genuinely different ML task from everything earlier in the book — because the unit of analysis is the batch, not the timepoint.
That distinction is the whole difficulty. A Raman soft sensor has thousands of training rows from a single batch, because it predicts a quantity that changes every minute. A batch-failure model has one row per batch — a fixed-length vector of process-summary features (peak viable cell density, integral lactate, harvest turbidity, capture load, cumulative hours outside the temperature NOR) paired with one binary outcome, pass or fail. The number of rows is the number of batches you have ever run, and that is the binding constraint of the entire plant-level enterprise [3]. A commercial process at a single site might accumulate a few hundred batches over years; a clinical or new-product process has a handful. This is the small-data ceiling from Chapter 1, at its most severe, because here it is measured in batches rather than rows.
The running example: BATCH-2026-004
The campaign in this series carries exactly one out-of-spec sibling. Read straight from examples/datasets/hplc_results.csv, the failure is unambiguous: BATCH-2026-004 has a host-cell-protein result of 128.0 ng/mg against a specification ceiling of 100.0 ng/mg — the only OOS value in the entire 66-row release table; every other assay on every batch passes. The other five batches (BATCH-2026-001/002/003/005/006) are clean, with HCP values from 14.953 to 28.203 ng/mg, all comfortably inside spec. So the label a batch-failure model would learn is real and it is single-attribute: this batch failed on HCP, host-cell-protein carry-over, not on monomer purity, charge variants, or any other CQA.
The mechanism matters for the features. Host-cell protein is a downstream-clearance attribute: HCP that survives the capture and polishing steps usually traces back to how much got loaded onto the columns, which traces back to how much made it through harvest and clarification. A turbid harvest carries more host-cell debris forward; an overloaded capture step clears less of it. So a physics-shaped batch-failure model should find harvest turbidity and capture load as the drivers of an HCP failure — not the upstream growth metrics that drive titer. A model that instead "predicts" the failure from, say, peak VCD would be learning a spurious correlation, and on six batches it absolutely could.
Why six batches cannot train a classifier
Here is the honest core. With six batches and one failure, you cannot fit, validate, or trust a batch-failure classifier — full stop. One positive example is not a class; a single held-out fold either contains the failure (and the train set has none) or it does not (and the test set has none). Any AUC you report is an artifact. The lesson of this chapter is not "look, we built a batch-failure model"; it is "here is what the model would look like, here is the data you would need, and here is why the plant-level constraint is batches-not-rows."
So the example module does two things side by side, and labels the boundary loudly. It reads the real outcomes from hplc_results.csv and batches.csv to state the ground truth (5 PASS, 1 OOS, the HCP excursion). Then it builds an illustrative synthetic cohort of a few hundred batches around the real process statistics — with failure driven by harvest turbidity into HCP carry-over, the real BATCH-2026-004 mechanism — purely to demonstrate the gradient-boosted classifier and its feature attribution. The model is real; the cohort it trains on is synthesised, and every number it prints is marked illustrative. That is the only intellectually honest way to show plant-level ML on a six-batch dataset.
Plant-level learning sits above the unit operations: predictive maintenance on the equipment (left), batch-failure and yield prediction on the product (center, with the real OOS BATCH-2026-004 as the target), and scheduling, capacity, supply-chain, and energy analytics on the campaign (right) — the cluster where production ML actually lives.
Original diagram by the authors, created with AI assistance.
Building the two models
The chapter's two modules are pure NumPy/Pandas/scikit-learn over the committed datasets, so they run with no services. Start with the batch-outcome model. It reads the real release outcomes, then trains a gradient-boosted failure classifier on the illustrative cohort, cross-validated batch-out because that is the only honest small-data estimate:
# examples/platform/ml/batch_outcome.py
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_predict
# Plant-level summary features a completed batch carries by harvest/release.
FEATURES = [
"peak_VCD_e6_per_mL", # upstream growth summary
"final_titer_g_L", # productivity summary
"integral_lactate", # metabolic-burden summary (lactate * time)
"harvest_turbidity_NTU", # clarification carry-over -> HCP risk
"protein_a_load_g_L", # capture loading
"temp_excursion_h", # cumulative hours outside the temperature NOR
]
def real_release_outcomes() -> pd.DataFrame:
"""The ground truth: 6 batches, one OOS. Read straight from the datasets."""
hplc = pd.read_csv(DATASETS / "hplc_results.csv")
batches = pd.read_csv(DATASETS / "batches.csv")
oos = hplc[hplc.result == "OOS"][["batch_id", "test", "value", "spec_high"]]
return batches.merge(oos.rename(columns={"test": "failing_assay",
"value": "failing_value"}),
on="batch_id", how="left")
def train_failure_model(df: pd.DataFrame) -> dict:
"""Gradient-boosted batch-failure classifier, batch-out cross-validated."""
X, y = df[FEATURES].to_numpy(), df["failed_release"].to_numpy()
clf = GradientBoostingClassifier(n_estimators=200, max_depth=2, random_state=2026)
# 5-fold CV is the honest small-data estimate; never score on the train fold.
proba = cross_val_predict(clf, X, y, cv=5, method="predict_proba")[:, 1]
auc = roc_auc_score(y, proba)
clf.fit(X, y)
imp = dict(sorted(zip(FEATURES, clf.feature_importances_), key=lambda kv: -kv[1]))
return {"n": len(df), "n_failed": int(y.sum()),
"cv_auc": round(float(auc), 3), "top_driver": next(iter(imp)),
"importances": {k: round(float(v), 3) for k, v in imp.items()}}
Running python ml/batch_outcome.py prints the real ground truth, then the illustrative model:
REAL release outcomes (6 batches):
5 PASS, 1 OOS
OOS batch = BATCH-2026-004: HCP_ng_per_mg = 128.0 (spec <= 100.0)
batch-failure classifier (ILLUSTRATIVE cohort, n=240, 39 failed, rate=0.163):
5-fold CV ROC-AUC = 0.978 # illustrative
top driver = harvest_turbidity_NTU # turbidity -> HCP carry-over, the real BATCH-2026-004 mechanism
feature importances = {'harvest_turbidity_NTU': 0.895, 'protein_a_load_g_L': 0.06,
'temp_excursion_h': 0.03, 'integral_lactate': 0.016,
'final_titer_g_L': 0.0, 'peak_VCD_e6_per_mL': 0.0} # illustrative
ASSERT ok: the HCP-carry-over failure signal is learnable (illustrative).
Read it carefully. The real block is the only verbatim claim: one OOS, on HCP, at 128.0 ng/mg. The illustrative block earns its keep by showing the shape of a good batch-failure model — the cross-validated AUC of 0.978 (illustrative) is high because the synthetic cohort has a clean turbidity→HCP signal, and the feature importances do the right thing: harvest turbidity dominates (0.895), with capture load and temperature excursion next, and the upstream growth metrics correctly near zero. The model recovers the physics of the real failure mode rather than a spurious one — which is exactly the property you would qualify a real model against, even though you would never train it on a synthetic cohort for a release decision.
The predictive-maintenance module is the unsupervised departure score:
# examples/platform/ml/pdm.py
def health_score(signal: np.ndarray, baseline_days: int = 21) -> dict:
"""Unsupervised PdM: z-distance from the healthy baseline window."""
base = signal[:baseline_days]
mu, sd = base.mean(), base.std(ddof=1)
z = (signal - mu) / sd # departure score
alarm = 3.0 # 3-sigma, an SPC threshold
crossed = np.argmax(z > alarm) if (z > alarm).any() else -1
return {"baseline_mean": round(float(mu), 3), "baseline_sd": round(float(sd), 3),
"first_alarm_day": int(crossed), "max_z": round(float(z.max()), 2)}
predictive maintenance (ILLUSTRATIVE agitation-motor vibration):
healthy baseline = 1.821 +/- 0.054 mm/s RMS
3-sigma health alarm first trips on day 43 (max departure 38.17 sigma)
hard failure scheduled day 52 -> 9 days of lead time # illustrative
ASSERT ok: the health alarm fires before failure with lead time (illustrative).
Nine days of lead time (illustrative) is the entire economic argument: the bearing fails on day 52, but the health alarm fires on day 43, turning a mid-batch emergency into a planned swap. The simulator does not model equipment faults, so the vibration trace is synthetic — but the method, baseline-and-departure, is the one that ships.
Right-first-time and review-by-exception
The third family is not a prediction at all; it is a workload problem. Every GMP batch produces an electronic batch record (EBR) with thousands of entries, and historically a human reviewer read every one before release — slow, expensive, and the place where release timelines go to die. The execution-layer answer is review-by-exception: the MES executes the recipe, automatically verifies the entries that are in tolerance, and surfaces to a human reviewer only the exceptions — the entries that fell outside limits or need judgment. Right-first-time (RFT) is the companion metric: the fraction of batches that complete with no deviation requiring rework.
This is where ML layers onto an existing production backbone rather than replacing it. Körber PAS-X is the canonical MES here, and its vendor materials cite "up to 98% right-first-time" with review-by-exception as a headline capability — a figure that is vendor-self-reported and is a best-case ceiling, not a typical result, but the capability (rule-based auto-verification of in-tolerance entries) is genuinely production [4]. The ML addition on top is twofold: classifiers that triage which exceptions are likely to become real deviations versus noise, and the generative-AI deviation/CAPA drafting covered in the next chapter. The crucial qualifier is that all of it is human-in-the-loop: the model proposes, a qualified reviewer disposes. This is not a stylistic preference. It is the post-Purolea regulatory reality — the FDA's first AI-citing cGMP warning letter (2 April 2026) cited a firm that used AI agents to generate specs, SOPs, and master production records without quality-unit review, which is precisely the line review-by-exception ML must not cross [5].
The "up to 98% RFT" claim for Körber PAS-X is vendor-self-reported and a best-case ceiling [4]. Review-by-exception as an MES capability is (production) and independently recognized (Gartner rates PAS-X an MES Leader), but the specific RFT percentage and install-base figures are vendor materials, not independently verified. Treat the capability as established and the headline number as illustrative.
Scheduling, capacity, supply chain, and energy
The fourth family is where the largest dollar figures live and where, tellingly, the methods are mostly not deep learning. A biomanufacturing site is a scarce set of expensive vessels and suites that many batches compete for; deciding which batch runs in which vessel when — around changeovers, cleaning validations, media and buffer prep, and shared utilities — is a production scheduling / capacity optimization problem. The right tools are operations research: mixed-integer linear programming, constraint programming, and discrete-event simulation, sometimes wrapped with ML demand forecasts as inputs. ML's role is usually to forecast the inputs (demand, batch duration, deviation likelihood) that the optimizer then schedules against, not to make the scheduling decision itself.
Above the plant, supply-chain analytics forecast demand and predict stockouts so that a cold-chain biologic neither expires on a shelf nor runs short at a clinic — Sanofi's plai platform reports roughly 80% stockout prediction and 65% risk-to-root-cause, both single-company self-reported and not independently verified, so they belong in the "illustrative/self-reported" bucket [6]. Energy and utility analytics optimize the enormous HVAC, chilled-water, and clean-utility loads that dominate a facility's footprint; the WEF Global Lighthouse pharma sites report figures like ACG Shirwal's 31% energy reduction and Cipla Indore's 26% cost reduction — Lighthouse-program self-reported, deployed at real sites, but headline numbers all the same [7]. Amgen's New Albany, Ohio "smart facility" pairs AWS-based predictive maintenance and machine vision with aggressive energy/water targets (−75% energy, −80% water), and it is worth stressing those are targets, not audited results, and the site is staffed by 400+ people, not lights-out [8][2].
The pattern across all four families is the ISPE 7th Pharma 4.0 survey's central finding: AI/ML has the most pilots and the fewest scaled implementations of any digital technology in pharma, and the production clusters are exactly the ones in this chapter — monitoring, predictive maintenance, vision inspection, and human-in-the-loop documentation — not autonomous control of CQAs [9].
Anatomy of one batch-outcome prediction record
A batch-outcome prediction, like every artifact in this series, is worthless as a bare number. What makes P(fail) = 0.07 a governable decision support is everything that travels with it: the batch it scores, the features it read, the model and data version that produced it, the calibration that turns the score into a probability, and the disposition a human ultimately recorded. Unpack the record the model would persist for a single batch.
One batch-outcome prediction as a governed record: the batch and its summary features, the calibrated failure probability with its top driver, the eventual release outcome that grades it, and the governance block — intended use, human disposition, e-signature, and lineage — that keeps it advisory rather than an autonomous release.
Original diagram by the authors, created with AI assistance.
Read it top to bottom. The input rows pin the prediction to one batch and one as-of moment — a batch-failure model is far more useful partway through (end of capture, say) than at release, because an early warning leaves room to intervene; the features are the same six summary statistics the model trains on. The green core is the prediction proper: a calibrated failure_probability, an optional predicted yield_margin, a decision band (watch / hold / release-track), and the top_driver from the model's feature attribution — for BATCH-2026-004 that driver would be harvest turbidity, pointing the investigator at the real mechanism. The reconciliation rows hold the eventual truth: the actual release outcome and the realized HCP value the lab reports later, plus the residual that grades the prediction and feeds drift monitoring. The violet governance block is what an auditor opens first: the model version, the (illustrative) training-cohort hash, the intended use scoping it to advisory decision support, the human disposition and e-signature, and the run-to-model-to-registry lineage. Strip that block and the same prediction becomes an ungoverned number an AI made about a medicine — the Purolea failure mode in miniature.
The unsolved part: the batches-not-rows ceiling and confounded labels
The honest open problem here is not a missing algorithm; it is that the plant level is where the small-data ceiling is hardest and the labels are most confounded. Batches, not rows, are the unit of evidence, and batches are scarce, slow, and expensive — a new process may have single-digit completed runs, and even a mature one accumulates batches over years, by which time process changes have made the early ones a different distribution. Transfer learning and Bayesian priors help (warm-start a new product's failure model from a related one), but the documented hard problem is that run-to-run variability in living systems compromises transferability, so a model learned on one process or scale can mislead on the next [3].
The labels are worse than scarce — they are confounded and imbalanced. A failing batch like BATCH-2026-004 carries a single OOS attribute (HCP) but a whole vector of correlated process conditions, and with one failure there is no way to separate the cause from the coincidences; the model can learn the operator who happened to run it as easily as the turbidity that caused it. Failures are also rare by design — a well-run process should have very few, which is exactly the regime where a classifier has almost nothing to learn the positive class from, and where a model that always predicts "pass" scores 99% accuracy while being useless. Add the survivorship problem (batches halted mid-run for a deviation never produce a clean pass/fail label at all) and the picture is clear: plant-level prediction is the application where the data is least adequate to the ambition, which is why production deployments cluster in the unsupervised monitoring corner (PdM, MSPC) where you only have to model "normal," not in the supervised failure-prediction corner where you have to have seen failures.
What this chapter adds to the model suite
This chapter contributes Book 5's plant-level pair — the two modules that operate above the unit operation:
examples/platform/ml/batch_outcome.py— yield/failure prediction. It reads the real release outcomes (5 PASS, 1 OOS — the HCP excursion onBATCH-2026-004) and then trains a gradient-boosted batch-failure classifier on an explicitly-labelled illustrative synthetic cohort, with batch-out cross-validation and feature attribution that recovers the real turbidity→HCP mechanism. Its lesson, baked into its docstring, is the batches-not-rows ceiling: the model is real, the cohort is synthesised, and the boundary is loud.examples/platform/ml/pdm.py— an illustrative predictive-maintenance signal: an unsupervised baseline-and-departure health score on a synthetic agitation-motor vibration trace, alarming with lead time before a seeded failure. The method (z-distance from a healthy window, three-sigma alarm) is the deployable one; the fault is synthetic because the simulator models the process, not the equipment.
These sit beside the suite's prediction and search models as the only two that take the batch (not the timepoint) and the equipment (not the product) as their unit of analysis — the suite's plant-altitude entries. They coordinate with, but do not duplicate, the deviation-triage and release-prediction modules: those classify the records and assays; these score the batch and the machine.
Why it matters
Plant-level ML is where machine learning meets the profit-and-loss statement of a biologics facility. A predictive-maintenance alarm that converts a scrapped batch into a planned part swap saves more than any titer improvement; a batch-failure model that flags a drifting run at the end of capture leaves room to intervene before an OOS is locked in; review-by-exception that lets a reviewer read only the exceptions compresses release timelines that otherwise stretch to weeks; a scheduler that packs idle vessels turns capital already spent into product. None of it is glamorous and none of it controls a CQA — and that is exactly why it is the part of the AI story that is actually deployed. The unglamorous, human-in-the-loop, monitor-and-optimize corner is where the production maturity is, and a sober reading of this chapter is the antidote to the autonomous-factory hype the frontier chapter will examine.
In the real world
The named, attributed reality is consistent. On predictive maintenance, AVEVA Predictive Analytics and Siemens Senseye are the cross-industry platforms, with biopharma instances like J&J Mulund reporting reduced unplanned downtime and Amgen's Ohio facility pairing AWS-based PdM with machine vision — all vendor or self-reported, all (production) in the sense of being deployed, with the headline percentages illustrative [1][2][8]. On review-by-exception and RFT, Körber PAS-X is the canonical MES execution layer (production), with its "up to 98% RFT" figure vendor-self-reported, joined by Rockwell PharmaSuite and the GxP manufacturing-intelligence layer of Aizon (its flagship Grifols deployment across three sites is vendor-self-reported, though its 2020/2021 PDA-JPST AI-qualification study is a notable peer-reviewed exception) [4][10]. On the data layer that feeds all of it, TetraScience Tetra OS positions itself as the "AI-ready data" platform (deployment counts vendor-self-reported), reflecting the field's true #1 barrier: a Zifo study found 70% of firms struggle to access data for AI because of silos, and only 39% use standardized formats [11]. On supply chain and energy, Sanofi's plai and the WEF Lighthouse sites supply the headline figures, all self-reported.
The throughline is the ISPE 7th Pharma 4.0 survey: the most pilots, the fewest scaled implementations, with production concentrated in monitoring, predictive maintenance, vision, and human-in-the-loop documentation [9]. Plant-level ML is real, it is deployed, and it is deliberately bounded — it keeps the factory running and lets humans decide.
Key terms
- Predictive maintenance (PdM) — learning an equipment-health signal (vibration, current, temperature) to predict a failure with lead time; usually unsupervised, because real failures are too rare to train a supervised classifier.
- Departure / health score — the distance of a live equipment signal from a healthy baseline (a z-distance, or a PCA/autoencoder reconstruction error); an alarm fires when it crosses a threshold, typically three sigma.
- Lead time — the gap between the PdM alarm and the hard failure; the entire economic value, converting an emergency into a scheduled task.
- Batch-failure prediction — a batch-level supervised task: one summary-feature vector per batch in, one pass/fail outcome out; the unit of evidence is the batch, not the timepoint.
- Yield / deviation prediction — the continuous and event-level cousins of batch-failure prediction (predicting the yield margin, or the likelihood a deviation will require rework).
- Batches-not-rows ceiling — the binding plant-level constraint: the number of training examples equals the number of batches ever run, which is scarce, slow, and expensive even at commercial scale.
- Right-first-time (RFT) — the fraction of batches completing with no deviation requiring rework; the headline metric review-by-exception is meant to raise.
- Review-by-exception — an MES capability (Körber PAS-X) that auto-verifies in-tolerance batch-record entries and surfaces only the exceptions for human review; ML and generative AI layer on top, human-in-the-loop.
- Production scheduling / capacity optimization — deciding which batch runs in which vessel when, around changeovers and shared utilities; an operations-research problem (MILP, constraint programming, discrete-event simulation), with ML usually forecasting the inputs rather than making the decision.
- Supply-chain / energy analytics — demand and stockout forecasting, and HVAC/utility-load optimization; where the largest dollar figures sit and most headline percentages are self-reported.
- Confounded labels — the plant-level pathology where a single failing batch carries a whole vector of correlated conditions, so cause cannot be separated from coincidence on small data.
Where this leads
The factory runs, the equipment is watched, the batch is scored, and the schedule is packed — and at every step a human stayed in the loop, reading only the exceptions a model surfaced and disposing of them with a signature. The obvious next question is what drafts those exceptions, those CAPAs, those investigation reports, and how far the new wave of language models can be trusted to do it. The next chapter, Generative AI and LLMs: Copilots, CAPA, and the Limits of Agents, takes up the copilots that write the deviation and the agents that promise to act on it — and the hard regulatory line, drawn in the Purolea warning letter and draft Annex 22, that keeps them advisory.