Case Studies: Named Deployments and Their Evidence
📍 Where we are: Part VII · ML/AI in Industry Today — Chapter 26. The vendor landscape told you who sells ML for bioprocessing and what their software actually does. This chapter asks the harder question: who has actually deployed it, on a named molecule, in a real plant — and how good is the evidence?
A vendor slide is a promise; a deployment is a fact, but only as strong as its evidence. This chapter walks the named, disclosed cases — the ones a careful reader can trace to a paper, a conference talk, or a press release — and grades each one twice: by maturity (is it in commercial GMP, demonstrated at scale, or still academic?) and by evidence tier (who said so, and could anyone outside the company check it?). The uncomfortable finding, stated up front so the rest of the chapter cannot soften it: of all the headline efficiency numbers the industry quotes, essentially none clear the bar of peer-reviewed-independent fact. They are real, they are interesting, and they are almost all a single company reporting on its own process.
That is not a reason to dismiss them. It is a reason to read them correctly. A self-reported +26.8 percent titer from one smart lab is genuine signal about what is possible; it is not a benchmark you can promise your own management. The whole skill this chapter teaches is holding both thoughts at once.
Imagine reading restaurant reviews where every five-star rating was written by the restaurant's own owner. The food might genuinely be excellent — owners do not lie about everything — but you would weight those reviews very differently from a critic who paid for their own meal and has no stake in the outcome. Biomanufacturing ML case studies are mostly owner-written reviews. This chapter teaches you to read them like a critic: enjoy the dish, but never quote the owner's star rating as if a stranger gave it.
What this chapter covers
- The handful of genuinely (production) deployments in commercial GMP, and exactly how solid each one's evidence is.
- The larger band of (pilot) demonstrations from named companies — including the most-cited hybrid-modeling case (Bristol Myers Squibb with DataHow) — read at the peer-reviewed numbers, not the vendor-page numbers.
- A pass down the named-company roster: Amgen, Genentech/Roche, Merck, Pfizer, Sanofi, BMS, Lilly, Biogen, WuXi Biologics, Samsung Biologics, Celltrion, and the big CDMOs Lonza and Fujifilm.
- Two corrections the field repeatedly gets wrong, fixed here: Genentech's MARS is a drug-product unit operation, not the Boehringer Ingelheim Protein A work; and DataHow is an independent company, not a Sartorius subsidiary.
- The self-reporting problem named plainly, with a machine-checkable ledger that computes how many of the disclosed numbers can actually be stated as fact.
How to grade a deployment: two axes, not one
The series uses a fixed convention, and this chapter leans on it harder than any other. Every named case gets a maturity marker and an evidence tier, and the two are independent — a claim can be production-deployed yet weakly evidenced, or research-stage yet rigorously proven.
Maturity answers how far has it actually gone? — (production) deployed in GMP or commercial use; (pilot) demonstrated at scale but not running the plant; (research) academic or early proof-of-concept. Evidence tier answers how much should I trust the claim? — in descending order: peer-reviewed-independent (a third party with no commercial stake published it), peer-reviewed-self-authored (the company published it in a journal, so the method was reviewed but the headline number is still theirs), vendor-self-reported (a vendor or manufacturer asserts it on its own materials), and press-release-only (a marketing announcement with no method, baseline, or replication).
The single rule that organizes the whole chapter: a number may be stated as established fact only at the peer-reviewed-independent tier or above. Everything below it is labeled illustrative or self-reported in the same sentence as the number. When you run this chapter's companion module, you will see that this rule disqualifies every headline figure in the named-deployment roster — which is precisely the point.
The industry-wide framing comes from two independent-ish sources. The 7th ISPE Pharma 4.0 Survey (peer-reviewed-independent, in the sense of a neutral professional society reporting member data) found AI/ML to have the most pilot projects and the fewest scaled implementations of any digital technology [1]. McKinsey's State of AI 2025 found that across all industries about 88 percent of organizations use AI but roughly 6 percent achieve enterprise-wide impact (vendor/analyst tier) [2]. The shape — many pilots, few scaled — is the backdrop for every case below.
The (production) tier: a short, well-scrutinized list
Genuinely commercial-GMP ML in biomanufacturing is a short list, and it clusters where the vendor chapter said it would: monitoring, vision inspection, and soft sensing — not autonomous control of critical quality attributes.
Amgen Juncos: the cleanest production soft-sensor case
The strongest named production exemplar is Amgen's drug-substance plant at Juncos, Puerto Rico, where engineers deployed SIMCA OPLS (orthogonal partial least squares) models — the same MVDA lineage our running example's soft sensors sit in — to predict harvest titer and in-process states in commercial GMP manufacturing. The reported operational payoff is the elimination of roughly six hours of harvest idle time and about ten hours of idle time between chromatography columns, by predicting when material is ready rather than waiting for the slow assay [3]. This maps directly onto our spine: it is exactly the harvest-endpoint prediction and capture-pooling decisions, made on real GMP material.
Maturity: (production) — this is OPLS running in a commercial drug-substance facility, not a demo. Evidence: peer-reviewed-self-authored / vendor, and it matters which. The account is a first-party BioProcess International article written by three Amgen Juncos employees, alongside a Sartorius vendor case study [3]. The deployment is real and credible; the specific hour-savings are the company's own figures and have not been independently audited. So the honest sentence is: Amgen Juncos runs OPLS soft sensors in commercial GMP (production), and reports about 6 and 10 hours of idle-time savings (illustrative, self-reported).
Amgen AVI: the most production-ready ML in QC
The other clearly production case from Amgen is automated visual inspection (AVI) of vials and syringes using deep-learning computer vision — the single most mature ML use in QC. Rule-based inspection over-rejects (Amgen has cited up to roughly 20 percent false rejection of good containers), and a trained convolutional model both catches more true defects and rejects far fewer good units. Amgen reports auto-releasing on the order of 95 percent of syringes and vials through AI-assisted visual inspection [4].
Maturity: (production). Evidence: vendor / trade-press self-reported. The 95 percent figure comes from conference and trade coverage, the fully validated retrofit Amgen has described in detail was a syringe line, and reaching production took years of work and direct conversations with FDA [4]. The lesson the vision chapters drew — that vision inspection is where deep learning genuinely earns GMP trust — is true; the exact percentage is illustrative.
The pattern under the production tier
Step back and the production list has a shape: PLS/OPLS multivariate models (Amgen Juncos; the broad SIMCA/SIMCA-online installed base every large biologics maker runs) and deep-learning vision (AVI). Both are advisory or inspection roles where a human or a downstream test remains in the loop. Notably absent: a model that closes the loop on a release-defining CQA with no human gate. That absence is not an accident of disclosure — it is the regulatory line the draft EU GMP Annex 22 draws, which the next chapter takes up in full.
The (pilot) tier: demonstrated, named, but not running the plant
Below production sits a much larger band: named companies showing real ML on real molecules, at real scale, but as demonstrations rather than standing GMP operations. This is where most of the famous numbers live — and where the self-reporting problem is sharpest.
Bristol Myers Squibb with DataHow: read the peer-reviewed numbers
The most-cited hybrid-modeling case is Bristol Myers Squibb's process-development dataset modeled by DataHow: 48 experiments at 5 L, 12 critical process parameters, 18 CQAs. It is the canonical demonstration of the hybrid paradigm — a mechanistic backbone with an ML component — beating a pure black box in the small-data regime.
Here the evidence tier does real work, because there are two versions of the headline. DataHow's own case-study page reports roughly 22 percent better prediction accuracy and about 3x fewer experiments. A peer-reviewed companion in Biotechnology Journal (2024), co-authored by DataHow and BMS, headlines roughly 33 percent better accuracy with about half the data versus a black-box model [5]. The rule of this chapter is to prefer the peer-reviewed figures — both because the method was reviewed and because, oddly, they are the more conservative claim on the data axis (about half, i.e. 2x, rather than 3x). Maturity: (pilot) — this is process development, not GMP production. Evidence: peer-reviewed-self-authored for the journal version.
Two corrections this case attracts, fixed here because they keep recirculating. First, DataHow is an independent ETH Zurich spin-off — its Series A was led by Momenta with Rockwell Automation and Zürich Kantonalbank, and it has collaborated with Eppendorf and Genedata — it is not a Sartorius subsidiary. Second, the BMS/DataHow result is a prediction-accuracy and experiment-efficiency claim on a PD dataset; it is not a closed-loop GMP control deployment, and it should be read as (pilot), not production.
Genentech/Roche: MARS is drug product, not Protein A
A correction that prevents a real error: the Boehringer Ingelheim work predicting 16 quality attributes in-line during Protein A capture in about 30 seconds [6] is frequently — and wrongly — attributed to Genentech. The analogous Genentech/Roche program is MARS, an in-line spectroscopic monitoring effort on formulated drug product — a different unit operation entirely, at the fill-finish end of the spine, not the capture step [7]. Two different companies, two different unit operations, two different molecules' worth of context. Conflating them produces a phantom "Genentech Protein A CNN" that does not exist.
And while we are precise about the BI work: it predicted 16 CQAs in-line, fast, on Protein A — a genuine (pilot) result, peer-reviewed-self-authored — but the model was KNN regression, not a CNN or any deep network, and the paper makes no deep-learning-superiority claim [6]. It is excellent evidence for Raman PAT on capture; it is not evidence for a "deep-learning Raman wave."
Roche/Genentech's other heavily publicized program is its NVIDIA AI-factory and Omniverse facility twins [8] — impressive, but facility-design and simulation twins (pilot), not closed-loop control of product quality. The digital-twin chapter already drew this distinction; the case-study reading is the same.
WuXi Biologics: the autonomous-lab headline, read carefully
The most striking single number in the disclosed literature is WuXi Biologics' ISLFCC (Industrial Smart Lab Framework for Cell Culture), which pairs decoder-only transformer models with robotic sampling and reports a +26.8 percent average titer improvement across three CHO clones, holding lactate below 1 g/L with no late-phase rebound — within a single batch [9]. It is a vivid demonstration of the autonomous-lab frontier.
Read it precisely. Evidence: peer-reviewed-self-authored (Biotechnology Progress, 2025) — the method was reviewed, which is more than most cases get. But it is single-company, self-reported, with no independent replication, and the maturity is process-development scale (3–15 L), not GMP (pilot). So: WuXi's ISLFCC reports +26.8 percent average titer (illustrative, self-reported, PD scale). The number tells you the ceiling a transformer-driven smart lab reached on its own clones; it does not tell you what your plant will get.
Sanofi, Pfizer, and the supply-chain / ops cases
Sanofi offers two often-quoted figures. Its SimplY yield-analytics program on Dupixent at Geel cites a +8 percent drug-substance improvement — but read closely it is a multi-year target, self-reported, not an audited realized outcome (production program, vendor-self-reported, illustrative) [10]. Its plai supply-chain control tower cites about 80 percent stockout prediction — from a June 2023 press release, with a separate ~65 percent risk-to-root-cause figure from an undated corporate page; neither independently verified (production, press-release-only, illustrative) [11]. This is the distribution / forecasting layer, where AI adoption is real but headline numbers are softest.
Pfizer's widely cited "Golden Batch" / Vox manufacturing-AI figures (thousands of hours saved per year, tens of thousands of extra doses per batch) trace to analyst and secondary write-ups and are largely small-molecule — they should not be presented as biologics CQA control (pilot, press-release-only, illustrative) [12]. Merck & Co. (MSD) contributes a peer-reviewed GenAI deviation-investigation study (GPT-4 / Claude-2) that is candid about "the interplay between apparent reasoning and hallucination" — a generative-AI result, retrieval-and-extraction not predictive control, tagged (research), peer-reviewed-self-authored [13].
Samsung, Celltrion, Lilly, Biogen, and the CDMOs
The remaining named players are mostly (pilot) programs disclosed at the corporate level, with the specific GMP models undisclosed. Samsung Biologics describes hybrid MPC, Raman soft sensors, and AI twins at Plant 5 — while executives candidly note much of the workflow is still manual (pilot, vendor-self-reported) [14]. Celltrion describes an AI platform and smart factory at Songdo (pilot, press-release-only) [15]. Eli Lilly and Biogen have broad smart-manufacturing and AI-in-operations programs disclosed in corporate communications, with no named GMP CQA-control deployment in the public record (pilot, press-release-only). The large CDMOs — Lonza and Fujifilm Diosynth — run MVDA/PAT and analytics platforms broadly across client programs (production for the monitoring layer), but per-client ML outcomes are contractually private and not publicly disclosed (vendor-self-reported). For contract manufacturers the monitoring backbone is real and widespread; the disclosed outcomes are not.
Every named deployment plotted on two axes at once: how far it has gone (maturity) against how much you should trust the claim (evidence tier). The production cases cluster at mid-tier evidence; the peer-reviewed cases cluster at pilot maturity; and the top-right quadrant — independently verified, in commercial GMP — is empty. The dashed established-fact line sits above every plotted headline number.
Original diagram by the authors, created with AI assistance.
The self-reporting problem, made machine-checkable
The argument of this chapter is structural, not anecdotal, so the contributed module encodes it as data. examples/platform/ml/case_ledger.py is a small stdlib-only ledger: each named deployment is a Case with an explicit maturity, tier, and verification note, and a method stated_as_fact_ok() that returns True only when the evidence clears the peer-reviewed-independent floor. The point of the code is not the code; it is that the curated evidence — which claim, at which tier — becomes auditable and countable rather than rhetorical.
# examples/platform/ml/case_ledger.py (excerpt)
TIER = (
"press-release-only",
"vendor-self-reported",
"peer-reviewed-self-authored",
"peer-reviewed-independent",
)
FACT_FLOOR = "peer-reviewed-independent" # a number is "fact" only at/above this
@dataclass(frozen=True)
class Case:
company: str
application: str
claim: str # the disclosed headline, verbatim-ish
maturity: str # research | pilot | production
tier: str # one of TIER
note: str # the verification caveat
def stated_as_fact_ok(self) -> bool:
return TIER.index(self.tier) >= TIER.index(FACT_FLOOR)
def overstated_if_quoted():
"""Numeric headlines that do NOT clear the fact floor — must be hedged."""
return [c for c in LEDGER
if any(s in c.claim for s in ("%", "+", "hrs", "doses"))
and not c.stated_as_fact_ok()]
Running python3 case_ledger.py over the curated roster prints the distribution this chapter quotes:
case ledger: 16 named deployments
by maturity: {'production': 5, 'pilot': 10, 'research': 1}
by tier: {'peer-reviewed-self-authored': 7, 'vendor-self-reported': 4, 'press-release-only': 5}
headline numbers that must be hedged (below peer-reviewed-independent): 7 of 7 numeric claims
- Amgen (Juncos, PR): "~6 h harvest idle + ~10 h inter-column idle eliminated (illustrative)" [peer-reviewed-self-authored]
- Amgen: "~95% of syringes/vials auto-released (illustrative)" [vendor-self-reported]
- Bristol Myers Squibb (with DataHow): "~33% better accuracy with ~half the data vs black-box" [peer-reviewed-self-authored]
- Sanofi: "+8% drug substance over 3 yrs (illustrative)" [vendor-self-reported]
- Sanofi: "~80% stockout prediction (illustrative)" [press-release-only]
- WuXi Biologics: "+26.8% average titer across 3 CHO clones (illustrative)" [peer-reviewed-self-authored]
- Pfizer: "16,000 hrs/yr, +20,000 doses/batch (illustrative)" [press-release-only]
claims that clear the established-fact floor: 0
Read the last line again: zero of the named headline numbers clear the established-fact floor. Seven of the seven numeric claims in the roster must be hedged. That is not cynicism encoded in code — the ledger generously tags the BMS/DataHow and WuXi results as peer-reviewed-self-authored, the best tier any of them reach — it is simply what the evidence is. The distribution also makes the maturity story concrete: production cases exist, but the peer-reviewed cases sit in the pilot column, and the one research-tier case (Merck's deviation study) is the only one with a published critical self-assessment of its own failure mode.
Anatomy of one case-ledger entry
The signature of this series is to unpack a single record. For a survey chapter the record is not a prediction or a model — it is one evidence claim, and what travels alongside the number is the entire value. Take the Amgen Juncos row and lay it out field by field.
One case is a whole record: the company and unit operation that scope it, the disclosed claim, the two grades (maturity and evidence tier) read against the fact floor, the verification caveat that keeps it honest, and the
stated_as_fact_ok verdict — here False, because a production deployment with first-party evidence is real but its number is still not independent fact.
Original diagram by the authors, created with AI assistance.
Read the card top to bottom and the chapter's method is laid out as fields. The header scopes the claim — which company, which application, where on the spine — because a number with no unit operation attached (a recurring sin of press releases) is uninterpretable. The claim row carries the disclosed headline, with its (illustrative) tag baked in. The two grade rows are the heart of it: maturity = production (this is real GMP) sits beside tier = peer-reviewed-self-authored (but the number is the company's own), and the established-fact floor is drawn above the current tier, so the eye sees immediately that the claim does not reach it. The note row holds the verification caveat — first-party authorship, no external audit — that a careful reader must carry alongside the number forever. And the relationships panel records where the claim came from (its reference) and what it links to on the spine, plus the stated_as_fact_ok boolean that the code computes: False. That single boolean is the chapter compressed into one bit.
The unsolved part: there is no independent biomanufacturing-ML benchmark
The honest open problem is the empty quadrant in the hero figure: there is no independent, third-party benchmark for ML in biomanufacturing, and almost no possibility of one under current conditions. In computer vision, ImageNet let any lab measure any model against the same yardstick; in language, public leaderboards let strangers replicate claims. Biomanufacturing has nothing comparable, for structural reasons that are not going away soon.
The data is proprietary and small. A company's batch records are competitively sensitive and bound by GMP confidentiality, so the training set behind a +26.8 percent titer claim is, by construction, unshareable. The processes are non-identical — a different cell line, scale, raw-material lot, or facility makes one company's "titer improvement" incommensurable with another's. And the experiments are too expensive to replicate independently: no neutral lab is going to run 48 GMP-scale runs to check a vendor's hybrid-model claim. Federated learning was floated as a way around the data-sharing wall, but the one proven program (MELLODDY, ten pharma companies, billions of data points) was discovery QSAR, not manufacturing, and federated learning across physical production sites remains conceptual [16].
The consequence is that the evidence ceiling for this whole field is, structurally, peer-reviewed-self-authored — a company publishing its own reviewed result. That is genuinely better than a press release, but it is not the independent verification that would let you treat any number as a benchmark. Until the industry builds shared, anonymized, standardized datasets — the kind BioPhorum's "data as a product" workstream and FAIR-data efforts gesture at [17] — the top-right quadrant stays empty, and the correct posture toward every headline number stays: interesting, illustrative, and unverified.
What this chapter adds to the model suite
This chapter contributes examples/platform/ml/case_ledger.py and its flat export cases.csv. It is the suite's one non-model artifact, and deliberately so: the case-studies chapter's job is not to fit anything but to make the evidence countable. The module ships the curated roster of 16 named deployments as typed Case records, computes the maturity/tier distribution the chapter quotes, and — most usefully — exposes overstated_if_quoted(), which any later chapter or notebook can call to confirm that a headline it wants to cite must be hedged. It is the in-code embodiment of the series' evidence-tier convention: the rule "a number is fact only at peer-reviewed-independent or above" stops being a style note and becomes an assertion the build can check.
Why it matters
The temptation in a field this hyped is to quote the best number you can find and move on. This chapter exists to make that impossible to do honestly. The named deployments are real — Amgen runs OPLS in commercial GMP, BMS and DataHow published a reviewed hybrid-model result, WuXi's smart lab raised titer on its own clones — and that reality is the genuine basis for optimism about where the field is going. But every one of their headline numbers is a single company grading its own homework, and a reader who quotes them as benchmarks will over-promise, under-deliver, and eventually lose credibility with the quality unit and the regulator both. The discipline of attaching maturity and evidence tier to every claim is not academic fussiness; it is the difference between a defensible business case and a slide that falls apart under the first hard question. The same discipline the data book demands of a single data point — that it carry its provenance — this chapter demands of a single claim.
In the real world
The way these cases actually surface is telling. The best evidence lives in trade journals and society surveys, not regulatory filings: GAO found that despite many companies entering FDA's Emerging Technology Program, only a small number of approved applications over 2015–2022 used an advanced manufacturing technology at all [18] — the disclosed case studies run far ahead of the disclosed filings. The strongest single anchor remains Amgen across three roles (Juncos OPLS, AVI, and continuous Raman glucose control), because Amgen has chosen to publish and present its work in unusual detail; that very willingness is why it dominates the production tier of any honest roster. The DataHow/BMS hybrid case is the one most worth citing for process development, specifically because the peer-reviewed companion exists — and you cite the journal's 33-percent-and-half-the-data numbers, not the vendor page's larger ones. And the cautionary marker for the whole field is regulatory, not technical: FDA's first AI-citing cGMP warning letter (Purolea, April 2026) landed on a firm that used AI to generate records without quality-unit review — a reminder that the gap between a disclosed case study and a defensible deployment is exactly the human-in-the-loop governance the regulation chapter takes up next [19].
Key terms
- Maturity marker —
(production)in GMP/commercial use,(pilot)demonstrated at scale,(research)academic/early; the how far has it gone axis. - Evidence tier — peer-reviewed-independent, peer-reviewed-self-authored, vendor-self-reported, press-release-only; the how much should I trust it axis.
- Established-fact floor — the rule that a number may be stated as fact only at the peer-reviewed-independent tier or above; everything below is labeled illustrative or self-reported.
- Self-reported headline — a figure asserted by the company that benefits from it, with no independent replication; the default state of biomanufacturing-ML numbers.
- OPLS (orthogonal PLS) — the MVDA model family behind Amgen Juncos's harvest-titer soft sensors; PLS with orthogonal-variation filtering.
- AVI (automated visual inspection) — deep-learning computer-vision inspection of containers; the most production-ready ML in QC.
- MARS — Genentech/Roche's in-line spectroscopic monitoring of formulated drug product; a drug-product unit operation, not the Boehringer Ingelheim Protein A capture work.
- ISLFCC — WuXi Biologics' Industrial Smart Lab Framework for Cell Culture; transformer models plus robotics, source of the +26.8 percent titer (illustrative) headline.
- Case ledger — the chapter's contributed artifact: a typed, gradable record of one named deployment, making the evidence countable rather than rhetorical.
Where this leads
The cases are graded; the empty top-right quadrant and the Purolea warning letter both point the same direction. The next chapter, Regulation and Governance: FDA, Annex 22, and Validating a Model, turns from who deployed what to what is allowed and how you prove it — the FDA 2023 discussion paper and its model-credibility framework, the draft EU GMP Annex 22 that draws a hard line around generative and adaptive AI, the locked-model-plus-predetermined-change-control expectation, and the validation paradox of a model that must never silently learn. The case studies tell you the field is real; the regulation chapter tells you the conditions under which a real deployment becomes a defensible one.