Case Studies: Named Deployments and Their Evidence

📍 Where we are: Part VII · ML/AI in Industry Today — Chapter 26. The vendor landscape told you who sells ML for bioprocessing and what their software actually does. This chapter asks the harder question: who has actually deployed it, on a named molecule, in a real plant — and how good is the evidence?

A vendor slide is a promise; a deployment is a fact, but only as strong as its evidence. This chapter walks the named, disclosed cases — the ones a careful reader can trace to a paper, a conference talk, or a press release — and grades each one twice: by maturity (is it in commercial GMP — Good Manufacturing Practice, the legally binding quality rules a licensed drug plant must follow — demonstrated at scale, or still academic?) and by evidence tier (who said so, and could anyone outside the company check it?). The uncomfortable finding, stated up front so the rest of the chapter cannot soften it: of all the headline efficiency numbers the industry quotes, essentially none clear the bar of peer-reviewed-independent fact — the standard, defined in full below, that a neutral third party with no commercial stake published and could check. They are real, they are interesting, and they are almost all a single company reporting on its own process.

That is not a reason to dismiss them. It is a reason to read them correctly. A self-reported +26.8 percent titer (the concentration, or yield, of antibody product the culture makes) from one smart lab (a highly automated lab where models propose experiments and robots run them) is genuine signal about what is possible; it is not a benchmark you can promise your own management. The whole skill this chapter teaches is holding both thoughts at once — and the discipline that makes it teachable is refusing to ever write the number without its two grades attached.

The simple version

Imagine reading restaurant reviews where every five-star rating was written by the restaurant's own owner. The food might genuinely be excellent — owners do not lie about everything — but you would weight those reviews very differently from a critic who paid for their own meal and has no stake in the outcome. Biomanufacturing ML case studies are mostly owner-written reviews. This chapter teaches you to read them like a critic: enjoy the dish, but never quote the owner's star rating as if a stranger gave it.

What this chapter covers

The handful of genuinely (production) deployments in commercial GMP, and exactly how solid each one's evidence is — Amgen Juncos OPLS soft sensors, Amgen deep-learning AVI, Amgen's continuous Raman glucose loop (a production-tier company running a pilot-tier control loop), and the broad MVDA/MSPC (multivariate data analysis / multivariate statistical process control) monitoring backbone every large maker runs. (A soft sensor is a model that predicts a hard-to-measure quantity from easy-to-measure signals, instead of waiting for the slow lab test.)
The larger band of (pilot) demonstrations from named companies — including the most-cited hybrid-modeling case (hybrid modeling = a known physics/chemistry model with a data-driven ML component added; Bristol Myers Squibb with DataHow) — read at the peer-reviewed numbers, not the vendor-page numbers.
A pass down the named-company roster: Amgen, Genentech/Roche, Boehringer Ingelheim, Merck & Co., Pfizer, Sanofi, BMS, Eli Lilly, Biogen, WuXi Biologics, Samsung Biologics, Celltrion, and the big CDMOs Lonza and Fujifilm Diosynth — each tagged with maturity and evidence tier in the same breath as its number.
Two corrections the field repeatedly gets wrong, fixed here: Genentech's MARS is a drug-product unit operation (one discrete manufacturing step), not the Boehringer Ingelheim Protein A work; and DataHow is an independent company, not a Sartorius subsidiary.
The self-reporting problem named plainly, with a machine-checkable ledger that computes how many of the disclosed numbers can actually be stated as fact — the answer is zero, and the code proves it.

How to grade a deployment: two axes, not one

The series uses a fixed convention, and this chapter leans on it harder than any other. Every named case gets a maturity marker and an evidence tier, and the two are independent — a claim can be production-deployed yet weakly evidenced (Amgen AVI), or research-stage yet rigorously reviewed (Merck's deviation study — a deviation is a documented departure from the approved process that must be investigated). Treating them as one axis is the most common reading error in the field: a press release about a GMP plant feels authoritative because the plant is real, and the reader silently upgrades the number's credibility to match the deployment's maturity. The two-axis grade exists to break that reflex.

Maturity answers how far has it actually gone? — (production) deployed in GMP or commercial use; (pilot) demonstrated at scale but not running the plant; (research) academic or early proof-of-concept. Evidence tier answers how much should I trust the claim? — in descending order: peer-reviewed-independent (a third party with no commercial stake published it), peer-reviewed-self-authored (the company published it in a journal, so the method was reviewed but the headline number is still theirs), vendor-self-reported (a vendor or manufacturer asserts it on its own materials), and press-release-only (a marketing announcement with no method, baseline, or replication). The gap between the top two tiers is the one readers most often miss: a Biotechnology Journal paper means strangers checked the modeling approach, not that strangers re-ran the experiment and got the same +33 percent (the BMS/DataHow figure detailed later in this chapter).

The single rule that organizes the whole chapter: a number may be stated as established fact only at the peer-reviewed-independent tier or above. Everything below it is labeled illustrative or self-reported in the same sentence as the number. When you run this chapter's companion module, you will see that this rule disqualifies every headline figure in the named-deployment roster — which is precisely the point. The rule is not pessimism; it is the only writing discipline that survives contact with a skeptical quality unit or a regulator who asks "says who?"

Evidence

The industry-wide framing comes from two independent-ish sources. The 7th ISPE Pharma 4.0 Survey — ISPE is the International Society for Pharmaceutical Engineering, and "Pharma 4.0" is the industry's name for applying Industry 4.0 digital and automation methods to drug manufacturing — (peer-reviewed-independent, in the sense of a neutral professional society reporting member data) found AI/ML to have the most pilot projects and the fewest scaled implementations of any digital technology [1]. McKinsey's State of AI 2025 found that across all industries about 88 percent of organizations use AI but only roughly 6 percent qualify as high performers capturing enterprise-wide value, with about 7 percent reporting AI fully scaled (analyst/consultancy tier) [2]. The shape — many pilots, few scaled — is the backdrop for every case below, and it is why a roster of disclosed deployments looks longer than the roster of disclosed scaled deployments.

The (production) tier: a short, well-scrutinized list

Genuinely commercial-GMP ML in biomanufacturing is a short list, and it clusters where the vendor chapter said it would: monitoring, vision inspection, and soft sensing — not autonomous control of critical quality attributes (a CQA is a measurable property of the drug that must stay in spec for it to be safe and effective). Every production case below is advisory or inspection; a human or a downstream test stays in the loop. That is not a coincidence of disclosure, it is the regulatory line the next chapter draws.

Amgen Juncos: the cleanest production soft-sensor case

The strongest named production exemplar is Amgen's drug-substance plant at Juncos, Puerto Rico, where engineers deployed OPLS (orthogonal partial least squares) batch-level models inside SIMCA-online (the industry-standard multivariate-analysis software the models run in) — the same MVDA lineage our running example's soft sensors sit in — to predict harvest titer (the antibody yield when the culture is harvested) and capture eluate-pool protein mass (how much antibody comes off the purification column, the quantity that governs which fractions are pooled together) and other in-process states in commercial GMP manufacturing. The reported operational payoff is the elimination of roughly six hours of harvest idle time plus roughly ten hours of inter-column idle time (about sixteen hours in all), by predicting when material is ready rather than waiting for the slow assay (a lab test that measures the quantity directly) [3]. This maps directly onto our spine: it is exactly the harvest-endpoint prediction and capture-pooling decisions, made on real GMP material rather than a development dataset. The model is not exotic — it is the PLS family the soft-sensor chapter builds — which is exactly why it cleared GMP: regulators and quality units understand a linear latent-variable model (one that maps inputs to outputs through a few interpretable underlying factors) in a way they do not yet understand a transformer (the complex deep-learning architecture behind today's large AI models).

Maturity: (production) — this is OPLS running in a commercial drug-substance facility, not a demo. Evidence: peer-reviewed-self-authored / vendor, and it matters which. The account is a first-party BioProcess International article authored by three Amgen Juncos employees, complemented by a Sartorius SIMCA vendor case study [3]. The deployment is real and credible; the specific hour-savings are the company's own figures and have not been independently audited. So the honest sentence — the template for the whole chapter — is: Amgen Juncos runs OPLS soft sensors in commercial GMP (production), and reports about six hours of harvest idle and ten hours of inter-column idle eliminated (illustrative, self-reported).

Amgen AVI: the most production-ready ML in QC

The other clearly production case from Amgen is automated visual inspection (AVI) of vials and syringes using deep-learning computer vision — the single most mature ML use in QC. Rule-based inspection over-rejects (Amgen has cited up to roughly 20 percent false rejection of good containers), and a trained convolutional neural network (CNN — the standard deep-learning architecture for images) both catches more true defects and rejects far fewer good units. The concrete, validated case Amgen describes in detail is an industry-first AI retrofit of a Syntegon (formerly Bosch) syringe-inspection machine at Juncos, reporting roughly a 70 percent boost in particle detection and a 60 percent reduction in false rejects — distinguishing bubbles from genuine contaminants in viscous solutions, the failure mode that defeats rule-based vision. From that base Amgen reports auto-releasing on the order of 95 percent of syringes and vials through AI-assisted inspection [4].

Maturity: (production). Evidence: vendor / trade-press self-reported. The 95 percent figure and the 20 percent rule-based false-rejection figure come from Amgen conference and trade coverage (e.g., the 2025 ISPE Aseptic Conference); the fully validated retrofit Amgen has described in detail was a syringe line, and reaching production took years of work and direct conversations with FDA [4]. The lesson the vision chapters drew — that vision inspection is where deep learning genuinely earns GMP trust, because a wrong call is caught by a human or a re-inspection rather than shipped in a vial — is true; the exact percentages are illustrative.

Amgen continuous Raman glucose control: production-tier company, pilot-tier loop

Amgen also published a continuous Raman plus deep-learning glucose feedback-control demonstration on the production bioreactor: closing a feed loop on Raman-inferred glucose — Raman spectroscopy is an optical scan that infers a concentration such as glucose without a separate lab sample — to hold residual glucose tight, avoiding the low-glucose excursions that drive up high-mannose glycoforms: antibody sugar chains left under-trimmed when nutrient-starved cells slow their glycan processing, which shorten the drug's serum half-life (faster clearance from the body) and alter its effector (immune-killing) function, making the glycoform profile a release-relevant CQA — and thereby trimming high-mannose while raising titer. This is the closest a named, peer-reviewed case comes to closed-loop control of a quality-relevant variable — and it is precisely why the maturity grade matters. The work is peer-reviewed-self-authored (Rashedi and colleagues, 2025) and the company is a production-tier operator, but the glucose-control loop itself is a demonstration (pilot), not a standing commercial closed loop signing off on release. Read it as the strongest evidence that the loop can close, not as evidence that it is closed in routine GMP. The distinction is the whole reason maturity is a separate axis from evidence.

The pattern under the production tier

Step back and the production list has a shape: PLS/OPLS multivariate models (Amgen Juncos; the broad SIMCA/SIMCA-online installed base every large biologics maker runs) and deep-learning vision (AVI). Both are advisory or inspection roles where a human or a downstream test remains in the loop. Notably absent: a model that closes the loop on a release-defining CQA with no human gate. Amgen's own Raman glucose work, the most control-like case in the roster, is a demonstration rather than a standing autonomous loop. That absence is not an accident of disclosure — it is the regulatory line the draft EU/PIC/S GMP Annex 22 draws, which the next chapter takes up in full.

The (pilot) tier: demonstrated, named, but not running the plant

Below production sits a much larger band: named companies showing real ML on real molecules, at real scale, but as demonstrations rather than standing GMP operations. This is where most of the famous numbers live — and where the self-reporting problem is sharpest, because a pilot has every incentive to publish its best run and no obligation to publish its baseline.

Bristol Myers Squibb with DataHow: read the peer-reviewed numbers

The most-cited hybrid-modeling case is Bristol Myers Squibb's process-development dataset modeled by DataHow: 48 experiments at 5 L development scale, 12 critical process parameters (the controllable inputs — feed rate, temperature, and so on — that drive product quality), 18 CQAs — a realistic PD (process-development) design-of-experiments size (a planned set of runs that varies inputs systematically), not a data-starved accident, which is exactly why hybrid modeling is the right tool. It is the canonical demonstration of the hybrid paradigm — a mechanistic backbone (the known physics and chemistry of the process) with an ML component bolted on — beating a pure black box (a model that learns only from data, with no built-in physics) in the small-data regime that defines real PD, where you have dozens of runs, not the millions of examples a deep network wants.

Here the evidence tier does real work, because there are two versions of the headline, and they disagree. DataHow's own case-study page reports roughly 22 percent better prediction accuracy and about 3x fewer experiments. A peer-reviewed companion in Biotechnology Journal (2024), co-authored by DataHow and BMS, headlines roughly 33 percent better accuracy with substantially less experimental data — about half — versus a black-box model [5]. The rule of this chapter is to prefer the peer-reviewed figures over the vendor page — both because the method was reviewed and because, instructively, the journal is the more conservative claim on the data axis (about half, i.e. 2x, rather than the vendor page's 3x). When the reviewed number is the smaller number, you have a small natural experiment in why peer review matters. Maturity: (pilot) — this is process development, not GMP production. Evidence: peer-reviewed-self-authored for the journal version.

caution

Two corrections this case attracts, fixed here because they keep recirculating. First, DataHow is an independent ETH Zurich spin-off — its Series A was led by Momenta with Rockwell Automation and Zürich Kantonalbank, and it has collaborated with the likes of Eppendorf and Genedata — it is not a Sartorius subsidiary. Conflating an independent modeling house with a hardware giant misreads both the incentives and the install base. Second, the BMS/DataHow result is a prediction-accuracy and experiment-efficiency claim on a PD dataset; it is not a closed-loop GMP control deployment, and it should be read as (pilot), not production.

Genentech/Roche: MARS is drug product, not Protein A

A correction that prevents a real error: the Boehringer Ingelheim work predicting 16 quality attributes in-line (measured during the process, not in a separate lab afterward) during Protein A capture — the affinity-chromatography step that captures the antibody out of the harvested broth — in about 30 seconds [6] is frequently — and wrongly — attributed to Genentech. The analogous Genentech/Roche program is MARS (multi-attribute Raman spectroscopy), an in-line spectroscopic monitoring effort on formulated drug product (the final filled, finished dosage form — distinct from drug substance, the bulk purified antibody made upstream): a Raman-plus-DoE-plus-MVDA (design-of-experiments plus multivariate analysis) method measuring multiple product-quality attributes of a formulated monoclonal antibody from a single scan [7]. That is a different unit operation entirely, at the fill-finish end of the spine, not the capture step. Two different companies, two different unit operations, two different molecules' worth of context. Conflating them produces a phantom "Genentech Protein A CNN" that does not exist anywhere in the literature — and once you have invented a deployment, you will quote its invented number.

And while we are precise about the BI work: it predicted 16 CQAs in-line, fast, on Protein A — a genuine (pilot) result, peer-reviewed-self-authored — but the model was k-nearest-neighbor (KNN) regression — a simple method that predicts from the most similar past examples — on smoothed (Butterworth-filtered) Raman spectra, not a CNN or any deep network, and the paper makes no deep-learning-superiority claim [6]. It is excellent evidence for Raman PAT on capture; it is not evidence for a "deep-learning Raman wave." The error of upgrading a KNN to a CNN in retelling is the same error as upgrading a pilot to production — the listener hears the more impressive word and stops checking.

Roche/Genentech's other heavily publicized program is its NVIDIA AI-factory and Omniverse facility twins [8] — impressive, but facility-design and simulation twins (pilot, press-release tier), virtual replicas of production lines for design and simulation, not closed-loop control of product quality. The digital-twin chapter already drew this distinction; the case-study reading is the same. A facility twin that lays out a plant is a genuine engineering tool and an entirely different claim from a process twin that controls a CQA.

WuXi Biologics: the autonomous-lab headline, read carefully

The most striking single number in the disclosed literature is WuXi Biologics' ISLFCC (Industrial Smart Lab Framework for Cell Culture), which pairs decoder-only transformer models (named CCGPT/ILGPT — the same GPT-style deep-learning architecture that powers large language models, here trained to propose feeds) with robotic sampling and reports a +26.8 percent average titer improvement across three CHO (Chinese hamster ovary — the standard mammalian cell line used to make antibodies) clones by keeping the culture out of the late-phase lactate rebound that normally caps CHO fed-batch yield — late in a fed-batch run, cells often start dumping lactate (a metabolic waste acid) back into the broth, which stresses them and limits how much antibody they make; here the loop holds lactate below 1 g/L with no late-phase rebound — within a single batch [9]. It is a vivid demonstration of the autonomous-lab frontier: a generative architecture proposing feeds, robots executing them, the loop closing without a scientist between proposal and action.

Read it precisely. Evidence: peer-reviewed-self-authored (Biotechnology Progress, 2025) — the method was reviewed, which is more than most cases get. But it is single-company, self-reported, with no independent replication, and the maturity is process-development scale (3–15 L), not GMP (pilot). So: WuXi's ISLFCC reports +26.8 percent average titer (illustrative, self-reported, PD scale). The number tells you the ceiling a transformer-driven smart lab reached on its own clones (cell lines descended from a single founder cell), under its own baseline, in its own facility; it does not tell you what your plant will get, and it would be a category error to present it to management as a target. Read +26.8 percent as a real signal that this kitchen can cook — not a Michelin rating you can put on your own menu.

Sanofi, Pfizer, Merck, and the supply-chain / ops cases

Sanofi offers two often-quoted figures, and both reward close reading. Its SimplY yield-analytics program on Dupixent drug substance at the Geel, Belgium "digital lighthouse" site (a World Economic Forum designation for a plant showcasing advanced digital manufacturing) cites a +8 percent drug-substance improvement — but read closely it is a multi-year target over three years, attributed to a Sanofi manufacturing-digital executive, self-reported, not an audited realized outcome (production program, vendor-self-reported, illustrative) [10]. Its plai supply-chain control tower (co-developed with Aily Labs) cites about 80 percent stockout prediction — from a June 2023 press release, with a separate ~65 percent risk-to-root-cause figure from an undated corporate page; neither independently verified (production, press-release-only, illustrative) [11]. This is the distribution / forecasting layer, where AI adoption is real but headline numbers are softest precisely because the outcome is hard to attribute to the model rather than to the year.

Pfizer's widely cited "Golden Batch" / Vox manufacturing-AI figures — roughly 16,000 hours per year saved and about 20,000 additional doses per batch — trace to Pfizer/AWS communications and analyst secondary write-ups, and they blend distinct claims: the 16,000 hours is scientist document-search time saved, the 20,000 doses is an mRNA-vaccine yield prediction, and the work is largely small-molecule / operations rather than biologics CQA control (pilot, press-release-only, illustrative) [12]. Quoting "16,000 hours" and "20,000 doses" in one breath, as a single biologics result, fuses two unrelated numbers into a phantom third — a textbook example of why a number with no unit operation attached is uninterpretable.

Merck & Co. (MSD) deserves its own line, because it is the lone case in the roster with a published critical self-assessment — a point the closing ledger discussion leans on. It contributes a peer-reviewed GenAI deviation-investigation study (evaluating GPT-3.5, GPT-4, and Claude-2 on root-cause extraction and semantic search over real deviation records) that is candid about "the complex interplay between apparent reasoning and hallucination" — a generative-AI result, retrieval-and-extraction not predictive control, tagged (research), peer-reviewed-self-authored [13]. It is the one case in the roster that publishes a critical self-assessment of its own failure mode, which is why it earns the research tier honestly rather than as a euphemism.

Samsung, Celltrion, Lilly, Biogen, and the CDMOs

The remaining named players are mostly (pilot) programs disclosed at the corporate level, with the specific GMP models undisclosed — the band where maturity claims outrun evidence by the widest margin. Samsung Biologics describes hybrid modeling (CFD — computational fluid dynamics, simulating how fluids move and mix — plus mechanistic plus ML), Raman spectroscopy as a glucose soft sensor with automated feeding, hybrid MPC (model predictive control — a controller that uses a model to plan its next moves), and AI/digital-twin technology at Plant 5 in Songdo — while executives candidly note much of the workflow is still manual (pilot, vendor-self-reported) [14]. Celltrion describes AI-powered smart factories at new API/drug-substance facilities in Songdo — autonomous logistics robots, automated warehousing, collaborative robots, intelligent manufacturing platforms — with quality control and production optimization slated for later phases (pilot, press-release-only) [15]. Eli Lilly and Biogen have broad smart-manufacturing and AI-in-operations programs disclosed in corporate communications, with no named GMP CQA-control deployment in the public record (pilot, press-release-only). The large CDMOs (contract development and manufacturing organizations) — Lonza and Fujifilm Diosynth — run MVDA/MSPC and PAT (process analytical technology — in-process measurement and control of quality) analytics platforms broadly across client programs (production for the monitoring layer), but per-client ML outcomes are contractually private and not publicly disclosed (vendor-self-reported). For contract manufacturers the monitoring backbone is real and widespread; the disclosed outcomes are not, because the outcomes belong to the client. The pattern across this whole tier: the infrastructure is genuinely there, the headline number is either absent or unverifiable.

The named deployments plotted on two axes at once: how far each has gone (maturity) against how much you should trust the claim (evidence tier). The production cases cluster at mid-tier evidence; the peer-reviewed cases cluster at pilot maturity; and the top-right quadrant — independently verified, in commercial GMP — is empty. The dashed established-fact line sits above every plotted headline number. Original diagram by the authors, created with AI assistance.

The self-reporting problem, made machine-checkable

The argument of this chapter is structural, not anecdotal, so the contributed module encodes it as data. examples/platform/ml/case_ledger.py is a small stdlib-only ledger: each named deployment is a Case with an explicit maturity, tier, unit operation, and verification note, and a method stated_as_fact_ok() that returns True only when the evidence clears the peer-reviewed-independent floor. The point of the code is not the code; it is that the curated evidence — which claim, at which tier — becomes auditable and countable rather than rhetorical. You cannot quietly upgrade a tier in a paragraph; you have to change a field, and the field has a name.

# examples/platform/ml/case_ledger.py  (excerpt)
MATURITY = ("research", "pilot", "production")
TIER = (
    "press-release-only",
    "vendor-self-reported",
    "peer-reviewed-self-authored",
    "peer-reviewed-independent",
)
FACT_FLOOR = "peer-reviewed-independent"   # a number is "fact" only at/above this

@dataclass(frozen=True)
class Case:
    company: str
    application: str
    unit_op: str          # where on the bioprocess spine
    claim: str            # the disclosed headline, verbatim-ish
    maturity: str         # one of MATURITY
    tier: str             # one of TIER
    note: str             # the verification caveat

    def stated_as_fact_ok(self) -> bool:
        return TIER.index(self.tier) >= TIER.index(FACT_FLOOR)

def overstated_if_quoted():
    """Numeric headlines that do NOT clear the fact floor — must be hedged."""
    return [c for c in LEDGER
            if ("%" in c.claim or "+" in c.claim or "hrs" in c.claim or "doses" in c.claim)
            and not c.stated_as_fact_ok()]

The design choices are deliberate. The ledger is frozen so a claim cannot be mutated after curation; tier and maturity are constrained to ordered tuples so the fact-floor comparison is a simple index test rather than a judgment call; and note is a required field, not optional, so no row can exist without its caveat. The point is that the evidence convention stops being a style note in a CLAUDE file and becomes an assertion the interpreter can check.

This is the same governance move the data book makes for any value that must be trusted: a note field that cannot be empty is a metadata requirement enforced in code rather than hoped for in a procedure, and unit_op constrained to the named steps of the bioprocess is the case-study equivalent of grounding a tag in the ISA-95 equipment hierarchy so that "harvest" means one specific operation, not whatever a writer felt like calling it. The flat cases.csv export matters for the same reason a data shadow is auditable: the typed records reduce, losslessly, to one table a reviewer can sort, filter, and re-derive the distribution from without running the Python — the analyst's check that the code's count and the data's count agree.

Running python3 case_ledger.py over the curated roster prints the distribution this chapter quotes — reproduced here verbatim from the module's recorded run:

case ledger: 16 named deployments
  by maturity: {'production': 5, 'pilot': 10, 'research': 1}
  by tier:     {'peer-reviewed-self-authored': 7, 'vendor-self-reported': 4, 'press-release-only': 5}

headline numbers that must be hedged (below peer-reviewed-independent): 7 of 7 numeric claims
  - Amgen (Juncos, PR): "~6 h harvest idle + ~10 h inter-column idle eliminated (illustrative)"  [peer-reviewed-self-authored]
  - Amgen: "~95% of syringes/vials auto-released (illustrative)"  [vendor-self-reported]
  - Bristol Myers Squibb (with DataHow): "~33% better accuracy with ~half the data vs black-box"  [peer-reviewed-self-authored]
  - Sanofi: "+8% drug substance over 3 yrs (illustrative)"  [vendor-self-reported]
  - Sanofi: "~80% stockout prediction (illustrative)"  [press-release-only]
  - WuXi Biologics: "+26.8% average titer across 3 CHO clones (illustrative)"  [peer-reviewed-self-authored]
  - Pfizer: "16,000 hrs/yr, +20,000 doses/batch (illustrative)"  [press-release-only]

The module's fact_grade_claims() returns an empty list, so the run closes on the line the whole chapter is built to earn:

claims that clear the established-fact floor: 0

Read the last line again: zero of the named headline numbers clear the established-fact floor. Seven of the seven numeric claims in the roster must be hedged. That is not cynicism encoded in code — the ledger generously tags the BMS/DataHow and WuXi results as peer-reviewed-self-authored, the best tier any of them reach, and never demotes a journal paper to vendor tier just to make a point — it is simply what the evidence is. The distribution also makes the maturity story concrete: production cases exist (five of them), but the peer-reviewed cases sit in the pilot column, and the one research-tier case (Merck's deviation study) is the only one with a published critical self-assessment of its own failure mode. The two axes pull in opposite directions — the things that are deployed are not the things that are independently verified, and vice versa — which is exactly the empty-quadrant story the hero figure tells.

Anatomy of one case-ledger entry

The signature of this series is to unpack a single record. For a survey chapter the record is not a prediction or a model — it is one evidence claim, and what travels alongside the number is the entire value. A bare "~6 h saved" is worthless; the same number with its company, unit operation, maturity, tier, and caveat attached is a defensible sentence. Take that Amgen Juncos row and lay it out field by field.

One case is a whole record: the company and unit operation that scope it, the disclosed claim, the two grades (maturity and evidence tier) read against the fact floor, the verification caveat that keeps it honest, and the stated_as_fact_ok verdict — here False, because a production deployment with first-party evidence is real but its number is still not independent fact. Original diagram by the authors, created with AI assistance.

Read the card field by field and the chapter's method is laid out as data:

company — Amgen (Juncos, PR). Not "Amgen" in the abstract but the specific site, because the deployment is the Juncos drug-substance plant. A claim with no site is a claim with no place to inspect.
application — SIMCA OPLS harvest-titer / in-process models. Names the model family (OPLS, the PLS lineage of the soft-sensor chapter) and the software it runs in (SIMCA-online), so the reader can place it in the vendor landscape and judge whether the method is one a quality unit would accept.
unit_op — harvest + capture. Pins the claim to two points on the spine: harvest endpoint and capture pooling. A number with no unit operation attached is uninterpretable, which is the recurring sin of the press-release tier (see the Pfizer 16,000-hours/20,000-doses fusion above).
claim — "~6 h harvest idle + ~10 h inter-column idle eliminated (illustrative)". The disclosed headline, carrying its (illustrative) tag inside the string so the hedge cannot be lost in transcription.
maturity — production. Real commercial GMP, the strongest maturity grade. This is the field that tempts a reader to upgrade the trust of the number — which is exactly why the next field exists.
tier — peer-reviewed-self-authored. The method has the credibility of a trade-press first-party account plus a vendor case study; the number is still Amgen's own. The established-fact floor sits one tier above this, so the figure does not reach it.
note — "First-party BioProcess International account by Amgen Juncos engineers + Sartorius case study; hour-savings not externally audited." The caveat a careful reader must carry alongside the number forever. It is a required field; the row could not exist without it.
stated_as_fact_ok() — False. The boolean the code computes: TIER.index("peer-reviewed-self-authored") < TIER.index("peer-reviewed-independent"). That single bit is the chapter compressed: a production deployment with first-party evidence is real, and its number is still not independent fact.

The two grade rows are the heart of the card: maturity = production (this is real GMP) sits beside tier = peer-reviewed-self-authored (but the number is the company's own), and the established-fact floor is drawn above the current tier, so the eye sees immediately that the claim does not reach it. The whole anatomy is built so you cannot read the impressive maturity without also reading the unmet evidence bar in the same glance — which is the discipline the chapter wants in muscle memory.

The ledger is a controlled vocabulary in disguise

It is worth noticing what the Case record actually is, because it is the same discipline the ontology book makes formal. The fields are not free text: maturity is constrained to the ordered tuple ("research", "pilot", "production") and tier to a four-rung ladder, so the fact-floor test is a simple index comparison rather than a judgment call. That is a controlled vocabulary — a governed list of agreed terms — exactly the first remedy the data book reaches for when two systems must mean the same thing by the same word, and the inputs that make a knowledge graph FAIR (Findable, Accessible, Interoperable, Reusable) rather than a pile of strings. An evidence tier typed against ("press-release-only", "vendor-self-reported", "peer-reviewed-self-authored", "peer-reviewed-independent") is the same construct an OWL ontology writes as an owl:oneOf enumeration or a SKOS concept scheme, and the same one a SHACL sh:in constraint checks when it refuses a releaseStatus that is not drawn from ("PASS" "OOS" "PENDING"). The Python tuple and the sh:in list are the same idea at two rigor levels: a value that is not in the set is not a typo to tolerate, it is a record that must not exist.

Read the whole ledger through the ontology lens and a stronger claim follows. The Case is a typed node; unit_op is the edge that pins it to the bioprocess spine the way bp:derivedFrom pins a lot to its genealogy; and stated_as_fact_ok() is a derived property — a computed truth, not an asserted one, exactly the kind of fact an ontology reasons out rather than stores. If this roster were lifted into the running campaign's graph, each case would become an instance carrying an IRI, its evidence tier a value from a published code list, its unit-operation tag a typed link a query could traverse — and overstated_if_quoted() would be one SPARQL filter over the tier ladder rather than a list comprehension. The point is not to rebuild the module in RDF; it is that the ledger already follows the ontology book's grammar — controlled values, typed relations, derived facts, a required provenance note — and that grammar is what makes the evidence countable instead of rhetorical. A column of free-text "maturity: pretty mature" strings could not have produced the zero this chapter closes on; an enumerated, machine-checkable vocabulary could.

This is also why the same grounding that makes a model trustworthy makes this ledger auditable. The frontier chapter argues that an ontology-grounded graph is what keeps a fluent model from inventing answers — a number pulled by its evidence-tier IRI cannot be quietly upgraded the way a number floating in prose can. The ledger is the small, in-suite proof of that argument: it does to a claim what a knowledge graph does to a measurement — attaches its provenance, constrains its type, and computes whether it may be stated as fact — so the evidence convention stops being a style note in a CLAUDE file and becomes an assertion the interpreter can check.

The unsolved part: there is no independent biomanufacturing-ML benchmark

The honest open problem is the empty quadrant in the hero figure: there is no independent, third-party benchmark for ML in biomanufacturing, and almost no possibility of one under current conditions. In computer vision, ImageNet (a large public labeled-image dataset and competition) let any lab measure any model against the same yardstick; in language, public leaderboards let strangers replicate claims. Biomanufacturing has nothing comparable, for structural reasons that are not going away soon — and the consequence is that the field's evidence ceiling is peer-reviewed-self-authored — one rung short of the top of the four-tier ladder (press-release-only, vendor-self-reported, peer-reviewed-self-authored, peer-reviewed-independent), so one tier short of fact, no matter how good the science is.

Three walls, each independent:

The data is proprietary and small. A company's batch records are competitively sensitive and bound by GMP confidentiality, so the training set behind a +26.8 percent titer claim is, by construction, unshareable. There is no public corpus on which a stranger could even attempt to reproduce a result.
The processes are non-identical. A different cell line, scale, raw-material lot, or facility makes one company's "titer improvement" incommensurable with another's. Even if two companies shared data, a +33 percent on one molecule is not comparable to a +33 percent on another — there is no shared task definition the way "classify this image" is shared.
The experiments are too expensive to replicate. No neutral lab is going to run 48 GMP-scale runs, or stand up a transformer-plus-robotics smart lab, to check a vendor's claim. The cost of independent verification is the cost of the original study, which is the cost no third party will pay.

Federated learning — training a shared model across companies without any of them handing over its raw data — was floated as a way around the first wall, but the one proven program at scale (MELLODDY, ten pharma companies federating roughly 2.6 billion activity data points across about 21 million molecules) was discovery QSAR (quantitative structure-activity relationship — predicting a molecule's activity from its chemical structure, a drug-discovery task, not a manufacturing one), and federated learning across physical production sites remains conceptual [16]. The consequence is that the evidence ceiling for this whole field is, structurally, peer-reviewed-self-authored — a company publishing its own reviewed result. That is genuinely better than a press release, but it is not the independent verification that would let you treat any number as a benchmark. Until the industry builds shared, anonymized, standardized datasets — the kind BioPhorum's (a biopharma industry collaboration) "data as a product" workstream and FAIR-data efforts (FAIR = Findable, Accessible, Interoperable, Reusable shared data) gesture at [17] — the top-right quadrant stays empty, and the correct posture toward every headline number stays: interesting, illustrative, and unverified. The benchmark problem is not a gap someone forgot to fill; it is a structural feature of a regulated, competitive, expensive industry, and the right response is the two-axis grade, not a wish for an ImageNet that cannot exist.

What this chapter adds to the model suite

This chapter contributes examples/platform/ml/case_ledger.py and its flat export cases.csv. It is the suite's one non-model artifact, and deliberately so: the case-studies chapter's job is not to fit anything but to make the evidence countable. The module ships the curated roster of 16 named deployments as typed, frozen Case records; computes the maturity/tier distribution the chapter quotes via distribution(); and — most usefully — exposes overstated_if_quoted(), which any later chapter or notebook can call to confirm that a headline it wants to cite must be hedged, and fact_grade_claims(), which returns the (currently empty) set of claims that clear the floor. The companion cases.csv is the same ledger in flat form for a notebook. It is the in-code embodiment of the series' evidence-tier convention: the rule "a number is fact only at peer-reviewed-independent or above" stops being a style note and becomes an assertion the build can check, with fact_grade_claims() returning [] as the machine-verified proof.

Reproducibility here is unusually strong precisely because the module is so plain. It is standard-library only — no scikit-learn, no PyTorch, no network — so unlike a trained model it has no random seed to pin and no environment to drift: the run is deterministic by construction, and run_all.py invokes it alongside the rest of the suite — the same harness that pins every dataset by its MANIFEST.sha256 content hash in the data chapter. That makes it the one artifact in the suite a reader can audit by eye — the verbatim run above is the output, with nothing stochastic between the source and the printed count. The same open-source analytics lineage runs in earnest in the open-source book's SPC/MVDA/ML stack, where a calibration's dataset hash is the field that quietly invalidates a model when the probe changes; the case-ledger is that discipline applied to a claim rather than a calibration — the provenance note is the dataset hash of an evidence row.

Why it matters

The temptation in a field this hyped is to quote the best number you can find and move on. This chapter exists to make that impossible to do honestly. The named deployments are real — Amgen runs OPLS in commercial GMP, BMS and DataHow published a reviewed hybrid-model result, WuXi's smart lab raised titer on its own clones — and that reality is the genuine basis for optimism about where the field is going. But every one of their headline numbers is a single company grading its own homework, and a reader who quotes them as benchmarks will over-promise, under-deliver, and eventually lose credibility with the quality unit and the regulator both. The discipline of attaching maturity and evidence tier to every claim is not academic fussiness; it is the difference between a defensible business case and a slide that falls apart under the first hard question. The same discipline the data book demands of a single data point — that it carry its provenance — this chapter demands of a single claim: a number without its grade is not evidence, it is decoration.

In the real world

The way these cases actually surface is telling. The best evidence lives in trade journals and society surveys, not regulatory filings: the GAO (Government Accountability Office, the U.S. government's audit agency) found that only 16 approved applications or supplements incorporated an advanced manufacturing technology between 2015 and late 2022, while 112 advanced technologies were accepted into the Emerging Technology Program of CDER (the FDA's Center for Drug Evaluation and Research) over the same period [18] — the disclosed case studies run far ahead of the disclosed filings. The strongest single anchor remains Amgen across three roles (Juncos OPLS, AVI, and continuous Raman glucose control), because Amgen has chosen to publish and present its work in unusual detail; that very willingness is why it dominates the production tier of any honest roster, and it is a reminder that the roster is shaped as much by who discloses as by who deploys. The DataHow/BMS hybrid case is the one most worth citing for process development, specifically because the peer-reviewed companion exists — and you cite the journal's 33-percent-and-half-the-data numbers, not the vendor page's larger ones. And the cautionary marker for the whole field is regulatory, not technical: FDA's first AI-citing cGMP warning letter (Purolea, April 2026) landed on a firm that used AI to generate specifications, SOPs, and master production records without adequate review by the quality unit — the GMP-mandated department that must independently approve specifications, procedures, and records before release — cited under 21 CFR 211.22(c), the rule that assigns that approval authority — a reminder that the gap between a disclosed case study and a defensible deployment is exactly the human-in-the-loop governance the regulation chapter takes up next [19].

Key terms

GMP (Good Manufacturing Practice) — the legally binding quality rules a licensed drug plant must follow; the anchor of the maturity axis.
Titer — the concentration, or yield, of antibody product the culture makes; the central success metric the headline numbers report.
Soft sensor — a model that predicts a hard-to-measure quantity (e.g. harvest titer) from easy-to-measure signals, instead of waiting for the slow lab assay.
Raman spectroscopy — an optical scan that infers a concentration such as glucose without taking a separate lab sample; the basis of several in-line monitoring and feedback cases here.
Maturity marker — (production) in GMP/commercial use, (pilot) demonstrated at scale, (research) academic/early; the how far has it gone axis.
Evidence tier — peer-reviewed-independent, peer-reviewed-self-authored, vendor-self-reported, press-release-only; the how much should I trust it axis. The gap between the top two tiers (method reviewed vs. result independently replicated) is the one readers most often miss.
Established-fact floor — the rule that a number may be stated as fact only at the peer-reviewed-independent tier or above; everything below is labeled illustrative or self-reported in the same sentence as the number. Zero named headline numbers clear it.
Self-reported headline — a figure asserted by the company that benefits from it, with no independent replication; the default state of biomanufacturing-ML numbers.
OPLS (orthogonal PLS) — the MVDA model family behind Amgen Juncos's harvest-titer soft sensors; PLS with orthogonal-variation filtering, run in SIMCA-online.
AVI (automated visual inspection) — deep-learning computer-vision inspection of containers; the most production-ready ML in QC, validated first on an Amgen syringe line.
MARS — Genentech/Roche's multi-attribute Raman spectroscopy of formulated drug product; a drug-product unit operation, not the Boehringer Ingelheim Protein A capture work it is often confused with.
ISLFCC — WuXi Biologics' Industrial Smart Lab Framework for Cell Culture; decoder-only transformer models plus robotics, source of the +26.8 percent titer (illustrative) headline at PD scale.
Case ledger — the chapter's contributed artifact: a typed, frozen, gradable record of one named deployment, making the evidence countable rather than rhetorical.
Controlled vocabulary — a governed list of agreed terms a field is allowed to take; here the ordered maturity and tier tuples, the same construct an OWL owl:oneOf enumeration or a SHACL sh:in constraint expresses, and the input that makes a knowledge graph FAIR rather than a pile of strings.
Derived property — a value computed from other fields rather than asserted, such as stated_as_fact_ok(); an ontology reasons such facts out the way the ledger computes them, so a tier cannot be silently upgraded without changing a named field.

Where this leads

The cases are graded; the empty top-right quadrant and the Purolea warning letter both point the same direction. The next chapter, Regulation and Governance: FDA, Annex 22, and Validating a Model, turns from who deployed what to what is allowed and how you prove it — the FDA 2023 discussion paper and its model-credibility framework, the draft Annex 22 that draws a hard line around generative and adaptive AI, the locked-model-plus-predetermined-change-control expectation, and the validation paradox of a model that must never silently learn. The case studies tell you the field is real; the regulation chapter tells you the conditions under which a real deployment becomes a defensible one.

What this chapter covers​

How to grade a deployment: two axes, not one​

The (production) tier: a short, well-scrutinized list​

Amgen Juncos: the cleanest production soft-sensor case​

Amgen AVI: the most production-ready ML in QC​

Amgen continuous Raman glucose control: production-tier company, pilot-tier loop​

The pattern under the production tier​

The (pilot) tier: demonstrated, named, but not running the plant​

Bristol Myers Squibb with DataHow: read the peer-reviewed numbers​

Genentech/Roche: MARS is drug product, not Protein A​

WuXi Biologics: the autonomous-lab headline, read carefully​

Sanofi, Pfizer, Merck, and the supply-chain / ops cases​

Samsung, Celltrion, Lilly, Biogen, and the CDMOs​

The self-reporting problem, made machine-checkable​

Anatomy of one case-ledger entry​

The ledger is a controlled vocabulary in disguise​

The unsolved part: there is no independent biomanufacturing-ML benchmark​

What this chapter adds to the model suite​

Why it matters​

In the real world​

Key terms​

Where this leads​