The Frontier: Foundation Models, Autonomous Labs, and Agentic AI

📍 Where we are: Part VII · ML/AI in Industry Today — Chapter 28. The previous chapters mapped what is real: the vendor landscape that exists, the named deployments and their evidence, and the regulatory line that governs them. This chapter looks the other way — at the demonstrations that are not yet deployments, and asks honestly how far they are from the plant floor.

Every chapter so far has been disciplined about the difference between what is demonstrated and what is deployed. This one is about that difference. The frontier of bioprocess machine learning in 2024-2026 is loud: press releases announce "self-driving" bioreactors, "agentic" manufacturing platforms, and foundation models that will read every batch ever run and learn the whole process at once. Some of it is genuine, careful science published in peer-reviewed journals. Some of it is a research result wearing a product's clothes. And almost none of it is running on a commercial GMP (Good Manufacturing Practice — the legally binding, regulator-inspected regime a sellable drug must be manufactured under) line making a drug you can buy. The job of this chapter is to walk the four headline frontiers — autonomous labs, federated learning, foundation/time-series models, and agentic AI — and for each one, mark exactly where it sits on the road from a controlled demonstration to routine commercial use, with the maturity and evidence tier attached to every claim.

The honest summary is one sentence, and the rest of the chapter earns it: the production reality of AI in biomanufacturing clusters in monitoring, predictive maintenance, vision inspection, and human-in-the-loop documentation — not in the autonomous control of a critical quality attribute (a CQA — a measurable product property, such as glycosylation or purity, that must stay within its specification for the batch to be safe and effective, and that cannot be changed once the batch is released) — and the frontier is the set of things that promise to change that but have not yet [1].

The simple version

There is a long road between a science-fair robot that mixes chemicals on its own and a power plant you would trust to run unattended overnight. The science-fair robot is real, impressive, and exactly the right thing to be building — but nobody confuses it with the power plant. Most of the bioprocess AI "frontier" is at the science-fair stage: clever, working demonstrations in a controlled setting, often at a tenth or a hundredth of manufacturing scale, with a human watching closely. The frontier is the bridge being built between the demo and the plant. This chapter walks out onto the bridge and marks, honestly, how much of it is finished.

What this chapter covers

Read the four headline frontiers as four bets on the same problem — how to beat the small-data ceiling by generating data faster, pooling it across firms, pretraining on all of it, or acting on the little there is:

Self-driving bioreactors — autonomous design-of-experiments closing the loop on a real cultivation, and why the demonstrated case is perfusion process development, not GMP control.
Federated learning — training across institutions without sharing the data; what MELLODDY actually proved, and why it stays in discovery rather than crossing into manufacturing.
Bioprocess foundation and time-series models — the aspiration of one large pretrained model for all bioprocess data, why it does not yet exist, and what transfer learning does in the meantime.
Agentic AI in GMP — autonomous agents that plan and act; the hard line the Purolea warning letter and draft Annex 22 draw around them, confining them to non-critical, human-in-the-loop tasks.
The demo-versus-routine gap itself — measured, not asserted, using the ISPE Pharma 4.0 survey and the small-data ceiling that explains why the gap is so persistent.
A frontier readiness scorecard — a small, runnable artifact that scores each capability against data adequacy, demonstrated maturity, and regulatory clearance, so the gap is auditable rather than rhetorical.

The frontier, framed: a demonstration is not a deployment

Before any specific technology, fix the measuring stick. A claim in this field can sit at very different distances from the plant, and the whole chapter depends on keeping them straight. We use the same two axes the case-studies chapter introduced. Maturity is the deployment ladder: (research) means an academic or early demonstration, often at bench scale; (pilot) means demonstrated at a meaningful scale but not in routine commercial use; (production) means deployed in GMP or commercial operation. Evidence tier is how strong the proof is: peer-reviewed-independent (strongest), peer-reviewed-self-authored, vendor-self-reported, or press-release-only (weakest). A frontier claim is only as credible as the weaker of its two labels, and an efficiency headline carries its tier in the same breath or it carries nothing.

It is worth being concrete about why the weaker label dominates, because it is the lever the rest of the chapter pulls. A peer-reviewed paper co-authored by the vendor that built the rig (peer-reviewed-self-authored) clears the methodological bar — the method is real and reproducible in principle — but it does not clear the independence bar: the people who report the result are the people who profit from it being believed, and no competitor or regulator has yet reproduced it on different hardware with a different molecule. A vendor-self-reported claim clears neither. The two axes are also genuinely independent: a result can be high-maturity and low-evidence (a production deployment that exists only as a press release), or low-maturity and high-evidence (a careful bench study in an independent journal). The frontier is unusual in that almost everything on it is low maturity with mixed evidence — the research is often excellent, the deployment claim almost always weaker than it sounds, and the reader's whole job is to refuse to average the two into a comfortable middle.

The reason the frontier is mostly (research) and (pilot) is not timidity; it is the structural reality of the field this whole book has been describing. Living cells, an offline reference assay (a lab test run on a pulled sample away from the tank, so its result returns only once or twice a day rather than continuously), run-to-run variability, and fast model decay together form a small-data ceiling — the same ceiling that makes pure machine learning (a model learned from data alone) stall and hybrid modeling (a model that fuses known process physics with data, covered in the hybrid-models chapter) win [1][7]. A foundation model trained on text has the open internet; a bioprocess team has, on a good day, a few hundred completed runs of a given product, each one a 14-day commitment of a multimillion-dollar suite, with a single ground-truth glycan profile (the gold-standard lab measurement of the antibody's sugar attachments — a CQA the model is trying to predict) that comes back from an offline assay days later. Every frontier capability in this chapter runs into that ceiling, and how it copes with it is the single best predictor of whether it will cross from demo to deployment. The four frontiers can be read as four different bets on how to beat it — generate data faster, pool it across firms, pretrain on all of it, or act on the little there is — and the chapter's verdict is that each bet has a real, named obstacle still standing.

Self-driving bioreactors: the loop, closed at development scale

The most concrete frontier is the self-driving bioreactor — a cultivation (a tank of living cells being grown to make a product) that designs its own next experiment, runs it, learns from the result, and designs the next, with no human choosing the conditions in between. This is Bayesian optimization and the digital twin fused into a closed loop and pointed at real hardware. Mechanically, the loop has four moving parts that must all close in real time: a surrogate model that maps proposed operating conditions to predicted outcomes with calibrated uncertainty; an acquisition function that turns that uncertainty into the single most informative next experiment to run; an execution layer (liquid handlers, feed pumps, automated sampling) that performs the experiment without hands; and an analytics layer (at-line and inline PAT) that returns a measurement fast enough to feed the next round. Break any one link — the surrogate is overconfident, the analytics return too slowly, the hardware cannot execute the proposed feed profile — and the loop either stalls or wanders, which is exactly why the demonstrated cases are so carefully bounded.

The cleanest demonstrated case is genuine and worth describing precisely, because it is so often mis-cited. A collaboration of DataHow, Sartorius, and Merck (Ares Trading / Merck KGaA's biopharma arm) ran an autonomous development campaign for a perfusion process (a continuous run in which cells are held at steady state while fresh media is constantly fed in and spent media drawn off, so it can be re-sampled day after day) making a monoclonal antibody (mAb — a single, identical antibody protein grown as the therapeutic product), published in Biotechnology and Bioengineering (2026) [2]. The machinery is the state of the art: a bank of 24 parallel ambr250 mini-bioreactors (an automated rig of 24 tiny 250 mL culture vessels run side by side), a cognitive digital twin combining a hybrid mechanistic-plus-data model with a step-wise Gaussian-process surrogate (a statistical model that returns not just a prediction but a calibrated uncertainty around it), and Bayesian optimal experimental design choosing each round's conditions to maximize information gain. The choice of each component is doing real work. The model is hybrid, not pure data-driven, precisely because the small-data ceiling forbids learning a 27-day perfusion trajectory from scratch — the mechanistic backbone supplies the mass balances (bookkeeping of what flows in, out, and accumulates) and growth kinetics (the equations for how fast the cells grow and consume nutrients) so the data only has to fit what the physics leaves free. The surrogate is a Gaussian process because the loop needs not just a prediction but a calibrated uncertainty on it: the acquisition function spends experiments where the model is most uncertain, and a point estimate with no error bar cannot drive that choice. And the design transfers what was learned on one cell line to accelerate another — a deliberate move to stretch a starved data budget across products rather than restart it. Over a 27-day perfusion cultivation — of which roughly twenty days ran under the autonomous loop, with no human selecting the operating points across that span — the system designed and executed its own experiments. This is real autonomous experimentation on a living mammalian culture, not a slideware promise.

And it is, unambiguously, process development at mini-bioreactor scale (research) — the authors themselves stress the gap between robotic capability (the rig can execute steps without hands) and device autonomy (the system can be trusted to decide) [3]. That distinction is the load-bearing one. Robotic capability is an engineering achievement that the rig plainly has; device autonomy is a trust claim that no one is yet making, because trusting the system to decide in a GMP context would mean trusting it to choose conditions that affect a released product's quality with no human in the decision. It is not a GMP deployment, it does not make a commercial drug, and the conditions it explores are development conditions, not a locked validated recipe [2][3]. The broader self-driving-lab literature sits in the same place: an independent Chemical Reviews survey catalogues impressive closed-loop systems across chemistry, materials, and biology [4], and a Royal Society review finds that optimization is the most common self-driving-lab task — tractable precisely because Bayesian optimization is well-suited to it — while the single most-cited gap is exactly the one our running example would face: extending from fast-growing microbial and abiotic (non-living, chemical) systems and a single tuned reactor to slow, expensive CHO (Chinese Hamster Ovary — the standard mammalian cell line for therapeutic antibodies) culture and dynamic, multi-vessel control [5]. The asymmetry is physical, not incidental. A microbial self-driving lab can run dozens of generations a day and turn an experiment around in hours; a CHO perfusion run is a 27-day commitment, so the same closed loop that completes a hundred design-build-test cycles in a week of E. coli work (a fast-dividing bacterium, the canonical microbial host) completes perhaps one in a month of mammalian work. A 27-day mammalian perfusion run is already heroic; a self-driving GMP suite running BATCH-2026-001 to release (the quality unit's irreversible decision to let a finished batch leave the building) — at 2000 L, with a validated recipe (a process locked and proven to reliably yield conforming product), a quality unit (the independent department accountable for that release decision), and a CQA that cannot be re-run if the loop chose wrong — is a different order of problem, and not only in scale: the demonstration is a continuous perfusion run that can be re-sampled day after day, while the GMP target is a fed-batch (a single sealed run that is fed but not drained and concludes at one harvest) whose release CQAs return once, at the end, on a batch that cannot be re-run if the loop chose wrong.

Evidence

The DataHow/Sartorius/Merck self-driving perfusion campaign is (research), evidence tier peer-reviewed-self-authored — a peer-reviewed paper co-authored by the vendors and process owners who built it [2]. Read the two labels together: peer review means the method is sound and disclosed; self-authorship means the deployment significance has not been independently reproduced, on other hardware, by anyone without a stake in the result. It is the strongest demonstrated autonomous-bioreactor case to date, and it is still development-scale. Any "self-driving GMP bioreactor" claim should be read against this ceiling: the best published result is a 27-day development cultivation, human oversight intact.

The second frontier answers a real industry problem: the most valuable training data is locked inside competitors. Every company's batches are a tiny dataset; the union of every company's batches would be a large one — but no firm will hand its process data to a rival, and much of it is regulated, confidential, or both. Federated learning (FL) is the mechanism that promises both: each participant trains on its own data behind its own firewall, and only model updates (gradients or weights — the internal numbers a model adjusts as it learns, not the data itself), never raw data, are pooled into a shared model. The data never leaves home; the learning is shared anyway. In practice the privacy guarantee is layered — secure aggregation so the orchestrator sees only the summed update and not any single participant's contribution, often with differential-privacy noise on top (small random perturbations added to the update so no single record can be reverse-engineered from it) — which is what makes a pharmaceutical company willing to contribute gradients computed on its proprietary chemistry at all.

The landmark proof is MELLODDY (Machine Learning Ledger Orchestration for Drug Discovery): ten pharmaceutical companies trained a shared predictive model across more than 2.6 billion data points spanning over 21 million molecules and tens of thousands of assays, without any company exposing its compounds or assay results to the others [6]. It worked, in the specific sense that matters: the federated model beat any single participant's local model on the participants' own private test sets, demonstrating real predictive benefit from pooling, while the privacy mechanism held — no company could reconstruct a rival's compounds from the shared updates. That is a genuine frontier crossed, and at a scale no single firm could approach alone.

The crucial caveat — and the reason it belongs in a frontier chapter rather than a deployment chapter — is what MELLODDY federated: it is drug discovery / QSAR (predicting molecular properties and bioactivity from chemical structure), not manufacturing. The molecules are static objects with crisp, comparable labels: a compound's structure is the same string of atoms at Bayer as at Novartis, and an IC50 against a named target means the same thing in both companies' assays once the assay is standardized. A manufacturing process is a living, time-evolving system whose "features" are entangled with each site's specific equipment, raw-material lots, and SOPs — so federating across physical production sites runs straight into a harder problem than federating across compound libraries: the data is not just private, it is not commensurable [6][7]. Two sites' "feed rate at day 7" mean subtly different things — different bioreactor geometry, different probe calibration, different feed medium lot — and raw-material lot variability is the sharpest case: site A's medium lot is a different physical material from site B's, so the same nominal "feed at day 7" is delivered into a chemically different broth, and no gradient exchange can reconcile what the materials themselves do not share. A day-7 reading, on top of that, may not even mark the same physiological state of the culture — so a gradient computed on site A's feature is not cleanly addable to a gradient computed on site B's same-named feature. Federated learning naively assumes the participants share a feature space; QSAR very nearly does, and manufacturing very nearly does not. A second, independent obstacle compounds the first: even granting a shared feature space, manufacturing federation faces extreme non-IID (each site's data follows a different distribution, not the independent-and-identically-distributed mix that averaging assumes), client-imbalanced data — one site may hold thirty batches of mAb-A and another none — and per-site batch counts that are tiny by federated-learning standards, both of which degrade FedAvg (the standard federated method that simply averages each site's model update) far more than the near-IID compound-library setting where MELLODDY succeeded. Federated learning from molecules to processes is therefore an active research perspective, not a shipped capability; the manufacturing analogue of MELLODDY does not yet exist [7].

So the maturity reads cleanly: federated learning is (pilot) as a mechanism and confined to discovery as an application. The privacy machinery is proven at ten-company scale; the application to manufacturing is conceptual and structurally harder than the proof. For our running example, this is the difference between "ten companies could pool a model that ranks mAb-A-like candidate molecules" (demonstrated) and "ten companies could pool a model that predicts BATCH-2026-004's HCP excursion (a spike in host-cell protein — a residual impurity from the production cells — above its limit) from their combined manufacturing history" (not demonstrated, and structurally harder — because the excursion's predictive features are exactly the non-commensurable ones).

Bioprocess foundation models: the aspiration that is not yet a product

The third frontier is the most hyped and the least real. A foundation model is a single large model pretrained on a vast, broad corpus, then adapted to many downstream tasks with little extra data — the pattern behind large language models and image generators. The aspiration, spoken in every other vendor deck, is a bioprocess foundation model: one model pretrained on every Raman spectrum (a light-scattering fingerprint read continuously inside the tank to infer what is dissolved in it), every batch trajectory, every chromatogram (the output trace of a purification step) ever recorded, that a new process could fine-tune on its handful of runs and immediately predict titer (how much product the cells have made, in grams per litre), glycosylation (the sugar pattern on the antibody — a CQA), and failure risk — the small-data ceiling finally escaped by borrowing scale from everyone else's data.

It is important to state this plainly: a bioprocess foundation model, in this sense, does not yet exist as an established system. It is an aspiration, not a product [8]. Two confusions inflate the impression that it does, and both dissolve on inspection. First, there are foundation models in the life sciences — single-cell genomics models like Geneformer and scGPT — but they operate on transcriptomics, sequences of gene expression (which genes a cell has switched on, read off as an ordered list), which is a sequence modality with a natural vocabulary (genes) and a web-scale public corpus (millions of openly deposited single-cell profiles), exactly the conditions that let the language-model recipe transfer [8]. A bioreactor's data has none of those: it is dense multivariate real-valued time series, there is no shared vocabulary across processes, and the corpus is proprietary and tiny. A model that predicts a cell's state from its gene-expression sequence does not transfer to predicting titer from a Raman spectrum any more than a protein-language model transfers to weather forecasting; the modality is simply different.

Second, there is a real and active research line on general time-series foundation models — models like Chronos, TimeGPT, and Moirai pretrained on enormous collections of generic time series and applied zero-shot to forecasting — and early work has begun probing whether such a model can forecast process signals [9]. That is the genuine seed of the idea, and it is the most credible path on offer. But a generic time-series model has never seen a Monod growth curve, a day-7 temperature excursion, a bolus feed, or the cell-specific perfusion rate and bleed that hold a high-density steady state; it has no notion of the mass balances and kinetics that the hybrid model bakes in for free, and "pretrained on millions of generic series" buys little when the target series — a specific product's perfusion run — is governed by mechanisms absent from electricity-demand and retail-sales data. The corpus that gives a language model its power simply does not exist for bioprocess time series, and a few hundred batches will not conjure it.

The honest near-term answer to "how do we borrow scale we do not have?" is the unglamorous one this book keeps returning to: transfer learning and Bayesian priors, not a foundation model [10][7]. The distinction is precise. A foundation model promises general pretraining — one model, all processes, fine-tune on anything. Transfer learning makes a specific, defensible bet: warm-start a new product's model from a related one, where you can argue the relatedness; carry the kinetic constants from model_fedbatch.py as a mechanistic prior, so the physics you already trust does the work the data cannot; reuse a calibration across probes with the piecewise direct standardization the transfer.py module already demonstrates. These are real, deployable, and exactly the workaround the literature recommends while the foundation-model dream remains a research aspiration — they beat the small-data ceiling not by pretending the data is large but by injecting knowledge that substitutes for the missing data. The frontier here is not a model you can download; it is a research direction with a hard, named obstacle — the same small-data ceiling, plus a missing corpus and a modality mismatch — standing in its way.

The prerequisite under all three data-pooling frontiers: a graph the data can actually be pooled in

Three of the four frontiers — self-driving labs (generate faster), federation (pool across firms), a foundation model (pretrain on all of it) — are the same bet at different scales: make the corpus bigger. They all founder on the same point, and it is the one the companion ontology book is entirely about: you cannot pool what is not commensurable, and bioprocess data is not commensurable by default. The fix is not more data, it is semantically harmonized data — the discipline the ML side has been treating as an aside and the ontology side treats as the whole job. It is worth making the connection concrete, because it is what would turn the foundation-model aspiration from a slogan into a buildable corpus.

The unit that makes pooling sound is not a column named feed_day7; it is a feature pulled by its ontology IRI — a globally unique identifier such as bp:feedRate realized on a typed, unit-bearing quantity rather than a fragile string. A historian tag emitting a bare BR101.Feed.PV = 0.40 carries no units, no context, and no guarantee that two sites mean the same thing by it; the same reading lifted into the graph becomes a feed rate of 0.40 vessel-volumes per day (a UCUM unit code — Unified Code for Units of Measure, the machine-readable unit standard the identifiers-and-units chapter specifies), realized on the production phase of BATCH-2026-001, within its declared normal operating range. Only the second form can be safely added to another site's contribution, because only the second form states what it is. This is exactly the non-commensurability the federated-learning section flagged, restated as the thing an ontology exists to fix: a shared upper ontology (BFO / IOF Core), aligned identifiers, and typed values are the precondition for a gradient from site A to be addable to a gradient from site B at all — the reuse-and-alignment discipline is the unglamorous work that a pooled corpus silently assumes and almost no one has done.

The same graph then does three more things a learning system needs and a flat table cannot supply, each drawn straight from the release-gate chapter:

A completeness contract for training data. The SHACL shapes (Shapes Constraint Language — a standard that checks a graph carries every required, in-range, typed value before it is trusted) that gate a release also gate what a model is allowed to learn from: before BATCH-2026-001's panel becomes a feature row, bp:ReleaseShape certifies every required CQA is present, singular, and in range. A model handed a row whose HMW result silently never loaded will not object — it imputes and reports a confident number — so the same closed-world gate that refuses a non-conformant release refuses a non-conformant training set. The release rule is the labeling contract: it defines what a valid PASS/OOS label even means.
A leak-free split, for free. Because lineage is explicit — bp:derivedFrom is the typed edge the genealogy is built on, in the spirit of PROV-O provenance — a model can be split the way honesty demands: grouped / leave-one-batch-out cross-validation that holds out every row sharing a derivedFrom ancestor, so BATCH-2026-001's sibling samples never leak across the fold and the reported score is one an unseen campaign would actually earn. The graph makes the grouping mechanical where a flat table leaves it to a hopeful convention.
A retrieval boundary for a grounded LLM. When the agentic frontier's LLM is asked a factual question — was a given lot released? — the honest answer lives in the verified graph, not the model's fluency, and GraphRAG (retrieval that walks typed edges and cites them) is how the model narrates a true lineage rather than inventing one. A SPARQL query that returns no conforming subgraph is the retrieval-time analogue of an out-of-distribution flag: a refusal to answer rather than a confident guess on unfamiliar ground. The ontology, in other words, is what lets an LLM be checked against ground truth instead of trusted on style.

None of this is a fifth frontier competing with the four; it is the substrate the three data-pooling frontiers silently stand on. A foundation model trained on a billion incommensurable, ungoverned runs may learn less than a hybrid model trained on fifty harmonized ones — so the real near-term move is not a bigger model but a graph good enough to deserve one, which is a question of classification, identity, and governance, not of parameters. That is the prerequisite the hype skips and the ontology book makes its whole subject.

The four headline frontiers as tracks from research to production: each has traveled a real distance, none has reached routine commercial GMP use, and the wide band before the production gate — the demo-to-routine gap — is the honest subject of this chapter. Original diagram by the authors, created with AI assistance.

Agentic AI in GMP: the line drawn in 2026

The fourth frontier is the one regulators reached first. Agentic AI is a system that does not merely predict or draft but plans and acts — it decomposes a goal into steps, calls tools, and takes actions toward an objective with reduced human intervention. Built on the large language models of the generative-AI chapter, an agent could, in principle, read a deviation (a recorded departure from the approved procedure), query the historian (the system that logs every process value over time), draft a CAPA (Corrective and Preventive Action — the formal record of how a problem is fixed and prevented), update the SOP (Standard Operating Procedure — the approved written method), and schedule the corrective action — a loop of decisions, not a single suggestion. The capability is genuinely new: an LLM that merely drafts text is a tool; an agent that reads the deviation, decides which records to pull, writes the CAPA, and pushes the SOP change is taking actions that previously required a qualified human at each step. That is precisely why it collides with GMP, where the controlling principle is that a competent, accountable human dispositions anything that touches product quality. Vendors market this aggressively; "agentic" is the 2025-2026 buzzword, and the messaging consistently runs ahead of any demonstrated production use [1].

The reason this frontier is confined rather than advancing is that 2026 produced two concrete regulatory facts that draw a hard line around autonomous agents in GMP. The first is enforcement: on 2 April 2026 the FDA issued its first AI-citing cGMP warning letter, to a firm (Purolea) that had used AI agents to generate specifications, SOPs, and master production records without quality-unit review [11]. The detail matters. The violation was not "you used AI" — the agency was explicit that AI use is permitted; it was that an AI agent produced GMP-controlling documents and a human quality unit did not review them before use, a 21 CFR 211.22(c) failure (the US rule requiring the quality unit to approve quality-affecting procedures before they are used), with the agent having omitted the process-validation requirement (the documented proof that a process reliably yields conforming product) entirely. That is precisely the failure mode an unconstrained agent invites: the more autonomously it acts, the more GMP-controlling artifacts it produces with no human in the decision, and the broader the surface that should have been reviewed and was not. The agency named the failure mode, not the technology.

The second is rulemaking: the draft EU/PIC/S GMP Annex 22 (Artificial Intelligence) — an appendix to the EU Good Manufacturing Practice rules, co-issued with PIC/S (the Pharmaceutical Inspection Co-operation Scheme, the international network of GMP regulators) — issued for joint EU/PIC/S consultation in July 2025, is the first manufacturing-specific AI rule, and it draws the line explicitly. For critical GMP applications it permits only static and deterministic models — static meaning no continuous or online learning after validation, deterministic meaning the same input yields the same output every time — and it excludes dynamic/adaptive, continuously-learning, probabilistic, and generative AI/LLM models from critical use [12][13]. Read against the definition of an agent, the exclusion is nearly total: an agent built on an LLM is generative, probabilistic, and (if it learns from its actions) adaptive — three of the exact properties the draft bars from the critical path. The permitted models must additionally be fully characterized, validated against predefined acceptance criteria on independent test data, log feature attribution and confidence scores, and be monitored after deployment [13]. The consistent expectation across FDA, EMA, Annex 22, and the ISPE GAMP guidance is a model locked at validation under a predetermined change-control plan, with human-in-the-loop (a human approves each action before it takes effect) or human-on-the-loop (a human monitors and can intervene, but the system acts on its own) oversight throughout — the opposite of an agent that adapts its own behavior in production [12][14]. (Annex 22 is a draft; finalization is expected mid-2026, and the specific exclusions are provisional, so cite it as draft.)

The net is that agentic AI in GMP is real, demonstrated, and deliberately bounded to non-critical, human-in-the-loop tasks — drafting that a human reviews, triage that a human dispositions, retrieval that a human acts on. This is the frontier's distinctive feature: its ceiling is not technical but governance. The autonomous agent that closes a CAPA on its own is, in 2026, on the wrong side of both an enforcement action and a draft rule. Unlike the small-data ceiling — which more data or better priors might eventually lift — this boundary will not move because the model improved; it moves only if regulators decide an adaptive, generative system can be trusted with a critical quality decision, and the entire posture of FDA, EMA, and PIC/S in 2025-2026 runs the other way. That is not a temporary technical gap; it is a governance boundary, and it is unlikely to move quickly.

Reading the frontier: the demo-to-routine gap, measured

It would be easy to wave at "the gap" rhetorically. It is better to measure it. The most reliable instrument is the 7th ISPE Pharma 4.0 Survey, which asked the industry which digital technologies are in pilots versus scaled into routine use. AI/ML came back as the technology with the most pilot projects but the fewest scaled implementations — it trails big-data analytics, advanced analytics, robotic process automation (RPA — software bots that automate fixed clerical steps), GxP cloud (validated cloud infrastructure that meets the regulated "good practice" rules), and IIoT (the Industrial Internet of Things — networked plant sensors and devices), all of which are further along the maturity curve [1]. That ordering is itself diagnostic: the technologies ahead of AI/ML are the ones that do not depend on scarce living-system data — cloud and IIoT are infrastructure, RPA automates deterministic clerical steps, and big-data analytics consumes data the plant already generates in abundance. AI/ML sits last precisely because it is the one that needs the small-data ceiling to lift. The cross-industry pattern rhymes: broad surveys find that a large majority of organizations use AI while only a tiny fraction achieve enterprise-wide impact, and BioPhorum's Digital Plant Maturity Model places fully autonomous, self-optimizing operation (Level 5) as the explicitly aspirational end-state that essentially no plant has reached, with realistic organizations operating at Levels 3-4 [1][15].

The gap is therefore not an artifact of one cautious company; it is the industry's measured position. And its cause is the throughline of the entire book. The frontier capabilities all promise to escape the small-data ceiling — self-driving labs by generating data faster, federated learning by pooling it across firms, foundation models by pretraining on all of it — and each runs into the fact that living-system data is scarce, slow, confounded, and not commensurable across sites and scales [1][7]. The methods that have actually crossed into production — Raman soft sensing, MSPC, vision inspection, predictive maintenance — are precisely the ones that do not depend on escaping that ceiling: they model "normal," or they have abundant, cheap labels (an image of a vial), or they wrap a mechanistic backbone that supplies the knowledge the data lacks. That is the diagnostic rule the whole book has been building toward: a capability ships when it does not need the ceiling to lift, and stalls in pilot when it does. The frontier is the set of capabilities that still need the ceiling to lift, and the gap is the distance between needing it and lifting it.

# examples/platform/ml/frontier_scorecard.py  (excerpt)
# Score each frontier capability on three gates between a demo and routine GMP use.
# stdlib only — the DATA is the artifact (curated, sourced), not the code.
from dataclasses import dataclass

MATURITY = {"research": 1, "pilot": 2, "production": 3}   # how far it has traveled
TIER = {"press-release-only": 1, "vendor-self-reported": 2,
        "peer-reviewed-self-authored": 3, "peer-reviewed-independent": 4}

@dataclass(frozen=True)
class Frontier:
    name: str
    maturity: str          # demonstrated deployment ladder
    tier: str              # strongest evidence available
    data_adequacy: int     # 1-5: does enough commensurable data exist to support it?
    reg_cleared: bool      # is it permitted for CRITICAL GMP use today?
    anchor: str            # the strongest real demonstration

FRONTIERS = [
    Frontier("Self-driving bioreactors", "research", "peer-reviewed-self-authored",
             data_adequacy=2, reg_cleared=False,
             anchor="DataHow/Sartorius/Merck 27-day autonomous perfusion (ambr250, PD scale)"),
    Frontier("Federated learning (manufacturing)", "research", "peer-reviewed-self-authored",
             data_adequacy=1, reg_cleared=False,
             anchor="MELLODDY proved the mechanism — in discovery/QSAR, not manufacturing"),
    Frontier("Bioprocess foundation models", "research", "vendor-self-reported",
             data_adequacy=1, reg_cleared=False,
             anchor="Aspiration; generic time-series FMs probed; no bioprocess FM exists"),
    Frontier("Agentic AI in GMP", "pilot", "vendor-self-reported",
             data_adequacy=3, reg_cleared=False,
             anchor="Confined to non-critical, human-in-loop (Purolea letter, draft Annex 22)"),
]

def readiness(f: Frontier) -> dict:
    # routine GMP use needs ALL THREE gates: production maturity, adequate data, reg clearance.
    gaps = []
    if MATURITY[f.maturity] < MATURITY["production"]: gaps.append("not-yet-production")
    if f.data_adequacy < 4:                           gaps.append("data-ceiling")
    if not f.reg_cleared:                              gaps.append("not-cleared-for-critical-GMP")
    score = MATURITY[f.maturity] + TIER[f.tier] + f.data_adequacy + (3 if f.reg_cleared else 0)
    return {"name": f.name, "score": score, "routine_gmp_ready": not gaps, "gaps": gaps}

if __name__ == "__main__":
    for f in FRONTIERS:
        r = readiness(f)
        flag = "READY" if r["routine_gmp_ready"] else "GAP"
        print(f"[{flag}] {r['name']:<38} score={r['score']:>2}  gaps={','.join(r['gaps']) or 'none'}")
    n_ready = sum(readiness(f)["routine_gmp_ready"] for f in FRONTIERS)
    print(f"\nfrontiers ready for routine critical-GMP use: {n_ready} / {len(FRONTIERS)}")
    assert n_ready == 0, "no 2024-2026 frontier capability clears all three gates for critical GMP"
    print("ASSERT ok: the frontier is real, demonstrated, and not yet routine in critical GMP.")

Running it prints the chapter's thesis as a table — the gaps are explicit, per capability, and the bottom line is zero:

[GAP] Self-driving bioreactors               score= 6  gaps=not-yet-production,data-ceiling,not-cleared-for-critical-GMP
[GAP] Federated learning (manufacturing)     score= 5  gaps=not-yet-production,data-ceiling,not-cleared-for-critical-GMP
[GAP] Bioprocess foundation models           score= 4  gaps=not-yet-production,data-ceiling,not-cleared-for-critical-GMP
[GAP] Agentic AI in GMP                      score= 7  gaps=not-yet-production,data-ceiling,not-cleared-for-critical-GMP

frontiers ready for routine critical-GMP use: 0 / 4
ASSERT ok: the frontier is real, demonstrated, and not yet routine in critical GMP.

Read the scores, not just the verdict, because the spread is informative. Agentic AI scores highest (7) yet is no closer to routine critical-GMP use than the others — it is the most operationally mature and has the best data position of the four, but its blocking gate is regulatory: it fails the reg-clearance gate by design, which no score can buy back. Foundation models score lowest (4) because they are stalled on every axis at once. And here is the argument the scorecard encodes most sharply: all four frontiers fail the data-adequacy gate (data_adequacy < 4) — even agentic AI, whose data is the most adequate of the set (data_adequacy = 3) but still short of the threshold, so it is not data-saturated either. The two highest-evidence frontiers, self-driving bioreactors and federated learning, fail it the hardest, encoding the chapter's point numerically: peer review buys methodological credit, not commensurable data. The scorecard is deliberately not a model — it is a survey artifact, the same shape as the case-studies ledger, and its value is that every cell is a sourced, falsifiable claim. If a self-driving GMP bioreactor ships next year, you change one field (maturity="production", reg_cleared=True), the assertion flips, and the book is wrong in a way you can see — which is exactly how a forward-looking chapter should age.

Anatomy of a frontier claim, stress-tested

A frontier claim is the unit this chapter is built from, and like every artifact in the series it is only trustworthy when you can read what travels with it. Take the strongest one — the self-driving perfusion campaign — and unpack it field by field, then stress-test each field the way a skeptical reviewer would before letting it into a slide deck: for each field, ask what a marketer would do with it and what the field actually licenses.

One frontier claim, fully unpacked: what was actually demonstrated, at what scale, with what evidence tier — and, just as load-bearing, what it is NOT and the named gap its own authors flag. The fields that matter most are the ones that keep a real research result from being read as a deployment. Original diagram by the authors, created with AI assistance.

Read top to bottom and the discipline is the whole point — each field is a place a claim can be honestly stated or quietly inflated.

Demonstrated. Generous, and correctly so: a 27-day autonomous mammalian perfusion run is genuinely hard, and crediting it fully is what gives the rest of the card its credibility. Stress test: the marketer's move is to drop the qualifiers — "autonomous perfusion process," full stop. The field licenses "an autonomous development campaign over one 27-day cultivation," and the missing words are the whole difference.
Maturity / scale. (research), ambr250, not a GMP suite. Stress test: the inflation is to elide scale — "demonstrated on a perfusion process" invites the reader to picture a 2000 L production bioreactor, when the rig is a 250 mL mini-bioreactor bank. Scale is not a detail here; it is the gap.
Evidence. Peer-reviewed, which is real weight, with the tier named honestly: self-authored by the builders. Stress test: "peer-reviewed in Biotechnology and Bioengineering" is true and load-bearing, but reading it as independent validation is the error — the people reporting the significance are the people who built the rig, and no one else has reproduced it.
What it is NOT. Not a commercial drug, not a locked recipe, not autonomous CQA control. Stress test: this is the field a deck simply deletes, because every entry in it is a sentence the headline would rather you not finish. Its presence is the single best signal that a claim is being stated honestly.
Named gap. The most credible thing on the card, because it carries two distinct, separately-sourced limits: the authors' own caveat — robotic capability is not device autonomy [3] — plus the field-level gap the independent surveys flag, that microbial demonstrations do not transfer free to CHO [4][5]. Keeping the two apart matters: one is what the campaign's builders disclosed, the other is what reviewers outside the work identified. Beyond the cycle-time asymmetry, the Gaussian-process surrogate also assumes a roughly stationary response surface, which a multi-week mammalian run can violate within a campaign — so the acquisition function must re-confirm earlier points, not only explore. Stress test: a claim that names its own gap is far stronger than one that hides it; when the builders themselves flag the limit and the independent literature flags the rest, a reviewer can stop hunting for it.
Falsification. The footer that makes the claim age gracefully: it states the exact condition under which it would become a deployment — an autonomous loop running a released GMP batch under a predetermined change-control plan. Stress test: a claim with no falsification condition is not a hypothesis, it is a slogan; this one has a tripwire, so the day it crosses, the card flips visibly rather than the language quietly drifting.

The fields that matter most are the last three — the ones a marketing deck omits and a skeptical reviewer reads first.

The unsolved part: will the ceiling ever lift?

The honest open question under this entire chapter is whether the small-data ceiling is a temporary obstacle the frontier will eventually clear, or a structural property of biology that no amount of compute will overcome. The optimistic reading is that every ceiling in machine-learning history has eventually lifted: foundation models escaped small-data in language and vision by pretraining on web-scale corpora, and perhaps autonomous labs generating data around the clock, plus federated pooling across firms, plus a future bioprocess foundation model, will together manufacture the scale that biology withholds [7][9]. On this reading the three data-generating frontiers are complementary, not competing: self-driving labs raise the rate, federation raises the breadth, and a foundation model would convert the accumulated mass into transferable priors — and the agentic frontier waits on governance that will relax once the technical case is overwhelming.

The pessimistic reading is sharper and, today, better supported by evidence. Bioprocess data is not merely small; it has three properties that more of it does not fix. It is non-commensurable — two sites' day-7 feed rates are not the same feature, so pooling them (whether by federation or a foundation corpus) adds noise, not signal, unless every contributing process is semantically harmonized first — the shared-vocabulary, aligned-identifier, typed-value discipline that Book 4's reuse-and-alignment and identifiers-and-units chapters specify, riding on the ISA-95 equipment-and-batch model and B2MML exchange schema that give a "day-7 feed on BR-101" the same structural meaning across plants (the semantic-interoperability layer the data book builds, where each value also carries the governed data shadow — the lineage, units, and ALCOA+ integrity metadata that say what it is and whether it can be trusted) — which is the expensive thing nobody has done. It is confounded — one OOS (out-of-specification — a batch result that fails its acceptance limit) batch like BATCH-2026-004 carries a whole vector of correlated conditions, so a model learns "these twelve things co-occurred with failure" and cannot separate cause from coincidence; more confounded batches sharpen the correlation without ever isolating the cause, and only a designed experiment can, which is slow and expensive precisely in the regime where it is needed. And it is fast-decaying — a process change makes last year's batches a different distribution, so a corpus does not accumulate the way a text corpus does; it expires — and part of the expiry is biological, not just procedural: cells drift across passages (successive rounds of re-growing the culture), working-cell-bank vials differ (the frozen stock of cells a campaign is started from is finite and slightly variable), and a bank has a finite lifetime, so a corpus pretrained on a decade of runs may have been pretrained on a population of cells that no longer exists, quite apart from any recipe change [1][7]. Put together, a foundation model trained on a billion incomparable, confounded, stale runs may learn less than a hybrid model trained on fifty good ones with the right physics — which is the empirical pattern the hybrid-modeling literature already reports under small data [10]. The agentic boundary compounds this: even if the data ceiling lifted tomorrow, an adaptive generative model would still be excluded from critical GMP by Annex 22's logic, so two independent obstacles — one statistical, one regulatory — would both have to fall.

No one knows which reading is correct, and intellectual honesty requires saying so. The frontier might cross the gap in a decade, or the gap might be the permanent shape of the field, with hybrid modeling and human-in-the-loop oversight not a transitional phase but the mature end-state — the field's equilibrium rather than its waiting room. The practical tell to watch is narrow and concrete: the day a single autonomous loop runs a released GMP batch under a predetermined change-control plan, the optimistic reading gains its first real evidence; until then the scorecard reads zero, and the honest default is that the workarounds are the destination. This chapter does not resolve it. It marks where the line is in 2026 and gives you the scorecard to watch it move.

What this chapter adds to the model suite

A forward-looking chapter contributes a survey artifact rather than a predictive model, in the spirit of the case-studies ledger:

examples/platform/ml/frontier_scorecard.py — a dependency-light (stdlib only) readiness scorecard. It encodes each of the four frontiers as a row carrying its maturity, evidence tier, a data-adequacy score, and a regulatory-clearance flag, with the strongest real demonstration named as the anchor. The readiness() function gates each capability on all three conditions for routine critical-GMP use — production maturity, adequate commensurable data, and regulatory clearance — and the module asserts that zero of the 2024-2026 frontiers clears all three. Like the case ledger, the data is the artifact: every field is a sourced, falsifiable claim, and the assertion is designed to flip the day a frontier genuinely crosses, so the chapter ages visibly rather than silently.

The reproducibility discipline is deliberate, and it is the same one the whole runnable suite carries: stdlib-only so there is nothing to pin and no environment to drift, an explicit assert so the script's exit status — not a human reading the output — is the verifier, and a committed, permissively licensed artifact so anyone can re-run it and get byte-identical output (the suite's heavier models, the scikit-learn and PyTorch chapters, hold their results stable the same way, through run_all.py under a fixed seed). That last property is what separates a falsifiable survey from a press release: a press-release figure cannot be re-run, while this scorecard prints the same verdict on every machine and fails loudly the day a field is edited to claim a crossing the evidence does not support. It is the open-source analogue of the analytics stack's CI gate, where a model's claim is only as good as the test that re-checks it on every change.

It coordinates with, and does not duplicate, the case ledger (named deployments and their evidence) and the transfer module (the real near-term workaround the foundation-model section points at). The scorecard sits one altitude above both: it scores capabilities, not deployments or models, and is the auditable form of the anatomy card — the same fields (maturity, tier, the named gap, the falsification condition), reduced to three pass/fail gates so the verdict is reproducible rather than rhetorical.

Why it matters

The frontier matters precisely because it is so easy to mis-sell, and mis-selling it is expensive. A plant that believes a self-driving GMP bioreactor is a purchasable product will set a strategy on a research demonstration; a quality unit that believes an agentic platform can close CAPAs autonomously will build a process that the next warning letter cites — the Purolea letter is the worked example of exactly that bill coming due [11]; an executive who believes a bioprocess foundation model exists will defund the unglamorous transfer-learning and hybrid-modeling work that is the actual near-term path. The discipline of this chapter — credit the real achievement, attach the maturity and evidence tier, name what it is not, and state how the claim would be falsified — is not pessimism. It is the only posture that lets a team invest in the frontier without being burned by it: build the demonstrations, fund the research, and deploy only what has crossed the gap, watching the scorecard for the day a capability genuinely moves a column.

In the real world

The named reality in 2026 is consistent across the four frontiers. Self-driving bioreactors: the DataHow/Sartorius/Merck autonomous perfusion campaign is the strongest demonstrated case, peer-reviewed and explicitly process-development scale (research) [2]; the wider autonomous-lab field is active and microbial-leaning, with the CHO/multi-vessel extension the named gap [4][5]. Federated learning: MELLODDY proved the mechanism at ten-company scale, in discovery; the manufacturing analogue is a research perspective blocked by non-commensurability, not a product (pilot, discovery-only) [6][7]. Foundation models: no bioprocess foundation model exists; the genomics foundation models are a different modality, and generic time-series foundation models are an early research probe with no bioprocess corpus behind them (research/aspiration) [8][9]. Agentic AI: demonstrated for non-critical drafting and triage, confined by the Purolea warning letter and draft Annex 22 to human-in-the-loop, non-critical use (pilot, bounded) [11][12]. Above all of them sits the ISPE Pharma 4.0 finding: AI/ML has the most pilots and the fewest scaled implementations, and production clusters in monitoring, predictive maintenance, vision, and human-in-the-loop documentation — not autonomous control of CQAs [1]. The frontier is real, it is being built in earnest, and in 2026 none of it is routine on a commercial GMP line. That is not a criticism of the work; it is the honest map of where the work has reached.

Key terms

GMP (Good Manufacturing Practice) — the legally binding, regulator-inspected regime governing how a sellable drug is manufactured; the line the whole chapter measures the frontier against.
CQA (critical quality attribute) — a measurable product property (such as glycosylation or purity) that must stay within its specification for the batch to be safe and effective, and that cannot be changed once the batch is released.
CHO (Chinese Hamster Ovary) — the standard mammalian cell line for therapeutic antibodies; slow and expensive relative to microbial hosts, which is why closing an autonomous loop on it is so much harder.
Perfusion vs fed-batch — two process modes: a perfusion run continuously exchanges media at steady state, so it can be re-sampled day after day; a fed-batch is a single sealed run, fed but not drained, whose release CQAs return only once, at harvest.
HCP (host-cell protein) — a residual impurity from the production cell line that must be cleared; an HCP excursion is a spike above its limit.
OOS (out-of-specification) — a batch result that fails its acceptance limit (e.g. BATCH-2026-004).
PIC/S — the Pharmaceutical Inspection Co-operation Scheme, the international network of GMP regulators co-issuing draft Annex 22 with the EU.
Non-IID / FedAvg — non-IID data means each site's data follows a different distribution, not the independent-and-identically-distributed mix that averaging assumes; FedAvg is the standard federated method that simply averages each site's model update, which non-IID, client-imbalanced manufacturing data degrades.
Self-driving bioreactor — a cultivation that designs, runs, and learns from its own experiments in a closed loop, with no human selecting the conditions between rounds; demonstrated at development scale (research), not in GMP. The loop has four links — surrogate, acquisition function, execution, analytics — that must all close in real time.
Bayesian optimal experimental design — choosing each next experiment to maximize expected information gain about the process; the acquisition engine inside a self-driving lab, which needs a model that reports calibrated uncertainty (hence the Gaussian-process surrogate).
Cognitive digital twin — a hybrid mechanistic-plus-data model coupled to a surrogate (often a Gaussian process) that both predicts the process and proposes the next experiment; hybrid by necessity, because the small-data ceiling forbids learning a 27-day trajectory from data alone.
Robotic capability vs device autonomy — the authors' own distinction: a rig that can execute experimental steps without hands (capability, demonstrated) is not a system that can be trusted to decide (autonomy, the unmet trust claim that GMP would require).
Federated learning (FL) — training a shared model across institutions by exchanging only model updates, never raw data, typically via secure aggregation, so private datasets can contribute to a common model without being disclosed.
MELLODDY — the ten-company, 2.6-billion-data-point federated-learning project that proved the privacy mechanism and federated benefit at scale — in drug discovery / QSAR, not manufacturing.
Non-commensurability — the problem that the same-named feature (a feed rate, a day-7 reading) means subtly different things across sites and scales, which blocks naive pooling of manufacturing data and is the structural reason federation does not transfer from molecules to processes.
Foundation model — a single large model pretrained on a broad corpus and adapted to many downstream tasks with little extra data; for bioprocess time series it is an aspiration, not an existing product, blocked by a missing corpus and a modality mismatch.
Time-series foundation model — a model (Chronos, TimeGPT, Moirai) pretrained on large collections of generic time series for zero-shot forecasting; an early research probe for process signals, with no bioprocess-specific corpus, kinetics, or mass-balance knowledge behind it.
Agentic AI — a system that plans and acts toward a goal by calling tools and taking steps, not merely predicting or drafting; in GMP, confined to non-critical, human-in-the-loop tasks because it is generative, probabilistic, and adaptive — three properties draft Annex 22 bars from the critical path.
Predetermined change-control plan (PCCP) — a pre-approved scope of allowed model changes; the regulatory expectation that a model be locked at validation with changes governed in advance, the opposite of an unconstrained adaptive agent.
Demo-to-routine gap — the measured distance, captured by the ISPE Pharma 4.0 survey, between a demonstrated AI capability and its scaled, routine use in production; AI/ML shows the most pilots and the fewest scaled implementations of any surveyed enabling technology.
Small-data ceiling — the binding constraint of living-system data (scarce, slow, confounded, non-commensurable, fast-decaying) that every frontier capability promises to escape and none has yet escaped; the chapter's diagnostic rule is that a capability ships when it does not need the ceiling to lift and stalls in pilot when it does.
Semantically harmonized data / ontology IRI — the precondition that makes pooling sound: a feature pulled by a globally unique identifier (such as bp:feedRate) realized on a typed, unit-bearing quantity, not a fragile column name, so a value from one site is genuinely the same feature as the same-named value from another; the reuse-and-alignment work the data-pooling frontiers silently assume and almost no one has done.
SHACL completeness contract — the closed-world shape (bp:ReleaseShape) that certifies every required CQA is present, singular, typed, and in range before a record becomes a training row, so a model never silently learns from a hole; the same gate that refuses a non-conformant release refuses a non-conformant training set, and the release rule defines what a valid PASS/OOS label even is.
Grouped / leave-one-batch-out cross-validation — splitting a model's data on the bp:derivedFrom lineage rather than at random, holding out every row that shares a batch ancestor, so sibling samples of one lot never leak across the fold and the reported score is one an unseen campaign would actually earn; the graph's typed lineage edges supply the grouping key a flat table lacks.
GraphRAG / retrieval grounding — answering an LLM's factual question by walking the verified knowledge graph's typed edges and citing them, so the model narrates a true lineage rather than inventing one; a query that returns no conforming subgraph is the retrieval-time analogue of an out-of-distribution flag, the discipline the companion ontology chapter makes its whole subject.

Where this leads

The frontier is mapped, the gap is measured, and the scorecard reads zero — every headline capability is real, demonstrated, and not yet routine in critical GMP. Before the verdict, one of those four frontiers deserves a closer look — not for what it might someday decide, but for the unglamorous, governable job it can do now. The next chapter, Agentic AI for Connectivity, takes the agentic capability off the frontier and points it at the plant's oldest bottleneck — wiring a new instrument into the data model — and finds the honest answer is the same one this book keeps reaching: not a smarter agent, but a deterministic verifier and a human gate that make the agent safe to use at all.

What this chapter covers​

The frontier, framed: a demonstration is not a deployment​

Self-driving bioreactors: the loop, closed at development scale​

Federated learning: training across the silos without sharing the secret​

Bioprocess foundation models: the aspiration that is not yet a product​

The prerequisite under all three data-pooling frontiers: a graph the data can actually be pooled in​

Agentic AI in GMP: the line drawn in 2026​

Reading the frontier: the demo-to-routine gap, measured​

Anatomy of a frontier claim, stress-tested​

The unsolved part: will the ceiling ever lift?​

What this chapter adds to the model suite​

Why it matters​

In the real world​

Key terms​

Where this leads​