Molecule Discovery: Generative Design and Developability Prediction

📍 Where we are: Part II · Discovery & Development, Learned — Chapter 5. The previous chapter, Target and Concept, used learning to decide where a molecule should start: which target, which modality, which rough format. Now the format is fixed — an IgG monoclonal antibody, mAb-A (a monoclonal antibody, "mAb," is a single, uniform Y-shaped immune protein mass-produced as a drug; IgG is its most common class, the antibody type nearly all therapeutic antibodies belong to) — and the question becomes which exact sequence to advance, and whether that sequence can survive the factory that is still twelve chapters away.

A molecule is born as a string of amino acids — the ~20 chemical building blocks that, chained in a specific order (the sequence), fold into a protein — long before it is born as a batch. The choice of that sequence — made by an antibody discovery team staring at hundreds or thousands of candidate binders — quietly fixes most of what the rest of this book will struggle with. A candidate that binds beautifully but aggregates (whose molecules clump together into larger species — a safety and quality defect, because clumps can lose activity and provoke the immune system) will fight you in the production bioreactor, clog the polishing column, and fail the SEC (size-exclusion chromatography) high-molecular-weight gate — the purity test that catches clumped, aggregated protein. A candidate with a deamidation hotspot will drift its charge-variant profile every time the process breathes. The discovery team almost never sees the factory — but its sequence decisions are the factory's first and most irreversible inputs. This chapter is about the machine learning that tries to see the factory from the sequence, and increasingly to design sequences that the factory will like.

This is also where the small-data ceiling from Chapter 1 briefly lifts. Sequences are cheap; assays are not, but there are far more sequences with some measured property than there are complete GMP batches. Discovery is the one corner of the spine where genuinely large models — protein language models trained on hundreds of millions of natural sequences — have a foothold. The catch, which we will return to, is that the property the factory cares about is exactly the one that is still small-data.

The simple version

Think of hiring. A binding screen is the interview: it tells you the candidate can do the headline job (stick to the target). Developability prediction is the background check: can this candidate actually show up to work every day — not dissolve into a sticky mess, not pick up chemical damage, not provoke the immune system? You run the cheap background check (read the résumé, the sequence) on thousands of applicants before the expensive on-site (making protein, running assays), so you never waste a year developing someone who was always going to wash out. Generative design is the next step: instead of only screening applicants who walked in the door, you write the job description so precisely that you can propose ideal candidates and then verify them.

What this chapter covers

The discovery-to-manufacturability handoff: why a sequence is a manufacturing decision
Sequence features, the chemistry behind the classic liability motifs, and the scoring math
Developability prediction as a supervised-learning task, with the training and validation that make it honest
The biophysical guidelines (Therapeutic Antibody Profiler, the Jain panel) that anchor the field, with their actual metrics
Protein language models and structure models — the foundation-model wave reaching antibodies, and the math that makes embeddings useful
Generative antibody/protein design, named tools with maturity and evidence tiers, and where it is real versus aspirational
A runnable developability.py that turns a CDR sequence into a liability score tied to CQAs (critical quality attributes — measured product properties that must stay in spec; defined in Chapter 1)
The anatomy of one developability prediction, field by field, and the unsolved label-scarcity problem
The GMP/CMC angle: why a discovery model never decides anything, but shapes everything

The handoff nobody in discovery watches

The five-book series keeps insisting that manufacturing consequences are baked in upstream; nowhere is that more literal than here. A monoclonal antibody is roughly 1,300 amino acids across its heavy and light chains (the two pairs of protein strands that assemble into the Y shape — two identical long "heavy" chains and two identical short "light" chains). Each chain has a variable region (the tip of the Y, which differs from antibody to antibody and does the binding) and a constant region (the rest, shared across antibodies of the same class). The variable regions — and within them the six complementarity-determining regions (CDRs) that form the binding surface — are where discovery has freedom and where most liabilities live. The constant regions are largely fixed by the chosen isotype (the antibody subclass that sets the constant-region sequence — here IgG1, a subtype of IgG); the CDRs are the dozen-to-few-dozen residues that selection pressure for binding shapes, with no pressure at all for manufacturability unless someone adds it.

That is the whole problem. A display campaign (phage, yeast, or mammalian), a transgenic-animal/single-B-cell route, or a hybridoma optimizes affinity (binding strength — how tightly the antibody grips its target). These are all just different laboratory methods for finding and refining antibodies that bind well; the common thread is that every one of them selects for binding and nothing else. Affinity-optimized CDRs are frequently rich in aromatic and hydrophobic (water-repelling) residues (they make good binding contacts) — and those same residues drive aggregation and viscosity, the two properties that most often kill a high-concentration subcutaneous antibody (a subcutaneous, under-the-skin, dose must fit in a small roughly 1–2 mL injection, which forces protein concentrations above ~100 mg/mL — the regime where self-association raises viscosity past what a syringe or a fill line can push) [1][2]. The discovery team's reward function and the factory's reward function are not just different; on the most important axis they are opposed. Developability prediction exists to put the factory's reward function into the discovery loop early, computationally, before any of it is expensive.

Concretely, in our running example: the lead sequence behind WCB-CHO-001 and ultimately BATCH-2026-001 had to clear a developability screen before the cell line was ever built. Every CQA on the release certificate traces back to that screen. The SEC_HMW_pct of 1.287 and SEC_monomer_pct of 98.611 on BATCH-2026-001 (read straight from examples/datasets/hplc_results.csv) are, in part, a verdict on a sequence choice made years earlier — the aggregation propensity that the discovery screen tried to predict, now measured on a real released lot. A bare number is not a verdict without its specification window: SEC_HMW_pct 1.287% sits well inside its 0.0–3.0% spec (a PASS on the release certificate), and the charge-variant spread (CEX_acidic_pct 21.551, CEX_basic_pct 10.452, the acidic fraction inside its 10.0–30.0% window) is the in-process echo of the PTM hotspots that screen flagged or cleared. Discovery does not measure these; it bets on them.

Sequence features: the interpretable backbone, and the math

Before any deep model, there is a layer of features any chemist would recognize, and any audit would accept. These are computed directly from the sequence, they are cheap, and they map onto specific manufacturing failure modes. Four families matter, and each carries a concrete formula.

Hydrophobicity. Summarized by GRAVY (grand average of hydropathy), the mean Kyte-Doolittle hydropathy across the residues: assign each residue its tabulated Kyte-Doolittle hydropathy index — a standard per-amino-acid scale where positive means water-repelling (hydrophobic, e.g. Ile +4.5, Val +4.2) and negative means water-loving (hydrophilic, e.g. Arg -4.5) — then sum and divide by length. A net-hydrophobic CDR is "sticky": it favors self-association, which shows up downstream as HMW aggregate (the SEC HMW_pct attribute) and as elevated viscosity at the high concentrations a subcutaneous dose demands. The sequence-level average is the floor; a structure model lets you refine it into a spatial aggregation propensity (SAP) — a sum of side-chain hydrophobicity over a moving spatial window weighted by each atom's solvent-accessible surface area, which localizes the sticky patch rather than averaging it away. SAP is the metric that best tracks measured aggregation, but it needs a structure; the sequence GRAVY already separates the obvious offenders [1][2].

Net charge and isoelectric point. Estimated from the ionizable residues (the amino acids that can carry an electric charge) via the Henderson-Hasselbalch relation — the standard chemistry rule for how charged a group is at a given acidity. (Two background terms: pH measures how acidic a solution is, low pH = acidic; pKa is the pH at which a particular group is half-charged.) For each acidic group the fractional negative charge at pH is one over one-plus-ten-to-the-(pKa minus pH); for each basic group the fractional positive charge is one over one-plus-ten-to-the-(pH minus pKa); the net charge is the sum of the positives minus the sum of the negatives, with the N- and C-termini contributing their own pKa terms. Sweeping pH until the net charge crosses zero yields the isoelectric point (pI). Charge at formulation pH governs colloidal stability (extremes of charge can drive both aggregation and viscosity), and the distribution of charge across isoforms is literally the CEX (cation-exchange chromatography) assay's readout — the CEX_acidic_pct / CEX_main_pct / CEX_basic_pct triplet on every batch. The Therapeutic Antibody Profiler (TAP, the standard guideline screen detailed two sections below) adds two structure-derived charge metrics here — patches of positive and negative surface charge (PPC/PNC) and Fv charge-symmetry (Fv is the variable fragment, the binding tip of the Y) — because where the charge sits on the surface, not just the net total, predicts viscosity at high concentration [1].

Post-translational-modification (PTM) liability motifs. Short sequence patterns — captured as regular expressions (regex, a compact notation for "match this text pattern"; e.g. N[GS] means "an N followed by either a G or an S") — where chemistry happens spontaneously and turns one molecule into a heterogeneous population. Each amino acid has a one-letter code (N = asparagine "Asn," G = glycine "Gly," D = aspartate "Asp," C = cysteine, M = methionine, W = tryptophan, P = proline, and so on); the patterns below are written in it:

N-G / N-S (regex N[GS]) — asparagine deamidation: Asn becomes Asp/isoAsp, adding a negative charge → an acidic charge variant (drives CEX_acidic_pct up over shelf life). The Gly that follows Asn gives the backbone the flexibility to form the cyclic succinimide intermediate, which is why N-G is the canonical hotspot.
D-G / D-T / D-S / D-H (regex D[GTSH]) — aspartate isomerization to isoAsp: another route to charge and potency drift.
N-X-S/T with X not Pro (regex N[^P][ST]) — an N-linked glycosylation sequon in a CDR: heterogeneous glycoforms, altered clearance, a quality nightmare in a binding region. Note this is an aberrant sequon: the antibody's wanted glycan sits on the conserved Fc asparagine (N297) and is a CQA the cell-culture process is built to control; a new sequon inside a CDR is an unintended, often heterogeneously-occupied site that can blunt binding and shift clearance — a quality liability precisely because it is in the wrong place.
M / W (regex [MW]) — methionine/tryptophan oxidation: charge and stability changes, often process- and storage-dependent. Grouping Met and Trp under one flat penalty is a deliberate conservative simplification — Met oxidation is high-prevalence and surface-exposure/peroxide-driven, while Trp oxidation is rarer, light- and radical-driven, and often more consequential in a CDR; a production tool weights them separately by solvent accessibility.
odd cysteine count — an unpaired (free) cysteine: a reactive thiol that forms intermolecular disulfides → covalent HMW aggregate that the polishing step cannot easily resolve. The screen models this by penalizing the count of C modulo two, so an even (paired) count costs nothing and an odd (unpaired) count costs the full penalty.
D-P (regex DP) — an acid-labile Asp-Pro bond → fragmentation to LMW species (the SEC LMW_pct attribute).

The exact-position chemistry matters more than the count: a hotspot buried in a structured framework strand is far less labile than the same motif exposed in a flexible CDR loop, which is why structure-aware tools weight motif solvent accessibility. The sequence regex is the conservative first pass — it over-flags, deliberately, because a false alarm at the desk costs nothing and a missed hotspot costs a charge-variant drift discovered in the factory.

Immunogenicity risk — the chance the patient's immune system attacks the drug as foreign. The trigger is T-cell epitopes: short fragments of the antibody (peptides) that the immune system can display on a cell-surface protein called MHC class II. When a fragment is displayed there, immune T-cells can recognize it and rally the body to make anti-drug antibodies that neutralize the medicine — so a sequence that produces many strongly-displayed fragments is a liability. This is its own ML sub-field: neural-network peptide-MHC binding predictors (NetMHCIIpan is the canonical one) slide along the sequence in overlapping 15-residue (15-mer) windows and score each window against a panel of HLA alleles (the human genetic variants of the MHC protein), returning a display likelihood per window [3]. The design response is deimmunization: mutating predicted epitopes while preserving binding.

The point of keeping these features explicit is not nostalgia. Under the regulatory posture this book keeps returning to — locked, explainable models for anything touching product quality — an interpretable feature whose mechanism a chemist can name is worth more than a marginally more accurate black box, because you can defend it. The deep models in the next section earn their place by being more accurate; they do not replace the obligation to explain.

Developability prediction as a learning task

Frame it as supervised learning and the difficulties become precise. The input is a sequence (or a structure model derived from it). The target is a measured developability property: a hydrophobic-interaction-chromatography (HIC) retention time as an aggregation proxy, a measured viscosity at 150 mg/mL, an accelerated-stability aggregation rate, an assay-panel pass/fail. The model is a regressor or classifier — historically a random forest or gradient-boosting model on the engineered features above, increasingly a learned representation on top of a protein language model.

How the label is made, and why it is the bottleneck. A single viscosity label is not free data; it is a multi-week wet-lab campaign. You must express the candidate (transient transfection, days), purify it (Protein A plus polishing, days), concentrate it to therapeutic strength (the step that itself fails on the stickiest candidates, biasing the dataset), and measure cone-and-plate or microfluidic viscosity at 150 mg/mL — consuming tens of milligrams of pure protein per data point. HIC retention is cheaper, and it is the field's standard aggregation surrogate for a mechanistic reason, not just a budgetary one: retention is governed by the same surface-exposed hydrophobicity that drives self-association, which is exactly what the patch-of-surface-hydrophobicity score (PSH) and SAP try to compute from structure — so a long HIC retention and a hot hydrophobic patch are two readouts of one underlying liability. Accelerated stability (weeks at 40 degrees Celsius, then SEC) is the most relevant but the slowest. So the label is expensive, noisy, and a proxy — three properties that make ordinary deep learning ill-suited and make the validation discipline below non-negotiable.

Training and validation that survive scrutiny. Because the labeled set is small (hundreds, not millions) and the candidates are related, the validation has to fight optimism harder than the model fights bias:

Cross-validation, not a single split. With a few hundred labels a single hold-out is a coin flip; k-fold CV is the floor, and the per-fold variance is reported alongside the mean because a model whose accuracy swings wildly across folds is not a model you put in a screen.
Grouped / scaffold splits. Random splits leak: two candidates from the same affinity-maturation lineage differ by a residue, so a random split puts near-duplicates on both sides and inflates the score. Splitting by germline family or campaign — keeping whole lineages on one side — gives the only honest estimate of how the model behaves on the next genuinely new molecule.
The right metric for the right shape. Developability triage cares about ranking (advance the best, drop the worst), so AUROC (the probability the model ranks a random true-bad candidate above a random true-good one — 0.5 is a coin flip, 1.0 is perfect) and Spearman correlation (how well the model's ordering matches the measured ordering) matter more than raw error; for the imbalanced "will this one fail" classifier, where failures are rare, AUPRC (the area under the precision-vs-recall curve) against the prevalence baseline (the score a no-skill model gets, equal to the fraction that actually fail) is the honest number, exactly as the suite's downstream release predictor reports it. And because triage ends in a decision, the gate is finally a threshold, not a ranking: where you cut trades recall (never discard a strong binder over a fixable liability) against precision (never spend a scarce wet-lab slot on a candidate that was always going to wash out) — the same two-operating-point choice the release predictor makes explicit.
A baseline that must be beaten. The PLM-embedding model has to beat the cheap engineered-feature model on the grouped split, or it is buying complexity for nothing — the same "earn your parameters" discipline the suite applies when a 1D-CNN fails to beat a PLS soft sensor and loses on parsimony.

Three things make this task genuinely hard, and they are the same three the whole book circles:

The labels are small-data and expensive. A measured viscosity at therapeutic concentration needs tens of milligrams of purified protein — you cannot run it on ten thousand candidates. The largest public biophysical panels number in the low hundreds of clinical-stage antibodies (the foundational Jain et al. survey assayed 137 clinical-stage mAbs across a dozen assays) [2]. Industrial datasets are larger but proprietary; the recent ensemble-deep-learning viscosity work that pushed performance forward did so precisely by assembling large-scale internal viscosity data, which is exactly the asset most teams lack [4].
The properties are correlated and multi-objective. Reducing hydrophobicity to fix aggregation can wreck affinity; germlining a PTM hotspot can shift charge. There is no single scalar "developability"; there is a Pareto surface, and a real screen reports a vector of risks, which is why the field gravitates to interpretable per-axis flags rather than one opaque number.
The distribution shifts. A model trained on one company's historical antibodies, or on clinical-stage molecules that already survived selection, is a biased sample of sequence space — and a generative model that proposes novel sequences pushes the predictor out of its training distribution by design. The predictor that scored the natural candidates may be unreliable on the engineered ones it most needs to grade.

The strongest current evidence is research-tier but real. PROPERMAB (research) is a published framework (Regeneron) that predicts antibody developability properties — including HIC retention and high-concentration viscosity — in silico from sequence and structure-derived features plus protein-language-model embeddings, explicitly built to triage candidates before wet-lab assays [5]. The ACeT context-embedding-transformer work (research, preprint) predicts the expensive late-stage endpoints — high-concentration viscosity, in-vivo clearance, HIC retention, even clinical progression — from cheap early assay panels, learning to forecast the slow assays from the fast ones [6]. And interpretable-ML viscosity reduction (research) has shown that a model can not only predict high viscosity but point at the residues responsible, turning prediction into a concrete mutation instruction [7]. None of these is a GMP-deployed system — they are discovery-stage tools — but discovery is exactly where they belong, and the consequences propagate the length of the factory.

What makes the training set trustworthy is semantic, not statistical. Every defense above assumes the labeled panel is itself clean — that a "viscosity" column means the same assay at the same concentration across every row, that the candidate it sits beside is the candidate it was actually measured on, and that two near-duplicate sequences are known to share a lineage. None of that is guaranteed by a CSV; it is guaranteed by the data being semantically grounded the way Books 2-4 build it. Three threads from that work land directly on this task:

Features pulled by meaning, not by column name. The honest grouped split above only works if "germline family" and "campaign" are recorded as typed, resolvable facts rather than free-text a careless export can rename. When a developability feature is keyed to its ontology IRI (Internationalized Resource Identifier — a globally unique web name for a concept) instead of a fragile spreadsheet header, a renamed or reordered column fails loudly rather than silently mispredicting — the same FAIR (Findable, Accessible, Interoperable, Reusable) and identifier discipline the data chapter makes load-bearing, grounded in Book 4's identifiers-and-units and classes-and-taxonomy.
The lineage edge is the grouping key. The leak that grouped/scaffold splits exist to kill is the same one the rest of the book kills with batch-grouped cross-validation: near-duplicates straddling the train/test cut. Here the grouping key is the affinity-maturation lineage — and that lineage is exactly the derivedFrom-style genealogy edge Book 4 models, the relations-and-genealogy spine that roots every variant in its parent. A provenance graph (PROV-O / bp:derivedFrom lineage) makes "keep whole lineages on one side" a query, not a guess — the leave-one-lineage-out analogue of leave-one-batch-out.
A validated training set, by the same release-gate shape. The release gate this book audits CQAs with — the SHACL (Shapes Constraint Language) shapes of Book 4's release gate — is a closed-world completeness check: is every required value present, singular, typed, and in range? Pointed at a training table instead of a release lot, the identical shape guarantees the model's inputs are complete and admissible before a single fit, catching the dropped row or out-of-range label that ordinary validation reports as skill. That a measurement is a distinct thing from the run that produced it — BFO's continuant/occurrent cut from classes-and-taxonomy — is what keeps a viscosity value from being conflated with the assay occurrence when the two are joined into one feature row.

This is why the chapter keeps the features interpretable and named: an auditable feature is also a semantically-addressable one, and the same property that lets a chemist defend it lets a knowledge graph join it.

Evidence

Developability ML is overwhelmingly (research) maturity: peer-reviewed or preprint, demonstrated on retrospective antibody panels, not embedded in a GMP decision. The Jain et al. biophysical survey [2] and the TAP guidelines [1] are peer-reviewed-independent and are the field's most cited anchors. PROPERMAB [5] and the ensemble-viscosity work [4] are peer-reviewed but self-authored on proprietary training data, so their headline accuracies are not independently reproducible; ACeT [6] is a preprint (not yet peer-reviewed). Treat every "we predict X with Y accuracy" claim in this space as model-and-dataset specific until shown otherwise.

The biophysical guidelines that anchor the field

Before any of the learned models, the field standardized on two reference points that need no training set, and they are why a discovery team can run developability triage on day one.

The Jain et al. survey is the empirical bedrock: 137 clinical-stage monoclonal antibodies (48 of them approved) run through a dozen biophysical assays — self-interaction, cross-interaction, HIC, accelerated-stability aggregation, expression titer, and more [2]. Its value is not a model but a yardstick: it tells you what "normal" looks like by showing where molecules that actually made it to the clinic land on each axis, and it establishes the headline correlations the whole chapter rests on — that surface hydrophobicity tracks aggregation and HIC retention, that charge asymmetry tracks self-association and viscosity. Nearly every learned model cites it as ground truth.

The Therapeutic Antibody Profiler (TAP) turns those correlations into a transparent, guideline-based screen [1]. From an antibody structure model it computes five metrics — total CDR length, surface hydrophobicity (the patch-of-surface-hydrophobicity score, PSH), patches of positive charge (PPC), patches of negative charge (PNC), and the structural Fv charge-symmetry parameter — and for each it places the candidate against the distribution spanned by clinical-stage antibodies, flagging anything in the amber or red tail. There is no fitted weight vector and no proprietary training set: the "model" is the clinical-stage reference distribution itself, which is exactly why an auditor accepts it and why it remains research, peer-reviewed-independent. It is the conceptual parent of the toy screen this chapter ships — same shape (structure/sequence → interpretable metrics → flag against a reference), at far greater fidelity.

The foundation-model wave reaches antibodies

The thing that makes discovery different from the rest of this book is that here, the data is — for once — almost big. Two model families drive the current wave.

Protein language models (PLMs). Trained self-supervised on hundreds of millions of natural protein sequences, a PLM (the ESM family is the canonical example, trained by masked-language modeling — hide a residue, predict it from its context — over 250 million sequences) learns a representation in which evolutionarily plausible, well-folding sequences cluster and oddities stand out [8]. Two uses follow. First, the model's pseudo-likelihood of a sequence — for a masked-language model, the summed log of each residue's predicted probability when that residue is masked and the rest are visible (a masked-marginal pseudo-likelihood, not the true joint likelihood an autoregressive infilling model like IgLM would compute) — is itself a soft developability signal: natural-looking sequences tend to express and behave better, so a low pseudo-likelihood is a yellow flag, and a single point mutation's effect can be read off as the change in that score. Second, and more powerfully, the learned embeddings — the model's internal per-residue activation vectors, typically mean-pooled across the sequence into one fixed-length vector — are excellent features: a small regressor (a ridge or gradient-boosting model) trained on a few hundred labeled antibodies, fed PLM embeddings instead of hand-built features, often beats the engineered-feature model on the grouped split. The foundation model has done the representation learning that the small labeled set could never afford. This is transfer learning, the single most important workaround for bioprocess small-data, applied where it works best — and it is exactly the asset PROPERMAB folds into its feature set [5].

The reason a discovery team can actually reach for these is that the wave is overwhelmingly open: the ESM family, AlphaFold2, ProteinMPNN, RFdiffusion, and IgLM all ship open weights and runnable code, the same open-source posture this book's companion suite adopts — standard-library developability.py here, scikit-learn and PyTorch downstream, fixed seeds and a run_all that reproduces every printed number. The embeddings those PLMs produce are just another feature vector feeding the open-source analytics stack — the PCA/PLS/ML chemometrics that Book 3's analytics chapter runs and that the knowledge-graph layer addresses by identity rather than column position. The toy screen here deliberately stops at the interpretable-feature tier; embedding a real PLM is a dependency and a GPU away, not a different architecture, which is exactly why the flow SEQUENCE → FEATURES → RISK → CQA is faithful even though the toy stops short of the embedding step.

Structure prediction models. AlphaFold2 and its successors made high-quality structure prediction routine [9], and antibody-specific predictors followed. A predicted structure unlocks spatial developability metrics — the SAP and surface-charge patches above, which sequence alone cannot see. The catch for antibodies specifically is the CDR-H3 loop: it is the most diverse, most functionally important, and hardest-to-predict region (low predicted-confidence scores cluster exactly there), so structure-based scores are most uncertain precisely where they matter most. A SAP computed on a poorly predicted CDR-H3 is a number with an error bar the size of the number.

Generative design: where it is real, and where it is not

The frontier moves from screening to proposing. Rather than scoring binders that walked in from a display campaign, generative models write candidates. Three families, with their named tools and honest maturity:

Antibody language models (research, peer-reviewed-independent). IgLM is an autoregressive infilling model trained on 558 million immunoglobulin variable sequences; it samples novel CDRs conditioned on the surrounding scaffold and chain/species tags, and generated libraries score better on in-silico developability than naive randomization [10]. This is the most mature generative-antibody primitive: published, reproducible, and directly antibody-shaped.
Inverse folding (research, peer-reviewed-independent). ProteinMPNN takes a desired backbone and designs a sequence to fold into it, recovering the native sequence on benchmark backbones far more often than physics-based design (52.4 percent native-sequence recovery versus 32.9 percent for Rosetta) [11]. For antibodies it is used to resurface a fixed binding loop's framework toward better developability without touching the paratope.
Structure-generating diffusion (research, peer-reviewed-independent). RFdiffusion generates protein backbones — including de novo binders to a target — by a denoising-diffusion process, which a downstream inverse-folding model then sequences [10]. It has genuine wet-lab successes for de novo binders, but de novo antibodies specifically remain harder than de novo mini-binders, and the CDR-H3 problem bites here too.

The promise is a closed loop: propose candidates that are simultaneously high-affinity and high-developability, then verify the survivors experimentally. The honest status is research-to-pilot: there are real, published wet-lab successes for de novo binders and engineered antibodies, several well-funded biotechs are built on these methods, but the published, independently-reproduced "designed an antibody that went to the clinic on developability-by-design" story is still thin — most clinic-bound claims are vendor-self-reported or press-release-only. Worse, the generated sequences stress-test exactly the predictors that are supposed to grade them — the vicious circle that the chapter's unsolved-part section takes head-on. We treat generative design as the most exciting and least settled part of the chapter — promising, partly real, not yet routine, and not within a country mile of a GMP decision.

The discovery lens on the spine: a binding screen floods an in-silico developability funnel with candidate sequences; cheap interpretable features and protein-language-model embeddings triage them before any protein is made; each surviving property is an early bet on a specific downstream CQA of the running example's released batch; and a generative loop proposes new candidates to grade. Original diagram by the authors, created with AI assistance.

A runnable developability screen

The example suite's contribution for this chapter is examples/platform/ml/developability.py — an illustrative but fully runnable screen that takes a CDR sequence, computes the interpretable features above, and rolls them into a liability score with per-axis flags, every flag tied to a manufacturing CQA. It is deliberately simple and transparent: a production tool replaces the hand-set weights with coefficients learned on industrial assay data and adds PLM embeddings and structure features, but the flow — SEQUENCE → FEATURES → RISK → CQA — is exactly faithful. The sequences are plausible IgG CDR-H3 strings for the running example's mAb-A line; they are not real therapeutic sequences.

The core turns a sequence into chemistry. Net charge is a Henderson-Hasselbalch sum over ionizable residues; GRAVY is the mean Kyte-Doolittle hydropathy; liabilities are regex motif hits, each annotated with the CQA it threatens:

# examples/platform/ml/developability.py
PKA_POS = {"K": 10.5, "R": 12.5, "H": 6.0}
PKA_NEG = {"D": 3.9, "E": 4.1, "C": 8.3, "Y": 10.1}

def net_charge(seq: str, pH: float = 7.4) -> float:
    """Henderson-Hasselbalch net charge of a peptide at a given pH."""
    pos = 1.0 / (1.0 + 10 ** (pH - N_TERM_PKA))
    neg = 1.0 / (1.0 + 10 ** (C_TERM_PKA - pH))
    for aa, pk in PKA_POS.items():
        pos += seq.count(aa) * (1.0 / (1.0 + 10 ** (pH - pk)))
    for aa, pk in PKA_NEG.items():
        neg += seq.count(aa) * (1.0 / (1.0 + 10 ** (pk - pH)))
    return round(pos - neg, 3)

# Each liability motif carries its chemical risk AND its CQA consequence.
LIABILITY_MOTIFS = {
    "deamidation_NG":   (r"N[GS]",    "Asn deamidation -> acidic charge variant"),
    "isomerization_DG": (r"D[GTSH]",  "Asp isomerization -> acidic/charge variant"),
    "n_glycosylation":  (r"N[^P][ST]","N-linked glycosylation sequon"),
    "oxidation_MW":     (r"[MW]",     "Met/Trp oxidation -> charge/stability"),
    "free_cysteine":    (r"C",        "unpaired Cys -> covalent HMW aggregate"),
    "fragmentation_DP": (r"DP",       "Asp-Pro acid-labile -> LMW fragment"),
}

The risk model is an interpretable additive score: each motif hit adds a hand-set penalty keyed to its severity — two per deamidation site (charge variants), one-and-a-half per isomerization site, one per oxidation-prone residue, three per glycosylation sequon, three per fragmentation site, and a flat four for an odd (unpaired) cysteine count via the modulo-two trick; net-hydrophobic CDRs add five times the GRAVY (aggregation/viscosity), and charge magnitudes beyond two add a charge-extreme term. The weights are deliberately legible — a production model would replace them with coefficients fit on industrial assay data, but the additivity is the point: every contribution to the final number is traceable to one named chemical mechanism, which is what makes the score auditable rather than merely accurate.

The screen then scores three candidates: the mAb-A lead CDR-H3 (which carries an N-G deamidation site), a germlined variant v2 that mutates N→Q to remove the site, and a contrived hydrophobic reject. Running python developability.py prints exactly this:

Antibody developability screen (illustrative)
====================================================

mAb-A_CDRH3_lead  (11 aa)
  GRAVY=-0.955  net_charge(pH7.4)=-1.028  liability_score=3.0
  motifs: {'deamidation_NG': 1, 'oxidation_MW': 1}
  - PTM hotspot -> acidic charge-variant drift (CEX)

mAb-A_CDRH3_v2  (11 aa)
  GRAVY=-0.955  net_charge(pH7.4)=-1.028  liability_score=1.0
  motifs: {'oxidation_MW': 1}
  - no high-severity liability detected (illustrative)

sticky_reject  (12 aa)
  GRAVY=+0.875  net_charge(pH7.4)=+0.974  liability_score=8.38
  motifs: {'oxidation_MW': 4}
  - HYDROPHOBIC CDR -> HMW aggregate / viscosity risk

ASSERT ok: germlining the N-G site lowers the liability score (3.0 -> 1.0).

Read it as a discovery scientist would. The lead carries one deamidation hotspot, flagged for CEX charge-variant drift — the exact attribute that becomes CEX_acidic_pct on the release certificate. Germlining N→Q removes the motif and drops the score from 3.0 to 1.0, and the script asserts that improvement so the claim cannot silently rot: a sequence edit at the desk lowers a predicted manufacturing risk, before a single cell is transfected. The hydrophobic reject scores highest at 8.38, flagged for HMW aggregate / viscosity — the SEC_HMW_pct attribute and the downstream viscosity that would have made a subcutaneous dose unfilterable. The score is illustrative; the mapping from sequence to CQA is the real lesson, and it is the same mapping every production developability tool implements at far greater fidelity.

Anatomy of one developability prediction

The series signature is to unpack a single record. A developability prediction is never just a number — like every artifact in this series, its value is in what travels with the number. The record the screen emits binds the sequence, every interpretable feature, the per-axis flags, the model and its training provenance, and — crucially — the downstream CQA each flag predicts, so a discovery decision can be traced forward to the batch it shaped.

One developability prediction, fully unpacked: the sequence that fed it, the interpretable features and motif hits that explain it, the liability score and per-axis flags as the green core, the reconciliation rows that map each flag to a real downstream CQA on BATCH-2026-001, and the provenance that records its training data, its PLM embeddings, and its advisory-only scope. Original diagram by the authors, created with AI assistance.

Field by field, the card carries the chapter's argument.

name — the candidate identifier, mAb-A_CDRH3_lead. It is the thread that ties this prediction to the lineage that becomes WCB-CHO-001; without a stable name, a discovery bet cannot be traced to the batch that tested it — the same lineage root that Book 4's upper spine traces forward to a vial of drug product.
sequence and length — the CDR-H3 string ARDNGYWLFDY and its length, 11 residues. This is the contract of the prediction: the screen claims nothing about residues it was not shown, and the length bounds every per-residue feature. A reviewer can recompute every downstream field from this one input.
gravy = -0.955 — the interpretable hydrophobicity evidence. Negative means net-hydrophilic on the Kyte-Doolittle scale, so the sequence-average aggregation flag does not fire — but this is the metric's floor, not an all-clear. This CDR is aromatic-rich: four of eleven residues are W/Y/F — the very residues affinity selection favors and aggregation fears. And Kyte-Doolittle scores Trp and Tyr as hydrophilic, underweighting exactly the aromatic stickiness that a structure-derived SAP would localize instead of averaging away. A quiet GRAVY here means "nothing obvious at the sequence average," not "no aggregation risk" — which is why the chapter never lets sequence GRAVY be the last word. A chemist can audit the number by summing the Kyte-Doolittle indices and dividing by 11; whether that number is trustworthy is exactly the SAP-vs-GRAVY point the sequence-features section made earlier, now sitting live in the running example.
net_charge_74 = -1.028 — net charge at formulation-relevant pH 7.4, from the Henderson-Hasselbalch sum. Its magnitude is well under the charge-extreme threshold, so the charge-extreme term contributes nothing — but its sign foreshadows the acidic-leaning charge-variant profile the factory will measure.
motif_hits — the dictionary {'deamidation_NG': 1, 'oxidation_MW': 1}: the actual regex hits, with counts. This is the explainability core — not "risky," but which motif, how many, and therefore which mechanism. One deamidation site and one oxidation-prone residue, each named.
liability_score = 3.0 — the green core, the additive roll-up: two from the deamidation hit plus one from the oxidation hit. The scalar is the headline, but it is the least actionable field; the flags are what a team acts on.
flags — the human-readable per-axis verdicts, here PTM hotspot -> acidic charge-variant drift (CEX). Each flag is a sentence a discovery scientist can carry into a review and a regulator could later read.
the reconciliation panel — the rows that make this chapter part of the series. Each flag points at the measured CQA it predicted on the released batch: the PTM hotspot at CEX_acidic_pct = 21.551, the (here-quiet) hydrophobicity axis against SEC_HMW_pct = 1.287, and the (also-quiet) free-cysteine axis against covalent HMW. This is what turns a developability bet into a falsifiable prediction once the factory finally runs.
the relationships panel — trained_on an antibody panel (a yardstick like the Jain set, not this molecule), embeddings_from a protein language model, advances_to cell-line development, and the standing scope: advisory only. A discovery model triages and proposes; it never releases, sets a setpoint, or decides anything a regulator would call critical.

The unsolved part: the predictor and the designer disagree exactly where it matters

The honest open problem here is a vicious circle the whole field knows and none has closed, and it sharpens precisely as generative design gets better. Developability predictors are trained on the antibodies that exist — clinical-stage molecules, internal historical panels — which are a sample of sequence space that already survived selection for being well-behaved. Generative designers (IgLM, RFdiffusion-plus-ProteinMPNN, the rest) exist precisely to propose sequences that do not exist yet, often deliberately far from the natural distribution to escape an affinity-developability trade-off. So the generative model produces its most valuable candidates in exactly the region where the predictor that should grade them has the least training support, and is least trustworthy. The predictor says "looks fine"; whether that means "is fine" or "is unlike anything I was trained on" is unknowable from the score alone — and the more creative the designer, the more confidently wrong the predictor's silence can be.

The circle is self-reinforcing in a way worth naming: a generative model that is trained to maximize a predictor's developability score will, given enough freedom, learn to exploit the predictor's blind spots — producing sequences that score beautifully and behave terribly, the protein-design version of an adversarial example. The predictor and the designer end up in an arms race the wet lab is too slow to referee.

Three partial defenses exist, none complete:

Uncertainty estimation. Have the predictor report not just a score but a calibrated confidence — an ensemble's spread, a Gaussian-process posterior variance, a conformal interval — so an out-of-distribution candidate flags itself as "I have not seen anything like this." But two cautions bite: calibration on small biophysical datasets is itself unreliable (the very data scarcity that motivates the predictor undermines the confidence estimate), and a conformal interval's coverage guarantee assumes the new candidate is exchangeable with the calibration set — which a generative designer violates by construction, so the naive interval is least trustworthy on exactly the novel sequences it is meant to police. The more defensible out-of-distribution signal is often the blunt one: the candidate's distance to the training manifold in embedding space (or a distribution-shift-aware conformal variant).
Active learning. Spend the few wet-lab assays you can afford on the candidates the model is most uncertain about, then retrain. This is the principled loop, and it is the same data-efficiency logic that drives Bayesian optimization in process development — buy the most informative label, not the most convenient one — but each cycle costs weeks of express-purify-assay, so you get a handful of cycles, not hundreds.
Physics-anchored features. Prefer SAP from a structure model over a learned hydrophobicity correlation; prefer a mechanistic charge-patch term over a black-box embedding. Mechanism generalizes off-distribution because it encodes why, not just what correlated in the training set — the same hybrid-modeling instinct that wins everywhere else in this book and that the suite's hybrid digital twin embodies downstream.

The unsolved core remains: the only thing that settles a developability prediction on a novel sequence is making the protein and measuring it, and that is the expensive step the prediction was supposed to avoid. Discovery ML buys you a better-ordered queue, not a verdict.

What this chapter adds to the model suite

This chapter contributes one module to examples/platform/ml/:

developability.py — an illustrative sequence-feature → liability-score screen. It computes GRAVY, Henderson-Hasselbalch net charge, and PTM/aggregation liability motifs from a CDR sequence; rolls them into an interpretable additive liability score with per-axis flags; ties every flag to a downstream manufacturing CQA (HMW aggregate, charge variants, fragmentation); and asserts that germlining a deamidation hotspot lowers the predicted risk (3.0 to 1.0), so the central claim is CI-checkable. It is the discovery-stage bookend to the suite's downstream models: where soft_sensor_pls.py predicts a CQA during a run, developability.py predicts the propensity for that CQA from the sequence, years earlier.

The module is intentionally dependency-light (standard library only) so it runs anywhere, and intentionally transparent so the mechanism is auditable — both choices reflect the regulatory posture that interpretability beats marginal accuracy whenever a model's output shapes a quality outcome.

Why it matters

Discovery is where manufacturing difficulty is created, almost entirely for free, by people who never see the plant. A developability model that catches an aggregation-prone or hotspot-laden candidate at the sequence stage saves the cost of a cell line, a process, and an analytical package built around a molecule that was always going to fail — and it does so when the only cost of changing course is editing a string. Get the sequence right and the entire downstream half of this book has an easier job; get it wrong and the most sophisticated soft sensor or hybrid digital twin is merely instrumenting a molecule that should never have advanced. The earliest learning in the spine is the cheapest place to change the molecule, and the most expensive place to be wrong.

The CMC and GMP angle

It is worth being precise about why the most ambitious ML in this book runs here, because the answer is regulatory, not technical. A developability model is a research tool: its output is a triage and a proposal, reviewed by a scientist, and it never touches a release decision, a setpoint, or a batch record. It therefore sits entirely outside the locked-model strictures of draft EU/PIC/S GMP (Good Manufacturing Practice — the legally binding rules for how a medicine is actually manufactured) Annex 22 and the FDA model-credibility framework that will bind the manufacturing chapters [12][13] — no formal validation protocol, no change-control on every retrain, no human-in-the-loop sign-off on each inference. That freedom is exactly what lets generative models and large PLMs run at the discovery bench while they are forbidden at the bioreactor.

But the bets do not vanish; they defer. The developability decision becomes part of the molecule's CMC (Chemistry, Manufacturing, and Controls — the part of the regulatory dossier that defines how the product is made and controlled) story the moment it advances: the sequence and its known liabilities feed the control strategy — a flagged-but-retained deamidation site, for example, becomes a justification for the CEX charge-variant specification window the QC release chapter eventually enforces, and the structure-function rationale for that window is a section a reviewer reads. The discovery model's output is never in a regulatory filing; its consequences are everywhere in one. The trade is stark and clean: the model is free because it decides nothing, and it is graded — years later, on a real lot — by the very CQAs the factory measures.

In the real world

In practice, antibody discovery teams already run developability triage as a routine in-silico gate, anchored by published biophysical guidelines. The Therapeutic Antibody Profiler (TAP) (research, peer-reviewed-independent) computes its five developability metrics from an antibody structure model and flags candidates outside the range spanned by clinical-stage antibodies — a transparent, guideline-based screen that needs no proprietary training set [1]. The Jain et al. biophysical survey of 137 clinical-stage antibodies across a dozen assays is the empirical bedrock that tells you what "normal" even looks like, and is cited by nearly every learned model as ground truth [2]. Learned successors — PROPERMAB (research, peer-reviewed-self-authored) for sequence/structure/PLM-embedding property prediction [5], ensemble deep learning on large-scale viscosity data (research, peer-reviewed-self-authored) [4], the ACeT context-embedding transformer (research, preprint) forecasting late endpoints from early panels [6], interpretable viscosity-reduction ML (research) [7] — push accuracy higher where the training data exists. Generative platforms (research-to-pilot) — IgLM, ProteinMPNN, RFdiffusion and the businesses built on them [8][9][10][11] — are real and propose de novo binders and engineered antibodies, with genuine wet-lab successes but few independently-reproduced developability-by-design clinical stories (most clinic claims are vendor-self-reported or press-release-only).

The governance reality is the calm part of the story, and it is the reason discovery is where the most ambitious ML in this book actually runs: as the previous section laid out, a developability model decides nothing a regulator would call critical, so it sits outside draft Annex 22 and the FDA model-credibility framework [12][13] — its bets are simply deferred, to be graded years later by the very CQAs the QC release chapter measures. The ISPE Pharma 4.0 reality holds even here: the pilots cluster, the production deployments are scarce, and the value is in triage and proposal, not in any autonomous decision.

Key terms

CQA (critical quality attribute) — a measurable product property (e.g. SEC monomer, charge variants) that must stay inside its specification for the medicine to be safe and effective; defined in Chapter 1.
CDR (complementarity-determining region) — the variable-region loops that form an antibody's binding surface; where discovery has freedom and most liabilities live.
SEC / CEX / HIC — analytical chromatography assays: SEC (size-exclusion) measures aggregate and fragment content, CEX (cation-exchange) measures the charge-variant distribution, HIC (hydrophobic-interaction) reports a retention time used as an aggregation surrogate.
CMC / GMP — CMC (Chemistry, Manufacturing, and Controls) is the dossier section defining how the product is made and controlled; GMP (Good Manufacturing Practice) is the binding rule-set for how a medicine is manufactured. A developability model sits outside both because it decides nothing; its bets feed the CMC control strategy later.
Developability — the bundle of properties (aggregation, viscosity, stability, PTM liabilities, immunogenicity) that determine whether a binder can actually be manufactured and dosed.
Liability motif — a short sequence pattern (e.g. N-G deamidation, D-P fragmentation, unpaired cysteine) where spontaneous chemistry creates heterogeneity, each mapping to a specific CQA.
GRAVY — grand average of hydropathy; the mean Kyte-Doolittle hydrophobicity of a sequence, a sticky-CDR / aggregation proxy.
SAP (spatial aggregation propensity) — a structure-derived hydrophobicity score over a moving surface window; localizes the sticky patch that sequence-average GRAVY misses, and the best predictor of measured aggregation.
PTM (post-translational modification) — chemical change to a residue (deamidation, isomerization, oxidation, glycosylation) that turns one molecule into a heterogeneous population.
Deimmunization — mutating predicted T-cell epitopes to reduce anti-drug-antibody risk while preserving binding.
Protein language model (PLM) — a model trained self-supervised on hundreds of millions of natural sequences (e.g. ESM); supplies sequence pseudo-likelihoods and transferable embeddings for small-data property prediction.
Transfer learning — using a large-corpus model's learned representation as features for a small labeled task; the key small-data workaround, at its most effective in discovery.
Generative design — sampling or designing novel sequences (antibody language models such as IgLM, inverse folding such as ProteinMPNN, diffusion such as RFdiffusion) rather than only screening existing ones.
Therapeutic Antibody Profiler (TAP) — a guideline-based, structure-derived developability screen flagging candidates outside the clinical-stage range on five metrics.
Out-of-distribution problem — the failure mode where a generative model proposes its most valuable candidates exactly where the developability predictor has no training support.
Advisory-only scope — a discovery model triages and proposes but never makes a regulated decision; its bets are graded later by measured CQAs.
Semantic grounding — keying a feature to its ontology IRI (a globally unique web name) rather than a spreadsheet column, so a renamed input fails loudly and a knowledge graph can join it; the FAIR/identifier discipline of Book 4.
Lineage-grouped split — using the derivedFrom provenance edge (PROV-O lineage) as the cross-validation grouping key, the leave-one-lineage-out analogue of leave-one-batch-out, so affinity-maturation near-duplicates never straddle the train/test cut.
SHACL-validated training set — applying the closed-world completeness/in-range shape of the release gate to a training table, guaranteeing the model's inputs are present, singular, typed, and in spec before a fit.

Where this leads

A sequence has now survived the developability gate; the bet is placed, but no protein exists yet. The next chapter, Cell-Line Development: Ranking Clones with Machine Learning, turns the chosen sequence into producing cells and confronts the first wet-lab small-data wall in earnest: out of thousands of clones, which few will be high-producing, stable, and quality-consistent — and how learning ranks them from sparse, noisy early-screen data long before any of them becomes WCB-CHO-001.

What this chapter covers​

The handoff nobody in discovery watches​

Sequence features: the interpretable backbone, and the math​

Developability prediction as a learning task​

The biophysical guidelines that anchor the field​

The foundation-model wave reaches antibodies​

Generative design: where it is real, and where it is not​

A runnable developability screen​

Anatomy of one developability prediction​

The unsolved part: the predictor and the designer disagree exactly where it matters​

What this chapter adds to the model suite​

Why it matters​

The CMC and GMP angle​

In the real world​

Key terms​

Where this leads​