Molecule Discovery: Generative Design and Developability Prediction
📍 Where we are: Part II · Discovery & Development, Learned — Chapter 5. The previous chapter, Target and Concept, used learning to decide where a molecule should start: which target, which modality, which rough format. Now the format is fixed — an IgG monoclonal antibody, mAb-A — and the question becomes which exact sequence to advance, and whether that sequence can survive the factory that is still twelve chapters away.
A molecule is born as a string of amino acids long before it is born as a batch. The choice of that sequence — made by an antibody discovery team staring at hundreds or thousands of candidate binders — quietly fixes most of what the rest of this book will struggle with. A candidate that binds beautifully but aggregates will fight you in the production bioreactor, clog the polishing column, and fail the SEC high-molecular-weight gate. A candidate with a deamidation hotspot will drift its charge-variant profile every time the process breathes. The discovery team almost never sees the factory — but its sequence decisions are the factory's first and most irreversible inputs. This chapter is about the machine learning that tries to see the factory from the sequence, and increasingly to design sequences that the factory will like.
This is also where the small-data ceiling from Chapter 1 briefly lifts. Sequences are cheap; assays are not, but there are far more sequences with some measured property than there are complete GMP batches. Discovery is the one corner of the spine where genuinely large models — protein language models trained on hundreds of millions of natural sequences — have a foothold. The catch, which we will return to, is that the property the factory cares about is exactly the one that is still small-data.
Think of hiring. A binding screen is the interview: it tells you the candidate can do the headline job (stick to the target). Developability prediction is the background check: can this candidate actually show up to work every day — not dissolve into a sticky mess, not pick up chemical damage, not provoke the immune system? You run the cheap background check (read the résumé, the sequence) on thousands of applicants before the expensive on-site (making protein, running assays), so you never waste a year developing someone who was always going to wash out. Generative design is the next step: instead of only screening applicants who walked in the door, you write the job description so precisely that you can propose ideal candidates and then verify them.
What this chapter covers
- The discovery-to-manufacturability handoff: why a sequence is a manufacturing decision
- Sequence features, and the chemistry behind the classic liability motifs
- Developability prediction as a supervised-learning task, and what makes its labels so hard
- The biophysical guidelines (Therapeutic Antibody Profiler, the Jain panel) that anchor the field
- Protein language models and structure models — the foundation-model wave reaching antibodies
- Generative antibody/protein design, and where it is real versus aspirational
- A runnable
developability.pythat turns a CDR sequence into a liability score tied to CQAs - The anatomy of one developability prediction, and the unsolved label-scarcity problem
- The GMP angle: why a discovery model never decides anything, but shapes everything
The handoff nobody in discovery watches
The five-book series keeps insisting that manufacturing consequences are baked in upstream; nowhere is that more literal than here. A monoclonal antibody is roughly 1,300 amino acids across its heavy and light chains, but the variable regions — and within them the six complementarity-determining regions (CDRs) that form the binding surface — are where discovery has freedom and where most liabilities live. The constant regions are largely fixed by the chosen isotype (here IgG1); the CDRs are the dozen-to-few-dozen residues that selection pressure for binding shapes, with no pressure at all for manufacturability unless someone adds it.
That is the whole problem. A phage-display or hybridoma campaign optimizes affinity. Affinity-optimized CDRs are frequently rich in aromatic and hydrophobic residues (they make good binding contacts) — and those same residues drive aggregation and viscosity, the two properties that most often kill a high-concentration subcutaneous antibody [1][2]. The discovery team's reward function and the factory's reward function are not just different; on the most important axis they are opposed. Developability prediction exists to put the factory's reward function into the discovery loop early, computationally, before any of it is expensive.
Concretely, in our running example: the lead sequence behind WCB-CHO-001 and ultimately BATCH-2026-001 had to clear a developability screen before the cell line was ever built. Every CQA on the release certificate traces back to that screen. The SEC_HMW_pct of 1.287 and SEC_monomer_pct of 98.611 on BATCH-2026-001 (read straight from examples/datasets/hplc_results.csv) are, in part, a verdict on a sequence choice made years earlier — the aggregation propensity that the discovery screen tried to predict, now measured on a real released lot. The charge-variant spread (CEX_acidic_pct 21.551, CEX_basic_pct 10.452) is the in-process echo of the PTM hotspots that screen flagged or cleared. Discovery does not measure these; it bets on them.
Sequence features: the interpretable backbone
Before any deep model, there is a layer of features any chemist would recognize, and any audit would accept. These are computed directly from the sequence, they are cheap, and they map onto specific manufacturing failure modes. Four families matter.
Hydrophobicity. Summarized by GRAVY (grand average of hydropathy), the mean Kyte-Doolittle hydropathy across the residues. A net-hydrophobic CDR is "sticky": it favors self-association, which shows up downstream as HMW aggregate (the SEC HMW_pct attribute) and as elevated viscosity at the high concentrations a subcutaneous dose demands. Spatial aggregation-propensity scores refine this from structure, but the sequence-level summary already separates the obvious offenders [1][2].
Net charge and isoelectric point. Estimated from the ionizable residues via Henderson-Hasselbalch. Charge at formulation pH governs colloidal stability (extremes of charge can drive both aggregation and viscosity), and the distribution of charge across isoforms is literally the CEX assay's readout — the CEX_acidic_pct / CEX_main_pct / CEX_basic_pct triplet on every batch.
Post-translational-modification (PTM) liability motifs. Short sequence patterns where chemistry happens spontaneously and turns one molecule into a heterogeneous population:
N-G/N-S— asparagine deamidation: Asn becomes Asp/isoAsp, adding a negative charge → an acidic charge variant (drivesCEX_acidic_pctup over shelf life).D-G/D-T— aspartate isomerization: another route to charge and potency drift.N-X-S/T(X not Pro) — an N-linked glycosylation sequon in a CDR: heterogeneous glycoforms, altered clearance, a quality nightmare in a binding region.M/W— methionine/tryptophan oxidation: charge and stability changes, often process- and storage-dependent.- odd cysteine count — an unpaired (free) cysteine: a reactive thiol that forms intermolecular disulfides → covalent HMW aggregate that the polishing step cannot easily resolve.
D-P— an acid-labile Asp-Pro bond → fragmentation to LMW species (the SECLMW_pctattribute).
Immunogenicity risk. Predicted T-cell epitopes — peptides likely to be presented on MHC class II and provoke anti-drug antibodies. This is its own ML sub-field (peptide-MHC binding predictors), and the design response is deimmunization: mutating predicted epitopes while preserving binding [3].
The point of keeping these features explicit is not nostalgia. Under the regulatory posture this book keeps returning to — locked, explainable models for anything touching product quality — an interpretable feature whose mechanism a chemist can name is worth more than a marginally more accurate black box, because you can defend it. The deep models in the next section earn their place by being more accurate; they do not replace the obligation to explain.
Developability prediction as a learning task
Frame it as supervised learning and the difficulties become precise. The input is a sequence (or a structure model derived from it). The target is a measured developability property: a hydrophobic-interaction-chromatography retention time as an aggregation proxy, a measured viscosity at 150 mg/mL, an accelerated-stability aggregation rate, an assay-panel pass/fail. The model is a regressor or classifier — historically a random forest or gradient-boosting model on the engineered features above, increasingly a learned representation on top of a protein language model.
Three things make this task genuinely hard, and they are the same three the whole book circles:
- The labels are small-data and expensive. A measured viscosity at therapeutic concentration needs tens of milligrams of purified protein — you cannot run it on ten thousand candidates. The largest public biophysical panels number in the low hundreds of clinical-stage antibodies (the foundational Jain et al. survey assayed 137 clinical-stage mAbs across a dozen assays) [2]. Industrial datasets are larger but proprietary; the recent ensemble-deep-learning viscosity work that pushed performance forward did so precisely by assembling large-scale internal viscosity data, which is exactly the asset most teams lack [4].
- The properties are correlated and multi-objective. Reducing hydrophobicity to fix aggregation can wreck affinity; germlining a PTM hotspot can shift charge. There is no single scalar "developability"; there is a Pareto surface, and a real screen reports a vector of risks, which is why the field gravitates to interpretable per-axis flags rather than one opaque number.
- The distribution shifts. A model trained on one company's historical antibodies, or on clinical-stage molecules that already survived selection, is a biased sample of sequence space — and a generative model that proposes novel sequences pushes the predictor out of its training distribution by design. The predictor that scored the natural candidates may be unreliable on the engineered ones it most needs to grade.
The strongest current evidence is research-tier but real. PROPERMAB (research) is a published framework that predicts antibody developability properties in silico from sequence and structure-derived features, explicitly built to triage candidates before wet-lab assays [5]. The ACeT work (research) predicts developability from early assay panels, learning to forecast the expensive late assays from the cheap early ones [6]. And interpretable-ML viscosity reduction (research) has shown that a model can not only predict high viscosity but point at the residues responsible, turning prediction into a design instruction [7]. None of these is a GMP-deployed system — they are discovery-stage tools — but discovery is exactly where they belong, and the consequences propagate the length of the factory.
Developability ML is overwhelmingly (research) maturity: peer-reviewed or preprint, demonstrated on retrospective antibody panels, not embedded in a GMP decision. The Jain et al. biophysical survey [2] is peer-reviewed-independent and is the field's most cited anchor. PROPERMAB [5] and the ensemble-viscosity work [4] are peer-reviewed but use proprietary training data, so their headline accuracies are not independently reproducible. Treat every "we predict X with Y accuracy" claim in this space as model-and-dataset specific until shown otherwise.
The foundation-model wave reaches antibodies
The thing that makes discovery different from the rest of this book is that here, the data is — for once — almost big. Two model families drive the current wave.
Protein language models (PLMs). Trained self-supervised on hundreds of millions of natural protein sequences, a PLM (the ESM family is the canonical example) learns a representation in which evolutionarily plausible, well-folding sequences cluster and oddities stand out [8]. Two uses follow. First, the model's pseudo-likelihood of a sequence is itself a soft developability signal — natural-looking sequences tend to express and behave better, so a low likelihood is a yellow flag. Second, and more powerfully, the learned embeddings are excellent features: a small regressor trained on a few hundred labeled antibodies, fed PLM embeddings instead of hand-built features, often beats the engineered-feature model — the foundation model has done the representation learning that the small labeled set could never afford. This is transfer learning, the single most important workaround for bioprocess small-data, applied where it works best.
Structure prediction models. AlphaFold2 and its successors made high-quality structure prediction routine [9], and antibody-specific predictors followed. A predicted structure unlocks spatial developability metrics — patches of surface hydrophobicity or charge that sequence alone cannot see, the spatial aggregation propensity that best correlates with real aggregation. The catch for antibodies specifically is the CDR-H3 loop: it is the most diverse, most functionally important, and hardest-to-predict region, so structure-based scores are most uncertain exactly where they matter most.
Generative design. The frontier moves from screening to proposing. Generative antibody models (IgLM and related autoregressive sequence models) sample novel CDRs conditioned on a scaffold; protein-design diffusion models (RFdiffusion and relatives) and inverse-folding models (ProteinMPNN) design sequences to fit a desired backbone or binding mode [10][11]. The promise is a closed loop: propose candidates that are simultaneously high-affinity and high-developability, then verify the survivors experimentally. The honest status is research-to-pilot: there are real wet-lab successes for de novo binders and engineered antibodies, several biotechs are built on this, but the published, independently-reproduced "designed an antibody that went to the clinic on developability-by-design" story is still thin, and the generated sequences stress-test exactly the predictors that are supposed to grade them. We treat generative design as the most exciting and least settled part of the chapter — promising, partly real, not yet routine.
The discovery lens on the spine: a binding screen floods an in-silico developability funnel with candidate sequences; cheap interpretable features and protein-language-model embeddings triage them before any protein is made; each surviving property is an early bet on a specific downstream CQA of the running example's released batch; and a generative loop proposes new candidates to grade.
Original diagram by the authors, created with AI assistance.
A runnable developability screen
The example suite's contribution for this chapter is examples/platform/ml/developability.py — an illustrative but fully runnable screen that takes a CDR sequence, computes the interpretable features above, and rolls them into a liability score with per-axis flags, every flag tied to a manufacturing CQA. It is deliberately simple and transparent: a production tool replaces the hand-set weights with coefficients learned on industrial assay data and adds PLM embeddings and structure features, but the flow — SEQUENCE → FEATURES → RISK → CQA — is exactly faithful. The sequences are plausible IgG CDR-H3 strings for the running example's mAb-A line; they are not real therapeutic sequences.
The core turns a sequence into chemistry. Net charge is a Henderson-Hasselbalch sum over ionizable residues; GRAVY is the mean Kyte-Doolittle hydropathy; liabilities are regex motif hits, each annotated with the CQA it threatens:
# examples/platform/ml/developability.py
PKA_POS = {"K": 10.5, "R": 12.5, "H": 6.0}
PKA_NEG = {"D": 3.9, "E": 4.1, "C": 8.3, "Y": 10.1}
def net_charge(seq: str, pH: float = 7.4) -> float:
"""Henderson-Hasselbalch net charge of a peptide at a given pH."""
pos = 1.0 / (1.0 + 10 ** (pH - N_TERM_PKA))
neg = 1.0 / (1.0 + 10 ** (C_TERM_PKA - pH))
for aa, pk in PKA_POS.items():
pos += seq.count(aa) * (1.0 / (1.0 + 10 ** (pH - pk)))
for aa, pk in PKA_NEG.items():
neg += seq.count(aa) * (1.0 / (1.0 + 10 ** (pk - pH)))
return round(pos - neg, 3)
# Each liability motif carries its chemical risk AND its CQA consequence.
LIABILITY_MOTIFS = {
"deamidation_NG": (r"N[GS]", "Asn deamidation -> acidic charge variant"),
"isomerization_DG": (r"D[GTSH]", "Asp isomerization -> acidic/charge variant"),
"n_glycosylation": (r"N[^P][ST]","N-linked glycosylation sequon"),
"oxidation_MW": (r"[MW]", "Met/Trp oxidation -> charge/stability"),
"free_cysteine": (r"C", "unpaired Cys -> covalent HMW aggregate"),
"fragmentation_DP": (r"DP", "Asp-Pro acid-labile -> LMW fragment"),
}
The screen then scores three candidates: the mAb-A lead CDR-H3 (which carries an N-G deamidation site), a germlined variant v2 that mutates N→Q to remove the site, and a contrived hydrophobic reject. Running python developability.py prints exactly this:
Antibody developability screen (illustrative)
====================================================
mAb-A_CDRH3_lead (11 aa)
GRAVY=-0.955 net_charge(pH7.4)=-1.028 liability_score=3.0
motifs: {'deamidation_NG': 1, 'oxidation_MW': 1}
- PTM hotspot -> acidic charge-variant drift (CEX)
mAb-A_CDRH3_v2 (11 aa)
GRAVY=-0.955 net_charge(pH7.4)=-1.028 liability_score=1.0
motifs: {'oxidation_MW': 1}
- no high-severity liability detected (illustrative)
sticky_reject (12 aa)
GRAVY=+0.875 net_charge(pH7.4)=+0.974 liability_score=8.38
motifs: {'oxidation_MW': 4}
- HYDROPHOBIC CDR -> HMW aggregate / viscosity risk
ASSERT ok: germlining the N-G site lowers the liability score (3.0 -> 1.0).
Read it as a discovery scientist would. The lead carries one deamidation hotspot, flagged for CEX charge-variant drift — the exact attribute that becomes CEX_acidic_pct on the release certificate. Germlining N→Q removes the motif and halves the score, and the script asserts that improvement so the claim cannot silently rot: a sequence edit at the desk lowers a predicted manufacturing risk, before a single cell is transfected. The hydrophobic reject scores highest, flagged for HMW aggregate / viscosity — the SEC_HMW_pct attribute and the downstream viscosity that would have made a subcutaneous dose unfilterable. The score is illustrative; the mapping from sequence to CQA is the real lesson, and it is the same mapping every production developability tool implements at far greater fidelity.
Anatomy of one developability prediction
The series signature is to unpack a single record. A developability prediction is never just a number — like every artifact in this series, its value is in what travels with the number. The record the screen emits binds the sequence, every interpretable feature, the per-axis flags, the model and its training provenance, and — crucially — the downstream CQA each flag predicts, so a discovery decision can be traced forward to the batch it shaped.
One developability prediction, fully unpacked: the sequence that fed it, the interpretable features and motif hits that explain it, the liability score and per-axis flags as the green core, the reconciliation rows that map each flag to a real downstream CQA on BATCH-2026-001, and the provenance that records its training data, its PLM embeddings, and its advisory-only scope.
Original diagram by the authors, created with AI assistance.
Field by field, the card carries the chapter's argument. The input is the CDR-H3 sequence and its length — the contract of the prediction. The features block (GRAVY, net_charge_74, the motif_hits dictionary) is the interpretable evidence a chemist can audit. The green core is the liability_score with its human-readable flags — and the flags, not the scalar, are what a real team acts on. The reconciliation panel is the row that makes this chapter part of the series: each flag points at the measured CQA it predicted on the released batch — the PTM hotspot at CEX_acidic_pct = 21.551, the hydrophobicity at SEC_HMW_pct = 1.287 — so a developability bet becomes a falsifiable prediction once the factory finally runs. The relationships panel records trained_on an antibody panel, embeddings_from a protein language model, advances_to cell-line development, and the standing scope: advisory only. A discovery model triages and proposes; it never releases, sets a setpoint, or decides anything a regulator would call critical.
The unsolved part: the predictor and the designer disagree exactly where it matters
The honest open problem here is a vicious circle the whole field knows and none has closed. Developability predictors are trained on the antibodies that exist — clinical-stage molecules, internal historical panels — which are a sample of sequence space that already survived selection for being well-behaved. Generative designers exist precisely to propose sequences that do not exist yet, often deliberately far from the natural distribution to escape an affinity-developability trade-off. So the generative model produces its most valuable candidates in exactly the region where the predictor that should grade them has the least training support, and is least trustworthy. The predictor says "looks fine"; whether that means "is fine" or "is unlike anything I was trained on" is unknowable from the score alone.
Three partial defenses exist, none complete. Uncertainty estimation — having the predictor report not just a score but a calibrated confidence, so out-of-distribution candidates flag themselves — helps, but calibration on small biophysical datasets is itself unreliable. Active learning — making the few wet-lab assays you can afford on the candidates the model is most uncertain about, then retraining — is the principled loop, and it is the same data-efficiency logic that drives Bayesian optimization in process development, but each cycle costs weeks. And physics-anchored features — spatial aggregation propensity from a structure model rather than a learned correlation — generalize better off-distribution because they encode mechanism, the same hybrid-modeling instinct that wins everywhere else in this book. The unsolved core remains: the only thing that settles a developability prediction on a novel sequence is making the protein and measuring it, and that is the expensive step the prediction was supposed to avoid. Discovery ML buys you a better-ordered queue, not a verdict.
What this chapter adds to the model suite
This chapter contributes one module to examples/platform/ml/:
developability.py— an illustrative sequence-feature → liability-score screen. It computes GRAVY, Henderson-Hasselbalch net charge, and PTM/aggregation liability motifs from a CDR sequence; rolls them into an interpretable additive liability score with per-axis flags; ties every flag to a downstream manufacturing CQA (HMW aggregate, charge variants, fragmentation); and asserts that germlining a deamidation hotspot lowers the predicted risk, so the central claim is CI-checkable. It is the discovery-stage bookend to the suite's downstream models: wheresoft_sensor_pls.pypredicts a CQA during a run,developability.pypredicts the propensity for that CQA from the sequence, years earlier.
The module is intentionally dependency-light (standard library only) so it runs anywhere, and intentionally transparent so the mechanism is auditable — both choices reflect the regulatory posture that interpretability beats marginal accuracy whenever a model's output shapes a quality outcome.
Why it matters
Discovery is where manufacturing difficulty is created, almost entirely for free, by people who never see the plant. A developability model that catches an aggregation-prone or hotspot-laden candidate at the sequence stage saves the cost of a cell line, a process, and an analytical package built around a molecule that was always going to fail — and it does so when the only cost of changing course is editing a string. Get the sequence right and the entire downstream half of this book has an easier job; get it wrong and the most sophisticated soft sensor or hybrid digital twin is merely instrumenting a molecule that should never have advanced. The earliest learning in the spine is the cheapest place to change the molecule, and the most expensive place to be wrong.
In the real world
In practice, antibody discovery teams already run developability triage as a routine in-silico gate, anchored by published biophysical guidelines. The Therapeutic Antibody Profiler (TAP) (research, peer-reviewed-independent) computes five developability metrics from an antibody structure model and flags candidates outside the range spanned by clinical-stage antibodies — a transparent, guideline-based screen that needs no proprietary training set [1]. The Jain et al. biophysical survey of 137 clinical-stage antibodies across a dozen assays is the empirical bedrock that tells you what "normal" even looks like, and is cited by nearly every learned model as ground truth [2]. Learned successors — PROPERMAB (research) for sequence/structure-based property prediction [5], ensemble deep learning on large-scale viscosity data (research) [4], interpretable viscosity-reduction ML (research) [7] — push accuracy higher where the training data exists. Generative platforms (research-to-pilot) are real businesses proposing de novo binders and engineered antibodies, with genuine wet-lab successes but few independently-reproduced developability-by-design clinical stories.
The governance reality is the calm part of the story. None of this is GMP-regulated: a developability model is a research tool whose output is reviewed by a scientist and never touches a release decision, so it sits entirely outside the locked-model strictures of draft EU GMP Annex 22 and the FDA model-credibility framework that will bind the manufacturing chapters [12][13]. That freedom is exactly why discovery is where the most ambitious ML in this book actually runs — and the trade is that its bets are only graded years later, by the very CQAs the QC release chapter measures on the released lot. The ISPE Pharma 4.0 reality holds even here: the pilots cluster, the production deployments are scarce, and the value is in triage and proposal, not in any autonomous decision.
Key terms
- CDR (complementarity-determining region) — the variable-region loops that form an antibody's binding surface; where discovery has freedom and most liabilities live.
- Developability — the bundle of properties (aggregation, viscosity, stability, PTM liabilities, immunogenicity) that determine whether a binder can actually be manufactured and dosed.
- Liability motif — a short sequence pattern (e.g.
N-Gdeamidation,D-Pfragmentation, unpaired cysteine) where spontaneous chemistry creates heterogeneity, each mapping to a specific CQA. - GRAVY — grand average of hydropathy; the mean Kyte-Doolittle hydrophobicity of a sequence, a sticky-CDR / aggregation proxy.
- PTM (post-translational modification) — chemical change to a residue (deamidation, isomerization, oxidation, glycosylation) that turns one molecule into a heterogeneous population.
- Deimmunization — mutating predicted T-cell epitopes to reduce anti-drug-antibody risk while preserving binding.
- Protein language model (PLM) — a model trained self-supervised on hundreds of millions of natural sequences (e.g. ESM); supplies sequence likelihoods and transferable embeddings for small-data property prediction.
- Transfer learning — using a large-corpus model's learned representation as features for a small labeled task; the key small-data workaround, at its most effective in discovery.
- Generative design — sampling or designing novel sequences (antibody language models, inverse folding, diffusion) rather than only screening existing ones.
- Therapeutic Antibody Profiler (TAP) — a guideline-based, structure-derived developability screen flagging candidates outside the clinical-stage range.
- Out-of-distribution problem — the failure mode where a generative model proposes its most valuable candidates exactly where the developability predictor has no training support.
- Advisory-only scope — a discovery model triages and proposes but never makes a regulated decision; its bets are graded later by measured CQAs.
Where this leads
A sequence has now survived the developability gate; the bet is placed, but no protein exists yet. The next chapter, Cell-Line Development: Ranking Clones with Machine Learning, turns the chosen sequence into producing cells and confronts the first wet-lab small-data wall in earnest: out of thousands of clones, which few will be high-producing, stable, and quality-consistent — and how learning ranks them from sparse, noisy early-screen data long before any of them becomes WCB-CHO-001.