Modeling the Molecule: Sequence, Modality, and Developability

📍 Where we are: Part II · Discovery and Development, modeled — Chapter 5. The target is named. Now the program turns the concept into real candidate molecules, and we model what a "molecule" even is to a graph.

Discovery is a sieve. A program generates many candidate antibodies, measures them, and narrows to one lead that is potent and manufacturable. Every step of that sieve produces entities and decisions worth modeling — but first we have to answer a deceptively simple question: when we say "the molecule," what kind of thing is that to the graph? The answer turns out to be three different things, and keeping them apart is what lets the model follow one antibody from a sequence in a database to a protein in a vial.

The simple version

A song is three things at once: the written score, any one performance of it, and the recording you stream. Confuse them and you cannot say "the same song was performed twice." A candidate antibody is the same puzzle. There is its sequence — the score, pure information you can copy and email. There is each batch of protein actually made from it — the performances. And there is its modality and behavior — the kind of music it is. This chapter keeps those three apart, because only then can the graph say that the protein in this vial was built from that sequence and behaves in this way.

What this chapter covers

We model the candidate molecule as three distinct things — a sequence (information), a material (the protein when it is made), and a modality (a class) — and then model developability not as a score buried in a report but as a set of dispositions a candidate either has or lacks. We dissect one candidate's record, show how the screening campaign that selects the lead is an occurrent the graph can replay, and close on the hard problem of modeling design intent across thousands of candidates that mostly never get made.

A molecule is information before it is matter

The deepest modeling move in discovery is to separate the sequence from the substance. An antibody's amino-acid sequence is a generically dependent continuant — the same category as the product concept and the recipe — because it is information that can be copied across carriers without loss: it lives in a database, an email, a manufacturing record, and is identically itself in all of them [1]. The protein, by contrast, is a material entity that only exists once a cell makes it. This is not pedantry; it is what lets the graph state that BATCH-2026-001's material expresses a particular sequence, that the drug substance made downstream carries that same sequence, and that two physically distinct lots are "the same molecule" precisely because they realize the same information entity. Collapse sequence and substance into one node and you lose the ability to say any of that.

The modality — monoclonal antibody, bispecific, antibody-drug conjugate, fusion protein — is a third thing again: a class the candidate is an instance of, which the Protein Ontology and modality vocabularies supply. Modality matters to the model because it determines which downstream process classes are even applicable: a bispecific implies process steps and quality attributes a plain mAb does not, and typing the candidate by modality lets the graph reason about which manufacturing template fits. The "same molecule across lots" claim rests on a shared, verified type for the antibody itself:

# bioproc.ttl + align.ttl — the antibody as both a bp: material and a shared OBO molecule.
bp:Antibody a owl:Class ;
    rdfs:subClassOf bp:Material ;
    rdfs:subClassOf obo:GO_0071735 .   # GO 'IgG immunoglobulin complex' (verified via OLS4)

Two physically distinct lots are "the same molecule" precisely because they realize the same sequence and instantiate this same shared class.

Developability is a disposition, not a number

A candidate can bind the target perfectly and still be undevelopable — prone to aggregation, unstable, hard to express, viscous at high concentration. Developability is the bundle of properties that decide whether a candidate can actually be manufactured [2]. The naive model records these as numbers in a column. The better model, and the one BFO makes available, treats them as dispositions: a candidate has a disposition to aggregate, which is realized in an aggregation process under stress, and measured by an OBI assay that yields a typed value. The disposition is a real feature of the molecule that exists whether or not it is currently being measured; the assay result is the evidence for it.

Why bother with the distinction? Because it captures how discovery actually reasons. "This candidate aggregates" is a claim about the molecule's nature; "the assay read 12% high-molecular-weight species" is one measurement supporting that claim. Modeling the disposition separately from its measurements lets the graph accumulate multiple lines of evidence for the same underlying property, flag when two assays disagree, and carry the property forward as a risk into process development — where an aggregation-prone molecule drives specific control choices. The computational developability guidelines the field has published are, in ontology terms, a set of dispositions worth screening for early [3]. In the dataset the lead is exactly that — information plus material plus evidenced dispositions — and the 12% figure above is a real measured value, distinct from the disposition it supports:

# instances.ttl — the lead candidate as information + material + evidenced dispositions.
bp:mAb-A a bp:CandidateMolecule ;        # the lead; a bp:Antibody, so it instantiates the shared GO class
    bp:hasSequence bp:SEQ-mAb-A ;        # the sequence — copyable information, not the protein
    bp:bindsTo bp:TARGET-X ;             # potent against the PRO target from the last chapter
    bp:selectedBy bp:SCREEN-1 ;          # chosen by the screening campaign (an occurrent)
    bp:hasDevelopability bp:DEV-agg , bp:DEV-tm , bp:DEV-titer , bp:DEV-visc .
bp:DEV-agg a bp:AggregationPropensity ;  # the disposition — a real feature of the molecule
    bp:measuredBy bp:DEV-Assay-agg ;
    bp:aggregationPropensityPct "12.0"^^xsd:float .   # the evidence (12 percent HMW species), not the disposition
bp:SCREEN-1 a bp:ScreeningCampaign ; bp:hasOutput bp:mAb-A .

One candidate, fully unpacked: the sequence is information, the modality is a class, and each developability property is a disposition with its own measuring assay — so the molecule's manufacturability risks travel forward as modeled facts, not buried numbers. Original diagram by the authors, created with AI assistance.

The campaign that selects the lead is an occurrent

Discovery is not only a set of molecules; it is a process of selecting among them, and that process is worth modeling as an occurrent. A screening campaign — express a panel, run binding and developability assays, rank, and pick — is exactly the kind of happening Part I's spine puts on the occurrent side. Modeling it means the graph records not just that this lead was chosen but which candidates it beat, on which assays, by which criteria. That provenance is the difference between a decision you can defend years later and one you can only assert. When a regulator or a partner asks "why this molecule?", the answer is a subgraph: the campaign, its candidate participants, their developability dispositions, and the selection criteria applied — the OBI investigation model made concrete [4].

Selecting the lead, modeled as a process: the screening campaign is an occurrent whose candidate participants, assays, and criteria stay in the graph — so "why this molecule?" is answerable as a subgraph, not a recollection. Original diagram by the authors, created with AI assistance.

The unsolved part: modeling intent at the scale of thousands

The honest difficulty here is scale and counterfactuals. A real campaign may generate thousands of candidates, the vast majority of which are never made, never advance, and exist only as sequences and a few assay points. Modeling every one as a full entity with dispositions and provenance is possible but rarely economical, so teams model the survivors richly and the rest thinly — which means the graph's account of "why this lead" is only as complete as the losing candidates were recorded, and early-stage data is notoriously messy and under-curated. There is a genuine tension between the FAIR ideal (every candidate a first-class, well-described entity) and the reality that discovery moves fast and most candidates are disposable.

A second open problem is the link from in silico to in vitro. Modern discovery generates candidates and developability predictions computationally before anything is expressed, so the graph must relate a predicted disposition (from a model) to a measured one (from an assay) — and be honest about which is which. Conflating a computational prediction with an experimental result is a quiet data-integrity failure that a careless model invites. The discipline of typing every developability claim with its evidence source — prediction or measurement — is straightforward to state and routinely skipped, and it is exactly the kind of provenance the ML/AI book will insist on when those predictions come from learned models.

Why it matters

The decisions made in discovery echo through the entire lifecycle, and most of them are made on properties the model can carry forward — if it bothers to. An aggregation disposition flagged here becomes an aggregation CQA to control in process development and a release test in QC; a sequence named here is the identity that the drug substance must match. Model the molecule as three clean entities with evidenced dispositions and that forward thread is automatic; model it as a fuzzy node with some columns and the manufacturing team re-derives from scratch what discovery already knew. The molecule is the one entity that travels the whole length of the process, so getting its model right pays off at every later step.

In the real world

Antibody discovery is increasingly developability-aware, exactly the shift this chapter models: large surveys of clinical-stage antibodies quantified which biophysical properties separate molecules that progress from those that fail, and computational guidelines now screen candidates for those properties before a single one is expressed [2][3]. The sequences themselves live in databases that already use stable identifiers, and the assays that measure binding and developability are the OBI-style investigations the graph can reference rather than reinvent. What is still uneven is the discipline of carrying a candidate's modeled dispositions forward into manufacturing as the same entities — too often the rich discovery model is flattened to a sequence and a name at the tech-transfer wall, discarding precisely the manufacturability knowledge the next stage needs.

Key terms

Sequence — the antibody's amino-acid sequence, modeled as a generically dependent continuant (information), distinct from any protein material made from it.
Modality — the kind of molecule (monoclonal antibody, bispecific, ADC, fusion protein), modeled as a class that determines which downstream process and quality classes apply.
Developability — the bundle of properties deciding whether a candidate can be manufactured; modeled as dispositions the molecule has, each realized in a process and measured by an assay.
Disposition — a BFO realizable entity: a real feature of a thing (e.g., a propensity to aggregate) that exists whether or not it is currently being measured.
Lead — the candidate selected to advance, modeled as the output of a screening campaign that retains its losing competitors and selection criteria.
Screening campaign — the discovery process, modeled as an occurrent whose candidate participants, assays, and criteria make the selection decision reconstructible.

Where this leads

We have a lead molecule, modeled as sequence, material, modality, and evidenced dispositions. The next chapter, Modeling the Cell Line and the Cell-Bank Genealogy, reaches the entity where the manufacturing genealogy begins — the engineered cell line and the banked vials of cells, WCB-CHO-001 at the root of every derivedFrom chain in this book. We model the host organism by its taxonomy IRI, the cell-bank hierarchy as a lineage, and the uncomfortable question of how a graph represents the identity and stability of a living, mutating thing.

What this chapter covers​

A molecule is information before it is matter​

Developability is a disposition, not a number​

The campaign that selects the lead is an occurrent​

The unsolved part: modeling intent at the scale of thousands​

Why it matters​

In the real world​

Key terms​

Where this leads​