Modeling the Seed Train and the Start of Genealogy

📍 Where we are: Part III · Upstream, modeled — Chapter 10. The recipe is on the plant floor. Now information becomes action: a vial is thawed, cells are grown, and the genealogy of this specific batch begins to be written, edge by edge.

Part II modeled knowledge — the molecule, the cell line, the design space, the portable recipe. Part III models events: the actual making of one batch, with its own concrete genealogy. It starts undramatically, with a technician thawing a single vial of WCB-CHO-001 and coaxing those few million cells, through a series of ever-larger vessels, into the billions needed to inoculate a production bioreactor. This seed train is the bridge from the banked cell line to production, and it is where the first real derivedFrom edges of the campaign — the ones this whole book has pointed toward — are finally laid down.

The simple version

You cannot fill a stockpot from a single seed; you grow a seedling in a cup, move it to a pot, then a bed, then a field. Each move is a real step with a before and an after, and the field's plants trace back through the bed and the pot to that first seedling. The seed train is that staged growing-up for cells, and modeling it means recording each move as an event that turns one container of cells into the next — so the production batch can trace its ancestry, hop by hop, back to the vial it started from.

What this chapter covers

We model the seed train as a chain of expansion processes (EXP-001, EXP-002), each an occurrent that consumes one cell material and produces a larger one, laying the first derivedFrom edges of a specific campaign. We dissect the SEEDFLASK-001 and SEED-001 nodes, carry the passage count forward from the cell bank (12, then 16, against a validated limit of 40), and confront the chapter's real modeling puzzle: how a continuous biological expansion gets represented as a handful of discrete graph nodes, and where to draw the lines.

Each expansion is a process that makes a new material

The seed train is a sequence of scale-ups: thaw into a shake flask, expand into a small bioreactor, then a larger one, until there are enough cells to inoculate production [1]. The clean way to model this uses exactly the continuant/occurrent split from Part I's spine. Each scale-up is an expansion process — an occurrent — that has as input the cells from the previous stage (a material entity) and produces as output a larger population (a new material entity). The cells participate in the expansion; the expansion hasOutput the next stage's material. String these together and the seed train is a chain of processes, each handing its product to the next as input.

The genealogy falls out of this for free. Because each stage's output material derivedFrom the previous stage, and the first stage derives from the thawed WCB-CHO-001 vial, the campaign's lineage is a derivedFrom chain rooted in the cell bank — SEEDFLASK-001 derivedFrom WCB-CHO-001, SEED-001 derivedFrom SEEDFLASK-001, and later BATCH-2026-001 derivedFrom SEED-001. The deliverable models the seed train as two real expansion stages, not one: a shake-flask culture (SEEDFLASK-001) and a seed-bioreactor culture (SEED-001), each the output of its own ExpansionProcess. These are the concrete edges the open-source loader writes — though its coarse lot_genealogy.csv chain collapses the shake-flask intermediate this ontology keeps; here we see why the edges exist — each is the trace of a real material transformation, not an arbitrary link. And because derivedFrom is transitive, the production batch still automatically derives from the cell bank, no matter how many expansion stages sit between them. These are the first concrete edges of the campaign, exactly as the loadable dataset writes them:

# instances.ttl — the seed train: two real expansion stages, accumulating passage.
bp:SEEDFLASK-001 a bp:ShakeFlaskCulture ; rdfs:label "shake-flask seed culture" ;
    bp:derivedFrom bp:WCB-CHO-001 ;           # rooted in the cell bank
    bp:participatesIn bp:EXP-001 ;
    bp:passageNumber 12 .
bp:SEED-001 a bp:SeedBioreactorCulture ; rdfs:label "SEED-001 (seed bioreactor culture)" ;
    bp:derivedFrom bp:SEEDFLASK-001 ;         # ...one hop back to the shake flask
    bp:participatesIn bp:EXP-002 ;
    bp:passageNumber 16 .                     # the count carried forward, checkable at release
bp:EXP-001 a bp:ExpansionProcess ; rdfs:label "shake-flask expansion" ;
    bp:hasInput bp:WCB-CHO-001 ; bp:hasOutput bp:SEEDFLASK-001 ; bp:occursIn bp:SF-01 .
bp:EXP-002 a bp:ExpansionProcess ; rdfs:label "seed-bioreactor expansion" ;
    bp:hasInput bp:SEEDFLASK-001 ; bp:hasOutput bp:SEED-001 ; bp:occursIn bp:SBR-01 .
bp:BATCH-2026-001 bp:derivedFrom bp:SEED-001 .   # ...and so, transitively, back to WCB-CHO-001

The campaign's first genealogy edges: each expansion is a process turning one cell material into the next, the passage count climbing 8 to 12 to 16 while staying under the validated limit of 40, and the production batch tracing transitively back to the working cell bank. Original diagram by the authors, created with AI assistance.

The passage count travels with the cells

The cell-line chapter introduced a fact that now becomes load-bearing: the passage (or generation) count, which bounds how long the cells may be grown before quality risks rise [2]. The seed train is where that count accumulates — each expansion adds generations — so the model carries it forward along the derivedFrom chain, incrementing at each stage: the working cell bank carries passageNumber 8, the shake-flask culture 12, the seed-bioreactor culture 16. This lets the graph answer a real GMP question — was this batch inoculated from cells within the validated passage limit? — as a query rather than a manual reconstruction from lab notebooks. A fact established at the root (the validated window — bp:PassageLimit-mAb-A bp:validatedPassageLimit 40) meets a fact accumulated along the train (the actual count, 16 at inoculation), and the release gate can later check one against the other. The seed train is unglamorous, but it is where a key piece of provenance is either kept or lost.

The seed train, unpacked: two expansion processes turning the thawed vial into enough cells to inoculate production, the derivedFrom edges rooted (transitively) in the cell bank, and the passage count carried forward to where the release gate can check it. Original diagram by the authors, created with AI assistance.

The unsolved part: discretizing a continuous, living expansion

Here is the puzzle the tidy chain hides. A seed train is not really a sequence of discrete events; it is continuous growth, cells dividing every day, occasionally split between vessels. The graph models it as a handful of nodes — a few expansion stages, a few materials — because that granularity is useful, but the choice of where to put the boundaries is a modeling judgment, not a fact the biology hands you. Is each vessel transfer a new material node, or each day of growth? Is a flask split into two the creation of two new materials, or one material in two containers? Reasonable modelers answer differently, and there is no canonical granularity — the same individuation problem the harvest chapter meets head-on. Model too coarsely and you lose the resolution to trace a contamination to a specific transfer; model too finely and the graph drowns in nodes nobody queries.

The deeper version of this is the one the cell-line chapter already named: the cells are alive and changing, so "the material at stage 3" is a population that was different yesterday and will be different tomorrow. The derivedFrom edge implies a clean parent-child handoff, but the reality is a smear of continuous growth that the discrete edge approximates. The model is genuinely useful — it makes lineage and passage queryable — but it is an abstraction over a continuous biological process, and pretending the nodes are crisp where the biology is smooth is a quiet way to over-trust the graph. The honest standard is to choose a granularity that answers the questions you actually ask (contamination tracing, passage compliance) and to document that the nodes are deliberate simplifications of a living continuum.

Why it matters

The seed train is where a campaign's genealogy is either rooted correctly or quietly broken. Lay the derivedFrom edges as the traces of real expansions, carry the passage count forward, and every downstream lineage question — back to the cell bank, within passage limits — is answerable by traversal. Skip it, or flatten the train to a single "inoculum" node, and the chain from production back to the cell-bank root loses its most fragile link exactly where living material is most variable. This unassuming stage is the on-ramp for the entire genealogy the rest of the book traverses.

From the wire to the graph

The passage count this chapter leans on does not appear in the graph by magic — it is sourced. The working cell bank's count comes from the electronic lab notebook (an ELN/LIMS such as Benchling, production tooling in any biologics lab), and the companion ontology records that source rather than asserting the number as a bare fact. In examples/platform/ontology/instances.ttl the ELN is a prov:SoftwareAgent, and the claim that WCB-CHO-001 is at passage 8 is a prov:Entity attributed to it with prov:wasAttributedTo — the same provenance discipline the production stages use for the MES reconciliation, applied here to a number that travels all the way to the release gate.

The instances also show the seed train reusing the upstream controlled vocabularies rather than minting local terms. bp:SEED-001 declares bp:hasHostOrganism bp:CHO-host, the same host instance carried by the cell line and the working cell bank. The alignment that grounds those classes lives in examples/platform/ontology/align.ttl: bp:HostOrganism rdfs:subClassOf obo:NCBITaxon_10029 (NCBI Taxonomy Cricetulus griseus, the Chinese hamster) and bp:WorkingCellBank rdfs:subClassOf obo:CLO_0002421 (the CLO CHO cell type) — both production-grade reference vocabularies. The point: those IRIs are not confined to cell-line development; the host instance is carried forward into every expansion stage.

The align.ttl excerpt those classes rest on grounds the seed train up to shared reference vocabularies — the host organism and cell bank to OBO biology, the culture and expansion process to IOF, and the lineage edge to the Relation Ontology:

# align.ttl — the seed-train classes grounded UP to shared vocabularies (excerpt).
bp:HostOrganism       rdfs:subClassOf obo:NCBITaxon_10029 .       # NCBI Taxonomy 'Cricetulus griseus' (Chinese hamster, verified via OLS4)
bp:WorkingCellBank    rdfs:subClassOf obo:CLO_0002421 .           # Cell Line Ontology 'CHO cell'
bp:SeedTrainCulture   rdfs:subClassOf iof:CellCulture .           # IOF biopharma 'cell culture' (Released); ShakeFlaskCulture / SeedBioreactorCulture inherit it
bp:CellCultureProcess rdfs:subClassOf iof:ManufacturingProcess .  # IOF Core; bp:ExpansionProcess inherits this
bp:derivedFrom        rdfs:subPropertyOf obo:RO_0001000 .         # RO 'derives from' (the campaign's lineage edge)

bp:SEED-001 a bp:SeedBioreactorCulture ; rdfs:label "SEED-001 (seed bioreactor culture)" ;
    bp:derivedFrom bp:SEEDFLASK-001 ;
    bp:participatesIn bp:EXP-002 ;
    bp:hasHostOrganism bp:CHO-host ;   # reuse the NCBI-Taxonomy-aligned host upstream, not only in cell-line dev
    bp:passageNumber 16 .

# The working-cell-bank vial's passage count is sourced from the ELN/LIMS (e.g. Benchling);
# the source is recorded as PROV provenance, the same discipline used for the MES reconciliation.
bp:ELN a prov:SoftwareAgent ; rdfs:label "electronic lab notebook (ELN/LIMS, e.g. Benchling)" .
bp:claim-passage-WCB a prov:Entity ; rdfs:label "source claim: WCB-CHO-001 passage number = 8" ;
    prov:wasAttributedTo bp:ELN .

Which standards and vocabularies are production versus pilot, and where each one actually fits, is the subject of the ontologies and controlled vocabularies actually in use.

In the real world

The staged seed train, with its disciplined passage limits and characterization, is how every commercial mammalian-cell process actually bridges from a banked vial to production [1][2]. Modern intensified processes push more growth into perfusion-style seed steps, which changes the number and shape of the expansion nodes but not the modeling principle: each scale-up is still a process that derives a new material from the last. The open-source upstream chapter captures the live signals of these vessels, and the genealogy edges this chapter explains are exactly the rows its knowledge graph loads — the seed train modeled here is the first stretch of the lineage that code walks.

Key terms

Seed train — the staged scale-up that grows a thawed cell-bank vial into enough cells to inoculate production; modeled as a chain of expansion processes.
Expansion process — one scale-up step, modeled as an occurrent consuming one cell material and producing a larger one, with the cells as participants.
derivedFrom (campaign edges) — the first concrete genealogy edges of a specific batch, each the trace of a real material transformation, rooted (transitively) in the cell bank.
Passage / generation count — the accumulating division count carried forward along the train, checkable against the validated limit at release.
Granularity / individuation — the modeling judgment of where to place material and process boundaries along a continuous, living expansion; deliberate simplification, not a fact of the biology.

Where this leads

The cells are grown and ready. The next chapter, Modeling the Production Bioreactor: A Process, Its Phases, and Its Parameters, reaches the heart of upstream — where the batch material and the cell-culture process must be kept distinct, where growth and production phases are modeled as sub-processes, and where the critical parameters from process development finally have a real run to attach to, complete with the dense sensor streams the graph indexes but does not swallow.

What this chapter covers​

Each expansion is a process that makes a new material​

The passage count travels with the cells​

The unsolved part: discretizing a continuous, living expansion​

Why it matters​

From the wire to the graph​

In the real world​

Key terms​

Where this leads​