Preface — Teaching the Process to Mean Something
📍 Where we are: Right at the door of the fourth book. Before we model a single step, let us agree on what this book is really about — turning a process that is made and recorded into a process the machine can reason about.
Welcome. This is a book about meaning. The first three books in this series taught how a biologic medicine is physically made, how the data it sheds is managed, and how to build the open-source software that holds that data. This book asks the next question: once every reading, result, and record exists, how do we make a computer understand what they mean — well enough to walk from a vial in a patient's hand back to the cell bank it grew from, in a single query, without a human translating between systems at every hop?
The answer is an ontology: a formal, shared, machine-readable model of what exists in the bioprocess and how it all connects. We will build that model the way the medicine itself is built — one step at a time, from the cell bank to the filled vial — and watch the separate facts of each step click into one navigable knowledge graph.
You need no background in logic or semantics to read this. If you have never written a triple, never opened an ontology editor, never heard of OWL, you are exactly the reader we wrote for. We define every specialized word the first time it appears, and again in a Key terms box at the end of each chapter.
Imagine every department in a factory keeps a perfect notebook — but each in its own private shorthand. The cell-culture team writes "BR-101," the lab writes "Lot 26-001," the warehouse writes material "1000457," and all three mean the same batch. Everything is recorded; nothing connects. An ontology is the shared dictionary and grammar that says exactly what a "batch," a "bioreactor," and a "result" are, and how they relate — so any person, or any machine, can read every notebook as one story. This book builds that dictionary for biomanufacturing, page by page, following the medicine through the plant.
What this chapter covers
A quick map of the preface: who this book is for and the promise we make; the central idea that the process becomes knowledge only when its meaning is modeled, not merely stored; the running example — one antibody batch — that threads every chapter; how this book stands on the three that came before it; the eight parts that follow the medicine from cell bank to patient and then out into the industry that builds it; and the small set of conventions you will see on every page.
The promise: every claim is traceable
This is a popular book in tone and a textbook in rigor. To keep those two honest, every non-obvious claim — every standard, every number, every "studies show" — carries a small bracketed marker like this [1]. Click it and you land on a single References page listing the exact standard, peer-reviewed paper, or regulatory document behind the statement. Where a claim can be checked, you can follow it to its source and check it.
The idea: a stored fact is not yet a known fact
Here is the conviction the whole book turns on.
Three books made the data; this book makes it mean
A modern batch leaves behind an enormous trail — sensor traces, lab results, batch records, signatures. The companion data book calls that trail the molecule's data shadow and shows how to manage it; the open-source book shows how to store it as concrete database rows. But stored is not the same as understood. A historian can hold ten million perfectly time-stamped numbers and still not know that the tag BR101.Temp.PV and the lab field culture_temp describe the same physical quantity, or that this drug-substance lot descends from that bioreactor run. The meaning that a human supplies by reading two screens side by side is exactly what does not travel between systems on its own.
An ontology is where that meaning is made to travel. It promotes the relationships a human keeps in their head — this result is about that batch; that batch derived from this seed train; this parameter is critical to that quality attribute — into first-class, machine-readable facts. Once they are facts, a computer can follow them, combine them, and check them. The process stops being a pile of records and becomes a body of knowledge.
One physical measurement, carried across the series: a reading, then a data point, then a stored row, and finally a typed triple in a graph the machine can reason over.
Original diagram by the authors, created with AI assistance.
Quality by Design is, secretly, an ontology
The industry has wanted this for a long time without using the word. Quality by Design (QbD) — the regulatory framework that treats recorded process understanding as part of the product — asks you to identify which process settings are critical process parameters (CPPs) and which measurable product traits are critical quality attributes (CQAs), and to capture the relationships linking them [2]. Read that again: QbD asks for a model of entities (parameters, attributes, materials, steps) and relations (this parameter affects that attribute). That is an ontology in everything but name. This book makes the name — and the machinery underneath it — explicit, so that "process understanding" becomes something a computer can hold, not just something written into a development report.
The running example: one antibody, end to end
Rather than scatter toy examples, this book models one monoclonal antibody batch all the way through, the same batch the rest of the series follows. Its genealogy is a chain we will turn into graph edges again and again:
a working cell bank
WCB-CHO-001seeds a seed trainSEED-001, which inoculates the production bioreactor batchBATCH-2026-001, whose harvest becomes a Protein A capture poolPApool-001, purified into a drug-substance lotDS-001, filled into a drug-product lotDP-001.
Each arrow in that sentence becomes one derivedFrom edge. The batch carries a release result — its size-exclusion %monomer purity, 98.611, an instance of the bp:monomerPct quality attribute — and the campaign includes a deliberately out-of-spec sibling, DP-004, so we can ask the graph the question an investigator actually asks: what else shares this lot's lineage? By the final chapters, that one sentence will be a graph you can query, validate, and trust. (Throughout, we use the same illustrative namespace as the open-source book, bp: for https://example.org/bioproc#, so a class here is the same class there.)
The medicine's journey, redrawn as a graph: every handoff between steps is a
derivedFrom edge, and a release result hangs off the batch it describes — the shape this whole book builds toward.
Original diagram by the authors, created with AI assistance.
And this is not a thought experiment. That whole sentence is real, loadable RDF in the companion repository (examples/platform/ontology/) — here is its spine:
# instances.ttl — the running example as RDF (bp:derivedFrom is an owl:TransitiveProperty).
# These are the real, asserted immediate-parent edges; each downstream chapter inserts the
# per-unit-operation intermediate it runs through (clarified harvest, the viral-inactivated
# and viral-filtered pools, the polishing intermediate), and because derivedFrom is
# transitive, DS-001 still traces all the way back to the cell bank in a single walk.
bp:WCB-CHO-001 a bp:WorkingCellBank ; bp:derivedFrom bp:MCB-CHO-001 . # the master cell bank above it
bp:SEED-001 a bp:SeedBioreactorCulture ; bp:derivedFrom bp:SEEDFLASK-001 .
bp:BATCH-2026-001 a bp:Batch ; bp:derivedFrom bp:SEED-001 ; # the batch material (the vessel BR-101 is a separate node)
bp:monomerPct "98.611"^^xsd:float .
bp:PApool-001 a bp:CapturePool ; bp:derivedFrom bp:CLAR-001 . # via the clarified harvest
bp:DS-001 a bp:DrugSubstance ; bp:derivedFrom bp:POLpool-001 . # via the polishing intermediate
bp:DP-001 a bp:DrugProduct ; bp:derivedFrom bp:DS-001 .
A small script (validate.py) loads all 2,097 triples, reasons over them (closing to 7,077 under OWL-RL), walks this lineage end to end, runs nine SPARQL queries, and gates a release against SHACL — so every Turtle, SPARQL, and SHACL snippet in this book is a true excerpt of a dataset that actually runs.
What you should already half-know (and where to get it)
This is the fourth book, and it stands on the three before it. You do not need them memorized — we reintroduce each idea as it becomes load-bearing — but it helps to know where each foundation was laid.
- The physical process — what a seed train, a bioreactor, a capture column, and a fill line actually do — is the subject of Biologic Drug Manufacturing, beginning with its bioprocessing overview. Every entity we model is a real thing from that book.
- The languages of meaning — RDF triples, OWL logic, SHACL constraints, the BFO upper ontology, the Industrial Ontologies Foundry, the FAIR principles — were introduced in Data Management's chapter on ontologies and FAIR data. This book assumes that chapter's vocabulary and goes deeper, step by step.
- The runnable graph — loading bioprocess CSVs into RDF with RDFLib, walking lineage with a SPARQL property path, gating it with SHACL — is built in real code in Open-Source Bioprocess Data Systems's knowledge-graph chapter. Where this book says "the loader writes this triple," that is the file it means.
So a single fact threads all four books: a probe in the production bioreactor measures a temperature (Book 1); it becomes a tagged data point with a unit and a quality flag (Book 2); it lands as a ts.sensor_reading row joined to its batch (Book 3); and here it becomes an RDF triple whose predicate is a shared ontology class and whose value carries its unit, so any system can read it without being told what it means (Book 4).
Throughout, we model the standard commercial way of making a monoclonal antibody, because it is the clearest teaching case, and point out where a modern continuous or perfusion process would change the model. The ontology does not care which you run — that flexibility is part of the point.
How to read this book: eight parts, following the medicine
The book is one continuous argument, told in eight parts that trace the batch from cell bank to patient, step back to assemble the whole graph, then survey how the real industry uses ontologies today before delivering an honest verdict.
- Part I — Foundations of the model. The small kit every later chapter uses: the domain-neutral upper ontology that keeps everyone's terms compatible, the classes and relations and axioms that make a vocabulary, and the identifiers and units that keep a value from ever travelling bare.
- Part II — Discovery and development, modeled. Where the entities are first named: the target, the molecule, the cell line, the design space, the analytical methods, the recipe that ships to the plant.
- Part III — Upstream, modeled. The seed train and the production bioreactor, where the genealogy chain begins and a process (an occurrent) must be modeled as carefully as a thing.
- Part IV — Downstream, modeled. Capture, viral safety, polishing, and the drug-substance lot — a sequence of material transformations the graph must keep in order.
- Part V — Fill-finish and release, modeled. The drug-product lot, the specification and the release gate, serialization, and the cold chain to the patient.
- Part VI — The whole graph. Assembling the full digital thread, governing the model so it does not rot, and measuring whether the data is actually FAIR.
- Part VII — Ontologies in industry today. Stepping out of the running example to survey the real ecosystem: the standards bodies and consortia that build the shared vocabulary, the specific ontologies and controlled vocabularies actually in use, the commercial platforms and knowledge-graph vendors, the enterprise graphs that big pharma genuinely runs, the regulatory semantics that are already mandated, the shop floor and the digital twin where formal ontologies are still arriving, and the convergence of ontologies with AI.
- Part VIII — The verdict. An honest reckoning of what modeling the bioprocess as knowledge genuinely achieves, what it leaves to people, and when it is worth doing.
A thread runs through all eight: the FAIR principles, the widely adopted standard that good data should be Findable, Accessible, Interoperable, and Reusable — usable by machines with minimal human help [3]. Ontologies are how a process's data becomes FAIR in fact rather than merely in claim; keep FAIR in mind, because we return to it, and to the uncomfortable gap between the two, repeatedly.
A few conventions
These appear on every page, so it is worth knowing them once.
- Citations. Inline markers like [3] link to the References page. The visible number is local to each chapter and restarts at [1] every chapter.
- Key terms. Each chapter ends with a short glossary of the terms it introduced, so you never have to scroll back.
- Code is real where it claims to be. Turtle, SPARQL, and SHACL shown as the runnable artifact of a chapter is the same shape the open-source companion executes; anything shown as illustrative configuration (a production triplestore, an imported upper ontology) is labelled as such.
- Admonitions. Coloured boxes flag the asides: a
tipfor the plain-English analogy near the top, anotefor useful context, and acautionwhere a misunderstanding could genuinely mislead. - Bilingual. The book is published in English and Korean (한국어), so a reader can follow it in either language.
- Trademarks. Product and standards-body names mentioned here (including but not limited to W3C, ISO/IEC, the OBO Foundry, the Industrial Ontologies Foundry, Allotrope Foundation, QUDT.org, Stanford's Protégé, Apache Jena, Ontotext GraphDB, Amazon Neptune) belong to their respective owners and are used for identification and editorial purposes only, with no claim of endorsement.
This book teaches how to think about modeling a regulated process as knowledge. It is not regulatory advice, and an ontology is not a validated system. Real manufacturing decisions must follow current official guidance and your organization's approved, validated procedures — a point Part VI returns to in full.
The unsolved part
A model that is correct on paper can still be hollow
It would be tidy to promise that ontologies simply solve the meaning problem. The honest truth is harder, and naming it now sets the standard the rest of the book is held to. An ontology is only as good as the discipline that authors it. A plant can adopt every standard — RDF on the wire, BFO at the top, SHACL at the gate — and still produce a graph that lies, because a human confidently labeled a field with a plausible-but-wrong term, or two teams coined two predicates for one concept, or a value loaded without its unit and quietly became uninterpretable.
The measurable version of this gap is sobering. When researchers scored real-world datasets against FAIR, nearly all were findable but only a minority reached even moderate interoperability, because their metadata was authored by hand without a controlled vocabulary [3] — present in form, hollow in fact. Biomanufacturing inherits that gap directly. So this book treats the ontology not as a finished artifact you install but as a practice you sustain: aligned to shared upper ontologies, authored once under change control, and gated on every load. Where the field genuinely still struggles — getting the metadata authored on the plant floor to be controlled, machine-checkable, and FAIR in fact — we will say so plainly rather than pretend the standard is the solution.
Why it matters
If you remember one thing from this preface, make it this: in a regulated process, the difference between data you have and knowledge you can act on is a model of meaning. Without it, every cross-system question — what did this lot derive from? which other batches share its cell bank? what parameters drove this attribute? — is a fresh archaeology project, answered by someone exporting spreadsheets and matching identifiers by hand. With it, those questions become queries. That is not a convenience; in an investigation, it is the difference between a scoped deviation and a blind, campaign-wide quarantine. The model is what turns a heap of true facts into something you can reason over under pressure.
In the real world
This is not a thought experiment waiting for adopters. The languages are settled W3C standards running production knowledge graphs across industries; the upper ontology is a published ISO/IEC standard; and the council-governed model of building interoperable ontologies has worked at scale in the life sciences since the mid-2000s, when the OBO Foundry showed that biomedical ontologies built to shared design rules interlock instead of overlap [4]. Manufacturing copied the lesson with the Industrial Ontologies Foundry, and its biopharma working groups released manufacturing ontologies aimed at exactly the kind of monoclonal-antibody line this book models. On the analytical side, the Allotrope ontologies already give laboratory results one vendor-neutral meaning. The pieces exist, they are standardized, and they are converging — which is precisely why it is worth learning to use them well.
Key terms
- Ontology — a formal, shared, machine-readable model of what exists in a domain and how it relates; the dictionary-and-grammar that lets different systems mean the same thing.
- Knowledge graph — a web of interconnected facts (rather than rows in isolated tables) built from an ontology's classes and relations.
- Quality by Design (QbD) — the framework that treats recorded process understanding — the links between critical parameters and quality attributes — as essential to the product.
- Critical process parameter (CPP) / critical quality attribute (CQA) — a process setting that must be controlled because it affects quality; a measurable product trait that must stay within limits.
derivedFrom— the genealogy relation, used throughout, linking a child material or lot to the parent it came from; the edge that makes the batch's lineage walkable.- FAIR principles — the standard that data should be Findable, Accessible, Interoperable, and Reusable by machines; FAIR in fact is harder than FAIR in claim.
- Upper (foundational) ontology — a small, domain-neutral vocabulary of the most general categories, on which domain ontologies are built so they stay compatible.
- References page — the single page where every inline citation marker resolves to its source.
Where this leads
We have claimed that a stored fact becomes knowledge only when its meaning is modeled. The next chapter, The Upper Spine: Continuants, Occurrents, and Why Everyone Builds on BFO, starts at the very top of the model — the small, domain-neutral set of categories that everything else hangs from — and shows why anchoring a Bioreactor, a CellCultureProcess, and a MonomerPurity result to one shared spine is what keeps the cell-culture team's vocabulary and the lab's vocabulary from drifting into two private dialects all over again.