Preface — Teaching the Process to Mean Something

📍 Where we are: Right at the door of the fourth book. Before we model a single step, let us agree on what this book is really about — turning a process that is made and recorded into a process the machine can reason about.

Welcome. This is a book about meaning. The first three books in this series taught how a biologic medicine is physically made, how the data it sheds is managed, and how to build the open-source software that holds that data. This book asks the next question: once every reading, result, and record exists, how do we make a computer understand what they mean — well enough to walk from a vial in a patient's hand back to the cell bank it grew from, in a single query, without a human translating between systems at every hop?

The answer is an ontology: a formal, shared, machine-readable model of what exists in the bioprocess and how it all connects. We will build that model the way the medicine itself is built — one step at a time, from the cell bank to the filled vial — and watch the separate facts of each step click into one navigable knowledge graph.

You need no background in logic or semantics to read this. If you have never written a triple, never opened an ontology editor, never heard of OWL, you are exactly the reader we wrote for. We define every specialized word the first time it appears, and again in a Key terms box at the end of each chapter.

The simple version

Imagine every department in a factory keeps a perfect notebook — but each in its own private shorthand. The cell-culture team writes "BR-101," the lab writes "Lot 26-001," the warehouse writes material "1000457," and all three mean the same batch. Everything is recorded; nothing connects. An ontology is the shared dictionary and grammar that says exactly what a "batch," a "bioreactor," and a "result" are, and how they relate — so any person, or any machine, can read every notebook as one story. This book builds that dictionary for biomanufacturing, page by page, following the medicine through the plant.

What this chapter covers

A quick map of the preface: who this book is for and the promise we make; the central idea that the process becomes knowledge only when its meaning is modeled, not merely stored; the running example — one antibody batch — that threads every chapter; how this book stands on the three that came before it; the nine parts that walk the ontology-engineering lifecycle — from the requirements, through the model built on the running example, to the industry that builds it; and the small set of conventions you will see on every page.

The promise: every claim is traceable

This is a popular book in tone and a textbook in rigor. To keep those two honest, every non-obvious claim — every standard, every number, every "studies show" — carries a small bracketed marker like this [1]. Click it and you land on a single References page listing the exact standard, peer-reviewed paper, or regulatory document behind the statement. Where a claim can be checked, you can follow it to its source and check it.

The idea: a stored fact is not yet a known fact

Here is the conviction the whole book turns on.

Three books made the data; this book makes it mean

A modern batch leaves behind an enormous trail — sensor traces, lab results, batch records, signatures. The companion data book calls that trail the molecule's data shadow and shows how to manage it; the open-source book shows how to store it as concrete database rows. But stored is not the same as understood. A historian can hold ten million perfectly time-stamped numbers and still not know that the tag BR101.Temp.PV and the lab field culture_temp describe the same physical quantity, or that this drug-substance lot descends from that bioreactor run. The meaning that a human supplies by reading two screens side by side is exactly what does not travel between systems on its own.

An ontology is where that meaning is made to travel. It promotes the relationships a human keeps in their head — this result is about that batch; that batch derived from this seed train; this parameter is critical to that quality attribute — into first-class, machine-readable facts. Once they are facts, a computer can follow them, combine them, and check them. The process stops being a pile of records and becomes a body of knowledge.

Three systems on the plant floor each speak one of those private dialects. The process historian streams high-frequency tags (BR101.Temp.PV — the bioreactor-101 temperature process value, each with its engineering unit and a quality flag, typically over OPC UA); the MES (manufacturing execution system) / electronic batch record owns the genealogy and the operator signatures; and the LIMS (laboratory information management system) holds the sampled release results — the analytical tests run on a drawn sample to confirm the product is within spec, such as CE-SDS purity, SEC %monomer (the percentage of the molecule that is the intact, single antibody rather than a clump), and the glycan map (the profile of the sugar chains attached to it) — keyed to a sample identifier. Each is complete, and each is mute about the other two. The ontology is the grammar that lets one query read all three.

One physical measurement, carried across the series: a reading, then a data point, then a stored row, and finally a typed, unit-bearing value in a graph the machine can reason over. Original diagram by the authors, created with AI assistance.

Quality by Design is, secretly, an ontology

The industry has wanted this for a long time without using the word. Quality by Design (QbD) — the regulatory framework that treats recorded process understanding as part of the product — asks you to identify which process settings are critical process parameters (CPPs) and which measurable product traits are critical quality attributes (CQAs), and to capture the relationships linking them [2]. Read that again: QbD asks for a model of entities (parameters, attributes, materials, steps) and relations (this parameter affects that attribute). That is an ontology in everything but name. This book makes the name — and the machinery underneath it — explicit, so that "process understanding" becomes something a computer can hold, not just something written into a development report.

On this line those abstractions are concrete. The CPPs the model carries include bioreactor temperature and feed rate (the two a design-space query returns for monomer purity), and the validated low-pH viral-inactivation hold — pH 3.6 for 60 minutes, the step that holds the product briefly at acid pH to kill any enveloped virus. The CQAs are the release panel — the measured product traits, each with a limit it must stay within — with their real limits: SEC %monomer (the intact-antibody fraction, at least 95%), HMW aggregate (the high-molecular-weight clumps, at most 2.0%), charge-variant %main by CEX (60–80%), host-cell protein (residual protein from the CHO host cell, at most 100 ppm — parts per million), and protein concentration (45–55 mg/mL). QbD's promise — this parameter affects that attribute — is exactly the edge an ontology makes walkable.

The running example: one antibody, end to end

Rather than scatter toy examples, this book models one monoclonal antibody batch all the way through, the same batch the rest of the series follows. Its genealogy — its parent-to-child lineage, the same chain we later make walkable with the derivedFrom edge — is a chain we will turn into graph edges again and again (the CHO in the identifiers names the Chinese-hamster-ovary cell line the antibody is grown in):

a working cell bank WCB-CHO-001 seeds a seed train SEED-001, which inoculates the production bioreactor batch BATCH-2026-001, whose harvest becomes a Protein A capture pool PApool-001, purified into a drug-substance lot DS-001, filled into a drug-product lot DP-001.

Each arrow in that sentence becomes one derivedFrom edge. The six-edge chain above is the headline lineage; each downstream chapter threads in the per-unit-operation intermediate the material actually passes through (the clarified harvest, the viral-inactivated and viral-filtered pools, the polishing intermediate), and because derivedFrom is transitive, DS-001 still traces all the way back to the cell bank in a single walk. The batch carries a release result — its size-exclusion %monomer purity, 98.611, recorded with the bp:monomerPct datatype property (the bp: prefix is our illustrative namespace, https://example.org/bioproc#, the home address shared with the open-source book, so a class named here is the same class there). It is a convenience scalar; its fully unit-qualified form is the qudt:QuantityValue individual bp:DS-001-monomer (a preview we unpack in Part IV: there we distinguish the bare number from its unit-bearing twin and from the attribute's kind, the distinct class bp:MonomerContent — no need to parse the difference yet). The campaign also includes a deliberately failed sibling, DP-004: its monomer is fine (98.687%), but its high-molecular-weight aggregate is 2.41% against an at-most-2.0% release limit, so it goes out of spec on a different CQA, a failure the SHACL release gate catches.

Because DP-004 grew from the same working cell bank WCB-CHO-001 through a separate seed train and batch, one lineage walk answers the question an investigator actually asks during that deviation: which other lots share this cell bank? By the final chapters, that one sentence will be a graph you can query, validate, and trust.

The medicine's journey, redrawn as a graph: every handoff between steps is a derivedFrom edge, and an in-process monomer result is recorded on the batch (the release CQA's faithful home is the drug-substance lot DS-001) — the shape this whole book builds toward. Original diagram by the authors, created with AI assistance.

And this is not a thought experiment. That whole sentence is real, loadable RDF in the companion repository (examples/platform/ontology/) — here is its spine, written in Turtle, the standard plain-text notation for RDF triples. You do not need to read Turtle yet; just know that each line states a subject and one or more predicate object facts about it, a is shorthand for "is a" (the subject's type), and "98.611"^^xsd:float is a value tagged with its datatype (here, a floating-point number):

# instances.ttl — the running example as RDF (bp:derivedFrom is an owl:TransitiveProperty).
# These are the real, asserted immediate-parent edges; each downstream chapter inserts the
# per-unit-operation intermediate it runs through (clarified harvest, the viral-inactivated
# and viral-filtered pools, the polishing intermediate), and because derivedFrom is
# transitive, DS-001 still traces all the way back to the cell bank in a single walk.
bp:WCB-CHO-001    a bp:WorkingCellBank ; bp:derivedFrom bp:MCB-CHO-001 .  # the master cell bank above it
bp:SEED-001       a bp:SeedBioreactorCulture ; bp:derivedFrom bp:SEEDFLASK-001 .
bp:BATCH-2026-001 a bp:Batch ; bp:derivedFrom bp:SEED-001 ;   # the batch material (the vessel BR-101 is a separate node)
                  bp:monomerPct "98.611"^^xsd:float .
bp:PApool-001     a bp:CapturePool ; bp:derivedFrom bp:CLAR-001 .       # via the clarified harvest
bp:DS-001         a bp:DrugSubstance ; bp:derivedFrom bp:POLpool-001 .  # via the polishing intermediate
bp:DP-001         a bp:DrugProduct ; bp:derivedFrom bp:DS-001 .

A small script (validate.py) loads all 2,120 triples, reasons over them — automatically deriving the facts that logically follow (for example, that because derivedFrom is transitive, DS-001 derives from the cell bank), which grows the graph from 2,120 to 7,137 triples under the OWL-RL reasoning rules — and then runs the 23 competency questions that make up this ontology's requirements — each as a pass/fail acceptance test, the centerpiece of Part I. Every Turtle, SPARQL, and SHACL snippet in this book is a true excerpt of that dataset, and every number is what the harness actually prints.

Those 23 questions are not a loose test list: together they form an Ontology Requirements Specification Document (ORSD) in the sense of NeOn (one of the established named methodologies for engineering ontologies) — purpose, scope, intended uses, and the functional requirements a finished ontology must satisfy, each written as a competency question (a CQ: a question the finished model must be able to answer). Borrowing the test-first discipline of another such methodology, SAMOD, every CQ is also its own acceptance test — the requirements are the tests (requirements == tests): CQ-04 — when a drug-product lot fails, which other drug products share its lineage? — is the exact question the out-of-spec sibling DP-004 is in the dataset to answer, and it is green only when the graph actually returns the cell-bank siblings. Part I builds that catalog; Part VI runs it as live SPARQL.

What you should already half-know (and where to get it)

This is the fourth book, and it stands on the three before it. You do not need them memorized — we reintroduce each idea as it becomes load-bearing — but it helps to know where each foundation was laid.

The physical process — what a seed train, a bioreactor, a capture column, and a fill line actually do — is the subject of Biologic Drug Manufacturing, beginning with its bioprocessing overview. Every entity we model is a real thing from that book.
The languages of meaning — RDF triples, OWL logic, SHACL constraints, the BFO upper ontology, the Industrial Ontologies Foundry, the FAIR principles — were introduced in Data Management's chapter on ontologies and FAIR data. This book assumes that chapter's vocabulary and goes deeper, step by step.
The runnable graph — loading bioprocess CSVs into RDF with RDFLib, walking lineage with a SPARQL property path, gating it with SHACL — is built in real code in Open-Source Bioprocess Data Systems's knowledge-graph chapter. Where this book says "the loader writes this triple," that is the file it means.

So a single fact threads all four books: a probe in the production bioreactor measures a temperature (Book 1); it becomes a tagged data point with a unit and a quality flag (Book 2); it lands as a ts.sensor_reading row joined to its batch (Book 3); and here it becomes an RDF triple — the field's atomic unit of fact, a subject–predicate–object statement (here, this reading — has-value — that number) — whose predicate is a shared ontology property and whose value carries its unit, so any system can read it without being told what it means (Book 4).

note

Throughout, we model the standard commercial way of making a monoclonal antibody, because it is the clearest teaching case, and point out where a modern continuous or perfusion process would change the model. The ontology does not care which you run — that flexibility is part of the point.

How to read this book: nine parts, walking the lifecycle

The book is one continuous argument. It is organized the way an ontology is actually engineered — by the lifecycle: first specify what the model must answer, then reuse what already exists, then conceptualize, formalize, implement, and validate, then publish and maintain — and the running antibody campaign is the worked example threaded through every phase. It closes by surveying how the real industry uses ontologies today, then delivers an honest verdict. Nine parts:

Part I — Specification and requirements. The discipline every other part answers to: the competency questions the finished model must answer, written as an Ontology Requirements Specification Document and made executable, plus the one campaign and the proof harness the whole book is checked against.
Part II — Reuse: standing on what exists. Before minting anything, survey and align to what the field already built: the domain-neutral upper ontology (BFO) that keeps everyone's terms compatible, and the IOF and OBO ontologies and controlled vocabularies you reuse rather than reinvent.
Part III — Conceptualization. Naming the model's content on the running campaign: the classes and taxonomy that sit under the upper ontology, and the relations — derivedFrom genealogy chief among them — that turn a list of terms into a connected model.
Part IV — Formalization. Making the model bite: the OWL axioms and restrictions that turn description into enforceable constraint, and the identifiers and units (IRIs, QUDT) that keep a value from ever travelling bare.
Part V — Implementation. Filling the model with the campaign: building the instance graph of the one antibody batch, and the ETL that carries data off the wire (OPC UA, B2MML, the ELN) into the graph.
Part VI — Validation. Proving it answers: the competency questions run as executable SPARQL queries, and the release gate enforced as SHACL — the test-first loop that keeps all 23 questions green.
Part VII — Maintenance, publication, and FAIR. Keeping it alive and shareable: governing the model so it does not rot under change, publishing the finished vocabulary, and measuring whether the data is actually FAIR.
Part VIII — Ontologies in industry today. Stepping out of the running example to survey the real ecosystem: the standards bodies and consortia that build the shared vocabulary, the specific ontologies and controlled vocabularies actually in use, the commercial platforms and knowledge-graph vendors, the enterprise graphs that big pharma genuinely runs, the regulatory semantics that are already mandated, the shop floor and the digital twin where formal ontologies are still arriving, and the convergence of ontologies with AI.
Part IX — The verdict. An honest reckoning of what modeling the bioprocess as knowledge genuinely achieves, what it leaves to people, and when it is worth doing.

A thread runs through all nine: the FAIR principles, the widely adopted standard that good data should be Findable, Accessible, Interoperable, and Reusable — usable by machines with minimal human help [3]. Ontologies are how a process's data becomes FAIR in fact rather than merely in claim; keep FAIR in mind, because we return to it, and to the uncomfortable gap between the two, repeatedly.

A few conventions

These appear on every page, so it is worth knowing them once.

Citations. Inline markers like [3] link to the References page. The visible number is local to each chapter and restarts at [1] every chapter.
Key terms. Each chapter ends with a short glossary of the terms it introduced, so you never have to scroll back.
Code is real where it claims to be. Turtle, SPARQL, and SHACL shown as the runnable artifact of a chapter is the same shape the open-source companion executes; anything shown as illustrative configuration (a production triplestore — a database built to hold RDF triples — or an imported upper ontology) is labelled as such.
Admonitions. Coloured boxes flag the asides: a tip for the plain-English analogy near the top, a note for useful context, and a caution where a misunderstanding could genuinely mislead.
Bilingual. The book is published in English and Korean (한국어), so a reader can follow it in either language.
Trademarks. Product and standards-body names mentioned here (including but not limited to W3C, ISO/IEC, the OBO Foundry, the Industrial Ontologies Foundry, Allotrope Foundation, QUDT.org, Stanford's Protégé, Apache Jena, Ontotext GraphDB, Amazon Neptune) belong to their respective owners and are used for identification and editorial purposes only, with no claim of endorsement.

Caution

This book teaches how to think about modeling a regulated process as knowledge. It is not regulatory advice, and an ontology is not a validated system. Real manufacturing decisions must follow current official guidance and your organization's approved, validated procedures — a point Part VII returns to in full.

The unsolved part

A model that is correct on paper can still be hollow

It would be tidy to promise that ontologies simply solve the meaning problem. The honest truth is harder, and naming it now sets the standard the rest of the book is held to. An ontology is only as good as the discipline that authors it. A plant can adopt every standard — RDF on the wire, BFO at the top, SHACL at the gate — and still produce a graph that lies, because a human confidently labeled a field with a plausible-but-wrong term, or two teams coined two predicates for one concept, or a value loaded without its unit and quietly became uninterpretable.

The measurable version of this gap is sobering. When researchers scored real-world datasets against FAIR, nearly all were findable but only a minority reached even moderate interoperability, because their metadata was authored by hand without a controlled vocabulary [3] — present in form, hollow in fact. Biomanufacturing inherits that gap directly. So this book treats the ontology not as a finished artifact you install but as a practice you sustain: aligned to shared upper ontologies, authored once under change control, and gated on every load. Where the field genuinely still struggles — getting the metadata authored on the plant floor to be controlled, machine-checkable, and FAIR in fact — we will say so plainly rather than pretend the standard is the solution.

There is a quieter honesty owed up front about this very companion. The graph aligns to BFO, IOF, and the OBO ontologies — it asserts that bp:Material is a BFO material entity and bp:Equipment is also an IOF piece of equipment — but it does not owl:imports those upstream ontologies wholesale (owl:imports is the instruction that pulls another ontology's full contents in), and the harness loads only the local files. So a reasoner run here gets our local axioms plus those alignment edges (the asserted "our term is their term" links), not the upstream logic that would come with a full import — BFO's disjointness rules (which say certain categories can never overlap) or IOF's own upward chain of parent classes leading to the BFO process. That distinction — asserted alignment for shared vocabulary versus full import for cross-ontology entailment (the new conclusions a reasoner could then draw) — is not pedantry: it is the difference between a graph another IOF-importing tool can line up with and one that already carries the upstream logic. Part II makes the choice explicit and shows what each one buys.

Why it matters

If you remember one thing from this preface, make it this: in a regulated process, the difference between data you have and knowledge you can act on is a model of meaning. Without it, every cross-system question — what did this lot derive from? which other batches share its cell bank? what parameters drove this attribute? — is a fresh archaeology project, answered by someone exporting spreadsheets and matching identifiers by hand. With it, those questions become queries. That is not a convenience; in an investigation, it is the difference between a scoped deviation and a blind, campaign-wide quarantine. And the graph already carries the safety case a reviewer would demand: two orthogonal viral-clearance barriers — one inactivating enveloped virus chemically (the low-pH hold), the other removing virus physically by size (virus-retentive nanofiltration) — and because the two mechanisms are independent, their validated log-reduction values (LRV, the base-ten log of how many-fold each step clears virus) sum across the process exactly as the regulators expect (4.5 plus 4.2 equals a total of 8.7 LRV), a fact the harness checks rather than merely asserts. The model is what turns a heap of true facts into something you can reason over under pressure.

In the real world

This is not a thought experiment waiting for adopters. The languages are settled W3C standards running production knowledge graphs across industries; the upper ontology is a published ISO/IEC standard — the Basic Formal Ontology (BFO 2020), standardized as ISO/IEC 21838-2, which draws the model's most general line, between continuants that persist (a bioreactor, a lot) and occurrents that happen (a cell-culture run), with the Industrial Ontologies Foundry (IOF) Core one level below it supplying manufacturing-general terms (equipment, process, information content) that our biopharma classes specialize — that is, define as more specific sub-types of, the way bp:Bioreactor is a particular kind of IOF equipment; and the council-governed model of building interoperable ontologies has worked at scale in the life sciences since the mid-2000s, when the OBO Foundry showed that biomedical ontologies built to shared design rules interlock instead of overlap [4]. Manufacturing copied the lesson with the Industrial Ontologies Foundry, and its biopharma working groups released manufacturing ontologies aimed at exactly the kind of monoclonal-antibody line this book models. On the analytical side, the Allotrope ontologies already give laboratory results one vendor-neutral meaning. The pieces exist, they are standardized, and they are converging — which is precisely why it is worth learning to use them well.

Key terms

Ontology — a formal, shared, machine-readable model of what exists in a domain and how it relates; the dictionary-and-grammar that lets different systems mean the same thing.
Knowledge graph — a web of interconnected facts (rather than rows in isolated tables) built from an ontology's classes and relations.
RDF triple — the atomic fact of a knowledge graph: a subject, a predicate (the named relationship), and an object (a value or another thing) — for example DS-001 — derivedFrom — PApool-001. RDF, OWL, and SHACL are introduced in Data Management; here we put them to work.
Quality by Design (QbD) — the framework that treats recorded process understanding — the links between critical parameters and quality attributes — as essential to the product.
Critical process parameter (CPP) / critical quality attribute (CQA) — a process setting that must be controlled because it affects quality; a measurable product trait that must stay within limits.
derivedFrom — the genealogy relation, used throughout, linking a child material or lot to the parent it came from; the edge that makes the batch's lineage walkable.
FAIR principles — the standard that data should be Findable, Accessible, Interoperable, and Reusable by machines; FAIR in fact is harder than FAIR in claim.
Upper (foundational) ontology — a small, domain-neutral vocabulary of the most general categories, on which domain ontologies are built so they stay compatible.
References page — the single page where every inline citation marker resolves to its source.

Where this leads

We have claimed that a stored fact becomes knowledge only when its meaning is modeled. The next chapter, Specification: Competency Questions and the ORSD, starts where disciplined ontology actually starts — not with classes, but with the questions the finished model must be able to answer. It writes those competency questions down as an executable requirements document, the ORSD, and turns them into 23 pass/fail acceptance tests that validate.py runs and gates the build on — so that every later modeling choice is judged by whether it serves a real question rather than by how elegant it looks.

What this chapter covers​

The promise: every claim is traceable​

The idea: a stored fact is not yet a known fact​

Three books made the data; this book makes it mean​

Quality by Design is, secretly, an ontology​

The running example: one antibody, end to end​

What you should already half-know (and where to get it)​

How to read this book: nine parts, walking the lifecycle​

A few conventions​

The unsolved part​

A model that is correct on paper can still be hollow​

Why it matters​

In the real world​

Key terms​

Where this leads​