Ontologies and FAIR Data

📍 Where we are: The last chapter showed why numbers fail to connect even when they transmit perfectly; this one introduces the two deepest tools that fix it — ontologies and the FAIR principles.

In the previous chapter, Why Numbers Don't Connect: The Semantic Interoperability Problem, we drew a hard line between two ideas. Syntactic interoperability means two systems agree on format — the message parses, the fields line up, the bytes arrive intact. Semantic interoperability means they agree on meaning — both ends understand that a field labeled pH in one machine and a field labeled pH_value in another name the very same measured quantity. We saw that even flawless byte-transfer leaves a swamp of heterogeneity: the same real-world thing gets described differently everywhere — different units, different identifiers, different timestamp formats, different vocabularies. This chapter is about the durable cure for that swamp. Not another adapter that translates one private dialect into another, but a shared model of meaning that every system can point to.

The simple version

Think of a library before catalogs. Every librarian shelves books by their own private logic, so finding anything means asking the one person who shelved it. An ontology is the agreed-upon catalog system — it says exactly what a "book," an "author," and a "subject" are, and how they relate — so anyone, or any machine, can find and combine things without a human translator. FAIR is the promise that the catalog actually works: that data is easy to find, get, combine, and reuse. Ontologies build the catalog; FAIR is the service guarantee.

What this chapter covers

We build an ontology from scratch — classes, relations, and the languages that express them (RDF, OWL, SHACL). We then climb to the upper ontologies that let separate fields reuse one another's work (BFO and the Industrial Ontologies Foundry), descend into the biopharma domain ontologies and the council that governs them, unpack the FAIR principles one by one, and finish by showing how the two together turn siloed files into a single queryable graph.

What an ontology actually is

Classes, instances, relations, and axioms

Strip away the intimidating word and an ontology is a formal, shared, machine-readable model of what exists in a domain and how it relates [3]. It has a small number of moving parts.

A class is a category of thing — Bioreactor, CellCultureProcess, pH Measurement. An instance (or individual) is one concrete member of a class — bioreactor BR101 is an instance of Bioreactor. A relation (or property) connects things — BR101 is part of Upstream Suite 2; a pH Measurement is about a particular batch. Finally, axioms are logical statements that constrain the model so a computer can reason over it — for example, "every CellCultureProcess has participant some (meaning at least one — here "some" is a precise logical keyword, not a vague amount) LivingCell." Classes name the kinds, relations wire them together, and axioms make the wiring provable — meaning a reasoner can mechanically derive new facts and flag contradictions from them — rather than merely suggestive [3].

This is the leap past the previous chapter. A spreadsheet column header named pH is a label a human happens to recognize. An ontology class named pH Measurement, with axioms saying it measures hydrogen-ion activity on a defined scale, is something a machine can recognize and act on without being hand-told.

A small fact network: the classes Bioreactor and pH Measurement each have a concrete instance (BR101 and a pH reading of 7.02) linked by an 'instance of' arrow; relations then connect the instances — BR101 is part of Upstream Suite 2, the pH reading is about BATCH-2026-001, and that batch was produced by BR101. Classes name kinds of things, instances are concrete members, and relations connect them into a small fact network a computer can follow. Original diagram by the authors, created with AI assistance.

The languages: RDF triples, OWL logic, SHACL constraints

How is such a model written down so any system can read it? The foundation is RDF 1.1 — the Resource Description Framework (W3C, 2014), which represents every fact as a triple: subject – predicate – object, as in BR101 — isPartOf — Suite2 [7]. Each part is named with a globally unique web identifier (an IRI), so "BR101" here cannot be confused with someone else's "BR101" elsewhere. An IRI is written like a web address to guarantee global uniqueness, but it is a name, not a destination — it need not point at any live page you can open in a browser (the http://example.org/… below is a reserved placeholder that deliberately leads nowhere). Written out in the N-Triples serialization — one of several equivalent text formats for writing the same triple down — that single fact looks like this:

<http://example.org/BR101> <http://example.org/isPartOf> <http://example.org/Suite2> .

Stack millions of such triples together and they form a knowledge graph — a web of interconnected facts rather than rows in isolated tables [7]. Book 3 builds exactly this graph in runnable code: its knowledge-graph chapter loads bioprocess CSVs into RDF triples with RDFLib, where BATCH-2026-001 monomerPct 98.611 becomes one concrete row and DS-001 derivedFrom PApool-001 becomes one concrete edge — the data-point this chapter dissects, made physical.

OWL 2 — the Web Ontology Language (W3C, 2012) — is the layer that adds the logic: it lets you state the classes, relations, and axioms above formally enough that automated reasoners can infer new facts (if BR101 is in Suite 2 and Suite 2 is in Building 4, the reasoner concludes BR101 is in Building 4) and detect contradictions [8].

A common confusion is worth heading off. OWL is open-world — it assumes that what is not stated is merely unknown, not false — which is wrong for data validation, where a missing required field really is an error. That job belongs to SHACL, the Shapes Constraint Language, a W3C standard that checks an RDF graph against shapes — rules such as "every batch record must have exactly one approval signature" — and reports violations [9]. In short: OWL says what things mean; SHACL says what a valid record must contain.

That release rule is not a hand-wave; it is written as a small block of Turtle a validator runs. A SHACL shape (a closed-world rule expressed in RDF that says what a valid record must hold) gating the monomer purity of every released batch reads like this:

# A SHACL shape: every released lot needs exactly one in-spec monomer result.
bp:ReleaseShape a sh:NodeShape ;
    sh:targetClass bp:Batch ;
    sh:property [
        sh:path bp:monomerPct ;
        sh:minCount 1 ; sh:maxCount 1 ;     # present and singular — no cherry-picking a repeat
        sh:datatype xsd:float ;
        sh:minInclusive 95.0 ;              # at or above the release floor
        sh:message "Monomer purity is missing, duplicated, or below the 95.0 % release limit." ] .

The sh:minCount 1 is the closed-world half no reasoner supplies: it fails a lot whose monomer result is simply absent, the exact gap that opens when a LIMS integration silently drops a row. Book 4 builds this discipline out as a lifecycle: it starts from a competency question (CQ — a question, written before any modeling, that the finished ontology must be able to answer, such as does a released lot carry exactly one in-spec value for every required quality attribute?), then turns that question into the shape above as an executable acceptance test. Book 4's specification chapter catalogs twenty-three such questions, and its release-gate chapter shows the same ReleaseShape failing a real out-of-spec lot on exactly the one path that breached its limit — the gate this chapter sketches, run as code.

Anatomy of one RDF triple

To see why a triple is the atom of all this, take one real fact off a release record — the monomer purity of a batch — and lay it out part by part. In a spreadsheet it is just a cell: 98.611. As a triple it becomes a self-describing statement that no human has to interpret.

The subject is the thing being described, named not by a local label but by an IRI — bp:BATCH-2026-001, a namespace prefix plus a local id that forms a globally unique web-style name (it looks like a web address and is guaranteed unique, but it need not point to a working website), so this batch cannot be confused with any other system's "BATCH-2026-001". Here the bp: prefix is just shorthand for the batch ontology's namespace IRI, so bp:BATCH-2026-001 expands to the full web-style name shown above. The predicate is the relation, and crucially it is not a private column header: it is bp:monomerPct, a property (relation) drawn from the shared ontology — the predicate is a property, while the subject bp:BATCH-2026-001 is an instance of the class bp:Batch — so the lab system and the plant system that both emit this fact are pointing at the very same meaning. The object is where the value lives, and it comes in two forms. As a typed literal it is never a bare number — it carries its datatype xsd:float, so a machine parses it as a number rather than text. To carry the unit too, the value sits in its own small quantity node — itself reached by an ordinary triple, so the unit rides along without breaking the three-part rule — which pins the unit with a QUDT unit IRI, unit:PERCENT, so 98.611 unambiguously means 98.611 %, the exact normalization the previous chapter showed was missing from raw transfers.

Alternatively the object is itself an IRI edge — bp:derivedFrom → bp:SEED-001 — in which case the triple is a link in the genealogy chain rather than a measured value (the bioreactor batch derives from its seed-train lot). Stack three such triples about one subject and you have a Batch node with typed edges; a SHACL shape then gates it, checking it carries exactly one monomerPct before the batch may claim release.

The genealogy chain is most vivid one step downstream, at the first purification unit operation. Protein A capture — the affinity-chromatography step that grabs the antibody out of the clarified harvest by its Fc stem and elutes it at low pH — produces a capture pool PApool-001, and that pool derivedFrom the clarified harvest, which derivedFrom the bioreactor batch. But capture also throws off measured triples of its own that the release record later needs to reconcile: a host-cell-protein (HCP — residual protein from the production cell line, an impurity) result cut by two-to-three log across the step, a leached Protein A value (ligand that sheds off the resin beads into the product, itself a tracked contaminant controlled to parts-per-million), and the operator's chosen pooling window (the slice of the elution peak, between two cut points, that was collected as product). Each is a quantity, so each needs a QUDT unit to be unambiguous — unit:NanoGM-PER-MilliGM for HCP and leached Protein A, not a bare ppm a downstream system might misread. Without a shared predicate and a pinned unit, the same capture-step HCP number arriving from the chromatography data system, the LIMS, and the batch record is exactly the swamp this chapter exists to drain — and the genealogy edge DS-001 derivedFrom PApool-001 is what later lets a recall query walk back from a failed drug-substance lot to the capture pool that may have carried the defect forward.

One release result as a triple: subject and predicate are globally unique IRIs, the object is either a QUDT-typed literal or an edge to another node, and stacked triples make a Batch node a SHACL shape can validate. Original diagram by the authors, created with AI assistance.

This is the same identity-card discipline the connectivity layer applies to a live tag: just as an OPC UA node hands back a value bundled with its quality, timestamp, and engineering unit instead of a bare number, an RDF triple refuses to let a value travel without its subject, its shared predicate, and its typed, unit-bearing object. The difference is reach: the OPC UA node carries identity across the wire to one reader; the triple carries identity into a graph any system can later query.

note

You do not hand-author triples any more than you write web pages in raw protocol. Ontologies are built and maintained by domain experts in specialized editors — Protégé (the free, open-source tool from Stanford that is the de facto standard for ontology authoring) being the most common, alongside the commercial TopBraid Composer and the open-source VocBench. The RDF/OWL/SHACL underneath is the interchange format, the way HTML is the format under a styled web page.

The ontology stack: upper, industrial, domain

Upper ontologies and the BFO spine

The previous chapter named the upper ontology and BFO as the fix for this; here is how BFO actually works. The problem with letting every field invent its own ontology is this: a biologist's "process" and an engineer's "process" drift apart, and we are back to heterogeneity one level higher. The solution is an upper (or foundational) ontology — a small, domain-neutral vocabulary of the most general categories that everything falls under: things that endure through time versus things that happen, qualities, roles, functions [3]. Build every domain ontology on the same spine and they become reusable and combinable by construction.

A leading upper ontology in science and engineering is BFO — the Basic Formal Ontology — and it is not a hobby project: it is published as an international standard, ISO/IEC 21838-2, which establishes BFO as a conformant top-level ontology [4]. BFO's core move is to split reality into continuants (things that persist through time as wholes — a cell, a bioreactor, a batch of drug substance) and occurrents (things that unfold in time — a fermentation, a purification step) [3]. Anchoring every domain term under one of these prevents whole categories of modeling error.

This coordinated, principle-based approach was pioneered in the life sciences by the OBO Foundry, a community that builds biomedical ontologies to shared design rules so they interlock instead of overlap [2]. Manufacturing took the lesson and built its own equivalent: the Industrial Ontologies Foundry (IOF), explicitly modeled on the OBO Foundry's governance, with BFO at the top and a BFO-aligned mid-level IOF Core Ontology that supplies industry-wide concepts every manufacturing domain can specialize [6][5].

Layered ontology stack: BFO, the upper ontology (ISO/IEC 21838-2), branches down to IOF Core mid-level industrial concepts and to the OBO Foundry biomedical ontologies; IOF Core flows down to the IOF biopharma manufacturing ontologies; Allotrope AFO analytical lab data aligns with BFO; the IOF biopharma ontologies and AFO both converge into one queryable biomanufacturing knowledge graph. A layered stack: one neutral upper ontology at the top, an industrial mid-level beneath it, and biopharma-specific ontologies at the bottom, all feeding a single graph. Original diagram by the authors, created with AI assistance.

Domain ontologies: IOF biopharma-manufacturing and BMIC governance

At the bottom of the stack live the domain ontologies that name the specifics of making a biologic. Two efforts matter most here.

The first is the IOF biopharmaceutical-manufacturing ontologies — the biopharma specialization of the IOF stack, developed by the OAGi (Open Applications Group) and NIIMBL (the U.S. National Institute for Innovation in Manufacturing Biopharmaceuticals) effort within IOF and released as open source over 2024-2025 [11]. They inherit BFO and IOF Core, so a CellCultureProcess defined there is automatically an occurrent and automatically interoperable with any other IOF-based industrial ontology. These are the same terms the open-source stack reaches for when it grounds its graph: Book 3's batch and equipment model maps the physical equipment hierarchy onto these manufacturing concepts, and its knowledge-graph build aligns a Batch node to IOF rather than inventing a private class. The IOF concepts also dovetail with the plant's transactional standards rather than competing with them — ISA-95 (the standard model for how plant-floor and business systems exchange production information) and its XML serialization B2MML (Business to Manufacturing Markup Language) define the batch, material, and equipment messages a MES already emits, and the ontology gives those messages a semantic layer so a Batch in a B2MML production response and a Batch in the graph resolve to the same IOF concept.

note

In this book we use BMIC for the governance body — the Biopharmaceutical Manufacturing Industry Council that develops and stewards these biopharma ontologies within IOF (alongside the OAGi/NIIMBL effort) — rather than for the ontology artifact itself, paralleling how OBO and IOF name their stewarding councils through shared principles rather than one person owning the vocabulary [2][6].

Allotrope AFO for analytical data

The second is the Allotrope Foundation Ontologies (AFO), the vocabulary behind the Allotrope analytical-data stack we met in the connectivity chapter — a set of ontologies giving laboratory measurements (chromatography, spectroscopy, and the rest) one vendor-agnostic meaning, so a result means the same thing regardless of which instrument produced it. These are precisely the analytical results released against a batch in the physical workflow — the assay numbers that gate QC release — and Book 3 wires them into a graph in its analytical-lab LIMS/ELN chapter, where an HPLC purity result is captured with its AFO meaning rather than as a loose column. AFO covers the lab; the IOF biopharmaceutical-manufacturing ontologies cover the manufacturing process; designed to share an upper ontology, they are built to meet in the same knowledge graph instead of in yet another adapter.

FAIR: the service guarantee for data

FAIR principles: Findable, Accessible, Interoperable, Reusable

Ontologies give data meaning. The FAIR principles give data a quality standard. Published in 2016, FAIR is an acronym — Findable, Accessible, Interoperable, Reusable — and its central, easily-missed insight is that the principles target machine-actionability: data should be usable by computers with minimal human help, because the volume and complexity of modern data have outgrown manual handling [1].

A three-stage journey: heterogeneous descriptions of one number converge through a shared ontology into FAIR, machine-actionable data An ontology gives a number one agreed meaning — the foundation that makes data FAIR. Original diagram by the authors, created with AI assistance.

Unpacked, with a biomanufacturing example for each [1]:

Findable — every dataset has a globally unique, persistent identifier and rich metadata, so it can be located. A batch record carries a permanent ID and is indexed with its product, site, and date — not buried as final_v3_REALfinal.xlsx on one engineer's laptop.
Accessible — once found, data is retrievable by that identifier over a standard protocol, with clear access rules. An auditor's system can request that batch record through a documented interface, and is told plainly whether it may have it.
Interoperable — data uses shared, formal vocabularies — exactly the ontologies above — so it combines with other data. The record's pH field points to the same ontology property every other system uses, so readings from the lab and the plant can be merged without guessing.
Reusable — data is richly described with its context, provenance (where it came from and how), and a clear usage license, so others can trust and reuse it. A later process-development or tech-transfer team can reuse the batch data because its conditions, lineage, and terms of use travel with it. This is exactly the data a scale-up and technology transfer (moving a validated process from a small development reactor to a larger production one, or from one site to another) depends on: the receiving site can only reproduce the sending site's process if the critical process parameters, the genealogy, and the units arrive interpretable rather than as a spreadsheet whose column meanings must be re-discovered by phone — which is the difference between a clean transfer and a comparability investigation.

note

FAIR is not the same as open. Accessible means the access conditions are clear and the retrieval mechanism is standard — not that everyone may read everything [1]. Highly confidential, regulated manufacturing data can be fully FAIR while remaining tightly restricted — and indeed it must remain controlled, because the records governed here fall under 21 CFR Part 11 and EU Annex 11 — the U.S. FDA rule on electronic records and signatures and its European counterpart for computerised systems — which mandate access controls, audit trails, and traceability. The principle is well-defined access, not free access.

FAIR, ALCOA+, and a validated system

FAIR and the regulator's data-integrity frame are close cousins, and it is worth mapping them explicitly because they reinforce each other. The expectation a GMP inspector reads against is ALCOA+ — the acronym for data that is Attributable, Legible, Contemporaneous, Original, and Accurate, plus the four extensions Complete, Consistent, Enduring, and Available. Several ALCOA+ letters map almost one-to-one onto what a typed, provenance-bearing triple delivers: a triple's attributable signer edge (bp:approvedBy) is Attributable; its QUDT-typed, unit-bearing object is Accurate and Original (the value never travels stripped of its meaning); a persistent IRI plus rich metadata is Available and FAIR-Findable at once; and the derivedFrom genealogy is Complete and Consistent lineage. FAIR-Reusable and ALCOA+-Complete are, in practice, the same requirement seen from two angles.

But a triple that satisfies ALCOA+ is only trustworthy if the system that produced it was proven to work — which is why FAIR data presupposes a validated computerized system. The companion CSV-to-CSA chapter covers this: a release-record system is qualified through IQ/OQ/PQ (Installation, Operational, and Performance Qualification — proof that the software was installed right, operates right, and performs right on its real workload), under the risk-based shift from heavy CSV (Computerized System Validation) to CSA (Computer Software Assurance — the FDA's critical-thinking-over-scripts approach that spends validation effort where patient risk is). A graph is only as trustworthy as the validated LIMS, MES, and historian that fed it: FAIR without validation is merely well-organized hearsay.

Why a machine-actionable graph is the foundation for ML

The "machine-actionability" at FAIR's centre is not an abstraction — it is the precondition for trustworthy machine learning on this data, and the connection is worth drawing because it is where the next book picks up. A model is only as honest as the structure of the data underneath it, in three concrete ways.

First, the shared predicate is what makes a leakage-free split possible. A model that predicts batch release from process data must be validated by grouping rows by batch — a leave-one-batch-out cross-validation (training on some batches and testing on entirely held-out ones, so the score reflects performance on a genuinely new batch rather than on rows it half-memorized). That grouping is only reliable when every row carries the same derivedFrom batch IRI; without the shared genealogy edge, rows from one batch leak across the train/test boundary and the reported accuracy is a flattering illusion. The graph's lineage is the grouping key. Book 5's models-and-validation chapter builds exactly this batch-grouped, nested cross-validation.

Second, the unit and the operating range define the applicability domain. A QUDT-typed value with a qualified range lets a model declare when a new input lies outside the envelope it was trained on (the applicability domain — the input region where a model's prediction can be trusted), so an out-of-range spectrum is flagged rather than silently extrapolated into a confident wrong answer. A bare number carries no such envelope.

Third, the same provenance edges are model lineage. A deployed model is itself a node in the graph: trainedOn a dataset pinned by its hash, monitoredBy drift detectors, supersedes its prior version — provenance modeled the way the W3C PROV-O vocabulary (the standard for recording where data and artifacts came from) intends. That lineage is what separates model drift (the data-driven model going stale, watched as a MLOps concern) from genuine process drift (the living biology actually changing), and it is what lets a regulator trace which model decided what about which batch. The unsolved gap above — metadata authored by hand without a controlled vocabulary — is therefore also an ML gap: a learning system trained on un-aligned predicates inherits the swamp it was meant to escape.

Why it matters

For data management, ontologies and FAIR convert a recurring, expensive project into a permanent asset. Without them, every time you want to combine the bioreactor history, the chromatography results, and the release tests for one batch, someone writes throwaway code to reconcile mismatched names, units, and IDs — and rewrites it when a system changes. With a shared ontology, those datasets already speak one language; with FAIR, they are already findable, retrievable, and richly enough described to trust. The integration stops being a heroic data-archaeology effort and becomes a query. That is the difference between data you have and data you can actually use.

From heterogeneous data to machine-actionable knowledge

It helps to watch the whole machine run once, end to end, on the single number we anatomized above. Three systems report the monomer purity of one batch, and — exactly as the semantic-interoperability chapter warned — they disagree in every way that matters: the LIMS exports monomer_pct = 98.611 with no unit recorded, the historian logs a tag Monomer% = 98.6 rounded and with the percent buried in the label, and a CSV writes %Mono,0.98611,frac where "frac" could mean a fraction or a percent. Syntactically all three transmit fine. Semantically they are a swamp.

Now apply the stack. Ontology alignment maps all three column names to the one shared predicate bp:monomerPct, so the meaning is no longer guessed from a header. QUDT normalization converts the fraction to a percentage and pins the unit and datatype, so the value is no longer ambiguous — 98.611, xsd:float, unit:PERCENT. The result is a single typed triple: a fully self-describing fact. That triple passes a SHACL gate — exactly one monomerPct, value in range, before the batch may be released — and on conformance it loads into one RDF graph, where the batch becomes a node, derivedFrom edges chain the drug substance back through the capture pool and the bioreactor batch to the cell bank, and the typed result hangs off it. Reconstructing that lineage is no longer a data-archaeology project; it is one SPARQL query — RDF's query language — walking the derivedFrom chain in a single line. Starting from the drug-substance leaf bp:DS-001 (the last node in the chain, so its ancestor walk reaches everything upstream):

SELECT ?ancestor WHERE { bp:DS-001 (bp:derivedFrom)+ ?ancestor . }

Here SELECT names what to return, WHERE holds the pattern to match, and ?ancestor is a variable — a blank the engine fills with every node that matches. The + means follow derivedFrom one-or-more hops, so the whole lineage — capture pool, bioreactor batch, cell bank, and everything between — falls out of a single pattern.

This is the batch's data shadow finally made joinable — the molecule was made once on the floor, its data twice over in a dozen systems, and only a shared predicate and a typed, unit-bearing object let those scattered shadows rejoin on one batch ID into a record a machine can trust.

$A four-stage pipeline: three heterogeneous source rows on the left (a LIMS field monomer_pct = 98.611 with no unit, a historian tag Monomer% = 98.6 rounded, and a CSV %Mono,98.611,frac) flow into an alignment-and-normalize stage that maps every name to bp and converts the fraction to 98.611 with unit PERCENT and datatype xsd; that becomes one typed RDF triple, which passes a SHACL gate checking exactly one monomerPct before release, and on conformance loads into an RDF knowledge graph where a Batch node is chained by derivedFrom edges back to a seed train and a cell bank, with the QUDT-typed monomer result hanging off the batch and one SPARQL property-path query walking the genealogy.$ The full cure in one pass: heterogeneous columns are aligned to one ontology predicate, normalized with QUDT, gated by SHACL, and loaded as triples into a graph a single SPARQL query can traverse. Original diagram by the authors, created with AI assistance.

This is the data-point thread of the whole trilogy made concrete. The physical batch is made in the production bioreactor and released through QC; here it becomes a typed triple with a shared predicate and a unit; and in Book 3 that triple is a literal row in a knowledge graph you can query.

The unsolved part: wire-compliant but not actually FAIR

It would be tidy to end on the machine running cleanly. The harder truth is that meeting the standards does not guarantee meeting the goal. FAIR is a set of principles, not a conformance test, and data can satisfy the easy letters while silently failing the hard ones.

A 2024 meta-research study of pandemic datasets makes the gap measurable. Surveying COVID-19 data resources against the FAIR principles, it found that nearly all were Findable — they had identifiers and showed up in searches — but only 46.7% reached even moderate Interoperability [10]. The reason is the one this chapter has been circling: metadata was authored by hand, in free text, without a controlled vocabulary, so the fields existed but did not point at any shared ontology. The data was findable and downloadable and still un-combinable — exactly the swamp, dressed in the language of compliance. "We use FAIR" can be true at the syntactic layer and hollow at the semantic one.

Biopharma inherits this gap directly. A plant can stand up a perfectly conformant historian, tag every point, and ship OPC UA and MQTT flawlessly — and still produce data that no downstream system can merge, because the predicates were never aligned to a shared ontology and the units were never pinned. This is the same failure family Book 3 calls "when the graph lies": vocabulary drift (two teams coin two predicates for one concept), edge sprawl (relationships multiply until traversal is meaningless), and type loss (a number loads without its datatype or unit and quietly becomes uninterpretable). A SHACL gate catches the structural cases; it cannot catch a human who confidently mislabels a field with a plausible-looking term. The open problem is not the existence of ontologies — they exist and are standardized — but the discipline and tooling to ensure that the metadata actually authored on the plant floor is controlled-vocabulary, machine-checkable, and FAIR in fact rather than in claim. That is an organizational and change-management problem as much as a technical one, and it is where the field genuinely still struggles.

In the real world

This is not theory waiting for adopters. BFO is a published ISO/IEC standard [4]; RDF 1.1, OWL 2, and SHACL are settled W3C recommendations that run production knowledge graphs across industries [7][8][9]; and the OBO-then-IOF lineage shows the council-governed model working at scale since the mid-2000s [2][6]. In biopharma specifically, the OAGi/NIIMBL effort within IOF released the IOF biopharmaceutical-manufacturing ontologies as open source over 2024-2025, stewarded by the BMIC council [11], and the Allotrope AFO already grounds vendor-neutral lab data — chromatography and mass-spectrometry results from instruments such as Agilent and Shimadzu analytical systems can be exported into the Allotrope Data Format and read with one shared meaning regardless of which vendor produced them. One practical caveat the tool choice hides: an ontology, like any reusable artifact, ships under a license, and the terms differ. The IOF and OBO ontologies are released under permissive open licenses (Creative Commons / BSD-style) you may freely reuse and redistribute; the Allotrope AFO is distributed under the Allotrope Foundation's own membership terms, which historically required Allotrope membership to obtain the full framework — so "is it open?" must be checked per ontology before a stack standardizes on it, exactly as one checks a software dependency's license. Once authored, the triples are physically stored in — and queried out of — a triplestore (the database engine; the knowledge graph is the contents it holds, the same way a table is the data and the database is the engine that serves it) — open-source engines such as Apache Jena Fuseki and Oxigraph (the latter used in Book 3), and commercial ones such as Ontotext GraphDB, Stardog, and Amazon Neptune — which is the storage-and-query tier a real vendor conversation reaches first.

Key terms

Ontology — a formal, shared, machine-readable model of what exists in a domain and how it relates.
Class / instance / relation / axiom — a category of thing; a concrete member; a connection between things; a logical rule that constrains and enables reasoning.
RDF (triple, knowledge graph) — the W3C model representing facts as subject–predicate–object triples that link into a graph.
OWL — the W3C Web Ontology Language, which adds formal logic so reasoners can infer facts and find contradictions.
SHACL — the W3C Shapes Constraint Language, which validates whether an RDF graph meets required-content rules.
Upper / foundational ontology — a small, domain-neutral vocabulary of the most general categories that everything specializes.
BFO (Basic Formal Ontology, ISO/IEC 21838-2) — the standardized upper ontology splitting reality into continuants and occurrents.
OBO Foundry — the biomedical community whose coordinated, principle-based ontology model inspired the industrial equivalent.
IOF (Industrial Ontologies Foundry) / IOF Core — the manufacturing ontology suite modeled on OBO, with a BFO-aligned mid-level core.
IOF biopharmaceutical-manufacturing ontologies — the biopharma domain specialization of the IOF stack, developed by the OAGi/NIIMBL effort within IOF and released as open source over 2024-2025.
BMIC (Biopharmaceutical Manufacturing Industry Council) — in this book's usage, the governance body that develops and stewards the IOF biopharmaceutical-manufacturing ontologies within IOF; a council, not the ontology itself.
AFO (Allotrope Foundation Ontologies) — ontologies giving analytical laboratory data one vendor-agnostic meaning.
FAIR (Findable, Accessible, Interoperable, Reusable) — principles making data usable by machines; FAIR is not the same as open.
Machine-actionability — the property of being usable by computers with minimal human intervention.
RDF triple (subject–predicate–object) — the atomic unit of a knowledge graph: an IRI subject, an ontology-defined property (predicate), and an object that is either a typed literal or an IRI edge to another node.
IRI — an Internationalized Resource Identifier; the globally unique web name that makes a subject, predicate, or object unambiguous across systems.
QUDT — the Quantities, Units, Dimensions and Types vocabulary that lets a triple's value carry its unit and datatype, so a number is never bare.
SPARQL — the W3C query language for RDF; property paths such as (derivedFrom)+ walk a genealogy chain recursively in a single line.
SHACL shape — a closed-world rule expressed in RDF (such as bp:ReleaseShape) that says what a valid record must hold; here, the release gate requiring exactly one in-spec monomer result per batch.
Competency question (CQ) — a question, written before any modeling, that the finished ontology must be able to answer; in Book 4 each CQ becomes an executable acceptance test.
ALCOA+ — the GMP data-integrity expectation that data be Attributable, Legible, Contemporaneous, Original, and Accurate, plus Complete, Consistent, Enduring, and Available; several letters map directly onto a typed, provenance-bearing triple.
IQ/OQ/PQ, CSV, CSA — Installation/Operational/Performance Qualification under Computerized System Validation; CSA (Computer Software Assurance) is the FDA's risk-based, critical-thinking successor — the validation a FAIR data system presupposes.
Leave-one-batch-out cross-validation — validating a model by holding out whole batches, keyed on the shared derivedFrom genealogy edge, so the score is not inflated by rows leaking across the train/test split.
Applicability domain — the input region where a model's prediction can be trusted; a QUDT-typed value's qualified range is what lets a model flag an out-of-envelope input instead of extrapolating blindly.
PROV-O — the W3C provenance ontology for recording where data and artifacts (including a deployed model) came from, the vocabulary behind model-lineage edges like trainedOn and supersedes.
ISA-95 / B2MML — the standard model and its XML serialization for exchanging production information between plant-floor and business systems; the ontology gives those batch/material/equipment messages a shared semantic meaning.

Where this leads

We now have the full kit: connected systems, trustworthy records, governed semantics, and FAIR data that machines can find, combine, and trust. The next chapter, The Digital Thread and the Digital Twin, shows what becomes possible once that connected, semantic data is woven across the entire product lifecycle. The digital thread is one continuous, traceable record stitched from design to patient; the digital twin is a living, data-fed model of the process that mirrors and predicts the real thing. Both are not new technologies so much as consequences — they only work because everything in the prior chapters, ending with the ontologies and FAIR principles of this one, is finally in place. And if you want to see this chapter's triple, SHACL gate, and SPARQL traversal as running code rather than concepts, Book 3's knowledge-graph chapter builds the whole thing on a laptop.

What this chapter covers​

What an ontology actually is​

Classes, instances, relations, and axioms​

The languages: RDF triples, OWL logic, SHACL constraints​

Anatomy of one RDF triple​

The ontology stack: upper, industrial, domain​

Upper ontologies and the BFO spine​

Domain ontologies: IOF biopharma-manufacturing and BMIC governance​

Allotrope AFO for analytical data​

FAIR: the service guarantee for data​

FAIR principles: Findable, Accessible, Interoperable, Reusable​

FAIR, ALCOA+, and a validated system​

Why a machine-actionable graph is the foundation for ML​

Why it matters​

From heterogeneous data to machine-actionable knowledge​

The unsolved part: wire-compliant but not actually FAIR​

In the real world​

Key terms​

Where this leads​

What this chapter covers

What an ontology actually is

Classes, instances, relations, and axioms

The languages: RDF triples, OWL logic, SHACL constraints

Anatomy of one RDF triple

The ontology stack: upper, industrial, domain

Upper ontologies and the BFO spine

Domain ontologies: IOF biopharma-manufacturing and BMIC governance

Allotrope AFO for analytical data

FAIR: the service guarantee for data

FAIR principles: Findable, Accessible, Interoperable, Reusable

FAIR, ALCOA+, and a validated system

Why a machine-actionable graph is the foundation for ML

Why it matters

From heterogeneous data to machine-actionable knowledge

The unsolved part: wire-compliant but not actually FAIR

In the real world

Key terms

Where this leads