Skip to main content

Ontologies and FAIR Data

πŸ“ Where we are: The last chapter showed why numbers fail to connect even when they transmit perfectly; this one introduces the two deepest tools that fix it β€” ontologies and the FAIR principles.

In the previous chapter, Why Numbers Don't Connect: The Semantic Interoperability Problem, we drew a hard line between two ideas. Syntactic interoperability means two systems agree on format β€” the message parses, the fields line up, the bytes arrive intact. Semantic interoperability means they agree on meaning β€” both ends understand that a field labeled pH in one machine and a field labeled pH_value in another name the very same measured quantity. We saw that even flawless byte-transfer leaves a swamp of heterogeneity: the same real-world thing gets described differently everywhere β€” different units, different identifiers, different timestamp formats, different vocabularies. This chapter is about the durable cure for that swamp. Not another adapter that translates one private dialect into another, but a shared model of meaning that every system can point to.

The simple version

Think of a library before catalogs. Every librarian shelves books by their own private logic, so finding anything means asking the one person who shelved it. An ontology is the agreed-upon catalog system β€” it says exactly what a "book," an "author," and a "subject" are, and how they relate β€” so anyone, or any machine, can find and combine things without a human translator. FAIR is the promise that the catalog actually works: that data is easy to find, get, combine, and reuse. Ontologies build the catalog; FAIR is the service guarantee.

What this chapter covers​

We build an ontology from scratch β€” classes, relations, and the languages that express them (RDF, OWL, SHACL). We then climb to the upper ontologies that let separate fields reuse one another's work (BFO and the Industrial Ontologies Foundry), descend into the biopharma domain ontologies and the council that governs them, unpack the FAIR principles one by one, and finish by showing how the two together turn siloed files into a single queryable graph.

What an ontology actually is​

Strip away the intimidating word and an ontology is a formal, shared, machine-readable model of what exists in a domain and how it relates [3]. It has a small number of moving parts.

A class is a category of thing β€” Bioreactor, CellCultureProcess, pH Measurement. An instance (or individual) is one concrete member of a class β€” bioreactor BR-101 is an instance of Bioreactor. A relation (or property) connects things β€” BR-101 is part of Upstream Suite 2; a pH Measurement is about a particular batch. Finally, axioms are logical statements that constrain the model so a computer can reason over it β€” for example, "every CellCultureProcess has participant some LivingCell." Classes name the kinds, relations wire them together, and axioms make the wiring provable rather than merely suggestive [3].

This is the leap past the previous chapter. A spreadsheet column header named pH is a label a human happens to recognize. An ontology class named pH Measurement, with axioms saying it measures hydrogen-ion activity on a defined scale, is something a machine can recognize and act on without being hand-told.

Classes name kinds of things, instances are concrete members, and relations connect them into a small fact network a computer can follow. Figure by the authors.

The languages: RDF, OWL, and a word on SHACL​

How is such a model written down so any system can read it? The foundation is RDF 1.1 β€” the Resource Description Framework (W3C, 2014), which represents every fact as a triple: subject – predicate – object, as in BR-101 β€” isPartOf β€” Suite2 [7]. Each part is named with a globally unique web identifier (an IRI), so "BR-101" here cannot be confused with someone else's "BR-101" elsewhere. Written out in the N-Triples serialization, that single fact looks like this:

<http://example.org/BR-101> <http://example.org/isPartOf> <http://example.org/Suite2> .

Stack millions of such triples together and they form a knowledge graph β€” a web of interconnected facts rather than rows in isolated tables [7].

OWL 2 β€” the Web Ontology Language (W3C, 2012) β€” is the layer that adds the logic: it lets you state the classes, relations, and axioms above formally enough that automated reasoners can infer new facts (if BR-101 is in Suite 2 and Suite 2 is in Building 4, the reasoner concludes BR-101 is in Building 4) and detect contradictions [8].

A common confusion is worth heading off. OWL is open-world β€” it assumes that what is not stated is merely unknown, not false β€” which is wrong for data validation, where a missing required field really is an error. That job belongs to SHACL, the Shapes Constraint Language, a W3C standard that checks an RDF graph against shapes β€” rules such as "every batch record must have exactly one approval signature" β€” and reports violations [9]. In short: OWL says what things mean; SHACL says what a valid record must contain.

note

You do not hand-author triples any more than you write web pages in raw protocol. Ontologies are built and maintained by domain experts in specialized editors β€” ProtΓ©gΓ© (the free, open-source tool from Stanford that is the de facto standard for ontology authoring) being the most common, alongside the commercial TopBraid Composer and the open-source VocBench. The RDF/OWL/SHACL underneath is the interchange format, the way HTML is the format under a styled web page.

Upper ontologies: a shared spine​

The previous chapter named the upper ontology and BFO as the fix for this; here is how BFO actually works. The problem with letting every field invent its own ontology is this: a biologist's "process" and an engineer's "process" drift apart, and we are back to heterogeneity one level higher. The solution is an upper (or foundational) ontology β€” a small, domain-neutral vocabulary of the most general categories that everything falls under: things that endure through time versus things that happen, qualities, roles, functions [3]. Build every domain ontology on the same spine and they become reusable and combinable by construction.

A leading upper ontology in science and engineering is BFO β€” the Basic Formal Ontology β€” and it is not a hobby project: it is published as an international standard, ISO/IEC 21838-2, which establishes BFO as a conformant top-level ontology [4]. BFO's core move is to split reality into continuants (things that persist through time as wholes β€” a cell, a bioreactor, a batch of drug substance) and occurrents (things that unfold in time β€” a fermentation, a purification step) [3]. Anchoring every domain term under one of these prevents whole categories of modeling error.

This coordinated, principle-based approach was pioneered in the life sciences by the OBO Foundry, a community that builds biomedical ontologies to shared design rules so they interlock instead of overlap [2]. Manufacturing took the lesson and built its own equivalent: the Industrial Ontologies Foundry (IOF), explicitly modeled on the OBO Foundry's governance, with BFO at the top and a BFO-aligned mid-level IOF Core Ontology that supplies industry-wide concepts every manufacturing domain can specialize [6][5].

A layered stack: one neutral upper ontology at the top, an industrial mid-level beneath it, and biopharma-specific ontologies at the bottom, all feeding a single graph. Figure by the authors.

Domain ontologies for biopharma​

At the bottom of the stack live the domain ontologies that name the specifics of making a biologic. Two efforts matter most here.

The first is the IOF biopharmaceutical-manufacturing ontologies β€” the biopharma specialization of the IOF stack, developed by the BMIC working groups within IOF and publicly released under an open MIT-style license in November 2025 [9]. They inherit BFO and IOF Core, so a CellCultureProcess defined there is automatically an occurrent and automatically interoperable with any other IOF-based industrial ontology.

caution

A correction worth fixing in your mental model: in this book's usage, BMIC is not the name of the ontology. It is the Biopharmaceutical Manufacturing Industry Council, the governance body that develops and stewards these biopharma ontologies β€” the council, not the artifact. This follows the same logic by which OBO and IOF councils govern their suites through shared principles rather than one person owning the vocabulary [2][6].

The second is the Allotrope Foundation Ontologies (AFO), the vocabulary behind the Allotrope analytical-data stack we met in the connectivity chapter β€” a set of ontologies giving laboratory measurements (chromatography, spectroscopy, and the rest) one vendor-agnostic meaning, so a result means the same thing regardless of which instrument produced it. AFO covers the lab; the IOF biopharmaceutical-manufacturing ontologies cover the manufacturing process; designed to share an upper ontology, they are built to meet in the same knowledge graph instead of in yet another adapter.

FAIR: the service guarantee for data​

Ontologies give data meaning. The FAIR principles give data a quality standard. Published in 2016, FAIR is an acronym β€” Findable, Accessible, Interoperable, Reusable β€” and its central, easily-missed insight is that the principles target machine-actionability: data should be usable by computers with minimal human help, because the volume and complexity of modern data have outgrown manual handling [1].

A three-stage journey: heterogeneous descriptions of one number converge through a shared ontology into FAIR, machine-actionable data An ontology gives a number one agreed meaning β€” the foundation that makes data FAIR. Original diagram by the authors, created with AI assistance.

Unpacked, with a biomanufacturing example for each [1]:

  • Findable β€” every dataset has a globally unique, persistent identifier and rich metadata, so it can be located. A batch record carries a permanent ID and is indexed with its product, site, and date β€” not buried as final_v3_REALfinal.xlsx on one engineer's laptop.
  • Accessible β€” once found, data is retrievable by that identifier over a standard protocol, with clear access rules. An auditor's system can request that batch record through a documented interface, and is told plainly whether it may have it.
  • Interoperable β€” data uses shared, formal vocabularies β€” exactly the ontologies above β€” so it combines with other data. The record's pH field points to the same ontology class every other system uses, so readings from the lab and the plant can be merged without guessing.
  • Reusable β€” data is richly described with its context, provenance (where it came from and how), and a clear usage license, so others can trust and reuse it. A later tech-transfer team can reuse the batch data because its conditions, lineage, and terms of use travel with it.
note

FAIR is not the same as open. Accessible means the access conditions are clear and the retrieval mechanism is standard β€” not that everyone may read everything [1]. Highly confidential, regulated manufacturing data can be fully FAIR while remaining tightly restricted β€” and indeed it must remain controlled, because the records governed here fall under 21 CFR Part 11 (the U.S. FDA rule on electronic records and signatures) and EU Annex 11 (its European counterpart for computerised systems), which mandate access controls, audit trails, and traceability. The principle is well-defined access, not free access.

Why it matters​

For data management, ontologies and FAIR convert a recurring, expensive project into a permanent asset. Without them, every time you want to combine the bioreactor history, the chromatography results, and the release tests for one batch, someone writes throwaway code to reconcile mismatched names, units, and IDs β€” and rewrites it when a system changes. With a shared ontology, those datasets already speak one language; with FAIR, they are already findable, retrievable, and richly enough described to trust. The integration stops being a heroic data-archaeology effort and becomes a query. That is the difference between data you have and data you can actually use.

In the real world​

This is not theory waiting for adopters. BFO is a published ISO/IEC standard [4]; RDF 1.1, OWL 2, and SHACL are settled W3C recommendations that run production knowledge graphs across industries [7][8][9]; and the OBO-then-IOF lineage shows the council-governed model working at scale since the mid-2000s [2][6]. In biopharma specifically, the BMIC council released the IOF biopharmaceutical-manufacturing ontologies publicly in November 2025, and the Allotrope AFO already grounds vendor-neutral lab data β€” chromatography and mass-spectrometry results from instruments such as Agilent and Shimadzu analytical systems can be exported into the Allotrope Data Format and read with one shared meaning regardless of which vendor produced them. The U.S. NIIMBL institute and its Big Data Program β€” the real-time manufacturing-data and ontology work behind the IOF Biopharma release β€” sit exactly here: getting instruments, labs, and partners not merely connected but semantically aligned and FAIR, so a number measured once means the same thing everywhere it travels.

Key terms​

  • Ontology β€” a formal, shared, machine-readable model of what exists in a domain and how it relates.
  • Class / instance / relation / axiom β€” a category of thing; a concrete member; a connection between things; a logical rule that constrains and enables reasoning.
  • RDF (triple, knowledge graph) β€” the W3C model representing facts as subject–predicate–object triples that link into a graph.
  • OWL β€” the W3C Web Ontology Language, which adds formal logic so reasoners can infer facts and find contradictions.
  • SHACL β€” the W3C Shapes Constraint Language, which validates whether an RDF graph meets required-content rules.
  • Upper / foundational ontology β€” a small, domain-neutral vocabulary of the most general categories that everything specializes.
  • BFO (Basic Formal Ontology, ISO/IEC 21838-2) β€” the standardized upper ontology splitting reality into continuants and occurrents.
  • OBO Foundry β€” the biomedical community whose coordinated, principle-based ontology model inspired the industrial equivalent.
  • IOF (Industrial Ontologies Foundry) / IOF Core β€” the manufacturing ontology suite modeled on OBO, with a BFO-aligned mid-level core.
  • IOF biopharmaceutical-manufacturing ontologies β€” the biopharma domain specialization of the IOF stack, developed by the BMIC working groups and publicly released in 2025.
  • BMIC (Biopharmaceutical Manufacturing Industry Council) β€” in this book's usage, the governance body that develops and stewards the IOF biopharmaceutical-manufacturing ontologies; a council, not the ontology itself.
  • AFO (Allotrope Foundation Ontologies) β€” ontologies giving analytical laboratory data one vendor-agnostic meaning.
  • FAIR (Findable, Accessible, Interoperable, Reusable) β€” principles making data usable by machines; FAIR is not the same as open.
  • Machine-actionability β€” the property of being usable by computers with minimal human intervention.

Where this leads​

We now have the full kit: connected systems, trustworthy records, governed semantics, and FAIR data that machines can find, combine, and trust. The next chapter, The Digital Thread and the Digital Twin, shows what becomes possible once that connected, semantic data is woven across the entire product lifecycle. The digital thread is one continuous, traceable record stitched from design to patient; the digital twin is a living, data-fed model of the process that mirrors and predicts the real thing. Both are not new technologies so much as consequences β€” they only work because everything in the prior chapters, ending with the ontologies and FAIR principles of this one, is finally in place.