Skip to main content

The Vocabularies in Use: From AFO to IDMP

📍 Where we are: Part VII · Ontologies in Industry Today — Chapter 25. The previous chapter named the consortia and standards bodies; this one inventories what they actually shipped, and asks the harder question of which vocabularies live in production versus merely on a download page.

The last chapter named the rooms: Allotrope, the OBO Foundry, the IOF, the regulators at EMA and FDA, the schema-modelers at the keyboard.

Naming the consortia is the easy half.

The harder question — the one that decides whether "anchor your batch to a shared ontology" is advice or wishful thinking — is which of those outputs a real plant has wired into a running system, and which sit published, downloadable, and unused. A standard with a stable URL is not the same as a standard with a production dependency on it.

So this chapter is an inventory followed by an honest accounting. We name the vocabularies, say what each one models, and sort them by a single unforgiving criterion: not how good the ontology is, but how deep it has actually penetrated a GMP or regulatory system.

The picture that emerges is lopsided in a way the rest of the book has been quietly preparing you for.

The simple version

Think of a professional kitchen. Some tools are in the cook's hand every service — the knife, the pans, the order tickets. Some live in the pantry, real and relied upon but tasted only through the dishes they end up in — the spice jars, the stock. And some are catalog items the kitchen ordered but never unboxed.

Every one is "in the kitchen." Only the first kind is in use. Biopharma's ontologies sort the same three ways.

What this chapter covers

We survey the vocabularies a biopharma graph actually reaches for, sorted into three maturity tiers.

Tier 1 is the genuinely production, genuinely formal layer: the Allotrope Foundation Ontology (AFO) and its lightweight sibling the Allotrope Simple Model (ASM) for lab data, the ISO IDMP family and UNII for substance and product identity, and NCIt/CDISC terminology.

Tier 2 is production but indirect — reference ontologies such as ChEBI, the unit vocabularies QUDT and UCUM, the OBO life-science anchors, and the provenance ontology PROV-O — present in manufacturing only through the files and graphs that embed them.

Tier 3 is piloted, proposed, or academic: BAO, CHMO, PROCO, and the manufacturing-process vocabularies that are thinnest exactly where this book worked hardest. We close on the gap the tiering exposes.

A three-tier maturity chart of the ontologies used in biopharma, with a downward maturity axis on the left labeled most adopted to least adopted; Tier 1 in green holds AFO/ASM, ISO IDMP/SPOR, GSRS/UNII, and NCIt/CDISC CT; Tier 2 in amber holds QUDT, UCUM, ChEBI, and OBO anchors plus PROV-O; the narrower Tier 3 in violet holds BAO (Piloted), CHMO (Academic), and PROCO (Proposed); a rose band beneath reads that the map is detailed at the borders and blank in the interior. The mature vocabularies describe the edges — the lab and the registry — while the process vocabulary thins toward the bottom tier. Original diagram by the authors, created with AI assistance.

Tier 1 — production and formal: the lab and the registry

Two domains in biopharma have genuinely crossed from "published ontology" to "load-bearing in production." Both sit at the edges of manufacturing rather than its center, which is the first clue to the chapter's punchline.

The first is analytical-lab data. The Allotrope Foundation Ontology (AFO) is a BFO-aligned formal ontology — organized into Equipment, Material, Process, and Result taxonomies — first publicly released in March 2018 (production) [1].

AFO is exactly the kind of thing this book has argued for: real classes, a real upper-ontology anchor, real reuse — and, unusually for this survey, real instruments emitting it.

Its original heavyweight carrier, the HDF5-based Allotrope Data Format (ADF), proved too cumbersome for casual adoption.

So the Foundation released the Allotrope Simple Model (ASM) — a lightweight JSON serialization that carries the AFO-derived vocabulary with far less overhead — publicly on 8 February 2023, covering more than forty techniques (production) [2].

One hedge belongs here in plain sight. A frequently-quoted claim that AFO usage "tripled over three years" is consortium self-reported and carries no absolute base, so read it as a self-reported direction of travel, not a measured headline number [2].

The second production domain is regulatory substance and product identification — treated in depth in the regulatory chapter, named here for the inventory.

The ISO IDMP family (ISO 11238 for substances, 11615 for products, plus 11616, 11239, and 11240) is implemented by EMA's SPOR services and is in production across the EU submission stack (production) [3].

On the FDA side, GSRS/UNII issues unique substance identifiers grounded in ISO 11238 and ISO/TS 19844 (production) [4].

And the NCI Thesaurus (NCIt) is the production distribution platform for CDISC Controlled Terminology and many SPL terminology subsets (production) [5].

A distinction matters here, because the rest of the book has been strict about it. IDMP, UNII, and CDISC CT are structured controlled vocabularies and registries — not BFO-aligned OWL ontologies like AFO — but they are no less production-grade for it.

The IDMP substance identifier is precisely what a real plant puts behind our running example's bp:DS-001, the drug substance our batch becomes, when it crosses from internal genealogy into a regulated submission.

Tier 2 — production, but indirect: the infrastructure layer

The next tier is real, maintained, and depended upon — but in manufacturing it rarely appears by name. It rides inside data files, or is imported by the ontologies that are used directly.

These are the spice jars: indispensable, but you taste them only in the dish.

ChEBI (Chemical Entities of Biological Interest, maintained at EMBL-EBI) is the reference chemistry ontology, reused by PubChem and by downstream ontologies; its presence on the manufacturing floor is indirect (production) [6].

Units are the cleanest example of the indirect pattern, and they split in a way worth pinning down — because the split tells you which vocabulary a given system is really speaking. QUDT (Quantities, Units, Dimensions and Types) is the unit ontology embedded inside the Allotrope ADF, where every quantity value carries its unit (production) [7].

But UCUM (the Unified Code for Units of Measure) is the one to watch, because UCUM — not QUDT — is the unit grammar mandated inside HL7/FHIR, and therefore inside EMA PMS, FDA SPL, and PQ-CMC FHIR: the very regulatory stack already in production (production) [8].

So the same SEC result on our bp:BATCH-2026-001 might carry a QUDT unit IRI inside an Allotrope file and a UCUM code on the way to a regulator — two unit grammars for one number, neither of them visible to anyone reading the value alone.

The OBO life-science anchors form a second cluster here. The Protein Ontology (PRO) grounds the molecule, the Cell Line Ontology and NCBI Taxonomy ground the cell line, and the Gene Ontology and Disease Ontology ground the target (production) [12].

These are the same public ontologies Part II reached for — and in industry they anchor the discovery and R&D graphs, not the GMP floor. The pattern repeats: the formal vocabulary is densest where the science is, and thins as you walk toward the manufacturing line.

Finally, PROV-O (the W3C provenance ontology), alongside the catalog vocabularies SKOS and DCAT, is the provenance-and-catalog layer real FAIR programs lean on; PROV-O is a common way lineage and genealogy — this book's entire payoff — is modeled in production catalogs (production) [13].

When a plant records that bp:DS-001 is derivedFrom bp:PApool-001, the production-grade way to say that to the outside world is very often a PROV-O wasDerivedFrom.

One more name belongs in this tier with a caveat. LinkML, the schema-modeling framework met in the governance chapter, is the pragmatic layer where several enterprise programs author their data models before emitting OWL and SHACL. It is a modeling convenience, not itself a formal upper ontology — useful precisely because it lets a team work in spreadsheets and YAML and still compile down to the formal artifacts the rest of this tier assumes.

The tiering, at a glance

Ontology / VocabularyDomain modeledMaintainerMaturity
AFO / ASMAnalytical-lab equipment, material, process, resultAllotrope FoundationProduction
ISO IDMP / SPORSubstance and product identity (EU)EMA / ISOProduction
GSRS / UNIIUnique substance identifiers (US)FDAProduction
NCIt / CDISC CTClinical and SPL controlled terminologyNCI / CDISCProduction
QUDTQuantities and units (in ADF)QUDT.orgProduction (indirect)
UCUMUnit grammar (in HL7/FHIR, SPL, PMS)RegenstriefProduction
ChEBIReference chemistryEMBL-EBIProduction (reference)
PRO / CLO / GO / DOIDMolecule, cell line, gene, diseaseOBO FoundryProduction (reference)
PROV-OProvenance and lineageW3CProduction
BAOHigh-throughput-screening assaysOBO / academicPiloted
CHMOChemical methods (extends OBI)OBO FoundryAcademic
PROCOProcess chemistryOBO FoundryProposed
IOF biopharmaBioprocess unit ops, equipment, materials, QbD, recipeIOF (with OAGi / NIIMBL)Released 2026-02 (adoption nascent)

Tier 3 — piloted, proposed, academic: where the words run out

The third tier is where ambition outruns adoption. The vocabularies here are not bad — several are excellent — but they have not yet earned a production dependency anyone can point to.

The BioAssay Ontology (BAO) describes high-throughput-screening assays; it has been used to annotate large HTS assay collections and ties into the Open PHACTS and Pistoia efforts — real, but pilot-scale rather than plant-wide (piloted) [9].

Below that sit the chemical-method ontologies: CHMO (the Chemical Methods Ontology, which extends OBI) (academic) [10], and PROCO (the Process Chemistry Ontology, BFO-aligned, submitted to OBO in 2021, reusing ChEBI, CHMO, and AFO) (proposed) [11].

Both are credible, well-built, and BFO-anchored — and neither has crossed into routine GMP-manufacturing use in the public evidence found. That is the recurring shape of this tier: the modeling is sound, the adoption is not yet there.

The manufacturing-process vocabularies proper — unit operations, equipment state, in-process material — are treated in the shop-floor chapter, and they live firmly in this tier. That placement is not an accident of this chapter's survey; it is the central finding of the whole part.

The most serious attempt to give that interior formal, BFO-grounded words is IOF's biopharma domain ontology, and it is further along than a first glance suggests. Audited directly against its February 2026 release, it defines 171 classes — all marked Released — including 44 unit-operation classes (capture, viral clearance, viral inactivation, viral filtration, polishing, ultrafiltration, drug-product formulation) and 17 QbD-parameter classes (process parameter, quality attribute, normal-operating- and proven-acceptable-range expressions). This book's running example now binds its process steps, its cell line, and its QbD scaffolding to those real IRIs rather than minting local ones. But Released in a specification is not the same as depended-upon in a plant: IOF biopharma still has no production dependency anyone can point to, which is exactly why it sits in this tier. The words finally exist; the adoption does not — yet.

The unsolved part: the mature words describe the edges, not the process

Read the tiering back as a map of manufacturing and the gap is stark.

The vocabularies that are mature describe the lab and the substance/product registry — the analytical bench at one end and the regulatory filing at the other.

The vocabulary of the process itself — the unit operation that turned bp:SEED-001 into bp:BATCH-2026-001, the equipment state of the bioreactor, the identity of the in-process material in the capture pool — is exactly where the standards are thinnest and least adopted. This book spent its entire middle modeling that process.

The industry has mature, production-grade words for what goes into the process and what comes out of it, and only immature, piloted, or home-grown words for the transformation in between. The map is detailed at the borders and blank in the interior.

One qualifier the rest of this part insists on: that interior is no longer empty of words. IOF's biopharma ontology, Released in February 2026, now supplies formal unit-operation and QbD classes, and this book's running example binds to them directly. What the interior still lacks is adoption — a plant in production that actually depends on those terms. The borders are inked because the lab and the registry run on their vocabularies every day; the interior is drawn in pencil, because its vocabulary, however real and however Released, is not yet load-bearing anywhere.

Why it matters

When this book told you to "anchor to a shared ontology," these vocabularies are what it meant — and their uneven maturity is the reason a real plant's knowledge graph is a quilt.

Mature lab and registry vocabularies are stitched, panel by panel, to home-grown process terms that no consortium has yet standardized. Knowing which thread is load-bearing and which is decorative is the difference between a graph you can defend in an audit and one that merely looks semantic.

The tiering is not pessimism; it is the honest map you need before you decide where to build and where to borrow.

Key terms

  • AFO (Allotrope Foundation Ontology) — a BFO-aligned formal ontology for analytical-lab data, organized into Equipment, Material, Process, and Result, first publicly released March 2018.
  • ASM (Allotrope Simple Model) — the lightweight JSON serialization that carries AFO-derived vocabulary with far less overhead than the HDF5-based ADF; publicly released 8 February 2023.
  • ISO IDMP — the ISO family (11238, 11615, 11616, 11239, 11240) for identifying medicinal-product substances and products; a structured standard, not an OWL ontology, implemented in production by EMA's SPOR services.
  • UNII / GSRS — FDA's unique substance identifiers and the Global Substance Registration System that issues them, grounded in ISO 11238.
  • QUDT vs UCUM — two production unit vocabularies; QUDT rides inside the Allotrope ADF, while UCUM is the unit grammar mandated inside HL7/FHIR and the regulatory stack.
  • ChEBI — the EMBL-EBI reference chemistry ontology, present in pharma manufacturing indirectly through the files and ontologies that reuse it.
  • PROV-O — the W3C provenance ontology, a production-grade standard for modeling lineage and genealogy in FAIR data catalogs.
  • Maturity tier — this chapter's sorting axis: production-and-formal, production-but-indirect, or piloted/proposed/academic, judged by depth of real-system dependency rather than ontology quality.

Where this leads

The vocabularies are only half the story.

Someone has to package, host, and sell the machinery that loads them.

The next chapter, The Platforms: How Vendors Sell Semantics, turns from the standards to the commercial systems that wrap them.

It asks the question this tiering sets up: how much of the "semantic" in a vendor pitch is AFO and PROV-O underneath, and how much is marketing over a relational database.