The Vocabularies in Use: From AFO to IDMP

📍 Where we are: Part VIII · Ontologies in Industry Today. The previous chapter named the consortia and standards bodies; this one inventories what they actually shipped, and asks the harder question of which vocabularies live in production versus merely on a download page.

The last chapter named the rooms: Allotrope, the OBO Foundry, the IOF, the regulators at EMA and FDA, the schema-modelers at the keyboard.

Naming the consortia is the easy half.

The harder question — the one that decides whether "anchor your batch to a shared ontology" is advice or wishful thinking — is which of those outputs a real plant has wired into a running system, and which sit published, downloadable, and unused. A standard with a stable URL is not the same as a standard with a production dependency on it.

So this chapter is an inventory followed by an honest accounting. We name the vocabularies, say what each one models, and sort them by a single unforgiving criterion: not how good the ontology is, but how deep it has actually penetrated a GMP (Good Manufacturing Practice — the regulated, auditable production environment) or regulatory system.

The picture that emerges is lopsided in a way the rest of the book has been quietly preparing you for.

The simple version

Think of a professional kitchen. Some tools are in the cook's hand every service — the knife, the pans, the order tickets. Some live in the pantry, real and relied upon but tasted only through the dishes they end up in — the spice jars, the stock. And some are catalog items the kitchen ordered but never unboxed.

Every one is "in the kitchen." Only the first kind is in use. Biopharma's ontologies sort the same three ways.

What this chapter covers

We survey the vocabularies a biopharma graph actually reaches for, sorted into three maturity tiers.

Tier 1 is the genuinely production, genuinely formal layer: the Allotrope Foundation Ontology (AFO) and its lightweight sibling the Allotrope Simple Model (ASM) for lab data, the ISO IDMP family and UNII for substance and product identity, and NCIt/CDISC terminology.

Tier 2 is production but indirect — reference ontologies such as ChEBI, the unit vocabularies QUDT and UCUM, the OBO life-science anchors, and the provenance ontology PROV-O — present in manufacturing only through the files and graphs that embed it.

Tier 3 is piloted, proposed, or academic: BAO, CHMO, PROCO, and the manufacturing-process vocabularies that are thinnest exactly where this book worked hardest. We close on the gap the tiering exposes.

One word about what the tiers measure, because it is easy to misread: they sort by how directly a system depends on a vocabulary, not by how deeply. A Tier-2 unit grammar can be more load-bearing than a Tier-1 registry and still sit a tier lower, because you only ever meet it inside a file or an import — never by name on the floor.

The mature vocabularies describe the edges — the lab and the registry — while the process vocabulary thins toward the bottom tier. Original diagram by the authors, created with AI assistance.

Tier 1 — production and formal: the lab and the registry

Two domains in biopharma have genuinely crossed from "published ontology" to "load-bearing in production." Both sit at the edges of manufacturing rather than its center, which is the first clue to the chapter's punchline.

The first is analytical-lab data. The Allotrope Foundation Ontology (AFO) is a BFO-aligned formal ontology — organized into Equipment, Material, Process, and Result taxonomies — first publicly released in March 2018 (production) [1]. "BFO-aligned" is the criterion that does the sorting work in this whole chapter, so it is worth fixing here: BFO (the Basic Formal Ontology) is the tiny top-level vocabulary of kinds — things that persist (an independent continuant, like a substance), things that happen (a process), the properties they carry (a quality), and the records about them (an information artifact) — that every rung of this book's stack hangs under. A vocabulary that anchors to BFO can be reasoned over — a program (a reasoner) can derive new facts and flag contradictions — and lined up with any other BFO-aligned vocabulary, which is precisely what a registry of bare identifiers cannot offer.

AFO is exactly the kind of thing this book has argued for: real classes, a real upper-ontology anchor, real reuse — and, unusually for this survey, real instruments emitting it.

Its original heavyweight carrier, the HDF5-based Allotrope Data Format (ADF), proved too cumbersome for casual adoption.

So the Foundation released the Allotrope Simple Model (ASM) — a lightweight JSON serialization that carries the AFO-derived vocabulary with far less overhead — publicly on 8 February 2023, covering more than forty techniques (production) [2].

One hedge belongs here in plain sight. A frequently-quoted claim that AFO usage "tripled over three years" is consortium self-reported and carries no absolute base, so read it as a self-reported direction of travel, not a measured headline number [2].

Worth naming what AFO actually carries on the floor, because it is the one place the interior and a mature ontology touch. For our campaign, AFO is the vocabulary behind the release SEC chromatogram — the trace from size-exclusion chromatography (SEC), a lab method that separates molecules by size; bp:ADF-SEC-001 is the identifier this run's chromatogram carries in the graph, the bp: prefix marking it as one of this book's own terms. Every identifier on this page has that prefix:name shape, where the prefix names the vocabulary the term comes from (obo: for OBO, qudt: for QUDT, and so on), so you can read who owns a term at a glance. That chromatogram reports the drug substance at 98.611 % monomer (the intact, single-molecule antibody) against a monomer specification of at least 95 % and an HMW (high-molecular-weight) aggregate limit of 2.0 % — clumps of antibody stuck together, which must stay below that ceiling — so 98.611 % clears the 95 % floor and the result conforms to specification (a pass). Those are CQAs: critical quality attributes a regulator ties to safety and efficacy, measured by a validated method, with the chromatogram retained as the raw record. AFO is production-grade here precisely because the SEC result is a discrete, defensible fact — which is exactly why the lab end got a formal ontology before the bioreactor end did.

The second production domain is regulatory substance and product identification — treated in depth in the regulatory chapter, named here for the inventory.

The ISO IDMP family (ISO is the international standards organization; IDMP is its Identification of Medicinal Products standard — ISO 11238 for substances, 11615 for products, plus 11616, 11239, and 11240) is implemented by the SPOR services (Substance, Product, Organisation, Referential — the four master-data registries) of the EMA (the European Medicines Agency, the EU drug regulator) and is in production across the EU submission stack — the systems through which a company files a drug for approval (production) [3].

On the FDA side, GSRS/UNII issues unique substance identifiers grounded in ISO 11238 and ISO/TS 19844 (production) [4].

And the NCI Thesaurus (NCIt) — a large reference vocabulary maintained by the US National Cancer Institute — is the production distribution platform for CDISC Controlled Terminology (the standardized clinical-trial term set from the CDISC standards body) and many SPL (Structured Product Labeling — the US drug-label format) terminology subsets (production) [5].

A distinction matters here, because the rest of the book has been strict about it. IDMP, UNII, and CDISC CT are structured controlled vocabularies and registries — not BFO-aligned OWL (Web Ontology Language) ontologies like AFO — but they are no less production-grade for it.

The IDMP substance identifier is precisely what a real plant puts behind our running example's bp:DS-001, the drug substance our batch becomes, when it crosses from internal genealogy into a regulated submission.

That crossing is already wired in the dataset: bp:DS-001 bp:hasSubstanceIdentifier bp:IDMP-DS-001. Note the deliberate modeling choice — the IDMP identifier is an information content entity about the substance, not the substance itself, so the graph links to it through a dedicated property rather than collapsing the registry key onto the material. UNII behaves the same way (a UNII is an identifier for a substance, grounded in ISO 11238 and ISO/TS 19844), which is why both sit in the production tier as registries without being OWL ontologies: their job is durable identity, not subsumption reasoning.

Tier 2 — production, but indirect: the infrastructure layer

The next tier is real, maintained, and depended upon — but in manufacturing it rarely appears by name. It rides inside data files, or is imported by the ontologies that are used directly.

These are the spice jars: indispensable, but you taste them only in the dish.

ChEBI (Chemical Entities of Biological Interest, maintained at EMBL-EBI) is the reference chemistry ontology, reused by PubChem and by downstream ontologies; its presence on the manufacturing floor is indirect (production) [6].

Units are the cleanest example of the indirect pattern, and they split in a way worth pinning down — because the split tells you which vocabulary a given system is really speaking. QUDT (Quantities, Units, Dimensions and Types) is the unit ontology embedded inside the Allotrope ADF, where every quantity value carries its unit (production) [7].

But UCUM (the Unified Code for Units of Measure) is the one to watch, because UCUM — not QUDT — is the unit grammar mandated inside HL7/FHIR (the dominant healthcare-data exchange standard — HL7 the standards body, FHIR its modern web-API format), and therefore inside the regulatory submission systems built on top of it: EMA PMS (the European agency's Product Management Service), FDA SPL (Structured Product Labeling, the US drug-label format), and PQ-CMC FHIR (the FDA's manufacturing-and-quality submission format) — the very regulatory stack already in production (production) [8]. Because each of those mandates UCUM, a number that reaches them must wear a UCUM code.

So the same SEC release result on our bp:DS-001 sample might carry a QUDT unit IRI (an IRI is a web-address-style identifier — here, one a machine can follow to the unit's definition) inside its Allotrope ADF chromatogram and a UCUM code on the way to a regulator — two unit grammars for one number, neither of them visible to anyone reading the value alone. (The companion's identifiers-and-units chapter traces that split in full.)

It helps to see the tiering as the path a single number takes through a plant. A culture-temperature setpoint leaves the historian and the OPC UA stream as a bare UCUM string — 36.5 with unit Cel — and no formal ontology at all; that is the interior, running on home-grown tags. The same campaign's release SEC chromatogram leaves the analytical instrument as AFO/ASM, every quantity value carrying a QUDT unit IRI; that is one mature edge. The batch record that ties them together is ISA-95 / B2MML structure wrapped around unit-operation names no consortium has standardized. And when the substance crosses into a submission, its identity becomes an IDMP/UNII code and its numbers re-encode to UCUM for SPL and PQ-CMC FHIR; that is the other mature edge. Four systems, four vocabularies — formal at the two ends, improvised in the middle — and the companion suite's loaders (opcua_to_rdf.py, historian_to_rdf.py, b2mml_to_rdf.py, asm_to_rdf.py — rdf for RDF, the Resource Description Framework, the simple subject-predicate-object "triple" data model this book's graph is built from) walk exactly that path. Here the historian is the plant's time-series archive and OPC UA the industrial machine-to-machine protocol the readings stream over; ISA-95 / B2MML is the manufacturing-data standard that structures a batch record.

The running example carries both grammars on real values. The release SEC result lands as a QUDT-typed quantity — bp:DS-001-monomer a qudt:QuantityValue ; qudt:hasUnit unit:PERCENT ; qudt:hasQuantityKind qkind:DimensionlessRatio (read it as one fact built from clauses: this monomer value is a QUDT quantity value, has unit percent, and is of quantity kind dimensionless ratio) — with a dereferenceable unit IRI a reasoner can work over ("dereferenceable" means a machine can follow the IRI to fetch the unit's definition — its dimension and quantity kind). The same campaign's bioreactor temperature arrives off the historian and the OPC UA tag carrying its source UCUM string qudt:ucumCode "Cel", which the loader resolves to qudt:hasUnit unit:DEG_C (and qudt:hasQuantityKind qkind:Temperature) wherever the UCUM code maps cleanly to a QUDT unit — so this value, too, ends up wearing both grammars. The two are not redundant: the QUDT IRI is for inference, while the UCUM code is a grammar a parser validates and a regulator's FHIR resource demands. A mature pipeline keeps both — which is why historian_to_rdf.py, opcua_to_rdf.py, and b2mml_to_rdf.py all write qudt:ucumCode (and the QUDT unit IRI alongside it) on the way in.

The OBO life-science anchors form a second cluster here. OBO — the Open Biological and Biomedical Ontologies Foundry — is the curated, BFO-aligned commons of life-science reference vocabularies the discovery world already standardized on, which is why a biopharma graph borrows from it rather than minting its own biology. The Protein Ontology (PRO) [12] grounds the molecule, while the Cell Line Ontology (CLO) and NCBI Taxonomy ground the cell line, and the Gene Ontology (GO) and Disease Ontology (DOID) ground the target — the same OBO anchors Part II reached for (production).

The running example reuses three of these by IRI, in align.ttl: bp:HostOrganism rdfs:subClassOf obo:NCBITaxon_10029 (read rdfs:subClassOf as "is a kind of": our host-organism class is declared a subclass of NCBI Taxonomy's Cricetulus griseus, the Chinese hamster behind the CHO line — Chinese Hamster Ovary cells, the workhorse host that grows most biologic drugs), bp:WorkingCellBank rdfs:subClassOf obo:CLO_0002421 (the Cell Line Ontology's CHO-cell class), and bp:Target rdfs:subClassOf obo:PR_000000001 (a PRO protein IRI) — the same edges CQ-20 (competency question 20, one of the book's numbered acceptance-test queries) and its "what host organism, by stable NCBI Taxon IRI" question resolve against. The cluster is Tier 2 here for the reason its label says: the plant never types its line "by hand," it inherits the OBO IRI through that alignment.

In industry these anchor the discovery and R&D graphs, not the GMP floor. The pattern repeats: the formal vocabulary is densest where the science is, and thins as you walk toward the manufacturing line.

Finally, PROV-O (the W3C provenance ontology), alongside the catalog vocabularies SKOS and DCAT, is the provenance-and-catalog layer real FAIR (Findable, Accessible, Interoperable, Reusable) data programs lean on; PROV-O is a common way lineage and genealogy — this book's entire payoff — is modeled in production catalogs (production) [13].

When a plant records that bp:DS-001 is derivedFrom bp:POLpool-001 (its immediate parent — the Protein A capture pool bp:PApool-001 sits four hops up the same genealogy chain, through the polishing and the two viral pools), the production-grade way to say that to the outside world is very often a PROV-O wasDerivedFrom.

This book's own graph keeps the two layers visibly separate: it asserts lineage with the local bp:derivedFrom and reserves PROV-O for attribution and curation — prov:wasAttributedTo on each source claim, a prov:Activity for the steward's reconciliation. The PROV-O wasDerivedFrom is the export idiom, the word you reach for when the genealogy has to leave the building.

One more name belongs in this tier with a caveat. LinkML, the schema-modeling framework met in the governance chapter, is the pragmatic layer where several enterprise programs author their data models before emitting OWL and SHACL (the Shapes Constraint Language, which checks that real data meets the ontology's rules — e.g. every batch must carry a substance identifier, every quantity value a unit). It is a modeling convenience, not itself a formal upper ontology — useful precisely because it lets a team work in spreadsheets and YAML and still compile down to the formal artifacts the rest of this tier assumes.

Why this inventory is also an AI-readiness inventory

There is a reason the newest argument for any of these vocabularies — restated in full in the AI frontier chapter and the companion volume's own frontier chapter — turns the same tier list into a checklist of what a model is allowed to stand on. A large language model is a fluent guesser: ask it whether bp:DS-001 met its monomer specification and it will compose a confident answer with the same ease whether or not it has the fact. Something else must supply the truth, and these vocabularies are that something. The technique the 2024-2026 wave settled on, retrieval-augmented generation — pulling verified facts from a trusted store and requiring the model to answer only from them — is only as trustworthy as the store, and in its graph-native form (GraphRAG) the store is exactly the bp: graph this chapter inventories. The mature edges are what make that store dependable: an AFO/ASM SEC result is a checkable lab fact, an IDMP/UNII code is a durable identity, and a model that retrieves either is grounded; a model improvising over the home-grown, vocabulary-free interior has nothing to retrieve and falls back to invention.

The unit split this section spent so long on is not academic once a model is involved — it is what keeps a learned feature dimensionally honest. A model that ingests 36.5 off the historian as a bare float will happily mix it with a Cel-meaning value and a K-meaning value as if they were the same number; the QUDT IRI and the qudt:ucumCode "Cel" string the loaders write are what let a feature pipeline (or the reasoner behind it) refuse that mistake before it reaches the model. A graph feature is only safe to learn from once it is unit-bearing and identified — which is precisely the lift from mute float to typed quantity that the Tier-2 unit vocabularies perform.

Three of this book's own artifacts turn that grounding into a discipline a model must actually obey:

The SHACL gate becomes a retrieval gate. The shapes that refuse a non-conformant release — every batch carrying its substance identifier, every quantity value its unit — equally refuse a non-conformant retrieval. A SHACL-screened subgraph is the only honest training set or retrieval context a model that learns over these instances can have; a subgraph that fails its shapes is the hollow, mislabeled input a fluent model will cheerfully complete from training memory.
The lineage edges are the cross-validation fold. Because bp:derivedFrom is explicit and transitive, a model can be split the way bioprocess data demands — not by random rows but by batch, with a grouped / leave-one-batch-out scheme that holds out every row sharing a bp:derivedFrom ancestor. Every lot in this campaign descends from the same working cell bank, so two lots off one bank are near-twins; the (bp:derivedFrom)+ walk back to that shared ancestor is the grouping key, the discipline the companion volume's validation chapter makes its default. A flat export of proprietary tables leaves that grouping to a hopeful convention; the graph makes it mechanical.
The reasoned graph wins the tie — the validation paradox. A fluent model is checked against held-out data, but that data is only as honest as the graph it came from. A model that contradicts a reasoned graph — one whose owl:TransitiveProperty closure and SHACL shapes have already been machine-checked — is, in that contradiction, the more likely wrong of the two, because the graph's answer was derived and certified while the model's was merely generated. This is the validation paradox the companion volume names, and it is why the same tiering that ranks these vocabularies for an audit ranks them for an AI: the mature, reasoned, shape-validated edges are exactly the trustworthy ground truth a model needs, and the unvalidated interior is exactly where a model is most free to invent.

So the lopsidedness this chapter is building toward has a second edge. The same map that is detailed at the borders and blank in the interior is also a map of where a model can be grounded and where it can only guess: the edges give a model a checkable, unit-bearing, lineage-anchored fact to retrieve, and the interior — mute, home-grown, un-reasoned — is precisely where a confident, fluent, wrong answer is most expensive and least catchable.

The tiering, at a glance

Ontology / Vocabulary	Domain modeled	Maintainer	Maturity
AFO / ASM	Analytical-lab equipment, material, process, result	Allotrope Foundation	Production
ISO IDMP / SPOR	Substance and product identity (EU)	EMA / ISO	Production
GSRS / UNII	Unique substance identifiers (US)	FDA	Production
NCIt / CDISC CT	Clinical and SPL controlled terminology	NCI / CDISC	Production
QUDT	Quantities and units (in ADF)	QUDT.org	Production (indirect)
UCUM	Unit grammar (in HL7/FHIR, SPL, PMS)	Regenstrief	Production
ChEBI	Reference chemistry	EMBL-EBI	Production (reference)
PRO / CLO / GO / DOID	Molecule, cell line, gene, disease	OBO Foundry	Production (reference)
PROV-O	Provenance and lineage	W3C	Production
BAO	High-throughput-screening assays	Univ. of Miami (academic)	Piloted
CHMO	Chemical methods (extends OBI)	OBO Foundry	Academic
PROCO	Process chemistry	OBO Foundry	Proposed
IOF biopharma	Bioprocess unit ops, equipment, materials, QbD, recipe	IOF (with OAGi / NIIMBL)	Released 2026-02 (adoption nascent)

A note on the maturity column: "Production (indirect)" and "Production (reference)" are no weaker than bare "Production" — they mark a vocabulary you depend on through a file or an import rather than one you cite by name. The tier is about directness of the dependency, not its weight.

One axis the table hides is what each vocabulary ships in, which is half the integration cost: AFO is OWL (and ASM re-serializes its terms as JSON-LD); CDISC CT travels as NCIt OWL plus flat files; IDMP and SPL are XML; UCUM is a code grammar embedded in FHIR JSON/XML; PROV-O, SKOS and DCAT are RDF. "In use" therefore means different things — AFO you can classify with a DL reasoner (a description-logic reasoner: software that reads OWL's logical definitions and works out, on its own, what falls under what), NCIt you mostly look up (it is a poly-hierarchical thesaurus — a browsable term list where a concept can sit under several parents at once — not engineered for that kind of automated OWL-DL reasoning), and IDMP you parse. The companion proves the cheapest round-trip: asm_to_rdf.py loads one Allotrope Simple Model JSON-LD document and yields the same bp:SEC-Result-001 triples the Turtle asserts — identical meaning, lighter file.

Tier 3 — piloted, proposed, academic: where the words run out

The third tier is where ambition outruns adoption. The vocabularies here are not bad — several are excellent — but they have not yet earned a production dependency anyone can point to.

The BioAssay Ontology (BAO) describes high-throughput-screening assays; it has been used to annotate large HTS assay collections and ties into the Open PHACTS and Pistoia efforts — real, but pilot-scale rather than plant-wide (piloted) [9].

Below that sit the chemical-method ontologies: CHMO (the Chemical Methods Ontology, which extends OBI) (academic) [10], and PROCO (the Process Chemistry Ontology, BFO-aligned, submitted to OBO in 2021, reusing ChEBI, CHMO, and AFO) (proposed) [11].

Both are credible, well-built, and BFO-anchored — and neither has crossed into routine GMP-manufacturing use in the public evidence found. That is the recurring shape of this tier: the modeling is sound, the adoption is not yet there.

The manufacturing-process vocabularies proper — unit operations, equipment state, in-process material — are treated in the shop-floor chapter, and they live firmly in this tier. That placement is not an accident of this chapter's survey; it is the central finding of the whole part.

The most serious attempt to give that interior formal, BFO-grounded words is IOF's biopharma domain ontology, and it is further along than a first glance suggests. Audited directly against its February 2026 release, it defines 171 classes — a class being a named category of thing the ontology defines, like "capture step" or "process parameter"; all marked Released — including 44 unit-operation classes (capture, viral inactivation and viral filtration — the two orthogonal viral-clearance steps — polishing, ultrafiltration, drug-product formulation — the discrete process steps a biologic runs through) and 17 QbD-parameter classes (process parameter, quality attribute, normal-operating- and proven-acceptable-range expressions — QbD is Quality by Design, the regulatory approach of defining, up front, which process parameters and quality attributes must be controlled). This book's running example now binds its process steps, its cell line, and its QbD scaffolding to those real IRIs rather than minting local ones. But Released in a specification is not the same as depended-upon in a plant: IOF biopharma still has no production dependency anyone can point to, which is exactly why it sits in this tier. The words finally exist; the adoption does not — yet.

The unsolved part: the mature words describe the edges, not the process

Read the tiering back as a map of manufacturing and the gap is stark.

The vocabularies that are mature describe the lab and the substance/product registry — the analytical bench at one end and the regulatory filing at the other.

The vocabulary of the process itself — the cell-culture process (bp:CellCultureProcess) that turned bp:SEED-001 into bp:BATCH-2026-001, the equipment state of the bioreactor, the identity of the in-process material in the capture pool — is exactly where the standards are thinnest and least adopted. This book spent its entire middle modeling that process.

The industry has mature, production-grade words for what goes into the process and what comes out of it, and only immature, piloted, or home-grown words for the transformation in between. The map is detailed at the borders and blank in the interior.

There is a manufacturing reason the interior resisted standardization while the edges did not. An SEC result and a UNII code are discrete, plant-independent facts — the same number, the same identifier, whoever measures it — so a shared vocabulary for them is just a matter of agreeing on names. A unit operation is not a fact; it is a controlled trajectory. "Protein A capture" is the same noun at every plant, but a different resin, a different load (grams of mAb per litre of resin), a different residence time, a different wash-and-elution buffer train, and a different pool-collection criterion at each — the name standardizes, the recipe does not. The edges got their ontologies first precisely because what they describe is the same everywhere; the interior is thin because what it describes is, by design, where each manufacturer's process knowledge actually lives.

There is a modeling reason underneath the same fact, and it is the continuant/occurrent split this book has leaned on since Part II. A substance and a product are independent continuants with stable identity criteria — a drug substance is the same substance from vial to vial and year to year — and registries like IDMP and UNII are very good at minting durable identifiers for things that persist. The process interior, by contrast, is made of BFO occurrents — things that happen rather than persist: a capture step, a low-pH hold, a cell-culture run, each with temporal parts (a beginning, middle, and end), participants whose state changes continuously, and dispositions (latent tendencies, like a buffer's tendency to elute protein) realized only while the process unfolds. The borders are mature partly because they ask the easy ontological question (what persists); the interior lags partly because it asks the hard one (what happens).

One qualifier the rest of this part insists on: that interior is no longer empty of words. IOF's biopharma ontology, Released in February 2026, now supplies formal unit-operation and QbD classes, and this book's running example binds to them directly. What the interior still lacks is adoption — a plant in production that actually depends on those terms. The borders are inked because the lab and the registry run on their vocabularies every day; the interior is drawn in pencil, because its vocabulary, however real and however Released, is not yet load-bearing anywhere.

Why it matters

When this book told you to "anchor to a shared ontology," these vocabularies are what it meant — and their uneven maturity is the reason a real plant's knowledge graph is a quilt.

Mature lab and registry vocabularies are stitched, panel by panel, to home-grown process terms that no consortium has yet standardized. Knowing which thread is load-bearing and which is decorative is the difference between a graph you can defend in an audit and one that merely looks semantic — that uses ontology-shaped names without a real, reasoned-over vocabulary underneath.

The tiering is not pessimism; it is the honest map you need before you decide where to build and where to borrow.

Key terms

AFO (Allotrope Foundation Ontology) — a BFO-aligned formal ontology for analytical-lab data, organized into Equipment, Material, Process, and Result, first publicly released March 2018.
ASM (Allotrope Simple Model) — the lightweight JSON serialization that carries AFO-derived vocabulary with far less overhead than the HDF5-based ADF; publicly released 8 February 2023.
ISO IDMP — the ISO family (11238, 11615, 11616, 11239, 11240) for identifying medicinal-product substances and products; a structured standard, not an OWL ontology, implemented in production by EMA's SPOR services.
UNII / GSRS — FDA's unique substance identifiers and the Global Substance Registration System that issues them, grounded in ISO 11238.
QUDT vs UCUM — two production unit vocabularies; QUDT rides inside the Allotrope ADF, while UCUM is the unit grammar mandated inside HL7/FHIR and the regulatory stack.
ChEBI — the EMBL-EBI reference chemistry ontology, present in pharma manufacturing indirectly through the files and ontologies that reuse it.
PROV-O — the W3C provenance ontology, a production-grade standard for modeling lineage and genealogy in FAIR data catalogs.
Maturity tier — this chapter's sorting axis: production-and-formal, production-but-indirect, or piloted/proposed/academic, judged by depth of real-system dependency rather than ontology quality.
GraphRAG — retrieval-augmented generation whose trusted store is a knowledge graph, so a model retrieves connected, typed facts (such as bp:derivedFrom edges) to answer from rather than improvising; the mature edges of this inventory are what make that store dependable.
Grouped / leave-one-batch-out cross-validation — splitting a model's data by batch rather than by row, holding out every row sharing a bp:derivedFrom ancestor, so the score is not inflated by sibling lots leaking across the split; the graph's lineage edges supply the grouping key.
Validation paradox — a fluent model is checked against held-out data, but a reasoned, SHACL-validated graph is the more trustworthy of the two when they disagree, because the graph's answer was derived and certified while the model's was generated.

Where this leads

The vocabularies are only half the story.

Someone has to package, host, and sell the machinery that loads them.

The next chapter, The Platforms: How Vendors Sell Semantics, turns from the standards to the commercial systems that wrap them.

It asks the question this tiering sets up: how much of the "semantic" in a vendor pitch is AFO and PROV-O underneath, and how much is marketing over a relational database.

What this chapter covers​

Tier 1 — production and formal: the lab and the registry​

Tier 2 — production, but indirect: the infrastructure layer​

Why this inventory is also an AI-readiness inventory​

The tiering, at a glance​

Tier 3 — piloted, proposed, academic: where the words run out​

The unsolved part: the mature words describe the edges, not the process​

Why it matters​

Key terms​

Where this leads​