Regulatory Semantics: IDMP, SPL, KASA, and the Structured Submission

📍 Where we are: Part VIII · Ontologies in Industry Today. The previous chapters walked the manufacturing floor, where formal ontologies are thinnest. We now cross to the regulatory interface — where structured, semantic data is most advanced, and in places legally mandated.

The earlier chapters of this Part delivered an uncomfortable finding: on the GMP (Good Manufacturing Practice) shop floor, where mAb-A — the monoclonal antibody this book follows as its running example — is actually made, formal ontology — a machine-readable model of what things are and how they relate — is rare. One could read that as a verdict against the technology. It would be the wrong lesson.

Cross the boundary from the plant to the regulator and the picture inverts. The interface where a manufacturer tells an agency what a drug is — its substances, its strengths, its labeling, its quality — is the most semantically advanced corner of the entire industry.

And much of it is not optional. It is law. Where the floor's structured data is something a plant builds because it pays off, the regulatory interface's structured data is something a plant builds because, absent it, the filing is rejected at the door.

One framing point before the stack: mAb-A is a biologic, so in the United States it is filed as a Biologics License Application (BLA) under section 351 of the Public Health Service Act — not a New Drug Application or Abbreviated New Drug Application under the Food, Drug & Cosmetic Act. That distinction quietly shapes much of what follows. The structured-assessment and structured-CMC (chemistry, manufacturing, and controls) pilots in this chapter were born on the small-molecule, solid-oral, generic side of the house, which is why a liquid monoclonal — the very product this book follows — tends to reach each of them last.

This chapter surveys that stack, and it does so with a distinction this book has insisted on from the first page. The regulatory interface is built largely from structured XML, HL7 FHIR (a healthcare data-exchange format), and controlled vocabularies (shared, official dictionaries of approved terms) — not from formal OWL ontologies, the logic-based models this book has built. Exactly one piece of it is a true OWL ontology with a reasoner — the engine that infers new facts from the ones already stated — behind it; the rest is mandated, validated, structured data.

Understanding why that mix is what shipped — and why a self-modifying, reasoning ontology has not — is the chapter's real payload. The answer turns out to lie not in any missing standard but in the regime that decides what is allowed to run in a regulated environment at all.

The simple version

Imagine every country agreed to fill out the same form, in the same boxes, using the same official dictionary of words, so that a computer could check the form before a human ever read it. That is roughly what drug regulators have built for product identity and labeling — a shared form with a shared codebook. What they did not build is a machine that reasons about your filing and rewrites it. The form is locked, audited, and validated; it does not think. That difference — a checkable form versus a thinking system — is the whole story of why structured data won at the regulatory boundary and live ontologies did not.

What this chapter covers

We begin with the legally binding core: ISO IDMP (Identification of Medicinal Products) and the EMA's SPOR master-data services, then the FDA's substance backbone, GSRS/UNII, and its labeling standard, SPL.

We then meet the nonclinical-data standard CDISC SEND and its enforcement teeth, the Technical Rejection Criteria, and KASA, the FDA's internal structured-assessment platform — the closest real-world cousin to this book's SHACL release gate (the rule-checked quality gate that decides whether a batch may be released).

From there we survey what is still piloted or proposed: PQ/CMC FHIR, eCTD v4.0 (the electronic Common Technical Document — the standard envelope a submission is packaged in), and the structured-CMC future in ICH M16 / SPQS. We note the one formal OWL artifact, the Pistoia IDMP Ontology, and the unit grammar that threads through everything, UCUM. Finally we confront the GxP validation gate — the real reason ontologies stay off the floor.

One substance identity rises from the knowledge graph through mandated, structured submission layers, color-coded by maturity, until the GxP validation gate stops a live reasoner. Original diagram by the authors, created with AI assistance.

The mandated core: ISO IDMP and EMA SPOR

The foundation is a family of ISO standards collectively called ISO IDMP — Identification of Medicinal Products — a set of data models for describing substances, products, dosage forms, units, and organisations in a uniform, machine-readable way.

In the European Union its use is not a recommendation. ISO IDMP is legally mandated for Member States, marketing-authorisation holders, and the EMA under Articles 25–26 of Commission Implementing Regulation (EU) No 520/2012 [1]. That word — mandated — is what sets this layer apart from almost everything else in the book: a manufacturer does not adopt IDMP because it is elegant, but because the law requires the filing to be structured. (production)

The EMA delivers IDMP through four master-data services known by the acronym SPOR — Substances, Products, Organisations, Referentials. Two of them launched first and have been operational since June 2017: the Referentials Management Service (RMS) and the Organisations Management Service (OMS).

The Products service, PMS, is built on HL7 FHIR — the same healthcare data-exchange standard the rest of this stack leans on. Its timeline is easy to garble, so it is worth stating precisely.

The PMS Product user interface went live read-only on 31 May 2024; the PMS API was delayed and released read-only on 3 July 2024 — the two dates are routinely conflated — and write and edit access rolled out from late 2024 [1]. (production)

This is the layer that, in our running example, would carry the real-world identity behind bp:DS-001. The IDMP substance identifier is the regulator's name for the same drug substance our graph tracks from the working cell bank WCB-CHO-001 (the vial of CHO — Chinese-hamster-ovary — production cells where the manufacturing chain begins) forward.

It is, in effect, what a real plant puts behind that node when it speaks to an agency rather than to its own historian: the internal bp: identifier is for the factory's knowledge graph; the IDMP identifier is for the world outside it. The two name the same substance, and keeping that correspondence true is part of what a working master-data discipline is for.

Ontologically the two are not the same kind of thing, and the companion model is careful about it. bp:DS-001 is the drug substance itself — an independent continuant, a BFO material entity (a physical thing that exists in its own right) — while the IDMP identity bp:IDMP-DS-001 (carrying bp:uniiCode and bp:mpid) is an information content entity (a piece of information about a thing, not the thing itself: bp:SubstanceIdentifier rdfs:subClassOf bp:InformationArtifact rdfs:subClassOf iof:InformationContentEntity, a BFO generically-dependent continuant) that is about the substance through bp:isAbout, aligned to IAO's is about (obo:IAO_0000136). That is why an MPID or a UNII can be reissued or corrected while the substance it denotes is unchanged: you are editing the label, never the thing labelled.

In the companion this correspondence is not a wish but an executable acceptance test. Competency question CQ-16 — one of the plain-English questions the ontology must be able to answer, drawn from the project's requirements document (the ORSD) — asks, in the ORSD's own words, "Does the ISO IDMP regulatory identity attach to the same node the release gate validated, not a copy?" — and validate.py runs it as a one-line ASK (a SPARQL query that returns just true or false):

PREFIX bp: <https://example.org/bioproc#>
ASK {
  bp:DS-001 bp:hasSubstanceIdentifier ?id .   # the regulator's IDMP/UNII identity
  bp:DS-001 bp:monomerPct           ?m .       # the released CQA panel
}

It passes only because the IDMP identity and the release-gate result hang on the same bp:DS-001 node, not on a parallel record. The regulated identity and the released quality are one entity — which is exactly the master-data discipline named just above, expressed as a test that is green or red, never merely asserted.

And this is not prose dressed up as code: the same ASK lives in the companion's executable ORSD, cq-catalog.json, and is run by examples/platform/ontology/validate.py on a plain open-source triplestore — the runnable SPARQL property-path digital-thread query Book 3 builds with RDFLib, the same idiom here pointed at the regulatory join. CQ-16's green or red is therefore reproducible on a laptop, not a claim a vendor asks you to believe. That matters at the regulatory boundary precisely because a structured submission is auditable only if its consistency checks can be re-run by someone who did not author them.

The FDA substance backbone: GSRS and UNII

One distinction underlies this whole layer, and it is worth naming plainly before the ISO numbers arrive, because it is the same one our running example already encodes: IDMP separates a substance — the molecule itself, mAb-A — from a product, the marketable thing built around it (a strength, a dose form, a pack). They get different standards and different identifiers, which is exactly why bp:IDMP-DS-001 carries both a bp:uniiCode (a substance code) and a bp:mpid, a Medicinal Product Identifier — two names for two layers of the same regulated identity.

The United States anchors substance identity in the Global Substance Registration System (GSRS), co-developed by the FDA and NIH/NCATS. GSRS implements the IDMP substance standards — ISO 11238 (substance identification) and its ISO/TS 19844 element-of-substance implementation — which sit beside the IDMP product standards (ISO 11615 and the 11616/11239/11240 family) that EMA's SPOR services implement on the product side. GSRS classifies every substance into one of a small set of substance classes — a monoclonal antibody such as mAb-A registers as a protein — and mints a free, non-proprietary UNII (Unique Ingredient Identifier) code for it, codes surfaced publicly on DailyMed. In our companion that UNII rides on bp:IDMP-DS-001 as bp:uniiCode, the same bp:SubstanceIdentifier node CQ-16 ties back to bp:DS-001.

The scale is real: a 2020 public release of the system held 116,636 substance definitions [2]. That a substance identifier is free and non-proprietary matters more than it sounds — it is the property that lets a UNII travel across systems and organisations without licensing friction, the same property this book argued for when it pinned global IRIs to the knowledge graph.

Notably, the EU substance service EU-SRS runs on the same GSRS software and has been live at the EMA since January 2023 — a rare case of two agencies sharing one codebase rather than two reconciling rival registries after the fact [2]. (production)

The labeling standard: SPL

When a manufacturer files a product label, it does not file a PDF; it files Structured Product Labeling (SPL), an HL7 v3 XML standard. SPL is mandatory under the FDA Amendments Act of 2007 and 21 CFR Part 207.

And it is not free text dressed up in tags. An SPL document draws on 44 controlled-terminology subsets from the NCI Thesaurus, plus NDC and UNII identifiers, so the meaning of each field is pinned to a shared vocabulary rather than left to prose a reviewer must interpret by hand [3]. The label, in other words, is computable: an agency can query across labels because every field denotes the same thing in every filing. (production)

Nonclinical data and its enforcement teeth: SEND

For nonclinical — that is, animal — study data, the FDA mandates CDISC SEND (Standard for Exchange of Nonclinical Data). One point must be stated plainly, because it is a common and consequential error: SEND governs nonclinical study data only. It is not a manufacturing or CMC standard; it has nothing to say about how BATCH-2026-001 was run or tested. Its controlled terminology is distributed through the NCI Thesaurus, which is the production distribution platform for CDISC Controlled Terminology.

SEND also carries real enforcement teeth. The Technical Rejection Criteria — a set of machine-checked error codes — became effective 15 September 2021, after which a non-conforming submission is technically rejected at the door, before any human reviewer sees it [4]. (production)

KASA: a rule-based assessment, not an ontology

The closest regulatory relative to this book's SHACL release gate is KASA — Knowledge-Aided Assessment and Structured Application — the FDA Office of Pharmaceutical Quality's internal review platform. It is worth pausing on, because it is the place where a regulator does, in production, something close to what this book has been modeling.

Launched in February 2021 for non-sterile solid-oral-dosage ANDAs, it has more than 300 active users, and a 2025 peer-reviewed paper documents more than 1,130 manufacturing assessments completed through September 2024 [5]. KASA applies rules and quantitative risk models to standardized CMC data, and it is designed to consume PQ/CMC structured data.

But it must be classified honestly: KASA is a structured-data and controlled-vocabulary platform, not a formal OWL ontology with a reasoner.

It checks structured meaning the way our SHACL gate checks DP-004 against spec — by rule, against a shape — not by inference (deriving a new fact the data never stated outright) over a knowledge graph (a web of facts linked by meaning). The resemblance to our release gate is real, and so is the line between them. (production, internal)

Piloted and proposed: the structured-submission future

Several pieces of the stack are real but not yet mandatory. It is worth being careful here, because the difference between published, accepted, and required is exactly the difference that gets lost when people describe the regulatory future as already here.

The table below sorts the near-future initiatives by maturity, and the tag on each is doing real work — a published implementation guide is not a mandate, and an accepted format is not a required one.

Initiative	What it is	Status
PQ/CMC FHIR	A FHIR R5 Implementation Guide profiling FHIR documents submitted as XML in eCTD Module 3, with NCIt terminology	Published and Connectathon-tested; limited to Solid Oral Dosage Form (liquids and biologics are future work); explicitly not yet a mandated format (piloted)
eCTD v4.0	The next submission envelope, built on the HL7 v3 Regulated Product Submissions message standard	FDA began accepting it 16 September 2024 (voluntary; v3.2.2 still supported); Japan PMDA mandates it from 1 April 2026; the EU targeted an optional go-live around end 2025 with a later mandate (production as an accepted format; not yet mandatory in the US or EU)
ICH M4Q(R2) / M16 / SPQS	A granular, IDMP-aligned restructuring of CTD Module 3, with the structured-data layer deferred to ICH M16 / Structured Product Quality Submissions	M4Q(R2) reached Step 2 consultation in 2025; the structured layer (M16/SPQS) is deferred; ICMRA's PQ Knowledge Management initiative proposes common identifiers building on UNII/GSRS and SPOR (proposed)

Read the dates against the calendar rather than the press release. eCTD v4.0 is an accepted format in the United States, not a required one, and v3.2.2 is still supported. Japan's PMDA mandate took effect on 1 April 2026; the EU's optional go-live was targeted around the end of 2025, with any binding mandate still to follow [6][7][8].

What unites this row is that all three describe a more structured submission envelope and template — a better-organized way to hand over documents — rather than a reasoning layer over their contents. The submission is becoming more like a database; it is not becoming a model that infers.

The one distinction to keep

Across this whole row, more structured is not more reasoning. PQ/CMC FHIR, eCTD v4.0, and ICH M16/SPQS all make the submission more like a database — better-organized boxes, shared codebooks, machine-checkable fields. None of them adds a layer that infers a new fact from the ones submitted. That is the line the rest of this chapter walks up to: a checkable form is not a thinking system, and only the second kind is what the GxP gate forbids.

The one formal ontology, and the unit grammar

In this entire stack, exactly one artifact is a true OWL ontology: the Pistoia IDMP Ontology — public, MIT-licensed, with more than ten pharma companies reported to have adopted it, and being standardized through ISO/TS 21405. As the enterprise knowledge-graph chapter described, Johnson & Johnson built a product master on it for the EMA PMS deadline.

The ontology itself is best classed as (piloted) across the industry, with that one company's master data built on it and rolling toward production rather than uniformly in service [9]. It is the exception that proves the rule: where the rest of the regulatory stack reaches for structured data, here one formal ontology earns its place — and even it does so at the master-data layer, not on the floor.

One thread runs through every layer above: units. UCUM — the Unified Code for Units of Measure — is the unit standard embedded in HL7 and FHIR, and therefore inside EMA PMS, FDA SPL, and PQ/CMC FHIR alike [10].

This book's identifiers chapter pinned QUDT for the knowledge graph; the regulatory wire uses UCUM. A real plant must therefore speak both — QUDT inside its own graph, UCUM on the line to the agency — and maintain the mapping between them, because a quantity that loses its unit, or carries the wrong one, is a defect that no amount of structure elsewhere can repair.

The loadable graph already carries both grammars on a single value. The culture-temperature setpoint that came in from the OPC UA wire is typed once and labelled twice: qudt:hasUnit unit:DEG_C for the knowledge graph and qudt:ucumCode "Cel" for the wire — the QUDT IRI and the UCUM code naming the same degree Celsius. Maintaining that pairing, value by value, is the QUDT-to-UCUM mapping a regulated submission depends on.

Be precise about the release value, though, because it is easy to over-claim. In our model the SEC %monomer (the share of intact, single antibody molecules measured by size-exclusion chromatography) of 98.611 on bp:BATCH-2026-001 — the clean, released drug-substance batch our running example follows — is a fully-qualified qudt:QuantityValue — qudt:numericValue 98.611, qudt:hasUnit unit:PERCENT, qudt:hasQuantityKind qkind:DimensionlessRatio — whose canonical UCUM code is the percent sign. The companion stamps the QUDT unit IRI on that node, not a materialised qudt:ucumCode; the UCUM string is what the value is re-expressed against when it crosses into a regulated submission, where PMS and PQ/CMC FHIR carry the UCUM code rather than the QUDT IRI. The number does not change; the grammar it is wrapped in does. And the discipline is, again, a test: competency question CQ-19 — "Does every stored quantity value carry a unit, a QUDT unit IRI or a UCUM code — no bare numbers?" — is a SELECT of the offenders that must return zero rows, so a quantity that lost its unit is a failing build, not a silent defect. (production)

The unsolved part: structured CMC is still mostly a PDF

For all this advancement, the genuinely missing piece is structured CMC itself.

The chemistry, manufacturing, and controls content of a submission — the part that actually describes how a batch is made and tested, the part nearest to everything this book has modeled — is still mostly narrative PDF. The structured-data layer meant to replace it, ICH M16 / SPQS, is proposed, not in force [8].

To see what "narrative PDF" costs, locate the running example in the dossier. Everything our graph knows about bp:DS-001 — its manufacture, its characterisation, and its release CQA panel — lives, in a real BLA, in CTD (Common Technical Document) section 3.2.S (Control of Drug Substance); everything about the filled vial lives in 3.2.P (Control of Drug Product). And "control of drug substance" is not an abstraction: 3.2.S.2 narrates, step by step, the very downstream train this book follows — the Protein A capture column that binds mAb-A out of clarified harvest, the low-pH viral inactivation hold and orthogonal polishing chromatography that strip aggregate and host-cell protein, the 20-nanometre viral filtration step whose log-reduction value is a registered claim, and the UF/DF that concentrates and buffer-exchanges into the final drug substance. Each step's validated operating ranges and its impurity-clearance evidence — the aggregate, HCP, and the 2.0 % HMW release limit our SHACL gate enforces — are 3.2.S prose today. Today those sections are prose with tables a reviewer reads by eye; the structured-CMC future is the proposal to make 3.2.S.4 and 3.2.P.5 machine-checkable in the same way our SHACL gate checks the panel. That the part nearest to everything this book modeled is the part still least structured is no accident of effort — it is the part where the floor's GxP records and the submission meet, and so the part the validation gate guards most jealously.

And there is a second, deeper limit hiding behind the first: even when M16 lands, it will arrive as validated, locked structured data long before it arrives as a live, reasoning ontology. The submission boundary will get more checkable. It will not soon get a reasoner. The reason for that is the subject of the next section, and it is the chapter's sharpest point.

The GxP gate: why ontologies stay off the floor

Here is the correction to a tempting misreading. The reason formal ontologies are scarce on the production floor is not a missing vocabulary or an immature standard — by this point in the book we have seen that the standards exist, that they are mature, and that much of the wire format is even free.

It is the computerized-systems-validation regime. GAMP 5 (2nd edition, 2022), 21 CFR Part 11 (electronic records and signatures), and EU GMP Annex 11 require that any system of record in a GxP — Good Practice, the regulated-quality umbrella — environment be validated, access-controlled, audit-trailed, and change-controlled [11]. A system that cannot demonstrate those properties cannot hold a record on which a batch decision rests. (in force)

A self-modifying, reasoner-driven ontology that imports external vocabularies and infers new facts (derives conclusions that were never explicitly written down) is exactly the kind of system that is hard to validate under that regime.

Ask the question a regulator would ask of a batch decision: what was the inferred fact, derived from which axioms (the model's stated logical rules), under which version of which imported ontology, at the precise moment the decision was made? That question has no stable answer if the model can change beneath you between one query and the next — and a record without a stable answer is, under Part 11 and Annex 11, not a record a regulated system may keep.

So the floor runs on locked-down structured systems instead — not because their semantics are poorer, but because they can be frozen, signed, and proven. The friction that keeps the reasoner out is regulatory, not technical, and naming it that way is the only way to understand why the most semantically advanced corner of pharma is also the one most committed to data that does not think.

The same gate, drawn around a learning model

The logic that bars a live reasoner is not specific to ontologies. It is the exact logic the 2025-2026 regulators drew around machine-learning models, and seeing the two as one rule is the cleanest way to connect this chapter to the book's companion volume, Machine Learning & AI for Biomanufacturing.

The draft EU/PIC/S GMP Annex 22 (the first manufacturing-specific AI rule, issued for consultation in July 2025) permits, for critical GMP applications, only static, deterministic models — and explicitly excludes self-learning, adaptive, generative, and large-language-model systems from those critical uses, demanding a predetermined change-control plan for any update. Read that against the reasoner: a self-modifying reasoning ontology and a continuously-learning model fail validation for the identical reason. Each can answer differently tomorrow than today, so neither yields the stable, signable answer to "what was the fact, under which version, at the moment of decision?" that Part 11 and Annex 11 require. A reasoner (inference software that derives new facts) and a continuously-learning predictor are two names for the thing the GxP gate forbids: a record that does not hold still. The resolution the models-and-validation chapter lands on — locked-then-relearn, never continuously-learning, for anything touching a CQA — is the same resolution that keeps the regulatory stack built from frozen structured data rather than a live reasoner.

But the relationship runs deeper than a shared prohibition, and this is where the ontology earns its place beside the model rather than merely sharing its cage. A model supplies fluency; the graph supplies the truth it is checked against, and the ontologies-and-AI chapter makes the move explicit. Three concrete ties carry it into the regulated submission:

SHACL-validated training data. The release gate was built to refuse a non-conformant release; the identical shapes refuse a non-conformant retrieval or training set. Before a subgraph trains or grounds a model, conformance certifies that every lot has its bp:derivedFrom parent, every CQA its unit-bearing value (the CQ-19 discipline above), every signature its signer — so the model never learns from, or cites, a hollow graph it would otherwise complete from training memory. A submission's structured data is exactly such a SHACL-checkable graph, which is why the regulatory boundary is the most natural place a validated grounding store could live.
The reasoned graph as ground truth for a fluent model. A retrieval-augmented LLM checked against the ontology can narrate an IDMP filing — "what is the UNII behind bp:DS-001, and which release panel did the gate pass?" — by traversing the same typed edges CQ-16 walks, citing the path rather than inventing it. The graph does the knowing; the model does the talking. This is the validation paradox stated precisely: a reasoned graph can be frozen, versioned, and proven at a point in time, so a model grounded on a snapshot of it inherits an auditable answer, whereas a model that reasons or learns on its own cannot. The ontology is the one component here that is both inference-capable and freezable — which is exactly why grounding, not autonomy, is the deployable pattern.
Honest validation over instances. When the learning happens over the graph's instances — predicting a release outcome across the campaign's lots — the genealogy is the grouping key. A bp:derivedFrom walk back to the shared cell bank WCB-CHO-001 is the leave-one-batch-out (or leave-one-cell-bank-out) split that a grouped, nested cross-validation needs to report an honest score rather than a flattering one; sibling lots off one bank are near-twins, and a row-wise random split leaks the answer. The graph that traces a deviation and the graph that defines a leak-free validation fold are the same graph.

The net is a single sentence the whole book has been earning: the regulator already insists that a record be frozen, signed, and proven, and that insistence — not any missing standard — is why both a live reasoner and a continuously-learning model stay off the critical path, while a reasoned, locked graph is precisely what a fluent model can stand on without lying to a reviewer.

Why it matters

The SHACL release gate this book built — the rule that holds DP-004 back — is not a thought experiment. DP-004 is a separate, deliberately out-of-spec drug-product lot, distinct from the clean DS-001/BATCH-2026-001 drug-substance batch used in the unit-grammar example above (which is why its monomer figure, 98.687 %, differs from the 98.611 % seen earlier — two different lots, not one number reported twice). It fails DP-004 on one criterion, and the realistic one: high-molecular-weight (HMW) aggregate — clumped-together antibody molecules — at 2.41 %, over the 2.0 % release limit, while monomer purity (98.687 %) stays in spec (within its allowed range). Aggregate is the soluble-aggregation attribute that carries immunogenicity risk; it trends up under thermal, agitation, and air-interface stress, and it is one of the first numbers a CMC reviewer interrogates on a comparability or stability package. A gate that tripped on monomer alone would be a toy; a gate that isolates the out-of-spec result to the aggregate path, leaving every other panel value conformant, is the shape of a real release decision.

It has a real-world cousin in KASA's rule-based assessment and in IDMP's mandated structured master data. Together they prove that checkable, structured meaning is already a regulatory reality at the submission boundary, even as a live, reasoning ontology stays rare on the GMP floor behind it. The reader who built the gate in the release-gate chapter has, without quite knowing it, been rehearsing the regulator's own move.

The book's through-line — turning records into knowledge that can be reasoned over under pressure — meets its hardest real-world test precisely here, where the law already demands structure but the validation regime still forbids a model that thinks for itself.

The lesson is not that ontologies failed at the regulatory edge. It is that structure shipped first, and reasoning, where it comes, will have to earn its way past the gate.

Key terms

ISO IDMP — the ISO family of standards for Identification of Medicinal Products; legally mandated in the EU for describing substances, products, and organisations in machine-readable form.
SPOR — the EMA's four IDMP master-data services: Substances, Products, Organisations, Referentials; RMS and OMS operational since 2017, PMS built on HL7 FHIR.
GSRS / UNII — the FDA/NIH Global Substance Registration System and the free, non-proprietary Unique Ingredient Identifier codes it mints; implements ISO 11238 and ISO/TS 19844.
SPL — Structured Product Labeling, an HL7 v3 XML standard mandatory for FDA product labels, using NCI Thesaurus controlled terminology, NDC, and UNII.
CDISC SEND — the FDA-mandated Standard for Exchange of Nonclinical (animal-study) Data; not a manufacturing or CMC standard.
KASA — the FDA's internal Knowledge-Aided Assessment and Structured Application platform; a rule-based, structured-data assessment system, not a formal OWL ontology.
PQ/CMC FHIR — a piloted FHIR R5 Implementation Guide for structured product-quality and CMC data, currently limited to solid oral dosage forms.
UCUM — the Unified Code for Units of Measure, the unit grammar embedded in HL7/FHIR and therefore in EMA PMS, SPL, and PQ/CMC FHIR.
Reasoner / inference — a reasoner is software that derives new facts an ontology did not state outright (inference); because what it concludes can shift as the model or its imported vocabularies change, such inferred results are hard to freeze and prove, which is why the GxP gate keeps a live reasoner off the production floor.
GxP validation regime — GAMP 5, 21 CFR Part 11, and EU GMP Annex 11 collectively; the requirement that GxP systems of record be validated, audited, and change-controlled, which makes live reasoning ontologies hard to deploy.
Annex 22 (locked model) — the draft EU/PIC/S GMP AI annex (consultation July 2025) that permits only static, deterministic models for critical GMP use and excludes self-learning, adaptive, generative, and LLM systems; the same locked-then-relearn logic that keeps a live reasoner off the floor.
GraphRAG grounding — retrieval-augmented generation whose trusted store is the validated knowledge graph, so a fluent model answers by traversing typed edges (the CQ-16 path) and citing them, rather than inventing a filing; the reasoned, locked graph is what a model can stand on without hallucinating.
Grouped / leave-one-batch-out validation — splitting every record of a lot — ideally every lot off one cell bank — wholly to train or test, so a model learning over the graph's instances is scored on genuinely unseen lots; the bp:derivedFrom walk to the shared cell bank is the grouping key.

Where this leads

The regulatory boundary, then, is where structured semantics is most mature and most enforced — and, not coincidentally, where the line between a checkable form and a thinking system is drawn most sharply.

The next chapter, The Shop Floor and the Digital Twin: Where Ontologies Are Still Arriving, turns back to the plant itself — to the live equipment, the process historians, and the digital-twin ambitions where formal models are only now beginning to appear, gated by the very validation regime this chapter just named.

What this chapter covers​

The mandated core: ISO IDMP and EMA SPOR​

The FDA substance backbone: GSRS and UNII​

The labeling standard: SPL​

Nonclinical data and its enforcement teeth: SEND​

KASA: a rule-based assessment, not an ontology​

Piloted and proposed: the structured-submission future​

The one formal ontology, and the unit grammar​

The unsolved part: structured CMC is still mostly a PDF​

The GxP gate: why ontologies stay off the floor​

The same gate, drawn around a learning model​

Why it matters​

Key terms​

Where this leads​