Skip to main content

Regulatory Semantics: IDMP, SPL, KASA, and the Structured Submission

📍 Where we are: Part VII · Ontologies in Industry Today — Chapter 28. The previous chapters walked the manufacturing floor, where formal ontologies are thinnest. We now cross to the regulatory interface — where structured, semantic data is most advanced, and in places legally mandated.

The earlier chapters of this Part delivered an uncomfortable finding: on the GMP shop floor, where a monoclonal antibody is actually made, formal ontology — a machine-readable model of what things are and how they relate — is rare. One could read that as a verdict against the technology. It would be the wrong lesson.

Cross the boundary from the plant to the regulator and the picture inverts. The interface where a manufacturer tells an agency what a drug is — its substances, its strengths, its labeling, its quality — is the most semantically advanced corner of the entire industry.

And much of it is not optional. It is law. Where the floor's structured data is something a plant builds because it pays off, the regulatory interface's structured data is something a plant builds because, absent it, the filing is rejected at the door.

This chapter surveys that stack, and it does so with a distinction this book has insisted on from the first page. The regulatory interface is built largely from structured XML, HL7 FHIR, and controlled vocabularies — not from formal OWL ontologies. Exactly one piece of it is a true OWL ontology with a reasoner behind it; the rest is mandated, validated, structured data.

Understanding why that mix is what shipped — and why a self-modifying, reasoning ontology has not — is the chapter's real payload. The answer turns out to lie not in any missing standard but in the regime that decides what is allowed to run in a regulated environment at all.

The simple version

Imagine every country agreed to fill out the same form, in the same boxes, using the same official dictionary of words, so that a computer could check the form before a human ever read it. That is roughly what drug regulators have built for product identity and labeling — a shared form with a shared codebook. What they did not build is a machine that reasons about your filing and rewrites it. The form is locked, audited, and validated; it does not think. That difference — a checkable form versus a thinking system — is the whole story of why structured data won at the regulatory boundary and live ontologies did not.

What this chapter covers

We begin with the legally binding core: ISO IDMP (Identification of Medicinal Products) and the EMA's SPOR master-data services, then the FDA's substance backbone, GSRS/UNII, and its labeling standard, SPL.

We then meet the nonclinical-data standard CDISC SEND and its enforcement teeth, the Technical Rejection Criteria, and KASA, the FDA's internal structured-assessment platform — the closest real-world cousin to this book's SHACL release gate.

From there we survey what is still piloted or proposed: PQ/CMC FHIR, eCTD v4.0, and the structured-CMC future in ICH M16 / SPQS. We note the one formal OWL artifact, the Pistoia IDMP Ontology, and the unit grammar that threads through everything, UCUM. Finally we confront the GxP validation gate — the real reason ontologies stay off the floor.

A semantic-stack diagram: a left identity spine flows from a Knowledge graph node holding bp (WCB-CHO-001) down a green arrow to an IDMP substance ID node, then feeds rightward into an eCTD-envelope panel of five tier-colored rows (ISO IDMP and EMA SPOR, FDA GSRS and UNII, SPL labeling and CDISC SEND all production-green; PQ/CMC FHIR piloted-amber and eCTD v4.0 accepted-green; ICH M16/SPQS proposed-violet and Pistoia IDMP Ontology piloted-amber), with a green KASA callout and a rose GxP validation-gate callout below, under a banner reading that structure shipped first. One substance identity rises from the knowledge graph through mandated, structured submission layers, color-coded by maturity, until the GxP validation gate stops a live reasoner. Original diagram by the authors, created with AI assistance.

The mandated core: ISO IDMP and EMA SPOR

The foundation is a family of ISO standards collectively called ISO IDMP — Identification of Medicinal Products — a set of data models for describing substances, products, dosage forms, units, and organisations in a uniform, machine-readable way.

In the European Union its use is not a recommendation. ISO IDMP is legally mandated for Member States, marketing-authorisation holders, and the EMA under Articles 25–26 of Commission Implementing Regulation (EU) No 520/2012 [1]. That word — mandated — is what sets this layer apart from almost everything else in the book: a manufacturer does not adopt IDMP because it is elegant, but because the law requires the filing to be structured. (production)

The EMA delivers IDMP through four master-data services known by the acronym SPOR — Substances, Products, Organisations, Referentials. Two of them launched first and have been operational since June 2017: the Referentials Management Service (RMS) and the Organisations Management Service (OMS).

The Products service, PMS, is built on HL7 FHIR — the same healthcare data-exchange standard the rest of this stack leans on. Its timeline is easy to garble, so it is worth stating precisely.

The PMS Product user interface went live read-only on 31 May 2024; the PMS API was delayed and released read-only on 3 July 2024 — the two dates are routinely conflated — and write and edit access rolled out from late 2024 [1]. (production)

This is the layer that, in our running example, would carry the real-world identity behind bp:DS-001. The IDMP substance identifier is the regulator's name for the same drug substance our graph tracks from WCB-CHO-001 forward.

It is, in effect, what a real plant puts behind that node when it speaks to an agency rather than to its own historian: the internal bp: identifier is for the factory's knowledge graph; the IDMP identifier is for the world outside it. The two name the same substance, and keeping that correspondence true is part of what a working master-data discipline is for.

The FDA substance backbone: GSRS and UNII

The United States anchors substance identity in the Global Substance Registration System (GSRS), co-developed by the FDA and NIH/NCATS. GSRS implements the IDMP substance standards ISO 11238 and ISO/TS 19844, classifies every substance into one of six classes, and mints a free, non-proprietary UNII (Unique Ingredient Identifier) code for it — codes surfaced publicly on DailyMed.

The scale is real: a 2020 public release of the system held 116,636 substance definitions [2]. That a substance identifier is free and non-proprietary matters more than it sounds — it is the property that lets a UNII travel across systems and organisations without licensing friction, the same property this book argued for when it pinned global IRIs to the knowledge graph.

Notably, the EU substance service EU-SRS runs on the same GSRS software and has been live at the EMA since January 2023 — a rare case of two agencies sharing one codebase rather than two reconciling rival registries after the fact [2]. (production)

The labeling standard: SPL

When a manufacturer files a product label, it does not file a PDF; it files Structured Product Labeling (SPL), an HL7 v3 XML standard. SPL is mandatory under the FDA Amendments Act of 2007 and 21 CFR Part 207.

And it is not free text dressed up in tags. An SPL document draws on 44 controlled-terminology subsets from the NCI Thesaurus, plus NDC and UNII identifiers, so the meaning of each field is pinned to a shared vocabulary rather than left to prose a reviewer must interpret by hand [3]. The label, in other words, is computable: an agency can query across labels because every field denotes the same thing in every filing. (production)

Nonclinical data and its enforcement teeth: SEND

For nonclinical — that is, animal — study data, the FDA mandates CDISC SEND (Standard for Exchange of Nonclinical Data). One point must be stated plainly, because it is a common and consequential error: SEND governs nonclinical study data only. It is not a manufacturing or CMC standard; it has nothing to say about how BATCH-2026-001 was run or tested. Its controlled terminology is distributed through the NCI Thesaurus, which is the production distribution platform for CDISC Controlled Terminology.

SEND also carries real enforcement teeth. The Technical Rejection Criteria — a set of machine-checked error codes — became effective 15 September 2021, after which a non-conforming submission is technically rejected at the door, before any human reviewer sees it [4]. (production)

KASA: a rule-based assessment, not an ontology

The closest regulatory relative to this book's SHACL release gate is KASA — Knowledge-Aided Assessment and Structured Application — the FDA Office of Pharmaceutical Quality's internal review platform. It is worth pausing on, because it is the place where a regulator does, in production, something close to what this book has been modeling.

Launched in February 2021 for non-sterile solid-oral-dosage ANDAs, it has more than 300 active users, and a 2025 peer-reviewed paper documents more than 1,130 manufacturing assessments completed through September 2024 [5]. KASA applies rules and quantitative risk models to standardized CMC data, and it is designed to consume PQ/CMC structured data.

But it must be classified honestly: KASA is a structured-data and controlled-vocabulary platform, not a formal OWL ontology with a reasoner.

It checks structured meaning the way our SHACL gate checks DP-004 against spec — by rule, against a shape — not by inference over a knowledge graph. The resemblance to our release gate is real, and so is the line between them. (production, internal)

Piloted and proposed: the structured-submission future

Several pieces of the stack are real but not yet mandatory. It is worth being careful here, because the difference between published, accepted, and required is exactly the difference that gets lost when people describe the regulatory future as already here.

The table below sorts the near-future initiatives by maturity, and the tag on each is doing real work — a published implementation guide is not a mandate, and an accepted format is not a required one.

InitiativeWhat it isStatus
PQ/CMC FHIRA FHIR R5 Implementation Guide profiling FHIR documents submitted as XML in eCTD Module 3, with NCIt terminologyPublished and Connectathon-tested; limited to Solid Oral Dosage Form (liquids and biologics are future work); explicitly not yet a mandated format (piloted)
eCTD v4.0The next submission envelope, built on the HL7 v3 Regulated Product Submissions message standardFDA began accepting it 16 September 2024 (voluntary; v3.2.2 still supported); Japan PMDA mandates it from 1 April 2026; the EU targeted an optional go-live around end 2025 with a later mandate (production as an accepted format; not yet mandatory in the US or EU)
ICH M4Q(R2) / M16 / SPQSA granular, IDMP-aligned restructuring of CTD Module 3, with the structured-data layer deferred to ICH M16 / Structured Product Quality SubmissionsM4Q(R2) reached Step 2 consultation in 2025; the structured layer (M16/SPQS) is deferred; ICMRA's PQ Knowledge Management initiative proposes common identifiers building on UNII/GSRS and SPOR (proposed)

Read the dates against the calendar rather than the press release. eCTD v4.0 is an accepted format in the United States, not a required one, and v3.2.2 is still supported. Japan's PMDA mandate took effect on 1 April 2026; the EU's optional go-live was targeted around the end of 2025, with any binding mandate still to follow [6][7][8].

What unites this row is that all three describe a more structured submission envelope and template — a better-organized way to hand over documents — rather than a reasoning layer over their contents. The submission is becoming more like a database; it is not becoming a model that infers.

The one formal ontology, and the unit grammar

In this entire stack, exactly one artifact is a true OWL ontology: the Pistoia IDMP Ontology — public, MIT-licensed, with more than ten pharma companies reported to have adopted it, and being standardized through ISO/TS 21405. As the earlier sponsor chapter described, Johnson & Johnson built a product master on it for the EMA PMS deadline.

The ontology itself is best classed as (piloted) across the industry, with that one company's master data built on it and rolling toward production rather than uniformly in service [9]. It is the exception that proves the rule: where the rest of the regulatory stack reaches for structured data, here one formal ontology earns its place — and even it does so at the master-data layer, not on the floor.

One thread runs through every layer above: units. UCUM — the Unified Code for Units of Measure — is the unit standard embedded in HL7 and FHIR, and therefore inside EMA PMS, FDA SPL, and PQ/CMC FHIR alike [10].

This book's identifiers chapter pinned QUDT for the knowledge graph; the regulatory wire uses UCUM. A real plant must therefore speak both — QUDT inside its own graph, UCUM on the line to the agency — and maintain the mapping between them, because a quantity that loses its unit, or carries the wrong one, is a defect that no amount of structure elsewhere can repair.

The SEC %monomer of 98.611 on BATCH-2026-001 carries a QUDT-typed value in our model and acquires a UCUM unit code the moment it crosses into a regulated submission. The number does not change; the grammar it is wrapped in does. (production)

The unsolved part: structured CMC is still mostly a PDF

For all this advancement, the genuinely missing piece is structured CMC itself.

The chemistry, manufacturing, and controls content of a submission — the part that actually describes how a batch is made and tested, the part nearest to everything this book has modeled — is still mostly narrative PDF. The structured-data layer meant to replace it, ICH M16 / SPQS, is proposed, not in force [8].

And there is a second, deeper limit hiding behind the first: even when M16 lands, it will arrive as validated, locked structured data long before it arrives as a live, reasoning ontology. The submission boundary will get more checkable. It will not soon get a reasoner. The reason for that is the subject of the next section, and it is the chapter's sharpest point.

The GxP gate: why ontologies stay off the floor

Here is the correction to a tempting misreading. The reason formal ontologies are scarce on the production floor is not a missing vocabulary or an immature standard — by this point in the book we have seen that the standards exist, that they are mature, and that much of the wire format is even free.

It is the computerized-systems-validation regime. GAMP 5 (2nd edition, 2022), 21 CFR Part 11 (electronic records and signatures), and EU GMP Annex 11 require that any system of record in a GxP — Good Practice, the regulated-quality umbrella — environment be validated, access-controlled, audit-trailed, and change-controlled [11]. A system that cannot demonstrate those properties cannot hold a record on which a batch decision rests. (in force)

A self-modifying, reasoner-driven ontology that imports external vocabularies and infers new facts is exactly the kind of system that is hard to validate under that regime.

Ask the question a regulator would ask of a batch decision: what was the inferred fact, derived from which axioms, under which version of which imported ontology, at the precise moment the decision was made? That question has no stable answer if the model can change beneath you between one query and the next — and a record without a stable answer is, under Part 11 and Annex 11, not a record a regulated system may keep.

So the floor runs on locked-down structured systems instead — not because their semantics are poorer, but because they can be frozen, signed, and proven. The friction that keeps the reasoner out is regulatory, not technical, and naming it that way is the only way to understand why the most semantically advanced corner of pharma is also the one most committed to data that does not think.

Why it matters

The SHACL release gate this book built — the rule that holds DP-004 back because a critical quality attribute fell out of spec — is not a thought experiment.

It has a real-world cousin in KASA's rule-based assessment and in IDMP's mandated structured master data. Together they prove that checkable, structured meaning is already a regulatory reality at the submission boundary, even as a live, reasoning ontology stays rare on the GMP floor behind it. The reader who built the gate in the QC chapter has, without quite knowing it, been rehearsing the regulator's own move.

The book's through-line — turning records into knowledge that can be reasoned over under pressure — meets its hardest real-world test precisely here, where the law already demands structure but the validation regime still forbids a model that thinks for itself.

The lesson is not that ontologies failed at the regulatory edge. It is that structure shipped first, and reasoning, where it comes, will have to earn its way past the gate.

Key terms

  • ISO IDMP — the ISO family of standards for Identification of Medicinal Products; legally mandated in the EU for describing substances, products, and organisations in machine-readable form.
  • SPOR — the EMA's four IDMP master-data services: Substances, Products, Organisations, Referentials; RMS and OMS operational since 2017, PMS built on HL7 FHIR.
  • GSRS / UNII — the FDA/NIH Global Substance Registration System and the free, non-proprietary Unique Ingredient Identifier codes it mints; implements ISO 11238 and ISO/TS 19844.
  • SPL — Structured Product Labeling, an HL7 v3 XML standard mandatory for FDA product labels, using NCI Thesaurus controlled terminology, NDC, and UNII.
  • CDISC SEND — the FDA-mandated Standard for Exchange of Nonclinical (animal-study) Data; not a manufacturing or CMC standard.
  • KASA — the FDA's internal Knowledge-Aided Assessment and Structured Application platform; a rule-based, structured-data assessment system, not a formal OWL ontology.
  • PQ/CMC FHIR — a piloted FHIR R5 Implementation Guide for structured product-quality and CMC data, currently limited to solid oral dosage forms.
  • UCUM — the Unified Code for Units of Measure, the unit grammar embedded in HL7/FHIR and therefore in EMA PMS, SPL, and PQ/CMC FHIR.
  • GxP validation regime — GAMP 5, 21 CFR Part 11, and EU GMP Annex 11 collectively; the requirement that GxP systems of record be validated, audited, and change-controlled, which makes live reasoning ontologies hard to deploy.

Where this leads

The regulatory boundary, then, is where structured semantics is most mature and most enforced — and, not coincidentally, where the line between a checkable form and a thinking system is drawn most sharply.

The next chapter, The Shop Floor and the Digital Twin: Where Ontologies Are Still Arriving, turns back to the plant itself — to the live equipment, the process historians, and the digital-twin ambitions where formal models are only now beginning to appear, gated by the very validation regime this chapter just named.