Skip to main content

Enterprise Knowledge Graphs at Big Pharma

📍 Where we are: Part VII · Ontologies in Industry Today — Chapter 27. The previous chapter walked the vendor landscape — the engines you can buy. This one asks the harder question: with those engines in hand, what have the largest drug makers actually built, and where did they put it?

A vendor demo and a production system are different animals. The triple stores, the reasoners, and the SHACL validators of the last chapter all work — but a working engine tells you nothing about where a company decided to bolt it in. This chapter inventories the enterprise knowledge graphs — large, cross-domain graphs that unify a company's data under a shared semantic model — that big pharma has put into production. The case studies are real, peer-reviewed or vendor-documented, and impressive at petabyte scale. They also share a pattern so consistent it becomes the chapter's real finding: the graphs cluster almost entirely in research, FAIR-data cataloging, and regulatory master data — and, on the public evidence, they have not yet reached the GMP manufacturing floor where this book's running batch is actually made.

The simple version

Imagine a global company that has finally built one giant, searchable index of everything it knows — every experiment, every dataset, every product record — and it works beautifully. Then you notice the index covers the research labs and the head-office filing cabinets but stops at the factory door. The machines that physically make the medicine are still keeping their own private notebooks. That gap, not the index, is the story.

What this chapter covers

We survey six production and two piloted enterprise knowledge graphs at named pharmaceutical companies, laid out as a single comparison table and then read closely. We distinguish a true formal ontology — an RDF/OWL model a reasoner can act on — from a contextualization program, which maps and links data without committing to a logic-bearing model; the difference is not pedantry, because only the former carries the guarantees earlier parts of this book were built to deliver. We trace where these graphs live: R&D discovery graphs, FAIR-data catalogs, and regulatory master data. We then name the pattern — that GMP-production deployments on the manufacturing floor are absent from the surveyed public evidence — and the coverage gaps in that evidence, chiefly contract manufacturers and Asian producers. Throughout, the maturity of each claim is marked in bold parentheses, because the distance between "shipped" and "announced" is exactly where this book refuses to oversell.

A two-zone survey map of big-pharma enterprise knowledge graphs. A large left zone titled R and D, FAIR cataloging and regulatory master data holds six green production cards (Roche EDIS, Novartis data42, AstraZeneca BIKG, Boehringer Ingelheim EKG, Novo Nordisk OBDM, Johnson and Johnson IDMP-O) above two amber piloted cards (Sanofi Modulus, Merck MSD and GSK Methods Hub), each card naming its system, scale and technology. A dashed rose boundary separates a smaller right zone, The GMP manufacturing floor, which is empty apart from a dashed batch card (BATCH-2026-001, bp goes out of spec), an em-dash reading no graph here yet, and rose notes on under-represented populations and unfinished work. A legend maps green to production, amber to piloted, and rose to absent from public evidence. Every deployed pharma knowledge graph in this survey clusters in R and D, FAIR catalogs, and master data, while the GMP manufacturing floor where the running batch is made stays a rose, ungraphed frontier. Original diagram by the authors, created with AI assistance.

The inventory

The table below is the chapter in miniature. Read the Status and Technology columns together: where the technology is RDF/OWL, the company built a formal ontology; where it is a mapping or "contextualization" layer, it did not.

CompanySystemStatusTechnologyWhat it does
RocheEDIS / Dataset Portal(production)DCAT, Dublin Core, PROV-O, SKOS, FOAF, PAV, JSON-LD over Ontotext GraphDBFAIR catalog of ~20,000 datasets
Boehringer IngelheimEnterprise Knowledge Graph(production)RDF/OWL/SPARQL on metaphactory; also a Stardog R&D layerFederates omics, IT, documents, trial data
AstraZenecaBIKG(production, R&D)Property graph, ~14M nodes / 136M edges, 55 sourcesPowers Mantis-ML 2.0 target identification
Novartisdata42(production)Palantir Foundry Ontology (earlier AWS Neptune)Unifies ~20 PB of R&D data
Novo NordiskOBDM(production, R&D)RDF/OWL reusing AFO, OBI, ChEBI, BFO; SKOS/SSSOM/ROBOTInferencing KG over research data
Johnson & JohnsonIDMP-O product master(production master data)Pistoia IDMP Ontology + AccuridsRegulatory product master for EMA PMS
SanofiModulus / Connected Smart Factories(piloted)Contextualized data mapping across MES/LIMS/SAPLinks factory data — not a formal ontology
Merck (MSD) & GSKPistoia Methods Hub(piloted)ADF/AFO + a novel RDF graph modelMachine-readable HPLC-UV method transfer

The FAIR catalogs: Roche and Novartis

Roche's EDIS / Roche Dataset Portal is the cleanest example of a knowledge graph built for FAIR — the principle that data be Findable, Accessible, Interoperable, and Reusable. It rests on a stack of public vocabularies — DCAT, Dublin Core, PROV-O, SKOS, FOAF, PAV, served as JSON-LD through a FAIR Data Point — sitting over Ontotext GraphDB, and it catalogs roughly 20,000 datasets (production) [1]. Roche reports an internal FAIR-maturity score of 4.75 out of 5 and reuse of the underlying terminology stack across more than 100 applications [1]. Both are self-reported figures — read them as the company's own assessment, not an independent audit. What is unambiguous is the architecture: this is an ontology in the service of a catalog, the same FAIR pattern Part V argued for, running at the scale of a global research organization.

Novartis's data42 is the same idea at brute scale: roughly 20 petabytes of R&D data and on the order of 2 million patient-years, unified through the Palantir Foundry Ontology, with an earlier iteration on AWS Neptune and SageMaker, and with FAIR described as a stated "mantra" (production) [4]. Tellingly, Novartis was hiring an Ontology Designer to define object and action types and the "semantic contracts" between them [4] — a sign that the discipline this book teaches has become a salaried role, not a research curiosity. One caution on the noun: Foundry's "Ontology" is a structured object model — the preceding platforms chapter unpacks exactly this distinction — and whether it constitutes a formal OWL ontology in the BFO sense is a question the public record does not settle. The petabyte figure is real; the word "ontology" is doing softer work than it does elsewhere in this table.

The discovery graphs: AstraZeneca and Boehringer Ingelheim

AstraZeneca's BIKG (Biological Insights Knowledge Graph) is a discovery-side graph of roughly 14 million nodes and 136 million edges drawn from 55 data sources, and it powers Mantis-ML 2.0 for disease-gene target identification (production, R&D) [3]. One caution on the numbers: the 136-million-edge figure is the count as of that 2024 paper, and edge counts drift across AstraZeneca's publications — a widely repeated "146 million" figure is not the value in this source. Treat 136M as as-of-paper, not a canonical constant. Note too that BIKG is a property graph, not an RDF/OWL deployment; it is a knowledge graph in the engineering sense, optimized for traversal and machine learning rather than for description-logic inference.

Boehringer Ingelheim runs an Enterprise Knowledge Graph Platform on metaphactory, using RDF/OWL with SPARQL federation across omics data, IT systems, documents, and the secondary use of clinical-trial data (production) [2]; it also operates a separate Stardog-based semantic layer for R&D. This is a genuine formal-ontology deployment, and the closest of the discovery graphs to the architecture this book recommends. The often-cited drug-discovery and supply-chain use cases, however, are best read as aspirational direction rather than shipped capability — the platform is real; some of the destinations it is pointed at are not yet arrivals.

The research ontology done right: Novo Nordisk

Novo Nordisk's OBDM (ontology-based data management) is the cleanest peer-reviewed account of an industrial pharma knowledge graph, published in 2025 (production, R&D) [5]. It is an RDF/OWL graph that deliberately reuses public ontologies — AFO, OBI, ChEBI, and BFO, the upper ontology this book built on — and is tooled with SKOS, SSSOM, and ROBOT. Its three components map almost exactly to the architecture earlier parts described: a data model, a set of controlled vocabularies, and an inferencing knowledge graph. For a reader who has followed this book, OBDM reads less like news and more like confirmation — evidence that the recommended pattern survives contact with a real, regulated enterprise rather than living only in textbooks.

The regulatory master: Johnson & Johnson

Johnson & Johnson's product master is built on the Pistoia IDMP Ontology plus the Accurids platform, aimed at the EMA's Product Management Service obligations (production master data) [6]. Two honesty notes. First, this is regulatory master data — the canonical, governed record of what a product is — not process execution; it is the formal substrate behind something like our bp:DS-001 drug-substance identifier, not a model of the batch that produced it. Second, because the EMA deadline it serves falls in mid-2026, "production" here is partly built-for and rolling-out rather than fully shipped; read it as a system going live against a deadline, not one with years of operation behind it.

The two that touch the floor — barely

Two efforts come closest to manufacturing, and both are instructive precisely because of how they stop short. Sanofi's Modulus, within its Connected Smart Factories program, is a "contextualized data" effort that maps and links data across MES, LIMS, SAP, and paper records, with sites in France and Singapore rolling out around 2026 (piloted) [7]. Be exact about what it is: a data-mapping and contextualization program, not a formal RDF/OWL ontology, and the tidy "ISA-95-style" framing sometimes attached to it is an outside inference, not Sanofi's own claim.

The Pistoia Methods Hub pilot, run with Merck (MSD) and GSK, validated machine-readable transfer of an HPLC-UV analytical method between sites using ADF/AFO plus a novel RDF graph model — 55 standardized parameters, with about 1.35% retention-time variance across the transfer, peer-reviewed in 2025 (piloted) [8]. This is the single case that lands nearest the regulated QC bench, and it remains a pilot, not a plant-wide system. The identification of "Merck" here as MSD rather than the unrelated Merck KGaA rests partly on conference-program inference, so it is offered with that caveat attached.

The pattern: everywhere but the floor

Lay the inventory out and one shape is unmistakable. The production graphs cluster in two places: R&D and FAIR-data cataloging — Roche, Novartis, AstraZeneca, Boehringer Ingelheim, Novo Nordisk — and regulatory master data, Johnson & Johnson. The two efforts that reach toward manufacturing — Sanofi's Modulus and the Methods Hub — are a contextualization program and a pilot, respectively, and both are still rolling out. The discipline has proven itself on discovery and on the regulatory record. It has not, in public, proven itself on the GMP line.

The careful claim is this: named, GxP-production ontology deployments on the manufacturing floor are not found in the surveyed public evidence. That is deliberately not the same as saying they do not exist. Manufacturing IT is routinely confidential and competitively sensitive; a plant running a quiet OWL-backed batch-genealogy graph would have every reason not to publish it. Absence from the public record is weak evidence of absence in fact — but it is the only evidence we honestly have, and the consistency of the gap is itself striking.

The unsolved part: bridging the R&D graph to the validated line

There is a second silence in the evidence, and it is worth naming as plainly as the first. The published case studies skew heavily toward Western big-pharma innovators, and two populations are conspicuously under-represented. Contract manufacturers (CDMOs) — exactly where multi-client data-interoperability pain is sharpest, the pain this book's distribution chapter flagged — barely appear. And Asian manufacturers are nearly absent from the public-evidence base: Samsung Biologics surfaces once in connection with digital twins, while Celltrion, Lotte Biologics, WuXi, and NMPA-region players are largely missing from the surveyed literature. This book invents no details about them; it only flags the hole. The unfinished work is therefore twofold: bridging the proven R&D knowledge graph across the GMP validation boundary onto the production floor, and federating graphs across the CDMO and partner network so that a batch's full digital thread survives the handoffs between companies.

Why it matters

These eight systems are the real-world instances of the digital thread this book has been building toward — living proof that the engine runs at petabyte scale inside the most demanding companies in the industry. They are also proof of the harder fact: the most regulated, highest-stakes mile of the journey — the GMP manufacturing floor where BATCH-2026-001 is physically made and bp:DP-004 goes out of spec — is precisely where the graphs have not yet reached. The pattern is not a failure of the technology; it is a map of the frontier. Knowing exactly where the deployed art stops is what lets the next reader aim at the part that is still open.

Key terms

  • Enterprise knowledge graph — a large, cross-domain graph that unifies a company's data under a shared semantic model, queryable as one connected whole.
  • FAIR catalog — a knowledge graph whose primary job is to make datasets Findable, Accessible, Interoperable, and Reusable, typically via public metadata vocabularies such as DCAT and PROV-O.
  • Contextualization program — a data effort that maps and links records across systems without committing to a formal, logic-bearing RDF/OWL model; not the same as an ontology.
  • Regulatory master data — the canonical, governed record of what a product is (identifiers, ingredients, strengths), as opposed to a record of how a given batch was made.
  • BIKG — AstraZeneca's Biological Insights Knowledge Graph, a discovery-side property graph powering computational target identification.
  • OBDM — Novo Nordisk's ontology-based data management approach: a data model, controlled vocabularies, and an inferencing RDF/OWL knowledge graph reusing public ontologies.
  • IDMP Ontology — the Pistoia Alliance's OWL rendering of the ISO IDMP product-identification standards, used to build regulatory product masters.
  • Argument from absence — the careful claim that something is "not found in surveyed public evidence," which is weaker than, and must not be stated as, "does not exist."

Where this leads

Johnson & Johnson's product master pointed at a target this book has so far only glanced at: the regulator. The next chapter, Regulatory Semantics: IDMP, SPL, KASA, and the Structured Submission, follows the data the whole way to the agency — how the identifiers, structured product labels, and submission formats that health authorities mandate are themselves becoming semantic, and what it means when the record a company files is meant to be read by a machine as much as by a reviewer.