Semantics & the Digital Thread: Ontologies and a Knowledge Graph
π Where we are: Part III Β· Storing & Connecting β Chapter 16. The historian holds the numbers and the relational model holds the batch; this chapter wires everything β batch, equipment, material, recipe, result β into one navigable graph so a single question can walk the whole product lifecycle.
A relational database is a filing cabinet: brilliant if you already know which drawer to open. But an investigator's questions are not "what is in drawer 7?" β they are "this vial of drug product failed a release test; show me everything it touched on the way here." That is not a lookup, it is a walk: drug product β drug substance β capture pool β bioreactor batch β seed train β the cell bank it all started from. A knowledge graph stores those relationships as first-class facts β DS-001 derivedFrom PApool-001, PApool-001 derivedFrom BATCH-2026-001 β so you can ask the database to follow the chain itself, however many links deep, in one query. That chain, traced end to end, is what industry calls the digital thread.
What this chapter coversβ
In Chapter 14 we joined the historian to the batch in PostgreSQL, and in Chapter 3 we modeled ISA-88/95 in relational tables. SQL joins are perfect when you know the shape of the question in advance. They get awkward the moment the question is recursive and cross-system: "what did this lot derive from, all the way back, no matter how many hops?" This chapter answers exactly that class of question.
We will:
- model the plant as a small ontology β a shared, machine-readable vocabulary of what a Batch, a CapturePool, and a SeedTrain are β and explain how it aligns to the open standards everyone is converging on (RDF, the Industrial Ontologies Foundry, Allotrope, QUDT);
- load the relational facts (batches, lot genealogy, release results) into an RDF knowledge graph with real, tested Python;
- run a SPARQL digital-thread query that traces one batch's full genealogy in a single statement, and look at the actual output;
- and be honest about where the open-source triplestore is genuinely production-grade and where the GxP wrapper is still yours to build.
The runnable code for this chapter is one file β examples/chapters/16-semantics-knowledge-graph/build_graph.py β which builds the graph in-process with RDFLib so it runs on a laptop with zero services. The ontology Turtle, the SHACL shapes, and the Apache Jena Fuseki deployment are shown as illustrative configuration where they appear; they are how you would serve this graph at plant scale and are labelled as such.
Why a graph, when we already have SQLβ
Everything in this book so far has lived in tables, and tables have served us well. So why add a graph at all?
Because meaning does not travel between systems on its own. The bioreactor DCS calls a reading BR101.Temp.PV; the LIMS calls the same lot DS-2026-001; the ERP calls it material 1000457; a CofA PDF calls it "Lot 26-001." Each system is internally consistent and mutually unintelligible, and a graph is the place a shared model that reconciles them can live.
The graph's data model is RDF β the Resource Description Framework β which represents every fact as a triple: subject, predicate, object [1]. DS-001 derivedFrom PApool-001 is one triple. BATCH-2026-001 monomerPct 98.611 is another. There is no fixed schema to migrate when you add a new relationship; you just add more triples. And because the relationships are stored as data rather than implied by foreign keys, you can ask the graph to traverse them recursively in SPARQL, the W3C query language for RDF [2] β something SQL can only do with awkward recursive CTEs, and only if you wrote them in advance. A graph that anyone can add to also needs a gate, which is why RDF comes with SHACL, a constraint language for validating triples against shapes [3]; we put it to work later in the chapter.
The ontology: agreeing on what things areβ
Before we can link facts, we need a vocabulary that says what a Batch is, what a CapturePool is, and that derivedFrom connects a child to its parent. That vocabulary is an ontology. You can write a tiny local one β and we will β but the value comes from anchoring it to ontologies the rest of the industry already shares.
Three open standards matter here, and they stack:
- Allotrope (AFO) is the laboratory-analytics ontology: it gives standardized meaning to results, instruments, samples, and methods, so an HPLC result means the same thing whether it came from your LIMS or a contract lab [4].
- QUDT types every quantity and unit, so
98.611 %carries its dimension and unit as machine-readable facts rather than a string tacked onto a number [5]. A value without its unit is a future deviation waiting to happen. - The Industrial Ontologies Foundry (IOF) publishes a BFO-grounded, mid-level Core ontology for manufacturing β a principled vocabulary for equipment, materials, processes, and the relations between them β with a biopharma (BMIC) release aimed at lines exactly like our mAb process [6]. Its design rationale (why a shared upper grounding prevents the "everyone-invents-their-own-classes" chaos) is laid out in the IOF Core paper [7].
In production you import those ontologies and map your plant onto them. For a laptop-runnable chapter we keep a small, local namespace and align it conceptually β this is the honest pattern the repo follows throughout. The vocabulary, written in Turtle, looks like this:
# Illustrative β platform/ontology/bioproc.ttl (the shape you would import & align to IOF/AFO).
@prefix bp: <https://example.org/bioproc#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
bp:Batch a owl:Class ; rdfs:label "Bioreactor batch" .
bp:CapturePool a owl:Class ; rdfs:label "Protein A capture pool" .
bp:SeedTrain a owl:Class ; rdfs:label "Seed train" .
bp:DrugSubstance a owl:Class ; rdfs:label "Drug substance lot" .
bp:DrugProduct a owl:Class ; rdfs:label "Drug product lot" .
bp:derivedFrom a owl:ObjectProperty , owl:TransitiveProperty ;
rdfs:comment "Child material/lot derived from a parent." .
bp:monomerPct a owl:DatatypeProperty ;
rdfs:comment "SEC %monomer release CQA, typed against QUDT." .
Notice derivedFrom is declared owl:TransitiveProperty. That single word is the whole trick: it tells any reasoner that if a drug product derives from a drug substance, and the drug substance derives from a capture pool, then the drug product derives from the capture pool too β transitively, all the way down the chain.
From tables to a thread: relational facts are loaded into RDF triples, the genealogy becomes a chain of derivedFrom edges, and one SPARQL query walks that chain to reconstruct a lot's full lineage.
Original diagram by the authors, created with AI assistance.
The data we are turning into a graphβ
The graph is built from three committed golden datasets the simulator produced (SIM_SEED=2026), so every number below is reproducible. First, the batches and their release verdict, from examples/datasets/batches.csv:
batch_id,role,release
BATCH-2026-001,golden,PASS
BATCH-2026-002,sibling,PASS
BATCH-2026-003,sibling,PASS
BATCH-2026-004,sibling,OOS
BATCH-2026-005,sibling,PASS
BATCH-2026-006,sibling,PASS
Second, the lot genealogy β the parent/child edges that are the thread β from examples/datasets/lot_genealogy.csv:
batch_id,child,child_type,parent,parent_type
BATCH-2026-001,SEED-001,seed_train,WCB-CHO-001,wcb
BATCH-2026-001,BATCH-2026-001,bioreactor,SEED-001,seed_train
BATCH-2026-001,PApool-001,capture_pool,BATCH-2026-001,bioreactor
BATCH-2026-001,DS-001,drug_substance,PApool-001,capture_pool
BATCH-2026-001,DP-001,drug_product,DS-001,drug_substance
Read those five rows as a supply chain in miniature: one working cell bank (WCB-CHO-001) seeds a seed train, which inoculates the bioreactor batch, whose harvest becomes a Protein A capture pool, which is purified into drug substance, which is filled into drug product. Notice every batch traces back to the same WCB-CHO-001 β that shared root is what makes a cell-bank-level investigation answerable across the whole campaign.
Third, the release assays, from examples/datasets/hplc_results.csv:
batch_id,test,value,unit,spec_low,spec_high,result
BATCH-2026-001,SEC_monomer_pct,98.611,%,95.0,100.0,PASS
BATCH-2026-001,SEC_HMW_pct,1.287,%,0.0,3.0,PASS
BATCH-2026-001,CEX_main_pct,70.686,%,60.0,80.0,PASS
We will hang the SEC %monomer result on each batch node as a Critical Quality Attribute, so a lineage walk and a quality result live in the same graph. One honest caveat: %monomer is a release / drug-substance-stage SEC purity result, but for chapter simplicity the loader attaches it to the upstream bioreactor batch node (build_graph.py adds bp:monomerPct to BP[r.batch_id]). In a faithful model it would attach to the drug-substance lot; we mirror the repo's simplification here rather than imply the bioreactor batch itself was assayed for monomer.
Building the graph with RDFLibβ
In the full companion stack these triples are served by Apache Jena Fuseki β the mature open-source SPARQL 1.1 server backed by a persistent triplestore [8] β which we deploy below. But to keep this chapter and its test runnable with zero services, we build the same graph in-process with RDFLib, the Python library for constructing, serializing, and querying RDF [9]. Here is the heart of the chapter β the loader, from examples/chapters/16-semantics-knowledge-graph/build_graph.py. It reads the three CSVs and emits triples:
from rdflib import Graph, Literal, Namespace, RDF
from rdflib.namespace import RDFS, XSD
DATA = Path(__file__).resolve().parents[2] / "datasets"
BP = Namespace("https://example.org/bioproc#")
def build_graph() -> Graph:
g = Graph()
g.bind("bp", BP)
batches = pd.read_csv(DATA / "batches.csv")
gen = pd.read_csv(DATA / "lot_genealogy.csv")
rel = pd.read_csv(DATA / "hplc_results.csv")
for _, b in batches.iterrows():
s = BP[b.batch_id]
g.add((s, RDF.type, BP.Batch))
g.add((s, RDFS.label, Literal(b.batch_id)))
g.add((s, BP.releaseStatus, Literal(b.release)))
Each g.add((subject, predicate, object)) is literally one triple written into the graph. The first loop turns every batch into a typed, labelled node carrying its release status.
The genealogy loop is where the thread is woven. For each row it types both child and parent, then adds the derivedFrom edge that links them:
# genealogy edges: child bp:derivedFrom parent
for _, e in gen.iterrows():
child, parent = BP[e.child], BP[e.parent]
g.add((child, RDF.type, BP[e.child_type.title().replace("_", "")]))
g.add((parent, RDF.type, BP[e.parent_type.title().replace("_", "")]))
g.add((child, BP.derivedFrom, parent))
if e.parent_type == "bioreactor":
g.add((child, BP.fromBatch, BP[e.parent]))
The .title().replace("_", "") turns the CSV's seed_train into the class SeedTrain and capture_pool into CapturePool β a small, deterministic mapping from relational vocabulary to ontology classes. Finally, the release results attach the CQA, typed against XSD.float (the QUDT alignment lives in the ontology) so the value's datatype is explicit, not guessed:
# release results: titer attaches to the batch
titer = rel[rel.test == "SEC_monomer_pct"] # any release assay; use monomer as a CQA
for _, r in titer.iterrows():
g.add((BP[r.batch_id], BP.monomerPct, Literal(float(r.value), datatype=XSD.float)))
return g
The digital-thread queryβ
Now the payoff. With the graph built, we ask one recursive question in SPARQL β the W3C standard query language for RDF [2]. The query, from the same file, walks the derivedFrom chain backwards from a drug-substance lot:
PREFIX bp: <https://example.org/bioproc#>
SELECT ?step ?type WHERE {
bp:DS-001 (bp:derivedFrom)+ ?step .
?step a ?type .
} ORDER BY ?type
The load-bearing token is (bp:derivedFrom)+. That is a SPARQL property path: the + means "follow one or more derivedFrom hops." So this single line says "find every step DS-001 derives from, however many links away" β exactly the recursive lineage walk that is painful in plain SQL.
Running the file end to end β python chapters/16-semantics-knowledge-graph/build_graph.py β produces this real output:
graph: 91 triples
digital thread β what DS-001 derives from:
BATCH-2026-001 (Batch)
BATCH-2026-001 (Bioreactor)
PApool-001 (CapturePool)
SEED-001 (SeedTrain)
WCB-CHO-001 (Wcb)
That is the complete lineage of one drug-substance lot, reconstructed by the graph in one query: the capture pool it was purified from, the bioreactor batch that produced it, the seed train that inoculated it, and the working cell bank at the very root. The batch appears twice β once typed Batch (from the batches loop) and once Bioreactor (from the genealogy loop) β which is itself an honest, faithful picture of multi-source modeling: the same physical thing carries more than one class, and the graph happily holds both. This is the open-source realization of what vendors sell as a "manufacturing knowledge graph."
Because the release CQA lives in the same graph, you can extend the walk to answer the question an investigator actually asks β and what was its quality result? The companion file carries a second, tested query, THREAD_WITH_CQA_QUERY, that walks the lineage forward to the upstream batch carrying the CQA and reads its %monomer off the same graph:
PREFIX bp: <https://example.org/bioproc#>
SELECT ?batch ?monomer WHERE {
bp:DS-001 (bp:derivedFrom)+ ?batch .
?batch bp:monomerPct ?monomer .
}
The direction matters: the monomer-bearing batch is an ancestor of DS-001, so the path points forward with (bp:derivedFrom)+ β the same operator as the lineage query β rather than reversing back down to children. Run end to end, it returns the originating batch and its result:
lineage + quality β originating batch and its release %monomer:
BATCH-2026-001 monomerPct=98.611
Lineage and quality, one graph, one query.
Drawn as a graph, the chain the query walks is just a row of derivedFrom edges with the release CQA hanging off the batch:
Validating the graph with SHACLβ
A graph that anyone can add triples to needs a gate, or it rots. The Shapes Constraint Language (SHACL) is the W3C standard for validating an RDF graph against shape constraints β it is how you enforce, for example, that every Batch must have a release status and exactly one monomer result before it is allowed to claim it has been released [3]. It is the same mechanism Allotrope's data models use to constrain how their ontology may be applied. A minimal shape looks like this:
# Illustrative β platform/ontology/shapes.ttl
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix bp: <https://example.org/bioproc#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
bp:BatchShape a sh:NodeShape ;
sh:targetClass bp:Batch ;
sh:property [ sh:path bp:releaseStatus ; sh:minCount 1 ;
sh:in ( "PASS" "OOS" "PENDING" ) ] ;
sh:property [ sh:path bp:monomerPct ;
sh:datatype xsd:float ; sh:maxCount 1 ] .
SHACL is what turns a knowledge graph from a convenient lookup into a data-quality control: a triple that violates a shape is caught at load time instead of surfacing as a contradiction during an investigation.
Serving it at scale: Fuseki and Oxigraphβ
In-process RDFLib is perfect for a chapter and a unit test, but a plant needs a persistent SPARQL endpoint many systems can query. Two open-source stores fit:
- Apache Jena Fuseki is the mature choice β a SPARQL 1.1 server backed by the TDB persistent triplestore, run as a service [8]. It is the
semanticsprofile in the companion stack. You would load the same triples and expose/digitalthread/sparqlover HTTP:
# Illustrative β load the graph into Fuseki (semantics profile)
curl -X POST --data-binary @bioproc.ttl \
-H "Content-Type: text/turtle" \
http://localhost:3030/digitalthread/data
- Oxigraph is the lightweight, embeddable alternative β a SPARQL 1.1 query/update database backed by RocksDB, ideal when you want a graph store without a JVM [10].
The repo notes one practical caution. The stack uses the official Apache image apache/jena-fuseki:5.2.0, pinned by tag in examples/platform/compose/compose.yaml, with the corresponding manifest digest recorded alongside it in examples/platform/versions.lock (the pattern Chapter 22's supplier register relies on). For Fuseki specifically that digest is left as a VERIFY-BEFORE-USE placeholder: as the lock file and the reference-architecture license table both note, the community Fuseki image moved repositories, so you should resolve and record the real digest for your chosen registry mirror before you deploy by digest rather than trusting the tag alone.
Whichever store you choose, the deeper payoff is the one that has run through the whole trilogy: a SPARQL endpoint over a shared vocabulary with globally unique identifiers is the cleanest route to data that is genuinely FAIR β Findable, Accessible, Interoperable, and Reusable, with machine-actionability as the explicit design target [11]. A graph that another system can query without a bespoke integration is interoperability, made concrete.
Why it mattersβ
The digital thread is not jargon; it is the literal mechanism of a modern investigation. A systematic literature review of the digital thread for smart manufacturing identifies semantic links β knowledge graphs and ontologies β as the connective tissue that provides traceability across the product lifecycle [12]. When a release test fails on DP-004 (our seeded OOS lot), the question is instantly "what else shares its lineage?" β and because every batch in the campaign traces to WCB-CHO-001, the graph can answer "which other lots came off the same cell bank or the same capture skid?" in one traversal. That is impact analysis, and it is the difference between a scoped deviation and a blind, campaign-wide quarantine.
It matters for regulators because lineage and impact analysis are exactly what data-integrity and investigation expectations demand: the ability to walk from any record to every record it depends on, across systems, reproducibly. A knowledge graph makes that walk a query instead of a week of cross-referencing spreadsheets.
In the real worldβ
Commercial vendors sell this under names like "manufacturing knowledge graph," "contextualized data fabric," or "self-service data layer," typically layered over a historian's asset model. What we built in one Python file and a Turtle vocabulary is the same idea expressed in open standards every system can speak β and the standards convergence is real: Allotrope, IOF/BMIC, and QUDT are industry efforts precisely so the graph at one site is intelligible at the next. It is not a paper exercise either: a 2026 peer-reviewed study built a biopharmaceutical knowledge graph that integrated heterogeneous process data and let engineers query parameter-to-outcome (CPP-to-CQA) relationships directly β the manufacturing-knowledge-graph idea, working in practice [13].
This is the world NIIMBL's SABRE facility β the NIIMBL / University of Delaware pilot-scale cGMP (current Good Manufacturing Practice) plant that broke ground in April 2024 β is built for: a multi-vendor line where data from different skids, instruments, and lab systems must be tied to one batch, one lot, and one thread before anyone can reason across them. SABRE is a facility, not a data standard, but it is the kind of heterogeneous setting where a shared semantic layer stops being optional.
Now the honest OSS-vs-commercial reckoning. The graph technology is genuinely production-grade in open source: RDF, SPARQL, SHACL, Fuseki, and Oxigraph are mature, standards-compliant, and free of licensing traps (Apache-2 for Jena/Fuseki, permissive BSD for RDFLib, MIT for Oxigraph). You can build a real digital thread with zero license cost. What pure OSS does not hand you is the GxP wrapper: a validated, change-controlled mapping from your source systems into the graph, supplier accountability for the triplestore under GAMP-5, and assurance that the load process is complete and correct under qualification. The graph is also a derived view β its triples are copied from the relational record of truth, so unless the load is validated and re-run under change control, the graph can silently drift from the systems it claims to mirror, which is the opposite of data integrity. As elsewhere in this book: open source gets you a clean, inspectable engine; the validated system around it is yours to build or buy.
Key termsβ
- Ontology β a shared, machine-readable vocabulary defining what entities exist (Batch, CapturePool) and how they relate (
derivedFrom); the agreement that lets different systems mean the same thing. - RDF (triple) β the Resource Description Framework; every fact is a subjectβpredicateβobject triple, with no fixed schema, so relationships are stored as first-class data.
- Knowledge graph β a graph of RDF triples linking batch, equipment, material, recipe, and result into one navigable whole.
- Digital thread β the end-to-end chain of linked records tracing a product across its lifecycle; here, the
derivedFromlineage from drug product back to the cell bank. - SPARQL property path β a query operator (e.g.
(derivedFrom)+) that follows a relationship recursively, enabling lineage walks of arbitrary depth in one statement. - SHACL β the Shapes Constraint Language; validates an RDF graph against shape constraints to enforce data quality before bad triples enter the graph.
- IOF / Allotrope (AFO) / QUDT β the open standards the graph aligns to: a manufacturing ontology, a laboratory-analytics ontology, and a units-and-quantities ontology, respectively.
- Transitive property β an OWL relation (like
derivedFrom) where AβB and BβC imply AβC, letting a reasoner infer the full chain. - Triplestore β a database that stores and queries RDF triples over SPARQL; here, Apache Jena Fuseki or the embeddable Oxigraph.
Where this leadsβ
We have made the data mean the same thing everywhere and proven a one-query digital thread against real campaign data. But the graph, like the contextualization view before it, assumes the numbers already live in our open historian. Most plants have a commercial historian holding decades of process data β and that is the boundary we cross next. In Chapter 17 β Bridging to Commercial Historians: AVEVA/OSIsoft PI, we build a bidirectional bridge to a PI Web API (against a faithful mock, with the production base-URL and credential swap documented) so the open stack and the validated commercial system of record can finally exchange data.