Semantics & the Digital Thread: Ontologies and a Knowledge Graph

📍 Where we are: Part III · Storing & Connecting — Chapter 19. The historian (the time-series database that records every sensor reading from the plant) holds the numbers and the relational model holds the batch; this chapter wires everything — batch, equipment, material, recipe, result — into one navigable graph so a single question can walk the whole product lifecycle.

The simple version

A relational database is a filing cabinet: brilliant if you already know which drawer to open. But an investigator's questions are not "what is in drawer 7?" — they are "this vial of drug product failed a release test; show me everything it touched on the way here." That is not a lookup, it is a walk: drug product → drug substance → capture pool → bioreactor batch → seed train → the cell bank it all started from (each of those manufacturing steps is named again, in order, in "The data we are turning into a graph" below, and walked through as a physical operation in Book 1's Production Bioreactor chapter). A knowledge graph stores those relationships as first-class facts — DS-001 derivedFrom PApool-001, PApool-001 derivedFrom BATCH-2026-001 — so you can ask the database to follow the chain itself, however many links deep, in one query. That chain, traced end to end, is what industry calls the digital thread.

What this chapter covers

In Chapter 17 we joined the historian to the batch in PostgreSQL, and in Chapter 4 we modeled ISA-88/95 (the batch-control and manufacturing-operations standards) in relational tables. SQL joins are perfect when you know the shape of the question in advance. They get awkward the moment the question is recursive (it re-applies itself to its own result — follow a link, then follow the next link from wherever that landed, and so on) and cross-system: "what did this lot derive from, all the way back, no matter how many hops?" This chapter answers exactly that class of question.

We will:

model the plant as a small ontology — a shared, machine-readable vocabulary of what a Batch, a CapturePool, and a SeedTrain are — and explain how it aligns to the open standards everyone is converging on (RDF, the Industrial Ontologies Foundry, Allotrope, QUDT);
load the relational facts (batches, lot genealogy, release results) into an RDF knowledge graph with real, tested Python;
run a SPARQL digital-thread query that traces one batch's full genealogy in a single statement, and look at the actual output;
show how a single LinkML model can generate the SHACL, JSON Schema, SQL, and OWL the stack would otherwise hand-maintain as four separately drifting copies;
and be honest about where the open-source triplestore is genuinely production-grade and where the GxP wrapper (GxP = the umbrella of "Good x Practice" quality regulations — GMP for manufacturing, GLP for labs, GCP for clinical — that govern any data a regulator may inspect; the wrapper is the validated, change-controlled layer you build around the engine to satisfy them) is still yours to build.

The runnable code for this chapter is one file — examples/chapters/16-semantics-knowledge-graph/build_graph.py — which builds the graph in-process with RDFLib so it runs on a laptop with zero services. The ontology Turtle, the SHACL shapes, and the Apache Jena Fuseki deployment are shown as illustrative configuration where they appear; they are how you would serve this graph at plant scale and are labelled as such.

Why a graph, when we already have SQL

Everything in this book so far has lived in tables, and tables have served us well. So why add a graph at all?

Because meaning does not travel between systems on its own. The bioreactor DCS (distributed control system — the equipment-control software) calls a reading BR101.Temp.PV; the LIMS (laboratory information management system) calls the same lot DS-2026-001; the ERP (enterprise resource planning — the business/inventory system) calls it material 1000457; a CofA (Certificate of Analysis) PDF calls it "Lot 26-001." Each system is internally consistent and mutually unintelligible, and a graph is the place a shared model that reconciles them can live.

The graph's data model is RDF — the Resource Description Framework — which represents every fact as a triple: subject, predicate, object [1]. DS-001 derivedFrom PApool-001 is one triple. BATCH-2026-001 monomerPct 98.611 is another. There is no fixed schema to migrate when you add a new relationship; you just add more triples. And because the relationships are stored as data rather than implied by foreign keys, you can ask the graph to traverse them recursively in SPARQL, the W3C query language for RDF [2] — something SQL can only do with awkward recursive CTEs (common table expressions — named, self-joining subqueries, shown later in the chapter), and only if you wrote them in advance. A graph that anyone can add to also needs a gate, which is why RDF comes with SHACL, a constraint language for validating triples against shapes [3]; we put it to work later in the chapter.

Anatomy of the graph: a triple, then a node

Before we build anything, it is worth dissecting the two structures everything else is made of — because the whole graph is, quite literally, just these two things repeated. A triple is the atom; a node is a molecule of triples that share a subject. Both are written by the loader you will read in a moment, so every field below is a real fact the code emits, not a textbook abstraction.

Anatomy of a triple: subject, predicate, object

A triple is the smallest possible fact: three slots, no more. The loader's last loop writes g.add((BP[r.batch_id], BP.monomerPct, Literal(float(r.value), datatype=XSD.float))) — and that one line is the triple dissected below. Its subject is a globally-unique IRI (Internationalized Resource Identifier — a web-style global name, here bp:BATCH-2026-001), not a local primary key that means nothing outside one database; its predicate is also an IRI (bp:monomerPct), drawn from the ontology, that names the relation; and its object is where RDF earns the word graph. The object is either a typed literal — a lexical form (98.611) plus an xsd (XML Schema) or QUDT datatype tag that pins what kind of value it is — or another IRI, in which case the triple is an edge you can walk (bp:DS-001 bp:derivedFrom bp:PApool-001). That single fork — literal-or-resource — is the difference between a column value and a foreign key (a database column whose value points at another table's row), except RDF stores both as first-class, addressable data.

Identity card dissecting one RDF triple into three labelled cells — subject IRI, predicate IRI, and object — with the object split into a typed-literal case (98.611 with an xsd datatype tag and QUDT unit) and an IRI-edge case (derivedFrom to PApool-001), each contrasted against the equivalent relational column value and foreign key. One triple, field by field: the subject and predicate are IRIs, and the object is either a typed literal (a value) or another IRI (an edge to walk) — the fork that makes RDF a graph rather than a table. Original diagram by the authors, created with AI assistance.

Anatomy of a Batch node: one node is a bundle of triples

There is no separate "node" object in RDF; a node is simply every triple that shares a subject, viewed together. Take BATCH-2026-001 exactly as the loader writes it, and six triples share that one IRI as their subject. It is typed twice — rdf:type bp:Batch from the batches loop and rdf:type bp:Bioreactor from the genealogy loop — which is the honest two-class fact the digital-thread output makes visible. That is not duplication or a bug: RDF lets one thing belong to several classes at once, and this thing genuinely is both a production batch (a manufacturing run) and a bioreactor (the vessel that ran it), each asserted by a different source loop — so the graph faithfully keeps both rather than forcing a single label. It carries an rdfs:label, a bp:releaseStatus of "PASS", the bp:monomerPct CQA (Critical Quality Attribute) of 98.611^^xsd:float (the ^^ attaches the datatype, here a 32-bit float), and one outbound bp:derivedFrom edge pointing out to SEED-001, its parent seed. The genealogy also runs the other way: PApool-001, its child pool, carries an inbound bp:derivedFrom (and a convenience bp:fromBatch) pointing at this node — edges whose subject is the child, not this batch. Unpack the node and you see the graph's whole trick at one address: lineage and quality, inbound and outbound, all as triples on the same subjects. This is the graph analogue of the "one node, fully unpacked" reading card from the connectivity chapter.

The same node, fully unpacked: a Batch node is just the bundle of triples that share bp:BATCH-2026-001 as their subject — two type triples, a label, a release status, the monomer CQA, and the outbound derivedFrom edge, with the inbound edges from its child pool making it a hub in the thread. Original diagram by the authors, created with AI assistance.

Where this node comes from — back along the trilogy spine

This node is the open-source landing point for two earlier books. The physical thing it stands for — the stirred-tank run that grew the cells and produced the harvest — is the subject of Book 1's Production Bioreactor chapter; that vessel is the bp:Bioreactor typing on this very node. The idea that a fact like monomerPct 98.611 should travel between systems as a self-describing, FAIR record rather than a bare cell in a spreadsheet is the open challenge posed in Book 2's Ontologies and FAIR data and Semantic interoperability chapters, with the lineage walk itself framed there as the digital thread and twin. What those two books describe as a step and a data-point, the triples on this page are the runnable code that finally implements.

The ontology: agreeing on what things are

Before we can link facts, we need a vocabulary that says what a Batch is, what a CapturePool is, and that derivedFrom connects a child to its parent. That vocabulary is an ontology. You can write a tiny local one — and we will — but the value comes from anchoring it to ontologies the rest of the industry already shares.

Three open standards matter here, and they stack:

Allotrope (AFO) is the laboratory-analytics ontology: it gives standardized meaning to results, instruments, samples, and methods, so an HPLC result means the same thing whether it came from your LIMS or a contract lab [4].
QUDT types every quantity and unit, so 98.611 % carries its dimension and unit as machine-readable facts rather than a string tacked onto a number [5]. A value without its unit is a future deviation waiting to happen.
The Industrial Ontologies Foundry (IOF) publishes a Core ontology for manufacturing grounded in the Basic Formal Ontology (BFO, the ISO/IEC 21838-2 top-level ontology) — a tiny, domain-neutral set of root categories (object, process, quality, and the like) that every more specific class hangs from, which is what lets independently built ontologies still agree on what kind of thing each term is. It is a mid-level, principled vocabulary for equipment, materials, processes, and the relations between them, with a biopharma release (BMIC, the IOF's Biopharma Manufacturing Industrial Content) aimed at production lines exactly like our mAb (monoclonal antibody) process [6]. Its design rationale (why a shared upper grounding prevents the "everyone-invents-their-own-classes" chaos) is laid out in the IOF Core paper [7]. One honest scope note: BMIC and these upper ontologies model intent and structure — recipes, specifications, equipment, and the CPP (Critical Process Parameter — a process input such as temperature, pH, or feed rate whose setting affects a quality outcome) / CQA relationships between them — not the raw measured arrays themselves. A spectrum or chromatogram is not flattened into triples; it lives in an Allotrope or AnIML file that the graph references by IRI.

In production you import those ontologies and map your plant onto them. For a laptop-runnable chapter we keep a small, local namespace and align it conceptually — this is the honest pattern the repo follows throughout. The vocabulary, written in Turtle (the most human-readable text format for RDF), looks like this. Two pieces of Turtle to read it: the bare a is shorthand for rdf:type ("is a"), and each prefix names which vocabulary a term comes from — rdf:/rdfs: for RDF's own core schema vocabulary, and owl: for the richer Web Ontology Language (OWL) layered on top of RDF, which adds expressive constructs like owl:TransitiveProperty.

# Illustrative — platform/ontology/bioproc.ttl (the shape you would import & align to IOF/AFO).
@prefix bp:   <https://example.org/bioproc#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl:  <http://www.w3.org/2002/07/owl#> .

bp:Batch          a owl:Class ; rdfs:label "Bioreactor batch" .
bp:CapturePool    a owl:Class ; rdfs:label "Protein A capture pool" .
bp:SeedTrain      a owl:Class ; rdfs:label "Seed train" .
bp:DrugSubstance  a owl:Class ; rdfs:label "Drug substance lot" .
bp:DrugProduct    a owl:Class ; rdfs:label "Drug product lot" .

bp:derivedFrom    a owl:ObjectProperty , owl:TransitiveProperty ;
                  rdfs:comment "Child material/lot derived from a parent." .
bp:monomerPct     a owl:DatatypeProperty ;
                  rdfs:comment "SEC %monomer release CQA, typed against QUDT." .

Notice derivedFrom is declared owl:TransitiveProperty. That single word records the meaning of the relation: it says that if a drug product derives from a drug substance, and the drug substance derives from a capture pool, then the drug product derives from the capture pool too — transitively, all the way down the chain. A reasoner — a separate piece of software that reads the OWL axioms and writes out the new triples they logically imply — could use that declaration to materialize the full closure as stored facts. In this laptop chapter, though, we get the same end-to-end chain a lighter way and run no reasoner at all: the SPARQL (bp:derivedFrom)+ property path you will meet below walks the asserted edges directly at query time (which is why the loader emits 91 asserted triples and never invokes a reasoner). Transitivity is the declared semantics of the relation; the + path is how we actually traverse it here — two complementary routes to the identical chain, not the same mechanism.

From tables to a thread: relational facts are loaded into RDF triples, the genealogy becomes a chain of derivedFrom edges, and one SPARQL query walks that chain to reconstruct a lot's full lineage. Original diagram by the authors, created with AI assistance.

The data we are turning into a graph

The graph is built from three committed golden datasets — canonical, version-controlled reference data the book's process simulator generated from a fixed random seed (SIM_SEED=2026), so every number below is reproducible. First, the batches and their release verdict, from examples/datasets/batches.csv:

batch_id,role,release
BATCH-2026-001,golden,PASS
BATCH-2026-002,sibling,PASS
BATCH-2026-003,sibling,PASS
BATCH-2026-004,sibling,OOS
BATCH-2026-005,sibling,PASS
BATCH-2026-006,sibling,PASS

The release column is each batch's verdict — PASS, or OOS (Out Of Specification, i.e. a result outside its acceptance limits) for the one seeded failure, BATCH-2026-004, whose drug product DP-004 is the lot that fails release downstream on host-cell protein (we follow that thread in "Why it matters" below).

Second, the lot genealogy — the parent/child edges that are the thread — from examples/datasets/lot_genealogy.csv:

batch_id,child,child_type,parent,parent_type
BATCH-2026-001,SEED-001,seed_train,WCB-CHO-001,wcb
BATCH-2026-001,BATCH-2026-001,bioreactor,SEED-001,seed_train
BATCH-2026-001,PApool-001,capture_pool,BATCH-2026-001,bioreactor
BATCH-2026-001,DS-001,drug_substance,PApool-001,capture_pool
BATCH-2026-001,DP-001,drug_product,DS-001,drug_substance

Read those five rows as a supply chain in miniature: one working cell bank (WCB-CHO-001) seeds a seed train, which inoculates the bioreactor batch, whose clarified harvest becomes a Protein A capture pool, which is purified into drug substance, which is filled into drug product. Notice every batch traces back to the same WCB-CHO-001 — that shared root is what makes a cell-bank-level investigation answerable across the whole campaign. The listing above shows only BATCH-2026-001's five rows — one batch's slice of the thread; the file carries the full genealogy for all six batches, which is why the built graph comes out to 91 triples when the loader runs.

One honest bioprocess simplification to flag: this five-row chain collapses the whole downstream train into a single capture_pool → drug_substance hop. A faithful genealogy threads every unit operation Book 1 walks through — the capture pool is polished (a second, orthogonal chromatography step that clears residual aggregate and host-cell DNA, in Book 1's Polishing Chromatography chapter), then carried through two independent viral-safety steps — low-pH or detergent viral inactivation and 20-nanometre viral filtration — before UF/DF concentrates and buffer-exchanges it into the drug substance. Each of those is its own material node with its own derivedFrom edge, and because the property is transitive the lineage walk reaches the cell bank exactly the same way whether the chain is five hops or twelve — the genealogy gets longer, never structurally different. The graph would also gain the evidence each step emits — a viral-filtration step's log-reduction value, a UF/DF cycle's final concentration and diavolume count — hung off its node as a release-relevant attribute, exactly as the SEC monomer CQA hangs off the batch below.

Third, the release assays, from examples/datasets/hplc_results.csv:

batch_id,test,value,unit,spec_low,spec_high,result
BATCH-2026-001,SEC_monomer_pct,98.611,%,95.0,100.0,PASS
BATCH-2026-001,SEC_HMW_pct,1.287,%,0.0,3.0,PASS
BATCH-2026-001,CEX_main_pct,70.686,%,60.0,80.0,PASS

We will hang the SEC %monomer result on each batch node as a Critical Quality Attribute, so a lineage walk and a quality result live in the same graph. %monomer is the size-exclusion (SEC) purity result — the fraction of antibody that is intact, non-aggregated monomer; a high value (98.6%) means low high-molecular-weight aggregate (HMW, 1.287% here, with 0.439% low-molecular-weight), a core safety CQA because aggregates can be immunogenic (they can provoke an unwanted immune response in the patient). One honest caveat: %monomer is a release / drug-substance-stage SEC purity result, but for chapter simplicity the loader attaches it to the upstream bioreactor batch node (build_graph.py adds bp:monomerPct to BP[r.batch_id]). In a faithful model it would attach to the drug-substance lot; we mirror the repo's simplification here rather than imply the bioreactor batch itself was assayed for monomer.

Building the graph with RDFLib

In the full companion stack these triples are served by Apache Jena Fuseki — the mature open-source SPARQL 1.1 server backed by a persistent triplestore (a database purpose-built to store and query RDF triples) [8] — which we deploy below. But to keep this chapter and its test runnable with zero services, we build the same graph in-process with RDFLib, the Python library for constructing, serializing, and querying RDF [9]. Here is the heart of the chapter — the loader, from examples/chapters/16-semantics-knowledge-graph/build_graph.py. It reads the three CSVs and emits triples:

from rdflib import Graph, Literal, Namespace, RDF
from rdflib.namespace import RDFS, XSD

DATA = Path(__file__).resolve().parents[2] / "datasets"
BP = Namespace("https://example.org/bioproc#")


def build_graph() -> Graph:
    g = Graph()
    g.bind("bp", BP)

    batches = pd.read_csv(DATA / "batches.csv")
    gen = pd.read_csv(DATA / "lot_genealogy.csv")
    rel = pd.read_csv(DATA / "hplc_results.csv")

    for _, b in batches.iterrows():
        s = BP[b.batch_id]
        g.add((s, RDF.type, BP.Batch))
        g.add((s, RDFS.label, Literal(b.batch_id)))
        g.add((s, BP.releaseStatus, Literal(b.release)))

Each g.add((subject, predicate, object)) is literally one triple written into the graph. (RDF.type and RDFS.label here are not the author's own predicates — they are the standard, built-in RDF/RDFS terms RDFLib provides, the same rdf:type and rdfs:label from the Turtle above.) The first loop turns every batch into a typed, labelled node carrying its release status.

The genealogy loop is where the thread is woven. For each row it types both child and parent, then adds the derivedFrom edge that links them:

    # genealogy edges: child bp:derivedFrom parent
    for _, e in gen.iterrows():
        child, parent = BP[e.child], BP[e.parent]
        g.add((child, RDF.type, BP[e.child_type.title().replace("_", "")]))
        g.add((parent, RDF.type, BP[e.parent_type.title().replace("_", "")]))
        g.add((child, BP.derivedFrom, parent))
        if e.parent_type == "bioreactor":
            g.add((child, BP.fromBatch, BP[e.parent]))

The .title().replace("_", "") turns the CSV's seed_train into the class SeedTrain and capture_pool into CapturePool — a small, deterministic mapping from relational vocabulary to ontology classes. Finally, the release results attach the CQA, typed against XSD.float (the QUDT alignment lives in the ontology) so the value's datatype is explicit, not guessed:

    # release results: the SEC %monomer CQA attaches to the batch
    monomer = rel[rel.test == "SEC_monomer_pct"]  # %monomer purity, not titer (yield)
    for _, r in monomer.iterrows():
        g.add((BP[r.batch_id], BP.monomerPct, Literal(float(r.value), datatype=XSD.float)))
    return g

What the graph does not hold

It is worth being just as precise about what the loader does not write, because the boundary is a design choice, not an omission. BMIC and the upper ontologies it builds on are prescriptive: they model intent and structure — recipes, specifications, equipment, the CPP/CQA relationships between them — and there is no class in them for a raw measured array. A spectrum is roughly a thousand intensity-versus-wavenumber points; a chromatogram is a dense time series; a design space is a multi-dimensional curve. Flattening even one of those into subject–predicate–object triples would balloon the graph into millions of meaningless rows and still lose the array's shape.

So the heavy numeric payloads live where they belong — in the vendor-neutral analytical containers — and the graph links to them by IRI rather than swallowing them. There are three complements, each suited to a shape: Allotrope ADF, an HDF5 binary built around an n-dimensional Data Cube, carries spectra, chromatograms, and curves; AnIML, the ASTM open XML format, carries the same arrays in its SeriesSet; and Allotrope ASM, the JSON form, carries the scalar, machine-actionable results. The graph holds a triple like Batch-001 hasGlycanProfile <file://…/run.adf> — a pointer to the antibody's glycan profile (the array of attached-sugar structures, itself a quality attribute), not the array — and the same pattern points at a historian tag (hasTrace <opc.tcp://…>) for a live time series. The full lineup of these formats, and how a LADS server or a LIMS emits them, is the subject of The Analytical Lab: Instruments, LIMS & ELN; the chromatography side — exporting a peak as ASTM ANDI/NetCDF rather than a vendor blob — is covered in Downstream Capture: Chromatography & Filtration Skids.

This is exactly why the loader you just read splits its behaviour. A scalar release result — monomerPct 98.611 — maps cleanly into a triple, because a single typed number is a fact the graph can reason over and constrain with SHACL. But a spectrum or chromatogram stays a referenced document: the graph records that the batch has one, where it lives, and what it is, and leaves the array itself in the ADF or AnIML file. The graph is the index of the digital thread, not the warehouse for every number on it — and that division of labour is what keeps it small, queryable, and honest.

The digital-thread query

Now the payoff. With the graph built, we ask one recursive question in SPARQL — the W3C standard query language for RDF [2]. The query, from the same file, walks the derivedFrom chain from a drug-substance lot toward its ancestors:

PREFIX bp: <https://example.org/bioproc#>
SELECT ?step ?type WHERE {
  bp:DS-001 (bp:derivedFrom)+ ?step .
  ?step a ?type .
} ORDER BY ?type

(The bare a in ?step a ?type is the same rdf:type shorthand from the Turtle above — it carries into SPARQL unchanged, so that line reads "and bind ?type to whatever class each step is typed as.") The load-bearing token is (bp:derivedFrom)+. That is a SPARQL property path: the + means "follow one or more derivedFrom hops." So this single line says "find every step DS-001 derives from, however many links away" — exactly the recursive lineage walk that is painful in plain SQL.

Running the file end to end — python chapters/16-semantics-knowledge-graph/build_graph.py — produces this real output:

graph: 91 triples

digital thread — what DS-001 derives from:
  BATCH-2026-001 (Batch)
  BATCH-2026-001 (Bioreactor)
  PApool-001     (CapturePool)
  SEED-001       (SeedTrain)
  WCB-CHO-001    (Wcb)

That is the complete lineage of one drug-substance lot, reconstructed by the graph in one query: the capture pool it was purified from, the bioreactor batch that produced it, the seed train that inoculated it, and the working cell bank at the very root. (Wcb is just the .title() transform applied to the CSV's wcb — the working cell bank — so it reads a little oddly next to a clean two-word class like CapturePool; it is the same deterministic mapping, not a glitch.) The batch appears twice — once typed Batch (from the batches loop) and once Bioreactor (from the genealogy loop) — which is itself an honest, faithful picture of multi-source modeling: the same physical thing carries more than one class, and the graph happily holds both. This is the open-source realization of what vendors sell as a "manufacturing knowledge graph."

Because the release CQA lives in the same graph, you can extend the walk to answer the question an investigator actually asks — and what was its quality result? The companion file carries a second, tested query, THREAD_WITH_CQA_QUERY, that walks the same ancestor direction to the upstream batch carrying the CQA and reads its %monomer off the same graph:

PREFIX bp: <https://example.org/bioproc#>
SELECT ?batch ?monomer WHERE {
  bp:DS-001 (bp:derivedFrom)+ ?batch .
  ?batch bp:monomerPct ?monomer .
}

The direction is identical to the first query: the monomer-bearing batch is an ancestor of DS-001, so it uses the same (bp:derivedFrom)+ path toward ancestors — only the extra ?batch bp:monomerPct ?monomer line differs, filtering the walk down to the one ancestor that carries the CQA. Run end to end, it returns the originating batch and its result:

lineage + quality — originating batch and its release %monomer:
  BATCH-2026-001 monomerPct=98.611

Lineage and quality, one graph, one query. A query like this — a question the vocabulary must be able to answer — is what ontology engineers call a competency question (CQ), and they use a catalog of them as the ontology's acceptance test: each CQ pairs the query with an expected result, so "is the model good?" becomes a mechanical pass/fail rather than an opinion. The two queries above are textbook CQs — what does this lot derive from? and what is its originating batch's monomer? — and in Book 4 they appear almost verbatim as CQ-01 and CQ-03, with the lineage walk's expected answer fixed at a specific ancestor count so a model that stops returning the right set fails visibly on every build. Book 4 turns 23 such questions into runnable PASS/FAIL checks in its Competency questions as queries chapter, the same discipline by which the derivedFrom transitive relation and the SHACL release gate below are themselves validated — requirements expressed as executable tests.

Walking the chain: how a property path actually traverses

It is worth being precise about what (bp:derivedFrom)+ does, because the + is doing more than it looks. A bare bp:derivedFrom matches exactly one hop — DS-001's immediate parent, PApool-001, and nothing further. The + quantifier turns that into "one or more hops, of any length": the engine follows derivedFrom from DS-001 to PApool-001, then again from PApool-001 to BATCH-2026-001, then to SEED-001, then to WCB-CHO-001, binding ?step to every node it lands on along the way and stopping only when there are no more outgoing derivedFrom edges. The lineage is five hops here, but the query never says "five" — it says "however deep," which is exactly why the same line works for a two-step chain and a twenty-step one without edits.

That arbitrary depth is the load-bearing difference from SQL. To walk the same chain in PostgreSQL you write a recursive common table expression — a WITH RECURSIVE that re-joins the genealogy table to itself, seeds the recursion with DS-001, and unions each new generation onto the result until the join returns no rows. It works, but you must author the recursion by hand, name the join columns, and guard against cycles yourself. In SPARQL the recursion is the operator. A property path also composes: (bp:derivedFrom)+ is one path expression, but you can write bp:derivedFrom/bp:fromBatch for a fixed two-step hop, or alternation with |, so the query language itself expresses the shape of the walk rather than encoding it in procedural joins. The graph stores the edges as data, so traversal is a query, not a program.

Drawn as a graph, the chain the query walks is just a row of derivedFrom edges with the release CQA hanging off the batch — the semantics-knowledge-graph.svg hero above shows that lineage end to end.

The genealogy is also the grouping key a model must split on

The same derivedFrom thread does quiet but essential work for anyone building a model on this data, which is why a knowledge graph is more than an investigation tool — it is the lineage substrate machine learning needs to validate itself honestly. When you train a soft sensor or a release-prediction model on a campaign, the cardinal sin is a row-wise random train/test split, because sibling batches are not independent examples: every batch in this campaign descends from the same WCB-CHO-001, often sharing a media lot and a capture skid, so two batches off one cell bank are near-twins. Split them randomly and a near-twin lands on both sides of the train/test line — the model has effectively seen the answer, and its reported score is fantasy. The fix is a grouped split — every record from a batch goes wholly to train or wholly to test — and the genealogy is exactly where the group key lives. A (bp:derivedFrom)+ walk back to the shared cell bank is the leave-one-batch-out (or leave-one-cell-bank-out) grouping that Book 5's data chapter makes the default, and that the models-and-validation chapter turns into GroupKFold and nested cross-validation. The graph that traces a deviation and the graph that defines an honest validation fold are the same graph.

Two more ML disciplines hang off the same triples. A model that predicts an outcome needs an applicability domain — a gate that asks whether a new batch even resembles the ones the model was trained on, so it can decline to guess out of its depth — and the lineage tells you when a batch sits outside the trained envelope (a new cell bank, an un-seen scale) before any prediction is trusted. And once a model is deployed, the graph is where its lineage belongs as first-class facts: which dataset hash, which model version, and which CQA it scored are triples like any other, so a later audit can walk from a released lot to the exact frozen model that touched it — the model-lineage and drift governance Book 5's MLOps chapter builds. That chapter also draws the distinction this graph helps keep straight: process drift (the living cells genuinely wandering batch to batch) is a real manufacturing signal the digital thread should preserve, whereas model drift (the predictor going stale against that moving process) is a defect to detect — and conflating the two is how a monitoring system either cries wolf or misses a real shift.

Validating the graph with SHACL

A graph that anyone can add triples to needs a gate, or it rots. The Shapes Constraint Language (SHACL) is the W3C standard for validating an RDF graph against shape constraints — it is how you enforce, for example, that every Batch must have a release status and exactly one monomer result before it is allowed to claim it has been released [3]. It is the same mechanism Allotrope's data models use to constrain how their ontology may be applied. A minimal shape looks like this:

# Illustrative (simplified from platform/ontology/shapes.ttl, which attaches
# monomerPct to bp:ReleaseShape on the drug-substance/product lot — see the L131 caveat).
@prefix sh:  <http://www.w3.org/ns/shacl#> .
@prefix bp:  <https://example.org/bioproc#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

bp:BatchShape a sh:NodeShape ;
  sh:targetClass bp:Batch ;
  sh:property [ sh:path bp:releaseStatus ; sh:minCount 1 ;
                sh:in ( "PASS" "OOS" "PENDING" ) ] ;
  sh:property [ sh:path bp:monomerPct ;
                sh:datatype xsd:float ; sh:maxCount 1 ] .

(The square brackets [ … ] are Turtle's way of writing an anonymous, unnamed node inline — a blank node — here used to bundle each property constraint without giving it its own IRI.) SHACL is what turns a knowledge graph from a convenient lookup into a data-quality control: a triple that violates a shape is caught at load time instead of surfacing as a contradiction during an investigation.

The validation report: what SHACL hands back when a triple fails

SHACL does not just say yes or no — it returns a validation report, which is itself an RDF graph, so the failure is queryable like any other fact. Run a validator (the rdflib ecosystem provides pyshacl, a separate package built on rdflib; Fuseki exposes the same SHACL engine over HTTP) and a clean graph reports sh:conforms true and nothing else. Break one rule — say a Batch somehow carries two bp:monomerPct triples, violating the sh:maxCount 1 constraint above — and the report instead reports sh:conforms false plus a sh:ValidationResult that names the offending node in sh:focusNode (bp:BATCH-2026-001), the property in sh:resultPath (bp:monomerPct), the rule that fired in sh:sourceConstraintComponent (sh:MaxCountConstraintComponent), a sh:resultSeverity of sh:Violation, and a human-readable sh:resultMessage. Because every one of those is a triple, the report drops straight back into the same graph machinery: you can SPARQL-query "which focus nodes failed which constraints" across a whole load, rather than reading a stack trace. That is the difference between a constraint language and an assertion — the failure carries enough structured context to route an exception, not just halt one. A sh:Violation blocks the load; a softer sh:Warning or sh:Info severity lets a borderline triple through with a flag, which is how you stage a new rule without rejecting historical data on day one.

One model, many schemas: authoring with LinkML

Step back and count the schemas this book has hand-written for the same handful of concepts. A Batch with a release status and a monomer result is an OWL class in the ontology above, a sh:NodeShape in the SHACL we just wrote, a CREATE TABLE in the Chapter 4 relational model, and — for the lab result that feeds the CQA — a JSON document with its own schema back in Chapter 14. Four schemas, four files, four chances to drift apart the next time someone adds a field. Nothing keeps them in step.

LinkML — the Linked data Modeling Language — is the open-source answer to that drift. It is a peer-reviewed, YAML-based data-modeling framework, developed at Lawrence Berkeley National Laboratory and licensed Apache-2.0, whose stated purpose is to curb the "proliferation of single-use data models" while making data FAIR [14]. You author the model once, in YAML, and generate the rest. Unlike the RDFLib loader, this LinkML model is shown as the authoring pattern to adopt — it is not yet checked into the companion stack — so treat the file path below as illustrative rather than runnable. The same Batch, modeled in bioproc.yaml:

# Illustrative — platform/model/bioproc.yaml (the one source every schema below is generated from).
id: https://example.org/bioproc
default_range: string
prefixes:
  bp:   https://example.org/bioproc#
  qudt: http://qudt.org/schema/qudt/

classes:
  Batch:
    class_uri: bp:Batch
    slots: [batch_id, release_status, monomer_pct]

slots:
  batch_id:
    identifier: true
  release_status:
    slot_uri: bp:releaseStatus
    range: ReleaseStatus            # enum -> SHACL sh:in + JSON Schema enum
  monomer_pct:
    slot_uri: bp:monomerPct          # the SEC %monomer CQA
    range: float
    unit:
      ucum_code: "%"                 # UCUM = Unified Code for Units of Measure; a unit can also carry a qudt: quantity-kind URI

enums:
  ReleaseStatus:
    permissible_values:
      PASS:
      OOS:
      PENDING:

From that single file the LinkML generators emit the artifacts you have been writing by hand. gen-shacl produces the sh:NodeShape — release_status's enum becomes the very sh:in ( "PASS" "OOS" "PENDING" ) constraint shown above. gen-json-schema emits a JSON Schema to validate JSON instances; gen-sqltables the CREATE TABLE DDL (class → table, slot → column); gen-owl the OWL ontology; gen-pydantic typed Python classes for the loaders. And linkml-validate checks a data file against the model directly. One reviewed source; every downstream schema regenerated from it.

Crucially for this chapter, a slot can carry its own semantics. The slot_uri shown above (and a companion exact_mappings list, which points a slot at equivalent terms in other vocabularies) let monomer_pct declare which concept it is, and the unit metaslot can reference a QUDT quantity-kind URI — so the meaning and units the graph relies on are pinned in the model rather than bolted on afterward [14]. One honest boundary: LinkML is the authoring layer, not a replacement for the standards. There is no official LinkML-to-Allotrope binding; aligning a field to an AFO term is a mapping you author, not a feature you import. What you gain is that the SHACL gate, the relational tables, the JSON Schema, and the OWL ontology stop drifting apart, because all four are generated from one source under change control — the precise opposite of the silent drift this chapter keeps warning about. Nor is it a research toy: the National Microbiome Data Collaborative models its entire metadata standard in LinkML [15], and the framework grew out of the Biolink Model that NCATS's Translator uses to harmonize hundreds of biomedical data sources [14].

One schema, five generated artifacts: bioproc.yaml is the single source of truth, and gen-shacl, gen-json-schema, gen-sqltables, gen-pydantic, and gen-owl derive the graph gate, JSON validation, relational tables, typed Python loaders, and the OWL ontology from it — so the five can never drift apart. Original diagram by the authors, created with AI assistance.

Serving it at scale: Fuseki and Oxigraph

In-process RDFLib is perfect for a chapter and a unit test, but a plant needs a persistent SPARQL endpoint many systems can query. Two open-source stores fit:

Apache Jena Fuseki is the mature choice — a SPARQL 1.1 server backed by the TDB persistent triplestore, run as a service [8]. It is the semantics profile in the companion stack. A freshly-upped Fuseki has no datasets (the compose block sets only ADMIN_PASSWORD), so you create the digitalthread dataset once — via the admin UI or a POST /$/datasets — and then load the same triples and expose /digitalthread/sparql over HTTP:

# Illustrative — load the graph into Fuseki (semantics profile);
# create the digitalthread dataset first (admin UI or POST /$/datasets).
curl -X POST --data-binary @bioproc.ttl \
  -H "Content-Type: text/turtle" \
  http://localhost:3030/digitalthread/data

Oxigraph is the lightweight, embeddable alternative — a SPARQL 1.1 query/update database backed by RocksDB, ideal when you want a graph store without a JVM (the Java Virtual Machine — Fuseki and Jena are Java services that need one running; Oxigraph is a single native binary that does not) [10].

Both of those are RDF triplestores, and that choice is itself worth naming. The other large family of graph databases is the labeled property graph (LPG) — most widely deployed as Neo4j, queried in Cypher — which many teams reach for first because it is so easy to start with. We chose RDF/SPARQL deliberately: its globally unique IRIs, shared ontologies (IOF / Allotrope / QUDT), and W3C-standard query and constraint languages are exactly what make the graph interoperable across systems and sites rather than a fast store for one application. (Commercial RDF engines exist too — Ontotext GraphDB and AWS Neptune among them — if you want vendor support behind the same standards.) The LPG world is excellent engineering; it simply optimizes for a different thing than the cross-system digital thread this chapter is about.

The repo notes one practical caution. The stack uses the official Apache image apache/jena-fuseki:5.2.0, pinned by tag in examples/platform/compose/compose.yaml, with the corresponding manifest digest recorded alongside it in examples/platform/versions.lock (the pattern Chapter 25's supplier register relies on). For Fuseki specifically that digest — the cryptographic sha256:… fingerprint of the exact image contents — is left as a VERIFY-BEFORE-USE placeholder: as the lock file and the reference-architecture license table both note, the community Fuseki image moved repositories, so you should resolve and record the real digest for your chosen registry mirror before you deploy by digest. Pinning by digest, rather than trusting the movable :5.2.0 tag alone, is what guarantees every site pulls byte-for-byte the same image.

Whichever store you choose, the deeper payoff is the one that has run through the whole trilogy: a SPARQL endpoint over a shared vocabulary with globally unique identifiers is the cleanest route to data that is genuinely FAIR — Findable, Accessible, Interoperable, and Reusable, with machine-actionability as the explicit design target [11]. A graph that another system can query without a bespoke integration is interoperability, made concrete.

Why it matters

The digital thread is not jargon; it is the literal mechanism of a modern investigation. A systematic literature review of the digital thread for smart manufacturing identifies semantic links — knowledge graphs and ontologies — as the connective tissue that provides traceability across the product lifecycle [12]. When DP-004 — the drug product filled from DS-004, which traces back through PApool-004 to our seeded OOS batch BATCH-2026-004 — fails release on host-cell protein (HCP — residual protein shed by the production cells, an impurity the Protein A capture step is meant to clear; the full release file carries an HCP result of 128.0 ng/mg on that batch against a 100.0 ng/mg ceiling, though the three-row excerpt shown earlier listed only the SEC and CEX assays), the question is instantly "what else shares its lineage?" — and because every batch in the campaign traces to WCB-CHO-001, the graph can answer "which other lots came off the same cell bank?" in one traversal — and, once you also model the shared chromatography skid as a node, the same traversal extends to "which lots ran on the same capture skid?". That is impact analysis, and it is the difference between a scoped deviation and a blind, campaign-wide quarantine.

It matters for regulators because lineage and impact analysis are exactly what data-integrity and investigation expectations demand: the ability to walk from any record to every record it depends on, across systems, reproducibly — the traceability and attributable-record demands of ALCOA+ and of electronic-records rules like 21 CFR Part 11 and EU Annex 11, which this book covers in ALCOA+ by Construction and Part 11 / Annex 11 with Open Source. A knowledge graph makes that walk a query instead of a week of cross-referencing spreadsheets.

In the real world

Commercial vendors sell this under names like "manufacturing knowledge graph," "contextualized data fabric," or "self-service data layer," typically layered over a historian's asset model. What we built in one Python file and a Turtle vocabulary is the same idea expressed in open standards every system can speak — and the standards convergence is real: Allotrope, IOF/BMIC, and QUDT are industry efforts precisely so the graph at one site is intelligible at the next. It is not a paper exercise either: a 2026 peer-reviewed study built a biopharmaceutical knowledge graph that integrated heterogeneous process data and let engineers query parameter-to-outcome (CPP-to-CQA) relationships directly — the manufacturing-knowledge-graph idea, working in practice [13].

Now the honest OSS-vs-commercial reckoning. The graph technology is genuinely production-grade in open source: RDF, SPARQL, SHACL, Fuseki, and Oxigraph are mature, standards-compliant, and free of licensing traps (Apache-2 for Jena/Fuseki, permissive BSD for RDFLib, MIT for Oxigraph). You can build a real digital thread with zero license cost. What pure OSS does not hand you is the GxP wrapper: a validated, change-controlled mapping from your source systems into the graph, supplier accountability for the triplestore under GAMP-5 (Good Automated Manufacturing Practice, guide 5 — the industry standard for validating computerised systems), and assurance that the load process is complete and correct under qualification (the regulated-industry term for documented proof a system was installed and works as intended). Qualification here is the same IQ/OQ/PQ ladder a chromatography skid passes — installation qualification proves the triplestore and its loader were deployed to the specified version (which is exactly what the tag-plus-digest pin above gives you machine-checkable evidence for), operational qualification proves the load job and the SPARQL endpoint behave to spec across their range, and performance qualification proves the end-to-end thread reproduces a known lineage on real data — and any later change to the mapping, the ontology, or the image digest re-enters that ladder under change control rather than being hot-patched. Under modern Computer Software Assurance (CSA) thinking the depth of that testing is risk-scaled — a query that informs a release decision earns more scripted evidence than a read-only convenience view — but it is never skipped, because the graph holds records a regulator may inspect.

That inspectability is precisely the demand of the electronic-records rules: 21 CFR Part 11 (the US FDA rule on electronic records and signatures) and EU Annex 11 (its European counterpart) require that a system holding GxP records enforce attributable, audit-trailed, access-controlled, and accurate data — so the load job's reconciliation log, the SHACL gate's validation report, and the triplestore's own access controls are not nice-to-haves but the literal Part 11 / Annex 11 evidence, covered in depth in Part 11 / Annex 11 with Open Source. And because a graph at one site must mean the same thing at the next, the same wrapper is what carries the model through tech transfer: when the validated mapping moves from the development site to a commercial plant — or scales from a 10-litre development bioreactor to a 2000-litre production train — the IRIs, the ontology alignment, and the SHACL shapes travel unchanged, so the receiving site re-qualifies the load against its own systems rather than re-inventing the vocabulary. The graph is also a derived view — its triples are copied from the relational record of truth, so unless the load is validated and re-run under change control, the graph can silently drift from the systems it claims to mirror, which is the opposite of data integrity. As elsewhere in this book: open source gets you a clean, inspectable engine; the validated system around it is yours to build or buy.

When the graph lies: drift, FAIR-noncompliance, and ontology sprawl

It is tempting to treat a knowledge graph as automatically trustworthy because it speaks open standards. It is not, and the failure modes are worth naming so you build against them. The first is drift: because the graph is a derived copy, any column the load mapper does not cover, or any source edit made after the last load, leaves the graph asserting something the relational record no longer says — a contradiction an investigator may not notice until it matters. The fix is the dull one: the load is a validated, change-controlled job, re-run on a known trigger, with the triple count reconciled against the source (our loader prints graph: 91 triples precisely so a drift check has a number to assert against).

The second failure is quieter and is the one the FAIR movement exists to measure. Standards-compliance on the wire does not guarantee the data is actually findable, interoperable, or reusable — and when someone measures it, the gap is large. A 2024 meta-research study assessed COVID-19 research datasets against the FAIR principles and found that while essentially all were findable, only 46.7% reached even moderate compliance on interoperability and 21.5% on accessibility — meaning the majority of openly published datasets fail the very "I" that a shared ontology is supposed to deliver [16]. The cause is rarely the triplestore; it is metadata authored without a controlled vocabulary, units left as bare strings, identifiers that are local rather than global. A graph inherits exactly those defects from its inputs: load un-typed values and you get a graph that looks FAIR and is not.

The third is ontology sprawl — the "everyone-invents-their-own-classes" chaos the IOF Core paper was written to prevent [7]. A local bp: namespace is fine for a laptop chapter, but if every site coins its own derivedFrom with subtly different meaning, the cross-site interoperability that justified RDF over a labeled-property graph evaporates. The discipline that holds all three failures off is the same one this chapter keeps returning to: align to shared upper ontologies (BFO, IOF, Allotrope, QUDT), author the model once under change control (LinkML), and gate every load with SHACL so a triple that lies is caught before it enters.

Key terms

Ontology — a shared, machine-readable vocabulary defining what entities exist (Batch, CapturePool) and how they relate (derivedFrom); the agreement that lets different systems mean the same thing.
RDF (triple) — the Resource Description Framework; every fact is a subject–predicate–object triple, with no fixed schema, so relationships are stored as first-class data.
Knowledge graph — a graph of RDF triples linking batch, equipment, material, recipe, and result into one navigable whole.
Digital thread — the end-to-end chain of linked records tracing a product across its lifecycle; here, the derivedFrom lineage from drug product back to the cell bank.
SPARQL property path — a query operator (e.g. (derivedFrom)+) that follows a relationship recursively, enabling lineage walks of arbitrary depth in one statement.
SHACL — the Shapes Constraint Language; validates an RDF graph against shape constraints to enforce data quality before bad triples enter the graph.
LinkML — an open, YAML-based data-modeling language; author the model once and generate the SHACL, JSON Schema, SQL DDL, OWL, and Pydantic classes from it, so those schemas stop drifting apart.
Basic Formal Ontology (BFO) — the ISO/IEC 21838-2 top-level ontology of domain-neutral root categories (object, process, quality) that IOF and BMIC are grounded in, so independently built ontologies still agree on what kind of thing each term is.
IOF / Allotrope (AFO) / QUDT — the open standards the graph aligns to: a manufacturing ontology, a laboratory-analytics ontology, and a units-and-quantities ontology, respectively.
Transitive property — an OWL relation (like derivedFrom) where A→B and B→C imply A→C; a reasoner (separate inference software) can materialize that full chain as stored triples, or — as in this chapter's runnable code — a SPARQL (derivedFrom)+ property path can compute the same closure at query time.
Triplestore — a database that stores and queries RDF triples over SPARQL; here, Apache Jena Fuseki or the embeddable Oxigraph.
IRI (Internationalized Resource Identifier) — the globally-unique name RDF gives every subject, predicate, and resource object (bp:BATCH-2026-001); unlike a local primary key it means the same thing across systems and sites, which is what makes the graph interoperable rather than database-local.
SHACL validation report — the RDF graph a SHACL run hands back: sh:conforms plus, for each failure, a sh:ValidationResult naming the focus node, property path, constraint, and severity — a queryable, structured exception rather than a yes/no or a stack trace.
GxP / GAMP-5 — GxP is the umbrella of "Good x Practice" quality regulations (GMP for manufacturing, GLP for labs, GCP for clinical) governing any data a regulator may inspect; GAMP-5 (Good Automated Manufacturing Practice, guide 5) is the industry standard for validating the computerised systems — like the triplestore and its load job — that produce or hold that data.
Qualification (IQ/OQ/PQ) — the documented proof, in three stages, that a system is fit for use: installation qualification (deployed to the specified version), operational qualification (behaves to spec across its range), and performance qualification (reproduces a known result end to end on real data). The graph's load job and SPARQL endpoint pass the same ladder a chromatography skid does, re-entered under change control on any change, with the testing depth risk-scaled under CSA.
Competency question (CQ) — a question the vocabulary must be able to answer, paired with its expected result so the ontology is graded by a mechanical pass/fail; the chapter's lineage and lineage-plus-quality SPARQL queries are CQs, catalogued and run as PASS/FAIL acceptance tests in Book 4.
Grouped (leave-one-batch-out) split — the validation discipline of putting every record from a batch — and ideally every batch off one cell bank — wholly on one side of the train/test line, so a model is scored on genuinely unseen lots; the genealogy's derivedFrom walk to the shared cell bank is the grouping key, and a row-wise random split that ignores it reports a fantasy score.

Where this leads

We have made the data mean the same thing everywhere and proven a one-query digital thread against real campaign data. But the graph, like the contextualization view before it, assumes the numbers already live in our open historian. Most plants have a commercial historian holding decades of process data — and that is the boundary we cross next. In Chapter 20 — Bridging to Commercial Historians: AVEVA/OSIsoft PI, we build a bidirectional bridge to a PI Web API (against a faithful mock, with the production base-URL and credential swap documented) so the open stack and the validated commercial system of record can finally exchange data.

What this chapter covers​

Why a graph, when we already have SQL​

Anatomy of the graph: a triple, then a node​

Anatomy of a triple: subject, predicate, object​

Anatomy of a Batch node: one node is a bundle of triples​

The ontology: agreeing on what things are​

The data we are turning into a graph​

Building the graph with RDFLib​

What the graph does not hold​

The digital-thread query​

Walking the chain: how a property path actually traverses​

The genealogy is also the grouping key a model must split on​

Validating the graph with SHACL​

The validation report: what SHACL hands back when a triple fails​

One model, many schemas: authoring with LinkML​

Serving it at scale: Fuseki and Oxigraph​

Why it matters​

In the real world​

When the graph lies: drift, FAIR-noncompliance, and ontology sprawl​

Key terms​

Where this leads​