Skip to main content

Implementation: From the Wire to the Graph

📍 Where we are: Part V · Implementation — methodology: SAMOD + LOT. The model is conceptualized, formalized, and aligned. Now it has to be fed. This chapter is the loading dock: it shows how the standards a plant already speaks — PI historian rows, OPC UA reads, B2MML batch records, Allotrope analytical files — become the very triples the rest of the book queries, with running code that proves each crosswalk rather than asserting it.

A model that only ever sees hand-typed Turtle is a model that has never met a factory. The dataset this book reasons over does not arrive as Turtle; it arrives as a historian tag streaming twice a second, an OPC UA variable read off a controller, an ISA-88 recipe exchanged as XML by an MES, and a chromatogram exported by an HPLC. Implementation is the act of turning each of those into the same graph — bp:BATCH-2026-001, bp:CCP-001, bp:SEC-Result-001 — without inventing new identity and without flattening a million sensor points into triples. Four small scripts in the companion dataset each take one real plant standard and emit the bp: triples that already live in instances.ttl, so the crosswalk is exhibited, not promised.

The simple version

Four delivery trucks back up to one loading dock. One carries temperature readings from the plant's data recorder; one carries the same readings straight off the bioreactor's controller; one carries the recipe-and-batch paperwork from the manufacturing system; one carries a lab result from the HPLC. They speak four different languages on the truck — but the dock has a translator for each, and every package comes off labeled in the same warehouse code. This chapter is those four translators. The trick is that the warehouse keeps the label and the shelf number (a number with its unit, and a pointer to the file) — it does not unpack the million-point chromatogram onto the shelf.

Start from the questions

This chapter underwrites one competency question above all: CQ-19does every stored quantity carry a unit, with no bare numbers? (defined in the specification). For a mAb campaign that requirement is not bookkeeping pedantry: a drug-substance lot is released by comparing a monomer purity to a 95.0 spec floor and a culture's temperature to a 36.0–37.0 °C range, and a comparison only means something if both sides carry the same unit. The wire is exactly where that unit gets dropped — a historian column gets read as a float and the Cel next to it discarded, and the graph fills with naked 36.51s that the release gate would then compare against a spec it cannot safely interpret. A bare 2.41 could be a percent of high-molecular-weight aggregate that fails release or a harmless count; without its unit the batch-disposition decision is built on sand. Every script below carries the unit across the boundary on purpose. The chapter also serves the structural questions CQ-21 (which vessel a run occurred in — equipment kept separate from the batch material) and the trace half of the genealogy, because the B2MML and historian bridges are where equipment and the dense stream attach to the run without corrupting the identity that makes a recall traceable.

Two wires, one observation: historian and OPC UA

A production bioreactor running a mAb culture is monitored every few seconds by a battery of probes — temperature, pH, dissolved oxygen, and more — because the cells are alive and the control strategy holds each parameter inside a narrow range for two weeks. That torrent of PAT data is the densest stream in the plant, and the first crosswalk has two front doors that must land in the same place. A reading can reach the graph as a PI historian row — ts, tag, value, unit, quality, batch_id — or it can be read one hop earlier, straight off the bioreactor controller as an OPC UA Data Access variable. The model's discipline is that both routes mint the same sosa:Observation, because the OPC UA NodeId's string identifier is the historian tag (ns=2;s=BR101.Temp.PVBR101.Temp.PV). The mapping is declared in RML and executed in process by historian_to_rdf.py:

# historian_to_rdf.py — each historian row becomes one sosa:Observation; the unit travels with it.
g.add((obs, RDF.type, SOSA.Observation))
g.add((obs, SOSA.observedProperty, prop))
g.add((obs, SOSA.hasSimpleResult, Literal(value, datatype=XSD.float)))
g.add((obs, SOSA.resultTime, Literal(ts, datatype=XSD.dateTime)))
# The unit is NO LONGER dropped: keep the raw UCUM code, and add the QUDT unit IRI where one exists.
g.add((obs, QUDT.ucumCode, Literal(unit, datatype=XSD.string)))
if unit in UCUM_TO_QUDT: # {"Cel": unit:DEG_C}
g.add((obs, QUDT.hasUnit, UCUM_TO_QUDT[unit]))
g.add((obs, BP.fromBatch, batch))
g.add((batch, BP.hasTrace, prop)) # the INDEX edge — the stream stays in PI

The qudt:ucumCode line is CQ-19 made operational: the Cel that sat in the historian's unit column rides along as a UCUM code on the observation, and where a clean QUDT unit exists it is also attached as an IRI (Celunit:DEG_C). The OPC UA loader proves the second door lands in the same room — it reads EUInformation (the engineering unit, as a UCUM code), maps the Good StatusCode to PI quality 192, and mints the identical observation IRI:

# opcua_to_rdf.py — the NodeId identifier IS the tag, so the OPC UA read lands the same observation.
def tag_of(node_id: str) -> str:
return node_id.split(";s=", 1)[1] if ";s=" in node_id else node_id
# ns=2;s=BR101.Temp.PV -> BR101.Temp.PV -> https://example.org/historian/obs/BR101.Temp.PV/<ts>

Run both and the index edges and the unit handling line up exactly. The historian bridge emits its index edges:

historian -> RDF: 23 triples from 3 rows

bp:hasTrace index edges (one per batch/tag — the stream stays in PI):
BATCH-2026-001 hasTrace https://example.org/historian/tag/BR101.Temp.PV
BATCH-2026-001 hasTrace https://example.org/historian/tag/BR101.pH.PV
OPC UA -> RDF: 16 triples from 2 variable reads

the NodeId identifier IS the historian tag, so both routes land the same observation:
ns=2;s=BR101.Temp.PV -> sosa:Observation (value 36.51, unit http://qudt.org/vocab/unit/DEG_C, quality 192)
ns=2;s=BR101.pH.PV -> sosa:Observation (value 7.02, unit pH, quality 192)

Note the honest asymmetry: Cel resolves to unit:DEG_C, but pH has no clean QUDT unit, so it stays a bare UCUM code. The value is still never bare — it carries pH as its code — but the crosswalk to a QUDT IRI is partial, and the script does not pretend otherwise. That is the index edge at work: the batch gets one bp:hasTrace to the tag IRI, and the millions of points stay in PI, never flattened into the graph.

The recipe and the batch record: B2MML, with equipment kept apart

The genealogy that this book reasons over does not, on a real floor, live as an RDF derivedFrom edge — it lives as a row in a proprietary MES, exchanged as B2MML XML (the MESA serialization of ISA-88/ISA-95). b2mml_to_rdf.py closes that gap by reading two real-shaped documents: a master recipe and an as-run batch production record. The master recipe (Recipe-mAb-A, version 2.0) maps to a bp:MasterBatchRecord with its production recipe element, its prescribed parameters, and an equipment requirement. The as-run record is where the most important modeling rule of the whole chapter shows up — the <Equipment> reference becomes an occursIn edge on the run, never a second rdf:type on the batch:

# b2mml_to_rdf.py — the batch is typed ONCE; the vessel becomes occursIn on the RUN.
batch = BP[_txt(info, f"{B2}BatchID")]
g.add((batch, RDF.type, BP.Batch)) # typed once, as a Material — never also as the vessel
run = BP["CCP-001"]
g.add((run, RDF.type, BP.CellCultureProcess))
g.add((run, BP.hasOutput, batch))
g.add((run, BP.realizes, BP[_txt(info, f"{B2}MasterRecipeID")]))
equip = info.find(f"{B2}Equipment")
if equip is not None:
vessel = BP[_txt(equip, f"{B2}EquipmentID")] # BR-101: a SEPARATE equipment node
g.add((run, BP.occursIn, vessel))

This is CQ-21 enforced at ingest time, and it matters because of what the vessel and the batch are in a mAb campaign. The bioreactor BR-101 persists across campaigns — it ran a different batch last month and will run another next month; the batch BATCH-2026-001 is the antibody-bearing culture that exists only for this run. They are different BFO categories, and Material owl:disjointWith Equipment makes fusing them a flagged error. A naive loader would read <Equipment>BR-101</Equipment> and stamp the batch as a bioreactor, conflating the material with the vessel it was made in — exactly the planted conflation the disjointness guards exist to catch, and exactly the error that, left in, makes it impossible to ask which batches ran on a given vessel or to walk a recall back through the run. The faithful loader keeps BATCH-2026-001 a bp:Batch and lets only the run bp:CCP-001 occursIn bp:BR-101. A cell-culture run is not uniform either — it passes through a growth phase where cells multiply and a production phase where they make antibody, each with its own parameter ranges — so the per-phase actuals come across as bp:RealizedParameterSettings attached to the phase they govern, each carrying a qudt:QuantityValue so the setpoint is never a bare number: Cel resolves to unit:DEG_C and /d, the feed rate's per-day unit, to unit:PER-DAY. Run it:

B2MML -> RDF: 8 triples from the master recipe, 25 more from the as-run batch record (33 total)

the <Equipment> reference becomes occursIn on the RUN, not a 2nd type on the batch:
CCP-001 occursIn BR-101
BATCH-2026-001 a Batch (typed once — Material only)

These are the same bp:CCP-001, bp:BR-101, and bp:Recipe-mAb-A that instances.ttl asserts by hand — the crosswalk MES → ontology exhibited end to end.

The lab result: Allotrope ASM into the same SEC-Result node

The analytical stream is the richest of the four and the one most prone to two opposite mistakes: flattening a chromatogram into triples, or leaving the result a bare number in a vendor file. Before the wire format even matters, the result has to be shaped right, and the most common analytical-modeling error is to fuse "the test" into one node. The model instead splits it into three kinds of thing, because for a mAb release that separation is what makes the number defensible. A method (bp:SEC-Method) — the validated size-exclusion procedure — is a plan, copyable information, the same SEC method whether run today or next campaign. An assay (bp:SEC-Assay-001) is an occurrent: the actual running of that method on a specific sample, by a named analyst, on a dated instrument. And a result (bp:SEC-Result-001) is an information artifact, a typed fact about the sample. Keep these three apart and the graph can say that one validated method was run as many assays yielding many results, that two release results came from the same method, or that a method was revised between two runs — none of which a single fused "test" node can express, and all of which a quality investigator needs when a lot is questioned. Fuse them and the release trace collapses to a number with no provenance.

The sample is the fourth node, and it is the bridge from the lab back to the process. A sample is a material entity derivedFrom the batch it was pulled from, so a result about the sample is, transitively, evidence about the batch. This is how an analytical number rejoins the genealogy — not by being stamped with a batch string, but by being about a specimen that derives from a lot, a chain the transitive derivedFrom walks automatically back to the cell bank. The result holds the scalar and a pointer to the curve, never the curve itself:

# instances.ttl — method (plan) / assay (occurrent) / sample / result, with the curve referenced.
bp:SEC-Method a bp:Method ; rdfs:label "validated SEC method" .
bp:SMP-DS-001 a bp:Sample ; rdfs:label "DS-001 release sample" ; bp:derivedFrom bp:DS-001 .
bp:SEC-Assay-001 a bp:SECAssay ; rdfs:label "SEC assay on DS-001 sample" ;
bp:realizes bp:SEC-Method ;
bp:hasInput bp:SMP-DS-001 ;
bp:hasDevice bp:HPLC-07 ;
bp:performedBy bp:Analyst-AB ;
bp:assayDate "2026-03-10"^^xsd:date ;
bp:hasResult bp:SEC-Result-001 .
bp:SEC-Result-001 a bp:SECResult ; rdfs:label "SEC result for DS-001" ;
bp:isAbout bp:SMP-DS-001 ;
bp:monomerPct "98.611"^^xsd:float ;
bp:specLow 95.0 ; bp:verdict "PASS" ;
bp:hasChromatogram bp:ADF-SEC-001 . # the heavy curve, referenced not embedded
bp:ADF-SEC-001 a bp:AnalyticalDataFile ;
rdfs:seeAlso <https://example.org/adf/SEC-Result-001.adf> .

On a real floor that same fact rarely arrives as Turtle — and it rarely arrives in just one shape. The same SEC purity can come off an Agilent system or a Waters system, each exporting its own file format and its own field names, the instrument-level heterogeneity that makes swapping an HPLC or moving the work to a contract lab quietly painful. Allotrope is the answer to exactly that problem: it gives an analytical result one standardized meaning regardless of which vendor produced it, so a monomerPct is not just a number with a unit but a result of a known measurement type, produced by a typed device, on a typed sample, by a named method — the same fact whether the lab changes instruments or the run moves sites [1]. On the wire that fact most often arrives as ASM — the Allotrope Simple Model, a lightweight JSON-LD document emitted by instrument software, the cheaper sibling of a full ADF data cube. Because ASM is JSON-LD, its @context maps each plain key onto the same bp:, af-r:, and qudt: IRIs, so loading either vendor's export yields the same bp:SEC-Result-001 triples. asm_to_rdf.py parses the one document and checks the headline triples against the dataset:

ASM JSON-LD -> RDF: 7 triples from one Allotrope Simple Model document

same bp:SEC-Result-001 as the Turtle (cheaper ingest, identical meaning):
[OK] typed bp:SECResult
[OK] typed AFO chromatogram
[OK] monomerPct 98.611
[OK] verdict PASS
[OK] hasChromatogram -> ADF-SEC-001

The same 98.611 monomer purity that threads this whole book, the same PASS verdict, the same pointer to bp:ADF-SEC-001 — from the light JSON path instead of the heavy ADF. OBI supplies the investigation frame (the assay as a planned process) and Allotrope (AFO) the analytical meaning (the result as a size-exclusion chromatogram) [1][2]; the alignment file types bp:SECResult up to both.

SEC is not the only assay run on that one specimen, and the loading dock has to land the whole release panel as a set, not a row of disconnected numbers. The same bp:SMP-DS-001 sample carries three results, because a mAb is released on three orthogonal quality questions at once: monomer purity by SEC, charge-variant distribution by cation exchange, and host-cell-protein impurity by ELISA. The dataset asserts all three against the same specimen, each typed to the ontology supplying its meaning:

# instances.ttl — the release panel: three assays, three results, one sample, one batch.
bp:CEX-Assay-001 a bp:CEXAssay ; rdfs:label "CEX assay on DS-001 sample" ;
bp:hasInput bp:SMP-DS-001 ; bp:hasResult bp:CEX-Result-001 .
bp:CEX-Result-001 a bp:CEXResult ; rdfs:label "CEX result for DS-001" ;
bp:isAbout bp:SMP-DS-001 ; bp:cexMainPct "70.686"^^xsd:float ; bp:verdict "PASS" .
bp:HCP-Assay-001 a bp:HCPAssay ; rdfs:label "HCP ELISA on DS-001 sample" ;
bp:hasInput bp:SMP-DS-001 ; bp:hasResult bp:HCP-Result-001 .
bp:HCP-Result-001 a bp:Result ; rdfs:label "HCP result for DS-001" ;
bp:isAbout bp:SMP-DS-001 ; bp:hcpPpm "12.0"^^xsd:float ; bp:verdict "PASS" .

bp:HCPAssay is rdfs:subClassOf obo:OBI_0000661 — OBI's ELISA — where AFO has no single verified IRI, while the SEC result subclasses the AFO chromatogram; the two vocabularies are not redundant but complementary, each naming what the other does not. The payoff of keeping method, assay, sample, and result as separate nodes is exactly this: the release panel is a set of results sharing one specimen that derives from one batch, not a flat record of columns sharing a string — so the question "did every CQA on this lot pass on a sample that truly derives from this batch?" is a traversal, not a clerical reconciliation.

A loading-dock diagram with an index zone and a warehouse zone: on the left, four inbound lanes converge on one knowledge graph — a PI historian row and an OPC UA variable read both landing the same SOSA observation carrying value 36.51 with the UCUM code Cel and the QUDT unit DEG_C; a B2MML master recipe and as-run batch record landing the run CCP-001 that occursIn the vessel BR-101 with the batch BATCH-2026-001 typed once as a material; and an Allotrope ASM JSON-LD result landing the SEC-Result-001 node holding monomerPct 98.611 with its spec and PASS verdict; on the right, a warehouse zone holds the dense payloads the graph only points at — the PI point stream and an Allotrope ADF chromatogram cube — with a single hasChromatogram IRI pointer and a hasTrace tag edge crossing the boundary, captioned that scalars and pointers are indexed in the graph while arrays stay in their files. The loading dock: four plant standards — historian, OPC UA, B2MML, and Allotrope ASM — each cross into the one graph as the same typed nodes, every quantity carrying its unit, while the dense stream and the chromatogram stay in their files behind a hasTrace or hasChromatogram pointer. Original diagram by the authors, created with AI assistance.

Evaluation: index, not payload — and units everywhere

The four scripts together demonstrate the two invariants implementation must protect. The first is index versus payload, and for a mAb it is not a storage optimization but a correctness rule. A scalar like 98.611 maps cleanly into a triple, because one typed number is a fact the graph can reason over and compare to the 95.0 spec floor. A chromatogram is not: it is thousands of intensity-versus-time points whose shape — where the monomer peak sits relative to the aggregate and fragment peaks — is the actual evidence that justifies the 98.611. Flattening that array into subject-predicate-object triples would explode the graph and still lose the very shape a reviewer re-examines when a lot is questioned. So none of the loaders ever emit the stream or the curve. The historian and OPC UA bridges emit one bp:hasTrace per batch/tag plus a small, bounded set of observations — the graph holds the index, PI holds the points. The analytical loader holds monomerPct 98.611 and a bp:hasChromatogram IRI — the graph holds the release number, the ADF file holds the curve that backs it. An investigator queries the graph to find which results failed and what they were about, then follows the IRI to the chromatogram only when the raw signal must be re-examined: the graph stays small and queryable, the evidence stays in the file built to hold it [1][3].

The second is CQ-19: no bare numbers, the invariant the release decision rests on. Every numeric crossing the dock carries its unit. The historian and OPC UA observations carry qudt:ucumCode and, where one exists, qudt:hasUnit; the B2MML realized settings carry a qudt:QuantityValue with both a UCUM code and a QUDT unit IRI; the ASM result keeps monomerPct typed xsd:float so a 98.611 is unambiguously a percent compared to a 95.0 percent floor, never a stray scalar. The honest gap — pH has no clean QUDT unit, since pH is a dimensionless logarithmic quantity — is reported, not hidden, which is the whole point of the unit being there: you can see which readings made it all the way to a resolvable unit IRI and which only carry a code, rather than discovering the ambiguity at the moment a batch is being dispositioned.

The unsolved part: a faithful loader is a hand-written contract, not a generated one

These four scripts each execute one crosswalk faithfully — but each was authored by a person who understood both the source standard and the target model, and that does not generalize for free. The RML mapping in historian-map.rml.ttl is declarative and a real virtual-graph engine (Ontop, an RML processor) could run it over the live SQL historian; but the OPC UA NodeId-to-tag rule, the B2MML "equipment becomes occursIn, not a second type" rule, and the ASM @context were all decisions a modeler made, not facts a machine derived. The conflation CQ-21 forbids — a batch typed as its vessel — is exactly the mistake an auto-generated mapping makes, because the B2MML XML genuinely puts the <Equipment>BR-101</Equipment> reference inside the batch element. If the loader takes that literally and stamps BATCH-2026-001 as a bioreactor, the material and the vessel fuse into one node, the genealogy edge that should run run-occursIn-vessel collapses, and the lineage that a recall walks back to the cell bank is broken at its most data-rich step. A faithful loader is therefore a hand-written contract, not a generated one.

There is a deeper gap beneath the per-script effort: the analytical domain alone sits at the meeting point of three large ontology efforts that share a goal but not a seamless join. OBI models the investigation, AFO models the analytical result, and the IOF biopharma modules model the manufacturing process the sample came from — and while all are reconcilable in principle, there is no single turnkey mapping that fuses them. A result modeled in AFO and a batch modeled in IOF meet only through a crosswalk a team authors, the same OBO–IOF seam that runs through the discovery chapters, now widened by a third party. There is no published, turnkey adapter that maps B2MML or Allotrope into IOF-grounded triples; every plant authors and maintains its own, and the cost of keeping those adapters correct as the source schemas evolve is a real, recurring tax that no amount of ontology rigor on the target side pays down.

The adoption barrier is sharpest for analytical results, and it is a discipline cost, not a standards-maturity one. AFO is large and genuinely complex; full conformance — emitting every result as a conformant ADF or ASM with complete annotation — is a heavy lift that, in 2026, most labs have not made. A plant can hold thousands of results as bare numbers with vendor metadata and call them standardized while none of them actually carry AFO meaning, which is why FAIR-in-fact lags FAIR-in-claim. The tooling exists and the standards are real; the architecture this chapter exhibits (index scalars, reference arrays, type and unit everything) is clear and correct. What remains uneven is the day-to-day discipline of making every loader honor it. Implementation is where the model stops being a diagram and starts being a maintenance burden — an honest one, but a burden.

Why it matters

A graph is only as trustworthy as the loaders that fill it, and in a mAb plant the two things the loaders most have to protect are the release gate and the genealogy. The release gate is the decision to disposition a drug-substance lot: a reviewer checks that every CQA result — monomer purity over 95.0, charge-variant distribution in spec, HCP impurity under limit — passes on a sample that genuinely derives from that lot. If the dock drops units, that comparison validates against numbers it cannot safely interpret, and a passing or failing verdict becomes guesswork. The genealogy is the derivedFrom chain from a vialed drug product back through the drug-substance lot, the harvest, the production batch, the seed train, to the working cell bank — the chain a recall walks to scope which lots share an origin. If the dock conflates a batch with its vessel, the run-to-vessel edge collapses and that walk loses its most data-rich link. If the dock flattens a chromatogram, the graph swells until it is unusable; if it leaves the result a bare blob, the release evidence is unqueryable and the graph is useless. The four crosswalks here show the alternative: each real standard a plant already speaks lands as the same typed node, with its unit, behind an index pointer to the payload — so the genealogy, the trace, and the release evidence are all queryable without the graph ever swallowing the stream. This is where the cheap discipline of the upstream chapters either pays off or is quietly thrown away.

In the real world

Historians, OPC UA, B2MML, and Allotrope are the standards real biopharma plants run today — and the maturity is uneven, which is the point. PI historians and OPC UA are production-default for process data; B2MML is how MES vendors (Körber PAS-X, Siemens Opcenter, Rockwell FactoryTalk Batch) exchange recipes and batch records; Allotrope's ADF and its lightweight ASM sibling give analytical results vendor-neutral meaning and are deployed in real labs, while SiLA 2 and OPC UA LADS device integration are piloted but not yet default [1][2][3]. The companion open-source book shows the same wire-to-graph pattern in a running stack, and Part VIII's tiered survey of what is actually in production maps which of these standards a plant truly runs versus pilots. What remains the frontier is making every loader carry units and respect the index boundary by default, and stitching OBI, AFO, and IOF into one graph without a hand-built adapter at every join.

Key terms

  • Crosswalk — a mapping from one standard's structure into the target ontology's triples; here, four of them (historian, OPC UA, B2MML, ASM), each exhibited by a running script rather than asserted.
  • SOSA observation — the W3C Sensor/Observation/Sample/Actuator node a sensor reading becomes; carries its observed property, simple result, result time, and unit.
  • UCUM code / QUDT unit — the unit travelling with a reading: a raw UCUM code (e.g. Cel, pH) and, where one exists, a resolvable QUDT unit IRI (unit:DEG_C) — CQ-19's "no bare numbers" made concrete.
  • B2MML — the MESA XML serialization of ISA-88/ISA-95 that MES tools use to exchange recipes and batch records; mapped so <Equipment> becomes occursIn on the run, not a second type on the batch.
  • ASM (Allotrope Simple Model) — Allotrope's lightweight JSON-LD result format whose @context resolves to the same IRIs as the Turtle, so a light file yields the same bp:SEC-Result-001 as a full ADF.
  • Index versus payload — the boundary the loaders protect: the graph holds the number and a hasTrace / hasChromatogram pointer; the dense stream and curve stay in PI or an ADF/AnIML file.
  • Method / assay / result — a plan, an occurrent, and an information artifact respectively; three nodes, not one fused "test," so one validated method can be run as many assays yielding many results, with full release provenance.
  • Sample — the specimen pulled from a lot, modeled derivedFrom the batch, so a result about the sample is transitively evidence about the batch — the bridge that rejoins an analytical number to the genealogy.
  • Release panel — the set of CQA results on one specimen that gate a drug-substance lot (here SEC monomer, CEX charge variants, HCP impurity), each isAbout the same sample so the gate is a traversal, not a clerical reconciliation.

Where this leads

The graph is fed: four plant standards land as the same typed, unit-carrying, index-respecting nodes. Now we ask what the graph can answer. The next chapter, Competency Questions as Queries, turns the requirements catalog into the SPARQL and reasoner checks that run against this loaded graph — lineage to the root, impact across a campaign, the trajectory of a quality attribute — proving that the model engineered across Parts I–V actually answers the questions it was built for.