A Canonical Bioreactor Information Model: Companion Specs, NodeSet2, and Semantic Alignment
📍 Where we are: Part I · The Blueprint — Chapter 6. Chapter 5 gave every signal a canonical name; this chapter gives the bioreactor a canonical model — a platform-agnostic, device-modular OPC UA information model, typed against companion specifications and aligned to a shared ontology, so two vendors' servers can expose the same meaning, not just the same protocol.
Two bioreactor vendors can both "speak OPC UA" and still be unable to talk to each other — like two people who both speak English but file the same fact under different headings in different drawers. One server publishes the antibody concentration as Titer in grams per litre under a Process folder; the other calls it ProductConc in milligrams per millilitre under an Analytics tree. Your code that read the first one breaks on the second. The fix is to agree, up front, on one shape: this is a bioreactor, these are its devices, each device exposes these signals, with these datatypes and these units, and here is the dictionary word each one means. That agreed shape is a canonical information model, and the agreements that make it portable across vendors are companion specifications. This chapter designs one for our bioreactor — modular by device, typed, unit-bearing, and tied to a shared vocabulary — and is honest that for the bioreactor itself, the industry agreement does not fully exist yet.
What this chapter covers
Chapter 5 solved naming — every signal has a disciplined name like BR101.Titer.PV. But a name is a label on a box; a model says what is inside the box, what type it is, what unit it carries, and what it means. This chapter builds that model. We cover:
- Why standards-compliance is not interoperability — two OPC UA servers can both conform and still expose the same quantity at incompatible address-space paths, datatypes, and unit conventions.
- A canonical model, modularized by device — the bioreactor as a set of reusable device modules (in-situ probe rack, cell-culture analyzer, Raman PAT) rather than one flat tag list, because a distributed analyzer is a different device from the probe rack.
- Typing against companion specifications — OPC UA DI for the probes, the Analyzer Devices companion spec (OPC 10020) for the at-line analyzer, and LADS (OPC 30500) for the Raman/lab device — and the honest gap that the bioreactor itself has no published companion spec.
- Units and semantic alignment — every node carrying a UCUM unit and an IOF/Allotrope-anchored ontology IRI, so meaning travels with the value.
- NodeSet2 as the model-as-a-file — the standard XML serialization a server imports — built and validated in
examples/chapters/06-bioreactor-information-model/bioreactor_model.py.
Standards-compliant is not the same as interoperable
It is tempting to think that once every device "speaks OPC UA," the integration problem is solved. It is not, and the reason is the crux of this chapter. The OPC UA transport — the address space, the security handshake, the subscription model from the connectivity chapter — moves bytes reliably and self-describingly. What it does not do, on its own, is make two vendors agree on what shape those bytes take. Vendor A exposes the antibody concentration as a node named Titer, a Double in g/L, under a Process object; vendor B exposes the same physical quantity as ProductConc, a Float in mg/mL, under an Analytics folder, with a different NodeId path. Both servers are fully OPC UA-compliant. A collector written against A's address space silently fails — or worse, silently misreads — against B's.
This is the difference between syntactic interoperability (we agree on the protocol and the data format) and semantic interoperability (we agree on what the data means) that the data-management book names directly. Transport is solved; shape and meaning are not. The remedy is to stop letting each vendor invent its own address space and instead agree on a canonical information model: one published, vendor-neutral description of what a bioreactor is in OPC UA terms — which objects, which variables, which datatypes, which units — that every conforming server is expected to expose the same way. The NIIMBL Big-Data interoperability work is exactly this: producing a platform-agnostic bioreactor data model so that an integration written once works against any conforming device, rather than being hand-mapped per vendor [1].
A model modularized by device
The first design decision is the one the RFP work makes explicit and that distinguishes a real model from a tag dump: modularize by device. A production bioreactor is not one monolithic instrument. It is an in-situ probe rack (temperature, pH, dissolved oxygen, agitation, pressure) physically integrated with the vessel; plus a separate, often distributed, cell-culture analyzer that draws samples and reports viable-cell density, glucose, lactate, and titer; plus inline PAT instruments such as a Raman probe with its soft-sensor outputs. These are different devices, frequently from different vendors, connected and replaced independently. A canonical model that lumps them into one flat list cannot represent that a site swapped its analyzer for a different model, or that the Raman probe is optional.
So the model is a set of reusable device modules, each a self-contained object type with its own variables, that a server composes. The companion module bioreactor_model.py builds exactly three:
in-situ-probes— the standard probes, typed as OPC UA DIAnalogItemType(the Device Integration base type that already carries an engineering-unit range).cell-culture-analyzer— the at-line analyzer, typed against the Analyzer Devices companion specification.raman-pat— the Raman probe and its soft-sensor outputs, typed against LADS.
Modularization is what makes the model match the plant's real connect-and-replace topology, and it is the unit at which conformance is checked: a vendor's analyzer either conforms to the cell-culture-analyzer module or it does not, independently of the probe rack.
Typing against companion specifications
A module is only portable if its types are published and shared, which is what an OPC UA companion specification is: a standardized, vendor-neutral set of object and variable types for a class of device, layered on the OPC UA base. Two relevant ones exist and the model uses both. LADS — the Laboratory and Analytical Device Standard, OPC 30500 — is the first OPC UA companion specification for laboratory and analytical devices, released by the OPC Foundation with SPECTARIS and VDMA; several biopharmaceutical-instrument vendors contributed to it, which is why it is the natural type for the Raman/lab module [2]. The Analyzer Devices companion specification, OPC 10020, standardizes the address-space shape of analyzer instruments — channels, results, and their metadata — and is the natural type for the cell-culture analyzer module [3]. Typing a node against AnalyserChannelType or a LADS FunctionalUnitType rather than a bare BaseDataVariableType is what lets a generic client browse any conforming analyzer and know where the results are.
Here is the honest gap, and the chapter owes it to you plainly: there is no published OPC UA companion specification for a mammalian-cell bioreactor itself. LADS covers the lab and analytical instruments around it; the Analyzer Devices spec covers analyzers; PA-DIM covers process-automation transmitters; MTP covers modular skids — but the CHO fed-batch bioreactor at the center, with its headline signals (viable-cell density, online glucose, titer), has no standard model to conform to [1]. That absence is why this chapter is "design a canonical model" rather than "import the standard one," and it is precisely the gap the NIIMBL bioreactor-data-schema work aims to fill. Where a companion spec exists (the analyzer, the lab device, the process transmitter), conform to it; where it does not (the bioreactor proper), invent a model with the same discipline — modular, typed, unit-bearing, semantically aligned — so it is at least rigorously self-consistent and ready to align to a standard when one arrives.
Units and semantic alignment: making meaning travel
Two more fields on every node carry the meaning the address space alone would lose. The first is the engineering unit, expressed as a UCUM code (Unified Code for Units of Measure) — Cel, g/L, [pH], mmHg — carried in the OPC UA EngineeringUnits property so the value is never a bare number. The unitless 0.40 that the legacy-skids chapter warns about is exactly the failure a canonical model forbids by construction: a node without a unit is not conformant.
The second is a semantic IRI — a globally unique identifier that says what the node is in a shared vocabulary, not just what one server calls it. Each node in the model carries an IRI into the running example's bioprocess ontology (bp:Temperature, bp:ProductTiter, afo:RamanSpectrum), aligned to the Industrial Ontologies Foundry (IOF) biopharma classes and the Allotrope Foundation Ontology (AFO) [4][5]. This is the bridge to the semantics and knowledge-graph chapter and to the companion ontology book: the same canonical model, read one way, is an OPC UA address space a server exposes; read another, its nodes are typed triples whose units are QUDT/UCUM IRIs and whose completeness a SHACL shape can gate. Naming (Chapter 5) gave the signal a consistent string; this model gives it a type, a unit, and a meaning that survive crossing a vendor boundary.
NodeSet2: the model as a file
A canonical model is not an idea; it is a file a server imports. The OPC UA standard serialization is NodeSet2 XML (the UANodeSet schema) — every object, variable, datatype, and reference written out so any conforming server can load the model and expose it. The module examples/chapters/06-bioreactor-information-model/bioreactor_model.py builds the model as data and emits the NodeSet2 fragment for a node, so the abstract "agree on a shape" becomes a concrete artifact. The validator is the model's acceptance criterion — every node must be typed, unit-bearing, ranged, and semantically aligned:
# examples/chapters/06-bioreactor-information-model/bioreactor_model.py (excerpt)
def validate(node: Node) -> list[str]:
"""A canonical-model node must be typed, unit-bearing, ranged, and semantically aligned."""
bad = []
if not node.datatype: bad.append("no-datatype")
if not node.base_type: bad.append("no-companion-type")
if not node.ucum: bad.append("no-unit") # the bare-number failure
if not (node.lo < node.hi): bad.append("bad-range")
if ":" not in node.iri: bad.append("no-ontology-iri") # home-namespace-only
return bad
Running python bioreactor_model.py prints the model, the NodeSet2 fragment, and the conformance check, verbatim:
canonical bioreactor information model — platform-agnostic, modular, typed, aligned
device modules: 3 variable nodes: 11
-- module 'in-situ-probes' -> conforms to OPC UA DI (Devices) — AnalogItemType
Temperature Double unit=Cel bp:Temperature
pH Double unit=[pH] bp:pH
DissolvedO2 Double unit=% bp:DissolvedOxygen
Agitation Double unit=min-1 bp:AgitationRate
Pressure Double unit=mmHg bp:HeadspacePressure
-- module 'cell-culture-analyzer' -> conforms to OPC 10020 Analyzer Devices
ViableCellDensity Double unit=10*6.{cells}/mL bp:ViableCellDensity
Glucose Double unit=g/L bp:GlucoseConcentration
Lactate Double unit=g/L bp:LactateConcentration
Titer Double unit=g/L bp:ProductTiter
-- module 'raman-pat' -> conforms to OPC 30500 LADS (Laboratory & Analytical Device Standard)
RamanSpectrum Float[] unit=1 afo:RamanSpectrum
GlucoseSoftSensor Double unit=g/L bp:GlucoseConcentration
-- NodeSet2 serialization (one node, the standard XML a server imports) --
<UAVariable BrowseName="1:Titer" DataType="Double" ParentNodeId="ns=1;s=BR101.cell-culture-analyzer">
<DisplayName>Titer</DisplayName>
<References>
<Reference ReferenceType="HasTypeDefinition">AnalyserChannelType</Reference>
</References>
<!-- EngineeringUnits: UCUM "g/L"; NormalRange: 0.0..10.0; semantic IRI: bp:ProductTiter -->
</UAVariable>
-- MODEL CONFORMANCE (every node modular, typed, unit-bearing, aligned) --
nodes conforming: 11/11
every node carries a UCUM unit: True
every node aligned to an ontology IRI: True
Read the NodeSet2 fragment as the whole chapter in eleven lines. The node is named canonically (Titer), typed against a companion spec (AnalyserChannelType, the Analyzer Devices result type), placed in a device module (BR101.cell-culture-analyzer), united (UCUM g/L), ranged (its normal operating band), and aligned (bp:ProductTiter, an IRI into the shared ontology). A server that imports this exposes the titer the same way as any other conforming server; a client written once reads them all; and the same node lifts cleanly into the knowledge graph because its unit and meaning came along. That is the artifact the agentic connectivity PoC in the ML book maps real, messy vendor devices onto — the canonical target whose unit and type its Chain-of-Logic verifier checks every proposed mapping against.
One canonical model, two readings: an OPC UA address space a server exposes (BR101 composing three companion-spec-typed device modules) and, because every node carries a UCUM unit and an IOF/Allotrope-aligned IRI, a set of typed triples that lift into the knowledge graph. The bioreactor itself has no published companion spec yet — so the model is authored to the same discipline, ready to align when one arrives.
Original diagram by the authors, created with AI assistance.
Why it matters
Naming made each signal consistent within our platform; a canonical information model is what makes it consistent across vendors — the difference between an integration you write once and one you hand-map for every new bioreactor a site installs. The cost of skipping it is the quiet, expensive normal of biomanufacturing today: every device speaks a standard protocol and still nobody agrees on the shape, so integration engineers spend their time mapping ProductConc in mg/mL to Titer in g/L, one vendor at a time, and a knowledge graph built on top inherits every inconsistency. The model in this chapter is the antidote, and it is also the precondition for everything more ambitious: a digital twin, a cross-site soft sensor, an agentic connectivity tool that drafts mappings — all of them assume a canonical target to map onto. Building that target, modular and typed and unit-bearing and semantically aligned, is unglamorous schema work, and it is the load-bearing work the rest of the data platform silently stands on.
In the real world
The companion-specification landscape is real, recent, and uneven — and the unevenness is the story. LADS (OPC 30500) is published (Release 1.0.1, January 2024) and was demonstrated end-to-end at a 2025 hackathon alongside the Allotrope Foundation Ontology, with biopharmaceutical-instrument vendors among its contributors [2]. The Analyzer Devices companion specification (OPC 10020) standardizes analyzer address spaces [3], and PA-DIM and MTP cover process transmitters and modular skids respectively. But the honest verdict the connectivity chapter already states holds here: there is no companion specification for a CHO fed-batch bioreactor or a Protein A capture step, so the headline analytes this whole book follows have no standard model to conform to, and most servers still expose vendor-specific address spaces [1]. On the semantic side, the IOF biopharma reference ontologies (released February 2026, with unit-operation and quality-parameter classes) and the Allotrope Foundation Ontology give the alignment target, with QUDT/UCUM for units [4][5]. The OSS-versus-commercial line is the book's recurring one: the NodeSet2 format, the published companion specs, and open-source tooling (asyncua loads a NodeSet) are all freely usable, and you own authoring a faithful, conformant model for the parts the standards do not yet cover — which, for the bioreactor itself, is still the center of the plant.
Key terms
- Information model — the published, vendor-neutral description of what a device is in OPC UA terms (its objects, variables, datatypes, references), as opposed to just a list of tag names; what makes two servers expose the same quantity the same way.
- Companion specification — a standardized, shared set of OPC UA object and variable types for a class of device, layered on the OPC UA base; conforming to one is what makes a module portable across vendors.
- OPC 10020 (Analyzer Devices) — the OPC UA companion specification standardizing analyzer instruments' address spaces (channels, results, metadata); the type the cell-culture-analyzer module conforms to.
- LADS (OPC 30500) — the Laboratory and Analytical Device Standard, the first OPC UA companion specification for lab/analytical devices; the type the Raman/PAT module conforms to.
- Modularization by device — designing the model as reusable per-device modules (probe rack, analyzer, Raman) rather than one flat tag list, so it matches the plant's real connect-and-replace topology and conformance is checked per device.
- NodeSet2 (UANodeSet) — the standard OPC UA XML serialization of an information model; the file a server imports to expose the canonical model.
- UCUM engineering unit — the Unified Code for Units of Measure code (
Cel,g/L,[pH]) carried in the OPC UAEngineeringUnitsproperty, so a value is never a bare number; a node without a unit is non-conformant. - Semantic IRI / alignment — a globally unique identifier (e.g.
bp:ProductTiter, aligned to IOF biopharma and the Allotrope Foundation Ontology) that says what a node means in a shared vocabulary, so the model lifts into the knowledge graph. - The missing bioreactor companion spec — the honest gap: LADS, OPC 10020, PA-DIM, and MTP cover the lab/analyzer/transmitter/skid, but no published companion spec covers the CHO bioreactor itself, which is why this model is authored, not imported.
Where this leads
The bioreactor now has a canonical shape — modular by device, typed against the companion specs that exist, authored to the same discipline where they do not, unit-bearing, and aligned to a shared vocabulary. But a model is still a blueprint until something speaks it over a wire. The next chapter opens Part II, Speaking OT: OPC UA, MQTT, and Sparkplug B, and stands up the protocols that carry this model's nodes off the bioreactor and into the platform — turning the shape we designed here into live, secured, self-describing data.