The Reference Architecture: One Stack, Layer by Layer

📍 Where we are: the blueprint. Before we type a single command, we unfold the whole platform onto one page — every layer, every open-source tool, every ISA-95 level — and draw the line where open source stops and commercial systems begin.

The preface made you a promise: clone one repository and build a working bioprocess data platform, layer by layer, in open source. This chapter is the map of that platform. Think of it as the architectural drawing you pin to the wall before construction starts — the one every later chapter quietly points back to.

We will not run anything yet. Chapter 2 does that. Here we do the harder thing: we make sure you can see the shape of the whole stack — how a number born on a pH probe travels through a message bus, lands in a historian, gets stitched to a batch record, becomes a triple in a knowledge graph, and finally helps assemble a regulatory submission — before you build the first piece of it. (A triple is a single subject-predicate-object fact, and a knowledge graph is the web those facts form; we build one for real, aligned to industrial ontologies, in The Semantic Layer: a Knowledge Graph, so end-to-end lineage can be queried with SPARQL rather than reconstructed by hand.)

The simple version

Imagine a building with eight floors. The ground floor is the loading dock where raw goods (sensor readings) arrive. Each floor above adds meaning: one labels and routes the deliveries, one warehouses them, one cross-references every box against the order it belongs to, one lets inspectors trace any box back to its origin. The top floors are where the regulators visit. Our job in this chapter is to draw the floor plan, name the open-source tool that staffs each floor, and mark — in red ink — the floors where the free staff cannot pass inspection alone and you must hire a licensed professional.

What this chapter covers

The layered blueprint, top to bottom, and why the companion repo's build order follows it.
Each layer mapped to an ISA-95 / Purdue level and to the open-source tool we chose for it.
Where the honest open-source-software (OSS)↔commercial boundary falls, and why it falls there.
The one design decision that makes the whole stack joinable: the historian and the batch model living in the same database.
A component-and-license inventory, and the data-integrity question that recurs through every chapter.

The blueprint on one page

Every layer of the platform has the same job description: take the data from the layer below, add one kind of meaning, and hand it up. Read from the bottom:

Layer	What it adds	ISA-95 / Purdue level	OSS tool in this book
Edge connectivity	a standard, self-describing way to read sensors	Level 0–2 (sensors, control)	OPC UA (via asyncua, the Python library)
Message bus	a named, real-time stream of every value	Level 2–3 boundary	MQTT + Sparkplug B (Mosquitto)
Historian / TSDB (time-series database)	durable, queryable time-series at scale	Level 3	TimescaleDB hypertable
Batch & equipment model	the context that makes a number mean something	Level 3	PostgreSQL (ISA-88/95)
Contextualization	the join: this value, this batch, this phase	Level 3	SQL views
Semantics	machine-traceable lineage across systems	Level 3–4	RDF / SPARQL (Apache Jena Fuseki)
Compliance / trust	the audit-trailed record of truth	Level 3–4	Postgres audit + hash chain
Analytics	prediction and process understanding	Level 3–4	Python (SPC = statistical process control charts, PLS soft-sensor)

If "Level 0–4", "Purdue", or "DCS" are new to you, read the table as a preview: the next subsection — Reading the ISA-95 ladder — defines the rungs.

The mapping to ISA-95 — the international standard (IEC 62264) for integrating enterprise systems (the business software that plans and tracks production) and control systems (the automation that physically runs the equipment), refreshed in its 2025 edition — is not decoration [1]. It tells you where each tool legitimately lives and, crucially, where the boundaries between them sit. The single most important boundary in this whole table is the one between Level 2 (the control system that actually runs the bioreactor — the vessel where the cells grow to make the drug) and Level 3 (everything we build in this book). The validated control system is sacred; we never write into it. We read from it.

That principle has a name. NAMUR — the European user association of automation technology — published the NAMUR Open Architecture (NOA) concept precisely for this: a second, read-mostly data channel that feeds monitoring, historization, and optimization without altering the validated core process control system [2]. Almost everything in this book lives on the NOA side of that line. When you hear us say "we never touch the DCS" (distributed control system — the computer system that automatically runs the plant's equipment in real time), NOA is the standard that blesses it.

Reading the ISA-95 ladder: where each tool legitimately lives

ISA-95 is a five-rung ladder, and the value of the rungs is that they pin each tool to a legitimate place rather than wherever it happened to be installed. Levels 0–2 are the validated world: the sensors and actuators (Level 0–1) and the basic and supervisory control that runs the process in real time (Level 2) — the DCS or PLC (programmable logic controller — a ruggedized industrial computer that drives individual machines), the part a regulator has qualified (formally proven, with documented evidence, to do exactly what it must) and that you must not perturb. Level 3 is manufacturing operations: historization, batch records, scheduling, dispatch — the layer where every open-source tool in this book lives. Level 4 is business planning and logistics, the ERP (enterprise resource planning) world. Reading the blueprint table against that ladder, the OPC UA edge straddles Level 0–2, the MQTT/Sparkplug bus sits on the Level 2–3 boundary, and everything from the historian upward is Level 3 and above [1]. The discipline the ladder enforces is simple: a tool may read from the level below, but the validated levels never take a write from above. Place a component one rung too low and you have quietly turned an analytics convenience into a change to a qualified system.

The Level 2/3 boundary: why we never write into the DCS (NOA)

Of the four boundaries on the ladder, the Level 2/3 line is the one this whole book is built to respect. Below it sits the validated control system; above it sits everything we are free to build. Writing across that line — pushing a value, a setpoint, or even a configuration change down into the DCS from a Level-3 tool — would put unvalidated code inside the validated boundary, and re-qualifying the control system is precisely the cost NOA exists to avoid. The NAMUR Open Architecture concept formalizes the escape: a second, read-mostly channel taps process data out of the control system for monitoring and optimization, while the validated core keeps its single, qualified path of control [2]. Every collector, historian, and view in this stack is a consumer on that second channel. The OT/IT security standard ISA/IEC 62443 — OT (operational technology) being the plant-floor systems that run the equipment, IT (information technology) the business and data systems above them — then governs the conduit itself — a firewall, a data diode (hardware that physically permits traffic in only one direction), or a one-way gateway — so that "read-mostly" is enforced by the network, not merely by good intentions [10]. When a later chapter shows code reading from a DCS mock, this is the rule it is honoring: in, never out.

The reference architecture, bottom to top. Data flows upward: each layer adds one kind of meaning and hands it to the next. The unshaded layers are pure open source you build and run on a laptop; the shaded compliance band near the submission is the honest hybrid — the GxP last mile of validation, qualified signatures, and vendor accountability that open source alone does not deliver. Original diagram by the authors, created with AI assistance.

The same stack, as a dataflow

The table reads bottom-up, the way you build it. The data, of course, flows the other way. Here is the same architecture as the journey of a single pH reading on bioreactor BR101:

A left-to-right dataflow of one pH reading on bioreactor BR101: pH probe to OPC UA server to Mosquitto broker to TimescaleDB hypertable; the ISA-88/95 batch model in PostgreSQL joins the historian by a plain line in one database; both feed a contextualization view that fans out to semantics, analytics, and an audit hash chain, with semantics and audit converging into the regulatory submission.

Notice that the historian and the batch model sit side by side and are connected by a plain line, not an arrow. That is the keystone of the design, and it deserves its own section.

One database for the historian and the batch model

In most facilities the historian and the batch/relational world are two different products from two different vendors, and joining them is a nightly ETL (extract-transform-load — a scheduled job that copies data out of one system, reshapes it, and loads it into another) job that someone babysits, fragile precisely because it runs after the fact, on a schedule, between two systems that were never designed to agree. We refuse that split. Look at how the companion repo defines its database, in examples/platform/compose/compose.yaml. This is a Docker Compose file — a single YAML file that declares each service (a container) by its pinned image, the ports it exposes, and the volumes it mounts; Chapter 2 runs it, so here just read it as the stack's manifest:

services:
  # --- core --------------------------------------------------------------
  postgres:
    # timescale/timescaledb IS PostgreSQL + TimescaleDB, so the historian
    # hypertable and the ISA-88/95 batch model live in one joinable database.
    image: timescale/timescaledb:2.17.2-pg17
    profiles: ["core"]
    environment:
      POSTGRES_USER: ${POSTGRES_USER:-bioproc}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-bioproc}
      POSTGRES_DB: ${POSTGRES_DB:-bioproc}
    ports: ["5432:5432"]
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ../db:/docker-entrypoint-initdb.d:ro   # 00-60 schema files run on first init

TimescaleDB is not a separate database — it is a PostgreSQL extension. Its core abstraction, the hypertable, is an ordinary PostgreSQL table that is automatically partitioned by time into chunks, so high-rate sensor data behaves like any other relational table while staying fast at scale [3]. Because the historian is PostgreSQL, the time-series and the batch context are not two systems pretending to talk — they are two schemas in one database you can join with plain SQL. (A schema, in the database sense, is a named namespace that groups related tables inside one database — not a diagram.) The historian lives in schema ts; the batch model lives in schema s88. Here is the hypertable, from examples/platform/db/20-historian.sql:

CREATE TABLE ts.sensor_reading (
    ts       timestamptz      NOT NULL,
    tag      text             NOT NULL,
    value    double precision,
    unit     text,
    quality  smallint         NOT NULL DEFAULT 192,  -- legacy OPC DA: 192 Good, 64 Uncertain, 0 Bad
    batch_id text
);

SELECT create_hypertable('ts.sensor_reading', 'ts', chunk_time_interval => INTERVAL '1 day');

That quality column is small but load-bearing. OPC UA — the platform-independent, service-oriented protocol that carries data and its metadata from the sensor up through Level 3 [4] — attaches a quality status to every value, which the schema stores in the compact legacy OPC DA encoding: 192 for Good, 64 for Uncertain, 0 for Bad (the OPC-DA-vs-OPC-UA distinction is unpacked below). We preserve it all the way into the historian. When an inspector later asks whether a value was trustworthy at the moment it was recorded, the answer is a column, not a guess. This is the ALCOA+ attribute "Original" made concrete. ALCOA+ is the regulators' shorthand for the properties trustworthy data must have — Attributable, Legible, Contemporaneous, Original, and Accurate, plus Complete, Consistent, Enduring, and Available — and the data-integrity guidance that frames the rest of this book expects the original measurement and its context to survive intact [5].

Anatomy of a ts.sensor_reading row

The whole stack exists to carry one thing upward: a sensor reading. So it is worth slowing down and dissecting a single row of ts.sensor_reading field by field, because every later chapter writes into, reads from, or extends exactly this shape. The row below is the first dissolved-oxygen reading of the seeded golden batch — the industry's term for an exemplary reference run whose trajectory later batches are compared against — in examples/datasets/fedbatch_timeseries_10min.sample.csv — the BR101.DO.PV row at batch start, 2026-01-05 00:00 — a real value, not an illustration.

An identity card dissecting one row of the ts.sensor_reading historian table, field by field: ts (source timestamp), tag (the structured signal name), value, unit, a highlighted quality column carrying the OPC UA status code 192 for Good, and batch_id; a decode panel breaks the tag BR101.DO.PV into asset, measurement, and role. One row of ts.sensor_reading as it lands at Level 3: six columns, each carrying a different kind of meaning, before any batch context is joined. Original diagram by the authors, created with AI assistance.

Read the six columns as six promises the historian makes about every value. ts is the source timestamp — when the measurement happened on the floor, not when it reached the database — typed timestamptz so the offset is never ambiguous. tag is the signal's identity as text: BR101.DO.PV decodes to asset (BR101, the ISA-95 unit), measurement (DO, dissolved oxygen), and role (PV, the process value — the evidence — not .SP, the recipe setpoint). value is the number, 40.8224, stored as double precision. unit is %sat (percent of air saturation — the standard way dissolved oxygen is reported), the answer to "40.8224 of what"; a number without its unit is a defect waiting to happen. quality is the trust flag — the next section dissects it on its own. And batch_id is BATCH-2026-001, the single text column that lets this reading be stitched to the GMP (Good Manufacturing Practice — the regulated quality rules production must follow) batch record later, with no foreign-key ceremony at write time (a foreign key is the relational-database constraint that forces one row to reference an existing row in another table; skipping it "at write time" lets the historian record a reading fast, even before its batch record exists). Six columns, and not one of them is decoration: drop any one and a downstream question becomes unanswerable.

Back along the trilogy spine

This row is the end of a chain the other two books begin. The BR101.DO.PV reading is born on a physical probe in a stirred bioreactor — the step Book 1 walks through in the production bioreactor. Book 2 then names what that measurement is: the data shadow the plant casts of itself, and the open challenge of giving it trustworthy context. What you see dissected above is the open-source code that finally implements that data-point — and the same ts.sensor_reading shape is what Book 2's real-time release vision reads back out.

The quality column, field by field

Of the six columns, quality is the one that earns its place by being almost invisible until it matters. Its declaration is quality smallint NOT NULL DEFAULT 192, and that single line encodes a data-integrity decision. The three values are not arbitrary: they are the classic OPC DA quality codes — 192 (0xC0) is Good, 64 is Uncertain, 0 is Bad — the legacy 8-bit encoding the schema comment above names, which the historian stores as a compact smallint [4]. OPC DA (Classic) is the older, Windows-bound predecessor of the modern OPC UA protocol used everywhere else in this stack; its terse quality byte is convenient to store, so we keep it as the column's default. They are not the OPC UA StatusCode values: a native OPC UA Good StatusCode is the 32-bit 0x00000000, with the severity living in its top two bits. We store the compact OPC DA smallint and map it to the full OPC UA StatusCode at the protocol boundary. The companion repo's OPC UA server in examples/chapters/05-connectivity-opcua-mqtt/opcua_server.py maps the historian's smallint to the protocol's StatusCode and back, so a value's trustworthiness survives the round trip from probe to database to subscriber — quality flows through, end to end, rather than being re-asserted as Good at each hop.

Why store it as a column at all, rather than dropping every non-Good point on the floor? Because which points were uncertain, and when, is itself regulated evidence. The ALCOA+ attribute "Original" requires that the first capture of a measurement — including its qualifying metadata — survive intact, and the MHRA (the UK Medicines and Healthcare products Regulatory Agency) data-integrity guidance is explicit that the meaning and context of data must be preserved through any processing [5]. A historian that silently averages an Uncertain reading into a phase mean has destroyed Original data; a historian that keeps quality beside value lets an analytics layer choose to exclude it, and lets an inspector see that it was excluded. The column is small; the obligation it discharges is not. In the golden batch this is not hypothetical: a brief day-7 temperature excursion (the setpoint dips to 36.5 °C on 2026-01-12) flags the dissolved-oxygen readings Uncertain (quality 64) for a few hours, and those exact points survive in the historian for an analyst to exclude or an inspector to see.

The layer that turns numbers into knowledge

A row in ts.sensor_reading says BR101.DO.PV = 41.3 %sat at 2026-01-05T08:00:00Z. True, but nearly useless on its own. Was the batch growing (building up cells) or producing (making the antibody)? Was this DO reading inside the controlled band for that phase? Answering needs the batch model. The contextualization layer is the join that supplies it, defined once in examples/platform/db/60-views.sql:

-- A reading with its full batch + phase context.
CREATE OR REPLACE VIEW s88.v_batch_sensor AS
SELECT r.ts, r.tag, r.value, r.unit, r.quality, r.batch_id,
       b.product_id, b.recipe_id, b.unit_id,
       bp.phase_id, ph.name AS phase_name
FROM ts.sensor_reading r
JOIN s88.batch b              ON b.batch_id = r.batch_id
LEFT JOIN s88.batch_phase bp  ON bp.batch_id = r.batch_id
     AND r.ts >= bp.start_ts AND (bp.end_ts IS NULL OR r.ts < bp.end_ts)
LEFT JOIN s88.phase ph        ON ph.phase_id = bp.phase_id;

This single view is the architectural payoff of keeping everything in one database. It joins a time-series reading (ts.sensor_reading) to the batch it belonged to (s88.batch) and to the ISA-88 phase that was active at that instant (s88.batch_phase → s88.phase). The batch model itself follows the combined hierarchy — the ISA-95 equipment model (enterprise → site → area → unit) and the ISA-88 procedural/recipe model (recipe → operation → phase), the two standards meeting at the unit — modeled in examples/platform/db/10-isa88-95.sql. (ISA-95 governs the physical equipment ladder; ISA-88 governs how a recipe is procedurally structured, which is why the book pairs them.) The repo seeds it with a concrete fed-batch CHO (Chinese hamster ovary, the workhorse host cell line for antibody production) line — fed-batch meaning the culture is periodically fed nutrient to extend its productive life — in examples/platform/db/seed/seed_cho_line.sql:

INSERT INTO s88.unit VALUES
    ('BR101',         'UPSTREAM',   'Production Bioreactor 101', 'bioreactor',     'Sartorius', 'Biostat STR 50'),
    ('N1SEED',        'UPSTREAM',   'N-1 Seed Bioreactor',       'bioreactor',     'Sartorius', 'Biostat STR 10'),
    ('PA01',          'DOWNSTREAM', 'Protein A Capture Skid',    'chromatography', 'Cytiva',    'AKTA process'),
    ('TFF01',         'DOWNSTREAM', 'UF/DF Skid',                'tff',            'Cytiva',    'AKTA flux'),
    ('FILL-LINE-01',  'FILL',       'Aseptic Fill Line',         'fill_line',      'Bausch+Stroebel', 'KSF');

That is our running case in data: one production bioreactor, an N-1 seed, a Protein A capture skid, a UF/DF skid, and a fill line — the fed-batch CHO + Protein A monoclonal-antibody process the whole trilogy follows. PA01 is the Protein A affinity-capture step that binds the antibody out of clarified harvest — the cell-free liquid left after the cells are filtered out — (the repo's protein_a_summary seeds 92% recovery at ~58 g/L dynamic binding capacity, the resin's antibody-holding limit), and TFF01 is the tangential-flow ultrafiltration/diafiltration (UF/DF) step that concentrates and buffer-exchanges the eluate (the antibody-bearing liquid released from the capture column). The data shapes carry the unit-op physics with them: a Protein A cycle is a bind-wash-elute-strip sequence, so PA01's signals are the chromatogram's UV A280 trace, conductivity, and pH against column volumes, and dynamic binding capacity is a loading limit (the grams of antibody one litre of resin holds before it breaks through), which is why it lives beside recovery rather than as a sensor tag. TFF01's signals are transmembrane pressure, crossflow rate, and the diavolumes of buffer exchanged — and because the same downstream chemistry produces the very release CQAs the analytics and semantics layers later gate on (monomer purity, HMW aggregate, host-cell protein), the data model that captures these steps is also what makes a downstream deviation traceable to its cause. Book 1 walks through the mechanism of each step — Protein A capture, viral inactivation and filtration, polishing, and UF/DF — here we only carry their data. (A perfusion / multi-column continuous variant of the same line appears as a sidebar in later chapters; the same relational schema is reused, with the continuous case adding perfusion-specific signals — perfusion rate in vessel volumes per day, cell-specific perfusion rate, cell-bleed rate — on top of the shared tables.) Once the contextualization view exists, a question that used to be an archaeology project — "what was dissolved oxygen doing, by phase, for the golden batch?" — becomes the one-line query make contextualize runs against s88.v_batch_sensor.

Anatomy of a contextualized reading

Hold the row from the previous section in your head, then look at what the join makes of it. The view does not rewrite the reading; it inherits all six columns unchanged and appends five new ones, each pulled from the batch model. The card below dissects the result — the same BR101.DO.PV reading, now carrying its product, recipe, equipment, and phase.

The same reading after s88.v_batch_sensor: the historian's six columns travel through untouched, and one SQL join — not a nightly ETL pipeline — appends the five fields that turn a number into knowledge. Original diagram by the authors, created with AI assistance.

The five appended fields each answer a question the bare row could not. product_id (MAB-001) and recipe_id (CHO-MAB-001) and unit_id (BR101) come from a plain JOIN s88.batch ON b.batch_id = r.batch_id — they say what this lot makes, which ISA-88 recipe governed it, and which ISA-95 unit it ran on. The last two, phase_id and phase_name, are subtler: they come from a LEFT JOIN s88.batch_phase that brackets the reading's ts against each phase's start_ts/end_ts window, then a LEFT JOIN s88.phase to resolve the human name. For our example row stamped 2026-01-05 00:00:00+00, that temporal join lands inside the Inoculate window (PH1), so the same reading that was phase-blind in the historian now knows it was taken during inoculation. The LEFT joins are deliberate: a reading that arrives before its phase windows are seeded still appears, with phase fields null, rather than vanishing from the view.

The architectural point is the one the whole chapter turns on. The second card is the first card plus the meaning the join adds — and that "plus" is a CREATE VIEW, evaluated at query time, because the historian and the batch model are two schemas in one database. In the more common two-product world, that same enrichment is a scheduled ETL job copying time-series rows into a relational store (or vice versa), with its own failure modes, latency, and reconciliation burden. Keeping one database collapses the pipeline into a join, and the join is the keystone payoff of the blueprint.

The same row, one rung up: a triple, a shape, a competency question

The contextualized row is also the natural hand-off to the semantics layer, and it is worth seeing the bridge in miniature before the semantics chapter builds it in full. Each row of s88.v_batch_sensor is, viewed semantically, a small bundle of RDF triples sharing one subject — the reading — so the relational (tag, value, unit, quality, batch_id, phase_id) tuple maps almost field-for-field onto subject-predicate-object facts:

# The same contextualized reading, expressed as triples (illustrative).
bp:reading-BR101-DO-20260105T0000  bp:onTag       "BR101.DO.PV" ;
                                    bp:value       40.8224 ;
                                    bp:unit        unit:PERCENT ;   # QUDT
                                    bp:quality     "Good" ;
                                    bp:fromBatch   bp:BATCH-2026-001 ;
                                    bp:duringPhase bp:PH1-Inoculate .

The unit column becomes a QUDT-typed quantity, the quality flag becomes a first-class fact, and batch_id/phase_id become edges the graph can walk — the difference between a foreign key and a relation you can traverse, unpacked in classes and taxonomy and relations and genealogy. Two ontology disciplines then ride along with the bridge. First, the NOT NULL DEFAULT 192 and the unit column are closed-world completeness obligations — exactly the job of a SHACL shape (the Shapes Constraint Language, which validates that a graph carries the structure it must), whose sh:minCount 1 on unit and on quality is the graph-side mirror of the schema's NOT NULL, as the release-gate-and-SHACL chapter makes a release decision. Second, the contextualization view answers, in SQL, a competency question the graph answers in SPARQL — "what was dissolved oxygen, by phase, for the golden batch?" — and a single such question being answerable both ways is the cheapest possible check that the relational model and the future graph agree on what a reading is. The deeper point, made honest by the identifiers-and-units chapter: BATCH-2026-001 is only a global identifier if everyone resolves it the same way, which is why the same text key that joins two PostgreSQL schemas becomes an IRI that joins two organizations.

The message bus: a named stream, not a tangle of point-to-point links

Between the sensors and the historian sits the message bus. Its job is to turn a chaos of point-to-point connections into one named, real-time stream that any consumer can subscribe to. We use MQTT — a lightweight publish/subscribe protocol — carried by the Mosquitto broker, pinned in the same compose file:

  mosquitto:
    image: eclipse-mosquitto:2.0.22
    profiles: ["core"]
    ports: ["1883:1883"]
    volumes:
      - ../mosquitto/mosquitto.conf:/mosquitto/config/mosquitto.conf:ro

Raw MQTT topics are a free-for-all, though, and a free-for-all is the enemy of data integrity. So on top of MQTT we adopt Sparkplug B, the Eclipse specification that imposes a standardized topic namespace, compact payloads, and — most importantly — birth/death session state so a consumer always knows whether a device is alive and what it is reporting [6]. Sparkplug is the mechanism behind the "Unified Namespace" idea you will meet in Chapter 5: one self-describing, broker-mediated address space for the whole plant. We will build the OPC UA → Sparkplug collector in Chapter 7; here, the point is simply that the bus is a named layer with a standard, not improvised plumbing.

One honest scoping note belongs here. MQTT/Sparkplug is the OT-side bus we build, chosen for its lightweight publish/subscribe fit close to the edge. Many enterprises run a second, IT-side streaming backbone in parallel — most commonly Apache Kafka — for high-throughput event pipelines between business systems, and the two are routinely bridged with an MQTT-to-Kafka connector at the OT/IT boundary. We stay deliberately on the OT side of that line; Kafka is the name to know for the moment the data crosses into the enterprise.

Where open source stops: the honest boundary

Now the red ink. Trace the dataflow upward again and watch where pure open source runs out of road.

Capture, historize, contextualize, visualize, reason, and analyze — every layer below the shaded compliance band — are pure, runnable open source. You will build all of it on a laptop and make test proves it works. That is roughly the first 80% of the platform.

The compliance and trust band is where the honest hybrid begins. No open-source component is 21 CFR Part 11 compliant out of the box [7] — Part 11 is the US FDA's (Food and Drug Administration's) rule governing trustworthy electronic records and electronic signatures — nor does it satisfy the parallel EU Annex 11 regime, its European counterpart for computerised systems — and none will be by download alone. We demonstrate a trust layer — a system-versioned audit trail and a cryptographic hash chain in PostgreSQL (Chapters 23–24) — and it is genuinely useful: it makes tampering detectable. But it does not make tampering impossible; a database superuser can disable the trigger that writes the audit row. Part 11 compliance is a property of a validated system and its procedures, not of any single tool [7]. The modern guidance agrees on the method: GAMP 5 (a widely used industry framework for validating computerised systems) in its second edition adds a risk-based, critical-thinking approach (and an open-source appendix) that lets OSS be used in GxP — the family of "Good Practice" quality regulations (Good Manufacturing, Laboratory, and Clinical Practice) — inside a validated lifecycle, meaning the system is built, tested, and documented under controls that produce formal evidence it is fit for regulated use [8], and the FDA's Computer Software Assurance guidance reframes that lifecycle as risk-proportionate assurance — leveraging logs, audit trails, and supplier evidence rather than documentation theater [9]. That is exactly the work we show, and exactly why no tool arrives compliant.

The commercial systems are the other side of the hybrid. AVEVA PI, SAP, Emerson DeltaV, Siemens, and commercial LIMS (laboratory information management systems) — respectively a historian, an ERP, two control systems, and lab-data software — cannot run on a laptop and are license-locked. The integration code we write is real; its counterpart is a clearly labelled mock with the same API contract and a documented production swap (Chapters 20–22). And the OT/IT boundary the whole architecture straddles is itself a security perimeter, governed by ISA/IEC 62443 — the zones-and-conduits standard (formerly ISA-99) that says where a firewall, a data diode, or a one-way NOA channel belongs between the Purdue levels [10]. The final chapter scores, line by line, exactly what you trade away at each boundary.

A component and license inventory

Because we recommend specific tools, we owe you their licenses — and the 2026 traps. The core stack in examples/platform/compose/compose.yaml is small and deliberate:

Component	Pinned image	License	Note for 2026
PostgreSQL + TimescaleDB	`timescale/timescaledb:2.17.2-pg17`	PostgreSQL License + Apache-2.0 core + TSL Community	Apache-2.0 core plus free TSL Community automation (continuous aggregates, retention policy) — source-available (you can read and run the code) but not OSI-approved open source (the licence restricts some commercial uses); we avoid only the TSL Hypercore columnstore/compression/HA (high availability — redundant failover) features (see Ch 16).
MQTT broker	`eclipse-mosquitto:2.0.22`	EPL-2.0 / EDL-1.0	Pinned to the stable 2.0.x line for reproducibility; EPL-2.0/EDL-1.0 are permissive Eclipse Foundation open-source licenses with no SaaS-hosting restriction.
Dashboards	`grafana/grafana-oss:11.4.0`	AGPL-3.0	Local use is fine; redistributing or hosting it as SaaS for others triggers AGPL obligations.
Triplestore	`apache/jena-fuseki:5.2.0`	Apache-2.0	Verify the digest — the community image moved.
Metrics store	`victoriametrics/victoria-metrics:v1.108.1`	Apache-2.0	Shipped instead of InfluxDB 3 to dodge the v3 license flip — InfluxDB's v3 moved away from a permissive open-source license, so we use Apache-2.0 VictoriaMetrics to keep the metrics store freely usable.

Every image is pinned by tag, and in the repo's lock file by digest as well, so the running stack, the license inventory, and the supplier register that Chapter 25 generates for validation can never quietly drift apart. The historian's OSS posture is written right into the schema comment in examples/platform/db/20-historian.sql:

-- Apache-2.0 core (hypertables, create_hypertable, time_bucket, drop_chunks) plus
-- free TimescaleDB Community (TSL) automation: continuous aggregates and
-- add_retention_policy. TSL is free-to-use and source-available, but NOT OSI
-- open source. We deliberately do NOT use the TSL Hypercore columnstore/compression,
-- so a strictly Apache-2.0 build is one cron-driven drop_chunks away — see Chapter 16.

That comment captures the book's whole license philosophy: we use the free, source-available features deliberately and flag the licensed ones loudly, so a tool you adopt for being "open" does not ambush you later.

Why it matters

An architecture diagram is not a decoration; it is a decision record. Every later chapter is a thin slice over this one shared blueprint — the companion repo is built exactly as "thin chapters over a thick shared platform." Chapter 7 does not redefine the historian; it writes into the ts.sensor_reading you saw above. Chapter 17 does not invent a join; it extends the s88.v_batch_sensor view. Chapter 23 does not bolt on a new database; it adds an audit schema beside the ones already there. Chapter 29 does not invent the analytics; it trains an SPC chart on the release/CofA results and a PLS soft-sensor on the committed Raman dataset, with held-out validation and run-to-model lineage — see Process Analytics: SPC, MVDA & Soft Sensors. Because the layers are defined once and reused, the build order is the architecture: make up brings up the core (database, broker, dashboards), make seed loads the ISA-88/95 line, and every chapter after that turns on one more profile.

Why the analytics layer needs the two columns below it

The analytics layer is the one place where the blueprint's discipline is not optional — it is what keeps a model trustworthy — and two columns you already met are what make it possible. The batch_id is the unit of statistical independence: rows within one fed-batch run are autocorrelated near-duplicates, so a model honestly validated on this data must split by whole batch (scikit-learn's GroupKFold/LeaveOneGroupOut grouping on batch_id), never row-wise, or it leaks a near-twin of every test point into training and reports a fantasy R² — the field's single most common validation error, unpacked in the learning problem and models and validation. The quality column is the gate on the inputs: dropping Uncertain (64) and Bad (0) points before fitting is not data-tidying, it is enforcing the model's applicability domain — the input region it was calibrated on and may be trusted inside — so the day-7 excursion's Uncertain DO never silently trains or scores the soft sensor.

The same contextualized row also makes the two kinds of drift separable, which a flat historian cannot. Process drift is the living system genuinely changing — a new raw-material lot, a cell line adapting with passage — visible as a shift in the value distribution within its phase; model drift is the soft sensor going stale against that moving world. A monitor on the input distribution (a Population Stability Index, label-free and leading) catches the first; a control chart on the prediction residual against the slow offline assay catches the second — the two-detector design grounded in MLOps and lifecycle and the hybrid-model reason a physics backbone drifts slower than a black box. And because the audit trail and the dataset are version-pinned in the same database, a prediction can be traced back to the exact rows, schema version, and model version that produced it — model lineage that is the analytics-side echo of the lot genealogy the semantics layer walks. The blueprint does not make the model good; it makes the model governable, which under GxP is the harder and more valuable property.

Getting the blueprint right early is also what makes the regulated end achievable at all. If the historian and the batch model were two disconnected products, the contextualization view — and with it every audit trail, lineage query, and golden-batch comparison built on top — would be a fragile ETL pipeline instead of a one-line SQL join. The architecture is the difference between a batch investigation that takes three weeks and one that takes an afternoon.

The evidence behind "three weeks vs. an afternoon"

That contrast is not rhetorical flourish. Reviews of data science in biopharmaceutical manufacturing identify the field's defining obstacle as exactly this fragmentation: process data is scattered across heterogeneous, vendor-specific systems with incompatible formats and no shared context, so a large share of any analysis or investigation is spent not analyzing but locating, aligning, and re-contextualizing data by hand before the real question can even be asked [11]. When the historian and the batch model are separate products, every cross-system investigation pays that tax up front — manually correlating time-series exports against batch records, reconciling timestamps, and re-deriving which phase was active — which is where the multi-week figure comes from. The one-database join in this blueprint pays the tax once, at schema-design time, so the investigation itself is a query. The lesson the literature draws is the same one this chapter draws: the cost of disconnected data is paid on every investigation, while the cost of contextualization-by-design is paid only once [11].

In the real world

Walk into a modern biomanufacturing facility and you will find this exact layering, even if no one drew it on a wall. There will be OPC UA servers on the skids, a historian (often a commercial one), a relational MES (manufacturing execution system — the Level-3 software that holds batch records and dispatches production) holding the batch model, and — increasingly — open-source tools like Grafana and PostgreSQL running alongside the validated systems. The skill the industry actually pays for is knowing where the boundary between them sits and why it sits there. ISA-95 gives that boundary a vocabulary [1]; ISA/IEC 62443 gives the OT/IT perimeter its controls [10]; NOA gives the read-mostly analytics channel its legitimacy [2].

One boundary this blueprint deliberately leaves off the page is the enterprise IT data plane above Level 4. In a large company the on-prem stack you build here typically hands its data onward — through pipeline orchestrators like Apache Airflow and SQL-transformation layers like dbt — into a cloud data lake or lakehouse (Snowflake, Databricks / Delta Lake, or the AWS and Azure equivalents) where cross-site analytics and AI live. Our make targets and SQL views are the laptop-scale stand-in for that machinery; the moment you scale past one site, those are the tools this architecture connects to, and we name them here so the hand-off is a known edge rather than a surprise.

The institutional momentum is real. New pilot-scale cGMP (current Good Manufacturing Practice) facilities are where architectures like this one get exercised against real, regulated production. And in every such plant the same question recurs, the one this book returns to in every chapter: which layer holds the trustworthy, audit-trailed record of truth? The blueprint you just read is our answer to where that record can live; the trust chapters are our honest accounting of how far open source can take it, and where it cannot.

Key terms

Reference architecture: the layered blueprint of the whole platform, where each layer adds one kind of meaning and maps to an ISA-95 level and an open-source tool.
ISA-95 (IEC 62264): the standard layered model (Levels 0–4) for integrating enterprise and control systems, used to place each tool.
Purdue model / ISA-99 → ISA/IEC 62443: the OT/IT layering and its zones-and-conduits security standard, defining where boundary controls belong.
NAMUR Open Architecture (NOA): the concept of a second, read-mostly data channel for monitoring and optimization that never alters the validated control system.
OPC UA: the platform-independent, self-describing protocol that carries each value plus its quality status from sensor to application.
MQTT / Sparkplug B: the lightweight publish/subscribe transport (Mosquitto) and the specification that gives it a standardized namespace and birth/death state — the basis of the Unified Namespace.
Hypertable: a TimescaleDB table auto-partitioned by time, the historian abstraction that keeps high-rate sensor data inside PostgreSQL.
Contextualization: the SQL join (s88.v_batch_sensor) that ties a raw reading to its batch, equipment, and ISA-88 phase.
Semantics (RDF / SPARQL): representing each contextualized fact as a subject-predicate-object triple in a knowledge graph (served by Apache Jena Fuseki) so lineage can be queried across systems with the SPARQL query language — built in full in the semantics chapter.
SHACL shape / competency question: the graph-side mirror of the schema's NOT NULL — a closed-world completeness check that a required fact (unit, quality, a release result) is present — and a question the model must be able to answer, the cheapest check that relational and graph views agree.
Applicability domain: the input region a model was calibrated on and may be trusted inside; the quality column is what lets analytics exclude out-of-domain (Uncertain/Bad) points before fitting.
Grouped (leave-one-batch-out) cross-validation: splitting validation data by whole batch_id so autocorrelated within-run rows never straddle the train/test line — the guard against the leakage that inflates a soft sensor's reported skill.
Process drift vs. model drift: the living system genuinely changing (a shift in the value distribution) versus the model going stale against it; the contextualized row makes the two separable, each caught by a different detector.
Model lineage: the audit-trailed link from a prediction back to the exact rows, schema version, and model version that produced it — the analytics-side echo of lot genealogy.
ALCOA+: the regulators' shorthand for the properties trustworthy data must have — Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available.
Quality code (OPC DA / OPC UA StatusCode): the per-value trust flag carried from sensor to historian. The historian stores the compact legacy OPC DA encoding — 192 Good, 64 Uncertain, 0 Bad — and maps it to the full 32-bit OPC UA StatusCode at the protocol boundary; it is the ALCOA+ "Original" attribute made into a column.
ETL (Extract–Transform–Load): the scheduled pipeline that copies and reconciles data between two separate products; the fragile alternative this blueprint avoids by keeping the historian and the batch model in one joinable database.
GxP: the family of "Good Practice" quality regulations (Good Manufacturing, Laboratory, and Clinical Practice) that govern regulated production and its records.
HA (high availability): a system design that survives a component failure without losing service, typically through redundant, automatically failing-over instances.
Honest hybrid: the stance that pure OSS covers ~80% of the stack while the GxP last mile (validation, electronic signatures, high availability, vendor accountability) is met with hardening or commercial systems.
Record of truth: the audit-trailed, trustworthy electronic record subject to Part 11 / Annex 11 — the recurring question of which layer holds it.

Where this leads

You now hold the map: eight layers from sensor to submission, each pinned to an ISA-95 level and an open-source tool, with the OSS-versus-commercial boundary drawn in ink. The next chapter, Standing Up the Stack: One docker compose up, turns the map into a running machine. We bring up the core profile — PostgreSQL + TimescaleDB, Mosquitto, and Grafana — with a single pinned command, start the CHO simulator alongside it (a Python package, not a Compose service), prove the stack is alive with a first-data-point smoke test, and watch the first reading flow into the historian you just designed.

What this chapter covers​

The blueprint on one page​

Reading the ISA-95 ladder: where each tool legitimately lives​

The Level 2/3 boundary: why we never write into the DCS (NOA)​

The same stack, as a dataflow​

One database for the historian and the batch model​

Anatomy of a ts.sensor_reading row​

The quality column, field by field​

The layer that turns numbers into knowledge​

Anatomy of a contextualized reading​

The same row, one rung up: a triple, a shape, a competency question​

The message bus: a named stream, not a tangle of point-to-point links​

Where open source stops: the honest boundary​

A component and license inventory​

Why it matters​

Why the analytics layer needs the two columns below it​

The evidence behind "three weeks vs. an afternoon"​

In the real world​

Key terms​

Where this leads​