Operating, Scaling & Securing the Platform

📍 Where we are: Part VI · Operating at Scale — the stack works; now we keep it alive, locked down, and recoverable when something breaks at 3 a.m.

The simple version

Building a kitchen is the fun part. Running a restaurant that passes a health inspection every single day for ten years is the hard part. This chapter is about day two: the night the disk fills, the morning a new CVE drops, the audit that asks "show me you can restore last Tuesday." Open source builds you a beautiful kitchen. The fire suppression system, the locks on the doors, and the logbook the inspector reads — those you have to install and maintain yourself, and a few of them you will end up buying.

What this chapter covers

We have spent twenty-four chapters making the platform do things. This chapter makes it survive things. We will walk through four day-two realities where open source gets honestly tested:

High availability (HA): why "just cluster it" is harder than it sounds, and how the free MQTT-broker clustering story (running several message-relay servers as one fault-tolerant group) shrank in 2025.
Backup and point-in-time recovery (PITR): the one thing a regulator will absolutely ask you to demonstrate, and the tooling that makes it routine for PostgreSQL.
Network segmentation, TLS, certificates, and secrets: keeping the plant floor and the data stack in separate, defended zones.
CVE response and self-observability: patch cadence, a cautionary tale about a lean broker, and watching the watcher with VictoriaMetrics (the open-source metrics database the stack uses to monitor itself).

Throughout, cGMP (current Good Manufacturing Practice — the binding quality regulations for making medicines) and EU GMP Annex 11 set the bar, and we are honest about where pure open source stops short of it.

Day two is the real exam

Everything before this chapter assumed the happy path: services start, data flows, dashboards render. Day-two operations is the unglamorous discipline of assuming the unhappy path. A node dies mid-batch. A certificate silently expires on a Sunday. A dependency you have never heard of gets a critical advisory. The regulator does not grade you on the demo — they grade you on the recovery.

This matters more in our world than in most. EU GMP Annex 11 makes availability, backup, and business continuity explicit obligations: section 7 (§7) requires regular, verified backups of data and a check that restored data is accurate, and section 16 (§16) requires documented business-continuity arrangements so a critical system failure does not stop you from making or releasing medicine [1]. You do not get to treat resilience as a nice-to-have. It is a line item an inspector reads back to you.

High availability, and the shrinking free-clustering story

HA means: a single component can fail and the system keeps serving. For a database, that is a replica ready to take over. For a message broker, that is a cluster of brokers sharing the load and the state — the live record of which clients are connected and what they have subscribed to, so any broker in the cluster can serve any client.

Here is the honest part. The open-source broker we ship in the core stack, Eclipse Mosquitto — an MQTT broker, the server that receives the plant's lightweight publish/subscribe sensor messages and relays them between devices and our data stack — is a deliberately single-broker design. It is small, fast, rock-solid, and pinned (locked to one exact version so an update cannot silently change it) to the stable eclipse-mosquitto:2.0.22 line in our compose file (compose.yaml, the single YAML file that declares every container the stack runs). What it is not is clustered. Mosquitto's answer to spanning multiple brokers is its bridge feature, which connects independent brokers and forwards selected topics between them — useful for federating sites, but it is not native clustering with shared session state and automatic failover [2]. If the one Mosquitto process is down, its clients have nowhere to reconnect until it comes back.

The obvious upgrade used to be EMQX, which offered free multi-node clustering. In 2025 that door narrowed: from EMQX 5.9 the project moved from Apache 2.0 (a permissive license — free to use, modify, and run in production with no scale limit) to the Business Source License (BSL 1.1), and running a production multi-node cluster now requires a commercial license [3]. This is the recurring 2026 license trap in miniature — the feature you most want for HA is exactly the one that is no longer free. So the chapter is blunt: for a regulated single-site mAb (monoclonal antibody — a protein drug grown in cultured cells, the running example throughout this book) line, a well-monitored single Mosquitto plus a fast restart and a documented failover SOP is a defensible, honest posture; true broker HA at scale is where you either pay EMQX, run HiveMQ, or accept a hybrid commercial component. Do not pretend a Mosquitto bridge is a cluster.

The database side is friendlier. Our postgres service is timescale/timescaledb:2.17.2-pg17 — PostgreSQL with the TimescaleDB extension, so the historian hypertables (the time-series store for high-rate sensor data — see the historian chapter) and the ISA-88/95 batch model (the standard relational model of batches, equipment, and recipes — see the batch & equipment model chapter) live in one joinable database, defined once in examples/platform/compose/compose.yaml:

# examples/platform/compose/compose.yaml
services:
  # --- core --------------------------------------------------------------
  postgres:
    # timescale/timescaledb IS PostgreSQL + TimescaleDB, so the historian
    # hypertable and the ISA-88/95 batch model live in one joinable database.
    image: timescale/timescaledb:2.17.2-pg17
    profiles: ["core"]
    <<: *restart
    environment:
      POSTGRES_USER: ${POSTGRES_USER:-bioproc}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-bioproc}
      POSTGRES_DB: ${POSTGRES_DB:-bioproc}
    ports: ["5432:5432"]
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ../db:/docker-entrypoint-initdb.d:ro   # 00-60 schema files run on first init
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-bioproc} -d ${POSTGRES_DB:-bioproc}"]
      interval: 5s
      timeout: 5s
      retries: 20

Notice three operational habits baked into that block. The image is pinned by tag, so a silent minor bump can never smuggle in a Debian/glibc jump that corrupts indexes — and the matching manifest digests are recorded in examples/platform/versions.lock (committed today and regenerated by make lock); the full supplier-register treatment is taken up in the validation chapter. The data lives on a named volume (pgdata), not the container's ephemeral layer, so the container is disposable but the data is not. And there is a real healthcheck, so the orchestrator — and the make up poller — knows the difference between "process started" and "actually ready to serve."

Streaming replication: the free, production-grade database HA

PostgreSQL also gives you real streaming replication: a primary streams its write-ahead log (the WAL — PostgreSQL's ordered record of every change, detailed in the PITR section below) to one or more hot standbys (replica databases kept continuously up to date and ready to take over) that can be promoted — switched from replica to primary — in seconds. That is genuine, free, production-grade database HA, and it is the strongest HA story in our whole stack. The mechanism is the same WAL stream that powers PITR — the standby simply replays the primary's WAL continuously instead of from an archive, so a single discipline (ship and replay WAL) buys you both a hot failover target and a point-in-time rewind. Because TimescaleDB is PostgreSQL, the historian hypertables replicate with everything else; there is no separate cluster to stand up for the high-rate sensor data. The asymmetry is the lesson — the relational core clusters well in the open; the broker does not.

The Mosquitto bridge is not a cluster

It is worth saying plainly because the mistake is common: Mosquitto's bridge is not HA. A bridge connects two independent brokers and forwards selected topics between them; each broker keeps its own sessions, its own retained messages, and its own subscriptions [2]. If broker A dies, the clients connected to A do not silently fail over to B — they have no session on B, B never received their CONNECT (the handshake packet a client sends to open a session with a broker), and their queued QoS 1/2 messages — the MQTT quality-of-service levels that promise at-least-once (1) and exactly-once (2) delivery (level 0 is fire-and-forget, with no such guarantee), so for QoS 1 and 2 the broker must hold the message until acknowledged — live in A's now-dead persistence file. A bridge federates sites that are each independently available; it does not make one logical broker out of two. Native clustering — shared session state, a client that reconnects to any node and finds its subscriptions intact, automatic failover — is exactly the feature Mosquitto does not ship and EMQX moved behind a commercial license in 2025. So the honest single-site posture stands: one well-monitored Mosquitto, a fast restart, and a documented failover SOP, with the limitation written down rather than papered over with a bridge diagram that looks like a cluster but is not one.

Backup and point-in-time recovery

If you remember one operational duty from this book, make it this one. A backup you have never restored is a rumor. PITR turns "we lost data" into "we rewound to 14:32 yesterday, just before the bad migration (a database schema change)."

PostgreSQL's PITR works by combining two things: a base backup (a full physical copy of the data directory) plus the continuously archived WAL — the stream of write-ahead-log segments that record every change since that base. To recover, you restore the base and then replay WAL forward to any moment you choose [4]. That last clause is the magic: you are not limited to the last nightly snapshot; you can land on a specific transaction or timestamp.

Doing the WAL plumbing by hand is fiddly, so the open-source tool of choice is pgBackRest (PostgreSQL License — permissive, free). It manages full, differential, and incremental backups; supports off-host and encrypted repositories so your backups are not sitting next to the thing they are meant to protect; and performs time- or transaction-targeted PITR restores [5]. Below is a representative ops/pgbackrest.conf of the shape the chapter recommends — illustrative configuration, not yet a runnable service in the repo, shown here so you can see exactly what the operator wires up:

# ops/pgbackrest.conf  (illustrative — the operator's day-two artifact)
[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=4            # keep 4 weekly full backups
repo1-cipher-type=aes-256-cbc     # encrypt the repository at rest
repo1-cipher-pass=<from-secrets-manager>
start-fast=y
process-max=2

[bioproc]
pg1-path=/var/lib/postgresql/data

A weekly full plus daily differentials, with continuous WAL archiving in between, is a sane starting cadence. The point an auditor cares about is not the config — it is the restore runbook: a written, tested procedure that proves you can rebuild the database to a chosen point and that the restored data verifies. Annex 11 §7 does not just want backups; it wants checked backups, and §16 wants the continuity plan that uses them [1]. A quarterly restore drill, logged, is how you turn "we have backups" into evidence.

The historian needs the same care. Because TimescaleDB is PostgreSQL, the hypertables in our ts schema are covered by the very same physical backup and WAL stream — one PITR strategy protects both the batch model and the high-rate sensor data. The historian leans on hypertables and drop_chunks (Apache-2.0) alongside the continuous aggregates and add_retention_policy automation, which are free TSL Community features — source-available and free to run, but not OSI-open (not on the Open Source Initiative's approved-license list, which is the formal bar this book uses for true open source) — and we deliberately avoid the TSL columnstore/compression tier, so there is no proprietary data-tiering layer to complicate a restore.

Anatomy of a database change: what a restore (or a revert) actually replays

PITR rewinds past a bad change. But the cheaper, first-line defense against a bad change is to make every change a small, reversible, self-verifying unit in the first place — so most "we broke the schema" incidents never reach the point where you reach for a backup. The repo ships exactly that artifact: a Sqitch migration under examples/platform/db/migrations. The pgBackRest manifest the PITR story leans on is still illustrative in this repo, so the honest artifact to dissect is the one that is checked in and runnable — a real change, with a real verify gate, that the chapter's "a patch is a change, and a change must be re-verified" rule is built from. The plan file lists the change recipe_param_no_overlap; on disk it is three SQL scripts — deploy/, verify/, and revert/ — plus the sqitch.plan entry and sqitch.conf engine config. Read field by field, a migration is a value-plus-provenance record in the same spirit as a historian reading: not just what changed, but when it was planned, how it proves itself, and how it is undone.

One Sqitch migration, field by field — a schema change is a reversible, self-verifying unit of record, not a one-way SQL statement you hope worked. Original diagram by the authors, created with AI assistance.

Three fields carry the day-two weight. verify is the gate: sqitch.conf sets verify = true under [deploy], so the engine runs the change's verify script as part of the deploy and reverts the change in the same run if the verify fails. The verify script for recipe_param_no_overlap is deliberately blunt — it SELECT 1 / CASE WHEN count(*) = 1 THEN 1 ELSE 0 END over pg_constraint, so it divides by zero (raising an error, and so failing the deploy) unless the constraint actually exists. revert is the reversibility rule in one line: ALTER TABLE s88.recipe_parameter DROP CONSTRAINT recipe_parameter_no_overlap. And deploy is the forward SQL — in plain terms, a rule that forbids two versions of the same recipe parameter from being "in effect" at the same time, which is exactly what makes a contiguous version history trustworthy. In PostgreSQL terms that rule is a GiST exclusion constraint (a constraint that rejects a new row if it would overlap an existing one, using PostgreSQL's GiST index type to test the overlap fast) that forbids two versions of the same recipe parameter (recipe_id, name) from overlapping in time (tstzrange(valid_from, valid_to, '[)') WITH &&, where && is PostgreSQL's range-overlap operator — the constraint rejects any new row whose validity window overlaps an existing one for the same parameter), with btree_gist enabling the text-equality predicates inside the GiST index. The '[)' half-open bound is load-bearing: it makes the range include valid_from but exclude valid_to, so back-to-back versions (the valid_to of one equal to the valid_from of the next) abut without overlapping and the exclusion lets a contiguous version history stand. The deploy carries no explicit BEGIN/COMMIT: Sqitch wraps each change in its own transaction on PostgreSQL, which is precisely what makes the auto-revert clean.

The migration gate: deploy, verify, or auto-revert

That trio composes into a gate, and the gate is the reason a half-applied schema change is not a thing that can quietly survive in this stack.

Flow of the migration gate: sqitch deploy runs the forward SQL in its own transaction, then runs the verify script; on a green verify-passes branch the change is committed and recorded in the sqitch registry, on a rose verify-fails branch the change is auto-reverted in the same run, and both branches converge on a known, never-half-applied state.

Run sqitch deploy. The deploy script runs inside its own transaction. The verify script runs immediately after; because verify = true, its result decides the outcome with no human in the loop. If it passes, the change commits and is recorded in the Sqitch registry. If it fails — the divide-by-zero fires — the engine runs revert/ in the same invocation and the database lands back exactly where it started. Either branch ends in a known state, never the dangerous middle where the table half-changed and nobody is sure. This is the migration-time mirror of the restore runbook: the runbook proves you can get back to a known point after the fact; the verify gate proves you never left one in the first place. For a regulated system where every schema change is a controlled change, "the change reverts itself if it cannot prove it worked" is a far stronger posture than "we ran the SQL and it didn't error."

The same guarantee, one layer up: the constraint as a graph shape

The GiST exclusion constraint guards contiguous recipe-parameter history inside one PostgreSQL table. But the knowledge-graph chapter lifts that same plant data into RDF triples, where the identical "no two versions of the same parameter overlap in time" rule is expressed not as a database constraint but as a SHACL shape — a declarative pattern a graph's data must conform to, the graph-world analogue of the CHECK/exclusion constraints SQL enforces row by row. The two are complements, not duplicates: the GiST constraint stops a bad row from ever being written; the SHACL shape lets a validator (and an auditor) confirm after the fact that the lifted graph still honours the rule across systems, which is exactly the closed-world "is a required fact missing or duplicated?" check the release gate builds in Book 4. This is also why a migration is worth reading as a provenance record and not just a diff: its planned_at timestamp, planner identity, and verify result are precisely the prov:wasGeneratedBy / prov:generatedAtTime facts the W3C PROV-O vocabulary uses to make a change attributable and time-stamped — the database-change equivalent of the audit trail that makes a process record trustworthy under ALCOA+. Stated as the kind of competency question an ontology is built to answer — "reconstruct the schema and the data exactly as they stood at 14:32 yesterday, and show which change took effect when" — the PITR rewind and the Sqitch registry are two halves of the same audited answer, the temporal walk that derivedFrom and the genealogy spine perform for lots rather than for schema versions. The day-two artifacts on this page are not outside the semantic layer; they are the place its integrity guarantees are physically enforced.

Drawing the OT/IT line: segmentation, TLS, and secrets

A mAb line is two worlds stitched together. On the operational technology (OT) side sit the PLCs (programmable logic controllers), the DCS (distributed control system), the bioreactor skid — a self-contained equipment module on a frame (the vessel packaged with its own pumps, valves, sensors, and local controls) — and its OPC UA server (the OPC Unified Architecture standard interface that exposes the skid's tags — see the connectivity chapter) — long-lived, fragile, rarely patched mid-campaign. On the information technology (IT) side sit our brokers, databases, and dashboards. The cardinal rule of plant-floor security is that these worlds do not share a flat network.

Zones, conduits, and Security Levels

The framework here is IEC 62443, specifically Part 3-3, which formalizes zones and conduits: you carve the plant into security zones, define the conduits (the only sanctioned paths) between them, and assign each a target Security Level [6]. A zone is a grouping of assets that share a security requirement (the OT skid, the IT data stack); a conduit is the controlled communication path between zones — and the rule is that traffic crosses only through a conduit, never zone-to-zone on a flat wire. Each zone gets a target Security Level (SL 1 through SL 4) reflecting the strength of attacker it must withstand, which then drives the concrete controls on its conduits. For us that means the OPC UA server and edge gateway sit in an OT zone; the historian and broker sit in an IT/DMZ (demilitarized zone — an isolated buffer network) zone; and the only conduit between them is the collector (the OPC-UA-to-MQTT bridge shown in the figure that reads tags from the OT side and republishes them to the IT side), through a firewalled, authenticated, encrypted link. Nothing on the business network reaches a PLC directly. The payoff compounds with the CVE section below: when a component sits inside a properly drawn zone, its exposure drops, and exposure is the difference between a CVSS (vulnerability-severity score, defined below) base score — the score for the flaw's intrinsic technical severity, before any account of where it actually sits — and your actual risk.

Three labeled security zones left to right: an OT zone containing PLC slash DCS linked to an OPC UA server BR101, a firewall-and-TLS conduit holding the OPC UA to MQTT collector, and an IT zone with Mosquitto, TimescaleDB plus Postgres, and Grafana; a read-mostly arrow crosses into the conduit and an MQTT slash TLS arrow crosses into the data stack.

That read-mostly arrow is not decoration — it is the NAMUR Open Architecture instinct (the process-industry pattern of adding a second, read-only monitoring-and-analytics channel that never writes back to the validated control system) we built on in Chapter 1: monitoring and analytics tap a second channel, they do not reach back and write to validated controls.

It is worth being concrete about what crosses that conduit, because the segmentation only makes sense once you picture the OT side as a real mAb line rather than an abstract "skid." The OPC UA tags streaming out are the in-process signals of named unit operations: the production bioreactor's pH, dissolved oxygen, and feed rate; the Protein A capture column's UV trace and conductivity as the antibody binds and elutes; the low-pH viral-inactivation hold's pH and timer; the polishing and viral-filtration steps' pressures; and the final UF/DF skid's transmembrane pressure as the drug substance is concentrated and buffer-exchanged. Each of those is a validated control on the OT side — adjust a Protein A elution gradient or a viral-hold pH from the IT network and you have changed how a medicine is purified, which is precisely the write the read-mostly conduit forbids. The historian and analytics get every one of those readings; they get to write back none of them.

Now the uncomfortable confession that runs through this whole book. The dev broker config we shipped is wide open on purpose, and it tells you so in examples/platform/mosquitto/mosquitto.conf:

# examples/platform/mosquitto/mosquitto.conf
# Mosquitto broker config for the local dev stack (Chapter 7).
# Dev-only: anonymous access on the plain 1883 listener. Chapter 28 (operating &
# securing) replaces this with TLS + per-client ACLs; never ship anonymous in
# a real plant.
listener 1883
allow_anonymous true

# enable the $SYS topic tree so the healthcheck can confirm the broker is alive
sys_interval 10

persistence true
persistence_location /mosquitto/data/
log_dest stdout

allow_anonymous true on a plaintext 1883 listener is perfect for a laptop and catastrophic for a plant. This chapter is exactly where that promise comes due. The hardened version moves to TLS on 8883, requires client certificates, and enforces per-client ACLs so a compromised sensor account cannot subscribe to everything:

# ops/mosquitto.tls.conf  (illustrative hardening for production)
listener 8883
allow_anonymous false
cafile   /mosquitto/certs/ca.crt
certfile /mosquitto/certs/server.crt
keyfile  /mosquitto/certs/server.key
require_certificate true
acl_file /mosquitto/config/acl

Those certificates are themselves an operational burden: they expire, and an expired broker or OPC UA cert is a self-inflicted outage that always seems to land on a holiday. You need a tracked inventory of every cert with its expiry, automated rotation where you can, and OpenSSL or a small internal CA to mint them. And the private keys, the repo1-cipher-pass, the database password — none of those belong in a compose file or a Git repo. They belong in a secrets manager, injected at runtime. The ${POSTGRES_PASSWORD:-bioproc} default you saw above is a dev convenience; in production that variable is sourced from a vault, never typed.

A four-pillar diagram of day-two operations: HA, backup and PITR, OT/IT segmentation with TLS and secrets, and CVE-watch plus self-observability, wrapped by Annex 11 and IEC 62443.

The four pillars of day-two operations for the OSS bioprocess stack — availability, recoverability, defended zones, and a watched, patched stack — each tied to the regulation or standard that makes it non-optional. Original diagram by the authors, created with AI assistance.

CVE response: the NanoMQ cautionary tale

Choosing a lean open-source component does not exempt you from maintaining it — if anything, it raises the stakes, because the maintenance is now your job. Consider the broker you didn't pick. NanoMQ is an attractively tiny MQTT broker, and precisely because it is small it has had to be patched. In 2026 an advisory landed: an out-of-bounds read in the MQTT v5 Variable Byte Integer parser, get_var_integer(), remotely triggerable by a crafted packet against versions up to and including 0.24.6 [7]. The National Vulnerability Database catalogued it as CVE-2026-21888, classified CWE-125 (out-of-bounds read), rated High at CVSS 3.1 base 7.5 [8].

That score is not a vibe; it comes from a defined model. The Common Vulnerability Scoring System turns a vulnerability's characteristics into a vector string and a 0–10 number; this particular CVE was scored under CVSS 3.1, but the same principle is made explicit in the later CVSS v4.0 specification — the base score measures technical severity, not your risk [9]. A 7.5 on an internet-facing broker is a fire drill; the same 7.5 on a broker locked inside an IEC 62443 OT zone with no inbound exposure is something you schedule into the next maintenance window. Severity is the input; your segmentation and exposure turn it into a triage decision. This is why the previous section pays off here — good zoning literally lowers the real-world risk of a given CVE.

So the day-two loop is: inventory → watch → triage → patch → re-verify. The inventory is a software bill of materials (SBOM): the operator generates one with make sbom (Syft → CycloneDX per pinned compose image, already in this repo) and pins every component by digest with make lock. The Grype/Trivy CVE-scan step and the supplier register are taken up in the validation chapter; the scan target is not yet wired into this repo's Makefile, so treat that step as the recommended shape rather than a make target you can run today. A CVE-watch runbook then subscribes to advisories for every pinned image. Each new advisory gets a CVSS-informed triage against where the component actually sits. Patching means bumping the pinned tag and digest, rebuilding, and re-running the test suite — because in a validated environment a patch is a change, and a change must be re-verified, not just applied. And the scanners themselves belong on the threat surface, not above it: treat scanner binaries and their feeds as suppliers too, run them against a reviewed allowlist, and make scanner provenance part of the assessment rather than blind trust — supply-chain compromises of widely used developer tooling are a recurring industry pattern, not a hypothetical.

Field evidence: why segmentation, not severity, decides your risk

The claim that "exposure, not the base score, is what bites you" is not a rhetorical flourish — it is what internet-wide measurement keeps finding. A landmark ZMap-class survey of the public IPv4 space across five industrial-control protocols found more than 60,000 publicly reachable ICS systems, including roughly 23,000 genuine Modbus devices answering on port 502 and about 2,800 Siemens S7 controllers across 75 countries — devices that, by protocol design, do no authentication and will execute commands from anyone who can route a packet to them [11]. The same pattern recurs for the message layer: a Shodan-based scan found more than 49,000 publicly reachable MQTT brokers, over 32,000 of them with no password protection at all [12]. Read those two numbers together and the lesson for our stack is exact: the NanoMQ CVE-2026-21888 out-of-bounds read is a 7.5 everywhere, but the population that actually gets exploited is the one sitting on a routable address with no zone in front of it. The defense that moves your real risk is not a lower CVSS score — there is no such thing — it is a conduit. A broker or database that an attacker cannot reach is a broker or database whose next critical advisory you can schedule into a maintenance window instead of firefighting at 3 a.m. This is also why the PITR runbook earns its keep: when the field-failure mode is "an exposed instance got compromised or corrupted," a tested restore to a known-good point is the recovery half of the same posture that segmentation provides the prevention half of.

Why a learning model rides on this same plumbing

The analytics chapters that follow do not just consume this platform; they inherit its day-two disciplines wholesale, because a deployed model is one more change-controlled component that decays, must be watched, and must be recoverable. Three of this chapter's habits map almost one-to-one onto MLOps (the operational lifecycle of a model in production):

PITR and the dataset hash are model lineage. A model is only as reproducible as the data it was fit on. The same WAL-and-base-backup discipline that lets you rebuild the database to 14:32 yesterday is what lets you pin the exact training snapshot behind a released model — so a retrain records a dataset hash and a version bump rather than a vague "we refit on recent data," exactly the locked-model, governed-retrain loop Book 5 builds. A model whose training data you cannot rewind to is a model you cannot revalidate.
A CVE re-verify is a model revalidation. The rule that "a patch is a change, and a change must be re-verified" is the identical rule that governs a learning model under GMP: you may not silently swap the weights any more than you may silently bump a pinned tag. Both go through the same controlled-change gate, which is why validating something that learns reuses this chapter's posture rather than inventing a new one.
Segmentation lowers exposure; the applicability domain lowers a different exposure. Just as a CVSS base score becomes a real risk only once you account for where a component sits, a model's headline accuracy becomes a real risk only once you account for where its inputs sit — inside the region it was calibrated on, or outside it. A model asked to predict on inputs beyond its applicability domain (the input region the training data actually covered) is extrapolating, the data-driven equivalent of an unsegmented, internet-facing broker: technically the same model, operationally a fire drill. The drift detectors of the MLOps chapter are the model-world counterpart of the CVE-watch runbook — both watch for the moment a once-safe component stops being safe, and neither can be skipped without converting a deferred incident into a 3 a.m. one.

The lesson is the same one the whole chapter argues: the infrastructure is not separate from the analytics. A governed, recoverable, self-watched stack is the precondition for trusting any model trained on the data it holds.

Watching the watcher with VictoriaMetrics

You cannot operate what you cannot see. The platform needs to monitor itself: broker connection counts, database replication lag, disk headroom, backup success, certificate expiry, service health. For that the stack ships VictoriaMetrics, pinned to victoriametrics/victoria-metrics:v1.108.1 and gated behind the ops/analytics profiles in the same compose file:

# examples/platform/compose/compose.yaml
  # --- analytics ---------------------------------------------------------
  victoriametrics:
    image: victoriametrics/victoria-metrics:v1.108.1
    profiles: ["analytics", "ops"]
    <<: *restart
    ports: ["8428:8428"]

VictoriaMetrics is the Apache-2 metrics store we ship instead of InfluxDB — a deliberate dodge of the InfluxDB v3 license flip (the same class of move as EMQX's: the newer InfluxDB core stepped away from a fully permissive open-source license, so we avoid building on it), the same instinct that kept us on Mosquitto over BSL-era EMQX. Operationally it is generous: a single node comfortably handles sub-million-sample-per-second ingestion; HA at that tier means running two identical single-node instances fed by replicated remote-write — the metrics stream is written to both instances at once, so either can serve if the other fails — and only at larger scale do you reach for the cluster version, whose HA instead comes from a replication factor (each sample is stored on more than one cluster node) [10]. For one mAb line, single-node is plenty, and that simplicity is itself a feature — fewer moving parts to back up, patch, and validate. The metrics it scrapes feed the alerts in Grafana that page you before the disk fills, not after.

The plant data platform readiness review

Everything in this chapter is a separate discipline — HA, PITR, segmentation, secrets, CVE response, self-monitoring — and the quiet failure mode is treating them as a list you read rather than one you run. So before this platform is allowed to hold GMP-relevant data, it should pass a single readiness review: a go/no-go gate aimed at the platform itself, whose every line resolves to an artifact an inspector can see or a command an operator can run. make test proves the stack functions; this review proves it is fit to operate.

Readiness area	The check	How you prove it	Go condition
Functional acceptance	The stack builds, comes up, and its tests pass	`make up` then `make test` (the committed determinism + db + analytics suite)	green on a clean checkout
Data integrity & audit chain	The append-only record's hash chain is unbroken and tamper-evident	`verify_chain()` over the audit table (the ALCOA+ chapter)	verifies clean; a planted edit is detected
Contextualization	Raw tags actually join to recipe, equipment, and batch — no orphaned readings	the contextualization join, spot-checked against a known batch	every reading resolves to its `batch_id` and asset
Backup & restore drill	You can rebuild the database to a chosen moment and the restore verifies	a logged quarterly restore drill via pgBackRest PITR (Annex 11 §7)	restored data checks out against a known point
Availability & continuity	Failover is documented and rehearsed; the broker's honest single-node limit is written down, not papered over	the PostgreSQL standby promotion plus the Mosquitto fast-restart SOP (Annex 11 §16)	a rehearsed failover; a continuity plan on file
Segmentation & secrets	OT/IT zones are enforced, TLS terminates where it should, and no secret sits in a compose file	the network diagram plus a `docker compose` config review	conduits in place; secrets externalized and rotated
Supply chain & CVE posture	Every image is pinned by digest, an SBOM exists, and there is a runbook for the next advisory	`make lock` (→ `versions.lock`) + `make sbom` + the CVE-watch runbook	digests pinned; SBOM current; runbook owned
Observability	The stack watches itself, so a filling disk or a stalled ingest is caught before an inspector finds it	the VictoriaMetrics metrics feeding Grafana alerts	alerts fire on a simulated fault

Two honesties close the review, and they are this book's whole thesis in one place. First, a green review proves the platform is operable and defensible, not that it is validated — the GAMP 5 / CSA lifecycle, the supplier register, and the qualified sign-off are the wrapper that turns a passing checklist into a validated system, and open source hands you the engine, not that wrapper. Second, the honest single-site limits stay written down rather than hidden: Mosquitto is not clustered, the pgBackRest config is still illustrative in this repo, and the CVE-scan step is not yet a make target — a readiness review that concealed those would be exactly the paper-over this chapter warns against. The gate's job is not to declare victory; it is to make the platform's real posture legible, line by line, to the person who has to sign for it.

Why it matters

A platform is not finished when it works once; it is finished when it keeps working through failure, attack, and time — and can prove it did. Annex 11 turns availability, backup, and continuity into auditable obligations [1]; IEC 62443 turns "keep OT and IT apart" into a defensible architecture [6]. Skipping day-two operations is not a shortcut; it is a deferred outage and a deferred audit finding, both with interest.

In the real world

Real biomanufacturers run this hybrid honestly. The PostgreSQL/PITR story is genuinely production-grade in the open — pgBackRest backs some of the largest regulated databases in the world [4][5]. The broker-HA story is where money changes hands: shops that need true clustered MQTT either license EMQX after its 2025 BSL move [3], buy HiveMQ, or — most commonly on a single GMP site — run a closely watched single broker with a fast-failover SOP and accept Mosquitto's honest single-node limits [2]. The pattern that survives contact with an inspector is never "we used open source"; it is "we used open source inside a validated lifecycle, with tested restores, segmented networks, rotated certs, a CVE runbook, and a stack that watches itself." Pure OSS gets you most of the way. The last mile — clustered broker HA, vendor accountability, a turnkey Part 11 wrapper (21 CFR Part 11, the FDA rule governing electronic records and electronic signatures) — is hybrid, and saying so is the whole point of this book.

Key terms

MQTT / broker: MQTT is the lightweight publish/subscribe messaging protocol the plant uses to move sensor readings; a broker is the server that receives published messages and relays them to subscribers.
High availability (HA): designing so a single component failure does not take the service down; for databases, a hot standby; for brokers, a cluster.
PITR (point-in-time recovery): restoring a database to any chosen moment by replaying archived WAL onto a base backup.
WAL (write-ahead log): PostgreSQL's ordered record of every change; the raw material of replication and PITR.
pgBackRest: the permissive open-source PostgreSQL backup/restore tool used for encrypted, off-host, targeted recovery.
Open source vs source-available: OSI-open means the license is on the Open Source Initiative's approved list — free to read, modify, and run at any scale (e.g. Apache-2.0, the PostgreSQL License; the permissive kind add almost no conditions). Source-available means you can read the source but production use is restricted (e.g. BSL, TimescaleDB TSL Community); this book treats only OSI-open as truly free.
BSL (Business Source License): a source-available license that restricts production use of certain features (e.g. EMQX clustering from 5.9) — free to read, not free to run at scale.
IEC 62443: the OT-security standard family; Part 3-3 defines zones, conduits, and Security Levels.
Zones and conduits: segmenting a plant into trust zones with controlled, monitored paths between them.
CVE / CVSS: a catalogued vulnerability identifier (e.g. CVE-2026-21888) and the 0–10 model that scores its technical severity.
SBOM (software bill of materials): the machine-readable inventory of every component and version in the stack, the basis for CVE watching.
Self-observability: the platform monitoring its own health and metrics, here via VictoriaMetrics.
Readiness review: the go/no-go gate that consolidates the chapter's separate disciplines — functional acceptance (make test), audit-chain verification (verify_chain()), contextualization, a logged restore drill, failover, segmentation/secrets, supply-chain/CVE posture, and observability — into one runnable checklist that proves the platform is fit to operate, while stating plainly that operable is not the same as validated.
Sqitch migration: a version-controlled database change as a named, reversible, self-verifying unit — a deploy, a verify, and a revert script per change, with verify = true auto-reverting any change that cannot prove it worked.
Security Level (SL): in IEC 62443, the target resilience (SL 1–4) assigned to a zone, reflecting the strength of attacker it must withstand and driving the controls on its conduits.
SHACL / PROV-O: the graph-world counterparts of the database's safeguards — SHACL is a shape language that validates whether RDF data conforms to required patterns (the closed-world analogue of a CHECK constraint), and PROV-O is the W3C vocabulary for recording who generated a record and when, the provenance backbone of an audit trail.
Competency question: a plain-English question a data model or ontology must be able to answer — here, "rebuild the schema and data exactly as they stood at a chosen instant" — used as a pass/fail design test.
MLOps / model lineage / applicability domain: MLOps is the operational lifecycle of a deployed model (drift detection, governed retraining, rollback); model lineage is the pinned record of the exact data and version behind a released model; the applicability domain is the input region a model was actually trained on, outside which its predictions are extrapolation.

Where this leads

The platform is now standing, defended, recoverable, and watched. With a trustworthy stream of well-governed data flowing through it, we can finally ask the most rewarding question of all: what can we learn from the data? The next chapter, Process Analytics: SPC, MVDA & Soft Sensors, turns the historian and the lab tables into statistical process control charts, multivariate batch models, and a Raman-to-titer soft sensor — the analytics payoff that the whole platform was built to enable.

What this chapter covers​

Day two is the real exam​

High availability, and the shrinking free-clustering story​

Streaming replication: the free, production-grade database HA​

The Mosquitto bridge is not a cluster​

Backup and point-in-time recovery​

Anatomy of a database change: what a restore (or a revert) actually replays​

The migration gate: deploy, verify, or auto-revert​

The same guarantee, one layer up: the constraint as a graph shape​

Drawing the OT/IT line: segmentation, TLS, and secrets​

Zones, conduits, and Security Levels​

CVE response: the NanoMQ cautionary tale​

Field evidence: why segmentation, not severity, decides your risk​

Why a learning model rides on this same plumbing​

Watching the watcher with VictoriaMetrics​

The plant data platform readiness review​

Why it matters​

In the real world​

Key terms​

Where this leads​