Operating, Scaling & Securing the Platform
๐ Where we are: Part VI ยท Operating at Scale โ the stack works; now we keep it alive, locked down, and recoverable when something breaks at 3 a.m.
Building a kitchen is the fun part. Running a restaurant that passes a health inspection every single day for ten years is the hard part. This chapter is about day two: the night the disk fills, the morning a new CVE drops, the audit that asks "show me you can restore last Tuesday." Open source builds you a beautiful kitchen. The fire suppression system, the locks on the doors, and the logbook the inspector reads โ those you have to install and maintain yourself, and a few of them you will end up buying.
What this chapter coversโ
We have spent twenty-four chapters making the platform do things. This chapter makes it survive things. We will walk through four day-two realities where open source gets honestly tested:
- High availability (HA): why "just cluster it" is harder than it sounds, and how the free MQTT-broker clustering story shrank in 2025.
- Backup and point-in-time recovery (PITR): the one thing a regulator will absolutely ask you to demonstrate, and the tooling that makes it routine for PostgreSQL.
- Network segmentation, TLS, certificates, and secrets: keeping the plant floor and the data stack in separate, defended zones.
- CVE response and self-observability: patch cadence, a cautionary tale about a lean broker, and watching the watcher with VictoriaMetrics.
Throughout, cGMP (current Good Manufacturing Practice โ the binding quality regulations for making medicines) and EU GMP Annex 11 set the bar, and we are honest about where pure open source stops short of it.
Day two is the real examโ
Everything before this chapter assumed the happy path: services start, data flows, dashboards render. Day-two operations is the unglamorous discipline of assuming the unhappy path. A node dies mid-batch. A certificate silently expires on a Sunday. A dependency you have never heard of gets a critical advisory. The regulator does not grade you on the demo โ they grade you on the recovery.
This matters more in our world than in most. EU GMP Annex 11 makes availability, backup, and business continuity explicit obligations: ยง7 requires regular, verified backups of data and a check that restored data is accurate, and ยง16 requires documented business-continuity arrangements so a critical system failure does not stop you from making or releasing medicine [1]. You do not get to treat resilience as a nice-to-have. It is a line item an inspector reads back to you.
High availability, and the shrinking free-clustering storyโ
HA means: a single component can fail and the system keeps serving. For a database, that is a replica ready to take over. For a message broker, that is a cluster of brokers sharing the load and the state.
Here is the honest part. The open-source broker we ship in the core stack, Eclipse Mosquitto, is a deliberately single-broker design. It is small, fast, rock-solid, and pinned to the stable eclipse-mosquitto:2.0.22 line in our compose file. What it is not is clustered. Mosquitto's answer to spanning multiple brokers is its bridge feature, which connects independent brokers and forwards selected topics between them โ useful for federating sites, but it is not native clustering with shared session state and automatic failover [2]. If the one Mosquitto process is down, its clients have nowhere to reconnect until it comes back.
The obvious upgrade used to be EMQX, which offered free multi-node clustering. In 2025 that door narrowed: from EMQX 5.9 the project moved from Apache 2.0 to the Business Source License (BSL 1.1), and running a production multi-node cluster now requires a commercial license [3]. This is the recurring 2026 license trap in miniature โ the feature you most want for HA is exactly the one that is no longer free. So the chapter is blunt: for a regulated single-site mAb line, a well-monitored single Mosquitto plus a fast restart and a documented failover SOP is a defensible, honest posture; true broker HA at scale is where you either pay EMQX, run HiveMQ, or accept a hybrid commercial component. Do not pretend a Mosquitto bridge is a cluster.
The database side is friendlier. Our postgres service is timescale/timescaledb:2.17.2-pg17 โ PostgreSQL with the TimescaleDB extension, so the historian hypertables and the ISA-88/95 batch model live in one joinable database, defined once in examples/platform/compose/compose.yaml:
# examples/platform/compose/compose.yaml
services:
# --- core --------------------------------------------------------------
postgres:
# timescale/timescaledb IS PostgreSQL + TimescaleDB, so the historian
# hypertable and the ISA-88/95 batch model live in one joinable database.
image: timescale/timescaledb:2.17.2-pg17
profiles: ["core"]
<<: *restart
environment:
POSTGRES_USER: ${POSTGRES_USER:-bioproc}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-bioproc}
POSTGRES_DB: ${POSTGRES_DB:-bioproc}
ports: ["5432:5432"]
volumes:
- pgdata:/var/lib/postgresql/data
- ../db:/docker-entrypoint-initdb.d:ro # 00-60 schema files run on first init
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-bioproc} -d ${POSTGRES_DB:-bioproc}"]
interval: 5s
timeout: 5s
retries: 20
Notice three operational habits baked into that block. The image is pinned by tag, so a silent minor bump can never smuggle in a Debian/glibc jump that corrupts indexes โ and the compose header notes that the matching manifest digests are meant to be recorded in a versions.lock companion file (the digest-pinning artifact we build in the supply-chain chapter; the repo ships the tag pin today, the lock file is that chapter's deliverable). The data lives on a named volume (pgdata), not the container's ephemeral layer, so the container is disposable but the data is not. And there is a real healthcheck, so the orchestrator โ and the make up poller โ knows the difference between "process started" and "actually ready to serve."
PostgreSQL also gives you real streaming replication: a primary streams its write-ahead log to one or more hot standbys that can be promoted in seconds. That is genuine, free, production-grade database HA, and it is the strongest HA story in our whole stack. The asymmetry is the lesson โ the relational core clusters well in the open; the broker does not.
Backup and point-in-time recoveryโ
If you remember one operational duty from this book, make it this one. A backup you have never restored is a rumor. PITR turns "we lost data" into "we rewound to 14:32 yesterday, just before the bad migration."
PostgreSQL's PITR works by combining two things: a base backup (a full physical copy of the data directory) plus the continuously archived WAL โ the stream of write-ahead-log segments that record every change since that base. To recover, you restore the base and then replay WAL forward to any moment you choose [4]. That last clause is the magic: you are not limited to the last nightly snapshot; you can land on a specific transaction or timestamp.
Doing the WAL plumbing by hand is fiddly, so the open-source tool of choice is pgBackRest (PostgreSQL License โ permissive, free). It manages full, differential, and incremental backups; supports off-host and encrypted repositories so your backups are not sitting next to the thing they are meant to protect; and performs time- or transaction-targeted PITR restores [5]. Below is a representative ops/pgbackrest.conf of the shape the chapter recommends โ illustrative configuration, not yet a runnable service in the repo, shown here so you can see exactly what the operator wires up:
# ops/pgbackrest.conf (illustrative โ the operator's day-two artifact)
[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=4 # keep 4 weekly full backups
repo1-cipher-type=aes-256-cbc # encrypt the repository at rest
repo1-cipher-pass=<from-secrets-manager>
start-fast=y
process-max=2
[bioproc]
pg1-path=/var/lib/postgresql/data
A weekly full plus daily differentials, with continuous WAL archiving in between, is a sane starting cadence. The point an auditor cares about is not the config โ it is the restore runbook: a written, tested procedure that proves you can rebuild the database to a chosen point and that the restored data verifies. Annex 11 ยง7 does not just want backups; it wants checked backups, and ยง16 wants the continuity plan that uses them [1]. A quarterly restore drill, logged, is how you turn "we have backups" into evidence.
The historian needs the same care. Because TimescaleDB is PostgreSQL, the hypertables in our ts schema are covered by the very same physical backup and WAL stream โ one PITR strategy protects both the batch model and the high-rate sensor data. We deliberately use only the Apache-2 OSS feature set, so there is no proprietary compression or data-tiering layer to complicate a restore.
Drawing the OT/IT line: segmentation, TLS, and secretsโ
A mAb line is two worlds stitched together. On the operational technology (OT) side sit the PLCs, the DCS, the bioreactor skid's OPC UA server โ long-lived, fragile, rarely patched mid-campaign. On the information technology (IT) side sit our brokers, databases, and dashboards. The cardinal rule of plant-floor security is that these worlds do not share a flat network.
The framework here is IEC 62443, specifically Part 3-3, which formalizes zones and conduits: you carve the plant into security zones, define the conduits (the only sanctioned paths) between them, and assign each a target Security Level [6]. For us that means the OPC UA server and edge gateway sit in an OT zone; the historian and broker sit in an IT/DMZ zone; and the only conduit between them is the collector, through a firewalled, authenticated, encrypted link. Nothing on the business network reaches a PLC directly.
That read-mostly arrow is not decoration โ it is the NAMUR Open Architecture instinct we built on in Chapter 1: monitoring and analytics tap a second channel, they do not reach back and write to validated controls.
Now the uncomfortable confession that runs through this whole book. The dev broker config we shipped is wide open on purpose, and it tells you so in examples/platform/mosquitto/mosquitto.conf:
# examples/platform/mosquitto/mosquitto.conf
# Mosquitto broker config for the local dev stack (Chapter 5).
# Dev-only: anonymous access on the plain 1883 listener. Chapter 25 (operating &
# securing) replaces this with TLS + per-client ACLs; never ship anonymous in
# a real plant.
listener 1883
allow_anonymous true
# enable the $SYS topic tree so the healthcheck can confirm the broker is alive
sys_interval 10
persistence true
persistence_location /mosquitto/data/
log_dest stdout
allow_anonymous true on a plaintext 1883 listener is perfect for a laptop and catastrophic for a plant. This chapter is exactly where that promise comes due. The hardened version moves to TLS on 8883, requires client certificates, and enforces per-client ACLs so a compromised sensor account cannot subscribe to everything:
# ops/mosquitto.tls.conf (illustrative hardening for production)
listener 8883
allow_anonymous false
cafile /mosquitto/certs/ca.crt
certfile /mosquitto/certs/server.crt
keyfile /mosquitto/certs/server.key
require_certificate true
acl_file /mosquitto/config/acl
Those certificates are themselves an operational burden: they expire, and an expired broker or OPC UA cert is a self-inflicted outage that always seems to land on a holiday. You need a tracked inventory of every cert with its expiry, automated rotation where you can, and OpenSSL or a small internal CA to mint them. And the private keys, the repo1-cipher-pass, the database password โ none of those belong in a compose file or a Git repo. They belong in a secrets manager, injected at runtime. The ${POSTGRES_PASSWORD:-bioproc} default you saw above is a dev convenience; in production that variable is sourced from a vault, never typed.
The four pillars of day-two operations for the OSS bioprocess stack โ availability, recoverability, defended zones, and a watched, patched stack โ each tied to the regulation or standard that makes it non-optional. Original diagram by the authors, created with AI assistance.
CVE response: the NanoMQ cautionary taleโ
Choosing a lean open-source component does not exempt you from maintaining it โ if anything, it raises the stakes, because the maintenance is now your job. Consider the broker you didn't pick. NanoMQ is an attractively tiny MQTT broker, and precisely because it is small it has had to be patched. In 2026 an advisory landed: an out-of-bounds read in the MQTT v5 Variable Byte Integer parser, get_var_integer(), remotely triggerable by a crafted packet against versions up to and including 0.24.6 [7]. The National Vulnerability Database catalogued it as CVE-2026-21888, classified CWE-125 (out-of-bounds read), rated High at CVSS 3.1 base 7.5 [8].
That score is not a vibe; it comes from a defined model. The Common Vulnerability Scoring System turns a vulnerability's characteristics into a vector string and a 0โ10 number; this particular CVE was scored under CVSS 3.1, but the same principle is made explicit in the later CVSS v4.0 specification โ the base score measures technical severity, not your risk [9]. A 7.5 on an internet-facing broker is a fire drill; the same 7.5 on a broker locked inside an IEC 62443 OT zone with no inbound exposure is something you schedule into the next maintenance window. Severity is the input; your segmentation and exposure turn it into a triage decision. This is why the previous section pays off here โ good zoning literally lowers the real-world risk of a given CVE.
So the day-two loop is: inventory โ watch โ triage โ patch โ re-verify. The inventory is a software bill of materials (SBOM): the operator generates one with a tool such as Syft, scans it with Grype and Trivy, and pins every component by digest in a supplier register โ the SBOM/scan toolchain and that register are the supply-chain chapter's deliverables, not yet wired into this repo's Makefile, so treat the workflow here as the recommended shape rather than a make target you can run today. A CVE-watch runbook then subscribes to advisories for every pinned image. Each new advisory gets a CVSS-informed triage against where the component actually sits. Patching means bumping the pinned tag and digest, rebuilding, and re-running the test suite โ because in a validated environment a patch is a change, and a change must be re-verified, not just applied. And the scanners themselves belong on the threat surface, not above it: treat scanner binaries and their feeds as suppliers too, run them against a reviewed allowlist, and make scanner provenance part of the assessment rather than blind trust โ supply-chain compromises of widely used developer tooling are a recurring industry pattern, not a hypothetical.
Watching the watcher with VictoriaMetricsโ
You cannot operate what you cannot see. The platform needs to monitor itself: broker connection counts, database replication lag, disk headroom, backup success, certificate expiry, service health. For that the stack ships VictoriaMetrics, pinned to victoriametrics/victoria-metrics:v1.108.1 and gated behind the ops/analytics profiles in the same compose file:
# examples/platform/compose/compose.yaml
# --- analytics ---------------------------------------------------------
victoriametrics:
image: victoriametrics/victoria-metrics:v1.108.1
profiles: ["analytics", "ops"]
<<: *restart
ports: ["8428:8428"]
VictoriaMetrics is the Apache-2 metrics store we ship instead of InfluxDB โ a deliberate dodge of the InfluxDB v3 license flip, the same instinct that kept us on Mosquitto over BSL-era EMQX. Operationally it is generous: a single node, run in its HA mode, comfortably handles sub-million-sample-per-second ingestion, and only at larger scale do you reach for the cluster version that adds replication-based HA [10]. For one mAb line, single-node is plenty, and that simplicity is itself a feature โ fewer moving parts to back up, patch, and validate. The metrics it scrapes feed the alerts in Grafana that page you before the disk fills, not after.
Why it mattersโ
A platform is not finished when it works once; it is finished when it keeps working through failure, attack, and time โ and can prove it did. Annex 11 turns availability, backup, and continuity into auditable obligations [1]; IEC 62443 turns "keep OT and IT apart" into a defensible architecture [6]. Skipping day-two operations is not a shortcut; it is a deferred outage and a deferred audit finding, both with interest.
In the real worldโ
Real biomanufacturers run this hybrid honestly. The PostgreSQL/PITR story is genuinely production-grade in the open โ pgBackRest backs some of the largest regulated databases in the world [4][5]. The broker-HA story is where money changes hands: shops that need true clustered MQTT either license EMQX after its 2025 BSL move [3], buy HiveMQ, or โ most commonly on a single GMP site โ run a closely watched single broker with a fast-failover SOP and accept Mosquitto's honest single-node limits [2]. NIIMBL's SABRE facility โ the NIIMBL / University of Delaware pilot-scale cGMP plant that broke ground in April 2024 and is still under construction โ is exactly the kind of next-generation site where these day-two patterns get exercised on real, regulated equipment rather than a laptop. The pattern that survives contact with an inspector is never "we used open source"; it is "we used open source inside a validated lifecycle, with tested restores, segmented networks, rotated certs, a CVE runbook, and a stack that watches itself." Pure OSS gets you most of the way. The last mile โ clustered broker HA, vendor accountability, a turnkey Part 11 wrapper โ is hybrid, and saying so is the whole point of this book.
Key termsโ
- High availability (HA): designing so a single component failure does not take the service down; for databases, a hot standby; for brokers, a cluster.
- PITR (point-in-time recovery): restoring a database to any chosen moment by replaying archived WAL onto a base backup.
- WAL (write-ahead log): PostgreSQL's ordered record of every change; the raw material of replication and PITR.
- pgBackRest: the permissive open-source PostgreSQL backup/restore tool used for encrypted, off-host, targeted recovery.
- BSL (Business Source License): a source-available license that restricts production use of certain features (e.g. EMQX clustering from 5.9) โ free to read, not free to run at scale.
- IEC 62443: the OT-security standard family; Part 3-3 defines zones, conduits, and Security Levels.
- Zones and conduits: segmenting a plant into trust zones with controlled, monitored paths between them.
- CVE / CVSS: a catalogued vulnerability identifier (e.g. CVE-2026-21888) and the 0โ10 model that scores its technical severity.
- SBOM (software bill of materials): the machine-readable inventory of every component and version in the stack, the basis for CVE watching.
- Self-observability: the platform monitoring its own health and metrics, here via VictoriaMetrics.
Where this leadsโ
The platform is now standing, defended, recoverable, and watched. With a trustworthy stream of well-governed data flowing through it, we can finally ask the most rewarding question of all: what can we learn from the data? The next chapter, Process Analytics: SPC, MVDA & Soft Sensors, turns the historian and the lab tables into statistical process control charts, multivariate batch models, and a Raman-to-titer soft sensor โ the analytics payoff that the whole platform was built to enable.