Standing Up the Stack: One docker compose up

📍 Where we are: Part I, Chapter 2 — we read the blueprint last chapter; now we boot the whole platform on your laptop so every later chapter has something real to run against.

The simple version

Think of the companion repo as a flat-pack data plant. Every machine — the database, the message broker, the dashboards, the bioreactor simulator — comes in its own sealed box (a container). One instruction sheet (compose.yaml) says which boxes to open, how to wire them together, and how to check each one switched on. You type one command, and the plant assembles itself. Type another, and it folds back into boxes — leaving no mess on your machine.

What this chapter covers

This is the hands-on turning point of the book. By the end you will have cloned one repository and brought a real, multi-service bioprocess data platform to life on a single laptop. We will:

Walk the one compose.yaml that defines the whole core stack, and explain why each service is there.
Explain why pinned image tags matter — the influxdb:latest→v3 license trap is the cautionary tale.
Run the Makefile that is the exact command surface the book prints.
Confirm the stack is alive with a first-data-point smoke test.
Meet the deterministic CHO (Chinese Hamster Ovary — the standard mammalian cell line for antibody drugs; see Biologic Drug Manufacturing, /cell-line-development) simulator the whole book feeds from.

Everything below comes from files that exist in examples/ and were run. No invented flags, no invented output. New words like historian, hypertable, profile, and healthcheck are collected in the Key terms list at the end — skim it first if you like.

One file, the whole core

A modern container platform lets you describe a set of services declaratively — what image each runs, what ports it exposes, what volumes it mounts, how to tell it is healthy — and then bring them all up with a single command [1]. That description is a formal artifact: the Compose Specification defines the schema for services, networks, and volumes, so the same YAML behaves identically on your machine, a colleague's, and the CI runner (the continuous-integration server that re-runs the whole stack automatically on every change) [2].

Here is the top of the real file, in examples/platform/compose/compose.yaml (the acronyms on the capture line are expanded inline here for first read; the on-disk comment keeps them terse on one line):

# compose.yaml — the base stack for "Open-Source Bioprocess Data Systems".
# One file defines every service; Docker Compose PROFILES gate what comes up so a
# reader only pays for the chapter they are on:
#   core       Ch 1-2, 4-6, 16-18 (db + broker + dashboards; the CHO simulator is a
#                              separate Python package run via `make data`)
#   capture    Ch 3, 7-15     (the OPC UA server + collector — the OSS (open-source)
#                              <-> DCS (plant distributed control system) mirror)
#   semantics  Ch 19          (triplestore)
#   commercial Ch 20-22       (PI, LIMS, DeltaV mocks — laptop-unrunnable systems)
#   trust      Ch 23-24       (identity, signing, object store)
#   analytics  Ch 29          (notebooks, model tracking)
# Bring up just the foundation with:  docker compose --profile core up -d
#
# Images are pinned by tag for reproducibility; the matching manifest digests are
# recorded in versions.lock (revisited in the supply-chain chapter, Ch 25).

name: bioprocess-data-stack

The key design idea is thin chapters over a thick shared platform. Every service in the entire book is declared exactly once, in this single file, and tagged with a Compose profile (core, capture, semantics, commercial, trust, analytics; trust is documented in the file's header comment, not yet carried by a live service block). docker compose --profile core up starts only what the Part I blueprint chapters need — roughly 3 GB of RAM — and each later Part turns on one more profile. You never re-declare the stack; you only ever switch a profile on. Your laptop's memory and CPU scale with the chapter you are actually reading.

The core profile is the always-on foundation. Note one deliberate choice in examples/platform/compose/compose.yaml:

  postgres:
    # timescale/timescaledb IS PostgreSQL + TimescaleDB, so the historian
    # hypertable and the ISA-88/95 batch model live in one joinable database.
    image: timescale/timescaledb:2.17.2-pg17
    profiles: ["core"]
    <<: *restart
    environment:
      POSTGRES_USER: ${POSTGRES_USER:-bioproc}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-bioproc}
      POSTGRES_DB: ${POSTGRES_DB:-bioproc}
    ports: ["5432:5432"]
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ../db:/docker-entrypoint-initdb.d:ro   # 00-60 schema files run on first init
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-bioproc} -d ${POSTGRES_DB:-bioproc}"]
      interval: 5s
      timeout: 5s
      retries: 20

There is no separate "time-series database" container. The timescale/timescaledb image is PostgreSQL [3] with the TimescaleDB extension already installed [4]. That single decision pays off for the rest of the book: the high-rate sensor history and the ISA-88/95 batch model (the ISA-standard way of describing a batch as a recipe run on equipment — 88 is the batch/recipe model, 95 the equipment/enterprise hierarchy; built in the next chapter) live in the same database, so a query can join "what the dissolved-oxygen probe read at 14:32" to "which batch and recipe phase was running" without copying data between systems. We will lean on that join hard in the contextualization chapters.

Two more details in that block earn their keep. The volumes line mounts ../db into PostgreSQL's first-boot init directory, so the numbered schema files (00-init.sql through 60-views.sql) run automatically the first time the database starts — the schema is code, not a manual step. And healthcheck runs pg_isready every five seconds so the platform can know, not guess, when the database is ready to accept connections. Healthchecks are how later chapters' tests wait for a clean dependency before they run.

Anatomy of a Compose service definition

That postgres: block is not a special case — it is the recurring artifact of this entire book. Every service in compose.yaml, from mosquitto to victoriametrics to the OPC UA collector you meet in the capture chapters, is one of these same eight-or-so fields filled in differently. Learn to read one and you can read all of them. So it is worth dissecting this one field by field, because each line answers a different question the platform needs answered before it can run a service safely.

Every service in the book is one of these cards with the fields filled in differently; the green block is the image (tag now, digest in the lockfile) and the violet block is the healthcheck that lets dependents wait for "ready," not merely "started." Original diagram by the authors, created with AI assistance.

Reading the card top to bottom:

image — the only mandatory field, and the one the whole reproducibility argument hangs on. timescale/timescaledb:2.17.2-pg17 names a repository and a pinned MAJOR.MINOR.PATCH tag; the companion versions.lock records the matching immutable sha256: manifest digest (sha256:3324f81c…) so the human-readable file and the content-addressed truth cannot drift apart. We dissect tag-versus-digest on its own below.
profiles: ["core"] — the gate. This service only starts under docker compose --profile core up; flip a different profile on and it stays dormant. This is the one line that makes "thin chapters over a thick platform" work.
<<: *restart — a YAML merge of the shared &restart anchor defined once at the top of the file (restart: unless-stopped). Declared in one place, merged into every service, so the restart policy can never disagree across services.
environment — configuration injected as environment variables. The ${POSTGRES_USER:-bioproc} syntax is default-substitution: take the value from a .env file or the shell if it is set, otherwise fall back to bioproc. The stack runs out of the box with no .env, yet every credential is overridable for a real deployment.
ports: ["5432:5432"] — a host:container mapping. The left number is the port on your laptop; the right is the port inside the container. That is why psql … -h localhost -p 5432 reaches PostgreSQL from outside the Docker network.
volumes — two volumes of two different kinds, which is the subtlety worth catching. pgdata:/var/lib/postgresql/data is a named volume: Docker-managed storage that survives make down (which is exactly why your data is still there after a restart, and why make clean's -v is what truly wipes it). ../db:/docker-entrypoint-initdb.d:ro is a read-only bind mount of a host directory into PostgreSQL's first-boot init folder — the mechanism that auto-runs the 00–60 schema files on first start.
healthcheck — the field that separates "the container is running" from "the service is ready," dissected in its own section next.

Reading a healthcheck: how the stack knows it's ready

A container can be running long before the program inside it can do useful work: PostgreSQL has to start up, replay its write-ahead log, and open its socket before it will accept a connection. A healthcheck is how Compose closes that gap — it runs a probe command on a clock and only flips the service from started to healthy when the probe succeeds. The postgres block declares four sub-fields:

test — the probe itself: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-bioproc} -d ${POSTGRES_DB:-bioproc}"]. pg_isready is PostgreSQL's own readiness utility [3]; it returns exit code 0 only when the server is accepting connections. CMD-SHELL runs the string through a shell so the ${…} substitution and the flags work as written.
interval: 5s — re-run the probe every five seconds.
timeout: 5s — a single probe that takes longer than this counts as a failure.
retries: 20 — allow up to twenty consecutive failures before declaring the service unhealthy (so a slow first boot is tolerated, but a genuinely dead database is eventually caught).

The payoff is the grafana service's depends_on: { postgres: { condition: service_healthy } }. Grafana does not merely wait for the Postgres container to exist — it waits for that healthcheck to pass. The healthcheck is the contract; depends_on is the consumer of it. Every later chapter's test suite leans on the same contract to avoid the classic flake of querying a database that has not finished starting.

The rest of core is the broker, the dashboards, and a triplestore/metrics pair that ride along under other profiles:

  mosquitto:
    image: eclipse-mosquitto:2.0.22
    profiles: ["core"]
    <<: *restart
    ports: ["1883:1883"]
    volumes:
      - ../mosquitto/mosquitto.conf:/mosquitto/config/mosquitto.conf:ro
    healthcheck:
      test: ["CMD-SHELL", "mosquitto_sub -t '$$SYS/#' -C 1 -W 3 -h localhost || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 10

  grafana:
    image: grafana/grafana-oss:11.4.0
    profiles: ["core"]
    <<: *restart
    ports: ["3000:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - ../dashboards/provisioning:/etc/grafana/provisioning:ro
      - grafana:/var/lib/grafana
    depends_on:
      postgres:
        condition: service_healthy

mosquitto is the MQTT (Message Queuing Telemetry Transport) broker — a lightweight message bus where producers publish readings to named topics and consumers subscribe to them; it is the path simulator telemetry will flow across in the capture chapters [5]. grafana-oss is the dashboard layer that queries the historian and draws the batch-overlay and golden-batch charts the book builds [6]. Notice grafana declares depends_on with condition: service_healthy against postgres: Grafana will not start drawing until the database has passed its healthcheck. Two more services, fuseki (the knowledge-graph triplestore — a database that stores facts as subject-predicate-object triples, built in Chapter 19) and victoriametrics (stack self-monitoring — it records the stack's own metrics), sit behind the semantics and analytics profiles and stay dormant until those Parts. The header comment lists notebooks, model tracking against the analytics profile as its eventual intent, but the shipped stack carries only VictoriaMetrics there; the Jupyter-and-MLflow served path lives in the full repo, as the analytics chapter is careful to note.

The OPC UA server and OPC UA collector that mirror a DCS — a plant's distributed control system — over the wire are not in the core profile; they switch on with the capture profile later, exactly the same way. The point is that the file you are reading now already declares the seats for them. (Field-gateway tools like Telegraf [7] and Node-RED [8] are introduced in the edge-gateway chapter and are not services in this compose file.)

Layered diagram of the Open-Source Bioprocess Data Systems core stack: a single compose.yaml emitting profile-gated containers — PostgreSQL+TimescaleDB, Mosquitto, and Grafana — wired by a Docker network, with the CHO simulator as a separate Python package feeding datasets in, and make targets driving data, up, seed, load, and the smoke test.

The one Compose file fans out into a handful of pinned, health-checked containers; the Makefile is the only command surface you ever type, and profiles decide how much of the plant is powered on. Original diagram by the authors, created with AI assistance.

Why pinned tags matter (the `latest` trap)

Look again at every image: line: timescale/timescaledb:2.17.2-pg17, eclipse-mosquitto:2.0.22, grafana/grafana-oss:11.4.0. None of them says :latest. That is not fussiness; it is the difference between a reproducible plant and a time bomb.

Containers are distributed as OCI images — a content-addressable manifest plus layers, identifiable by an immutable digest [9]. A tag like 2.17.2-pg17 is a friendly label pointing at one such digest; a tag like latest points at whatever the maintainer pushed most recently. Semantic versioning gives the tag meaning: MAJOR.MINOR.PATCH, where a MAJOR bump signals a breaking change [10]. Pin the version and you can reason about what an upgrade will and will not break.

Anatomy of a versions.lock line: tag vs digest

The whole pinning argument lives in one line of platform/versions.lock, and that line has exactly two halves worth separating in your head:

A versions.lock line is <image:tag> sha256:<digest>: the tag is a label you can re-point, the digest is the content fingerprint; :latest keeps the label and swaps the content underneath you. Original diagram by the authors, created with AI assistance.

The tag — timescale/timescaledb:2.17.2-pg17 — is a mutable, re-pointable label. It is human-readable and semver-meaningful, which is exactly why compose.yaml is written in tags: the file stays legible and the license table lines up by name. But a tag is only a name; the maintainer can re-point it at a new image tomorrow.
The digest — sha256:3324f81c… — is the immutable content. It is the hash of the image manifest itself, so it cannot change without becoming a different digest. Pin by digest (image: <repo>@sha256:…) and you get a byte-exact image or an error — never a surprise.

versions.lock records the tag→digest pair for every core image (TimescaleDB, Mosquitto, Grafana, VictoriaMetrics) so the readable file and the content-addressed truth are written down side by side and can be checked against each other. The lockfile even flags one image whose community repository moved, leaving its digest as VERIFY-BEFORE-USE until you re-fetch and record the real sha256: digest from the container registry you pull from (the image host — e.g. Docker Hub or your organization's mirror of it) — a small honest reminder that "the same tag" is not a guarantee of "the same bytes."

Why `:latest` is a silent time bomb (the field record)

The canonical horror story is InfluxDB. A reader who wrote influxdb:latest in 2024 woke up one morning pulling InfluxDB 3 — a near-total rewrite with a changed storage engine and a changed license posture — silently, in place, with no warning. The book sidesteps that entire class of accident by avoiding InfluxDB (we ship VictoriaMetrics under Apache-2.0 instead) and, more importantly, by pinning everything. compose.yaml pins each image by its human-readable tag; a companion platform/versions.lock records the matching immutable manifest digest (<image:tag> sha256:…) for each one. The supply-chain chapter (Chapter 25) builds on that lockfile to cross-check the running stack, the license inventory, and the supplier register against one pinned list — so they cannot silently drift apart.

That anecdote is not a one-off; it is a documented class of failure. Container-security researchers call a re-pointable tag a mutable (or "mutant") tag, and they treat :latest as a live threat rather than a style nit: because the tag can be re-pointed to a different digest at any moment, the image you scanned and approved and the image that actually gets pulled and runs need not be the same bytes — a gap that has been demonstrated as an admission-control bypass where a clean tag is swapped for a malicious image after the scan passes [13]. The InfluxDB swap is the benign-looking version of exactly that mechanism: same name, new content, no warning. The defensive posture the same researchers recommend is precisely what this stack does — verify the digest the tag resolves to, so "the same tag" is forced to mean "the same image" [13]. A pinned tag plus a recorded digest turns "we trust the label" into "we can prove the bytes."

A few 2026 license traps are worth flagging as you choose versions, because they bite quietly. TimescaleDB ships its core — hypertables, time_bucket, and drop_chunks (its data-retention command, which deletes old time chunks) — under Apache-2.0, a permissive open-source license. But the continuous aggregates (pre-computed rolling summaries) and add_retention_policy automation this stack actually uses are free TSL Community features (Timescale License — source-available and free to run, but not an OSI-approved open-source license), while the Hypercore columnstore/compression and HA (high availability) are the licensed TSL tier we deliberately avoid; a strictly Apache-2.0-only build would have to schedule its own cron-driven drop_chunks instead (Chapter 16). Grafana is AGPL-3.0 — perfectly fine to run locally, but redistributing or hosting it as a service for others triggers source-sharing obligations. InfluxDB v3, EMQX's BSL (Business Source License), and Redpanda's RCL (Redpanda Community License) — all source-available licenses that restrict commercial or hosted use — are the other landmines this stack steps around on purpose.

The Makefile is the command surface

You will never type a raw docker compose incantation in this book. Every action goes through make, and the book prints exactly what you type. Here is the real examples/Makefile:

COMPOSE := docker compose -f platform/compose/compose.yaml
PY := sim/.venv/bin/python
export DATABASE_URL ?= postgresql://bioproc:bioproc@localhost:5432/bioproc

.DEFAULT_GOAL := help
.PHONY: help venv up down seed data load contextualize alcoa soft-sensor test clean

help: ## list targets
	@grep -hE '^[a-zA-Z_-]+:.*?## ' $(MAKEFILE_LIST) | awk 'BEGIN{FS=":.*?## "}{printf "  %-14s %s\n", $$1, $$2}'

venv: ## create the Python env and install the simulator (uv)
	cd sim && uv venv --python 3.12 .venv && uv pip install --python .venv -e . "psycopg[binary]" "asyncua==2.0" scikit-learn

up: ## bring up the core stack (postgres+timescale, mosquitto, grafana)
	$(COMPOSE) --profile core up -d
	@echo "waiting for postgres..." && sleep 3
	@until docker exec bioprocess-data-stack-postgres-1 pg_isready -U bioproc >/dev/null 2>&1; do sleep 2; done
	@echo "core stack up."

The venv target builds the Python environment with uv, a fast Python package and virtual-environment manager (the later (uv) markers refer to the same tool). make help self-documents from the ## comments, so the menu and the code never disagree. make up brings up the core profile and then blocks until pg_isready succeeds, so the command does not return "done" until the database can actually take a connection. That polling loop is the small, honest difference between "the container started" and "the service is ready."

The startup handshake: depends_on, conditions, and the make up poll loop

It is worth tracing the readiness contract end to end, because it runs through three layers that all agree on the same fact:

Inside the service — postgres's own healthcheck runs pg_isready every 5s and flips the service to healthy once it passes (the section above).
Between services — grafana's depends_on: { postgres: { condition: service_healthy } } reads that flag, so Compose holds Grafana back until the database is genuinely accepting connections rather than merely present. Without the condition, depends_on would only wait for the container to be created — a weaker promise that causes the classic "connection refused on first boot" flake.
At the command surface — make up does not trust either layer blindly. After docker compose --profile core up -d returns, it runs its own until docker exec … pg_isready …; do sleep 2; done loop and only prints core stack up. when the probe answers. So the shell command, the inter-service dependency, and the in-container probe all gate on the same pg_isready truth.

The reason to belt-and-suspenders it this way is that the three layers protect different consumers: the in-container check protects Compose's own scheduling, the depends_on condition protects sibling services, and the make loop protects you — the human (or CI runner) about to type make seed against a database that had better be ready. One probe, three guards.

The full build order for the foundation is short:

make venv          # Python env + the simulator (uv)
make data          # generate every dataset deterministically + MANIFEST.sha256
make up            # bring up the core stack (postgres+timescale, mosquitto, grafana)
make seed          # load the ISA-88/95 reference CHO line
make load          # load the datasets into the historian + lab tables

make load truncates and bulk-COPYs the full-resolution parquet, so it is idempotent — re-running reloads the golden batch rather than duplicating it. make down stops the stack but keeps your data in named volumes; make clean runs docker compose down -v to delete the volumes when you want a truly fresh start. Because the whole environment is one declarative file plus one command, tearing it down leaves nothing scattered across your machine.

The first data point: a smoke test

Bringing services up is not the same as proving the platform works end to end. The smoke test for this stack is the simplest possible question: can a number land in the historian and come back joined to a batch?

After make up && make seed && make load, the simulator's datasets are in PostgreSQL+TimescaleDB. A first sanity check is to count what landed in the historian hypertable directly:

docker exec -e PGPASSWORD=bioproc bioprocess-data-stack-postgres-1 \
  psql -U bioproc -d bioproc \
  -c "select tag, count(*), round(min(value)::numeric,2) lo, round(max(value)::numeric,2) hi \
      from ts.sensor_reading where batch_id='BATCH-2026-001' group by tag order by tag limit 4;"

      tag      | count |  lo   |  hi
---------------+-------+-------+-------
 BR101.DO.PV   | 20160 | 30.04 | 43.77
 BR101.Temp.PV | 20160 | 36.36 | 37.12
 BR101.Titer.PV| 20160 | -0.11 |  5.82
 BR101.pH.PV   | 20160 |  6.91 |  7.08

Read a tag like BR101.DO.PV as equipment.measurement.role: bioreactor BR101, its dissolved-oxygen (DO) probe, reporting a process value (PV) rather than a setpoint (SP). These culture terms — titer, viable-cell density, viability, lactate, osmolality, fed-batch feeding — are explained in Biologic Drug Manufacturing (/production-bioreactor); here we only need that the numbers are realistic.

Those ranges are the fed-batch process telling the truth about itself: temperature held near 37 °C, pH spanning roughly 6.9–7.1, dissolved oxygen riding between about 30 and 44 %sat, and titer (the concentration of antibody product in the broth, in grams per litre) climbing from essentially zero (a slightly negative measurement at inoculation, from measurement noise on a near-zero online titer signal) to roughly 6 g/L over the run.

These are characteristic CHO setpoints — DO held around 30–50% air saturation, pH near 7.0, 37 °C physiological — and the run ends the way a healthy 14-day batch should: viable-cell density peaking above 20 million cells/mL, viability sliding from ~96% to ~64% (cells naturally age and die late in a fed-batch as nutrients deplete — an expected, healthy end-of-run profile), lactate accumulating to a few g/L and osmolality climbing past 340 mOsm/kg (all visible in the offline-assay dataset). Each tag has 20,160 rows — one every minute across the 14-day batch (the full-resolution fedbatch_timeseries.parquet that make load ingests; datasets/ also ships a 10-minute CSV downsample for file-replay chapters).

The real smoke test, though, is the join. make contextualize (which we build properly in the contextualization chapter) runs exactly this query against the same stack:

select phase_name, count(*) n, round(avg(value)::numeric,1) avg_DO
  from s88.v_batch_sensor where batch_id='BATCH-2026-001' and tag='BR101.DO.PV'
  group by phase_name order by min(ts);

If that returns dissolved-oxygen averages broken out by recipe phase, the platform is alive in the way that matters: a raw sensor value, captured into the historian, has been reunited with its ISA-88 process context inside one query. That is the whole platform in miniature, and it is the proof the rest of the book builds on.

That batch_id join is also the seam where this relational stack later becomes a semantic one. The same fact the SQL view returns — this reading belongs to this batch — is, in the knowledge-graph chapter, one RDF (Resource Description Framework — the standard way of writing a fact as a subject-predicate-object triple) triple: BATCH-2026-001 derivedFrom SEED-001. The historian value and the batch context that SQL reconciles on a foreign key, RDF stores as first-class edges you can walk recursively, which is what lets a single SPARQL query trace a lot's full genealogy back to the cell bank; that translation, and the Apache Jena Fuseki triplestore the semantics profile reserves a seat for, is built in the semantics & knowledge-graph chapter, and the broader ontology engineering behind it — competency questions, classes and relations, and the SHACL release gate — is the subject of Book 4 (classes and taxonomy, relations and genealogy, the release gate and SHACL). The point worth carrying forward now is that the join we just proved is the competency question a later SHACL shape will enforce closed-world: "is every required result present, singular, and in spec for this lot?" — exactly the question a missing-row LIMS integration fails, and exactly the question SQL's open-world LEFT JOIN cannot raise on its own.

The CHO simulator the whole book feeds from

The book has no real bioreactor, so it ships a deterministic one. The Python package bioproc_sim (installed by make venv, driven by make data) generates every dataset in the book from one fixed master seed, SIM_SEED=2026, so the 14-day fed-batch trace is byte-for-byte identical on every machine. That determinism is not a gimmick — it is what lets CI assert a MANIFEST.sha256 and catch any silent drift in the data.

The fed-batch run models a CHO culture with logistic viable-cell growth, Monod glucose/glutamine kinetics (lactate is produced as a byproduct during growth and consumed late, not a limiting substrate), antibody titer accumulating with the integral of viable cells (a largely non-growth-associated production term — antibody accumulates fastest as growth slows, the way real CHO cultures make most of their titer in the production phase), and PID-controlled DO and pH with bounded noise. It even seeds a deliberate 0.5 °C excursion on day 7 and scheduled bolus feeds on days 3, 5, 7, 9, 11, and 13 — so later chapters have real events to detect, alarm on, and review. A row of the golden trace looks like this:

ts,tag,value,unit,quality,batch_id
2026-01-05 00:00:00+00:00,BR101.DO.PV,40.8224,%sat,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.Temp.PV,37.0145,degC,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.Titer.PV,-0.0045,g/L,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.pH.PV,7.0511,pH,192,BATCH-2026-001

The quality column carries the legacy OPC DA (Classic) quality code, where 192 = Good, 64 = Uncertain, 0 = Bad. (Note the reversal: in modern OPC UA — a different, newer standard, untangled in Chapter 7 — StatusCode 0 means Good, so the same digit means opposite things in the two systems; on this page only the 192/64/0 scale is used.) The unit keeps the engineering unit attached to the number, and batch_id is the thread that ties every reading back to the ISA-88/95 model. The same engine can also stream live to an OPC UA server and to Mosquitto for the capture chapters, or dump flat goldens to datasets/ for chapters you want to follow by replaying files without booting a producer. One seed, one source of truth, every number in the book.

That batch_id column does one more quiet job that the machine-learning chapters depend on absolutely: it is the grouping key for honest model validation. The minute-resolution rows within one batch are not independent draws — every reading in BATCH-2026-001 shares the same cells, the same media lot, the same probe calibration — so splitting a soft-sensor's training and test rows at random leaks batch identity across the split and reports an accuracy the model will never see in production. The defensible split is grouped, leave-one-batch-out cross-validation (GroupKFold keyed on batch_id, so whole batches, never single rows, fall on one side of the split), which is precisely why the simulator ships several distinct batches rather than one long trace. The same batch_id thread that contextualizes a sensor reading here is the thread that keeps a model's validation from fooling itself; the leakage trap, grouped CV, and the applicability domain (the input region a model was actually calibrated on, beyond which its predictions are extrapolation) are worked through in Book 5's models and validation chapter. The deterministic MANIFEST.sha256 matters to ML for a second reason too: a model is only reproducible if its training data is, so a recorded dataset hash is the upstream half of model lineage — pin the data digest beside the model version and a reviewer can prove which exact bytes produced which exact model, the foundation the MLOps and lifecycle chapter builds drift detection and governed retraining on. (One caution the ML chapters stress: a simulator that is byte-identical everywhere is a reproducibility gift but also a covariate-shift blind spot — a model that learns this one seed's world has not yet met the run-to-run wander of a real cell line, which is its own kind of drift.)

Data-flow graph of the simulator-fed stack: bioproc_sim with SIM_SEED=2026 emits datasets via make data and a live Mosquitto MQTT stream; make load lands the datasets in the ts.sensor_reading TimescaleDB hypertable, while make seed loads platform/db/seed into the s88 ISA-88/95 model; the historian and the model converge in the s88.v_batch_sensor join on batch_id, which feeds Grafana dashboards.

Why it matters

Everything downstream — the historian, the batch model, contextualization, the audit chain, the titer soft-sensor — assumes a working, reproducible foundation. Get that wrong and every later chapter inherits the flakiness. Get it right and the book becomes something you run, not something you read.

There is a regulatory dividend too. A version-pinned, declarative, automated environment is exactly the artifact a qualification effort (the regulated-industry term for documented proof a system was installed and works as intended) wants. GAMP 5 (2nd Edition) — an industry guide for validating GxP ("Good x Practice", the umbrella for the FDA/EU pharma-quality regulations) computerized systems — frames a risk-based lifecycle and gives explicit attention to infrastructure qualification and to open-source software [11]. When your infrastructure is code — one Compose file, one lockfile of digests (versions.lock), one Makefile — your installation evidence is reproducible and reviewable rather than a screenshot of someone's terminal. The FDA's Computer Software Assurance (CSA) final guidance points the same way: assurance should be risk-based and least-burdensome, leaning on logs, automation, and supplier evidence rather than ritual documentation [12]. A clean make up that boots a pinned stack and passes a healthcheck is precisely the kind of objective, repeatable evidence those frameworks reward — and a deterministic rebuild plus a recorded MANIFEST.sha256/digest (what the make alcoa target checks) is what makes the data ALCOA+, the regulators' shorthand for records that are Attributable, Legible, Contemporaneous, Original and Accurate (plus Complete, Consistent, Enduring and Available): you can prove the bytes did not change. (We make the validation case in full later — ALCOA+ data integrity in Chapter 23, Part 11 / Annex 11 in Chapter 24, and a GAMP 5 + CSA walkthrough in Chapter 25.)

In the real world

Real plants do not run a single laptop Compose file, of course. A production historian might be AVEVA PI on dedicated, highly-available hardware; the DCS is Emerson DeltaV or Siemens; the LIMS is commercial and validated. Those systems cannot run on a laptop and are not open source — which is why this book is honest about being a hybrid: the open-source core here gets you perhaps 80% of the way, and the GxP last mile (Part 11 — FDA 21 CFR Part 11, the rule governing electronic records and signatures — e-signatures, vendor accountability, validated HA (high availability)) is where commercial systems and formal validation take over. No tool in this stack is Part 11-compliant out of the box, and saying otherwise would be marketing, not engineering.

But the architecture this chapter stands up is the same one the big shops use, just at laptop scale: a time-series historian beside a relational system of record, a message bus between the floor and IT, dashboards over the top. A reproducible, profile-gated dev stack is how you experiment with that architecture without a cleanroom — and the same Compose-and-Make discipline scales straight into the IQ/OQ evidence (Installation Qualification and Operational Qualification — the documented proof a regulated system was installed correctly and runs as specified) a real facility would demand. The honest gap is in the operational qualities — uptime guarantees, certified support, formal validation packages — not in the shape of the data plant.

It is worth being precise about how this maps to the qualification ladder, because the mapping is exact, not loose. A pinned compose.yaml plus the versions.lock digests is your IQ evidence — proof the right components, at the right versions, were installed — and because it is content-addressed it is stronger than a screenshot: a reviewer can re-pull and confirm the bytes. The healthchecks and the make up poll loop are OQ evidence — proof each service operates as specified (the database accepts connections, the broker answers, dashboards reach the historian) under a written, repeatable test. What a laptop stack cannot supply is PQ (Performance Qualification — proof the whole process performs reproducibly over time under production load, with the real materials and operators), because PQ is a property of the live GMP process, not of the software: it needs the actual cell-culture campaign, qualified operators, and the historian holding months of real batches, not a 14-day simulator trace. That IQ/OQ-on-a-laptop, PQ-only-in-the-plant boundary is the same hybrid honesty the rest of the chapter keeps, and the CSV-to-CSA shift that lets this code-as-evidence count toward IQ/OQ is the subject of Book 2's GAMP 5 and CSA chapter.

One scope note for the bioprocess reader, since the simulator only models the upstream bioreactor. The BATCH-2026-001 material this stack records does not stop at harvest — in a real campaign it flows through the downstream purification train that the time-series historian here will never see the inside of: Protein A capture chromatography (the affinity step that binds the antibody and washes everything else away), low-pH viral inactivation, polishing chromatography (the ion-exchange and mixed-mode steps that strip residual host-cell protein, DNA, and aggregate), viral filtration, and UF/DF to the drug substance (ultrafiltration/diafiltration, which concentrates and buffer-exchanges the antibody into its final formulation). Those steps are batch operations with their own in-process tags, their own deviations, and their own release results — and the reason the batch_id thread and the ISA-88/95 model matter so much is that they are what let a single later query follow BATCH-2026-001 from this bioreactor trace all the way to the host-cell-protein and monomer-purity numbers on the drug-substance lot it becomes.

Key terms

Container / OCI image — a sealed, portable bundle of an application and its dependencies, identified by an immutable content digest; the unit each service ships as.
Docker Compose / compose.yaml — the declarative file (and the tool) that defines a multi-service application and brings it all up with one command.
Profile — a Compose label that gates which services start, so a reader only powers on the layer the current chapter needs (core, capture, semantics, commercial, trust, analytics).
Tag pinning — fixing an image to a specific MAJOR.MINOR.PATCH version (and digest) rather than :latest, so the environment is reproducible and upgrades are deliberate.
Healthcheck — a command the platform runs to decide whether a service is actually ready, so dependents (and tests) can wait correctly.
Image digest (sha256:…) — the immutable, content-addressed hash of an image manifest; unlike a tag it cannot be re-pointed, so pinning by digest guarantees byte-exact bytes. versions.lock records the digest behind each tag.
Named volume vs bind mount — two ways a container gets storage: a named volume (pgdata) is Docker-managed and persists across make down; a bind mount (../db:…:ro) maps a host directory in, here read-only to auto-run the schema on first boot.
ISA-88/95 — the international standards for the batch/recipe model (88) and the equipment/enterprise hierarchy (95); the relational backbone the next chapter builds, so a sensor reading can be tied to the batch, recipe, and phase it belongs to.
Historian — the time-series store for high-rate process data; here, a TimescaleDB hypertable inside the same PostgreSQL database as the batch model.
Hypertable — TimescaleDB's PostgreSQL table that is automatically partitioned by time into chunks for fast time-series writes and queries.
SIM_SEED=2026 — the fixed master seed that makes the CHO simulator's output byte-for-byte identical everywhere, so datasets are reproducible and CI can verify them.
Grouped / leave-one-batch-out cross-validation — splitting a model's training and test data by whole batch (keyed on batch_id) rather than by random row, so correlated rows from the same batch never straddle the split and inflate the reported accuracy; the honest validation a soft sensor needs (Book 5).
IQ / OQ / PQ — the qualification ladder: Installation Qualification (the right components installed — here the pinned compose.yaml and versions.lock), Operational Qualification (each service operates to spec — here the healthchecks and make up poll loop), and Performance Qualification (the whole process performs reproducibly over time — a property only the live GMP plant can supply, not a laptop).

Where this leads

The stack is up and the simulator's numbers are sitting in PostgreSQL — but before we build on it, it is worth seeing what this running stack already is. The next chapter, Virtual Commissioning: Protocol Emulators and the Hardware-Free Testbed, names it: a flight simulator for the plant's data layer, where the emulated OPC UA bioreactor and Modbus skid let every later integration be proven on the bench — behind an acceptance gate — before a single byte reaches real hardware.

What this chapter covers​

One file, the whole core​

Anatomy of a Compose service definition​

Reading a healthcheck: how the stack knows it's ready​

Why pinned tags matter (the latest trap)​

Anatomy of a versions.lock line: tag vs digest​

Why :latest is a silent time bomb (the field record)​

The Makefile is the command surface​

The startup handshake: depends_on, conditions, and the make up poll loop​

The first data point: a smoke test​

The CHO simulator the whole book feeds from​

Why it matters​

In the real world​

Key terms​

Where this leads​