Preface — Open-Source Bioprocess Data Systems

📍 Where we are: the front door of the third book in the series — where you stop reading about bioprocess data and start building the platform that carries it from a sensor on a bioreactor all the way to a regulatory submission.

The first book in this series, Biologic Drug Manufacturing, walked you through how a monoclonal antibody (the protein drug being made) is actually made — the living CHO cells (the Chinese Hamster Ovary cells engineered to produce it), the fed-batch bioreactor (the tank where they grow while nutrients are added in stages), the Protein A column (the affinity chromatography column that captures the antibody out of the broth). The second, Data Management in Biopharmaceutical Manufacturing, stepped back and asked how the data from that process should be governed: who owns it, how it stays trustworthy, what "good" looks like. Both were books you read.

This one is a book you run.

You will clone one Git repository, type make up, and watch a real, open-source bioprocess data platform come to life on your laptop. By the end you will have a historian (a database built for time-stamped sensor readings) storing live sensor data, and a batch model that makes those numbers mean something. You will have an audit trail (the tamper-evident log of who changed what, when) you can try to tamper with (and catch yourself doing it). And you will have a soft-sensor (a model that infers a hard-to-measure quantity from cheaper signals) that predicts antibody titer (how much antibody the culture has accumulated, in grams per litre) from a Raman spectrum (a fast optical measurement of what is dissolved in the broth). Every code block in this book comes from a file in that repository that was actually executed and tested. Nothing is hand-waved.

The simple version

The first book taught you to cook the dish. The second taught you kitchen hygiene and how the health inspector thinks. This book hands you the kitchen: the stoves, the thermometers, the logbook, and the recipe cards — all open-source, all wired together — and shows you how to build it yourself, then tells you honestly which appliances you still have to buy from a commercial vendor because the free ones cannot pass inspection on their own.

What this chapter covers

Who this book is for, and how it hands off from the two books before it.
The two ideas that hold the whole book together: open-source first, and an honest hybrid where pure OSS runs out of road.
The one rule we never break: every claim is either runnable here or cited.
The running case — a fed-batch CHO + Protein A antibody line — and exactly how to clone and start the companion repository.
The conventions you will see in every chapter: how code is shown, how sources are cited, and how we flag the places where open source is not enough.

Who this book is for

If you are a process engineer who has watched data disappear into a vendor's black box, an automation or data engineer asked to "just connect the bioreactor to the cloud," a data scientist tired of receiving CSVs (plain comma-separated data files) stripped of the context that says what each column means, or a student who wants to see how a real pharmaceutical data stack fits together — this book is for you. You do not need to be a microbiologist or a regulatory-affairs specialist; the two earlier books supply that grounding, and we recap what matters as we go. You will be more comfortable if you have seen a terminal before. You do not need to be a Docker expert — Chapter 2 brings you up to speed — but you should be willing to type commands and read error messages without panic.

We assume a laptop with about 16 GB of RAM, Docker and Docker Compose installed [1], Python 3.12 with uv (the fast Python environment and package installer that make venv uses), Git [2], and GNU Make (the task runner that maps a short name like make up to the real commands behind it) — since every command in this book is a make target you type. That is the entire prerequisite list. The core stack — the part you bring up first — fits in roughly 3 GB of RAM, and later, heavier pieces are opt-in so your machine only carries the chapter you are on.

The two ideas behind the book

Open-source first

The thesis of this book is that you can build most of a modern, regulator-aware bioprocess data platform out of open-source software, and that doing so is clarifying: when every layer is inspectable, you finally understand what your data is doing. We assemble a full stack — a PostgreSQL + TimescaleDB historian (the time-stamped sensor database), an Eclipse Mosquitto message broker (a relay that carries sensor messages from publishers to subscribers), a Grafana dashboard layer (the charts you look at), an ISA-88/95 batch model in SQL (the schema that says which readings belong to which batch and phase), a knowledge graph (a database that stores facts as a navigable web of relationships, built in Chapter 19), an ALCOA+ audit chain (the data-integrity attributes a trustworthy record must have — Attributable, Legible, Contemporaneous, Original, Accurate, plus four more, spelled out later), and a Raman-to-titer soft-sensor — almost entirely from permissively licensed tools (licenses that let you use, modify, and redistribute the software with few strings attached).

But "open source first" is a starting posture, not a religion. Where a commercial system is the honest answer, we say so and we integrate with it.

The honest hybrid

Here is the line we will not let you cross unaware: no open-source tool is 21 CFR Part 11 compliant out of the box, and none ever will be by download alone [3]. Part 11 — the U.S. Food and Drug Administration's rule on electronic records and electronic signatures, codified in Title 21 of the U.S. Code of Federal Regulations (CFR) — sets the criteria for treating an electronic record as trustworthy and equivalent to a paper one [3]. Its companion guidance makes the point even sharper: compliance is risk-based and attaches to a validated system and its surrounding procedures, applied with enforcement discretion on certain controls — not to a piece of software sitting on a shelf [4]. The European counterpart, EudraLex Annex 11, says the same thing in different words: a computerised system used in a regulated process must be validated (validation = documented, risk-based evidence that the system does what it is intended to do and keeps doing it), with audit trails and data-integrity controls, and that property belongs to the whole system, not to any one component [5].

So our honest estimate, repeated throughout the book: pure open source gets you roughly 80% of the way. It captures the data, stores it, contextualizes it, visualizes it, and even lets you reason over it semantically. The last 20% — the GxP ("Good x Practice," the family of GMP/GLP/GCP quality regulations) last mile of validation, qualified electronic signatures, vendor accountability, and high-availability guarantees — is where you either do serious hardening work yourself or reach for a commercial product. (We define cGMP, current Good Manufacturing Practice, here on first use: the FDA-enforced quality system that GxP data integrity rules live inside.)

This is not a defeat for open source. It is what the modern guidance actually expects. The second edition of ISPE's GAMP 5 — the industry's main reference for validating computerized systems — explicitly added an appendix on open-source software and a risk-based assurance approach, confirming that OSS can be used in a GxP setting, but only inside a validated lifecycle [6]. The FDA's Computer Software Assurance thinking pushes in the same direction: assurance is a risk-proportionate, intended-use activity layered onto software, which is precisely why no tool arrives compliant [7]. We will build the runnable 80%, point clearly at the hybrid 20%, and in the final chapter score exactly what you give up at each boundary.

A horizontal layered stack from a bioreactor sensor on the left to a regulatory submission on the right, with each layer labelled by its open-source tool and shaded to show how much is pure OSS versus where a commercial system or heavy validation takes over near the regulated end.

From sensor to submission: data flows left to right through capture, historian, batch context, semantics, and trust layers. The green-shaded band (left) is pure open source you build and run here; the rose-shaded last mile near the submission is the honest hybrid — validation, qualified e-signatures, and vendor accountability that open source alone does not deliver. Original diagram by the authors, created with AI assistance.

The one rule: runnable or cited

Every factual claim in this book is backed one of two ways. Either it is runnable — there is a file in the companion repository you can execute to see it for yourself — or it is cited to a primary source: a regulation, a standard, an official documentation page, or a peer-reviewed paper.

The companion repository lives in the examples/ folder of the repo you clone, and you clone it once (clone = download a local working copy with git clone). Its README.md states the contract plainly:

# Open-Source Bioprocess Data Systems — companion repo

Everything the book claims is **either runnable here or cited.** This repo is the
"runnable" half — and CI runs the same `make` targets the book prints, so
"implementable on a laptop" is *proven*, not asserted.

That last sentence is the strongest promise we make. The same make targets you type while reading are run on a clean machine by continuous integration. If a command in this book did not work, the build would have failed before publication.

The running case and how to start it

Throughout the book we follow one process: a fed-batch CHO cell culture producing a monoclonal antibody, captured on a Protein A column — the workhorse that still makes most approved antibodies today, modelled at an 8 L bench-scale bioreactor captured on a 1 L Protein A column so the numbers are realistic rather than a toy. Over its fourteen days the golden batch (the clean, on-target reference run every later chapter benchmarks against) climbs to roughly 20 million viable cells per mL and accumulates about 5.9 g/L of antibody before viability (the fraction of cells still alive) tails off near 68% — every one of those numbers is in the dataset you generate. Where it matters, we call out the modern intensified variant — a perfusion culture, where fresh medium flows in and spent medium with harvest is drawn off continuously instead of in batches — which the simulator ships as a 30-day continuous run, feeding — in principle — a multi-column continuous-capture train. So the perfusion bioreactor is something you can run, not just read about, while the multi-column capture train is sketched as the direction the model would extend.

Getting started is five commands. From examples/README.md:

make venv          # Python env + the simulator (uv)
make data          # generate every dataset deterministically + MANIFEST.sha256
make up            # bring up the core stack (postgres+timescale, mosquitto, grafana)
make seed          # load the ISA-88/95 reference CHO line
make load          # load the datasets into the historian + lab tables

The word deterministically in make data is load-bearing. The simulator runs from a single fixed seed (SIM_SEED=2026), so the fourteen-day batch is byte-for-byte identical on your laptop and on ours. After generation, a MANIFEST.sha256 records the hash of every file — all 18 deterministic artifacts (timeseries, offline assays, chromatograms, Raman spectra, the perfusion run, and the ASM and AnIML interchange exports — vendor-neutral standard formats for analytical and lab data):

e3a78c7291c873b8ede54611d775152c356da138c22be1fc2f35ded04b359701  batches.csv
9b2eec56e5af4d12d7b78895578199fc88a49cf352ba2b45f99f447d275aa441  fedbatch_timeseries.parquet
6f4b5c4261fe3d7c4052326bd4d8fd3530c1cc06f8f0fd305dbb706cf9a0547c  fedbatch_timeseries_10min.csv

If your numbers ever disagree with the book's, the manifest tells you immediately, before a single chart misleads you. That is FAIR data thinking in miniature — Findable, Accessible, Interoperable, Reusable — with reproducibility baked in as a design goal rather than a hope [8].

What `make up` actually starts

The command make up is a thin wrapper. In examples/Makefile it brings up only the core profile and waits for the database to answer:

up: ## bring up the core stack (postgres+timescale, mosquitto, grafana)
	$(COMPOSE) --profile core up -d
	@echo "waiting for postgres..." && sleep 3
	@until docker exec bioprocess-data-stack-postgres-1 pg_isready -U bioproc >/dev/null 2>&1; do sleep 2; done
	@echo "core stack up."

Behind it sits one declarative file, examples/platform/compose/compose.yaml, that defines every service exactly once. Docker Compose lets a single YAML describe a multi-container application and start it with one command [9], and we lean on its profiles feature so you pay only for the layer you are on:

services:
  # --- core --------------------------------------------------------------
  postgres:
    # timescale/timescaledb IS PostgreSQL + TimescaleDB, so the historian
    # hypertable and the ISA-88/95 batch model live in one joinable database.
    image: timescale/timescaledb:2.17.2-pg17
    profiles: ["core"]
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-bioproc} -d ${POSTGRES_DB:-bioproc}"]
      interval: 5s
      timeout: 5s
      retries: 20

  mosquitto:
    image: eclipse-mosquitto:2.0.22
    profiles: ["core"]

  grafana:
    image: grafana/grafana-oss:11.4.0
    profiles: ["core"]

Two things in that snippet are worth slowing down for. First, every image is pinned to an exact version — timescale/timescaledb:2.17.2-pg17, eclipse-mosquitto:2.0.22, grafana/grafana-oss:11.4.0 — never :latest. A floating tag is a time bomb in a regulated system, and the repository's design goes a step further: it pins each image's underlying digest in a committed versions.lock so the running stack, the license inventory, and the validation register can never silently drift apart. Second, the comment on postgres reveals a deliberate choice: TimescaleDB is PostgreSQL with a time-series extension, so your high-rate sensor history and your ISA-88/95 batch context live in the same joinable database. That single decision saves you an entire category of integration pain later.

When the command finishes, you have a historian, a message broker, and a dashboard server running — the always-on foundation every later chapter assumes.

Anatomy of a core-stack service: what `make up` actually boots

The snippet above is trimmed for the page; the real postgres service block in examples/platform/compose/compose.yaml is a few lines longer, and every one of them earns its place. It is worth dissecting field by field, because this single block is the literal thing make up boots, and it encodes — in eight short keys — most of what makes the stack reproducible, self-initializing, and safe to wait on. The card below is that dissection.

Identity card dissecting the postgres service block field by field: image (pinned tag), profiles core, environment defaults, ports, the two volumes, restart policy, and a highlighted healthcheck that make up waits on. The postgres core service, field by field: a pinned image, a profile gate, a self-initializing init-mount, and a healthcheck that make up blocks on before it declares the stack up. Original diagram by the authors, created with AI assistance.

Read the keys in order:

image: timescale/timescaledb:2.17.2-pg17 — pinned to an exact tag, never :latest, and deliberately a TimescaleDB image rather than stock PostgreSQL. Because timescale/timescaledb is PostgreSQL with the time-series extension compiled in, your high-rate sensor history (ts.sensor_reading) and your ISA-88/95 batch context (s88.batch) live in one database you can join — match each sensor row to its batch on a shared key — with a single query.
profiles: ["core"] — the pay-for-what-you-use gate. make up runs docker compose --profile core up -d, so only the services tagged core (this database, the Mosquitto broker, Grafana) start; the OPC UA collector (the service that reads from an industrial OPC UA control-system server), the commercial mocks (labelled stand-ins that imitate paid systems), and the triplestore (the graph database behind the knowledge graph) stay dormant until a later chapter asks for them.
environment — POSTGRES_USER, POSTGRES_PASSWORD, and POSTGRES_DB all default to bioproc via ${POSTGRES_USER:-bioproc} syntax, so the stack runs with zero configuration yet every value is overridable from your shell.
ports: ["5432:5432"] — maps the container's PostgreSQL port to your laptop, which is how the make seed and make load targets reach the database with psql.
volumes — two of them, doing two different jobs. pgdata:/var/lib/postgresql/data is a named volume so your data survives make down and a reboot. The second, ../db:/docker-entrypoint-initdb.d:ro, is the clever one: PostgreSQL's official entrypoint runs every *.sql file in that directory, in filename order, the first time the database initializes — so mounting platform/db/ there makes the container build its own schema (00-init.sql through 60-views.sql: extensions and schemas, the ISA-88/95 model, the historian hypertable, lab and event tables, governance, the ALCOA+ audit chain, and the contextualization views) with no extra command from you.
healthcheck — the green block on the card, and the reason make up is trustworthy. It runs pg_isready -U bioproc -d bioproc every 5 seconds, up to 20 retries, and make up mirrors it from the host with an until docker exec … pg_isready loop before it prints core stack up. That handshake is what guarantees make seed and make load never race a database that is still initializing.

One block, and you can already read the stack's whole philosophy off it: pinned, profile-gated, self-seeding, and health-gated.

Anatomy of a provenance pin: how `pinned` and `deterministic` are made checkable

"Pinned" and "deterministic" are easy words to assert and hard ones to prove. The repository makes both checkable with two plain-text records, and they are worth dissecting side by side because they answer two different questions — did the right software run? and did it produce the right data?

Identity card dissecting two provenance records: one line of versions.lock split into tag plus sha256 digest with a VERIFY-BEFORE-USE placeholder, and one line of MANIFEST.sha256 as a file hash plus filename, with a panel mapping each to the claim it makes checkable. Two records make the book's promises auditable: a versions.lock line pins each image's digest (the software that ran), and a MANIFEST.sha256 line pins each dataset's hash (the data it produced). Original diagram by the authors, created with AI assistance.

The first record is the image pin, produced by the make lock target, which writes platform/versions.lock. Each line is a tag followed by its content digest — timescale/timescaledb:2.17.2-pg17 sha256:<64-hex digest>. The tag is the mutable, human-readable label; the sha256: digest is the immutable content address of the exact image layers that ran. A tag can be re-pushed under your feet; a digest cannot. (When make lock cannot resolve a digest — say you are offline — it writes the honest placeholder VERIFY-BEFORE-USE rather than a wrong value, so the gap is visible instead of silent.) versions.lock is committed with resolved digests, so a reader can audit the running stack against a pinned record straight from the repo, and make lock regenerates it against whatever registry you actually pull from.

The second record is the data pin, datasets/MANIFEST.sha256, written by make data. Each line is sha256(file) filename, two spaces between — for example e3a78c72…9701 batches.csv. Re-run make data from the fixed seed (SIM_SEED=2026) and every hash must reproduce byte-for-byte; if one disagrees, the manifest tells you before a single chart can mislead you.

Together the two records turn the book's adjectives into tests: pinned means the digest in versions.lock matches what is running, and deterministic means re-running make data reproduces every hash in MANIFEST.sha256. Neither is a marketing claim; each is a command you can run.

What the CI actually proves: the named assertion behind "runnable"

The README's strongest sentence — CI runs the same make targets the book prints — is only as good as the assertions behind it. The implementability evidence is the pytest suite that make test runs (in examples/tests/), and one test in particular backs the determinism claim: test_determinism_two_runs_identical in tests/test_simulator.py simulates the golden batch twice and asserts the two value arrays are np.array_equal. A sibling test, test_fedbatch_shape_and_quality, pins the shape of that batch — 16 tags (named sensor signals, such as temperature or pH) over the full run, and exactly the day-7 quality excursion (a stretch where the data quality is flagged questionable: 180 minutes of Uncertain quality on two tags) the later chapters lean on. These are not decorative tests; they are the named assertions that make "reproducible" a fact rather than a promise. If either failed, the build would fail before the book shipped.

Honest hybrid in the field: what the data-integrity record actually shows

The anecdote a few sections down — a batch investigation that drags on for three weeks because nobody can reconstruct what the sensors were doing during a day-7 excursion — is not a rhetorical flourish. Regulators have spent the last decade writing down exactly this failure mode. The MHRA's GXP Data Integrity Guidance and Definitions exists because audit-trail gaps, unattributable records, and irreproducible data are recurring, citable findings in GMP inspections, not rare accidents [11]; it codifies the ALCOA+ attributes — Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, and Available — precisely because each maps to a way real records have been found wanting. The PIC/S guide PI 041-1 says the same thing to inspectors across more than fifty regulatory authorities: data integrity controls belong to a validated system and its procedures, and the absence of a reliable audit trail is itself a deficiency [12]. This is the cited reality under our runnable demonstration: when Chapter 23 has you tamper with the audit chain and catch yourself, you are exercising, in miniature, the control whose absence fills these guidance documents. Open source can make tampering detectable; it cannot, by download alone, make a system validated — which is the whole reason for the honest hybrid.

Conventions you will see everywhere

Code comes from the repo. Each block names its source file (for example, in examples/platform/compose/compose.yaml). We never invent an API, a flag, or a line of output.
Mocks are labelled as mocks. Systems that cannot run on a laptop — AVEVA PI, SAP, Emerson DeltaV, commercial LIMS (these are industrial data historians, enterprise resource-planning software, a process control system, and laboratory information management systems, respectively) — are represented by clearly labelled mocks (lightweight fakes that stand in for the real thing) built with FastAPI or OPC UA, and they honor the real API contract (the same request-and-response shape the genuine system exposes), with a documented swap to the production endpoint (the real system's address) when you go live. The integration code is real; we always tell you when its counterpart is simulated.
Citations are inline. Numeric markers like [1] link to this chapter's reference list. There is no bibliography in the body.
Honesty about limits is a feature. Wherever open source falls short of GxP, Part 11, high availability, or vendor accountability, we say so plainly and point at the hybrid or commercial answer. We also flag 2026 license traps — TimescaleDB's TSL (Timescale License) features and Grafana's AGPL (a copyleft license that can force you to publish your own source code if you offer the tool as a hosted service), and others — so a tool you adopt for being "open" does not surprise you in a redistribution or SaaS (software-as-a-service, i.e. running it as a hosted product) context.
Bilingual. This book ships in English and Korean; code, identifiers, and log output stay verbatim in both.

Why it matters

A bioprocess generates a torrent of data — thousands of tagged sensor readings a minute, offline assays, chromatograms, spectra — and in most facilities that data is scattered across a historian no one can query, spreadsheets on a share drive, and a vendor cloud you rent but do not control. The cost is not abstract. It shows up as a batch investigation that takes three weeks because nobody can reconstruct what the sensors were doing during a day-7 temperature excursion, or as a technology transfer that stalls because the receiving site cannot interpret the sending site's tag names.

Building the stack yourself, in the open, changes your relationship to that data. You stop treating the platform as magic and start treating it as engineering. And because we hold the whole thing to the runnable or cited rule, you learn the difference between what a vendor's marketing claims and what a regulation actually requires — which, when an inspector is standing in your facility, is the only distinction that counts.

In the real world

The honest hybrid is not a compromise we invented to sell a book; it is the daily reality of every biomanufacturing data team. Walk into a modern facility and you will find open-source tools — Grafana, PostgreSQL, Python, MQTT — running happily alongside a validated commercial historian and an MES (Manufacturing Execution System — the software that directs and records production on the shop floor) that cost millions. The skill the industry actually pays for is knowing where the boundary sits and why.

The standards bodies have caught up to this. ISA-95 (IEC 62264), now in its 2025 edition, gives us the layered Level 0–4 reference model (from Level 0, the field sensors on the floor, up through control and operations to Level 4, the enterprise business systems) the whole book is organized around, from the sensor on the floor to the enterprise system [10]. GAMP 5's second edition explicitly blessed open-source software inside a validated lifecycle [6], and the FDA's Computer Software Assurance guidance reframed validation as risk-proportionate assurance rather than documentation theater [7]. Read together with Part 11 [3], its scope-and-application guidance [4], and EU Annex 11 [5], the message is consistent: regulators care that your system is controlled and your data has integrity, not which logo is on the box.

Key terms

OSS (open-source software): software whose source is publicly available under a license permitting use, study, and modification. The foundation of this book's stack.
Honest hybrid: this book's stance that pure OSS covers roughly 80% of a bioprocess data platform, while the GxP last mile (validation, e-signatures, high availability (HA), vendor accountability) is met with hardening or commercial systems.
Validation / a validated system: documented, risk-based evidence that a computerized system does what it is intended to do — and that the controls and procedures around it keep it that way — so its records can be trusted; the property belongs to the whole system, not to any one tool.
GxP / cGMP: the family of "Good x Practice" quality regulations; cGMP is current Good Manufacturing Practice, the FDA-enforced quality system that GxP data-integrity rules live inside.
21 CFR Part 11: the FDA rule defining when an electronic record or signature is trustworthy enough to replace paper.
EU Annex 11: the European GMP guidance on computerised systems; the regulatory counterpart to Part 11.
GAMP 5: ISPE's risk-based reference for validating GxP computerized systems; its 2nd edition addresses open source directly.
ISA-95 (IEC 62264): the standard layered model (Levels 0–4) for integrating enterprise and control systems, used to organize the stack.
FAIR: data that is Findable, Accessible, Interoperable, and Reusable — a design goal for everything the platform produces.
Fed-batch CHO / Protein A: the running process — Chinese Hamster Ovary cells fed nutrients over a batch, the antibody captured on a Protein A affinity column.
Docker Compose / profile: the tool that boots the multi-container stack from one YAML file; profiles gate which services start so you load only the chapter you need.
Deterministic simulator (SIM_SEED=2026): the fixed-seed engine that makes every dataset byte-for-byte reproducible.
Provenance pin (versions.lock / MANIFEST.sha256): the two plain-text records that make "pinned" and "deterministic" checkable — make lock records each image's sha256: digest, and make data records each dataset's file hash.
Init-mount (docker-entrypoint-initdb.d): the directory PostgreSQL's entrypoint runs on first start; mounting platform/db/ there lets the container build its own schema (00–60) with no extra command.

Where this leads

You now know the contract: an open-source-first stack you build and run, an honest hybrid where open source stops, and a runnable or cited rule that keeps everyone — including us — honest. The next chapter, The Reference Architecture: One Stack, Layer by Layer, unfolds that stack onto a single page. We map every layer — edge connectivity, message bus, historian, batch model, contextualization, semantics, compliance, analytics — to its ISA-95 level and to the open-source tool we chose for it, and we draw the OSS-versus-commercial boundary in ink so you can see the whole journey from sensor to submission before you build the first piece of it.

What this chapter covers​

Who this book is for​

The two ideas behind the book​

Open-source first​

The honest hybrid​

The one rule: runnable or cited​

The running case and how to start it​

What make up actually starts​

Anatomy of a core-stack service: what make up actually boots​

Anatomy of a provenance pin: how pinned and deterministic are made checkable​

What the CI actually proves: the named assertion behind "runnable"​

Honest hybrid in the field: what the data-integrity record actually shows​

Conventions you will see everywhere​

Why it matters​

In the real world​

Key terms​

Where this leads​