Preface โ From Sensor to Submission
๐ Where we are: the front door of the third and final book in the trilogy โ where you stop reading about bioprocess data and start building the platform that carries it from a sensor on a bioreactor all the way to a regulatory submission.
The first book in this trilogy, From Cell to Cure, walked you through how a monoclonal antibody is actually made โ the living CHO cells, the fed-batch bioreactor, the Protein A column. The second, Data Management in Biopharmaceutical Manufacturing Processes, stepped back and asked how the data from that process should be governed: who owns it, how it stays trustworthy, what "good" looks like. Both were books you read.
This one is a book you run.
You will clone one Git repository, type make up, and watch a real, open-source bioprocess data platform come to life on your laptop. By the end you will have a historian storing live sensor data, a batch model that makes those numbers mean something, an audit trail you can try to tamper with (and catch yourself doing it), and a soft-sensor that predicts antibody titer from a Raman spectrum. Every code block in this book comes from a file in that repository that was actually executed and tested. Nothing is hand-waved.
The first book taught you to cook the dish. The second taught you kitchen hygiene and how the health inspector thinks. This book hands you the kitchen: the stoves, the thermometers, the logbook, and the recipe cards โ all open-source, all wired together โ and shows you how to build it yourself, then tells you honestly which appliances you still have to buy from a commercial vendor because the free ones cannot pass inspection on their own.
What this chapter coversโ
- Who this book is for, and how it hands off from the two books before it.
- The two ideas that hold the whole book together: open-source first, and an honest hybrid where pure OSS runs out of road.
- The one rule we never break: every claim is either runnable here or cited.
- The running case โ a fed-batch CHO + Protein A antibody line โ and exactly how to clone and start the companion repository.
- The conventions you will see in every chapter: how code is shown, how sources are cited, and how we flag the places where open source is not enough.
Who this book is forโ
If you are a process engineer who has watched data disappear into a vendor's black box, an automation or data engineer asked to "just connect the bioreactor to the cloud," a data scientist tired of receiving CSVs with no context, or a student who wants to see how a real pharmaceutical data stack fits together โ this book is for you. You do not need to be a microbiologist or a regulatory-affairs specialist; the two earlier books supply that grounding, and we recap what matters as we go. You will be more comfortable if you have seen a terminal before. You do not need to be a Docker expert โ Chapter 2 brings you up to speed โ but you should be willing to type commands and read error messages without panic.
We assume a laptop with about 16 GB of RAM, Docker and Docker Compose installed [1], Python 3.12 with uv (the fast Python environment and package installer that make venv uses), Git [2], and GNU Make โ since every command in this book is a make target. That is the entire prerequisite list. The core stack โ the part you bring up first โ fits in roughly 3 GB of RAM, and later, heavier pieces are opt-in so your machine only carries the chapter you are on.
The two ideas behind the bookโ
Open-source firstโ
The thesis of this book is that you can build most of a modern, regulator-aware bioprocess data platform out of open-source software, and that doing so is clarifying: when every layer is inspectable, you finally understand what your data is doing. We assemble a full stack โ a PostgreSQL + TimescaleDB historian, an Eclipse Mosquitto message broker, a Grafana dashboard layer, an ISA-88/95 batch model in SQL, a knowledge graph, an ALCOA+ audit chain, and a Raman-to-titer soft-sensor โ almost entirely from permissively licensed tools.
But "open source first" is a starting posture, not a religion. Where a commercial system is the honest answer, we say so and we integrate with it.
The honest hybridโ
Here is the line we will not let you cross unaware: no open-source tool is 21 CFR Part 11 compliant out of the box, and none ever will be by download alone [3]. Part 11 โ the U.S. Food and Drug Administration's rule on electronic records and electronic signatures โ sets the criteria for treating an electronic record as trustworthy and equivalent to a paper one [3]. Its companion guidance makes the point even sharper: compliance is risk-based and attaches to a validated system and its surrounding procedures, applied with enforcement discretion on certain controls โ not to a piece of software sitting on a shelf [4]. The European counterpart, EudraLex Annex 11, says the same thing in different words: a computerised system used in a regulated process must be validated, with audit trails and data-integrity controls, and that property belongs to the whole system, not to any one component [5].
So our honest estimate, repeated throughout the book: pure open source gets you roughly 80% of the way. It captures the data, stores it, contextualizes it, visualizes it, and even lets you reason over it semantically. The last 20% โ the GxP last mile of validation, qualified electronic signatures, vendor accountability, and high-availability guarantees โ is where you either do serious hardening work yourself or reach for a commercial product. (We define cGMP, current Good Manufacturing Practice, here on first use: the FDA-enforced quality system that GxP data integrity rules live inside.)
This is not a defeat for open source. It is what the modern guidance actually expects. The second edition of ISPE's GAMP 5 โ the industry's main reference for validating computerized systems โ explicitly added an appendix on open-source software and a risk-based assurance approach, confirming that OSS can be used in a GxP setting, but only inside a validated lifecycle [6]. The FDA's Computer Software Assurance thinking pushes in the same direction: assurance is a risk-proportionate, intended-use activity layered onto software, which is precisely why no tool arrives compliant [7]. We will build the runnable 80%, point clearly at the hybrid 20%, and in the final chapter score exactly what you give up at each boundary.
From sensor to submission: data flows left to right through capture, historian, batch context, semantics, and trust layers. The unshaded portion is pure open source you build and run here; the shaded last mile near the submission is the honest hybrid โ validation, qualified e-signatures, and vendor accountability that open source alone does not deliver. Original diagram by the authors, created with AI assistance.
The one rule: runnable or citedโ
Every factual claim in this book is backed one of two ways. Either it is runnable โ there is a file in the companion repository you can execute to see it for yourself โ or it is cited to a primary source: a regulation, a standard, an official documentation page, or a peer-reviewed paper.
The companion repository lives at examples/ and you clone it once. Its README.md states the contract plainly:
# From Sensor to Submission โ companion repo
Everything the book claims is **either runnable here or cited.** This repo is the
"runnable" half โ and CI runs the same `make` targets the book prints, so
"implementable on a laptop" is *proven*, not asserted.
That last sentence is the strongest promise we make. The same make targets you type while reading are run on a clean machine by continuous integration. If a command in this book did not work, the build would have failed before publication.
The running case and how to start itโ
Throughout the book we follow one process: a fed-batch CHO cell culture producing a monoclonal antibody, captured on a Protein A column โ the workhorse that still makes most approved antibodies today. Where it matters, we call out the modern intensified variant: a perfusion culture feeding a multi-column continuous capture train. The simulator ships both, so the difference is something you can run, not just read about.
Getting started is five commands. From examples/README.md:
make venv # Python env + the simulator (uv)
make data # generate every dataset deterministically + MANIFEST.sha256
make up # bring up the core stack (postgres+timescale, mosquitto, grafana)
make seed # load the ISA-88/95 reference CHO line
make load # load the datasets into the historian + lab tables
The word deterministically in make data is load-bearing. The simulator runs from a single fixed seed (SIM_SEED=2026), so the fourteen-day batch is byte-for-byte identical on your laptop and on ours. After generation, a MANIFEST.sha256 records the hash of every file:
e3a78c7291c873b8ede54611d775152c356da138c22be1fc2f35ded04b359701 batches.csv
9b2eec56e5af4d12d7b78895578199fc88a49cf352ba2b45f99f447d275aa441 fedbatch_timeseries.parquet
6f4b5c4261fe3d7c4052326bd4d8fd3530c1cc06f8f0fd305dbb706cf9a0547c fedbatch_timeseries_10min.csv
If your numbers ever disagree with the book's, the manifest tells you immediately, before a single chart misleads you. That is FAIR data thinking in miniature โ Findable, Accessible, Interoperable, Reusable โ with reproducibility baked in as a design goal rather than a hope [8].
What make up actually startsโ
The command make up is a thin wrapper. In examples/Makefile it brings up only the core profile and waits for the database to answer:
up: ## bring up the core stack (postgres+timescale, mosquitto, grafana)
$(COMPOSE) --profile core up -d
@echo "waiting for postgres..." && sleep 3
@until docker exec sensor-to-submission-postgres-1 pg_isready -U bioproc >/dev/null 2>&1; do sleep 2; done
@echo "core stack up."
Behind it sits one declarative file, examples/platform/compose/compose.yaml, that defines every service exactly once. Docker Compose lets a single YAML describe a multi-container application and start it with one command [9], and we lean on its profiles feature so you pay only for the layer you are on:
services:
# --- core --------------------------------------------------------------
postgres:
# timescale/timescaledb IS PostgreSQL + TimescaleDB, so the historian
# hypertable and the ISA-88/95 batch model live in one joinable database.
image: timescale/timescaledb:2.17.2-pg17
profiles: ["core"]
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-bioproc} -d ${POSTGRES_DB:-bioproc}"]
interval: 5s
timeout: 5s
retries: 20
mosquitto:
image: eclipse-mosquitto:2.0.22
profiles: ["core"]
grafana:
image: grafana/grafana-oss:11.4.0
profiles: ["core"]
Two things in that snippet are worth slowing down for. First, every image is pinned to an exact version โ timescale/timescaledb:2.17.2-pg17, eclipse-mosquitto:2.0.22, grafana/grafana-oss:11.4.0 โ never :latest. A floating tag is a time bomb in a regulated system, and the repository's design goes a step further: it intends to pin each image's underlying digest in a versions.lock so the running stack, the license inventory, and the validation register can never silently drift apart. Second, the comment on postgres reveals a deliberate choice: TimescaleDB is PostgreSQL with a time-series extension, so your high-rate sensor history and your ISA-88/95 batch context live in the same joinable database. That single decision saves you an entire category of integration pain later.
When the command finishes, you have a historian, a message broker, and a dashboard server running โ the always-on foundation every later chapter assumes.
Conventions you will see everywhereโ
- Code comes from the repo. Each block names its source file (for example, in
examples/platform/compose/compose.yaml). We never invent an API, a flag, or a line of output. - Mocks are labelled as mocks. Systems that cannot run on a laptop โ AVEVA PI, SAP, Emerson DeltaV, commercial LIMS โ are represented by clearly labelled FastAPI or OPC UA mocks that honor the real API contract, with a documented swap to the production endpoint. The integration code is real; we always tell you when its counterpart is simulated.
- Citations are inline. Numeric markers like [1] link to this chapter's reference list. There is no bibliography in the body.
- Honesty about limits is a feature. Wherever open source falls short of GxP, Part 11, high availability, or vendor accountability, we say so plainly and point at the hybrid or commercial answer. We also flag 2026 license traps โ TimescaleDB's TSL features, Grafana's AGPL, and others โ so a tool you adopt for being "open" does not surprise you in a redistribution or SaaS context.
- Bilingual. This book ships in English and Korean; code, identifiers, and log output stay verbatim in both.
Why it mattersโ
A bioprocess generates a torrent of data โ thousands of tagged sensor readings a minute, offline assays, chromatograms, spectra โ and in most facilities that data is scattered across a historian no one can query, spreadsheets on a share drive, and a vendor cloud you rent but do not control. The cost is not abstract. It shows up as a batch investigation that takes three weeks because nobody can reconstruct what the sensors were doing during a day-7 temperature excursion, or as a technology transfer that stalls because the receiving site cannot interpret the sending site's tag names.
Building the stack yourself, in the open, changes your relationship to that data. You stop treating the platform as magic and start treating it as engineering. And because we hold the whole thing to the runnable or cited rule, you learn the difference between what a vendor's marketing claims and what a regulation actually requires โ which, when an inspector is standing in your facility, is the only distinction that counts.
In the real worldโ
The honest hybrid is not a compromise we invented to sell a book; it is the daily reality of every biomanufacturing data team. Walk into a modern facility and you will find open-source tools โ Grafana, PostgreSQL, Python, MQTT โ running happily alongside a validated commercial historian and an MES that cost millions. The skill the industry actually pays for is knowing where the boundary sits and why.
The standards bodies have caught up to this. ISA-95 (IEC 62264), now in its 2025 edition, gives us the layered Level 0โ4 reference model the whole book is organized around, from the sensor on the floor to the enterprise system [10]. GAMP 5's second edition explicitly blessed open-source software inside a validated lifecycle [6], and the FDA's Computer Software Assurance guidance reframed validation as risk-proportionate assurance rather than documentation theater [7]. Read together with Part 11 [3], its scope-and-application guidance [4], and EU Annex 11 [5], the message is consistent: regulators care that your system is controlled and your data has integrity, not which logo is on the box.
The institutional momentum is real, too. NIIMBL โ the U.S. public-private National Institute for Innovation in Manufacturing Biopharmaceuticals โ exists to accelerate exactly this kind of capability across the industry. Its pilot-scale cGMP facility with the University of Delaware, known as SABRE, broke ground in April 2024 and is under construction as of mid-2026. SABRE is a facility, not a data program โ but facilities like it are where the data architectures this book teaches will be exercised against real, regulated production. The open question SABRE and every plant like it must answer is the one we return to in every chapter: which layer holds the trustworthy, audit-trailed record of truth?
Key termsโ
- OSS (open-source software): software whose source is publicly available under a license permitting use, study, and modification. The foundation of this book's stack.
- Honest hybrid: this book's stance that pure OSS covers roughly 80% of a bioprocess data platform, while the GxP last mile (validation, e-signatures, HA, vendor accountability) is met with hardening or commercial systems.
- GxP / cGMP: the family of "Good x Practice" quality regulations; cGMP is current Good Manufacturing Practice, the FDA-enforced quality system that GxP data-integrity rules live inside.
- 21 CFR Part 11: the FDA rule defining when an electronic record or signature is trustworthy enough to replace paper.
- EU Annex 11: the European GMP guidance on computerised systems; the regulatory counterpart to Part 11.
- GAMP 5: ISPE's risk-based reference for validating GxP computerized systems; its 2nd edition addresses open source directly.
- ISA-95 (IEC 62264): the standard layered model (Levels 0โ4) for integrating enterprise and control systems, used to organize the stack.
- FAIR: data that is Findable, Accessible, Interoperable, and Reusable โ a design goal for everything the platform produces.
- Fed-batch CHO / Protein A: the running process โ Chinese Hamster Ovary cells fed nutrients over a batch, the antibody captured on a Protein A affinity column.
- Docker Compose / profile: the tool that boots the multi-container stack from one YAML file; profiles gate which services start so you load only the chapter you need.
- Deterministic simulator (
SIM_SEED=2026): the fixed-seed engine that makes every dataset byte-for-byte reproducible.
Where this leadsโ
You now know the contract: an open-source-first stack you build and run, an honest hybrid where open source stops, and a runnable or cited rule that keeps everyone โ including us โ honest. The next chapter, The Reference Architecture: One Stack, Layer by Layer, unfolds that stack onto a single page. We map every layer โ edge connectivity, message bus, historian, batch model, contextualization, semantics, compliance, analytics โ to its ISA-95 level and to the open-source tool we chose for it, and we draw the OSS-versus-commercial boundary in ink so you can see the whole journey from sensor to submission before you build the first piece of it.