Upstream Capture: The Production Bioreactor
๐ Where we are: Part II, Capturing the Process โ Chapter 7. The wiring is in place (OPC UA, MQTT, the edge gateway); now we point it at the single most valuable data source in the plant and capture a real 14-day batch.
Picture a 2,000-litre stainless-steel kettle of living cells that you have to keep alive for two weeks. A handful of probes report its temperature, acidity, oxygen, stirring speed, and how much medicine has accumulated โ once a second, forever. Our job in this chapter is to catch every one of those numbers, stamp each with a little "is this reading trustworthy?" flag, and file the fast-changing ones in the historian while the batch-defining facts go to the relational record. Get this layer right and the whole rest of the platform has clean fuel; get it wrong and every dashboard, model, and audit downstream inherits the mess.
What this chapter coversโ
The production bioreactor is where the drug is actually made, so its data is the data a regulator reads first. This chapter:
- introduces the signals a fed-batch CHO bioreactor produces โ setpoints in, process values out โ and the controller they come from;
- runs the deterministic simulator that generates the 14-day trace the rest of the book reuses, and shows its real output;
- explains quality flags, deadband, and how a deliberate day-7 excursion shows up in the data;
- shows where each signal lands: high-rate readings to the TimescaleDB historian, batch-scoped facts to PostgreSQL;
- loads that trace into the running stack with a real
COPYpath; - and calls out, honestly, what changes for the perfusion / continuous variant and where pure open source still needs help.
The bioreactor as a data sourceโ
A fed-batch Chinese-hamster-ovary (CHO) culture is the workhorse of approved monoclonal-antibody (mAb) manufacturing. You inoculate a few hundred thousand cells per millilitre, hold the environment exactly where the cells like it, feed concentrated nutrients in boluses as they get hungry, and over roughly two weeks the cells multiply, then age and die โ leaving behind a broth rich in antibody. The bioreactor's measurements are the highest-value, most-scrutinized data in the plant precisely because they close the loop: pH, dissolved oxygen, and feed are not just observed, they are actively controlled, and those control decisions move product quality [1].
There are two directions of data:
- Data in โ the setpoints and recipe: "hold temperature at 37.0 ยฐC, pH at 7.0, dissolved oxygen at 40 %sat." These come from the recipe (Chapter 3's ISA-88 model) and are written down to the distributed control system (DCS) or single-use skid controller.
- Data out โ the process values (PVs) the instruments actually measure, plus derived signals like off-gas COโ and an in-line titer estimate, plus alarms.
On a real line the skid or DCS exposes all of this as an OPC UA server. OPC UA (IEC 62541) is the platform-independent industrial interoperability standard, and its strength is that the server is self-describing: each node carries not just a value but a data type, an engineering unit, and metadata [2]. Crucially, every value also arrives with a StatusCode โ Good, Uncertain, or Bad โ alongside its value and timestamp, so the consumer always knows whether a reading can be trusted [3]. That status often originates in the field device itself: the NAMUR NE 107 recommendation condenses device health into four standardized signals โ Failure, Function Check, Out of Specification, Maintenance Required โ which a well-behaved controller maps onto the OPC UA quality it publishes [4]. A pH probe mid-calibration, for instance, should report Function Check, which downstream becomes Uncertain.
We don't have a 2,000-litre kettle on the laptop, so the repo ships a simulator that produces exactly this shape of data โ and does it deterministically, so every reader's numbers match the book's.
Generating a real 14-day batchโ
The simulator lives in examples/sim/bioproc_sim/fed_batch.py. It is intentionally simple but mechanistically honest: logistic-style growth limited by glucose and glutamine (Monod kinetics), a death phase as the culture ages and nutrients deplete, lactate produced during growth and consumed late, and antibody titer accumulating roughly with the integral of viable biomass. PID-style controllers hold temperature, pH, and dissolved oxygen in band with bounded sensor noise. Here is the kinetic core of the integration loop:
# examples/sim/bioproc_sim/fed_batch.py
for k in range(1, n):
# nutrient limitation + inhibition
mu = (MU_MAX
* glc[k - 1] / (K_GLC + glc[k - 1])
* gln[k - 1] / (K_GLN + gln[k - 1])
/ (1.0 + lac[k - 1] / LAC_INHIB))
starving = (glc[k - 1] < 0.3) or (gln[k - 1] < 0.15)
age = k * DT_DAY
# death is low while young, accelerates with culture age and toxic by-products
kd = (KD_BASE
* (1.0 + (age / KD_AGE_DAY) ** KD_AGE_EXP)
* (1.0 + 1.2 * starving)
* (1.0 + 0.04 * amm[k - 1]))
dXv = (mu - kd) * Xv[k - 1]
biomass = Xv[k - 1]
...
# antibody production is largely non-growth-associated (rises as growth slows)
d_titer = Q_P * biomass * (1.0 + 2.0 * (1.0 - mu / MU_MAX))
That last line encodes a real fed-batch fact: most antibody is made after the cells stop dividing, which is why titer keeps climbing while viability falls. The bolus feeds on days 3, 5, 7, 9, 11, and 13 top up glucose and glutamine so the culture doesn't starve too early:
# examples/sim/bioproc_sim/fed_batch.py
FEED_DAYS = (3, 5, 7, 9, 11, 13)
...
if k in feed_steps:
glc[k] += FEED_GLC
gln[k] += FEED_GLN
V[k] += FEED_VOL
feedA[k] += FEED_GLC * FEED_VOL # crude kg bookkeeping
feedB[k] += FEED_GLN * 0.146 * FEED_VOL # glutamine MW-scaled
Determinism comes from a single seeded random stream โ the whole book is pinned to SIM_SEED=2026 โ so the noise on every probe is byte-identical on every machine. Running the module directly gives the smoke output:
$ python -m bioproc_sim.fed_batch
BATCH-2026-001: rows=322560 tags=16
final VCD=18.2e6 viab=64% titer=5.77 g/L
That is a believable end of batch: a final viable cell density of about 18 million cells/mL (down from a higher mid-batch peak), viability dropping into the mid-60s as the culture ages, and a final titer of 5.77 g/L. Sixteen tags over 20,160 one-minute samples gives 322,560 rows โ the exact dataset every later chapter queries.
Sixteen tags, two homesโ
The simulator emits two products. One is the internal state trajectory (cell density, metabolites, volume) that the offline-assay and Raman simulators reuse so every dataset in the book agrees with every other. The other is the long-format tag stream a historian would actually store. The tag dictionary is declared right in the module:
# examples/sim/bioproc_sim/fed_batch.py
def _tag_specs() -> dict[str, str]:
return {
"BR101.Temp.PV": "degC",
"BR101.Temp.SP": "degC",
"BR101.pH.PV": "pH",
"BR101.pH.SP": "pH",
"BR101.DO.PV": "%sat",
"BR101.DO.SP": "%sat",
"BR101.Agitation.PV": "rpm",
...
"BR101.OnlineGlucose.PV": "g/L",
"BR101.Titer.PV": "g/L",
}
Notice the naming follows the convention from Chapter 4: <asset>.<measurement>.<role>, where BR101 is the production bioreactor unit seeded into the ISA-95 model, and .PV / .SP separate the measured process value from its setpoint. That separation matters: the setpoint is recipe data; the process value is evidence.
The long-format stream looks like this โ the first rows of examples/datasets/fedbatch_timeseries_10min.sample.csv:
ts,tag,value,unit,quality,batch_id
2026-01-05 00:00:00+00:00,BR101.Agitation.PV,81.4323,rpm,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.Agitation.SP,81.6008,rpm,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.DO.PV,40.8224,%sat,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.DO.SP,40.0,%sat,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.FeedA.PV,0.0,kg,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.FeedB.PV,0.0,kg,192,BATCH-2026-001
...
Long format โ one row per tag per timestamp โ is the right shape for a historian: it absorbs new tags without schema changes and partitions cleanly by time. But not every signal belongs in the historian. The architectural rule, repeated all through this book, is fast and numeric goes to the time-series database; batch-defining facts go to the relational database. The 20,160 minute-by-minute temperature readings are historian data. The single fact that "BATCH-2026-001 ran recipe R-mAb-01 on unit BR101 from 2026-01-05 to 2026-01-19, status Released" is relational, ISA-88-modelled data โ and the join key between the two worlds is batch_id, which is why every historian row carries it. The time-series stream is also where the procedural structure lives: ISA-88 organizes a batch into procedure โ unit procedure โ operation โ phase [5], and the same ts ordering that the historian uses is what later lets us slice the trace into those phases (inoculation, growth, production, harvest) and write phase boundaries into the batch model.
On a real line the collector is either the FreeOpcUa asyncua client โ the recommended open-source pure-Python OPC UA library, which subscribes to the skid's nodes and reads value + status [6] โ or, for steady interval polling, Telegraf's OPC UA input plugin, which collects tags on a fixed interval and lets you choose whether the timestamp comes from the source, the server, or the gather time. That choice is not cosmetic: contemporaneous capture means preferring the source timestamp so the record reflects when the value was true, not when we happened to poll it [7]. In this chapter we replay the committed golden trace instead of standing up the live OPC UA path, which Chapter 5 already built.
The production bioreactor as the plant's highest-value data source: recipe and setpoints flow in, quality-flagged process values flow out, and the stream splits between the historian and the batch model on a shared batch_id. Original diagram by the authors, created with AI assistance.
Quality flags and the day-7 excursionโ
The most important field after value is quality. The repo uses the OPC UA numeric quality codes directly โ 192 Good, 64 Uncertain, 0 Bad โ and the historian schema makes that explicit and defaults it to Good:
-- examples/platform/db/20-historian.sql
CREATE TABLE ts.sensor_reading (
ts timestamptz NOT NULL,
tag text NOT NULL,
value double precision,
unit text,
quality smallint NOT NULL DEFAULT 192, -- OPC UA: 192 Good, 64 Uncertain, 0 Bad
batch_id text
);
A reading you cannot vouch for is not the same as a missing reading, and it is emphatically not the same as a good one. Storing the flag alongside the value is what lets a later audit trail, alarm rule, or model treat an uncertain point honestly rather than silently averaging it in.
To give later chapters something real to find, the simulator injects a deliberate fault. On day 7 the temperature setpoint dips 0.5 ยฐC for three hours and the dissolved-oxygen probe reports Uncertain for that window:
# examples/sim/bioproc_sim/fed_batch.py
if excursion:
# day-7 cooling excursion: setpoint dips 0.5 degC for ~3 h, DO reads uncertain
e0 = int(7 * 24 * 60)
e1 = e0 + 180
temp_sp[e0:e1] = 36.5
temp[e0:e1] = 36.5 + rng.normal(0, 0.05, e1 - e0)
do_uncertain[e0:e1] = True
The arithmetic is checkable: 180 minutes ร 2 affected tags (BR101.Temp.PV and BR101.DO.PV) = 360 rows that should carry quality 64, out of 322,560 total. Grouping the generated stream by quality confirms it exactly:
quality
64 360
192 322200
Those 360 rows are not noise โ they are the chapter's gift to the data-integrity chapters. The day-7 dip is a genuine, attributable deviation that the contextualization views (Chapter 14), the ALCOA+ audit trail (Chapter 20), and the audit-trail-review report (the capstone) will rediscover and explain. A deviation you can trace from raw signal to investigation to disposition is exactly what contemporaneous, attributable, accurate capture is supposed to enable [8].
A word on deadband. The simulator stores at a flat one-minute cadence for reproducibility, but a real historian compresses: it stores a new point only when a value moves more than a configured deadband (say, ยฑ0.05 ยฐC) or after a maximum time gap. Deadband saves enormous storage on slow signals, but it must be set per tag with care โ too wide and you erase the very excursion you needed to see. The honest engineering rule is to treat deadband as part of the data-integrity design, not a storage afterthought.
Where the readings landโ
The historian is a single TimescaleDB hypertable: an ordinary PostgreSQL table that is automatically partitioned by time into chunks, so writes stay fast and old chunks can be dropped or aggregated [9]. The DDL also pre-rolls one-minute and one-hour summaries as continuous aggregates and bounds raw retention:
-- examples/platform/db/20-historian.sql
SELECT create_hypertable('ts.sensor_reading', 'ts', chunk_time_interval => INTERVAL '1 day');
CREATE INDEX ON ts.sensor_reading (tag, ts DESC);
CREATE INDEX ON ts.sensor_reading (batch_id, ts DESC);
CREATE MATERIALIZED VIEW ts.sensor_1m
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS bucket,
tag,
avg(value) AS avg_value,
min(value) AS min_value,
max(value) AS max_value,
last(value, ts) AS last_value
FROM ts.sensor_reading
GROUP BY bucket, tag
WITH NO DATA;
One honesty note baked into the comments: this DDL uses only Apache-2 / Community features. TimescaleDB's columnstore compression, data tiering, and high-availability features live under the Timescale License (TSL), so the book deliberately avoids them and uses chunk-based partitioning plus a retention policy as the open-source-safe substitute. It is the kind of license trap we flag throughout โ the pure-OSS path costs you some compression efficiency, and the chapter says so plainly rather than pretending the gap isn't there.
Loading the golden trace is a one-shot bulk COPY in examples/tools/load_datasets.py:
# examples/tools/load_datasets.py
def load_timeseries(conn) -> int:
df = pd.read_parquet(DATA / "fedbatch_timeseries.parquet")
buf = io.StringIO()
df[["ts", "tag", "value", "unit", "quality", "batch_id"]].to_csv(buf, index=False, header=False)
buf.seek(0)
with conn.cursor() as cur:
cur.execute("TRUNCATE ts.sensor_reading")
with cur.copy("COPY ts.sensor_reading (ts, tag, value, unit, quality, batch_id) "
"FROM STDIN WITH (FORMAT csv)") as copy:
copy.write(buf.read())
return len(df)
COPY is the right tool for 322,560 rows โ far faster than row-by-row inserts, and it preserves the quality flag and batch key exactly. Running the loader against the core stack prints:
loaded: 322560 sensor readings, 1344 offline results, 66 release results, 30 genealogy edges
The same script also writes offline assay and release results into the relational lab schema โ and there it deliberately goes row-by-row through INSERT so the ALCOA+ audit trigger fires (Chapter 20). Bulk-load fast time-series; insert-with-audit the batch-defining facts. That asymmetry is the architecture.
Why it mattersโ
The production bioreactor is where critical process parameters (CPPs) are established. FDA's Process Analytical Technology framework describes the bioreactor as the locus of the quality and performance attributes measured during processing โ the place where timely, in-process measurement of CPPs lets you understand and ultimately control the process [10]. When a batch is reviewed for release, these are the numbers a reviewer scrutinizes most, and CGMP guidance is explicit that such data must be reliable and accurate, with critical-step values recorded contemporaneously and audit trails reviewed by the quality unit [11].
That is why three small design choices in this chapter carry so much weight: capturing the quality flag (so an uncertain reading is never mistaken for a good one), preferring the source timestamp (so the record is contemporaneous), and keeping the batch key on every row (so the historian and the GMP batch record can always be rejoined). None of these is glamorous. All of them are the difference between a data platform a regulator trusts and one they don't.
In the real worldโ
On a commercial line the OPC UA server is part of a validated DCS or single-use skid โ Emerson DeltaV, Siemens, or a vendor controller โ and an enterprise historian like AVEVA PI sits beside it. PI is genuinely excellent at exactly this: high-rate compression, decades of retention, and battle-tested collectors. The honest open-source reality is that TimescaleDB on the Apache-2 build will capture, store, and serve this data faithfully and at laptop scale, but you give up PI's turnkey compression, its huge connector ecosystem, and a vendor you can hold contractually accountable in a GxP audit. That trade-off โ roughly 80 % of the capability in pure OSS, with the validated last mile being where commercial tools and a hybrid architecture earn their keep โ is the spine of this whole book, and the integration chapters (17 onward) show the bridge to PI explicitly.
NIIMBL โ the U.S. public-private Institute for Innovation in Biopharmaceutical Manufacturing โ and its SABRE pilot-scale current-good-manufacturing-practice (cGMP) facility being built with the University of Delaware (groundbreaking April 2024) exist precisely to de-risk this kind of modern, data-rich bioprocessing at pilot scale. The data shapes in this chapter โ quality-flagged OPC UA tags landing in a historian, joined by batch to an ISA-88/95 model โ are the same shapes such a facility's data architecture must handle, whatever vendors it picks.
Two honest limits on the simulator. First, there is no real OPC UA server, PLC, or DCS here; asyncua and Telegraf prove the integration code and the data shapes, not vendor-specific quirks โ and field OPC UA security is notoriously mis-configured in practice, which Chapter 5 covers. Second, the perfusion / intensified-continuous variant changes the picture: instead of a 14-day batch you run 30+ days of steady state with a perfusion-rate tag, cell-bleed, and harvest streams, and the sample rate climbs. In-line PAT such as Raman spectroscopy adds real-time, release-relevant signals โ antibody glycosylation occupancy has been monitored live in CHO bioreactors by in-situ Raman [12] โ and those high-value spectra must be captured alongside the CPP setpoints. The historian and the batch_id join survive the switch; the cadence, the tag set, and the volume all grow, which is exactly the stress test Chapter 13 puts the store through.
Key termsโ
- Fed-batch CHO culture โ the dominant mAb production mode: Chinese-hamster-ovary cells grown in a sealed bioreactor with periodic nutrient feeds over ~2 weeks.
- Setpoint (SP) vs process value (PV) โ the target the controller aims for vs the value the instrument measures; SP is recipe data, PV is evidence.
- CPP (critical process parameter) โ a process input (e.g., pH, DO, temperature) whose variability affects a critical quality attribute and so must be controlled.
- OPC UA StatusCode / quality flag โ
Good(192),Uncertain(64),Bad(0); travels with every value and timestamp. - NAMUR NE 107 โ standardizes field-device health into four signals (Failure / Function Check / Out of Specification / Maintenance Required) that map onto quality.
- Hypertable โ a PostgreSQL table that TimescaleDB auto-partitions by time into chunks for fast time-series writes and retention.
- Continuous aggregate โ a materialized, incrementally maintained rollup (1-minute / 1-hour avg/min/max) over the raw hypertable.
- Deadband โ the minimum change required before a historian stores a new point; a storage-vs-fidelity trade-off that is also a data-integrity decision.
- Long format โ one row per tag per timestamp; the schema-stable shape a historian stores.
Where this leadsโ
The bioreactor gives us a dense, in-line stream โ but in-line probes drift, and the most decisive numbers (viable cell density, viability, metabolites, true titer) still come from samples pulled by hand and run on a bench analyzer. The next chapter, Seed Train & Cell-Culture Offline Analytics, follows the culture back to its origins and shows how to capture those offline results, link each sample to the right batch and timepoint, and reconcile the noisier-but-authoritative lab values against the in-line trace we just stored.