Upstream Capture: The Production Bioreactor

📍 Where we are: Part II, Capturing the Process — Chapter 9. The wiring is in place (OPC UA, MQTT, the edge gateway); now we point it at the single most valuable data source in the plant and capture a real 14-day batch.

The simple version

Picture a 2,000-litre stainless-steel kettle of living cells that you have to keep alive for two weeks. A handful of probes report its temperature, acidity, oxygen, stirring speed, and how much medicine has accumulated — once a second, forever. Our job in this chapter is to catch every one of those numbers, stamp each with a little "is this reading trustworthy?" flag, and file the fast-changing ones in the historian (the plant's time-series database) while the batch-defining facts go to the relational record (the structured database of who/what/when, kept for audit). Get this layer right and the whole rest of the platform has clean fuel; get it wrong and every dashboard, model, and audit downstream inherits the mess.

What this chapter covers

The production bioreactor is where the drug is actually made, so its data is the data a regulator reads first. This chapter:

introduces the signals a fed-batch CHO bioreactor produces — setpoints in, process values out — and the controller they come from;
runs the deterministic simulator that generates the 14-day trace the rest of the book reuses, and shows its real output;
explains quality flags, deadband, and how a deliberate day-7 excursion shows up in the data;
shows where each signal lands, and why it takes two databases: high-rate readings to the TimescaleDB historian (a time-series database, tuned for huge volumes of timestamped numbers), batch-scoped facts to PostgreSQL (the general-purpose relational database it is built on, tuned for structured records and audit trails);
loads that trace into the running stack (the Docker Compose services brought up earlier in Part II) with a real bulk-load COPY path (COPY being PostgreSQL's fast bulk-load command, introduced below);
and calls out, honestly, what changes for the perfusion / continuous variant and where pure open source still needs help.

The bioreactor as a data source

A fed-batch Chinese-hamster-ovary (CHO) culture is the workhorse of approved monoclonal-antibody (mAb) manufacturing. You inoculate (seed the reactor with) a few hundred thousand cells per millilitre, hold the environment exactly where the cells like it, feed concentrated nutrients in boluses (discrete top-up doses) as they get hungry, and over roughly two weeks the cells multiply, then age and die — leaving behind a broth rich in antibody — its titer, the product concentration in g/L. The bioreactor's measurements are the highest-value, most-scrutinized data in the plant precisely because they close the loop (the controller feeds each reading straight back into the action that adjusts it): pH, dissolved oxygen, and feed are not just observed, they are actively controlled, and those control decisions move product quality [11].

Two directions of data: setpoints in, process values out

There are two directions of data:

Data in — the setpoints and recipe: "hold temperature at 37.0 °C, pH at 7.0, dissolved oxygen at 40 %sat (percent of air saturation — 40 % of the oxygen the medium would hold if fully air-saturated)." These come from the recipe (Chapter 4's ISA-88 model — ISA-88 models the recipe/procedure, i.e. what to do; ISA-95, below, models the equipment hierarchy, where it runs) and are written down to the distributed control system (DCS) or single-use skid controller (a skid is a pre-assembled, frame-mounted equipment unit with its own controller and instruments).
Data out — the process values (PVs) the instruments actually measure, plus derived signals — off-gas CO₂ (the CO₂ fraction of the gas leaving the reactor, read directly by an exhaust-gas analyzer and a proxy for how fast the cells are respiring) and an in-line titer estimate (this one a model-derived soft-sensor value inferred from spectra, not a direct probe reading — see Process Analytics: SPC, MVDA & Soft Sensors) — plus alarms.

On a real line the skid or DCS exposes all of this as an OPC UA server — OPC UA being the communication protocol industrial machines use to share data (introduced in Chapter 7). OPC UA (IEC 62541) is the platform-independent industrial interoperability standard, and its strength is that the server is self-describing: each node carries not just a value but a data type, an engineering unit, and metadata [1]. Crucially, every value also arrives with a StatusCode — Good, Uncertain, or Bad — alongside its value and timestamp, so the consumer always knows whether a reading can be trusted [2]. That status often originates in the field device itself: the NAMUR NE 107 recommendation — NAMUR being the international user association of automation technology in process industries — condenses device health into four standardized signals — Failure, Function Check, Out of Specification, Maintenance Required — which a well-behaved controller maps onto the OPC UA quality it publishes [3]. A pH probe mid-calibration, for instance, should report Function Check, which downstream becomes Uncertain. Where those skid instruments expose a standard model rather than a vendor-specific one, the relevant standard is PA-DIM (Process Automation Device Information Model) — the OPC UA companion specification from Chapter 7 (a companion specification is an add-on vocabulary that standardizes the data model for one industry on top of base OPC UA) that fixes the shape of this process-instrument data and folds in exactly these NAMUR diagnostics.

We don't have a 2,000-litre kettle on the laptop, so the repo ships a simulator that produces exactly this shape of data — and does it deterministically, so every reader's numbers match the book's. (For laptop reproducibility the simulator runs a small bench working volume — it starts near 8 L and is fed in 0.25 L boluses — but the intensive quantities (those that do not depend on how big the tank is — concentrations, temperature, pH), the data shape, and the control logic are identical to what a 2,000 L production SUB produces; only the extensive numbers (those that scale with size — working volume and cumulative feed mass) scale.)

Generating a real 14-day batch

The simulator lives in examples/sim/bioproc_sim/fed_batch.py. It is intentionally simple but mechanistically honest: logistic-style growth (it speeds up, then self-limits as resources run low) limited by glucose and glutamine (Monod kinetics — growth rate falls as a nutrient is used up), a death phase as the culture ages and nutrients deplete, lactate produced during growth and consumed late, and antibody titer accumulating roughly with the integral of viable biomass (biomass being the amount of living cells; the integral is the running total of those cells over time — the more live-cell-time the culture racks up, the more antibody it makes). PID-style (proportional-integral-derivative — the standard feedback-control algorithm) controllers hold temperature, pH, and dissolved oxygen in band (within an acceptable range) with bounded sensor noise. Here is the kinetic core of the integration loop:

# examples/sim/bioproc_sim/fed_batch.py
for k in range(1, n):
    # nutrient limitation + inhibition
    mu = (MU_MAX
          * glc[k - 1] / (K_GLC + glc[k - 1])
          * gln[k - 1] / (K_GLN + gln[k - 1])
          / (1.0 + lac[k - 1] / LAC_INHIB))
    starving = (glc[k - 1] < 0.3) or (gln[k - 1] < 0.15)
    age = k * DT_DAY
    # death is low while young, accelerates with culture age and toxic by-products
    kd = (KD_BASE
          * (1.0 + (age / KD_AGE_DAY) ** KD_AGE_EXP)
          * (1.0 + 1.2 * starving)
          * (1.0 + 0.04 * amm[k - 1]))

    dXv = (mu - kd) * Xv[k - 1]
    biomass = Xv[k - 1]
    ...
    # antibody production is largely non-growth-associated (rises as growth slows)
    d_titer = Q_P * biomass * (1.0 + 2.0 * (1.0 - mu / MU_MAX))

That last line encodes a real fed-batch fact: most antibody is made after the cells stop dividing, which is why titer keeps climbing while viability falls. The bolus feeds on days 3, 5, 7, 9, 11, and 13 top up glucose and glutamine so the culture doesn't starve too early:

# examples/sim/bioproc_sim/fed_batch.py
FEED_DAYS = (3, 5, 7, 9, 11, 13)
...
    if k in feed_steps:
        glc[k] += FEED_GLC
        gln[k] += FEED_GLN
        V[k] += FEED_VOL
        feedA[k] += FEED_GLC * FEED_VOL          # crude kg bookkeeping
        feedB[k] += FEED_GLN * 0.146 * FEED_VOL  # glutamine MW-scaled

Determinism comes from a single seeded random stream — the whole book is pinned to SIM_SEED=2026 — so the noise on every probe is byte-identical on every machine. Running the module directly gives the smoke output:

$ python -m bioproc_sim.fed_batch
BATCH-2026-001: rows=322560 tags=16
  final VCD=18.2e6  viab=64%  titer=5.77 g/L

That is a believable end of batch: a final viable cell density (VCD — the count of living cells per millilitre, written 18.2e6 for 18.2 million) of about 18 million cells/mL (down from a higher late-batch peak near day 12), viability dropping into the mid-60s as the culture ages, and a final titer of 5.77 g/L. Sixteen tags over 20,160 one-minute samples (14 days × 24 h × 60 min) gives 322,560 rows — the exact dataset every later chapter queries.

Sixteen tags, two homes

The simulator emits two products. One is the internal state trajectory (the time-path of the culture's internal variables — cell density, metabolites such as glucose and lactate, volume) that the offline-assay and Raman simulators reuse so every dataset in the book agrees with every other. The other is the long-format tag stream (one row per tag per timestamp — defined below) a historian would actually store. The tag dictionary is declared right in the module:

# examples/sim/bioproc_sim/fed_batch.py
def _tag_specs() -> dict[str, str]:
    return {
        "BR101.Temp.PV": "degC",
        "BR101.Temp.SP": "degC",
        "BR101.pH.PV": "pH",
        "BR101.pH.SP": "pH",
        "BR101.DO.PV": "%sat",
        "BR101.DO.SP": "%sat",
        "BR101.Agitation.PV": "rpm",
        ...  # (+ Agitation.SP/FeedA/FeedB/Pressure/Volume/Offgas O2+CO2 = 16 total)
        "BR101.OnlineGlucose.PV": "g/L",
        "BR101.Titer.PV": "g/L",
    }

Notice the naming follows the convention from Chapter 5: <asset>.<measurement>.<role>, where BR101 is the production bioreactor unit registered as a unit in the ISA-95 equipment hierarchy (its s88.unit row), and .PV / .SP separate the measured process value from its setpoint. That separation matters: the setpoint is recipe data; the process value is evidence.

The long-format stream looks like this — the first rows of examples/datasets/fedbatch_timeseries_10min.sample.csv:

ts,tag,value,unit,quality,batch_id
2026-01-05 00:00:00+00:00,BR101.Agitation.PV,81.4323,rpm,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.Agitation.SP,81.6008,rpm,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.DO.PV,40.8224,%sat,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.DO.SP,40.0,%sat,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.FeedA.PV,0.0,kg,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.FeedB.PV,0.0,kg,192,BATCH-2026-001
...

Long format — one row per tag per timestamp — is the right shape for a historian: it absorbs new tags without schema changes and partitions cleanly by time. But not every signal belongs in the historian. The architectural rule, repeated all through this book, is fast and numeric goes to the time-series database; batch-defining facts go to the relational database. The 20,160 minute-by-minute temperature readings are historian data. The single fact that "BATCH-2026-001 ran recipe R-mAb-01 on unit BR101 from 2026-01-05 to 2026-01-19, status Released" is relational, ISA-88-modelled data — and the join key between the two worlds is batch_id, which is why every historian row carries it. The time-series stream is also where the procedural structure lives: ISA-88 organizes a batch into procedure → unit procedure → operation → phase [4], and the same ts ordering that the historian uses is what later lets us slice the trace into those phases (inoculation, growth, production, harvest) and write phase boundaries into the batch model — and Book 4 (From the Wire to the Graph) shows how this same historian row and ISA-88/95 recipe become one queryable RDF graph that machines can reason over.

Anatomy of a historian reading: six fields, one row

Long format is the shape; the muscle, as with Chapter 7's OPC UA node, is one row. A reading is never a bare 40.8224 — it travels with its time, its identity, its unit, its trust flag, and the batch it belongs to. Take the dissolved-oxygen line from that CSV:

ts        2026-01-05 00:00:00+00     -- contemporaneous source time (timestamptz)
tag       BR101.DO.PV                -- the signal's identity
value     40.8224                    -- the measurement (double precision)
unit      %sat                       -- what 40.8224 means
quality   192                        -- legacy OPC DA: 192 Good, 64 Uncertain, 0 Bad
batch_id  BATCH-2026-001             -- the join key to the GMP batch record

Every field earns its place — here they are as one card:

ts — contemporaneous source time. A timestamptz, and (per the collector note below) preferably the source timestamp: when the value was true, not when we polled it.
tag — the signal's identity, structured not opaque. BR101.DO.PV decodes as <asset>.<measurement>.<role>: BR101 is the ISA-95 unit, DO the measurement, and .PV the process value (evidence) as opposed to .SP, the setpoint (recipe). That one suffix is the line between what the plant aimed for and what it got.
value + unit — the measurement, never naked. 40.8224 is meaningless without %sat; the pair travels together, exactly as the OPC UA value in Chapter 7 carried its data type alongside the number (a Variant holding Double 4.902).
quality — the trust flag. 192 is the legacy OPC DA (Classic) Good code the edge node passes through (64 Uncertain, 0 Bad). These numbers are not arbitrary: OPC DA (the older "Classic" standard that predates OPC UA) packs quality into one byte, and its bit layout puts plain "Good" at the value 192 — whereas the newer OPC UA standard reverses the convention so that an all-zero StatusCode means Good, which is why OPC UA-native Good is simply 0 (recall Chapter 7). The edge node deliberately stores the device's native quality byte unchanged rather than re-encoding it to the OPC UA convention — preserving the original code is the more contemporaneous, audit-friendly choice (no silent transform between probe and database); a consumer that needs the OPC UA convention maps 192→Good, 64→Uncertain, 0→Bad at read time. It is the most important field after value: a reading you cannot vouch for is not a missing reading and emphatically not a good one, and storing the flag beside the value is what lets a later audit or model treat it honestly.
batch_id — the join key. The single column that lets the historian and the relational GMP record be rejoined, which is why every row carries it.

One historian row, fully unpacked: time, identity (with the tag-name grammar decoded), value, unit, the quality trust flag, and the batch join key — the bioreactor's analog of the OPC UA DataValue card. Original diagram by the authors, created with AI assistance.

This is the unit the rest of the chapter defends: the whole point of the capture pipeline is to fill in this row without ever losing the quality flag, the source time, or the batch key between the probe and the database.

The same row as a graph fact: one row, one `sosa:Observation`

This six-field row is also the atom of the digital thread. When the row reaches the knowledge graph (built in Chapter 19 and modelled in Book 4's From the Wire to the Graph), it does not stay a CSV line — it mints one sosa:Observation, the W3C Semantic-Sensor-Network node a sensor reading becomes, where the six columns sort cleanly onto graph predicates rather than being flattened into anonymous numbers:

# Illustrative: the BR101.DO.PV row above as one sosa:Observation (see /ontology/from-the-wire-to-the-graph).
bp:obs-BR101-DO-20260105T0000 a sosa:Observation ;
    sosa:observedProperty bp:BR101.DO.PV ;            # the tag = the observable property
    sosa:hasSimpleResult  "40.8224"^^xsd:float ;      # value
    qudt:ucumCode         "%{sat}" ;                  # unit, carried across the boundary
    sosa:resultTime       "2026-01-05T00:00:00+00:00"^^xsd:dateTime ;  # source ts
    bp:quality            192 ;                        # the trust flag, preserved not dropped
    sosa:hasFeatureOfInterest bp:BATCH-2026-001 .     # batch_id = the material entity observed

The mapping is exact because each field denotes a kind of thing: the value is the magnitude of a quality (dissolved oxygen) realized in a process; the tag names the observable property the stream tracks; the unit is the dimension that magnitude is measured in (carried as a UCUM code so 40.8224 is never a bare number); the ts locates the reading in the run; and batch_id is the material entity the reading is about (sosa:hasFeatureOfInterest). That sorting is the whole mapping; the equipment BR101, encoded inside the tag, stays a separate continuant. Crucially, the historian and OPC UA bridges emit the index — one observation per tag plus a bp:hasTrace pointer — and leave the 322,560-point stream where it lives, exactly the index-versus-payload boundary the ontology book makes a correctness rule (a chromatogram or a full trace is never exploded into triples).

A crosswalk from the six-field ts.sensor_reading row to one sosa:Observation: ts maps to sosa, tag to sosa bp.DO.PV, value to sosa, unit to qudt, quality to bp 192, and batch_id to sosa pointing at a separate bp node for BATCH-2026-001, with a green SHACL shape-gate badge enforcing the value-plus-unit pair so 40.8224 is never a bare number. Each ts.sensor_reading field becomes one predicate on a single sosa:Observation — value, unit, ts, quality, the tag, and a batch_id edge to the bp:Batch node — then a SHACL shape gate enforces the value-plus-unit pair. Original diagram by the authors, created with AI assistance.

Modelling the row this way pays off the moment a question is cross-system rather than a lookup. The day-7 deviation becomes a competency question a SPARQL ASK can answer over the graph — "does every released batch carry an in-spec, present, Good-quality reading for each required CPP?" — and the no-bare-numbers discipline (every magnitude carries a unit) is enforced as a SHACL shape and the CQ-19 audit query, the same closed-world is-a-required-field-missing? check the release gate runs on the QC panel. The 192 we preserve here is what lets that gate distinguish a missing reading from an untrustworthy one — a distinction a bare avg(value) silently destroys.

Where this row comes from in the series

This ts.sensor_reading row is the open-source landing place for a story the two sister books before it tell first. In Book 1, the production bioreactor is the physical control loop — the living kettle whose pH, DO, and feed are actively held in band. In Book 2, that same measurement becomes a tagged six-field reading the moment it is born and acquires the full lifecycle of a data point — captured, contextualized, retained, reviewed. The six fields above are exactly that data-point, now realized as a concrete historian row you can COPY into a running database.

On a real line the collector is either the FreeOpcUa asyncua client — the recommended open-source pure-Python OPC UA library, which subscribes to the skid's nodes and reads value + status [5] — or, for steady interval polling, Telegraf's OPC UA input plugin, which collects tags on a fixed interval and lets you choose whether the timestamp comes from the source, the server, or the gather time. That choice is not cosmetic: contemporaneous capture means preferring the source timestamp so the record reflects when the value was true, not when we happened to poll it [6]. In this chapter we replay the committed golden trace instead of standing up the live OPC UA path, which Chapter 7 already built.

A fed-batch CHO bioreactor as a data source: setpoints and recipe flowing in from the DCS, process values and quality flags flowing out over OPC UA, splitting into the high-rate TimescaleDB historian and the relational ISA-88/95 batch model joined on batch_id.

The production bioreactor as the plant's highest-value data source: recipe and setpoints flow in, quality-flagged process values flow out, and the stream splits between the historian and the batch model on a shared batch_id. Original diagram by the authors, created with AI assistance.

The capture-to-store walkthrough

The hero figure is the architecture; here is the same path as a numbered pipeline — the way Chapter 7 walked the OPC UA handshake and the Sparkplug lifecycle (Sparkplug is an MQTT messaging convention for industrial data) — except this one ends by splitting:

The field probe measures, and reports its own health as NAMUR NE 107 signals (Failure, Function Check, Out of Specification, Maintenance Required).
The skid or DCS OPC UA server publishes each value already wearing a StatusCode and a source timestamp — the Function Check of a probe mid-calibration, for instance, becomes Uncertain.
The collector — asyncua subscribing, or Telegraf polling on a fixed interval — reads value + status + source time, choosing the source timestamp so the record is contemporaneous.
The router sends each reading by kind: fast numeric tags one way, batch-defining facts the other.
The two stores receive them: high-rate readings bulk-COPY into the TimescaleDB historian; batch facts INSERT row-by-row (so the per-row ALCOA+ audit trigger fires once per record — the mechanism is unpacked in Bulk-load vs insert-with-audit below; ALCOA+ being the data-integrity principles a regulator expects: Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available; detailed in Book 2 and Chapter 23) into PostgreSQL — rejoined whenever needed on batch_id.

The split in step 4 is not an implementation detail; it is the architecture, and step 5 — why the two stores are written so differently — is what the rest of this chapter unpacks.

Quality flags and the day-7 excursion

The most important field after value is quality. The repo uses the legacy OPC DA (Classic) numeric quality codes directly — 192 Good, 64 Uncertain, 0 Bad (recall from Chapter 7 that OPC UA-native Good is 0) — and the historian schema makes that explicit and defaults it to Good:

-- examples/platform/db/20-historian.sql
CREATE TABLE ts.sensor_reading (
    ts       timestamptz      NOT NULL,
    tag      text             NOT NULL,
    value    double precision,
    unit     text,
    quality  smallint         NOT NULL DEFAULT 192,  -- legacy OPC DA: 192 Good, 64 Uncertain, 0 Bad
    batch_id text
);

A reading you cannot vouch for is not the same as a missing reading, and it is emphatically not the same as a good one. Storing the flag alongside the value is what lets a later audit trail, alarm rule, or model treat an uncertain point honestly rather than silently averaging it in.

To give later chapters something real to find, the simulator injects a deliberate fault. On day 7 the simulator injects a single combined event: the temperature setpoint dips 0.5 °C for three hours and, in the same window, the dissolved-oxygen probe reports Uncertain — a realistic pairing, since a cooling disturbance shifts oxygen solubility and unsettles the DO probe, but here they are written together deliberately so later chapters have one traceable deviation to find:

# examples/sim/bioproc_sim/fed_batch.py
if excursion:
    # day-7 cooling excursion: setpoint dips 0.5 degC for ~3 h, DO reads uncertain
    e0 = int(7 * 24 * 60)
    e1 = e0 + 180
    temp_sp[e0:e1] = 36.5
    temp[e0:e1] = 36.5 + rng.normal(0, 0.05, e1 - e0)
    do_uncertain[e0:e1] = True

The arithmetic is checkable: 180 minutes × 2 affected tags (BR101.Temp.PV and BR101.DO.PV) = 360 rows that should carry quality 64, out of 322,560 total. Grouping the generated stream by quality confirms it exactly:

quality
64        360
192    322200

Those 360 rows are not noise — they are the chapter's gift to the data-integrity chapters. The day-7 dip is a genuine, attributable deviation that the contextualization views (Chapter 17), the ALCOA+ audit trail (Chapter 23), and the audit-trail-review report (the capstone) will rediscover and explain. A deviation you can trace from raw signal to investigation to disposition is exactly what contemporaneous, attributable, accurate capture is supposed to enable [10].

The day-7 excursion as the data sees it: a 3-hour setpoint dip and an Uncertain DO window that surface as exactly 360 rows flagged quality 64 — the deviation later chapters rediscover and explain. Original diagram by the authors, created with AI assistance.

When it goes wrong in the field: drift, deadband loss, and timestamp lies

The day-7 fault is simulated clean; real capture fails in messier ways, and each failure mode is an argument for one of the columns we just dissected.

Probe drift. An in-line pH or DO probe does not hold calibration for two weeks. In single-use CHO culture the working practice is to compare the in-line pH against a daily offline reference and recalibrate whenever the two diverge by more than 0.05 pH units — and over a typical commercial fed-batch (commonly 14 to 17 days, of which our worked batch is the 14-day case) that recalibration is needed daily to every few days [13]. That is exactly why the in-line value is evidence, not truth, and why the next chapter pulls offline lab values to anchor it: the PV and the lab number must be reconcilable, which is only possible if both carry their timestamp and batch_id.
Deadband data loss. A real historian compresses by storing a new point only when a value moves past a configured deadband (say, ±0.05 °C) or after a maximum time gap. That saves enormous storage on a slow signal — but set the deadband too wide and you erase the transient you needed to see: a day-7 dip that is never stored is a deviation that is never investigated. Deadband is a per-tag data-integrity decision, not a storage afterthought.
Timestamp lies. If the collector stamps arrival time instead of source time, every reading is subtly wrong about when it was true — and on a slow field bus or after a buffered reconnect, source and server time can diverge by seconds or minutes. Preferring the source timestamp (step 3 above) is what keeps ts honest; Chapter 7 dissected exactly this split in the OPC UA DataValue.

None of these is exotic; all of them are why a regulator treats process-data capture as a data-integrity surface, expecting the quality flag, the contemporaneous timestamp, and a reviewable trail rather than a hopeful average [9].

What this stream means for a model downstream

This chapter stops at capture, but every failure mode above is also a model failure mode, and it is worth naming the bridge so the next book is not a surprise. A soft sensor that infers titer from this stream (the Raman model of Process Analytics) decays for the very reasons the columns defend against, and the ML book sorts that decay into two kinds: covariate shift — the input distribution moves while the underlying relationship holds — is exactly what probe drift and a new raw-material lot do to the spectral background, and a model fed those drifted inputs is extrapolating outside the region it was calibrated on, which is why this stream must carry an applicability domain check that flags a reading the model has no business scoring; concept drift — the input-to-output relationship itself changes (a cell line adapting over passages, a media reformulation) — is the dangerous kind, invisible in the inputs and only catchable against the slow offline reference the next chapter pulls. The model book's drift detectors are exactly the two halves of this split: a leading, label-free monitor on the inputs and a lagging control chart on the prediction residual (MLOps and lifecycle).

Two more disciplines are decided here, at capture, not later at modelling. First, the batch_id we keep on every row is what makes a leak-free validation split possible: because consecutive one-minute readings inside a batch are near-duplicates, a naïve row-wise split lets the same batch land on both sides of train and test and reports a fantasty R², so the honest metric holds out whole batches with grouped cross-validation (GroupKFold / LeaveOneGroupOut on batch_id) — the field's single most common validation error, and one this stream's batch key is what lets you avoid (the learning problem). Second, the day-7 excursion is not just a data-integrity artifact — its 360 Uncertain rows are a labelled signal: a model trained on this trace must either exclude them or treat the flag as a feature, exactly as a later release model treats an OOS batch as a training label rather than noise. Keeping the quality flag and the batch_id on every row is therefore the same act that keeps the data honest and keeps a future model's lineage traceable to the exact rows it learned from.

Where the readings land

The historian is a single TimescaleDB hypertable: an ordinary PostgreSQL table that is automatically partitioned by time into chunks, so writes stay fast and old chunks can be dropped or aggregated [7]. The DDL (Data Definition Language — the CREATE statements that define the tables) also pre-rolls one-minute and one-hour summaries as continuous aggregates (pre-computed rollups, defined in Key terms) and bounds raw retention (caps how long the raw per-minute rows are kept before old chunks are dropped):

-- examples/platform/db/20-historian.sql
SELECT create_hypertable('ts.sensor_reading', 'ts', chunk_time_interval => INTERVAL '1 day');
CREATE INDEX ON ts.sensor_reading (tag, ts DESC);
CREATE INDEX ON ts.sensor_reading (batch_id, ts DESC);

CREATE MATERIALIZED VIEW ts.sensor_1m
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS bucket,
       tag,
       avg(value)  AS avg_value,
       min(value)  AS min_value,
       max(value)  AS max_value,
       last(value, ts) AS last_value
FROM ts.sensor_reading
GROUP BY bucket, tag
WITH NO DATA;

One honesty note baked into the comments: this DDL mixes two tiers, and they are not both OSI-open. The hypertable, time_bucket, and drop_chunks are Apache-2.0 core (Apache-2.0 is a permissive licence the Open Source Initiative — OSI, the body that certifies what counts as "open source" — recognizes); the continuous aggregates and add_retention_policy automation this stack uses are free TimescaleDB Community (TSL) features — their source is published and they cost nothing to run, but the licence adds use restrictions, so the OSI does not certify TSL as open source the way it does Apache-2.0 — which is why the stack you just brought up runs the free Community build, while a regulated site that must ship a strictly OSI-certified stack drops to the Apache-2.0 core (one cron-driven drop_chunks, below) instead. TimescaleDB's columnstore compression and high-availability features are the licensed TSL tier we deliberately avoid. The TSL add_retention_policy runs on a built-in scheduler; the strictly-open alternative is to call the Apache-2.0 drop_chunks function yourself on a timer (a cron job — the standard Unix task scheduler), so a strictly Apache-2.0 build is one cron-driven drop_chunks away. It is the kind of license trap we flag throughout — the pure-OSS path costs you some compression efficiency, and the chapter says so plainly rather than pretending the gap isn't there.

Loading the golden trace is a one-shot bulk COPY in examples/tools/load_datasets.py:

# examples/tools/load_datasets.py
def load_timeseries(conn) -> int:
    df = pd.read_parquet(DATA / "fedbatch_timeseries.parquet")
    buf = io.StringIO()
    df[["ts", "tag", "value", "unit", "quality", "batch_id"]].to_csv(buf, index=False, header=False)
    buf.seek(0)
    with conn.cursor() as cur:
        cur.execute("TRUNCATE ts.sensor_reading")
        with cur.copy("COPY ts.sensor_reading (ts, tag, value, unit, quality, batch_id) "
                      "FROM STDIN WITH (FORMAT csv)") as copy:
            copy.write(buf.read())
    return len(df)

COPY is the right tool for 322,560 rows — far faster than row-by-row inserts, and it preserves the quality flag and batch key exactly. Running the loader against the core stack (the docker compose --profile core services) prints:

loaded: 322560 sensor readings, 1344 offline results, 66 release results, 30 genealogy edges

Bulk-load vs insert-with-audit: the asymmetry is the architecture

The same script also writes offline assay and release results into the relational lab schema — and there it deliberately goes row-by-row through INSERT so the ALCOA+ audit trigger fires (Chapter 23). The mechanism behind the asymmetry is concrete: a per-row (FOR EACH ROW) audit trigger is attached only to the regulated relational tables (lab.result, holding the released QC assay values, and s88.batch, the GMP batch record — both records a regulator reviews, defined in the next two chapters), so each INSERT logs one audit entry — which is exactly why those facts go in row-by-row. The historian's ts.sensor_reading table carries no such trigger, so it can be bulk-COPYed for speed with nothing to log. Bulk-load fast time-series; insert-with-audit the batch-defining facts. That asymmetry is the architecture: the historian optimizes for volume and the relational record optimizes for accountability, and the same batch_id on both sides is what keeps them one story.

That asymmetry is also a deliberate risk-tiering decision, and it is worth naming the regulatory frame the validation chapter and Chapter 23 develop in full. Under the FDA's Computer Software Assurance (CSA) mindset — the 2022 successor framing to traditional Computer System Validation (CSV) that scales testing rigor to a function's risk rather than testing every feature equally — the two write paths sit in different tiers on purpose. The audit-trailed lab.result and s88.batch writes are high-risk: they hold the released values and the disposition a reviewer signs, so they need the per-row trail, the access controls, and the attributable electronic signatures that 21 CFR Part 11 (the US electronic-records-and-signatures rule) and EU GMP Annex 11 (its EU computerised-systems counterpart) require of a record that backs a release decision. The bulk-COPYed historian rows are lower-risk raw evidence — voluminous, machine-generated, and write-once — so their integrity is defended by the contemporaneous source timestamp, the preserved quality flag, and a reviewable acquisition trail rather than a per-row audit entry. CSA's whole point is to spend the expensive controls where a wrong value reaches a patient and not to drown the historian in them; the split this chapter builds is that principle expressed in two CREATE TABLE choices.

Why it matters

The production bioreactor is where critical process parameters (CPPs) are established — a CPP being a process input (such as pH, DO, or temperature) whose variation would affect a critical quality attribute (CQA), a measurable property of the drug itself that must stay within limits for the product to be safe and effective. FDA's Process Analytical Technology framework describes the bioreactor as the locus of the quality and performance attributes measured during processing — the place where timely, in-process measurement of CPPs lets you understand and ultimately control the process [8]. When a batch is reviewed for release, these are the numbers a reviewer scrutinizes most, and CGMP guidance is explicit that such data must be reliable and accurate, with critical-step values recorded contemporaneously and audit trails reviewed by the quality unit [9].

That is why three small design choices in this chapter carry so much weight: capturing the quality flag (so an uncertain reading is never mistaken for a good one), preferring the source timestamp (so the record is contemporaneous), and keeping the batch key on every row (so the historian and the GMP batch record can always be rejoined). None of these is glamorous. All of them are the difference between a data platform a regulator trusts and one they don't.

In the real world

On a commercial line the OPC UA server is part of a validated DCS or single-use skid — Emerson DeltaV, Siemens, or a vendor controller — and an enterprise historian like AVEVA PI sits beside it. PI is genuinely excellent at exactly this: high-rate compression, decades of retention, and battle-tested collectors. The honest open-source reality is that TimescaleDB on the Apache-2 build will capture, store, and serve this data faithfully and at laptop scale, but you give up PI's turnkey compression, its huge connector ecosystem, and a vendor you can hold contractually accountable in a GxP (the family of regulated Good Practice standards — Good Manufacturing Practice, Good Laboratory Practice, and so on — that an inspector audits a facility against) audit. That trade-off — our rough estimate that pure OSS covers most of the capability, with the validated last mile being where commercial tools and a hybrid architecture earn their keep — is the spine of this whole book, and the integration chapters (17 onward) show the bridge to PI explicitly.

And the physical vessel itself has narrowed: a modern fed-batch CHO production reactor is usually not a fixed stainless kettle but a single-use bioreactor (SUB) — the Sartorius Biostat STR that BR101 is modelled on, or platforms such as the Thermo HyPerforma S.U.B. or Cytiva Xcellerex XDR — each one fronted by its own OPC UA skid controller, so the data shape we built here (quality-flagged tags joined on batch_id) is the same whichever platform a plant actually buys.

That same vessel is also why the bench-vs-2,000 L split earlier in the chapter is more than a laptop convenience. A commercial process is not invented at scale; it is transferred there. The recipe R-mAb-01 is developed and characterized in small bench reactors, then scaled up to the production SUB by holding the intensive quantities constant — the setpoints, the per-volume feed strategy, and a matched mixing/oxygen-transfer regime (commonly anchored on volumetric power input and the volumetric oxygen-transfer coefficient k_La) — while only the extensive numbers (working volume, cumulative feed mass) grow, which is exactly the intensive/extensive line the simulator preserves. Tech transfer is the controlled handover that moves that characterized recipe between sites or scales, and the receiving plant cannot run a single GMP batch on BR101 until its automation is qualified: IQ (installation qualification — documented proof the skid, probes, and OPC UA server are installed and configured to specification), OQ (operational qualification — proof each controlled parameter holds its setpoint and each alarm and interlock fires across the operating range), and PQ (performance qualification — proof the whole process makes conforming product across consecutive runs) — the C in CGMP made literal for the very instruments whose tags this chapter captures. A .PV you trust is a .PV whose probe was IQ/OQ-qualified and whose calibration is current; the quality flag is the runtime echo of that one-time qualification.

Two honest limits on the simulator. First, there is no real OPC UA server, PLC, or DCS here; asyncua and Telegraf prove the integration code and the data shapes, not vendor-specific quirks — and field OPC UA security is notoriously mis-configured in practice, which Chapter 7 covers. Second, the perfusion / intensified-continuous variant changes the picture: instead of a 14-day batch you run 30+ days of steady state (the culture is held at constant conditions rather than allowed to age and die) by continuously flowing fresh medium in while spent medium plus product flow out as a continuous harvest stream, and removing excess cells as a cell-bleed to hold cell density steady — with a perfusion-rate tag added, and the sample rate climbs. In-line PAT (Process Analytical Technology — instruments that measure quality-relevant signals during the process rather than after) such as Raman spectroscopy (a light-scattering technique that reads the broth's chemical fingerprint) adds real-time, release-relevant signals — antibody glycosylation occupancy (how much of the antibody carries its expected sugar groups, a quality attribute) has been monitored live in CHO bioreactors by in-situ Raman [12] — and those high-value spectra must be captured alongside the CPP setpoints, where a chemometric soft-sensor model (covered in Process Analytics: SPC, MVDA & Soft Sensors, and end-to-end in the ML book's analytical-methods chapter) turns the raw spectrum into the glycosylation or titer number. The historian and the batch_id join survive the switch; the cadence, the tag set, and the volume all grow, which is exactly the stress test Chapter 16 puts the store through.

Key terms

Fed-batch CHO culture — the dominant mAb production mode: Chinese-hamster-ovary cells grown in a sealed bioreactor with periodic nutrient feeds over ~2 weeks.
Titer — the concentration of antibody product accumulated in the culture broth, in grams per litre (g/L); the bioreactor's headline output (here 5.77 g/L at harvest).
VCD (viable cell density) — the count of living cells per millilitre of culture (here ~18 million cells/mL, written 18.2e6); the state variable whose accumulation over time drives antibody titer.
Historian — the plant's time-series database, where high-rate tag readings are stored (here a TimescaleDB hypertable).
Skid — a pre-assembled, frame-mounted equipment unit (here the bioreactor) with its own controller and instruments, fronted by an OPC UA server.
Single-use bioreactor (SUB) — a production vessel built around a disposable bag rather than a cleaned-and-sterilized stainless tank; representative platforms include the Sartorius Biostat STR, Thermo HyPerforma S.U.B., and Cytiva Xcellerex XDR, each fronted by an OPC UA skid controller.
Setpoint (SP) vs process value (PV) — the target the controller aims for vs the value the instrument measures; SP is recipe data, PV is evidence.
CPP (critical process parameter) — a process input (e.g., pH, DO, temperature) whose variability affects a critical quality attribute (CQA) — a measurable property of the drug that must stay within limits — and so must be controlled.
ALCOA+ — the regulator's data-integrity principles (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available) that the row-by-row audit trail upholds.
GxP — the family of regulated Good-Practice standards (Good Manufacturing Practice, Good Laboratory Practice, and so on) an inspector audits a regulated facility against.
Quality flag (legacy OPC DA codes) — Good (192), Uncertain (64), Bad (0); travels with every value and timestamp. These are the legacy OPC DA (Classic) codes the edge node passes through; OPC UA-native quality is a StatusCode where Good is 0 (Chapter 7).
NAMUR NE 107 — standardizes field-device health into four signals (Failure / Function Check / Out of Specification / Maintenance Required) that map onto quality.
Hypertable — a PostgreSQL table that TimescaleDB auto-partitions by time into chunks for fast time-series writes and retention.
Continuous aggregate — a materialized, incrementally maintained rollup (1-minute / 1-hour avg/min/max) over the raw hypertable.
Deadband — the minimum change required before a historian stores a new point; a storage-vs-fidelity trade-off that is also a data-integrity decision.
Long format — one row per tag per timestamp; the schema-stable shape a historian stores.
ts.sensor_reading row — the long-format historian row (ts, tag, value, unit, quality, batch_id); the bioreactor's analog of an OPC UA DataValue, carrying time, identity, unit, the quality trust flag, and the batch join key.
Source vs server timestamp — source time is when the value was true (preferred, for contemporaneous capture); server / gather time is when it was collected. The two diverge on a slow bus or after a buffered reconnect, so the collector is configured to keep the source time.
sosa:Observation — the W3C Semantic-Sensor-Network node one historian row becomes in the knowledge graph: its observed property (the tag), simple result (the value), result time (the source ts), unit, and feature of interest (the batch_id), so the reading is queryable as a fact rather than a CSV line (from the wire to the graph).
Index vs payload — the rule that the graph holds the index (one observation per tag plus a trace pointer) while the historian holds the payload (the full point stream); a trace or chromatogram is never exploded into triples.
Covariate vs concept drift — the two ways a model on this stream decays: covariate shift moves the inputs (probe drift, a new raw-material lot) and is visible without labels; concept drift moves the relationship (a cell line adapting) and is catchable only against the slow offline reference (MLOps and lifecycle).
Grouped (leave-one-batch-out) cross-validation — holding out whole batches (GroupKFold / LeaveOneGroupOut on batch_id) so near-duplicate within-batch rows never straddle train and test; the batch key kept here is what makes this leak-free split possible (the learning problem).
Scale-up & tech transfer — moving a bench-characterized recipe to the production SUB by holding intensive quantities (setpoints, per-volume feed, mixing/k_La regime) constant while extensive ones (volume, feed mass) grow; the controlled handover between scales or sites.
IQ / OQ / PQ — installation, operational, and performance qualification: the documented proof that the skid, probes, and OPC UA server are installed to spec, hold setpoints and fire alarms across range, and make conforming product over consecutive runs — before any GMP batch runs on BR101.
CSA (Computer Software Assurance) — the FDA's risk-based successor framing to CSV that scales testing rigor to a function's risk; the per-row audit trail on lab.result/s88.batch (high-risk) versus the bulk-COPYed historian (lower-risk raw evidence) is that risk-tiering in practice, with Part 11 / Annex 11 signatures on the records that back a release.

Where this leads

The bioreactor gives us a dense, in-line stream — but in-line probes drift, and the most decisive numbers (viable cell density, viability, metabolites, true titer) still come from samples pulled by hand and run on a bench analyzer. The next chapter, Seed Train & Cell-Culture Offline Analytics, follows the culture back to its origins and shows how to capture those offline results, link each sample to the right batch and timepoint, and reconcile the noisier-but-authoritative lab values against the in-line trace we just stored.

What this chapter covers​

The bioreactor as a data source​

Two directions of data: setpoints in, process values out​

Generating a real 14-day batch​

Sixteen tags, two homes​

Anatomy of a historian reading: six fields, one row​

The same row as a graph fact: one row, one sosa:Observation​

The capture-to-store walkthrough​

Quality flags and the day-7 excursion​

When it goes wrong in the field: drift, deadband loss, and timestamp lies​

What this stream means for a model downstream​

Where the readings land​

Bulk-load vs insert-with-audit: the asymmetry is the architecture​

Why it matters​

In the real world​

Key terms​

Where this leads​