Seed Train & Cell-Culture Offline Analytics

📍 Where we are: Part II · Capturing the Process — Chapter 10. The historian (the time-series database that archives process data) already swallows the bioreactor's live tags (its named sensor channels); this chapter captures the other half of the truth — the manual seed-train entries and the offline bench results that never touch the DCS (the distributed control system that runs the equipment) — and ties each one to the batch and the moment it represents.

The simple version

The bioreactor's online sensors are like a fitness tracker on your wrist: always on, automatic, streaming. But twice a day a technician also draws a tube of culture, walks it to a bench analyzer, and gets numbers the wristband can never give — how many cells are alive, how much sugar is left, how much of the antibody drug the cells are secreting has accumulated. Those results arrive as a CSV emailed off an instrument, or typed into a form. The hard part is not measuring them. It is gluing each result back to the right batch and the exact minute the sample was pulled — because a number with no anchor in the batch record is, to a regulator, no number at all.

What this chapter covers

Chapters 7–9 captured everything the distributed control system (DCS) emits over OPC UA (the industrial machine-to-machine protocol Chapter 7 covers, used here to move sensor values off the equipment): temperature, pH, dissolved oxygen, feed pumps. But a CHO (Chinese Hamster Ovary, the workhorse mammalian cell line — an engineered, indefinitely-growing cell population descended from one parent — for antibody production) fed-batch run — a culture grown in one vessel with concentrated nutrient feeds added over the run — generates a whole second stream of data that the DCS never sees. The seed train — the staged expansion of cells from a thawed vial up to the production bioreactor — is largely logged by hand. And the most decision-critical numbers in cell culture, the ones a process scientist actually steers by, come off bench analyzers: viable cell density, viability (the fraction of cells still alive), glucose, lactate, and titer (how much antibody product has accumulated).

This chapter is about that off-DCS world. We will:

generate realistic offline / at-line bench results with the deterministic simulator (the companion repo's software model of a CHO run, introduced in earlier chapters, that produces the same numbers on every machine instead of reading a real plant), drawn from the same underlying culture state as the online trace;
land those results in a relational lab schema designed for sample-to-batch genealogy;
build a file-watch ingester that picks up an analyzer's CSV drop automatically;
and confront the genuinely hard part — reconciling the two timestamps every offline result has (when the sample was taken versus when it was measured), and handling the late, corrected, and amended results that make manual data a data-integrity minefield.

The two artifacts at the heart of this chapter both exist and are tested in the companion repo: the simulator module examples/sim/bioproc_sim/offline_assays.py, and the lab schema in examples/platform/db/30-lab-events.sql. The file-watcher code is shown as a condensed excerpt of the repo's file-ingester service (examples/services/file-ingester/app.py), which is built on the watchdog library; it is labelled where it appears.

Why offline data is a different animal

In the measurement taxonomy the FDA's (the US drug regulator's) Process Analytical Technology (PAT) framework made standard, a sensor wired into the tank is on-line (or in-line if the probe sits in the broth — the liquid cell culture filling the vessel); a sample pulled and run on a nearby analyzer within minutes is at-line; and a sample carried to a separate lab is off-line [6]. The framework's whole point is that every step away from in-line adds delay — the time between when the sample represents the process and when you finally know the answer. That delay is exactly the thing we have to record, because the result is contemporaneous with the measurement, but it is evidence about an earlier moment in the batch.

What lands on those analyzers? In a CHO seed train and production culture, the standing panel is viable cell density (VCD) and viability — historically by a manual hemocytometer-and-trypan-blue count (cells counted under a microscope on a ruled counting slide, with a blue dye that stains only the dead ones), today usually by an automated imaging counter, with method choice itself a validated, integrity-relevant decision [8] — plus the metabolite set every nutrient-feeding strategy is built on: glucose, lactate, glutamine, ammonium, and osmolality [7]. These are laboratory-informatics data: in a mature plant they are captured through a LIMS (laboratory information management system), an ELN (electronic lab notebook), or a laboratory execution system rather than the process automation stack, and the ASTM E1578 guide to laboratory informatics is the reference that frames them that way — a different data lineage, different systems, different validation than the DCS tags [2].

That split is the chapter's organizing fact. Two streams, two custody chains, one batch they must both serve.

Generating offline results that agree with the online trace

The simulator does something deliberate and important: it samples the offline panel from the same kinetic state the in-line tags come from, then adds analytical noise and a non-negativity floor. So a bench VCD agrees with the online cell-growth model — just noisier and far sparser. That is not a shortcut; it is the reconciliation problem this chapter exists to solve, baked into the test data.

Here is the core of examples/sim/bioproc_sim/offline_assays.py:

# examples/sim/bioproc_sim/offline_assays.py
SAMPLES_PER_DAY = 2


def sample(result: BatchResult | None = None, batch_id: str = "BATCH-2026-001") -> pd.DataFrame:
    """Two offline samples per day from the fed-batch state, with assay noise + LoD."""
    if result is None:
        result = simulate(batch_id)
    s = result.state
    rng = stream_rng("offline_assays", result.batch_id)

    minutes = []
    day = 0.0
    while day <= 14.0 + 1e-9:
        for frac in (0.25, 0.75):  # ~06:00 and ~18:00
            m = int(round((day + frac) * 1440))
            if m < len(s):
                minutes.append(m)
        day += 1.0
    minutes = sorted(set(minutes))

    rows = []
    for i, m in enumerate(minutes, start=1):
        st = s.iloc[m]
        rows.append({
            "sample_id": f"{result.batch_id}-OFF-{i:03d}",
            "batch_id": result.batch_id,
            "sample_time": st["ts"],
            "sample_point": "BR101",
            "VCD_e6_per_mL": max(0.0, round(st.Xv_e6_per_mL * (1 + rng.normal(0, 0.05)), 2)),
            "viability_pct": float(np.clip(round(st.viability_pct + rng.normal(0, 1.2), 1), 0, 100)),
            "glucose_g_L": max(0.0, round(st.glucose_g_L + rng.normal(0, 0.15), 2)),
            "lactate_g_L": max(0.0, round(st.lactate_g_L + rng.normal(0, 0.10), 2)),
            "glutamine_mM": max(0.0, round(st.glutamine_mM + rng.normal(0, 0.10), 2)),
            "ammonia_mM": max(0.0, round(st.ammonia_mM + rng.normal(0, 0.20), 2)),
            "osmolality_mOsm_kg": int(round(st.osmolality_mOsm_kg + rng.normal(0, 4))),
            "titer_g_L": max(0.0, round(st.titer_g_L * (1 + rng.normal(0, 0.04)), 3)),
            "pH_offline": round(float(np.clip(st.pH + rng.normal(0, 0.02), 6.6, 7.4)), 2),
        })
    return pd.DataFrame(rows)

Three details are worth slowing down on, because they encode real lab practice. First, the schedule: two samples a day at frac = 0.25 and 0.75 — roughly 06:00 and 18:00 — which is a realistic offline cadence and gives 28 results over a 14-day run, against the ~20,160 one-minute online rows per tag (14 days × 1,440 minutes a day). Offline data is sparse. Second, sample_time is taken from the simulated state row's own timestamp st["ts"] — the moment the sample represents — not the moment the function runs. Third, the noise is per-analyte and physically scaled: VCD and titer get a multiplicative 4–5 % error (counting and assay imprecision grows with magnitude), while glucose and lactate get a small additive error, and everything is floored at zero (a non-negativity floor, not a positive limit of detection) so a near-zero reading never goes negative. In practice a Cedex titer assay (a lab measurement run on a Roche Cedex analyzer, detailed later) has a real lower limit of detection (LoD — the smallest value an instrument can reliably distinguish from zero) around 0.01–0.05 g/L, which is why the first titer rows (~0.002–0.008) would report as "not detected" on a real instrument.

The rng = stream_rng("offline_assays", result.batch_id) line is why the whole book is reproducible: every random stream is derived from the master seed (SIM_SEED=2026) plus a per-stream label, so this dataset is byte-identical on every machine and in CI.

Run the module directly and it prints its own summary:

$ SIM_SEED=2026 python -m bioproc_sim.offline_assays
offline samples: 28 rows over 14 days
                sample_id  VCD_e6_per_mL  viability_pct  glucose_g_L  titer_g_L
0  BATCH-2026-001-OFF-001           0.34           96.6         6.18      0.002
1  BATCH-2026-001-OFF-002           0.43           96.6         6.26      0.008
2  BATCH-2026-001-OFF-003           0.56           99.0         6.01      0.014
3  BATCH-2026-001-OFF-004           0.72           97.5         5.99      0.022
4  BATCH-2026-001-OFF-005           0.96           96.7         5.69      0.033
release assays: 11 rows; OOS=0

That is the module's printed summary — a row count, a five-row head(), and a one-line release-assay tally (OOS = out-of-specification — a result outside its allowed range; here zero). The full per-batch rows are written to the committed golden file examples/datasets/offline_assays.csv by generate.py, which concatenates sample() over every batch (BATCH-2026-001 through -006, 28 rows each, 168 rows in all). Its first rows for the reference batch show the full wide panel — the at-line analyte set in one place:

sample_id,batch_id,sample_time,sample_point,VCD_e6_per_mL,viability_pct,glucose_g_L,lactate_g_L,glutamine_mM,ammonia_mM,osmolality_mOsm_kg,titer_g_L,pH_offline
BATCH-2026-001-OFF-001,BATCH-2026-001,2026-01-05 06:00:00+00:00,BR101,0.34,96.6,6.18,0.13,4.13,0.68,293,0.002,7.06
BATCH-2026-001-OFF-002,BATCH-2026-001,2026-01-05 18:00:00+00:00,BR101,0.43,96.6,6.26,0.19,4.31,0.38,292,0.008,7.04
BATCH-2026-001-OFF-003,BATCH-2026-001,2026-01-06 06:00:00+00:00,BR101,0.56,99.0,6.01,0.32,3.83,0.45,287,0.014,7.05

Read down the VCD_e6_per_mL column — 0.34, 0.43, 0.56 million cells/mL — and you are watching the production culture (BR101, the production bioreactor's vessel tag) climb out of lag (the slow settling-in phase right after the cells are added) into exponential growth (the phase where the population doubles at a steady rate) from a ~0.3e6 inoculum (the starting batch of cells seeded into the vessel; e6 is scientific shorthand for ×10⁶, so 0.3e6 = 0.3 million cells/mL): the culture was inoculated at ~0.3e6 viable cells/mL, so the first bench reading of 0.34e6 six hours later is the inoculum barely started. The titer column starts at essentially zero because antibody accumulation lags growth. The seed train itself — the staged expansion that fed this vessel — is a separate upstream lot, logged the same offline way; it simply is not the trace you are reading here.

Designing the experiment: ambr and DOE

The single fed-batch run we have been generating is what a commercial batch looks like — one recipe, one set of conditions. But the conditions in that recipe were not guessed; they were found, in a process-development (PD) campaign that ran dozens of small cultures in parallel to map how the process responds to its inputs. The representative workhorses there are micro- and parallel bioreactor systems — the Sartorius ambr15 and ambr250 are the names you will most often meet, again as an industry example — the ambr15 runs up to 48 microbioreactors and the ambr250 up to 24 (or 48 in its high-throughput config). Each of those vessels is a miniature of the BR101 we simulate, and each one is a row in an experiment, not a continuous trace.

That experiment is almost always a design of experiments (DOE) — a structured grid of input combinations chosen so that a handful of runs reveals the effect of each factor and their interactions. In a Quality-by-Design (QbD) program — a regulator-endorsed development approach that builds quality in by understanding the process rather than testing it in at the end — the goal of the DOE is to chart the design space: the multidimensional region of inputs within which the process reliably makes in-spec product. The design is built and the response model fitted in a statistics package — Sartorius BioPAT MODDE is the long-standing representative tool — and here is the data-shape twist that matters for this book: MODDE's native run table is a proprietary binary file (.mip), not a CSV. So the most decision-rich table in all of process development is locked, by default, inside a vendor format, and getting it into the open lab schema means an export step, exactly like the Cedex .txt above.

The DOE table is also shaped backwards from everything else in this chapter. The historian stores long-narrow rows — one timestamp, one tag, one value, millions of times. A DOE run table is wide and short: a few factor columns (feed rate, temperature, pH setpoint…) and a few response columns (final titer, viability, an aggregate quality score), with one row per experimental run. Reading across a row tells you a whole experiment's recipe and outcome; reading down a column tells you how one factor moved. That is the same wide-row shape a bench analyzer's panel has (the Nova FLEX2, detailed below), and it maps into lab.sample plus lab.result the same way — each run is a sample, each response a result — which is how a PD design space and a commercial batch end up living in one schema.

The honest hybrid note belongs here too. Open-source statistics can absolutely do the math: statsmodels fits the regression, pyDOE (and its successors) generates classical factorial and response-surface designs, and scikit-learn will fit a response-surface model and optimize over it. What the OSS stack does not hand you is MODDE's packaging of that math — a validated, vendor-accountable design-space deliverable with the documented design rationale, the diagnostics, and the regulatory-grade report that a QbD submission leans on. The arithmetic is open; the accountable design-space artifact is the GxP last mile (GxP = the umbrella of Good-Practice regulations — GMP, GLP, GCP — that govern any data a regulator may inspect), the same pattern as the signing wrapper around lab.result discussed later in this chapter.

A lab schema built for genealogy

A spreadsheet of results is worthless to a regulator unless every row is anchored to which batch and which sample. ISA-88/IEC 61512 (the batch-control standard) gives us the procedural-and-physical hierarchy — a nesting that runs from the process cell (a production area) down through its units (the individual vessels) to the batch (one execution of the recipe) and the lot (the released material it yields) — that defines what "the right batch" even means, and it is the spine that lets a seed-train or at-line sample attach to the correct genealogy [1]. We modeled that hierarchy in PostgreSQL (the open-source relational database this stack runs on) back in Chapter 4. The lab layer is the part that hangs results off it, and it lives in examples/platform/db/30-lab-events.sql:

-- examples/platform/db/30-lab-events.sql
CREATE TABLE lab.sample (
    sample_id    text PRIMARY KEY,
    batch_id     text REFERENCES s88.batch,
    sample_time  timestamptz NOT NULL,
    sample_point text NOT NULL,
    sample_type  text NOT NULL DEFAULT 'in_process'   -- in_process | release | stability
);

CREATE TABLE lab.test (
    test_id   text PRIMARY KEY,
    name      text NOT NULL,
    unit      text,
    spec_low  numeric,
    spec_high numeric
);

CREATE TABLE lab.result (
    result_id   bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    sample_id   text NOT NULL REFERENCES lab.sample,
    test_id     text REFERENCES lab.test,
    value       numeric,
    text_value  text,
    unit        text,
    result_ts   timestamptz NOT NULL DEFAULT now(),
    analyst     text,
    instrument_id text,
    status      text NOT NULL DEFAULT 'preliminary',   -- preliminary | verified | rejected
    UNIQUE (sample_id, test_id, result_ts)
);
CREATE INDEX ON lab.result (sample_id);

This little schema carries more compliance design than it looks. Three points do the heavy lifting.

Two timestamps, on purpose. lab.sample.sample_time is when the sample was taken — the moment in the batch the value is evidence about. lab.result.result_ts is when the result was recorded. These are different events, sometimes hours apart, and PostgreSQL's timestamp with time zone stores both as absolute UTC (Coordinated Universal Time, one global reference clock) instants so the gap between "sampled" and "known" is queryable rather than lost [11]. That gap is the sample-to-insight delay the PAT framework warns about, made into a column.

batch_id REFERENCES s88.batch is the genealogy link as a foreign key — the database refuses to record a result for a sample that names no real batch. Sample-to-batch traceability stops being a convention and becomes an invariant the engine enforces.

status and the UNIQUE (sample_id, test_id, result_ts) constraint are how corrections work without lying. A preliminary result and its later verified value are two rows, not an overwrite — which is exactly what the next section needs.

The matching CofA / release panel comes from the same simulator module. The release_results() function emits one certificate-of-analysis row-set per batch — the CofA being the formal quality document that must pass before the batch can be released for shipment — against realistic mAb (monoclonal antibody) specification ranges, the allowed high/low limits each result must fall within:

# examples/sim/bioproc_sim/offline_assays.py
# realistic mAb release-assay specs: (name, low, high, unit, target, sd)
_RELEASE_SPECS = [
    ("SEC_monomer_pct", 95.0, 100.0, "%", 98.5, 0.4),
    ("SEC_HMW_pct", 0.0, 3.0, "%", 1.1, 0.3),
    ("SEC_LMW_pct", 0.0, 2.0, "%", 0.4, 0.15),
    ("CEX_main_pct", 60.0, 80.0, "%", 70.0, 2.0),
    ("CEX_acidic_pct", 10.0, 30.0, "%", 20.0, 1.8),
    ("CEX_basic_pct", 5.0, 20.0, "%", 10.0, 1.2),
    # ... HCP, residual Protein A, host-cell DNA, endotoxin, bioburden follow
]

These are the product-quality release tests, not the cell-culture analytes captured earlier: SEC (size-exclusion chromatography) reports the percentage of intact monomer versus high- and low-molecular-weight species (HMW/LMW — aggregates and fragments); CEX (cation-exchange) splits the main charge variant from its acidic and basic forms; and the rest — HCP (host-cell protein), residual Protein A (the capture resin that can leach into the product), host-cell DNA, endotoxin, and bioburden — are purity and safety limits. Each is judged PASS if its value falls inside its spec range and OOS if it does not.

>>> release_results().head(6).to_string(index=False)
      batch_id            test  value unit  spec_low  spec_high result
BATCH-2026-001 SEC_monomer_pct 98.611    %      95.0      100.0   PASS
BATCH-2026-001     SEC_HMW_pct  1.287    %       0.0        3.0   PASS
BATCH-2026-001     SEC_LMW_pct  0.439    %       0.0        2.0   PASS
BATCH-2026-001    CEX_main_pct 70.686    %      60.0       80.0   PASS
BATCH-2026-001  CEX_acidic_pct 21.551    %      10.0       30.0   PASS
BATCH-2026-001   CEX_basic_pct 10.452    %       5.0       20.0   PASS

The result column is just "PASS" if low <= val <= high else "OOS" — the same in-spec / out-of-specification (OOS) logic a real release decision (the formal call to ship or reject a batch) turns on. These rows feed the LIMS/ELN chapter (11), the knowledge graph (Chapter 19), and the commercial-LIMS bridge (19).

The off-DCS capture path: bench results arrive as a file drop, are parsed for both their sample and result times, linked to the batch genealogy, and landed as append-only rows that can be corrected without overwriting. Original diagram by the authors, created with AI assistance.

Anatomy of a lab.result row: what a recorded result carries

The whole compliance story of this chapter lives in a single lab.result row, so it is worth dissecting one field by field — the same way Chapter 7 took apart an OPC UA DataValue. Take the verified glucose result for sample BATCH-2026-001-OFF-003, the row the lifecycle SQL below produces. A naive system would store the one number a scientist asks for, 6.01. The schema stores ten things, and each one closes a hole a regulator would otherwise find.

One lab.result row carries far more than its number: a surrogate identity, two foreign keys, the value with its unit and recorded time, who recorded it, and the constraint that admits corrections without an overwrite. Original diagram by the authors, created with AI assistance.

Read the card top to bottom and the design intent is plain:

result_id is a bigint GENERATED ALWAYS AS IDENTITY surrogate primary key. GENERATED ALWAYS means the engine assigns it and an INSERT cannot supply or override it — so every result has a stable, machine-minted handle that no human edits, which is exactly what an audit trail wants to reference.
sample_id and test_id are foreign keys, not free text. sample_id points at the lab.sample row that carries the sample_time and the batch_id; test_id points at the lab.test row that defines the analyte's name, unit, and spec range. The result inherits its genealogy and its specification by reference rather than by copy.
value is numeric and text_value is text: a deliberate split so the one table holds both a quantitative glucose reading (6.01, text_value NULL) and a qualitative result like a PASS/OOS disposition or a "Not Detected", without forcing a number into a string or vice versa.
unit travels with the value. 6.01 is meaningless; 6.01 g/L is a result. The unit is stored on the row even though lab.test also carries one, because the result is the contemporaneous record of what the instrument actually reported.
result_ts is the recorded time — here 2026-01-06 09:10:00+00. This is the field that makes the chapter's two-timestamp problem tractable, and the next section is built entirely on the gap between it and sample_time.
analyst records who recorded the value — SVC_INGEST for a machine-written preliminary row, a human login like a.kowalski for a verified one. instrument_id records which device produced it; on the offline file-drop path the ingester leaves it NULL (the CSV does not name the instrument), whereas the CofA bridge in Chapter 22 fills it with a real device id like HPLC-07. Recording the field as nullable rather than omitting it is the honest choice: the column exists, and a richer feed populates it.
status is the lifecycle flag — preliminary | verified | rejected, defaulting to preliminary. It is the difference between "the machine said so" and "a qualified person stands behind it".

The single most important line on the card is the bottom one: UNIQUE (sample_id, test_id, result_ts). That tuple — not result_id — is the real business key, and it is deliberately wide enough to let two results for the same sample and same test coexist as long as their recorded times differ. That is the whole mechanism by which a correction is an append, not an overwrite, which the lifecycle section makes concrete.

Back along the trilogy spine

The lab.sample and lab.result rows on that card are the digital shadow of a real bench draw on a real seed train. Book 1 walks the physical steps these rows record: thawing and expanding the master cell bank vial (the qualified frozen stock of cells the whole process starts from) in Cell Line Development, then growing the N-1 inoculum (the seed culture in the stage just before the production reactor) through the staged Seed Train up to the production bioreactor. Book 2's tour of where process data is born frames the open challenge of anchoring each offline measurement to its batch and its true sample moment — the problem this schema's two timestamps and batch_id foreign key are the concrete answer to.

Anatomy of the CSV drop row: what the analyzer writes

The lab.result row is the destination. The source is a single line in the analyzer's CSV drop, and it has a different shape worth dissecting in parallel — because the seam between the two is where most of the work and most of the mistakes live. Here is the first reference-batch line of offline_assays.csv, taken apart.

The analyzer's wide row splits four ways: four columns become one lab.sample, three mapped columns become preliminary lab.result rows, the result time is minted at ingest, and a per-vendor parser seam decides whether the file even arrives as CSV. Original diagram by the authors, created with AI assistance.

The amber block at the top holds the four columns that describe the sample, not any measurement: sample_id, batch_id, sample_time, sample_point. The ingester's upsert_sample (update-or-insert: it creates the lab.sample row the first time and quietly reuses it if the same sample arrives again) turns those four into exactly one lab.sample row, and batch_id is the foreign key that anchors the whole row-set to the GMP batch. Crucially, sample_time is the only timestamp the file carries — the moment the sample was taken. There is no column for when the result was recorded, which is the gap the next section is about.

Below the amber block, each analyte column becomes its own lab.result row — but only the three the machine is trusted to capture unattended. The INGEST_TESTS map names them: VCD_e6_per_mL → VCD (e6/mL), glucose_g_L → Glucose (g/L), titer_g_L → Titer (g/L). Every one of those rows is written status = 'preliminary'. The rest of the wide panel — viability, lactate, glutamine, ammonia, osmolality, offline pH — is present in the file but is not auto-ingested; it waits for an analyst to enter and verify it. The result_ts is the field the file does not have at all: the DB default now() mints it at ingest, which is precisely why the recorded time is later than the sampled time.

The violet panel at the bottom is the seam that bites multi-vendor sites first. This clean wide-row shape is native to only a minority of instruments — the Nova FLEX2 and the Vi-CELL BLU write CSV directly — while a Roche Cedex Bio HT writes a tab-delimited .txt, a Vi-CELL XR writes .txt/.xls, and a proprietary binary like a MODDE .mip must be exported before anything can read it. A real deployment hangs a small per-vendor parser off the on_created hook to normalize each layout into this one row shape before ingest() ever runs.

Ingesting the file drop

Real analyzers — Vi-CELL-style cell counters, Nova/Cedex-style metabolite analyzers — export a CSV or a vendor file to a watched folder.

Take one representative bench analyzer — the Nova Biomedical BioProfile FLEX2 is a common example — and the dual nature of this layer comes into focus. The FLEX2 runs a wide chemistry panel on a single drawn sample (glucose, lactate, glutamine, ammonium, osmolality, plus gases and cell density), and it is one of the few instruments here whose on-disk export is native CSV — the FLEX2 and the Vi-CELL BLU below are the only native-CSV cases in this chapter. That CSV is a wide, tabular-per-sample row: one line per sample, one column per analyte. That row shape is exactly the basis for the INGEST_TESTS column-to-test_id mapping above — each column becomes a lab.result row tagged with its test. But the FLEX2 can also expose the same results over OPC UA, which means one physical instrument can land on two routes at once: a file drop into this watched folder, and a live tag into the historian over the OPC UA path from "Speaking OT: OPC UA, MQTT, and Sparkplug B" (Chapter 7). The two routes are not redundant — the OPC UA tag gives you the value sooner, the CSV file drop gives you the full inspectable panel — and reconciling them back to one sample is the same two-timestamp problem this chapter is built around.

Contrast that with a Roche Cedex Bio HT — again a representative industry example — a chemistry analyzer that runs the metabolite panel and an IgG-titer assay. Its export is not native CSV at all: it writes a tab-delimited .txt archive, a vendor file with its own header and column conventions. That is precisely why the ingester is written to catch "a CSV or a vendor file" rather than assuming CSV — the .csv filter in OfflineDrop.on_created above is the easy case, and a real deployment adds a small per-vendor parser that normalizes the Cedex .txt (or any other instrument's layout) into the same wide-row shape before ingest() ever sees it.

The cell counter shows how messy this gets even within one vendor's lineup. The Beckman Coulter Vi-CELL family — our representative automated cell counter — splits two ways: the older Vi-CELL XR writes .txt and .xls/.xlsx exports, so it needs a parser just like the Cedex; the newer Vi-CELL BLU writes native CSV, and can be fed through Benchling's open-source allotropy Python library to canonicalize that CSV into Allotrope ASM JSON. (allotropy only canonicalizes text, Excel, and CSV exports to ASM — it does not crack open proprietary binary formats.) Two counters, one product generation, three on-disk formats between them: this is why "the instrument writes CSV" is never a safe assumption. The ASM path itself — what a vendor-neutral, ontology-tagged result file buys you, and where it stops — is the subject of "The Analytical Lab: Instruments, LIMS & ELN" (Chapter 14); here it is enough to know that the clean native-CSV instruments are the lucky minority.

We catch it the moment it lands, using the watchdog library, whose Observer plus a FileSystemEventHandler is the standard Python way to react to filesystem events [9]. Here is the heart of the repo's file-ingester service, examples/services/file-ingester/app.py (the DB-helper and main() boilerplate elided for space):

# examples/services/file-ingester/app.py (excerpt)
import pandas as pd
from watchdog.events import FileSystemEventHandler
from watchdog.observers import Observer

# offline CSV column -> (test_id, unit): the preliminary panel the machine captures
INGEST_TESTS = {
    "VCD_e6_per_mL": ("VCD", "e6/mL"),
    "glucose_g_L": ("Glucose", "g/L"),
    "titer_g_L": ("Titer", "g/L"),
}


class OfflineDrop(FileSystemEventHandler):
    def on_created(self, event):
        if event.is_directory or not event.src_path.endswith(".csv"):
            return
        ingest(event.src_path)


def ingest(path: str) -> int:
    df = pd.read_csv(path, parse_dates=["sample_time"])      # parse, don't guess
    df["sample_time"] = df["sample_time"].dt.tz_convert("UTC")  # one canonical zone
    with psycopg.connect(DSN, autocommit=False) as conn:
        for _, row in df.iterrows():
            upsert_sample(conn, row.sample_id, row.batch_id,
                          row.sample_time.to_pydatetime(), row.sample_point)
            for col, (test_id, unit) in INGEST_TESTS.items():
                insert_result(conn, row.sample_id, test_id, float(row[col]), unit,
                              analyst="SVC_INGEST", status="preliminary")
        conn.commit()

The unglamorous lines are the ones that matter. parse_dates=["sample_time"] and tz_convert("UTC") lean on pandas' time-series handling to turn whatever local string the instrument wrote into a single, timezone-aware UTC instant [10] — because an analyzer in a Newark suite and a server in UTC must agree on when, or the reconciliation in the next section is built on sand. The autocommit=False connection plus the single conn.commit() after the loop wraps the whole file in one database transaction (a unit of work the engine applies all-or-nothing) — if any row fails, the engine rolls the rest back, so a partial CSV never leaves half a sample's results behind. And every ingested row is written status="preliminary": the machine captures the value, but a human still has to verify it before it counts toward a release decision.

The hard part: reconciling timestamps, and handling late and corrected results

Now the genuinely difficult, genuinely regulatory work. Manual entry and late or corrected data are the classic data-integrity risk areas; the FDA's data-integrity guidance is blunt that what protects you is recording the original value, a contemporaneous timestamp, and a documented reason for any change [3]. The MHRA (the UK drug regulator) spells out the two ALCOA attributes most at risk here — ALCOA+ being the regulators' data-integrity checklist that data be Attributable, Legible, Contemporaneous, Original, and Accurate (the "+" adds Complete, Consistent, Enduring, Available): Contemporaneous (record it at the time the activity is performed) and Original (preserve the first capture, not a transcribed summary) [4]. And the PIC/S (an international scheme that coordinates inspectors across regulators) data-management guide gives the lifecycle frame: corrections and amendments are normal and expected, but they must be handled so the original remains visible and the change is traceable [5].

The result lifecycle: sampled, preliminary, verified, corrected

Our schema is built to honor all three without a single overwrite. Walk one offline glucose result through its life — the same BATCH-2026-001-OFF-003 glucose result the anatomy card dissected:

-- 1. The sample is pulled at 06:00; that moment is recorded immediately.
INSERT INTO lab.sample (sample_id, batch_id, sample_time, sample_point)
VALUES ('BATCH-2026-001-OFF-003', 'BATCH-2026-001',
        '2026-01-06 06:00:00+00', 'BR101');

-- 2. The analyzer reports at 06:25; ingester writes a PRELIMINARY value.
INSERT INTO lab.result (sample_id, test_id, value, unit, result_ts, analyst, status)
VALUES ('BATCH-2026-001-OFF-003', 'Glucose', 6.01, 'g/L',
        '2026-01-06 06:25:00+00', 'SVC_INGEST', 'preliminary');

-- 3. The analyst reviews and VERIFIES — a NEW row, not an update.
INSERT INTO lab.result (sample_id, test_id, value, unit, result_ts, analyst, status)
VALUES ('BATCH-2026-001-OFF-003', 'Glucose', 6.01, 'g/L',
        '2026-01-06 09:10:00+00', 'a.kowalski', 'verified');

Both rows survive. The UNIQUE (sample_id, test_id, result_ts) constraint lets them coexist because their result_ts differ, and the original preliminary capture is never destroyed — Original and Contemporaneous, both preserved. A genuine correction (say a transcription error caught the next day) is the same move: a new row carrying the corrected value, a new result_ts, and — in Chapter 23, where we add the reason-for-change audit trail and the tamper-evident hash chain — a recorded reason and signer. This chapter builds the append-only bones; Chapter 23 adds the muscle.

The two timestamps, and why the gap is a column not a footnote

Notice what the three inserts above quietly establish: the sample's sample_time is 06:00, the preliminary result_ts is 06:25, and the verified result_ts is 09:10. Three clocks, one result. The temptation in a homegrown system is to keep one timestamp and call it "the time of the result" — and that single shortcut is what turns offline data into a data-integrity liability, because it silently asserts that the value characterizes the batch at the moment it was recorded rather than the earlier moment the sample represents.

The schema refuses that conflation by spending two columns instead of one. lab.sample.sample_time belongs to the sample; lab.result.result_ts belongs to each result row. Because both are timestamptz, PostgreSQL stores them as absolute UTC instants and the gap between them — the sample-to-insight delay the PAT framework warns about — is a value you can subtract, filter, and chart, not a fact lost to a transcription [11]. A near-real-time at-line reading might show a gap of minutes; an off-line stability assay shipped to a contract lab might show days. Either way the gap is recorded, which is the only way a reviewer can later tell a contemporaneous capture from a back-dated one. The MHRA puts Contemporaneous and Original at the top of the at-risk list for exactly this reason [4]: the cheap mistake is to overwrite the original capture or to stamp it with the wrong clock.

Two custody chains, one timeline: reconciling offline with online

Reconciling the sample timestamp with the online trace is now a clean join, because the historian carries the same batch_id and the sample carries its sample_time:

-- Online DO at (or just after) the moment OFF-003 was pulled.
SELECT s.sample_id, s.sample_time, r.value AS bench_glucose,
       (SELECT value FROM ts.sensor_reading t
        WHERE t.batch_id = s.batch_id AND t.tag = 'BR101.DO.PV'
          AND t.ts >= s.sample_time
        ORDER BY t.ts LIMIT 1) AS online_do_at_sample
FROM lab.sample s
JOIN lab.result r ON r.sample_id = s.sample_id AND r.test_id = 'Glucose'
WHERE s.sample_id = 'BATCH-2026-001-OFF-003' AND r.status = 'verified';

       sample_id        |      sample_time       | bench_glucose | online_do_at_sample
------------------------+------------------------+---------------+---------------------
 BATCH-2026-001-OFF-003 | 2026-01-06 06:00:00+00 |          6.01 |              39.059

The bench glucose drawn at 06:00 now sits next to the online dissolved-oxygen reading from that same instant — the tag BR101.DO.PV reads as bioreactor 101, dissolved oxygen, process value, the convention Chapter 7 introduced. Two custody chains, one timeline — which is the entire point of capturing offline data properly rather than leaving it stranded in a spreadsheet.

The same row, as a triple a graph can reason over

The relational lab.result row is the system of record, but it is also — once you stop seeing rows and start seeing facts — a bundle of subject–predicate–object triples, the atom of an RDF (Resource Description Framework) graph. That re-reading is what the knowledge-graph chapter does for the whole campaign, and it is worth previewing here because the offline panel is exactly the data a digital thread most needs. The verified glucose result reads as three triples, written in Turtle (the human-readable RDF text format), each value carrying a QUDT unit IRI so 6.01 is never a bare number tacked onto a string:

@prefix bp:   <https://example.org/bioproc#> .
@prefix qudt: <http://qudt.org/schema/qudt/> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

bp:BATCH-2026-001-OFF-003 bp:fromBatch    bp:BATCH-2026-001 ;     # the s88.batch FK, as an edge
                          bp:sampleTime    "2026-01-06T06:00:00Z"^^xsd:dateTime .
bp:RESULT-OFF-003-Glu     bp:ofSample      bp:BATCH-2026-001-OFF-003 ;
                          bp:glucose       "6.01"^^xsd:float ;     # qudt:unit grams-per-litre
                          bp:resultStatus  "verified" .

The batch_id foreign key becomes a bp:fromBatch edge you can walk, so the sample-to-batch genealogy this schema enforces relationally is the same derivedFrom/fromBatch lineage a SPARQL property path traverses in one hop. Three open vocabularies give those terms shared meaning rather than local-only names: Allotrope ASM (the vendor-neutral JSON the allotropy path above emits) tags a result, instrument, and sample the same way whether it came from your Cedex or a contract lab; IOF/BMIC (the Industrial Ontologies Foundry's biopharma content, grounded in the ISO/IEC 21838-2 BFO upper ontology) types the batch and material; and QUDT types the quantity and unit — the alignment the classes-and-taxonomy and identifiers-and-units chapters build, and the same stack the relations-and-genealogy chapter uses to make derivedFrom a transitive edge.

What the database enforces with constraints, the graph enforces with SHACL (the Shapes Constraint Language — a closed-world gate over triples). The batch_id REFERENCES s88.batch foreign key and the status lifecycle become a node shape: every result must point at exactly one real sample, carry a unit, and bear a status drawn from the controlled set — the precise pattern the release-gate chapter models for the CofA panel, where a missing required test is a failure now, not an open question.

# Illustrative — the relational constraints, re-expressed as a closed-world SHACL shape.
@prefix bp: <https://example.org/bioproc#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .

bp:ResultShape a sh:NodeShape ;
    sh:targetClass bp:LabResult ;
    sh:property [ sh:path bp:ofSample ; sh:minCount 1 ; sh:maxCount 1 ;
                  sh:class bp:Sample ;                                  # genealogy, as a constraint
                  sh:message "A result must attach to exactly one real sample." ] ;
    sh:property [ sh:path bp:resultStatus ; sh:minCount 1 ;
                  sh:in ( "preliminary" "verified" "rejected" ) ] .

This makes the chapter's two-stream join a competency question — a question the vocabulary must be able to answer, the acceptance-test unit Book 4 turns into runnable PASS/FAIL checks in its competency-questions chapter. "What verified glucose was the culture at, and what was the dissolved oxygen at that same sampled instant?" is the SPARQL twin of the SQL join above; "which lab results have no verified row?" is the SHACL gate's negative. The honest scope note matches the graph chapter's: a single typed scalar like glucose maps cleanly into a triple, but a raw spectrum or chromatogram stays a referenced Allotrope/AnIML file the graph points at by IRI — the index of the thread, not the warehouse for every array on it. And both timestamps survive the lift: sample_time and result_ts become typed xsd:dateTime literals, so PROV-O-style provenance (who recorded what, when) is queryable rather than buried — the Attributable and Contemporaneous attributes, expressed in the graph the next chapters of Book 4 govern.

Why it matters

Process scientists steer cell culture on the offline panel, not the online tags. Glucose and lactate decide the feeding strategy; VCD and viability decide harvest timing (when to end the culture and collect the antibody-laden broth for purification); titer is the number the whole campaign is judged by. Yet this is the data most likely to be mishandled, because it is the data a human touches. A value typed into the wrong batch, a sample time guessed after the fact, an out-of-spec result quietly overwritten — these are exactly the findings that fill regulatory warning letters — the formal enforcement notices a regulator issues when an inspection turns up serious violations. Building the capture path so that the original is preserved, the two timestamps are distinct, and the batch link is a hard foreign key turns the riskiest data in the plant into the most defensible.

What the warning letters actually say

This is not a hypothetical risk; it is a measured one. A retrospective analysis of every FDA warning letter issued to pharmaceutical companies from 2010 through 2020 found that, of the three dominant current Good Manufacturing Practice (cGMP — the regulations that define how drugs must be manufactured) deficiency categories, documentation and data-integrity findings accounted for roughly 21% of the cGMP letters — second only to process validation (26% — findings that the manufacturing process was not proven to reliably make conforming product) and ahead of quality control (15% — findings in the testing-and-release function itself) — with documentation cited as a major deficiency in about 20–25% of letters on average [12]. The mechanisms behind those findings are the ones this schema is built against. The FDA's data-integrity guidance is blunt that the protection is recording the original value with a contemporaneous timestamp and a documented reason for any change [3], and inspectors keep finding the opposite: results not recorded contemporaneously, original captures overwritten, and timestamps that do not survive scrutiny — precisely the MHRA's Contemporaneous and Original attributes failing in the field [4]. An append-only lab.result with two distinct timestamptz columns and a batch_id foreign key is not gold-plating; it is the engineering answer to the single most common laboratory-records citation class.

It also completes the picture for everything downstream. Because both online and offline data now carry batch_id and a trustworthy time, a single join can reconcile them — the same move the contextualization view (Chapter 17) makes for the online stream, extended here to the offline panel. The soft-sensor in Chapter 29 — a model that predicts a hard-to-measure value from easy live signals, here a Raman (a light-scattering spectroscopy probe) reading turned into a titer estimate — needs the offline titer as its training label (the known, true value the model learns to reproduce); a mislinked sample poisons the model. Offline data is not a footnote to the historian. It is the other half.

Why the modeling chapter leans on this schema

Three properties of this little lab schema are not just compliance bookkeeping — they are what makes a defensible soft sensor possible at all, and Book 5 depends on every one of them. First, the batch_id foreign key is what lets a model be validated honestly. A spectrum gives a model 701 collinear columns, but a fed-batch run gives it one genuinely independent observation, so the only sound way to score a titer sensor is a grouped, leave-one-batch-out split — train on five batches, test on the sixth held-out whole batch, never mixing rows from one run across the train/test line. Splitting random rows instead leaks: two samples from the same culture taken hours apart are nearly identical, so a row-wise split lets the model memorize a batch it then "predicts," and the reported R² flatters itself. The batch_id this schema enforces as a hard foreign key is precisely the grouping key that split needs — the models-and-validation chapter builds the batch-grouped split on exactly this column, and the learning-problem chapter names the cold-start reality that the binding constraint is the number of independent batches, not rows.

Second, the two-stream join is the applicability-domain check waiting to happen. A locked soft sensor is only trustworthy inside the input envelope it was calibrated on; the moment a live spectrum carries structure the training set never saw, its number must be flagged, not trusted. That label-free gate — a Hotelling T² and squared-prediction-error (SPE) distance off the model's plane — runs on the online stream, but it is the offline panel that grounds it: the verified titer landing twice a day is the reference that tells you whether the sensor's drift is real. The models-and-validation chapter wires that applicability-domain gate to the PLS sensor; this schema supplies its truth source.

Third, the sample_time/result_ts gap is the difference between process drift and model drift. When a Raman titer prediction and the bench reference diverge, the question is whether the culture moved (a real process excursion the model correctly tracked) or the model went stale (probe fouling, a new media lot — true model drift). You can only tell them apart by charting the model-minus-reference residual at each sampled instant, which needs the bench result anchored to the moment the sample represents, not the moment it was measured — exactly the two-timestamp discipline this schema spends two columns on. The mlops-and-lifecycle chapter builds that residual control chart (and its leading-indicator partner, a Population Stability Index on the inputs) on the offline groundings this capture path produces, and pins each model version to the sha256 of the very dataset these rows form — the model-lineage record that makes "which data trained this?" a hash, not a guess. The honest limit is the one that chapter names: those offline groundings arrive only once or twice a day, so a residual chart catches a drifting model only days after the drift begins — which is why the schema's job is to make every one of those sparse, precious, human-touched results trustworthy.

In the real world

In a running biomanufacturing plant this path is owned by a LIMS or a laboratory execution system, with the instruments often connected through middleware that captures the raw file and the analyst's verification step. The ASTM E1578 guide is the map of that landscape, and it is worth being honest that the open-source corner of it is thin [2]. We can build the ingester and the schema in pure open-source software (OSS) — Python, watchdog (Apache-2.0), pandas (BSD), PostgreSQL — and it works, deterministically, on a laptop. What pure OSS does not hand you is a validated, vendor-accountable instrument-interface layer with built-in second-person review workflows (a second qualified person checks and signs off the first person's entry) and 21 CFR Part 11 (the FDA rule governing electronic records and electronic signatures) electronic signatures out of the box. The OSS LIMS we use later, SENAITE, is a capable teaching system, but its only published Part-11 gap analysis dates to 2019 and lists real gaps (e-signatures, retention, password controls) — which is why this book ships that gap list as an honest limit and pairs SENAITE with a separate signing service rather than claiming compliance. The append-only lab.result table here is the OSS-clean ~80 %: correct, inspectable, in Git. The validated review-and-sign wrapper around it is the GxP last mile, and it is hybrid — an open-source core paired with a proprietary, validated signing layer rather than one or the other.

This is also where a multi-vendor facility feels the pain first. A pilot-scale cGMP line, stitched together from many skids and many bench analyzers, is precisely where offline results from a dozen instruments must all be reconciled back to one batch and one sample time before anyone can reason about the run — the kind of physical setting this off-DCS capture path is built to serve.

Key terms

Seed train — the staged expansion of cells from a thawed vial through progressively larger vessels up to the production bioreactor; much of its data is logged manually.
Offline / at-line / on-line / in-line — the PAT measurement taxonomy by where and how fast a value is obtained: off-line (sample to a separate lab), at-line (sample to a nearby analyzer in minutes), on-line (sensor in a loop), in-line (probe in the broth).
VCD (viable cell density) — count of living cells per mL, a primary offline cell-culture measurement, by manual or automated count.
Metabolite panel — glucose, lactate, glutamine, ammonium, osmolality; the offline analytes that drive feeding and harvest decisions.
Titer — accumulated product (antibody) concentration; the offline result the campaign is judged by and the soft-sensor's training label.
sample_time vs result_ts — the moment a sample was taken versus the moment its result was recorded; stored as separate timestamptz columns because they are different events.
Preliminary / verified result — the two-row pattern that captures a machine-reported value and a human-confirmed value without overwriting the original.
Sample-to-batch genealogy — the ISA-88-rooted linkage (batch_id foreign key) that ties each lab result to the specific batch and lot it characterizes.
CofA (certificate of analysis) — the release-assay row-set (SEC, CEX, HCP, residual Protein A, endotoxin) judged PASS/OOS against specification.
DOE / design space (QbD) — a design of experiments is a structured grid of input combinations run (often on parallel micro-bioreactors) to map the process; the design space is the QbD region of inputs that reliably yields in-spec product. Its native run table is wide-and-short (factor and response columns, one row per run).
.mip — the proprietary binary native file of the BioPAT MODDE DOE package; like the Cedex .txt, it must be exported before the open lab schema can read it — DOE data is not natively CSV.
ALCOA+ (Original, Contemporaneous) — the regulators' data-integrity checklist (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available); the two attributes most at risk for manual and late results are Original (preserve the first capture) and Contemporaneous (record it at the time).
result_id (surrogate key) — the bigint GENERATED ALWAYS AS IDENTITY primary key the engine mints for every lab.result row; GENERATED ALWAYS blocks a human from supplying or overriding it, giving each result a stable handle for the audit trail. The real business key is the wider UNIQUE (sample_id, test_id, result_ts) tuple that lets corrections coexist.
INGEST_TESTS — the file-ingester's column-to-test_id map (VCD_e6_per_mL → VCD, glucose_g_L → Glucose, titer_g_L → Titer) that selects which wide-CSV columns are auto-captured as preliminary results; the rest of the panel waits for an analyst.
RDF triple / SHACL / competency question — the same lab.result row, re-read as subject–predicate–object facts: the batch_id foreign key becomes a walkable fromBatch edge, the status lifecycle becomes a SHACL closed-world gate, and the two-stream join becomes a SPARQL competency question — aligned to Allotrope ASM, IOF/BMIC, and QUDT in the knowledge-graph chapter.
Grouped / leave-one-batch-out split — validating a soft sensor by holding out a whole batch (grouped on batch_id) rather than random rows, because two samples from one culture leak into each other; the only honest score on small, collinear batch data.
Applicability domain — the input envelope a locked model was calibrated on; a live spectrum outside it (caught by a Hotelling T²/SPE gate) must be flagged, with the offline panel as the reference that grounds the check.
Process drift vs. model drift — the two-timestamp gap lets a divergence between a Raman prediction and the bench reference be attributed either to the culture genuinely moving (process) or to the model going stale (probe fouling, a new lot); only the residual-vs-reference chart, anchored to sample_time, separates them.

Where this leads

We have now captured both halves of the upstream truth — the streaming DCS tags and the sparse, human-touched offline panel — and tied them to one batch timeline. But not every signal arrives over a modern protocol or a tidy CSV. The next chapter, Connecting Legacy & Commercial Skids: Modbus, Siemens S7, PLC4X, drops down to the oldest and most stubborn layer of the plant, where data hides behind register maps and proprietary PLC protocols, and shows how to pull it into the same historian and the same batch model with open-source drivers.

What this chapter covers​

Why offline data is a different animal​

Generating offline results that agree with the online trace​

Designing the experiment: ambr and DOE​

A lab schema built for genealogy​

Anatomy of a lab.result row: what a recorded result carries​

Anatomy of the CSV drop row: what the analyzer writes​

Ingesting the file drop​

The hard part: reconciling timestamps, and handling late and corrected results​

The result lifecycle: sampled, preliminary, verified, corrected​

The two timestamps, and why the gap is a column not a footnote​

Two custody chains, one timeline: reconciling offline with online​

The same row, as a triple a graph can reason over​

Why it matters​

What the warning letters actually say​

Why the modeling chapter leans on this schema​

In the real world​

Key terms​

Where this leads​