Skip to main content

Seed Train & Cell-Culture Offline Analytics

๐Ÿ“ Where we are: Part II ยท Capturing the Process โ€” Chapter 8. The historian already swallows the bioreactor's live tags; this chapter captures the other half of the truth โ€” the manual seed-train entries and the offline bench results that never touch the DCS โ€” and ties each one to the batch and the moment it represents.

The simple version

The bioreactor's online sensors are like a fitness tracker on your wrist: always on, automatic, streaming. But twice a day a technician also draws a tube of culture, walks it to a bench analyzer, and gets numbers the wristband can never give โ€” how many cells are alive, how much sugar is left, how much antibody has accumulated. Those results arrive as a CSV emailed off an instrument, or typed into a form. The hard part is not measuring them. It is gluing each result back to the right batch and the exact minute the sample was pulled โ€” because a number with no anchor in the batch record is, to a regulator, no number at all.

What this chapter coversโ€‹

Chapters 5โ€“7 captured everything the distributed control system (DCS) emits over OPC UA: temperature, pH, dissolved oxygen, feed pumps. But a CHO fed-batch run generates a whole second stream of data that the DCS never sees. The seed train โ€” the staged expansion of cells from a thawed vial up to the production bioreactor โ€” is largely logged by hand. And the most decision-critical numbers in cell culture, the ones a process scientist actually steers by, come off bench analyzers: viable cell density, viability, glucose, lactate, titer.

This chapter is about that off-DCS world. We will:

  • generate realistic offline / at-line bench results with the deterministic simulator, drawn from the same underlying culture state as the online trace;
  • land those results in a relational lab schema designed for sample-to-batch genealogy;
  • build a file-watch ingester that picks up an analyzer's CSV drop automatically;
  • and confront the genuinely hard part โ€” reconciling the two timestamps every offline result has (when the sample was taken versus when it was measured), and handling the late, corrected, and amended results that make manual data a data-integrity minefield.

The two artifacts at the heart of this chapter both exist and are tested in the companion repo: the simulator module examples/sim/bioproc_sim/offline_assays.py, and the lab schema in examples/platform/db/30-lab-events.sql. The file-watcher code is shown as a condensed excerpt of the repo's file-ingester service (examples/services/file-ingester/app.py), which is built on the watchdog library; it is labelled where it appears.

Why offline data is a different animalโ€‹

In the measurement taxonomy the FDA's Process Analytical Technology framework made standard, a sensor wired into the tank is on-line (or in-line if the probe sits in the broth); a sample pulled and run on a nearby analyzer within minutes is at-line; and a sample carried to a separate lab is off-line [1]. The framework's whole point is that every step away from in-line adds delay โ€” the time between when the sample represents the process and when you finally know the answer. That delay is exactly the thing we have to record, because the result is contemporaneous with the measurement, but it is evidence about an earlier moment in the batch.

What lands on those analyzers? In a CHO seed train and production culture, the standing panel is viable cell density (VCD) and viability โ€” historically by manual hemocytometer-and-trypan-blue count, today usually by an automated imaging counter, with method choice itself a validated, integrity-relevant decision [2] โ€” plus the metabolite set every nutrient-feeding strategy is built on: glucose, lactate, glutamine, ammonium, and osmolality [3]. These are laboratory-informatics data: in a mature plant they are captured through a LIMS, ELN, or laboratory execution system rather than the process automation stack, and the ASTM E1578 guide to laboratory informatics is the reference that frames them that way โ€” a different data lineage, different systems, different validation than the DCS tags [4].

That split is the chapter's organizing fact. Two streams, two custody chains, one batch they must both serve.

Generating offline results that agree with the online traceโ€‹

The simulator does something deliberate and important: it samples the offline panel from the same kinetic state the in-line tags come from, then adds analytical noise and a limit of detection. So a bench VCD agrees with the online cell-growth model โ€” just noisier and far sparser. That is not a shortcut; it is the reconciliation problem this chapter exists to solve, baked into the test data.

Here is the core of examples/sim/bioproc_sim/offline_assays.py:

# examples/sim/bioproc_sim/offline_assays.py
SAMPLES_PER_DAY = 2


def sample(result: BatchResult | None = None, batch_id: str = "BATCH-2026-001") -> pd.DataFrame:
"""Two offline samples per day from the fed-batch state, with assay noise + LoD."""
if result is None:
result = simulate(batch_id)
s = result.state
rng = stream_rng("offline_assays", result.batch_id)

minutes = []
day = 0.0
while day <= 14.0 + 1e-9:
for frac in (0.25, 0.75): # ~06:00 and ~18:00
m = int(round((day + frac) * 1440))
if m < len(s):
minutes.append(m)
day += 1.0
minutes = sorted(set(minutes))

rows = []
for i, m in enumerate(minutes, start=1):
st = s.iloc[m]
rows.append({
"sample_id": f"{result.batch_id}-OFF-{i:03d}",
"batch_id": result.batch_id,
"sample_time": st["ts"],
"sample_point": "BR101",
"VCD_e6_per_mL": max(0.0, round(st.Xv_e6_per_mL * (1 + rng.normal(0, 0.05)), 2)),
"viability_pct": float(np.clip(round(st.viability_pct + rng.normal(0, 1.2), 1), 0, 100)),
"glucose_g_L": max(0.0, round(st.glucose_g_L + rng.normal(0, 0.15), 2)),
"lactate_g_L": max(0.0, round(st.lactate_g_L + rng.normal(0, 0.10), 2)),
"glutamine_mM": max(0.0, round(st.glutamine_mM + rng.normal(0, 0.10), 2)),
"ammonia_mM": max(0.0, round(st.ammonia_mM + rng.normal(0, 0.20), 2)),
"osmolality_mOsm_kg": int(round(st.osmolality_mOsm_kg + rng.normal(0, 4))),
"titer_g_L": max(0.0, round(st.titer_g_L * (1 + rng.normal(0, 0.04)), 3)),
"pH_offline": round(float(np.clip(st.pH + rng.normal(0, 0.02), 6.6, 7.4)), 2),
})
return pd.DataFrame(rows)

Three details are worth slowing down on, because they encode real lab practice. First, the schedule: two samples a day at frac = 0.25 and 0.75 โ€” roughly 06:00 and 18:00 โ€” which is a realistic offline cadence and gives 28 results over a 14-day run, against the ~20,160 one-minute online rows per tag. Offline data is sparse. Second, sample_time is taken from the simulated state row's own timestamp st["ts"] โ€” the moment the sample represents โ€” not the moment the function runs. Third, the noise is per-analyte and physically scaled: VCD and titer get a multiplicative 4โ€“5 % error (counting and assay imprecision grows with magnitude), while glucose and lactate get a small additive error, and everything is floored at zero so a near-LoD reading never goes negative.

The rng = stream_rng("offline_assays", result.batch_id) line is why the whole book is reproducible: every random stream is derived from the master seed (SIM_SEED=2026) plus a per-stream label, so this dataset is byte-identical on every machine and in CI.

Run the module directly and it prints its own summary:

$ SIM_SEED=2026 python -m bioproc_sim.offline_assays
offline samples: 28 rows over 14 days
sample_id VCD_e6_per_mL viability_pct glucose_g_L titer_g_L
0 BATCH-2026-001-OFF-001 0.34 96.6 6.18 0.002
1 BATCH-2026-001-OFF-002 0.43 96.6 6.26 0.008
2 BATCH-2026-001-OFF-003 0.56 99.0 6.01 0.014
3 BATCH-2026-001-OFF-004 0.72 97.5 5.99 0.022
4 BATCH-2026-001-OFF-005 0.96 96.7 5.69 0.033
release assays: 11 rows; OOS=0

That is the module's printed summary โ€” a row count, a five-row head(), and a one-line release-assay tally. The full per-batch rows are written to the committed golden file examples/datasets/offline_assays.csv by generate.py, which concatenates sample() over every batch (BATCH-2026-001 through -006, 28 rows each, 168 rows in all). Its first rows for the reference batch show the full wide panel โ€” the at-line analyte set in one place:

sample_id,batch_id,sample_time,sample_point,VCD_e6_per_mL,viability_pct,glucose_g_L,lactate_g_L,glutamine_mM,ammonia_mM,osmolality_mOsm_kg,titer_g_L,pH_offline
BATCH-2026-001-OFF-001,BATCH-2026-001,2026-01-05 06:00:00+00:00,BR101,0.34,96.6,6.18,0.13,4.13,0.68,293,0.002,7.06
BATCH-2026-001-OFF-002,BATCH-2026-001,2026-01-05 18:00:00+00:00,BR101,0.43,96.6,6.26,0.19,4.31,0.38,292,0.008,7.04
BATCH-2026-001-OFF-003,BATCH-2026-001,2026-01-06 06:00:00+00:00,BR101,0.56,99.0,6.01,0.32,3.83,0.45,287,0.014,7.05

Read down the VCD_e6_per_mL column โ€” 0.34, 0.43, 0.56 million cells/mL โ€” and you are watching the early seed-expansion ramp: a low-density inoculum doubling toward production density. The titer column starts at essentially zero because antibody accumulation lags growth. This is the seed-train story told in numbers.

A lab schema built for genealogyโ€‹

A spreadsheet of results is worthless to a regulator unless every row is anchored to which batch and which sample. ISA-88/IEC 61512 gives us the procedural-and-physical hierarchy โ€” process cell, unit, batch, lot โ€” that defines what "the right batch" even means, and it is the spine that lets a seed-train or at-line sample attach to the correct genealogy [5]. We modeled that hierarchy in PostgreSQL back in Chapter 3. The lab layer is the part that hangs results off it, and it lives in examples/platform/db/30-lab-events.sql:

-- examples/platform/db/30-lab-events.sql
CREATE TABLE lab.sample (
sample_id text PRIMARY KEY,
batch_id text REFERENCES s88.batch,
sample_time timestamptz NOT NULL,
sample_point text NOT NULL,
sample_type text NOT NULL DEFAULT 'in_process' -- in_process | release | stability
);

CREATE TABLE lab.test (
test_id text PRIMARY KEY,
name text NOT NULL,
unit text,
spec_low numeric,
spec_high numeric
);

CREATE TABLE lab.result (
result_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
sample_id text NOT NULL REFERENCES lab.sample,
test_id text REFERENCES lab.test,
value numeric,
text_value text,
unit text,
result_ts timestamptz NOT NULL DEFAULT now(),
analyst text,
instrument_id text,
status text NOT NULL DEFAULT 'preliminary', -- preliminary | verified | rejected
UNIQUE (sample_id, test_id, result_ts)
);
CREATE INDEX ON lab.result (sample_id);

This little schema carries more compliance design than it looks. Three points do the heavy lifting.

Two timestamps, on purpose. lab.sample.sample_time is when the sample was taken โ€” the moment in the batch the value is evidence about. lab.result.result_ts is when the result was recorded. These are different events, sometimes hours apart, and PostgreSQL's timestamp with time zone stores both as absolute UTC instants so the gap between "sampled" and "known" is queryable rather than lost [6]. That gap is the sample-to-insight delay the PAT framework warns about, made into a column.

batch_id REFERENCES s88.batch is the genealogy link as a foreign key โ€” the database refuses to record a result for a sample that names no real batch. Sample-to-batch traceability stops being a convention and becomes an invariant the engine enforces.

status and the UNIQUE (sample_id, test_id, result_ts) constraint are how corrections work without lying. A preliminary result and its later verified value are two rows, not an overwrite โ€” which is exactly what the next section needs.

The matching CofA / release panel comes from the same simulator module. The release_results() function emits one certificate-of-analysis row-set per batch against realistic mAb spec ranges:

# examples/sim/bioproc_sim/offline_assays.py
# realistic mAb release-assay specs: (name, low, high, unit, target, sd)
_RELEASE_SPECS = [
("SEC_monomer_pct", 95.0, 100.0, "%", 98.5, 0.4),
("SEC_HMW_pct", 0.0, 3.0, "%", 1.1, 0.3),
("SEC_LMW_pct", 0.0, 2.0, "%", 0.4, 0.15),
("CEX_main_pct", 60.0, 80.0, "%", 70.0, 2.0),
("CEX_acidic_pct", 10.0, 30.0, "%", 20.0, 1.8),
("CEX_basic_pct", 5.0, 20.0, "%", 10.0, 1.2),
# ... HCP, residual Protein A, host-cell DNA, endotoxin, bioburden follow
]
>>> release_results().head(6).to_string(index=False)
batch_id test value unit spec_low spec_high result
BATCH-2026-001 SEC_monomer_pct 98.611 % 95.0 100.0 PASS
BATCH-2026-001 SEC_HMW_pct 1.287 % 0.0 3.0 PASS
BATCH-2026-001 SEC_LMW_pct 0.439 % 0.0 2.0 PASS
BATCH-2026-001 CEX_main_pct 70.686 % 60.0 80.0 PASS
BATCH-2026-001 CEX_acidic_pct 21.551 % 10.0 30.0 PASS
BATCH-2026-001 CEX_basic_pct 10.452 % 5.0 20.0 PASS

The result column is just "PASS" if low <= val <= high else "OOS" โ€” the same in-spec / out-of-specification logic a real release decision turns on. These rows feed the LIMS/ELN chapter (11), the knowledge graph (16), and the commercial-LIMS bridge (19).

A seed-train flask icon and a bench analyzer feed two offline samples per day into a CSV file drop; a watchdog file-watcher picks up the drop, parses two timestamps โ€” sample_time and result_ts โ€” links the sample to its batch via the s88.batch foreign key, and writes preliminary then verified rows into the lab.sample and lab.result tables, which sit beside the online historian trace on one batch timeline.

The off-DCS capture path: bench results arrive as a file drop, are parsed for both their sample and result times, linked to the batch genealogy, and landed as append-only rows that can be corrected without overwriting. Original diagram by the authors, created with AI assistance.

Ingesting the file dropโ€‹

Real analyzers โ€” Cedex-style cell counters, Nova-style metabolite analyzers โ€” export a CSV or a vendor file to a watched folder. We catch it the moment it lands, using the watchdog library, whose Observer plus a FileSystemEventHandler is the standard Python way to react to filesystem events [7]. Here is the heart of the repo's file-ingester service, examples/services/file-ingester/app.py (the DB-helper and main() boilerplate elided for space):

# examples/services/file-ingester/app.py (excerpt)
import pandas as pd
from watchdog.events import FileSystemEventHandler
from watchdog.observers import Observer

# offline CSV column -> (test_id, unit): the preliminary panel the machine captures
INGEST_TESTS = {
"VCD_e6_per_mL": ("VCD", "e6/mL"),
"glucose_g_L": ("Glucose", "g/L"),
"titer_g_L": ("Titer", "g/L"),
}


class OfflineDrop(FileSystemEventHandler):
def on_created(self, event):
if event.is_directory or not event.src_path.endswith(".csv"):
return
ingest(event.src_path)


def ingest(path: str) -> int:
df = pd.read_csv(path, parse_dates=["sample_time"]) # parse, don't guess
df["sample_time"] = df["sample_time"].dt.tz_convert("UTC") # one canonical zone
with psycopg.connect(DSN, autocommit=False) as conn:
for _, row in df.iterrows():
upsert_sample(conn, row.sample_id, row.batch_id,
row.sample_time.to_pydatetime(), row.sample_point)
for col, (test_id, unit) in INGEST_TESTS.items():
insert_result(conn, row.sample_id, test_id, float(row[col]), unit,
analyst="SVC_INGEST", status="preliminary")
conn.commit()

The unglamorous lines are the ones that matter. parse_dates=["sample_time"] and tz_convert("UTC") lean on pandas' time-series handling to turn whatever local string the instrument wrote into a single, timezone-aware UTC instant [8] โ€” because an analyzer in a Newark suite and a server in UTC must agree on when, or the reconciliation in the next section is built on sand. Every ingested row is written status="preliminary": the machine captures the value, but a human still has to verify it before it counts toward a release decision.

The hard part: reconciling timestamps, and handling late and corrected resultsโ€‹

Now the genuinely difficult, genuinely regulatory work. Manual entry and late or corrected data are the classic data-integrity risk areas; the FDA's data-integrity guidance is blunt that what protects you is recording the original value, a contemporaneous timestamp, and a documented reason for any change [9]. The MHRA spells out the two ALCOA attributes most at risk here: Contemporaneous (record it at the time the activity is performed) and Original (preserve the first capture, not a transcribed summary) [10]. And the PIC/S data-management guide gives the lifecycle frame: corrections and amendments are normal and expected, but they must be handled so the original remains visible and the change is traceable [11].

Our schema is built to honor all three without a single overwrite. Walk one offline glucose result through its life:

-- 1. The sample is pulled at 06:00; that moment is recorded immediately.
INSERT INTO lab.sample (sample_id, batch_id, sample_time, sample_point)
VALUES ('BATCH-2026-001-OFF-003', 'BATCH-2026-001',
'2026-01-06 06:00:00+00', 'BR101');

-- 2. The analyzer reports at 06:25; ingester writes a PRELIMINARY value.
INSERT INTO lab.result (sample_id, test_id, value, unit, result_ts, analyst, status)
VALUES ('BATCH-2026-001-OFF-003', 'Glucose', 6.01, 'g/L',
'2026-01-06 06:25:00+00', 'SVC_INGEST', 'preliminary');

-- 3. The analyst reviews and VERIFIES โ€” a NEW row, not an update.
INSERT INTO lab.result (sample_id, test_id, value, unit, result_ts, analyst, status)
VALUES ('BATCH-2026-001-OFF-003', 'Glucose', 6.01, 'g/L',
'2026-01-06 09:10:00+00', 'a.kowalski', 'verified');

Both rows survive. The UNIQUE (sample_id, test_id, result_ts) constraint lets them coexist because their result_ts differ, and the original preliminary capture is never destroyed โ€” Original and Contemporaneous, both preserved. A genuine correction (say a transcription error caught the next day) is the same move: a new row carrying the corrected value, a new result_ts, and โ€” in Chapter 20, where we add the reason-for-change audit trail and the tamper-evident hash chain โ€” a recorded reason and signer. This chapter builds the append-only bones; Chapter 20 adds the muscle.

Reconciling the sample timestamp with the online trace is now a clean join, because the historian carries the same batch_id and the sample carries its sample_time:

-- Online DO at (or just after) the moment OFF-003 was pulled.
SELECT s.sample_id, s.sample_time, r.value AS bench_glucose,
(SELECT value FROM ts.sensor_reading t
WHERE t.batch_id = s.batch_id AND t.tag = 'BR101.DO.PV'
AND t.ts >= s.sample_time
ORDER BY t.ts LIMIT 1) AS online_do_at_sample
FROM lab.sample s
JOIN lab.result r ON r.sample_id = s.sample_id AND r.test_id = 'Glucose'
WHERE s.sample_id = 'BATCH-2026-001-OFF-003' AND r.status = 'verified';
sample_id | sample_time | bench_glucose | online_do_at_sample
------------------------+------------------------+---------------+---------------------
BATCH-2026-001-OFF-003 | 2026-01-06 06:00:00+00 | 6.01 | 39.059

The bench glucose drawn at 06:00 now sits next to the online dissolved-oxygen reading from that same instant. Two custody chains, one timeline โ€” which is the entire point of capturing offline data properly rather than leaving it stranded in a spreadsheet.

Why it mattersโ€‹

Process scientists steer cell culture on the offline panel, not the online tags. Glucose and lactate decide the feeding strategy; VCD and viability decide harvest timing; titer is the number the whole campaign is judged by. Yet this is the data most likely to be mishandled, because it is the data a human touches. A value typed into the wrong batch, a sample time guessed after the fact, an out-of-spec result quietly overwritten โ€” these are exactly the findings that fill regulatory warning letters. Building the capture path so that the original is preserved, the two timestamps are distinct, and the batch link is a hard foreign key turns the riskiest data in the plant into the most defensible.

It also completes the picture for everything downstream. The contextualization view (Chapter 14) can only join online and offline data to a batch because both now carry batch_id and a trustworthy time. The soft-sensor in Chapter 26 โ€” a Raman-to-titer model โ€” needs the offline titer as its training label; a mislinked sample poisons the model. Offline data is not a footnote to the historian. It is the other half.

In the real worldโ€‹

In a running biomanufacturing plant this path is owned by a LIMS or a laboratory execution system, with the instruments often connected through middleware that captures the raw file and the analyst's verification step. The ASTM E1578 guide is the map of that landscape, and it is worth being honest that the open-source corner of it is thin [4]. We can build the ingester and the schema in pure OSS โ€” Python, watchdog (Apache-2.0), pandas (BSD), PostgreSQL โ€” and it works, deterministically, on a laptop. What pure OSS does not hand you is a validated, vendor-accountable instrument-interface layer with built-in second-person review workflows and Part 11 electronic signatures out of the box. The OSS LIMS we use later, SENAITE, is a capable teaching system, but its only published Part-11 gap analysis dates to 2019 and lists real gaps (e-signatures, retention, password controls) โ€” which is why this book ships that gap list as an honest limit and pairs SENAITE with a separate signing service rather than claiming compliance. The append-only lab.result table here is the OSS-clean ~80 %: correct, inspectable, in Git. The validated review-and-sign wrapper around it is the GxP last mile, and it is hybrid.

This is also where a multi-vendor facility feels the pain first. NIIMBL โ€” the U.S. public-private National Institute for Innovation in Manufacturing Biopharmaceuticals โ€” is building SABRE, a pilot-scale current Good Manufacturing Practice (cGMP) facility with the University of Delaware that broke ground in April 2024. A line like SABRE's, stitched together from many skids and many bench analyzers, is precisely where offline results from a dozen instruments must all be reconciled back to one batch and one sample time before anyone can reason about the run. SABRE is a facility, not a data standard โ€” but it is the kind of physical setting this off-DCS capture path is built to serve.

Key termsโ€‹

  • Seed train โ€” the staged expansion of cells from a thawed vial through progressively larger vessels up to the production bioreactor; much of its data is logged manually.
  • Offline / at-line / on-line / in-line โ€” the PAT measurement taxonomy by where and how fast a value is obtained: off-line (sample to a separate lab), at-line (sample to a nearby analyzer in minutes), on-line (sensor in a loop), in-line (probe in the broth).
  • VCD (viable cell density) โ€” count of living cells per mL, a primary offline cell-culture measurement, by manual or automated count.
  • Metabolite panel โ€” glucose, lactate, glutamine, ammonium, osmolality; the offline analytes that drive feeding and harvest decisions.
  • Titer โ€” accumulated product (antibody) concentration; the offline result the campaign is judged by and the soft-sensor's training label.
  • sample_time vs result_ts โ€” the moment a sample was taken versus the moment its result was recorded; stored as separate timestamptz columns because they are different events.
  • Preliminary / verified result โ€” the two-row pattern that captures a machine-reported value and a human-confirmed value without overwriting the original.
  • Sample-to-batch genealogy โ€” the ISA-88-rooted linkage (batch_id foreign key) that ties each lab result to the specific batch and lot it characterizes.
  • CofA (certificate of analysis) โ€” the release-assay row-set (SEC, CEX, HCP, residual Protein A, endotoxin) judged PASS/OOS against specification.
  • ALCOA+ (Original, Contemporaneous) โ€” the data-integrity attributes most at risk for manual and late results: preserve the first capture; record it at the time.

Where this leadsโ€‹

We have now captured both halves of the upstream truth โ€” the streaming DCS tags and the sparse, human-touched offline panel โ€” and tied them to one batch timeline. But not every signal arrives over a modern protocol or a tidy CSV. The next chapter, Connecting Legacy & Commercial Skids: Modbus, Siemens S7, PLC4X, drops down to the oldest and most stubborn layer of the plant, where data hides behind register maps and proprietary PLC protocols, and shows how to pull it into the same historian and the same batch model with open-source drivers.