A Tour of Where Process Data Is Born
π Where we are: Having followed one data point through its lifecycle in The Lifecycle of a Data Point, we now zoom out to the whole factory β and re-walk the entire monoclonal-antibody process as a chain of places where process data is born.
In the previous chapter we traced a single measurement from the instant a sensor or analyst created it, through capture, contextualization, and archival β the data lifecycle that is the spine of everything in this book. We also learned that raw data without metadata (the context that says what, when, where, and under which conditions a value was recorded) is just noise. Now comes the natural next question: where do all those data points actually come from? To answer it, we take a walking tour of a biopharmaceutical plant β but with a twist. Instead of asking "what does each machine make?", we ask "what data does each machine emit?"
Imagine touring a busy restaurant kitchen, but you can only see the order tickets, not the food. Each station β the grill, the fryer, the plating bench β spits out its own kind of ticket at its own rhythm. The bioprocess is the same. Every piece of equipment is quietly printing a different kind of "ticket," and our job is to read the factory as a stream of tickets rather than a stream of liquid.
What this chapter coversβ
We will reframe the standard process map as a data map, station by station β upstream cell culture, downstream purification, fill-finish, and quality control. Along the way we introduce two ideas that the rest of Part I builds on: the difference between data measured during the process and data measured afterward, and the notion of batch genealogy β chaining every station's data together into one coherent story. We close by looking at how continuous manufacturing changes the very shape of the data.
The process map is a data mapβ
A monoclonal antibody (mAb) is a therapeutic protein grown by living cells β the kind of medicine used to treat cancers and autoimmune disease. Making one is a relay race of unit operations: discrete processing steps, each performed by dedicated equipment (a bioreactor, a chromatography column, a filter). The traditional way to draw this process is as a flow of material β cells in, purified drug out.
But there is a second, parallel flow that never appears on the material diagram: data. A foundational principle of modern bioprocessing, formalized by regulators in the FDA's Process Analytical Technology (PAT) framework β guidance that encourages building quality into a process by measuring it as it runs rather than only testing the final product [8] β is that quality should be understood and controlled through measurement at each step [1]. Every unit operation, in other words, is also a data station. It generates a characteristic signature of sensor readings, probe traces, and analyzer outputs that, taken together, describe what happened inside [2].
Adopt that lens and the plant looks different. Here is the same mAb process, drawn as the data each station emits.
Each station emits a different kind of data; the manufacturing line is also a data-generating line.
A bench-scale bioreactor instrumented with probes β an upstream data source generating continuous temperature, pH, and dissolved-oxygen traces. This is a bench/development-scale vessel, shown because it carries the same instrumentation as the much larger production tank described below.
Bench-scale bioreactor. Image by Jonas Schenk, public domain, via Wikimedia Commons.
Upstream: the bioreactor as a time-series fountainβ
Upstream means cell culture and product generation β everything from growing the cells through harvest, the point at which the culture is ready to be separated from the cells. (Harvest, and the clarification step right after it, is the conventional boundary between upstream and downstream.) Upstream's centerpiece is the bioreactor β the large, instrumented tank where the cells live and secrete product. As a data station, the bioreactor is the noisiest and most continuous one in the whole plant.
Probes inside the tank read temperature, pH (acidity), and dissolved oxygen (the oxygen available for the cells to breathe) every few seconds, producing dense time-series data β a stream of values stamped with the time they were taken. A typical sample might land in a historian as a row like BR101.Temp.PV,2026-06-13T08:00:00Z,37.5,degC β one tag, one timestamp, one value, one unit β and several hundred of those arrive every minute. Layered on top are discrete events: each nutrient feed, each dose of base to correct pH, each sample drawn. And increasingly, spectroscopic probes such as Raman β which shine light into the broth and infer its chemistry from how that light scatters β turn the bioreactor into a multivariate data source in its own right: a single in-line Raman probe has been shown to track glucose, lactate, glutamine, and viable cell density simultaneously and in real time [4], so what used to be four separate lab pulls becomes one continuous stream. The next chapter gives Raman and its cousins their proper introduction; here it is enough to see that the bioreactor alone can emit dozens of correlated channels at once.
This introduces our first big distinction: not all measurements are taken in the same place relative to the process stream. A measurement is in-line when the sensor sits inside the process stream and reads it in place, with no sample removed β the temperature probe immersed in the broth is in-line, and a result arrives continuously. When a technician instead pulls a sample to read it nearby, soon after, the measurement is at-line. And there is a third location, on-line, that sits between in-line and at-line β a small side-stream is automatically diverted out of the process, measured, and often returned β which is why "online" data is not quite the same as "in-line" data. These three β in-line, on-line, and at-line β are the process-analyzer modes the FDA's PAT framework defines [8]. Set against them is off-line measurement (a sample taken away and analyzed later in a separate lab), the conventional approach PAT aims to reduce by moving measurement closer to the process. The very next chapter splits all four locations apart in detail. For now, hold one idea: the bioreactor mixes a flood of continuous in-line (and some on-line) data with a trickle of at-line and off-line lab results that must later be reunited.
Reuniting them is itself a data problem, and it is where soft sensors earn their keep. A soft sensor is a model that fuses frequent online signals with sparse at-line or off-line values to estimate a process variable that no single probe measures directly β biomass, growth rate, or glucose uptake, inferred in real time for monitoring and control rather than measured outright [6]. For instance, a soft sensor might infer real-time biomass concentration from frequent Raman spectral readings paired with the occasional manual cell count, letting the control system adjust feeding before an analyst's next lab result arrives hours later. The online stream gives the soft sensor its rhythm; the occasional lab value keeps it honest. This is more than a convenience: it is why soft sensors sit at the heart of the data architecture, because they are how a plant turns a fast-but-indirect signal and a slow-but-trustworthy one into a single number a controller can act on.
Downstream: clarification, traces, and phase eventsβ
Downstream is purification β separating the antibody from everything else the cells produced. It opens with clarification and capture: the harvested broth is run through a centrifuge and depth filters to remove cells and debris before the first column. Even this housekeeping step is a data station β it emits turbidity (a measure of how cloudy the stream still is), filter pressure, and flow readings, and a rising pressure here signals a filter loading up just as it does anywhere else.
Downstream's workhorse, though, is chromatography, in which the protein mixture is pumped through a column packed with resin that grabs some molecules and lets others flow past. The skids that run it β process-scale systems such as Cytiva's ΓKTA process line or Sartorius's Resolute β are not just pumps and valves; as a data station, a chromatography skid is a chorus of continuous traces plus sharp phase events. Detectors at the column outlet continuously record UV absorbance at 280 nm (a proxy for how much protein is passing β it might sit near zero during loading and spike past 2.5 absorbance units at the peak of elution), conductivity (a proxy for salt concentration), and pH [5]. Overlaid are the named phases β load, wash, elute, regenerate β each a timestamped event marking a deliberate change in conditions. The decisive moment is when to start and stop collecting the purified product, the pooling decision. Conventionally this is judged from the standard UV trace; more advanced approaches measure it directly β in one large-scale demonstration, an on-line HPLC analyzer made the real-time pooling call from the data itself [5], though that remains a demonstrated technique rather than routine practice.
A process chromatography skid. Each purification step emits UV, conductivity, and pH traces plus discrete phase-transition events. (The unit shown is an earlier Amersham Pharmacia Biotech system; the ΓKTA and Resolute lines are named only as current examples.)
Image by Kitmondo Lab, CC BY 2.0 (https://creativecommons.org/licenses/by/2.0/), via Wikimedia Commons (File: Amersham Pharmacia Biotech chromotography skid.jpg).
Sandwiched between chromatography steps is tangential flow filtration (TFF) β pushing liquid across a membrane to concentrate the product or swap its buffer. Its data signature is simpler but just as telling: flux (how fast liquid crosses the membrane) and transmembrane pressure (the pressure difference driving it across). A creeping pressure rise is the membrane's way of reporting that it is fouling.
Notice the pattern. Each station has a shape of data. The bioreactor's is a long dense time-series; chromatography's is a bundle of traces punctuated by events; TFF's is a pair of pressure-and-flow signals. Recognizing these shapes is half the battle of managing the data, because each shape wants to be stored and analyzed differently.
Fill-finish and QC: the last drops and the final wordβ
Fill-finish is where bulk drug becomes the vials or syringes a patient receives. Its data is a blend: a fill-weight for every container (a continuous stream of scalar measurements), machine-vision images checking each unit for defects, and environmental monitoring β particle counts and microbial samples proving the room stayed clean.
Finally comes quality control (QC) β the analytical lab that confirms the batch is what it claims to be. QC is the archetype of offline data. Instruments such as HPLC (high-performance liquid chromatography) produce chromatograms (curves) and tables of results hours or days after the sample was drawn, using methods validated to the standard set out in ICH Q2(R2) (the international guideline for validating analytical procedures). These are direct analytical measurements, and they carry the regulator-grade weight that releases the batch β which is exactly why they must be linked back, unambiguously, to the batch and the station they describe.
Batch genealogy: stitching the stations into one storyβ
Each station tells a true but partial story. The medicine is the batch β one defined quantity of product β and to judge it you must be able to chain together every station's data into a single, traceable narrative: which bioreactor run, fed by which media lot, fed which downstream columns, filled on which line, and tested by which QC results. That chain is batch genealogy (or lineage). This is not optional bookkeeping: a complete batch record is mandated by 21 CFR Part 211 Subpart J in the United States and by EU Annex 11 for the computerized systems that hold it, and regulators treat this end-to-end traceability as a pillar of the control strategy β the documented plan of controls that assures quality [7].
In practice the linking is not a metaphor; it is a shared key written into every record. The same batch identity β B-26-0847 below β is stamped onto the bioreactor stream, the chromatography pool, and the QC result, so a query for that one value can reassemble the whole story:
{
"batch_id": "B-26-0847",
"station": "BR101",
"timestamp": "2026-06-13T08:00:00Z",
"temperature_degC": 37.5,
"pH": 7.0,
"DO_pct_sat": 50,
"operator": "A. Okafor"
}
What batch genealogy looks like in data: every step stamped with the same batch identity (B-26-0847).
Original diagram by the authors, created with AI assistance.
In a real plant this linking is the job of dedicated software β a data historian capturing the time-series and a manufacturing execution system (MES) such as Korber Werum PAS-X, Rockwell FactoryTalk PharmaSuite, or DELMIA Apriso recording which lot, line, and operator belongs to which batch. The way those records are structured is itself standardized: the ISA-88 batch-control standard defines how a batch and its procedural steps are described, and ISA-95 governs how that shop-floor information links up to the business systems above it β standards we return to when we look at automation and plant architecture in detail.
Without genealogy, a deviation in QC is an orphan β you cannot trace it back to the feed that was late or the column that fouled. Linking is therefore the central act of data management, and it is exactly why metadata matters: every data point needs to carry enough context to say which batch and which station it belongs to.
Why it mattersβ
Reframing the plant as data stations is not a metaphor for its own sake β it dictates how the data must be handled. Different shapes demand different stores: a high-frequency time-series, a bundle of chromatography traces, and a table of QC results each have distinct volume, structure, and timing. Online and offline data arrive at wildly different frequencies and latencies and must be reconciled into a unified timeline. And the whole point of collecting any of it β proving the batch is safe and effective, and improving the process β collapses unless every station's output is linked into one genealogy. Get the data architecture right and the factory becomes legible; get it wrong and you are left with a pile of disconnected, untraceable numbers.
In the real worldβ
This is precisely the territory the U.S. NIIMBL institute investigates β and the pilot-scale SABRE Center it is building at the University of Delaware is designed to demonstrate and teach. As industry shifts from running one big tank in fed-batch mode toward integrated continuous bioprocessing β perfusion bioreactors feeding continuous downstream steps that run at steady state with minimal hold volume [3] β the data itself changes shape. Continuous operation replaces tidy, discrete batches with streams monitored continuously at steady state [3], and the FDA has explicitly contrasted this with batch production, noting how it reshapes traceability and the control strategy [7]. When there is no single harvest event, "the batch" must be defined by a slice of time β a change we return to in the chapter on continuous and intensified processing.
Key termsβ
- Unit operation β a discrete processing step performed by dedicated equipment; here, also a data station.
- Monoclonal antibody (mAb) β a therapeutic protein grown by living cells.
- Time-series data β a stream of values each stamped with the time it was recorded.
- In-line β measured by a sensor inside the process stream, with no sample removed, giving continuous data.
- On-line β a side-stream is automatically diverted out of the process, measured, and often returned, giving near-continuous data.
- At-line / off-line β measured on a removed sample, nearby and soon (at-line) or later in a separate lab (off-line).
- Clarification and capture β removing cells and debris from harvested broth before purification; emits turbidity, pressure, and flow.
- Chromatography β purification by passing the mixture through a resin-packed column; emits UV, conductivity, and pH traces.
- Pooling decision β choosing when to start and stop collecting purified product from a column.
- Tangential flow filtration (TFF) β filtration across a membrane to concentrate or buffer-exchange; emits flux and transmembrane pressure.
- Fill-finish β filling bulk drug into vials or syringes; emits fill-weights, vision images, and environmental data.
- Quality control (QC) β the analytical lab that confirms batch identity and quality; the archetype of offline data.
- Soft sensor β a model that fuses online signals with sparse at-line or offline values to estimate a process variable no probe measures directly.
- Batch β one defined quantity of product.
- Batch genealogy (lineage) β the chained, traceable record linking every station's data for one batch.
- Integrated continuous bioprocessing β connected unit operations running at steady state, producing data streams rather than discrete batches.
Where this leadsβ
We have toured the plant as a chain of data-generating stations and seen that the primary sources of all this data are physical things β probes, detectors, cameras, and lab analyzers. The next chapter, Instruments and Sensors as Data Sources, zooms all the way in on those instruments: it classifies them by where they measure (in-line, on-line, at-line, off-line), gives Process Analytical Technology and spectroscopic sensors such as Raman and NIR a proper introduction, and shows how each instrument produces a characteristic data shape β scalar, spectral, chromatographic, or image β that the rest of the data stack must be built to receive.