Instruments and Sensors as Data Sources
π Where we are: Part II opens here β having toured where process data is born, we now meet the physical instruments that actually create it, the first true sources in our data supply chain.
In the last chapter, A Tour of Where Process Data Is Born, we walked the whole journey of making a monoclonal antibody β a Y-shaped protein drug grown by living cells β and reframed every station as a place that emits data. But we treated those data as if they simply appeared. They do not. Every number, spectrum, and chromatogram begins life inside a physical device: a probe dipped in a tank, a laser shining into a vessel, an instrument humming in a quality lab. This chapter is about those devices β the original sources from which all manufacturing data flows.
Think of a modern car. Some gauges read continuously while you drive β speed, fuel, engine temperature β never leaving the dashboard. Others require a trip to the garage, where a mechanic plugs in equipment to run a deeper diagnostic you cannot do at speed. A bioreactor is the same: some instruments watch the living process every second without ever removing a drop, while others demand that a sample be pulled out and carried to a specialized machine. The trick is knowing which question each instrument can answer, and what shape of data it hands back.
What this chapter coversβ
We will first learn to classify instruments by where they measure relative to the process. Then we will meet the two great families of probes β simple single-number sensors and rich spectroscopic ones β and the framework, called PAT, that turns their signals into control. Finally we will visit the off-line analytical instruments of the quality lab and see why each instrument produces a characteristic data shape that downstream systems must be built to hold.
Where an instrument measures: the four locationsβ
Engineers classify every measurement by its physical relationship to the process stream. The regulator's foundational guidance on the subject β the U.S. FDA's 2004 framework for Process Analytical Technology β formalized this vocabulary, and a widely cited review by Rathore and colleagues laid it out cleanly for biopharmaceuticals [1][2]. There are four locations:
- In-line β the sensor sits inside the process stream and measures in place, with no sample removed. A pH probe immersed in the bioreactor liquid is in-line. Nothing is taken out; the measurement happens where the liquid lives.
- On-line β a small side-stream is automatically diverted out of the process, measured, and (often) returned. The product never leaves the closed, sterile system, but the measurement happens just beside the main flow rather than within it. (An off-gas analyzer is loosely grouped here too, though it measures the exhaust gas leaving the vessel headspace rather than a returned liquid side-stream.)
- At-line β a sample is physically removed from the process and measured nearby, within seconds to minutes, usually right next to the equipment. The sample leaves the stream but not the room.
- Off-line β a sample is removed and carried away to a separate laboratory, where it may be measured hours or days later on large, specialized instruments.
The difference is not academic. It governs how fresh the data is, how automatically it arrives, and whether a result can steer the process while it is still running or only judge it afterward.
A measurement-location taxonomy: the closer to the stream, the faster and more continuous the data; the farther away, the more definitive but delayed.
Two families of probes: single-number and spectroscopicβ
Within the in-line and on-line world, probes come in two broad kinds, and the distinction matters enormously for data.
Univariate probes return one number at a time. ("Univariate" simply means one variable.) The classic quartet measures temperature, pH (how acidic the liquid is), dissolved oxygen or DO (how much oxygen the cells have to breathe), and dissolved carbon dioxide. A capacitance probe β such as the Aber Instruments Futura line, a long-standing example in cell-culture suites β adds a fifth: by sensing how the liquid responds to a tiny electric field, it estimates viable cell density / biomass β roughly, how many living cells are present β without anyone counting them. Each of these probes produces a steady stream of single readings over time: a scalar time-series, the simplest data shape there is [2].
In practice that time-series lands in a historian or a flat file as plain rows. A handful of seconds from one bioreactor might look like this β a timestamp, then one column per probe:
timestamp,pH_PV,DO_percent,temperature_C
2026-06-13T09:00:00Z,7.02,48.5,36.99
2026-06-13T09:00:05Z,7.01,48.2,37.00
2026-06-13T09:00:10Z,7.02,47.9,37.01
2026-06-13T09:00:15Z,7.00,47.6,37.00
The columns carry tag-style names β pH_PV is the present value of the pH loop β and the same reading might be addressed elsewhere as a tag like BR101.pH.PV. We meet those tag conventions in the automation chapter.
Schematic of an instrumented bioreactor: probes for temperature, pH, dissolved oxygen, and more feed continuous signals to the control and data systems.
Industrial bioreactor schematic. Image by NIST, public domain, via Wikimedia Commons.
Spectroscopic probes are far richer. Instead of one number, they shine light into the liquid and record how the sample absorbs or scatters it across hundreds or thousands of wavelengths at once β a spectrum. Because the spectrum reflects many chemical species simultaneously, these are multivariate ("many variables") instruments. A review by LourenΓ§o and colleagues surveys the main types used in bioreactors: UV-Vis (ultraviolet and visible light), near-infrared (NIR) light, fluorescence, and Raman spectroscopy [3]. A full Raman, NIR, or fluorescence scan hands back the same general data shape: a long vector of intensity values, one number per wavelength. UV-Vis is the exception worth flagging: a full-spectrum UV-Vis instrument is multivariate, but many in-line UV/Vis probes report at just one or a few fixed wavelengths β an optical-density biomass reading at 600 nm (OD600) or a protein concentration at 280 nm (A280) β and are effectively univariate scalars despite using light.
Raman is the spectroscopy a bioprocess engineer reaches for most. It measures the faint, characteristic way molecules scatter laser light. Water is a weak Raman scatterer, so aqueous samples give clean spectra β an advantage over infrared methods, where water absorbs strongly and swamps the signal β at the cost of a fluorescence background that can interfere in biological matrices. Commercial in-line Raman systems built for bioreactors are familiar fixtures by now β the Kaiser Optical Systems RamanRxn probes (now part of Endress+Hauser) and the Tornado Spectral Systems analyzers (now a Bruker company) are two examples a plant engineer would recognize. In a landmark 2011 study, Abu-Absi and colleagues showed that a single in-line Raman probe could track glucose (the cells' food), lactate (a waste product), glutamine, and viable cell density / biomass β the same quantity the capacitance probe estimates β simultaneously inside a mammalian-cell bioreactor, in real time [4]. Each of those four outputs β glucose, lactate, glutamine, and biomass β required its own multivariate calibration model, built from training data; one Raman spectrum does not "contain" four readings until four separate, validated models extract them (a point we return to under PAT). Authoritative reviews have since established Raman as a mainstream tool across pharmaceutical manufacturing and bioprocessing, from early development to commercial control [5][6]. Broader surveys confirm the same pattern for the whole spectroscopic family, upstream and downstream alike [7].
PAT: not just sensors, but a frameworkβ
A Raman spectrum is not a glucose concentration. It is thousands of intensity numbers. Turning that raw spectrum into "the glucose is 4.2 grams per liter" requires a chemometric model β a mathematical recipe, built and calibrated in advance, that maps the spectral fingerprint to a concentration [3][7]. The instrument senses; the model interprets. Building that recipe is itself a software task: practitioners typically fit such models in dedicated chemometrics packages β Eigenvector Research's PLS_Toolbox, Sartorius's SIMCA, or general statistical tools like JMP Pro β usually with a partial-least-squares regression linking spectra to known reference values.
This pairing of measurement with model is the heart of Process Analytical Technology (PAT). The FDA's 2004 guidance defines PAT deliberately broadly: not a list of gadgets, but a system for designing, analyzing, and controlling manufacturing through timely measurements of critical quality and performance attributes β the goal being to build quality in, rather than test for it at the end [1]. In that view, a sensor only becomes "PAT" when its data is actually used to understand and steer the process.
A chemometric model that converts a spectrum into a concentration is the seed of what later chapters call a soft sensor β a "virtual instrument" that reports a value no physical probe measured directly, inferred instead from other signals. We meet soft sensors properly in the chapter on machine learning and soft sensors; for now, just notice that the data shape they consume is the spectrum produced here.
Because these model-based methods are now used to make quality decisions, regulators treat them as governed analytical procedures. The international guideline ICH Q14, adopted in 2023, sets out a science- and risk-based way to develop such procedures β including multivariate, spectroscopic methods like NIR β with proper calibration and ongoing lifecycle monitoring [8]. The same expectation appears in binding regulation on both sides of the Atlantic: in the United States, 21 CFR 211.160 and 211.165 require that test methods be validated and equipment be qualified before results support release, while in the European Union, EU Annex 11 (the computerised-systems annex to EU GMP) calls in its Section 4 for validation and qualification and in its Section 1 for risk management across the system's lifecycle. The pharmaceutical industry's practical playbook for getting there, GAMP 5, lays out the validation lifecycle phases that a chemometric model or analytical instrument is typically taken through. An instrument, in other words, is not a neutral fact-machine; it is a validated, version-controlled data source.
The analytical instruments of the quality labβ
Not all questions can be answered by a probe in a tank. The most definitive measurements of product quality β is this truly the right antibody, and how pure is it? β come from large off-line instruments in the quality control (QC) laboratory.
A laboratory centrifuge in the QC lab β an off-line workhorse that prepares and clarifies samples before the analytical instruments (HPLC, LC-MS, CE) produce the chromatograms and results that define product quality attributes.
Laboratory centrifuge. Image by Ivangiesen, CC0 (public-domain dedication), via Wikimedia Commons.
The workhorses include:
-
HPLC / UPLC β high- (or ultra-) performance liquid chromatography, run on commercial platforms a QC analyst would name without hesitation: the Waters ACQUITY UPLC, the Agilent 1290 Infinity II, or the Shimadzu Nexera series. The sample flows through a packed column where different molecules travel at different speeds and emerge separated. The instrument records a chromatogram: a curve of detector signal over time, whose peaks reveal how much of each component is present. The processed result of one such run is essentially a small table β each row a peak β like this:
Retention time (min) Peak area Analyte 7.83 1,284,500 Main antibody (monomer) 6.41 51,200 High-molecular-weight aggregate 9.12 18,640 Low-molecular-weight fragment Those three numbers might condense a chromatogram of tens of thousands of raw points β which is exactly why the raw file and the processed result must both be kept, a distinction we return to below.
-
CE β capillary electrophoresis, which separates molecules by pushing them through a fine tube under an electric field; it likewise yields a peak-bearing trace.
-
Mass spectrometry (MS), usually paired with chromatography as LC-MS, which weighs molecules with extraordinary precision. Rogers and colleagues introduced the high-resolution MS multi-attribute method (MAM) β a single LC-MS run intended to quantify many quality attributes of a biologic at once to support characterization, QC testing, and release/disposition decisions [9]. (At the time Rogers and colleagues published in 2015, MAM was an advancing method already in use at some facilities but not yet universally adopted for release testing.) Its data is high-dimensional: a mass spectrum at every point along a chromatogram.
-
ddPCR β droplet digital PCR, which counts specific DNA molecules one by one, used for example to measure residual host-cell DNA. Its output is a single scalar/count result β a concentration β rather than a spectrum, chromatogram, or image, so its data shape is a scalar, the simplest of the shapes catalogued below.
These instruments produce two layers of data that are easy to confuse but must be kept distinct: the raw file (the instrument's full native recording β every data point it captured) and the processed result (the integrated peak areas, computed concentrations, and pass/fail verdicts derived from it). The raw file is the evidence; the processed result is the conclusion. Sound data management preserves both.
Data shapes and why they matterβ
A pattern runs through all of these instruments: each family produces a characteristic data shape, and that shape dictates how the data must be stored, sized, and integrated [2][3][9]. The sizes and rates in the last column are illustrative orders of magnitude β rough author estimates to convey scale, not figures drawn from those references:
| Instrument | Data shape | Approximate size and rate |
|---|---|---|
| Univariate probe (T, pH, DO, dissolved CO2, capacitance) | Scalar time-series | One number, many times per second to per minute |
| Spectroscopic probe (Raman, NIR, fluorescence) | Spectrum: a vector | Hundreds to thousands of numbers per scan |
| Chromatography (HPLC, CE) | Chromatogram: a signal-vs-time curve | Thousands of points per injection |
| Mass spectrometry (LC-MS / MAM) | High-dimensional spectra over time | Megabytes to gigabytes per run |
| Imaging / microscopy | Image (pixel grid) | Large binary files |
A scalar reading is trivial to store but arrives relentlessly, so volume comes from its high sampling rate over weeks of culture. A spectrum is a single timestamped vector β and a database designed only for single numbers will struggle to hold it gracefully. A chromatogram is a curve that means little without the software and method that interpret it. A mass-spectrometry run can be gigabytes. Plan storage for scalars alone and the first Raman probe or LC-MS instrument will overwhelm the system.
Why it mattersβ
Every later concept in this book β process control, the electronic batch record, data integrity, analytics β rests on these primary sources. If we do not know an instrument's measurement location, we cannot know how fresh its data is or whether it can drive real-time control. If we do not respect an instrument's data shape, we will build storage and integration that quietly drops information: a Raman spectrum flattened to a single number, a chromatogram saved as a verdict with the underlying curve discarded. The shape of the source is the first design constraint on every system downstream of it.
In the real worldβ
Modern continuous biomanufacturing β the kind advanced by the U.S. NIIMBL institute, and which a new pilot-scale cGMP facility known as SABRE (now being built next to NIIMBL to scale up, de-risk, and teach advanced biomanufacturing) will help bring online β depends heavily on in-line and on-line PAT precisely because a flowing process cannot wait hours for an off-line lab result before it must act. In-line Raman and NIR probes, paired with chemometric models, let such facilities watch glucose, product titer, and impurities continuously and adjust on the fly, exactly the PAT vision the FDA framed in 2004 [1][6]. Meanwhile the QC lab's off-line LC-MS and the multi-attribute method provide the definitive, regulator-grade verdict on product quality [9]. A real plant runs both worlds at once β fast continuous signals for control, slow definitive results for release β and the data architecture must serve both.
Key termsβ
- In-line / on-line / at-line / off-line β the four locations a measurement can happen, from inside the stream to a distant lab, ordered roughly from fastest to most definitive.
- Univariate probe β a sensor that returns one number at a time (temperature, pH, dissolved oxygen, dissolved CO2, capacitance/biomass).
- pH β a number describing how acidic a liquid is.
- Dissolved oxygen (DO) β how much oxygen is available in the liquid for cells to breathe.
- Capacitance / biomass probe β a probe that estimates viable cell density (biomass) from how living cells respond to an electric field.
- Viable cell density (VCD) / biomass β how many living cells are present in the culture; estimated electrically by a capacitance probe or optically by Raman.
- Spectroscopic probe β a multivariate instrument that records a spectrum across many wavelengths (Raman, NIR, fluorescence, full-spectrum UV-Vis).
- Spectrum β a vector of intensity values, one per wavelength, reflecting many chemical species at once.
- Chemometric model β a calibrated mathematical recipe that converts a spectrum into a concentration or other property.
- Process Analytical Technology (PAT) β the FDA framework of designing, analyzing, and controlling manufacturing through timely measurement, not merely the sensors themselves.
- Soft sensor β a virtual instrument that infers a value from other signals rather than measuring it directly.
- HPLC / UPLC β liquid chromatography that separates a mixture and records a chromatogram.
- Chromatogram β a curve of detector signal over time whose peaks quantify components.
- Mass spectrometry (MS / LC-MS) β an instrument that weighs molecules precisely; basis of the multi-attribute method (MAM).
- Multi-attribute method (MAM) β a single LC-MS method that measures many product quality attributes at once.
- ddPCR β droplet digital PCR, which counts specific DNA molecules individually.
- Raw file vs. processed result β the instrument's full native recording versus the integrated, computed conclusion drawn from it.
- Data shape β the structural form an instrument's output takes: scalar, spectral vector, chromatogram, or image.
Where this leadsβ
We now have raw signals pouring out of probes and instruments β but a signal is not yet a controlled process or a stored record. Between the sensor and the database sits the automation layer. The next chapter, Automation and Process Control Data, introduces the controllers β PLCs, DCS, and SCADA β that read these instrument signals, act on them, and in doing so generate a whole new class of data: setpoints, alarms, events, and recipes. There we will meet the ISA-88 batch-control standard and see how this control and recipe data becomes the backbone of the electronic batch record.