Instruments and Sensors as Data Sources

📍 Where we are: Part II opens here — having toured where process data is born, we now meet the physical instruments that actually create it, the first true sources in our data supply chain.

In the last chapter, A Tour of Where Process Data Is Born, we walked the whole journey of making a monoclonal antibody — a Y-shaped protein drug grown by living cells — and reframed every station as a place that emits data. But we treated those data as if they simply appeared. They do not. Every number, spectrum, and chromatogram begins life inside a physical device: a probe dipped in a tank, a laser shining into a vessel, an instrument humming in a quality lab. This chapter is about those devices — the original sources from which all manufacturing data flows.

The simple version

Think of a modern car. Some gauges read continuously while you drive — speed, fuel, engine temperature — never leaving the dashboard. Others require a trip to the garage, where a mechanic plugs in equipment to run a deeper diagnostic you cannot do at speed. A bioreactor is the same: some instruments watch the living process every second without ever removing a drop, while others demand that a sample be pulled out and carried to a specialized machine. The trick is knowing which question each instrument can answer, and what shape of data it hands back.

What this chapter covers

We will first learn to classify instruments by where they measure relative to the process. Then we will meet the two great families of probes — simple single-number sensors and rich spectroscopic ones — dissect the anatomy of a single raw reading, and meet the framework, called PAT, that turns those signals into control. Finally we will visit the off-line analytical instruments of the quality lab, see why each instrument produces a characteristic data shape that downstream systems must be built to hold, and confront an open problem in how those shapes are integrated.

How and where an instrument measures

Where an instrument measures: the four locations

Engineers classify every measurement by its physical relationship to the process stream. The regulator's foundational guidance on the subject — the U.S. FDA's 2004 framework for Process Analytical Technology — formalized this vocabulary, and a widely cited review by Rathore and colleagues laid it out cleanly for biopharmaceuticals [1][2]. There are four locations:

In-line — the sensor sits inside the process stream and measures in place, with no sample removed. A pH probe immersed in the bioreactor liquid is in-line. Nothing is taken out; the measurement happens where the liquid lives.
On-line — a small side-stream is automatically diverted out of the process, measured, and (often) returned. The product never leaves the closed, sterile system, but the measurement happens just beside the main flow rather than within it. (An off-gas analyzer is loosely grouped here too, though it measures the exhaust gas leaving the vessel headspace rather than a returned liquid side-stream.)
At-line — a sample is physically removed from the process and measured nearby, within seconds to minutes, usually right next to the equipment. The sample leaves the stream but not the room.
Off-line — a sample is removed and carried away to a separate laboratory, where it may be measured hours or days later on large, specialized instruments.

The difference is not academic. It governs how fresh the data is, how automatically it arrives, and whether a result can steer the process while it is still running or only judge it afterward. Book 1 walks the physical production bioreactor where most of these in-line and on-line probes actually sit; here we care about what each location does to the data that comes off them.

A measurement-location taxonomy: the closer to the stream, the faster and more continuous the data; the farther away, the more definitive but delayed. Original diagram by the authors, created with AI assistance.

Two families of probes: single-number and spectroscopic

Within the in-line and on-line world, probes come in two broad kinds, and the distinction matters enormously for data.

Univariate probes return one number at a time. ("Univariate" simply means one variable.) The classic quartet measures temperature, pH (how acidic the liquid is), dissolved oxygen or DO (how much oxygen the cells have to breathe), and dissolved carbon dioxide. A capacitance probe — such as the Aber Instruments Futura line, a long-standing example in cell-culture suites — adds a fifth: by sensing how the liquid responds to a tiny alternating electric field — only cells with an intact, undamaged membrane store charge (each acts like a tiny capacitor), so the signal counts living cells and ignores dead ones and debris — it estimates viable cell density / biomass — roughly, how many living cells are present — without anyone counting them. Each of these probes produces a steady stream of single readings over time: a scalar time-series, the simplest data shape there is [2].

In practice that time-series lands in a historian or a flat file as plain rows. A handful of seconds from one bioreactor might look like this — a timestamp, then one column per probe:

timestamp,pH_PV,DO_percent,temperature_C
2026-06-13T09:00:00Z,7.02,48.5,36.99
2026-06-13T09:00:05Z,7.01,48.2,37.00
2026-06-13T09:00:10Z,7.02,47.9,37.01
2026-06-13T09:00:15Z,7.00,47.6,37.00

The columns carry tag-style names — pH_PV is the present value of the pH loop — and the same reading might be addressed elsewhere as a tag like BR101.pH.PV. We meet those tag conventions in the automation chapter.

Schematic of a bioreactor instrumented with multiple probes feeding signals to control and data systems Schematic of an instrumented bioreactor: probes for temperature, pH, dissolved oxygen, and more feed continuous signals to the control and data systems. Industrial bioreactor schematic. Image by NIST, public domain, via Wikimedia Commons.

Spectroscopic probes are far richer. Instead of one number, they shine light into the liquid and record how the sample absorbs or scatters it across hundreds or thousands of wavelengths at once — a spectrum. Because the spectrum reflects many chemical species simultaneously, these are multivariate ("many variables") instruments. A review by Lourenço and colleagues surveys the main types used in bioreactors: UV-Vis (ultraviolet and visible light), near-infrared (NIR) light, fluorescence, and Raman spectroscopy [3]. A full Raman, NIR, or fluorescence scan hands back the same general data shape: a long vector of intensity values, one number per wavelength. UV-Vis is the exception worth flagging: a full-spectrum UV-Vis instrument is multivariate, but many in-line UV/Vis probes report at just one or a few fixed wavelengths — an optical-density biomass reading at 600 nm (nm is nanometres, the wavelength of the light; OD600) or a protein concentration at 280 nm (A280) — and are effectively univariate scalars despite using light.

Raman is the spectroscopy a bioprocess engineer reaches for most. It measures the faint, characteristic way molecules scatter laser light. Water is a weak Raman scatterer, so aqueous samples give clean spectra — an advantage over infrared methods, where water absorbs strongly and swamps the signal — at the cost of a fluorescence background — a competing glow that some molecules emit when hit by the laser — that can interfere in biological matrices (the complex mixture of cells, nutrients, and waste that makes up the culture). Commercial in-line Raman systems built for bioreactors are familiar fixtures by now — the Kaiser Optical Systems RamanRxn probes (now part of Endress+Hauser) and the Tornado Spectral Systems analyzers (now a Bruker company) are two examples a plant engineer would recognize. In a landmark 2011 study, Abu-Absi and colleagues showed that a single in-line Raman probe could track glucose (the cells' food), lactate (a waste product), glutamine (another nutrient the cells consume — an amino acid), and viable cell density / biomass — the same quantity the capacitance probe estimates — simultaneously inside a mammalian-cell bioreactor, in real time [4]. Each of those four outputs — glucose, lactate, glutamine, and biomass — required its own multivariate calibration model, built from training data; one Raman spectrum does not "contain" four readings until four separate, validated models extract them (a point we return to under PAT). Authoritative reviews have since established Raman as a mainstream tool across pharmaceutical manufacturing and bioprocessing, from early development to commercial control [5][6]. Broader surveys confirm the same pattern for the whole spectroscopic family, upstream and downstream alike [7].

Anatomy of a raw sensor reading: six fields that never travel naked

The CSV above hides something important. When a single reading leaves an in-line probe, it is not a bare number — or it must not be, if the data is to be trustworthy. A well-formed raw reading carries six fields bound together, and dropping any one of them quietly corrupts the record. Pulling one DO-probe reading apart shows the whole skeleton:

One raw reading, fully unpacked: two timestamps, a tag identity, a value bound to its unit, a quality flag, and the batch join key — six fields that must travel together from the probe to the database. Original diagram by the authors, created with AI assistance.

The six fields are:

Source time. When the sensor actually sampled the liquid — the contemporaneous instant the measurement represents. This is the time that belongs to the reading; record anything else in its place and you have falsified the moment the datum describes.
Collection time. When a historian or collector actually wrote the reading down. It is usually close to the source time but not identical, and conflating the two is a classic data-integrity error: a value buffered at the edge for thirty seconds and stamped only on arrival looks, to a careless system, like it was sampled thirty seconds late. Keeping both clocks separate is what makes a record contemporaneous. We treat timestamps and the full birth-to-archive journey of a single value in The Lifecycle of a Data Point.
Tag identity. BR101.DO.PV is not a free-text label; it is structured as asset.measurement.role — the present value (PV) of the dissolved-oxygen loop on bioreactor BR101. That grammar is how thousands of signals stay addressable, and we return to it in the automation and connectivity chapters.
Value, bound to its unit. A number with no unit is meaningless and dangerous: 47.9 could be percent saturation, milligrams per liter, or kilopascals. The unit (here, %sat) is part of the datum, not metadata to be looked up later — which is why the value and unit count as a single inseparable field rather than two.
Quality flag. A small integer stamped beside the value, recording whether the reading can be trusted. The familiar 192 Good / 64 Uncertain / 0 Bad triad is the legacy OPC DA quality byte (OPC is a long-standing industrial standard for moving sensor data between plant equipment and software), still ubiquitous because so much installed equipment speaks the older protocol; its successor OPC UA replaces the single byte with a richer 32-bit StatusCode whose Good is simply the number 0, while Bad or Uncertain readings carry large code numbers. A probe mid-calibration reports Uncertain; a reading you cannot vouch for is not a missing reading and emphatically not a good one. Storing the flag with the value is what lets a later audit treat the number honestly. We unpack both quality conventions in the connectivity chapter.
Batch ID. The single join key — BATCH-2026-001 — that ties this stray reading back to the GMP (Good Manufacturing Practice — the legally mandated quality regime under which every record in this chapter must hold up) batch record, so a continuous historian stream and the discrete batch facts can be rejoined.

This six-field record is the atom of process data, and it is exactly the shape Book 3 stores in code: the open-source upstream-bioreactor chapter writes each reading into a ts.sensor_reading row with columns (ts, tag, value, unit, quality, batch_id) — the same six fields, now a concrete database row. The journey from probe wire to that row, with the quality flag and source time preserved end to end, is the job of the edge gateway and the OPC UA / MQTT connectivity layer.

The same six fields can be expressed as semantic triples — subject-predicate-object facts of the kind Book 4 builds its bioprocess knowledge graph from — and reading the anatomy this way turns each field into something a machine can reason over rather than merely store. A value bound to its unit is exactly the QUDT (Quantities, Units, Dimensions and Types — a published units vocabulary) discipline Book 4 calls the typed value: instead of a bare 47.9, the reading becomes a quantity value carrying the magnitude as a typed number, the unit as a resolvable identifier, and the quantity kind (here, oxygen saturation) as a separate fact, so the percent can never be misread as kilopascals downstream. The tag identity BR101.DO.PV, meanwhile, is a local plant key, and the moment two systems each hold their own "BR101" you have the reconciliation problem the identifiers-and-units chapter is built around — the cure being a globally unique identifier (an IRI, internationalized resource identifier) that means the same probe to every system, not a name that collides across plants. In Turtle (the standard text syntax for RDF graphs), one DO reading reads:

# one DO reading as semantic triples — the unit and quality flag become reasoned facts, not strings
bp:reading-DO-0042 a qudt:QuantityValue ;
    qudt:numericValue "47.9"^^xsd:float ;   # the value, typed as a number
    qudt:hasUnit unit:PERCENT ;              # the unit as an IRI, not a column header
    bp:fromTag "BR101.DO.PV" ;               # the structured asset.measurement.role tag
    bp:qualityFlag "Good" ;                  # the OPC quality byte, normalized
    bp:partOfBatch bp:BATCH-2026-001 ;       # the join key, now an edge in the graph
    prov:generatedAtTime "2026-06-13T09:00:10Z"^^xsd:dateTime .   # the source time

That prov:generatedAtTime is no accident: the source-versus-collection-time distinction is precisely a PROV-O (the W3C provenance vocabulary) question of when an entity was generated versus when an activity recorded it, and modeling them as distinct facts is what keeps the record contemporaneous in a form an auditor's query can check. The same discipline is what lets a constraint be enforced rather than hoped for: a SHACL shape (the Shapes Constraint Language, which validates that a graph has the required structure) can gate the rule "every stored quantity value carries a unit — no bare numbers," rejecting any reading that arrives unit-less, exactly the release-gate mechanism the release-gate-and-SHACL chapter makes executable. And a plant's data steward can pose the six-field atom as a competency question — a question the data must be able to answer, such as "return every DO reading for BATCH-2026-001 whose quality flag is not Good" — and answer it with one SPARQL query over the graph. The point is not that a historian must become a triple store; it is that the six fields are already a small, well-formed semantic record, and naming them in a shared vocabulary is what makes one plant's reading legible to another system, a regulator's query, or a model years later.

PAT as measured meaning: chemometric models and validated interpretation

A Raman spectrum is not a glucose concentration. It is thousands of intensity numbers. Turning that raw spectrum into "the glucose is 4.2 grams per liter" requires a chemometric model — a mathematical recipe, built and calibrated in advance, that maps the spectral fingerprint to a concentration [3][7]. The instrument senses; the model interprets. Building that recipe is itself a software task: practitioners typically fit such models in dedicated chemometrics packages — Eigenvector Research's PLS_Toolbox, Sartorius's SIMCA, or general statistical tools like JMP Pro — usually with a partial-least-squares (PLS) regression linking spectra to known reference values.

Crucially, the model is never fed the raw spectrum. A measured spectrum carries nuisance variation that has nothing to do with chemistry — a slowly rising baseline from fluorescence or stray light, overall scale shifts from probe fouling or changing turbidity, and random instrument noise. So a fixed sequence of preprocessing steps is applied first: a baseline correction to subtract the drifting floor, a normalization such as standard normal variate (SNV) to remove multiplicative scale, and often a Savitzky-Golay derivative that smooths the curve while sharpening the chemically informative peaks. Only the cleaned spectrum reaches the regression. This matters for data management because the preprocessing recipe is part of the method: re-deriving a result years later means storing not just the raw spectrum but the exact preprocessing and model version that produced the number — another reason the raw file alone is never enough.

A calibrated model is also not calibrated forever. Probes and their models drift. An electrochemical pH or DO probe ages physically — its membrane fouls, its reference electrode shifts — so it must be re-zeroed against buffer standards on a schedule. A spectroscopic model drifts more subtly: it was trained on a particular cell line, medium lot, and probe, and as any of those changes the mapping from spectrum to concentration slowly stops holding, so the predicted glucose creeps away from the true value even though the spectrum looks fine. Catching that requires periodic reference samples and a maintenance discipline — recalibration, model updating, and the change control — the formal, documented process of reviewing and approving any change to a validated method before it is made — that ICH Q14 expects of a method across its lifecycle. A probe reading is therefore only as trustworthy as the calibration behind it.

Two cautions from the machine-learning side sharpen exactly how trustworthy that calibration is, and both are easy to get silently wrong. The first is how the model is validated before it is trusted at all. A chemometric model fit on spectra collected within a single batch will report a flattering accuracy if it is scored on spectra from that same batch — because consecutive readings minutes apart are nearly identical, the model is effectively graded on data it has all but seen, a form of data leakage (when information from the test set bleeds into training and inflates the score). The honest test is leave-one-batch-out cross-validation — hold out whole batches, train on the rest, and score only on the batches the model never saw, so a high number reflects real generalization to a new culture rather than memorized within-batch noise. Book 5's models-and-validation chapter makes this batch-grouped split the gate that makes any soft-sensor number admissible. The second caution is the applicability domain (AD) — the region of spectra the model was actually trained on. A spectrum that drifts outside that region (a new medium lot, a fouled probe, an unfamiliar matrix) is one the model must extrapolate on, exactly where a data-driven model is least trustworthy, and a well-built soft sensor flags such an out-of-envelope reading before its number is believed — the same Hotelling T² and squared-prediction-error (SPE) check that multivariate process monitoring already uses. Crucially, model drift is not the same as process drift: a control chart on the bioreactor's own signals catches the process wandering, while a control chart on the model's prediction-minus-reference residual catches the model going stale — two distinct monitors, because a model can drift while the process is in control and vice versa, the asymmetry the MLOps chapter builds its two detectors around. All of which is why a deployed model is not a file but a versioned, lineage-tracked object: the exact training dataset (pinned by a content hash), the preprocessing recipe, the operating range, and the validation evidence travel with the weights, so "which model produced this number, trained on what data?" is answerable — the model-lineage discipline this book returns to under data integrity and Book 5 makes runnable.

This pairing of measurement with model is the heart of Process Analytical Technology (PAT). The FDA's 2004 guidance defines PAT deliberately broadly: not a list of gadgets, but a system for designing, analyzing, and controlling manufacturing through timely measurements of critical quality and performance attributes — the goal being to build quality in, rather than test for it at the end [1]. In that view, a sensor only becomes "PAT" when its data is actually used to understand and steer the process.

note

A chemometric model that converts a spectrum into a concentration is the seed of what later chapters call a soft sensor — a "virtual instrument" that reports a value no physical probe measured directly, inferred instead from other signals. We meet soft sensors properly in the chapter on machine learning and soft sensors; for now, just notice that the data shape they consume is the spectrum produced here.

PAT is not an upstream-only story. The same in-line spectroscopy is increasingly aimed at the downstream purification train, where the data it produces feeds an immediate decision rather than a slow lab result. The first purification step, Protein A capture — the affinity column that grabs the antibody out of the clarified harvest, described in Book 1's capture chromatography chapter — already throws off a rich in-line stream: a UV-280 absorbance trace whose tall elution peak marks when the pure antibody comes off, and against which an operator sets the pooling window (the two cut points between which eluate is collected as product). In-line UV and Raman during capture can drive that pooling decision in real time and even monitor product-related quality as the peak elutes, the same PAT loop the upstream Raman runs, now steering a downstream unit operation. Other downstream steps lean on the same instruments: a conductivity and pH probe gates the low-pH hold of viral inactivation, an in-line UV trace tracks the polishing step's separation, and a pressure-decay or diffusion test on a viral filter is itself a measured integrity verdict. Each is a data source whose shape and location this chapter's taxonomy still governs.

Because these model-based methods are now used to make quality decisions, regulators treat them as governed analytical procedures. The international guideline ICH Q14, adopted in 2023, sets out a science- and risk-based way to develop such procedures — including multivariate, spectroscopic methods like NIR — with proper calibration and ongoing lifecycle monitoring [8]. The same expectation appears in binding regulation on both sides of the Atlantic: in the United States, 21 CFR 211.160 and 211.165 require that test methods be validated and equipment be qualified before results support release [13], while in the European Union, EU Annex 11 (the computerised-systems annex to EU GMP) calls in its Section 4 for validation and qualification and in its Section 1 for risk management across the system's lifecycle [11]. The pharmaceutical industry's practical playbook for getting there, GAMP 5 (Second Edition, 2022), lays out the risk-based validation lifecycle that a chemometric model or analytical instrument is typically taken through [14]. Before that instrument can produce GMP data at all, it is qualified through the standard three-stage sequence — IQ / OQ / PQ (Installation, Operational, and Performance Qualification: proof that it was installed right, operates right, and performs right on the real workload), the V-model rungs the CSV-to-CSA chapter lays out in full. An instrument, in other words, is not a neutral fact-machine; it is a validated, version-controlled data source.

That qualification is also not a one-time event, because instruments and their models move between sites and scales. When a PAT method is transferred from a development lab to a manufacturing plant, or scaled up from a 10-litre to a 2,000-litre bioreactor, the chemometric model rarely survives the move untouched: a different probe, a different vessel geometry, and a different optical path shift the spectra enough that the calibration must be re-verified and often re-calibrated at the receiving site — a known manufacturing pain point, since a multi-site Raman model is notoriously hard to make transferable, and a model trusted across an un-bridged scale or probe change is quietly extrapolating. The disciplined fix is an explicit calibration-transfer step plus revalidation under change control, treating the moved method as a new qualification rather than an inherited one. The same logic governs cleaning validation for shared equipment: a probe or column reused across products must be proven clean to documented limits between campaigns, and the data proving it is one more governed record this chapter's primary sources generate.

:::note Annex 11 is being rewritten The Annex 11 in force today dates from 2011. A substantially expanded revised draft was published for public consultation in July 2025 (consultation closed that October; finalization is expected in 2026), broadening the annex to audit-trail review, cloud and SaaS service providers, data integrity across networked multi-system environments, and — for the first time — AI/ML, while formalizing the ALCOA+ principles this book returns to in Part III [15]. A chemometric model deployed for release today should be built with that direction of travel in mind. :::

The data they hand back

The analytical instruments of the quality lab: HPLC, CE, LC-MS, ddPCR

Not all questions can be answered by a probe in a tank. The most definitive measurements of product quality — is this truly the right antibody, and how pure is it? — come from large off-line instruments in the quality control (QC) laboratory. Book 1 describes the physical analytical and formulation development work these instruments support, and the chromatography that the QC HPLC mirrors analytically is the same separation chemistry used preparatively at capture chromatography to purify the product.

A laboratory centrifuge in a quality-control laboratory, an off-line sample-preparation instrument A laboratory centrifuge in the QC lab — an off-line workhorse that prepares and clarifies samples before the analytical instruments (HPLC, LC-MS, CE) produce the chromatograms and results that define product quality attributes. Laboratory centrifuge. Image by Ivangiesen, CC0 (public-domain dedication), via Wikimedia Commons.

The workhorses include:

HPLC / UPLC — high- (or ultra-) performance liquid chromatography, run on commercial platforms a QC analyst would name without hesitation: the Waters ACQUITY UPLC, the Agilent 1290 Infinity II, or the Shimadzu Nexera series. The sample flows through a packed column where different molecules travel at different speeds and emerge separated. The instrument records a chromatogram: a curve of detector signal over time, whose peaks reveal how much of each component is present. The example below is a size-exclusion chromatography (SEC) purity run — the mode that sorts molecules by physical size, so the intact antibody, the larger aggregates clumped from several copies, and the smaller broken fragments each emerge at their own retention time — the time, measured from injection, at which a component leaves the column. The processed result of one such run is essentially a small table — each row a peak — like this:

Retention time (min) Peak area Analyte
7.83 1,284,500 Main antibody (monomer)
6.41 51,200 High-molecular-weight (HMW) aggregate
9.12 18,640 Low-molecular-weight (LMW) fragment

The larger aggregate elutes first (at 6.41 min) and the smaller fragment last (at 9.12 min) — a hallmark of SEC, where the column is packed with tiny porous beads: small molecules detour into the bead pores and are delayed, while bigger molecules are excluded from the pores and so take the shorter route straight through, eluting first. Dividing each peak area by the total gives a purity percentage: here the monomer is about 95% of the antibody mass. Those three numbers might condense a chromatogram of tens of thousands of raw points — which is exactly why the raw file and the processed result must both be kept, a distinction we return to below.
CE — capillary electrophoresis, which separates molecules by pushing them through a fine tube under an electric field; it likewise yields a peak-bearing trace.
Mass spectrometry (MS), usually paired with chromatography as LC-MS, which weighs molecules with extraordinary precision. Rogers and colleagues introduced the high-resolution MS multi-attribute method (MAM) — a single LC-MS run intended to quantify many quality attributes of a biologic at once to support characterization, QC testing, and release/disposition decisions [9]. (At the time Rogers and colleagues published in 2015, MAM was an advancing method already in use at some facilities but not yet universally adopted for release testing.) Its data is high-dimensional: a mass spectrum at every point along a chromatogram.
ddPCR — droplet digital PCR, which counts specific DNA molecules one by one, used for example to measure residual host-cell DNA — trace DNA from the living production cells that must be cleared from the final drug, so its tiny remaining amount is counted. Its output is a single scalar/count result — a concentration — rather than a spectrum, chromatogram, or image, so its data shape is a scalar, the simplest of the shapes catalogued below.

Retention time (min)	Peak area	Analyte
7.83	1,284,500	Main antibody (monomer)
6.41	51,200	High-molecular-weight (HMW) aggregate
9.12	18,640	Low-molecular-weight (LMW) fragment

These instruments produce two layers of data that are easy to confuse but must be kept distinct: the raw file (the instrument's full native recording — every data point it captured) and the processed result (the integrated peak areas — the area under each peak, found by integration, the math of measuring area under a curve — computed concentrations, and pass/fail verdicts derived from it). The raw file is the evidence; the processed result is the conclusion. Sound data management preserves both.

Data shapes and why they matter: scalar, spectrum, chromatogram, image

A pattern runs through all of these instruments: each family produces a characteristic data shape, and that shape dictates how the data must be stored, sized, and integrated [2][3][9]. The sizes and rates in the last column are illustrative orders of magnitude — rough author estimates to convey scale, not figures drawn from those references:

Instrument	Data shape	Approximate size and rate
Univariate probe (temperature, pH, DO, dissolved CO2, capacitance)	Scalar time-series	One number per channel, logged roughly once per second to once per minute
Spectroscopic probe (Raman, NIR, fluorescence)	Spectrum: a vector	Hundreds to thousands of numbers per scan
Chromatography (HPLC, CE)	Chromatogram: a signal-vs-time curve	Thousands of points per injection
Mass spectrometry (LC-MS / MAM)	High-dimensional spectra over time	Megabytes to gigabytes per run
Imaging / microscopy	Image (pixel grid)	Large binary files

A scalar reading is trivial to store but arrives relentlessly, so volume comes from its sustained logging rate over weeks of culture. (A probe transduces effectively continuously; what reaches storage is whatever cadence the historian is configured to log — typically once a second to once a minute, often thinned further by compression, a point we return to in Plant Information Systems.) A spectrum is a single timestamped vector — and a database designed only for single numbers will struggle to hold it gracefully. A chromatogram is a curve that means little without the software and method that interpret it. A mass-spectrometry run can be gigabytes. Plan storage for scalars alone and the first Raman probe or LC-MS instrument will overwhelm the system.

Why data shape matters downstream: the integration challenge

The danger is not just storage volume; it is silent loss. When a rich, high-dimensional output meets a system that was designed only for scalars, the path of least resistance is to keep the convenient summary and throw away the raw — and nothing in the pipeline necessarily warns you it happened. A Raman spectrum becomes "glucose 4.2 g/L" and the thousands of intensities behind it evaporate; a chromatogram becomes three peak areas and the underlying curve is gone. The scalar survives because the number is the datum. The spectrum and the chromatogram are at risk because their summary looks like the whole answer.

Three shapes, three fates through integration: a scalar survives intact, while a spectrum and a chromatogram can be silently flattened to a summary unless the raw is deliberately retained. Original diagram by the authors, created with AI assistance.

The unsolved challenge: silent data-shape loss

This is one of the genuinely open problems in bioprocess data, and it is not solved by buying a bigger disk. High-dimensional measurements — a Raman spectrum, an LC-MS run, a chromatogram — are routinely reduced to summary statistics at the moment of integration, with the raw signal never retained, because the receiving historian or LIMS (Laboratory Information Management System — the database of record for the quality lab's samples and results, which we meet in Plant Information Systems) has no natural slot for a vector or a binary instrument file. The summary is enough to make today's decision, so the raw is dropped. The cost surfaces later: when a chemometric model must be recalibrated, when a regulator asks to re-integrate a chromatogram with a corrected baseline, or when a new analytical question is posed of an old batch, the evidence is simply gone.

That is not merely inconvenient — it breaks reproducibility and violates the Original principle of ALCOA+ (the data-integrity framework whose letters stand for Attributable, Legible, Contemporaneous, Original, and Accurate, plus the "+" extensions Complete, Consistent, Enduring, and Available), which we devote a later chapter to. Here Original means the raw instrument file, the true source record; a peak table is a derived conclusion, not the original, and discarding the raw leaves the Complete principle violated too — the record no longer contains all the data secured during testing. Regulators are explicit about raw versus processed data: U.S. 21 CFR 211.194 requires that complete laboratory records include all data secured in the course of testing, and the EU's Annex 11 Section 5 requires that, where data is transferred to another format or system, the transfer not lose the original [10][11]. Because the raw file and the processed result are electronic records, the chromatography data system that holds them falls squarely under 21 CFR Part 11 and EU Annex 11 — the electronic-records-and-signatures rules that demand a secure, time-stamped audit trail (a tamper-evident log of who changed what, when, and why), so that re-integrating a chromatogram with a corrected baseline leaves a reviewable trace rather than silently overwriting the first answer. Proving that system actually enforces those controls is the job of computerized system validation, and the modern, risk-based way to do it — Computer Software Assurance (CSA), the FDA's shift from documenting-everything-identically toward spending assurance effort where patient risk is highest — is the subject of the CSV-to-CSA chapter. The function that integrates a peak and decides pass or fail is a Critical, patient-impacting feature that earns rigorous scripted proof; a free-text comment field is not — the same record, two very different levels of scrutiny. Rathore and colleagues, surveying the data challenges of multivariate process monitoring, make the same point from the engineering side: the value of spectroscopic and high-dimensional methods is only realized if the underlying data is captured and managed so that models can be built, audited, and maintained over a product's lifecycle [12]. What makes this unsolved in practice is that the architectural fix — store the richest shape natively, keep raw and processed side by side, and propagate both downstream — runs against decades of scalar-first historian design and the ordinary pressure to keep only what the next decision needs. The chapters on data integrity and the digital thread return to this; the design rule it implies is simple and unforgiving: size every system for the richest data shape it will ever see, not the simplest.

Why it matters

Every later concept in this book — process control, the electronic batch record, data integrity, analytics — rests on these primary sources. If we do not know an instrument's measurement location, we cannot know how fresh its data is or whether it can drive real-time control. If we do not respect an instrument's data shape, we will build storage and integration that quietly drops information: a Raman spectrum flattened to a single number, a chromatogram saved as a verdict with the underlying curve discarded. The shape of the source is the first design constraint on every system downstream of it.

In the real world

Modern continuous biomanufacturing depends heavily on in-line and on-line PAT precisely because a flowing process cannot wait hours for an off-line lab result before it must act. In-line Raman and NIR probes, paired with chemometric models, let such facilities watch glucose, product titer (the concentration of antibody the cells have made, in grams per liter — the headline measure of how productive a culture is), and impurities continuously and adjust on the fly, exactly the PAT vision the FDA framed in 2004 [1][6]. Meanwhile the QC lab's off-line LC-MS and the multi-attribute method provide the definitive, regulator-grade verdict on product quality [9]. A real plant runs both worlds at once — fast continuous signals for control, slow definitive results for release — and the data architecture must serve both.

Key terms

GMP (Good Manufacturing Practice) — the legally mandated quality regime under which production records, including the batch record, must be kept and held up to inspection.
In-line / on-line / at-line / off-line — the four locations a measurement can happen, from inside the stream to a distant lab, ordered roughly from fastest to most definitive.
Univariate probe — a sensor that returns one number at a time (temperature, pH, dissolved oxygen, dissolved CO2, capacitance/biomass).
pH — a number describing how acidic a liquid is.
Dissolved oxygen (DO) — how much oxygen is available in the liquid for cells to breathe.
Capacitance / biomass probe — a probe that estimates viable cell density (biomass) from how living cells respond to an electric field.
Viable cell density (VCD) / biomass — how many living cells are present in the culture; estimated electrically by a capacitance probe or optically by Raman.
Spectroscopic probe — a multivariate instrument that records a spectrum across many wavelengths (Raman, NIR, fluorescence, full-spectrum UV-Vis).
Spectrum — a vector of intensity values, one per wavelength, reflecting many chemical species at once.
Chemometric model — a calibrated mathematical recipe (usually a partial-least-squares regression) that converts a spectrum into a concentration or other property.
Spectrum preprocessing — the fixed sequence of cleanup steps (baseline correction, standard normal variate normalization, Savitzky-Golay derivatives) applied to a raw spectrum before the chemometric model sees it; part of the method, so it must be stored to reproduce a result.
Probe / model drift — the gradual loss of accuracy as a physical probe ages or a chemometric model's training conditions no longer match the process; corrected by scheduled recalibration, reference samples, and model maintenance under change control.
Process Analytical Technology (PAT) — the FDA framework of designing, analyzing, and controlling manufacturing through timely measurement, not merely the sensors themselves.
Soft sensor — a virtual instrument that infers a value from other signals rather than measuring it directly.
Titer — the concentration of product (here, antibody) in the culture, in grams per liter; the headline measure of how much a batch made.
LIMS — Laboratory Information Management System, the database of record for the quality lab's samples, tests, and results.
HPLC / UPLC — liquid chromatography that separates a mixture and records a chromatogram.
Chromatogram — a curve of detector signal over time whose peaks quantify components.
Size-exclusion chromatography (SEC) — a chromatography mode that separates molecules by physical size, used to measure monomer purity versus aggregates and fragments; larger species elute first.
Mass spectrometry (MS / LC-MS) — an instrument that weighs molecules precisely; basis of the multi-attribute method (MAM).
Multi-attribute method (MAM) — a single LC-MS method that measures many product quality attributes at once.
ddPCR — droplet digital PCR, which counts specific DNA molecules individually.
Raw file vs. processed result — the instrument's full native recording versus the integrated, computed conclusion drawn from it.
Data shape — the structural form an instrument's output takes: scalar, spectral vector, chromatogram, or image.
Source time vs. collection time — when the sensor actually sampled versus when a historian recorded it; the contemporaneous record is the source time.
Tag identity — the structured asset.measurement.role name (e.g. BR101.DO.PV) that makes one signal addressable among thousands.
Quality flag — a small integer stamped beside a value recording whether the reading can be trusted (legacy OPC DA: 192 Good, 64 Uncertain, 0 Bad; OPC UA replaces it with a 32-bit StatusCode whose Good is 0).
Silent data-shape loss — the integration failure in which a spectrum, chromatogram, or LC-MS run is reduced to a summary with the raw discarded, breaking reproducibility and ALCOA Original.
Semantic triple / IRI / QUDT — expressing a reading as subject-predicate-object facts in a shared vocabulary; an IRI (internationalized resource identifier) is a globally unique name for a probe or batch, and QUDT pins a value's unit and quantity kind as machine-readable identifiers rather than a column header.
SHACL / competency question — a SHACL (Shapes Constraint Language) shape gates a constraint such as "every value carries a unit"; a competency question is a question the data must be able to answer (e.g. every not-Good DO reading for a batch), answerable by one SPARQL query.
Leave-one-batch-out cross-validation — validating a chemometric model by holding out whole batches and scoring only on the unseen ones, so the reported accuracy reflects real generalization and is not inflated by within-batch data leakage.
Applicability domain (AD) — the region of spectra a model was trained on; a reading outside it forces the model to extrapolate and should be flagged (by Hotelling T² / SPE) before its prediction is trusted.
Model drift vs. process drift — a stale model (caught by a control chart on the prediction residual) versus a wandering process (caught by a control chart on the process signal); distinct failures needing distinct monitors.
IQ / OQ / PQ — Installation, Operational, and Performance Qualification: the three-stage proof that an instrument was installed, operates, and performs correctly before it produces GMP data.
Calibration transfer — the explicit step that re-verifies or re-calibrates a chemometric model when a PAT method moves to a new site, scale, or probe, since the spectra shift enough to break the original calibration.
ALCOA+ — the data-integrity principles (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available) a GMP record must satisfy; discarding a raw file violates Original and Complete.
21 CFR Part 11 / audit trail — the electronic-records-and-signatures rule (with its EU counterpart Annex 11) requiring a secure, time-stamped, tamper-evident log of who changed a record, when, and why.
Computer Software Assurance (CSA) — the FDA's risk-based successor to exhaustive computerized-system validation, spending rigorous scripted testing on patient-impacting functions and lighter checks on trivial ones.

Where this leads

We now have raw signals pouring out of probes and instruments — but a signal is not yet a controlled process or a stored record. Between the sensor and the database sits the automation layer. The next chapter, Automation and Process Control Data, introduces the controllers — PLCs, DCS, and SCADA — that read these instrument signals, act on them, and in doing so generate a whole new class of data: setpoints, alarms, events, and recipes. There we will meet the ISA-88 batch-control standard and see how this control and recipe data becomes the backbone of the electronic batch record.

What this chapter covers​

How and where an instrument measures​

Where an instrument measures: the four locations​

Two families of probes: single-number and spectroscopic​

Anatomy of a raw sensor reading: six fields that never travel naked​

PAT as measured meaning: chemometric models and validated interpretation​

The data they hand back​

The analytical instruments of the quality lab: HPLC, CE, LC-MS, ddPCR​

Data shapes and why they matter: scalar, spectrum, chromatogram, image​

Why data shape matters downstream: the integration challenge​

The unsolved challenge: silent data-shape loss​

Why it matters​

In the real world​

Key terms​

Where this leads​