The Lifecycle of a Data Point
π Where we are: In The Biologic and Its Data Shadow (Chapter 1) we met the data shadow; now we follow one data point through its entire lifetime β the journey that is the spine on which all of data management hangs.
In the last chapter we met the data shadow β the sensor traces, batch records, test results, and signatures that trail every biologic and are as essential to the product as the molecule itself. A shadow, though, is not one thing. It is millions of individual data points, each born somewhere, doing some job, then aging into a record that must survive for years. To understand the shadow, we have to follow a single point through its entire life.
That life has a shape, and regulators have a name for it: the data lifecycle. They define it as roughly all phases in the life of the data, from generation and recording through processing, use, retention, archive/retrieval, and destruction [1]. Every point you will ever manage travels this same road.
A data point is like a photograph from a wedding. Someone has to take it (generation). It gets developed and cropped (processing). It only means something once you label who, where, and when (contextualization). People look at it to decide things (use). It goes into an album you can find later (retention and archival). And eventually, decades on, it may be thrown away (disposal). A loose, unlabeled photo in a drawer is almost useless β and so is a number with no story attached.
What this chapter coversβ
We will trace one measurement β a pH reading of 7.0 β from the instant it is created to the day it is destroyed. Along the way we meet the difference between raw and processed data, the all-important idea of metadata, and the reason a bare number is just noise. We finish with the "four V's" that make bioprocess data genuinely hard, and the gap that the rest of this book exists to close.
The road every data point travelsβ
A data point is generated the moment something measures the world. Our pH probe in a bioreactor β the tank where living cells grow the antibody β today usually a single-use, plastic-and-film vessel, such as a Sartorius Ambr, a Thermo Scientific HyPerforma single-use bioreactor (S.U.B.), an Eppendorf BioBLU, or a Sartorius Biostat STR β senses the acidity of the broth and reports 7.0. That instant of creation is also a moment of capture: the value must be recorded somewhere durable, or it never existed at all. Capture happens automatically (a sensor writing to a control system) or by hand (an analyst keying a result into a logbook). The captured value lands in some format β often a vendor-specific one (a .ch file from an Agilent ChemStation HPLC, an .eds file from an Applied Biosystems qPCR instrument or an .rdml export (the open Real-time PCR Data Markup Language standard), a proprietary historian record from the bioreactor controller) β and a growing push toward open standards like AnIML (Analytical Information Markup Language) and the Allotrope Data Format (ADF), from the Allotrope Foundation, aims to make that captured data readable decades from now, regardless of which instrument made it.
Next comes processing: raw signals are converted, averaged, calibrated, or calculated into a usable result. Then contextualization attaches meaning. Then the point is reviewed and used β a human or algorithm checks it and acts on it. Findings are reported. Finally the record enters retention and archival, where it sits, retrievable, until its lawful disposal.
The data lifecycle: the seven stages every measurement passes through, from the probe to the shredder.
How long does retention last? For medicines, a long time. Current Good Manufacturing Practice β the legally enforced rules for how drugs are made β requires that records be kept and remain readable well beyond batch release. In the United States, the specific rule β 21 CFR 211.180 (eCFR Title 21, Part 211, Subpart J on records and reports) β sets that floor at at least one year after the batch's expiry date, and that is a minimum. The European Union's GMP expectations (EudraLex Volume 4, Chapter 4 on documentation, with the EU Annex 11 rules for computerised systems β the European counterpart to the US 21 CFR Part 11 β governing how electronic records stay readable) require batch documentation to be kept until at least one year after the batch's expiry date, or at least five years after the Qualified Person certifies the batch, whichever is longer; Japan and other regions run their own schedules. One year is simply the floor for a US-regulated product, and many companies retain far longer by business policy or regional law. Crucially, a data point's owner and the controls over it follow it across every stage, not just while it is fresh [3].
Raw, processed, and the metadata that makes it realβ
Our 7.0 is, at birth, raw data β the original, unaltered values exactly as the instrument first recorded them [1]. Raw data is sacred. From it we derive processed data: the calibrated, averaged, or calculated results people actually use. The two are different objects, and regulators insist you keep the raw form so any result can be re-traced to its source [6].

An analyst draws a sample from a fed-batch culture β the physical act that creates an offline data point, which must then be captured, contextualized, and linked to the batch.
Fed-batch sampling. Image by Luis Fernando Flores LAB, licensed under CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0/), via Wikimedia Commons; used unmodified. This image is licensed under CC BY-SA 4.0 and may be reused under the same license; this license applies to the image only, not to the rest of this book.
Surrounding both is metadata β literally "data about data": the information that gives a value its meaning and its history [1]. The unit, the timestamp, the instrument's identity and calibration state, who recorded it, the method used β all metadata. If a chromatography instrument reports a peak area of 4527.3, the metadata that travels with it includes the unit (mAUΒ·s), the timestamp, the instrument ID, the calibration state, the batch ID, the method, the operator, and the substance being measured. Strip all of that away and the number is orphaned.
Two more terms set up a later part of the book. The original record is the first durable capture of the data, in the format it was created [2]. A true copy is an exact, verified reproduction β including its metadata β that preserves the full meaning and can stand in for the original [4]. The difference between an original, a copy, and a corrupted half-copy is the difference between a defensible batch and a rejected one.
In Chapter 1 we met ALCOA β Attributable, Legible, Contemporaneous, Original, Accurate. Regulators now extend it to ALCOA+, appending four further qualities to those same five: Complete, Consistent, Enduring, and Available [2]. (The full treatment lives in Data Integrity and ALCOA+.) Notice how naturally these map onto the lifecycle: attributable and contemporaneous are about capture; enduring and available are about retention; original and accurate are about the raw-versus-processed boundary.
Why a number alone is noiseβ
This is the core of the chapter. What is 7.0?
It could be a pH. It could be 7.0 grams of glucose per liter, 7.0 million cells per milliliter, or seven o'clock. By itself it carries no truth β only a digit. Data plus context equals information [7]. To turn our reading into something a person can trust and act on, we must bind to it:
- a unit β pH (so we know what dimension it measures);
- a timestamp β 06:14 on day 7 of the run (so we know when, which proves it was recorded contemporaneously);
- an equipment ID β Bioreactor BR-204, probe PRB-17, last calibrated yesterday;
- a batch ID β the specific lot of drug substance this belongs to;
- a method β the standard procedure that says how the reading is taken.
Only now is 7.0 information: "the culture in BR-204 held pH 7.0 at hour 150 of batch L-22-0417, by a calibrated probe, per method SOP-pH-03." In a real system that same fact is stored not as a sentence but as a structured record, every field carrying one piece of the context:
{
"measurement": "pH",
"value": 7.0,
"unit": "pH units",
"timestamp": "2022-06-10T06:14:32Z",
"equipment_id": "BR-204",
"sensor_id": "PRB-17",
"batch_id": "L-22-0417",
"method": "SOP-pH-03",
"recorded_by": "analyst_15"
}
That sentence β and that record β can support a decision. The bare 7.0 cannot. This is why contextualization is not paperwork β it is what converts a measurement into evidence [3]. A regulatory inspector reviewing your batch file will not accept a bare number; they will ask you to prove where it came from, when, and under what conditions β which you cannot do without that context.
A subtle but critical rule: you cannot quietly delete a number you dislike. Failing, suspect, and out-of-specification results are data too, and they must be retained and reviewed alongside the rest β never discarded to make a batch look clean [4]. The lifecycle keeps the inconvenient points as faithfully as the flattering ones.
The four V's, made concreteβ
Bioprocess data is demanding along four dimensions β the "four V's," a lens borrowed from the broader world of big data. Made real, they look like this.
Volume. A single bioreactor run carries ten to twenty probes sampling every few seconds for one to three weeks; add offline lab assays, imaging, and genomics, and one batch can generate gigabytes of structured and unstructured records.
Velocity. Some of that data arrives in real time and must be acted on now β a pH excursion left for tomorrow's review may have already spoiled today's cells.
Variety. The data comes in many shapes: continuous sensor traces, single lab results, free-text operator notes, chromatograms, electronic signatures. Some is machine-generated, some hand-entered, and both must be governed under one consistent set of integrity rules [8].
Veracity. Every point must be trustworthy β genuinely attributable, accurate, and complete β because patients' safety rests on it [2]. Veracity is not subjective: it can be defined, scored, and even monitored automatically across this flood of heterogeneous data [8].
Why it mattersβ
If the lifecycle is the spine, then managing data is managing that spine end to end. A measurement that is captured but never contextualized is unusable. One that is used but never retained is undefensible. One that is retained but not as a faithful original or true copy is worthless to an inspector. Good data management is simply the discipline of carrying every point cleanly through all seven stages β with its metadata intact β for its entire required life [5].
In the real worldβ
Industry learned the hard way that managing this lifecycle is not optional. The ISPE GAMP guidance on records and data integrity treats the full life cycle β from the moment data is born to the day it is destroyed β as the spine of the quality system, the unit that controls must follow [5]. The general data-management body of knowledge, DAMA-DMBOK, supplies the vocabulary that any industry uses to describe this journey: capture, contextualization, use, retention, and archival [7]. And regulators go further: they ask manufacturers to map their data flows β to draw, point by point, where each measurement is born, where it travels, and where it could be altered or lost [3].
And here is the gap that drives the rest of this book. Our pH point was simple. But a real batch creates data in dozens of places, in dozens of shapes, on systems that often do not speak to one another β a process-control system (a Siemens or Emerson DCS) here, a chromatography data system (Waters Empower) there, a LIMS, a historian, a Manufacturing Execution System (such as Siemens Opcenter or Dassault SystΓ¨mes DELMIA), and a partner's spreadsheet somewhere else. Each captures its own islands of data with its own metadata conventions. Standards exist to give those islands a common shape β ANSI/ISA-88 defines how batch and recipe data are structured, and ISA-95 defines how plant-floor data connects to the business systems above it β but mapping every real instrument onto them is the hard part. Stitching those islands into one trustworthy, connected record β so the whole shadow can be read as a single story β is the central problem of biopharmaceutical data management. This data-integration problem is exactly the kind of challenge that public-private institutes like the U.S. NIIMBL aim to tackle as the field modernizes; alongside that mission, NIIMBL's neighboring SABRE (Securing American Biomanufacturing Research and Education) Center β a pilot-scale cGMP biomanufacturing and workforce-training facility that remains under construction at the University of Delaware β is intended to scale up and de-risk advanced biomanufacturing and train the people who will run it.
Key termsβ
- Data lifecycle β the full journey of a data point: generation/capture, processing, contextualization, review/use, reporting, retention/archival, and disposal.
- Raw data β the original, unaltered values exactly as first recorded by an instrument or person.
- Processed data β calibrated, averaged, or calculated results derived from raw data.
- Metadata β "data about data"; the unit, timestamp, equipment, batch, method, and authorship that give a value meaning and history.
- Original record β the first durable capture of data, in the format it was created.
- True copy β an exact, verified reproduction (including metadata) that preserves full meaning and can stand in for the original.
- Contextualization β attaching unit, time, equipment, batch, and method so a number becomes information.
- Information β data plus context; a number you can actually trust and act on.
- ALCOA+ β the ALCOA qualities from Chapter 1 (Attributable, Legible, Contemporaneous, Original, Accurate) extended with four more: Complete, Consistent, Enduring, Available.
- Retention β keeping records readable for their required life (for medicines, at least one year past batch expiry, often longer).
- The four V's β volume, velocity, variety, and veracity; the dimensions that make bioprocess data hard.
Where this leadsβ
We have followed one point through its life, but we treated its birthplace as a single probe. In reality, a biologic is built across many unit operations, each one a busy factory of measurements. In the next chapter, A Tour of Where Process Data Is Born, we walk the whole monoclonal-antibody process β upstream, downstream, fill-finish, and quality control β but reframed entirely as a chain of data-generating stations, so you can see exactly where every point in the shadow first draws breath.