The Lifecycle of a Data Point

📍 Where we are: Part I, Chapter 2 — In The Biologic and Its Data Shadow (Chapter 1) we met the data shadow; now we follow one data point through its entire lifetime — the journey that is the spine on which all of data management hangs.

In the last chapter we met the data shadow — the sensor traces, batch records, test results, and signatures that trail every biologic and are as essential to the product as the molecule itself. A shadow, though, is not one thing. It is millions of individual data points, each born somewhere, doing some job, then aging into a record that must survive for years. To understand the shadow, we have to follow a single point through its entire life.

That life has a shape, and regulators have a name for it: the data lifecycle. They define it as all phases in the life of the data — from generation and recording through processing, use, retention, archive/retrieval, and destruction [1]. Every point you will ever manage travels this same road.

The simple version

A data point is like a photograph from a wedding. Someone has to take it (generation). It gets developed and cropped (processing). It only means something once you label who, where, and when (contextualization). People look at it to decide things (use). It goes into an album you can find later (retention and archival). And eventually, decades on, it may be thrown away (disposal). A loose, unlabeled photo in a drawer is almost useless — and so is a number with no story attached.

What this chapter covers

We will trace one measurement — a pH reading of 7.0 — from the instant it is created to the day it is destroyed. Along the way we meet the difference between raw and processed data, the all-important idea of metadata, and the reason a bare number is just noise. We dissect the six fields that have to travel with a reading before it counts as evidence, we finish with the "four V's" that make bioprocess data genuinely hard, we look at how the lifecycle treats a point that is missing or untrustworthy, and we name the one part of this flow that is still, honestly, unsolved — the gap the rest of this book exists to close.

The seven stages: from probe to archive

A data point is generated the moment something measures the world. Our pH probe sits in a bioreactor — the tank where living cells grow the antibody, the physical step covered in the production-bioreactor chapter of Book 1 — today usually a single-use, plastic-and-film vessel, such as a Sartorius Ambr, a Thermo Scientific HyPerforma single-use bioreactor (S.U.B.), an Eppendorf BioBLU, or a Sartorius Biostat STR. It senses the acidity of the broth (the liquid cell culture) and reports 7.0. That instant of creation is also a moment of capture: the value must be recorded somewhere durable, or it never existed at all. Capture happens automatically (a sensor writing to a control system — the automation that reads each sensor and adjusts pumps and valves to hold the process on target) or by hand (an analyst keying a result into a logbook).

The captured value lands in some format — often a vendor-specific one (a .ch file from an Agilent ChemStation HPLC (a chromatography analyzer), an .eds file from an Applied Biosystems qPCR (real-time PCR) instrument or an .rdml export (the open Real-time PCR Data Markup Language standard), a proprietary historian record (a historian is the time-series database that archives every sensor tag) from the bioreactor controller) — and a growing push toward open standards like AnIML (Analytical Information Markup Language) and the Allotrope Data Format (ADF), from the Allotrope Foundation, aims to make that captured data readable decades from now, regardless of which instrument made it.

Next comes processing: raw signals are converted, averaged, calibrated, or calculated into a usable result. Then contextualization attaches meaning. Then the point is reviewed and used — a human or algorithm checks it and acts on it. Findings are reported. Finally the record enters retention and archival, where it sits, retrievable, until its lawful disposal.

Seven-stage data lifecycle drawn as a two-row flow: generation and capture, processing, and contextualization (capture and prepare), then review and use and reporting (use), then retention and archival (retain), and finally disposal (retire). The data lifecycle: the seven stages every measurement passes through, from the probe to the shredder. Original diagram by the authors, created with AI assistance.

How long does retention last? For medicines, a long time. Current Good Manufacturing Practice — the legally enforced rules for how drugs are made — requires that records be kept and remain readable well beyond batch release. One year is a floor, not a target. In the United States, the specific rule — 21 CFR 211.180(a) — that is, the US Code of Federal Regulations, Title 21, Part 211, Subpart J on records and reports — sets that floor at at least one year after the batch's expiry date — or, for a product with no expiration date, one year after distribution — and that is a minimum. The European Union runs longer still. Its GMP expectations (EudraLex Volume 4, Chapter 4 on documentation, with the EU Annex 11 rules for computerised systems — the European counterpart to the US 21 CFR Part 11 (the rules for electronic records and signatures) — governing how electronic records stay readable) require batch manufacturing and packaging records to be kept until at least one year after the batch's expiry date, or at least five years after the Qualified Person (the EU-designated person who legally certifies each batch for release) certifies the batch, whichever is longer; Japan and other regions run their own schedules. So one year is simply the floor for a US-regulated product, and many companies retain far longer by business policy or regional law.

Readability is not free — a proprietary .ch or historian archive is readable only while its original software survives, so long-retention strategy means either maintaining validated legacy systems or migrating raw data into a vendor-neutral form like AnIML/ADF, which is one reason those open formats matter as much for retention as for capture. Crucially, a data point's owner and the controls over it follow it across every stage, not just while it is fresh [3].

Three objects, one record: raw, processed, and metadata

Raw, processed, and the metadata that makes it real

Our 7.0 is, at birth, raw data — the original, unaltered values exactly as the instrument first recorded them [1]. Raw data is sacred. From it we derive processed data: the calibrated, averaged, or calculated results people actually use. The two are different objects, and regulators insist you keep the raw form so any result can be re-traced to its source [6]. For chromatography and other reprocessable data, raw data is not the printed result but the complete electronic data file — the digitized detector signal plus the integration method and audit trail — because the same injection can be re-integrated into a different answer; regulators treat that whole file, not the report, as the original record [4].

The same raw-versus-processed boundary appears, sharper still, one step downstream. After the bioreactor, the antibody is purified — and the first purification step, Protein A capture (the affinity-chromatography step that fishes the antibody out of the clarified harvest, covered in the capture chapter of Book 1), is itself a busy generator of data points. Its central in-process datum is the UV A280 trace — the ultraviolet absorbance at 280 nm that tracks how much protein flows out of the column — and the raw form is the continuous chromatogram (the same digitized detector signal, integration method, and audit trail), while the processed result is a derived number such as the load challenge in grams of antibody per litre of resin or the percentage of product breakthrough. A bare "1820" off that detector is exactly as orphaned as a bare 7.0: it needs the unit (mAU), the timestamp, the column and skid IDs, the resin lot, the method, and the batch_id before it can prove the column was loaded within its validated dynamic binding capacity. Two operational facts ride along with that data point and are pure metadata in lifecycle terms. First, the chromatography skid and its detector are not trusted on faith — they pass IQ/OQ/PQ (Installation, Operational, and Performance Qualification — documented proof the equipment was installed right, operates right, and performs right for its real workload), so the instrument's calibration and qualification state is part of every reading's context, and how that proof is produced — increasingly through risk-based Computer Software Assurance (CSA) rather than exhaustive Computerized System Validation (CSV) — is the subject of Validating Computerized Systems. Second, when a process moves from the development lab to the manufacturing plant — tech transfer and scale-up — the same six-field record shape must survive the move even though the column is now a hundred times larger; the metadata conventions are what let a small-scale data point and a commercial-scale one be compared at all. The lifecycle of a downstream data point is identical to an upstream one; only the instrument changes.

An analyst in a laboratory drawing a liquid sample from a fed-batch bioreactor culture

An analyst draws a sample from a fed-batch culture — the physical act that creates an offline data point, which must then be captured, contextualized, and linked to the batch.

Fed-batch sampling. Image by Luis Fernando Flores LAB, licensed under CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0/), via Wikimedia Commons; used unmodified. This image is licensed under CC BY-SA 4.0 and may be reused under the same license; this license applies to the image only, not to the rest of this book.

The metadata that makes meaning: unit, time, equipment, batch, method, operator

Surrounding both is metadata — literally "data about data": the information that gives a value its meaning and its history [1]. The unit, the timestamp, the instrument's identity and calibration state, who recorded it, the method used — all metadata. If a chromatography instrument reports a peak area of 4527.3, the metadata that travels with it includes the unit (mAU·s), the timestamp, the instrument ID, the calibration state, the batch ID, the method, the operator, and the substance being measured. Strip all of that away and the number is orphaned. Where each of these pieces of metadata is physically produced — which sensor, which lab instrument, which operator entry — is the subject of Instruments and Sensors as Data Sources; here we care only that they must all be present and bound to the value.

Two more terms set up a later part of the book. The original record is the first durable capture of the data, in the format it was created [2]. A true copy is an exact, verified reproduction — including its metadata — that preserves the full meaning and can stand in for the original [4]. The difference between an original, a copy, and a corrupted half-copy is the difference between a defensible batch and a rejected one.

note

In Chapter 1 we met ALCOA — Attributable, Legible, Contemporaneous, Original, Accurate. Regulators now extend it to ALCOA+, appending four further qualities to those same five: Complete, Consistent, Enduring, and Available [2]. (The full treatment lives in Data Integrity and ALCOA+.) Notice how naturally these map onto the lifecycle: attributable and contemporaneous are about capture; enduring and available are about retention; original and accurate are about the raw-versus-processed boundary.

Anatomy of a contextualized reading: six fields that travel together

Why a number alone is noise

This is the core of the chapter. What is 7.0?

It could be a pH. It could be 7.0 grams of glucose per liter, 7.0 million cells per milliliter, or seven o'clock. By itself it carries no truth — only a digit. Data plus context equals information [7]. To turn our reading into something a person can trust and act on, we must bind five pieces of context to the value:

a unit — pH (so we know what dimension it measures);
a timestamp — 06:14 on day 7 of the run, which is hour 150 counting from the run's start (so we know when, which proves it was recorded contemporaneously);
an equipment ID — Bioreactor BR204, probe PRB-17, last calibrated yesterday;
a batch ID — the specific lot of drug substance this belongs to;
a method — the standard procedure that says how the reading is taken.

The value plus those five contexts is the six-part record we will dissect below. Only now is 7.0 information: "the culture in BR204 held pH 7.0 at hour 150 of batch L-22-0417, by a calibrated probe, per method SOP-pH-03." Attaching that context is itself a build step — the open-source companion shows how raw tags are joined to recipe, equipment, and batch facts in its contextualization chapter. In a real system that same fact is stored not as a sentence but as a structured record, every field carrying one piece of the context:

{
  "measurement": "pH",
  "value": 7.0,
  "unit": "pH",
  "timestamp": "2022-06-10T06:14:32Z",
  "equipment_id": "BR204",
  "sensor_id": "PRB-17",
  "batch_id": "L-22-0417",
  "method": "SOP-pH-03",
  "recorded_by": "analyst_15"
}

That sentence — and that record — can support a decision. The bare 7.0 cannot. This is why contextualization is not paperwork — it is what converts a measurement into evidence [3]. A regulatory inspector reviewing your batch file will not accept a bare number; they will ask you to prove where it came from, when, and under what conditions — which you cannot do without that context.

The six fields, dissected

The JSON above is the full contextualized record; strip it to the load-bearing minimum a historian must carry on every row and you reach six fields — note two changes from the sentence form above: the equipment_id/sensor_id/method collapse into the structured tag, and we add a quality flag the prose record omitted. A trustworthy reading is then six fields, each carrying one piece of the story. They are easiest to see laid out as an identity card.

timestamp — when the value was true, recorded in a single unambiguous clock (here, UTC). This is what proves the reading was captured contemporaneously rather than back-filled.
tag — the signal's identity, structured rather than opaque. BR204.pH.PV decodes as <asset>.<measurement>.<role>: the asset BR204, the measurement pH, and the role .PV — the process value (what the culture actually did) as opposed to .SP, the setpoint (what the recipe asked for), and .MV/.OUT, the controller's output to the actuator (e.g. the base-addition pump command, since adding base raises pH) — the PV–SP–MV trio is how every control loop appears in the historian.
value — the measurement itself, 7.0 — meaningless on its own, which is the whole point.
unit — the dimension, pH, without which 7.0 could be glucose, cell density, or a clock time.
quality — the trust flag: Good, Uncertain, or Bad. A reading you cannot vouch for is not the same as a good one, and the only honest place to record that doubt is beside the value itself.
batch_id — the join key, L-22-0417: the one field that lets this reading be rejoined to every other record from the same lot.

An identity-card diagram dissecting a contextualized pH reading into six fields — timestamp, tag BR204.pH.PV, unit pH, quality flag Good, and batch_id L-22-0417 — with a green core block showing the value 7.0 paired with its unit and a violet panel decoding the tag name into asset, measurement, and role. The six fields that turn a bare 7.0 into evidence: strip any one away and the number is orphaned; carried together they make the reading defensible. Original diagram by the authors, created with AI assistance.

This is not a textbook abstraction. The open-source companion volume builds exactly this record as a single database row: the upstream-bioreactor chapter of Book 3 defines a ts.sensor_reading table whose columns are ts, tag, value, unit, quality, batch_id — the same six fields, now concrete SQL, with batch_id stamped on every row precisely so the historian and the GMP batch record can be rejoined. The conceptual card here and the table there are the same artifact at two altitudes; turning the first into the second cleanly, every time, across every instrument, is the discipline this whole trilogy is about.

The same record as a machine-readable triple

The JSON row and the SQL row are two writings of one fact; there is a third, and it is the one that lets different systems agree on what the fact means. In a knowledge graph the reading becomes a small bundle of RDF triples — statements of the form subject — predicate — object — where the subject is a globally unique name (an IRI, an Internationalized Resource Identifier — a web-scale name that means the same thing in every system, unlike a local key) rather than a row number. Written in Turtle (a compact RDF text syntax; a fenced block so the angle brackets do not confuse the page):

@prefix bp:   <https://example.org/bioproc#> .
@prefix qudt: <http://qudt.org/schema/qudt/> .
@prefix unit: <http://qudt.org/vocab/unit/> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

bp:reading-BR204-pH-h150 a bp:pHMeasurement ;
    bp:ofBatch       bp:L-22-0417 ;            # the join key, now a resolvable IRI
    bp:onEquipment   bp:BR204 ;
    bp:byProbe       bp:PRB-17 ;
    bp:atTime        "2022-06-10T06:14:32Z"^^xsd:dateTime ;
    bp:hasQuality    "Good" ;
    bp:hasValue [ a qudt:QuantityValue ;        # the value never travels bare
        qudt:numericValue "7.0"^^xsd:float ;
        qudt:hasUnit unit:PH ] .

That qudt:hasUnit is the same "data plus context equals information" rule the chapter started with, made enforceable rather than hoped-for: the unit travels as a machine-readable IRI, not a string in a column header, so no system can misread 7.0 as glucose. (How QUDT — the Quantities, Units, Dimensions and Types vocabulary — and a global IRI keep a value self-describing is the whole subject of Book 4's Identifiers and Units chapter.) Two further moves close the loop with the rest of this chapter. First, the quality flag becomes a constraint a gate can check: a SHACL shape (the Shapes Constraint Language — a way to validate that graph data has the required structure) can require that every release reading carry a bp:hasQuality from the controlled set ("Good" "Uncertain" "Bad") and exactly one timestamp, so a malformed reading is rejected at the door — the same closed-world checklist Book 4's release gate runs over a lot's full CQA panel. Second, the islands question — "show me everything about this batch" — becomes a one-line SPARQL query (the query language for RDF, as SQL is for tables), and it is a competency question: a question the data model must be able to answer to be fit for purpose.

# CQ: return every reading bound to batch L-22-0417, across all systems, in one query.
PREFIX bp:   <https://example.org/bioproc#>
PREFIX qudt: <http://qudt.org/schema/qudt/>
SELECT ?reading ?value ?unit ?when WHERE {
  ?reading bp:ofBatch bp:L-22-0417 ;
           bp:hasValue/qudt:numericValue ?value ;
           bp:hasValue/qudt:hasUnit ?unit ;
           bp:atTime ?when .
} ORDER BY ?when

The query answers in one pass only because every system minted bp:L-22-0417 as the same IRI — which is exactly the island join that breaks when the historian, the chromatography data system, and the LIMS each spell the batch differently. Modeling the reading this way also pins down what kind of thing it is: the pH value is a quality that inheres in the broth — a thing that persists and bears properties through time, a continuant in the upper-ontology sense — while the bioreactor run that produced it is an occurrent, a process that happens and is over. Keeping the two straight (the reading is about the broth, not the same thing as the run) is the provenance discipline that lets a lineage walk never cross from a batch into the vessel that held it. That whole semantic layer — globally unique identity, qualified values, and validation shapes — is what Semantic Interoperability and Ontologies and FAIR Data build later in this book, and what Book 4 develops in full as the engineered cure for the islands problem.

caution

A subtle but critical rule: you cannot quietly delete a number you dislike. Failing, suspect, and out-of-specification results are data too, and they must be retained and reviewed alongside the rest — never discarded to make a batch look clean [4]. This is exactly what the quality flag is for: a reading marked Uncertain or Bad is not deleted — it is retained and reviewed with its doubt recorded beside it; the honest record keeps the inconvenient point. The lifecycle keeps the inconvenient points as faithfully as the flattering ones.

The four Vs in one batch: volume, velocity, variety, veracity

Bioprocess data is demanding along four dimensions — the "four V's," a lens borrowed from the broader world of big data. Made real in a single batch, they look like this.

Volume. A single bioreactor run carries ten to twenty probes sampling every few seconds for one to three weeks — and that is only the production stage; the seed train that grew the cells beforehand was already logging its own traces. Those scalar probe traces are modest — a few hundred megabytes of time-series, not gigabytes. What pushes a batch into the gigabytes is the rich layers stacked on top: spectral scans (Raman and NIR, hundreds of wavelengths per acquisition), imaging, and genomics, alongside the offline lab assays. Add those, and one batch can generate gigabytes of structured and unstructured records.

Velocity. Some of that data arrives in real time and must be acted on now — a pH excursion left for tomorrow's review may have already spoiled today's cells.

Variety. The data comes in many shapes: continuous sensor traces, single lab results, free-text operator notes, chromatograms, electronic signatures. Some is machine-generated, some hand-entered, and both must be governed under one consistent set of integrity rules [8].

Veracity. Every point must be trustworthy — genuinely attributable, accurate, and complete — because patients' safety rests on it [2]. Veracity is not subjective: it can be defined, scored, and even monitored automatically across this flood of heterogeneous data [8].

Why a model also needs the context, not just the number

The chapter's central claim — that a bare 7.0 is noise — is not only a regulatory point; it is the foundational premise of any machine-learning model trained on this data, and the six-field record is what makes the data learnable rather than misleading. Three of the fields earn their keep the moment a model touches the data. The batch_id is the field that prevents data leakage — the cardinal sin where information from the test set bleeds into training and inflates the score. Probe readings from one run are highly correlated, so a model evaluated by splitting rows at random will see readings from the same batch in both training and test and report a fictitiously good number; the honest evaluation is a grouped (leave-one-batch-out) cross-validation that keeps every reading from a batch on the same side of the split, and that split is only possible because batch_id was stamped on every row at capture. The quality flag is a label a model must respect: a reading marked Bad or Uncertain is a different kind of input from a Good one, and silently feeding all three to a model as if equal teaches it on data the plant itself does not trust. And the timestamp plus tag are what let a deployed model tell process drift (the living culture genuinely moving batch to batch) apart from model drift (the model going stale against an unchanged process) — a distinction the model cannot make without knowing which signal, on which equipment, at which instant each value belongs to. The same context that turns 7.0 into evidence for an inspector is what turns it into an admissible training example: a model trained on orphaned numbers learns coincidences, and the metadata is the difference between a model a regulator can trust and one that flatters itself. How that machinery is built — leak-free splits, drift detection, and the locked-model lifecycle that keeps a learning model validated under GMP — is the subject of Book 5's data-the-fuel and MLOps-and-lifecycle chapters.

When a point is missing: gaps, the absent reading, and cautious imputation

The lifecycle so far assumed every point arrives. Real time-series have holes, and how you treat a hole is itself a data-integrity decision. The first move is to separate two failures that look alike on a chart but are opposites in meaning.

An absent reading is no row at all: the probe was offline, the network dropped, the historian's deadband (the swinging-door compression that only writes a new point when the signal moves enough, introduced in Plant Information Systems) legitimately suppressed an unchanged value, or nobody was scheduled to pull the offline sample yet. A Bad-quality reading is a row that exists but cannot be trusted: the value is there, but its quality flag reads Bad or Uncertain because the probe drifted out of range, a calibration lapsed, or a fault was detected. The distinction is not academic. A gap says we do not know; a Bad point says we measured, but do not believe it — and the honest record keeps both, rather than letting a chart's smooth line imply a continuity that never existed. This is why Book 5's OPC UA interface refuses to write a non-GOOD value silently: an out-of-domain prediction is demoted to UNCERTAIN and a sensor fault to BAD, and the receiving system raises an exception rather than storing a confident-looking number (the manufacturing-operations chapter builds exactly that gate).

Detecting a gap means knowing the expected cadence. A pH tag written every second is missing if ten seconds pass with no new point and the deadband cannot explain it; an offline titer sampled twice a day is not missing at hour 3, only overdue after its scheduled draw. So gap detection is cadence-aware: it compares the actual arrival pattern against the tag's declared sampling interval and its compression rule, and it treats a stale value — one whose timestamp has stopped advancing — as its own kind of hole. A value that arrives but sits outside the sensor's validated range is not a gap; it is a candidate for demotion to Bad, which keeps the row and records the doubt beside it, exactly as the caution above demands.

Representing a gap honestly means not fabricating a row. A missing point is recorded as missing — an explicit null with a reason code, or simply the absence of a row that the cadence metadata makes legible — never back-filled with a made-up value that would later read as a real measurement. If a downstream view needs an evenly-spaced series (many charts and models do), the fill is a processed artifact, not raw data: computed on read, labelled as derived, and never written back over the original record. That is this chapter's raw-versus-processed boundary applied to a hole.

Imputing a gap — estimating the missing value — is sometimes necessary and always to be done cautiously. The common methods rise in ambition: LOCF (last-observation-carried-forward, hold the previous value), linear interpolation between the bracketing points, and model-based imputation (a soft sensor or the mechanistic twin fills the value from correlated signals; the semantic-interoperability chapter shows the resampling-and-LOCF version of this on the wire). Each is a guess dressed as data, so three rules hold. First, an imputed value is processed, tagged as such, and never overwrites the raw gap. Second, it is kept out of any record a regulator reads as measured evidence — a batch release does not turn on an interpolated number. Third, and most dangerous for machine learning: bioprocess missingness is rarely random. A probe often fails because the culture went wrong, and analysts pull more offline samples precisely when a batch is misbehaving — so the gaps are not missing-at-random, and naïvely imputing them (or worse, silently dropping the rows) teaches a model that trouble is quiet. The safest imputations carry their own uncertainty forward, and a model that must handle gaps is evaluated with them present, not on a tidied series the plant never actually sees. This is the same honesty the ALCOA+ Complete principle demands of the record and that Book 5's applicability-domain and leakage discipline demand of the model.

The islands problem: why mapping disconnected systems remains unsolved

A real batch is not one probe

Our pH point was simple. But a real batch creates data in dozens of places, in dozens of shapes, on systems that often do not speak to one another — a process-control system (a Siemens or Emerson DCS — Distributed Control System) here, a chromatography data system (Waters Empower) there, a LIMS (Laboratory Information Management System), a historian, a Manufacturing Execution System (such as Siemens Opcenter or Dassault Systèmes DELMIA), and a partner's spreadsheet somewhere else. The plant-floor systems that hold these islands are surveyed in Plant Information Systems; what matters here is that each one captures its own island of data with its own metadata conventions. Standards exist to give those islands a common shape — ANSI/ISA-88 defines how batch and recipe data are structured, and ISA-95 defines how plant-floor data connects to the business systems above it — but mapping every real instrument onto them is the hard part.

The trouble is rarely the values; it is the metadata around them. The same batch is "L-22-0417" in the historian, "Lot-220417" in the chromatography data system, and "220417-L" in the LIMS. One system stamps time in UTC, another in site-local time, a third records only a US-format calendar date. One carries an explicit unit, another assumes it, a third leaves the unit blank. None of these readings is wrong — but they will not line up, and so the single most basic question an investigator asks — "show me everything about this batch" — cannot be answered by a query. Someone has to reconcile the islands by hand.

A diagram of the islands problem: three systems — a bioreactor historian, a chromatography data system, and a quality LIMS — each holding the same batch under an incompatible batch key, clock, and unit convention, with rose clash markers between them, a panel explaining why the join fails, and a green harmonized-record target labeled the unsolved bridge. The islands problem: three systems hold the same batch, but their batch keys, clocks, and unit conventions do not match, so the records cannot be joined without manual reconciliation. Original diagram by the authors, created with AI assistance.

Unsolved challenge: bridging the islands

Here is the honest state of the art. Stitching those islands into one trustworthy, connected record — so the whole shadow can be read as a single story — is the central problem of biopharmaceutical data management, and it is not yet solved in the general case. Where the conventions clash, the same fact has to be captured redundantly in each system and then reconciled by hand, and that reconciliation sits on the critical path of every investigation: a root-cause analysis that should take an afternoon instead waits days while someone proves that "Lot-220417" and "220417-L" are the same batch. Regulators anticipate exactly this by asking manufacturers to map their data flows — to draw, point by point, where each measurement is born, where it travels, and where it could be altered or lost [3] — and the GAMP guidance treats end-to-end traceability of the record as a requirement, not a nicety [5]. The most credible path forward is shared semantics at the point of capture: standardized, machine-readable formats so that the unit, the timestamp, and the identifiers mean the same thing in every system — how that shared meaning is engineered is the whole subject of Semantic Interoperability and Ontologies and FAIR Data later in this book. The Allotrope Foundation's AnIML and Allotrope Data Format (ADF) work is the leading collaborative attempt to define that common shape for analytical and process data [9] — but adoption is partial, legacy instruments emit their own formats, and harmonizing a real plant onto one vocabulary remains a genuinely open, expensive problem. This book exists largely to chart it.

Why it matters: the lifecycle as the spine of quality control

If the lifecycle is the spine, then managing data is managing that spine end to end. A measurement that is captured but never contextualized is unusable. One that is used but never retained is undefensible. One that is retained but not as a faithful original or true copy is worthless to an inspector. Good data management is simply the discipline of carrying every point cleanly through all seven stages — with its metadata intact — for its entire required life [5].

Industry learned this the hard way: managing the lifecycle is not optional. The ISPE GAMP guidance on records and data integrity treats the full life cycle — from the moment data is born to the day it is destroyed — as the spine of the quality system, the unit that controls must follow [5]. The general data-management body of knowledge, DAMA-DMBOK, supplies the vocabulary that any industry uses to describe this journey: capture, contextualization, use, retention, and archival [7].

In the real world

The lifecycle in this chapter is not a paper abstraction — it is stitched together from the real instruments and systems a plant runs every day, and our 7.0 is born inside that hardware. At generation, the pH probe sits in a single-use bioreactor from a vendor such as Sartorius (Ambr, Biostat STR), Thermo Fisher Scientific (HyPerforma S.U.B.), or Eppendorf (BioBLU). At capture, the value lands in a vendor format — an Agilent ChemStation .ch file for chromatography, a proprietary historian record from the bioreactor controller — which the Allotrope Foundation's open AnIML and Allotrope Data Format (ADF) aim to make readable decades later. Through processing and use it flows across a process-control system (a Siemens or Emerson DCS), a chromatography data system (Waters Empower), a historian (such as an OSIsoft/AVEVA PI server), a LIMS, and a Manufacturing Execution System (Siemens Opcenter or Dassault Systèmes DELMIA). Each of those is an island with its own metadata conventions, and the unsolved work of this book is bridging them: the same batch lives as one key in Empower, another in PI, and a third in the LIMS, and today someone still reconciles them by hand. Knowing which box each stage lives in is the difference between reading a vendor's brochure and reading your own batch record.

Key terms

Data lifecycle — the full journey of a data point: generation/capture, processing, contextualization, review/use, reporting, retention/archival, and disposal.
Raw data — the original, unaltered values exactly as first recorded by an instrument or person.
Processed data — calibrated, averaged, or calculated results derived from raw data.
Metadata — "data about data"; the unit, timestamp, equipment, batch, method, and authorship that give a value meaning and history.
Original record — the first durable capture of data, in the format it was created.
True copy — an exact, verified reproduction (including metadata) that preserves full meaning and can stand in for the original.
Contextualization — attaching unit, time, equipment, batch, and method so a number becomes information.
Information — data plus context; a number you can actually trust and act on.
Tag — the structured address of a signal, <asset>.<measurement>.<role> (e.g. BR204.pH.PV), so a thousand sensors share one table as values of one schema rather than new columns.
Process value (PV) / setpoint (SP) — what the process actually did (.PV) versus what the recipe asked for (.SP).
Quality flag — the trust marker stored beside a value (Good, Uncertain, or Bad); a reading you cannot vouch for is not the same as a good one.
Absent reading vs. Bad-quality reading — a gap (no row at all: probe offline, deadband, or not yet sampled) versus a row that exists but is flagged Bad/Uncertain; a gap says "we do not know," a Bad point says "we measured but do not believe it," and the honest record keeps both.
Imputation (LOCF / interpolation / model-based) — estimating a missing value by carrying the last one forward, interpolating between neighbours, or predicting from correlated signals; always a processed artifact, tagged as derived, kept out of measured evidence, and treated cautiously because bioprocess gaps are usually not missing-at-random (a probe often fails because the batch went wrong).
The islands problem — the difficulty of joining the same batch across disconnected systems whose batch keys, clocks, and units do not match, forcing redundant capture and manual reconciliation.
RDF triple / IRI — a fact written as subject — predicate — object in a knowledge graph, whose subject is a globally unique web name (an Internationalized Resource Identifier) that means the same thing in every system, so the same batch IRI joins records across islands.
SHACL / SPARQL — the Shapes Constraint Language that validates whether a reading has the required structure (e.g. exactly one timestamp and a quality flag from a controlled set), and the query language for RDF that answers a competency question like "show me everything about this batch" in one pass.
QUDT — the Quantities, Units, Dimensions and Types vocabulary, which lets a value carry its unit as a machine-readable IRI rather than a string buried in a column header, so 7.0 can never be misread as glucose.
Continuant vs. occurrent — a thing that persists and bears qualities through time (the broth, a batch) versus a process that happens and is over (the bioreactor run); the pH reading is a quality of the continuant, not the same thing as the occurrent that produced it.
Data leakage / grouped cross-validation — the modeling error where readings from one batch land in both training and test (inflating a model's score), and the leave-one-batch-out evaluation that prevents it — possible only because batch_id is stamped on every row.
Process drift vs. model drift — the living culture genuinely changing batch to batch versus a model going stale against an unchanged process; telling them apart needs the timestamp, tag, and batch_id on each value.
IQ/OQ/PQ — Installation, Operational, and Performance Qualification: documented proof that equipment (e.g. a chromatography skid) was installed right, operates right, and performs right for its real workload; the instrument's qualification state is part of every reading's metadata.
CSV / CSA — Computerized System Validation, the documented evidence that a regulated computer system is fit for use, and the FDA-led shift to risk-based Computer Software Assurance that spends validation effort where patient risk is highest.
ALCOA+ — the ALCOA qualities from Chapter 1 (Attributable, Legible, Contemporaneous, Original, Accurate) extended with four more: Complete, Consistent, Enduring, Available.
Retention — keeping records readable for their required life (for medicines, at least one year past batch expiry, often longer).
The four V's — volume, velocity, variety, and veracity; the dimensions that make bioprocess data hard.

Where this leads

We have followed one point through its life, but we treated its birthplace as a single probe. In reality, a biologic is built across many unit operations, each one a busy factory of measurements. In the next chapter, A Tour of Where Process Data Is Born, we walk the whole monoclonal-antibody process — upstream, downstream, fill-finish, and quality control — but reframed entirely as a chain of data-generating stations, so you can see exactly where every point in the shadow first draws breath.

What this chapter covers​

The seven stages: from probe to archive​

Three objects, one record: raw, processed, and metadata​

Raw, processed, and the metadata that makes it real​

The metadata that makes meaning: unit, time, equipment, batch, method, operator​

Anatomy of a contextualized reading: six fields that travel together​

Why a number alone is noise​

The six fields, dissected​

The same record as a machine-readable triple​

The four Vs in one batch: volume, velocity, variety, veracity​

Why a model also needs the context, not just the number​

When a point is missing: gaps, the absent reading, and cautious imputation​

The islands problem: why mapping disconnected systems remains unsolved​

A real batch is not one probe​

Unsolved challenge: bridging the islands​

Why it matters: the lifecycle as the spine of quality control​

In the real world​

Key terms​

Where this leads​