The Biologic and Its Data Shadow
π Where we are: Part I, Chapter 1 β having met the two products of biomanufacturing in the Preface (the molecule and its data), we now look closely at that second product: the data shadow that every batch casts.
In the Preface β Making the Same Medicine Twice β we made a claim that sounds almost philosophical: a biologic is manufactured twice. Once as a molecule, grown inside living cells, and once as a body of data that proves the molecule is what we say it is. This chapter takes that claim and makes it concrete. We are going to follow a single batch of medicine and count its shadow.
The word "shadow" is deliberate. A shadow is not the object, but it is cast by the object, follows it everywhere, and tells you a great deal about its shape. The data shadow of a batch is every number, record, signature, and sensor trace generated while that batch was made and tested. By the end of this chapter you will see that the shadow is not optional paperwork bolted on at the end. It is, in a real regulatory and scientific sense, part of the product.
Think of an expensive bottle of wine. The wine in the glass is the product β but a serious collector also wants the provenance: where the grapes grew, the weather that year, who bottled it, how it was stored, and proof the bottle was never tampered with. For wine, that paper trail is a nice-to-have. For a biologic medicine going into a human body, the paper trail is the law, and it is enormous. This book is about that paper trail β except almost none of it is paper anymore.
What this chapter coversβ
- Why a biologic's complexity forces data to the center of the story
- The many kinds of data one single batch produces, and roughly how much
- The legal rule that "if it isn't documented, it didn't happen"
- The four families of data we will return to again and again
- Why the verb in this book's title β manage β is the hard part
The molecule that is defined by how it is madeβ
A quick reminder, because everything else follows from it. A biologic is a large, fragile protein medicine β a molecular machine made of thousands of atoms β that can only be built by living cells, not by ordinary chemistry. Our recurring example is the monoclonal antibody (mAb), a Y-shaped protein engineered to lock onto one specific target in the body, such as a cancer signal. (The companion guide From Cell to Cure covers the biology in depth; here we care about its consequences for data.)
A small-molecule drug like aspirin can be written as an exact chemical formula. A biologic cannot. It is too big, too flexible, and decorated with tiny sugar chains that the cells attach as they grow it. Two factories with the same gene but different culture conditions β pH, temperature, nutrient levels β can produce subtly different molecules, because living cells perform post-translational modifications (PTMs), chemical edits made to the protein after it is built, such as glycosylation (the attaching of sugar chains) that shift with the cell's state. This is why the industry repeats the phrase "the product is the process" β the manufacturing process does not merely make the medicine, it helps define what the medicine is [1].
That single fact is the reason data sits at the center of biomanufacturing rather than at the edge. If the process defines the product, then the record of the process is your only proof of what the product actually is. Modern regulatory thinking formalized this. Under Quality by Design (QbD) β a development philosophy in which quality is built in deliberately rather than tested in afterward β you must identify which process settings and which product properties truly matter, and understand how they relate [1]. The international guideline ICH Q8(R2) turned this into expectations: define what the product needs to do, identify its critical quality attributes, and map the design space of conditions that reliably deliver them [2]. Every one of those words is, in practice, a request for data.
The data shadow of a single batchβ
Now picture one batch β one run through the factory, producing perhaps a few kilograms of purified antibody (the drug substance β the bulk antibody before it is filled into vials) that, after fill/finish, become tens of thousands of vials of finished drug product. Watch the data it throws off.
Time-series from sensors. Inside the bioreactor (the large vessel where cells grow β for example a 2,000-litre stirred-tank vessel such as a Cytiva Xcellerex XDR or a Thermo Fisher HyPerforma single-use bioreactor), probes measure temperature, pH (acidity), dissolved oxygen, stirring speed, and more, often every few seconds, for one to three weeks. These continuous streams are the heartbeat of the batch. The framework that pushed the industry to measure such critical quality and performance attributes in real time is the FDA's Process Analytical Technology (PAT) initiative [3]. A single bioreactor can generate millions of timestamped readings before the cells are ever harvested β ten to twenty probes sampled around 0.5 Hz over one to three weeks easily add up to well over ten million points per batch. Each point arrives as a timestamped tag β BR101.Temp.PV, BR101.pH.PV β landing in a process Historian database such as OSIsoft PI (AVEVA) or Rockwell FactoryTalk. A few rows look like this:
timestamp,tag,value,unit,quality
2026-06-13T14:03:00Z,BR101.Temp.PV,37.0,degC,Good
2026-06-13T14:03:00Z,BR101.pH.PV,7.05,pH,Good
2026-06-13T14:03:00Z,BR101.DO.PV,42.3,%sat,Good
2026-06-13T14:03:02Z,BR101.Temp.PV,37.0,degC,Good
2026-06-13T14:03:02Z,BR101.pH.PV,7.04,pH,Good
Alarms and events. Every time a value drifts out of range, an operator opens a valve, or a pump starts, the control system logs it β what happened, when, and often who triggered it.
The electronic batch record (EBR). This is the master narrative of the batch: the step-by-step recipe that was followed, the materials added, the parameters confirmed, and the human sign-offs at each stage. It lives in a Manufacturing Execution System (MES) β commercial examples include KΓΆrber PAS-X, Siemens Opcenter Execution Pharma, and Rockwell PharmaSuite. We will meet it properly in later chapters; for now, know it is the spine the rest of the data hangs from.
Analytical results. Once material is harvested and purified, the Quality Control (QC) laboratory tests it β for identity, purity, potency, and safety. Each test produces results, and behind each result sits a raw instrument file: the original output of a chromatograph or mass spectrometer, often a large and proprietary digital file β a single high-performance liquid chromatography (HPLC) run, for instance, is commonly stored as a vendor-specific .ch file or a .d directory of 5β100 MB.
Environmental monitoring. Cleanrooms are watched constantly β airborne particles, microbial samples, temperature, humidity β to prove the surroundings stayed clean while the batch was open.
Material genealogy. Every raw material, growth medium, filter, and single-use plastic bag carries a lot number, a supplier, and a certificate. The thread linking a finished vial back to every ingredient and consumable that touched it is its genealogy, or lineage.
The data shadow of a single batch: many independent streams of very different kinds of data, all describing the same run.
A single batch's data shadow: many sources, one connected Electronic Batch Record (EBR).
Original diagram by the authors, created with AI assistance.
Notice the variety, not just the volume. Tidy numeric streams sit beside large binary instrument files, beside human-signed forms, beside supplier certificates. They live in different systems, in different formats, and the hard work of this book is connecting them. Shared standards exist to make that connection possible β ISA-88 (also issued as ANSI/ISA-88, the batch-control model that defines recipes, procedures, and equipment hierarchy) gives the batch a common structure, and OPC UA (Open Platform Communications Unified Architecture) is the modern protocol that lets instruments and historians exchange those tagged readings between vendors β but applying them across an entire facility is the real labor.
If it isn't documented, it didn't happenβ
In ordinary work, you do the job and the paperwork is a chore afterward. In medicine manufacturing, the rule is inverted. The records are the evidence that the job was done correctly, and without them the work is treated as if it never occurred. This is the heart of cGMP β current Good Manufacturing Practice, the body of regulation governing how medicines are made.
In the United States, 21 CFR Part 211 spells this out. It requires a master production record (the approved recipe), a batch production and control record for every batch (the as-executed account), the recording of test results, and the protection of backup data β all kept on file for years after the batch is released [4]. The data shadow is not a courtesy. It is a legal obligation, batch by batch.
The records must also be trustworthy, which is a separate problem from merely existing. The FDA's guidance on data integrity describes the qualities reliable records must have, often summarized by the acronym ALCOA β Attributable, Legible, Contemporaneous, Original, and Accurate β backed by audit trails that capture who changed what and when [5]. Regulators now extend these five into ALCOA+, appending four further qualities (Complete, Consistent, Enduring, Available); we return to the full set in Data Integrity and ALCOA+. The word Contemporaneous matters most here: you must record an action as it happens, not reconstruct it from memory later. A reading written down an hour after the fact is, in the eyes of a regulator, a different and weaker kind of evidence.
"Contemporaneous" is why so much of biomanufacturing now happens through validated computer systems rather than notebooks. A sensor that timestamps its own reading the instant it takes it is the strongest possible witness. A human transcribing that number into a logbook later is the weakest. Much of modern data management exists to keep records close to the moment they are born.
Because most of these records are now electronic, a second rule applies: 21 CFR Part 11, which sets the conditions under which electronic records and electronic signatures are accepted as the legal equals of paper and ink [6]. When an operator clicks "approve" in a batch record system, Part 11 is what makes that click binding. The same expectations exist outside the United States: in the European Union, EU GMP Annex 11 governs computerised systems and is the close counterpart to Part 11, so a product sold on both sides of the Atlantic must satisfy both.
Four families of dataβ
The shadow is large, but it is not formless. Throughout this book we will sort data into four families. Meet them briefly now; each gets its own treatment later.
-
Process data β the CPPs. A Critical Process Parameter (CPP) is a setting that, if it varies too much, will change the product β bioreactor temperature, pH, feed rate. These are the "how we made it" numbers, mostly the sensor time-series above. Identifying which parameters are truly critical is exactly the QbD exercise that ICH Q8 demands [2].
-
Quality data β the CQAs. A Critical Quality Attribute (CQA) is a property of the product itself that must stay within limits to keep the medicine safe and effective β its purity, its potency, the pattern of sugar decorations on the antibody. These are the "what we made" numbers from the QC lab β and the analytical procedures that produce them now have their own QbD guideline, ICH Q14 (Analytical Procedure Development), a companion to the method-validation guideline ICH Q2. The industry's well-known A-Mab case study is a worked example of teams systematically ranking an antibody's CQAs and CPPs [9].
-
Metadata. Data about data: the units, the timestamp, the instrument, the operator, the calibration status. A bare number β "37" β is meaningless. 37 degrees Celsius, measured by probe TT-101, calibrated last Tuesday, at 14:03 is information. (Probe tags follow an equipment-specific schema β
BR-01-TT-01might mean bioreactor unit 01, temperature transmitter 01 β and the exact convention must be fixed in the facility's data dictionary so every system reads the same name the same way.) Metadata, and the audit trails that protect it, are central to the integrity rules above and to the risk-based system controls described in GAMP 5, the standard guide for validating computerized systems in regulated industry [10]. -
Master data. The stable reference information that does not change batch to batch: the approved recipe, the product specification, the list of materials, the equipment register. If process and quality data are the story of one batch, master data is the unchanging cast of characters every batch shares.
Hold these four loosely for now. The point is simply that the shadow has structure, and naming its parts is the first step to managing it.
Why "manage" is the operative wordβ
We could have called this book Recording Data in Biomanufacturing. We did not, because recording is the easy part. The hard part is everything the verb manage implies.
The data must be captured at the moment of creation, faithfully and contemporaneously. It must be contextualized β joined to the metadata that turns a number into a fact. It must be protected so it cannot be silently altered or lost, satisfying the integrity and electronic-records rules above [5][6]. It must be connected across the many systems that hold its fragments, so the genealogy of a vial can actually be traced. And it must be retained β the regulatory minimum under 21 CFR Part 211 is one year after the batch's expiration date [4], but firms in practice keep records for a decade or more, because a safety question can surface long after the medicine has shipped.
That is a tall order, and the industry increasingly treats process data not as a regulatory burden but as a genuine asset β fuel for analytics, process understanding, and the predictive models that are reshaping the field [7][8]. A well-managed data shadow does not just defend a batch in an audit; it teaches you how to make the next batch better.
Why it mattersβ
If you take one idea from this chapter, take this: the data shadow is as essential to the product as the molecule. Lose the molecule and you lose one batch. Lose the data β or fail to make it trustworthy β and you can no longer prove that any of your batches are what you claim, which can halt a product entirely. Regulators do not ask to taste the medicine; they ask to see its records [4][5]. The shadow is how a biologic proves it is itself.
In the real worldβ

Filled vials of drug product. Behind every vial stands a far larger volume of data proving how it was made and tested.
Vials. Image by CSIRO, https://commons.wikimedia.org/wiki/File:CSIRO_ScienceImage_11474_Vials.jpg, licensed under CC BY 3.0 (https://creativecommons.org/licenses/by/3.0/), via Wikimedia Commons.
The data shadow grows even larger as the industry modernizes. The U.S. NIIMBL institute (the National Institute for Innovation in Manufacturing Biopharmaceuticals), alongside the University of Delaware's forthcoming SABRE Center β a co-located cGMP biomanufacturing and workforce-training pilot facility being built to scale up and de-risk advanced processing and to support NIIMBL β is helping advance continuous and intensified processing, where cells produce nonstop rather than in a single fed-batch tank (a vessel that is filled, run once to completion, then emptied). Continuous processing means continuous data β there is no neat "end of batch" moment, so the sensor streams never stop and the need for real-time PAT measurement and live integrity becomes acute [3]. On a real plant floor this data follows a well-worn path: probes on the bioreactor feed a control system (a PLC running SCADA software such as Siemens SIMATIC or Emerson DeltaV), which streams tagged readings into a Historian (OSIsoft PI, AVEVA) and surfaces them in the MES batch record (KΓΆrber PAS-X, Siemens Opcenter), while the QC lab's chromatographs and mass spectrometers deposit their raw files into a chromatography data system (CDS) such as Waters Empower or Thermo Chromeleon. At the same time, frameworks like GAMP 5 push the field toward a risk-based, lifecycle view of the very computer systems that capture all this data [10]. The shadow is not shrinking. Managing it well is becoming the defining engineering challenge of modern biomanufacturing.
Key termsβ
- Data shadow β the full body of records a batch generates: sensor traces, batch records, test results, and signatures; as essential to the product as the molecule.
- Biologic β a large, complex protein medicine made by living cells rather than by chemistry alone.
- Monoclonal antibody (mAb) β a Y-shaped protein, every copy identical, engineered to lock onto one specific target.
- Drug substance β the purified bulk antibody, before it is filled into vials.
- Drug product β the finished, filled vials of medicine, produced from drug substance in the fill/finish step.
- The product is the process β the principle that how a biologic is made helps define what it is.
- Quality by Design (QbD) β building quality into a process deliberately by understanding which parameters and attributes matter.
- Bioreactor β the vessel in which living cells are grown to make the product.
- Fed-batch β the conventional batch mode in which a bioreactor is filled, run once to completion, then emptied (contrasted with continuous processing).
- Electronic batch record (EBR) β the as-executed digital account of how one batch was made, held in a Manufacturing Execution System (MES).
- Historian β a database built to store streams of timestamped process readings (tags) from instruments and control systems.
- Post-translational modification (PTM) β a chemical edit a living cell makes to a protein after building it; glycosylation, the attaching of sugar chains, is the key example for antibodies.
- PAT (Process Analytical Technology) β an FDA framework for measuring critical quality and process attributes in real time.
- cGMP (current Good Manufacturing Practice) β the regulations governing how medicines must be manufactured.
- ALCOA β the five data-integrity qualities records must have: Attributable, Legible, Contemporaneous, Original, Accurate. Regulators now extend this to ALCOA+, adding four further qualities (Complete, Consistent, Enduring, Available); the full treatment appears in Data Integrity and ALCOA+.
- Audit trail β a secure log of who changed what data, and when.
- Critical Process Parameter (CPP) β a process setting that, if it varies too much, changes the product.
- Critical Quality Attribute (CQA) β a property of the product that must stay within limits for it to be safe and effective.
- Metadata β data about data: units, timestamps, instruments, operators; what turns a bare number into a fact.
- Master data β stable reference information shared across batches: recipes, specifications, equipment registers.
- Material genealogy β the traceable lineage linking a finished vial back to every ingredient and consumable that touched it.
Where this leadsβ
We have surveyed the shadow from above and seen its scale, variety, and legal weight. But a shadow is made of individual points, and the only way to truly understand data management is to follow one. In the next chapter, The Lifecycle of a Data Point, we zoom all the way in: we will track a single measurement from the instant a sensor or analyst creates it β through capture, processing, contextualization, decision-making, retention, and archival. Along the way we will draw the line between raw data and metadata, and confront the idea at the core of everything that follows: data without context is just noise.