Preface — Making the Same Medicine Twice

📍 Where we are: Right at the door. Before any single step of making a medicine, let us agree on what this book is really about — and why a biologic is, in effect, made twice.

Welcome. This is a book about something invisible. When a modern medicine is manufactured, two products come off the line. One you can hold: a clear liquid in a vial, a protein grown by living cells. The other you cannot hold at all — it is the complete record of how that liquid came to be, every measurement and signature and test result that proves the vial contains exactly what the label says. This second product is data, and managing it well is the difference between a medicine that reaches a patient and one that does not.

You need no background to read this. If you have never set foot in a factory, never read a regulation, never thought about databases, you are exactly the reader we wrote for. We will define every specialized word the first time it appears, and again in a Key terms box at the end of each chapter.

The simple version

Imagine baking a cake for a contest where the judges do two things: they taste the cake and they read your notebook — every ingredient weighed, every oven temperature, every minute timed, signed and dated as you went. A delicious cake is not enough on its own; if a page of the notebook is missing, you are disqualified no matter how perfect the cake. And a flawless notebook cannot rescue a cake that fails the taste test. In regulated medicine-making, the notebook is the product just as much as the cake is — and both must pass. This book is about keeping that notebook.

What this chapter covers

A quick map of the preface: who the book is for and the promise we make to you; the central idea that data is a manufactured product in its own right; how the five parts of the book fit together; how this book relates to its companion process guide, Biologic Drug Manufacturing; and the small set of conventions — citations, glossaries, bilingual text — that you will see on every page.

The promise: every claim is traceable

This is a popular book in tone and a textbook in rigor. Those two goals can fight each other, so we made a rule. Every claim that is not obvious — every number, every regulatory fact, every "studies show" — carries a small bracketed marker like this [1]. Click it and you land on a single References page that lists the exact peer-reviewed paper or regulatory document behind the statement. Wherever a claim is non-obvious, you can follow it to its source and check.

Data is the product's twin

Here is the idea the whole book turns on.

A batch makes two products: the molecule and its data shadow

A biologic — a medicine made by living cells rather than by pure chemistry, such as a monoclonal antibody (one identical antibody protein, grown by the billions in engineered cells, that locks onto a single target molecule — an antigen — in the body) — is hard to make the same way twice. It is grown, not assembled, so the way you make it helps define what it actually is. Every step of that growing-and-purifying journey is laid out, vessel by vessel, in the companion guide, Biologic Drug Manufacturing; this book rides alongside it and asks, at each step, what data is born here, and where does it go?

Quality by Design: process data as evidence

The regulatory field's answer to the make-it-twice challenge, called Quality by Design (QbD), treats deep, recorded understanding of the process as being as essential to the product as the molecule itself: you identify which process settings are critical process parameters (CPPs) and which measurable product traits are critical quality attributes (CQAs), and you capture the data linking them [1]. The discipline is to control the CPPs — a tank temperature, a feed rate — so the CQAs — purity, the right sugar pattern on the antibody (antibodies carry attached sugar chains — their glycosylation — whose shape affects how the drug works and how long it lasts in the body) — always land in spec, and to keep the data that proves you did. In other words, modern medicine-making does not just produce a molecule; it produces knowledge about the molecule, written down. That recorded knowledge — not the operator's memory, not a verbal assurance — is the evidence a regulator weighs when deciding whether a batch may reach a patient. This toolkit has kept growing: later guidelines extend the same idea to managing a product across its whole lifecycle (the International Council for Harmonisation — ICH, the body whose guidelines harmonise drug-quality expectations across regions, here ICH Q12 [9]) and to running the process continuously rather than in discrete batches (ICH Q13 [10]) — directions the real-world sections at the end of this preface and throughout the book pick up.

Process Analytical Technology: measuring quality in real time

That knowledge is enormous. A single batch leaves behind a continuous trail of sensor readings — temperature, oxygen, acidity, and more — captured in real time so that quality can be judged during manufacture rather than only after, an approach regulators call Process Analytical Technology (PAT) [2]. Each reading is a small, structured fact. A single temperature point from a bioreactor might be stored as tag=BR101.Temp.PV, value=37.2, unit=°C, timestamp=2026-06-14T14:32:07Z, status=Good — a named tag (a structured, hierarchical name like BR101.Temp.PV, which we unpack in Part II), a value, a unit, the exact instant, and a quality flag (here, that the reading itself is reliable — not a judgment of whether 37.2 °C is within the acceptable range). Millions of such facts accumulate. Add to that the laboratory test results, the step-by-step batch records, and the human signatures, and you have a second artifact running in parallel with the medicine itself. We call it the molecule's data shadow.

One process, two products: living cells in a bioreactor produce both the molecule in the vial and its data shadow; both must pass before a batch is released One process, two products — the molecule and its data shadow are made together, judged together, and released together. Original diagram by the authors, created with AI assistance.

Anatomy of a batch's data shadow

The data shadow is not one big blob; it comes from six distinct sources, each born in a different place, each with its own shape. There are the continuous sensor readings — in-line and on-line measurements taken directly in the process (in-line = a probe sitting inside the vessel; on-line = an automatically drawn side-stream measured right beside it) — a dense time-series — temperature, pH, oxygen — sampled as often as every second, with slower-moving tags read less frequently and historians (the time-series databases that store tag data) typically storing only the points that change; the at-line and off-line lab results (at-line = a sample tested on a machine beside the line; off-line = a sample carried away to the lab) — discrete numbers like titer (how much antibody per litre of culture) and viability (the fraction of cells still alive), one per pulled sample, tested in the QC lab; the phase-transition events that mark when one stage of the recipe ends and the next begins; the batch master record, the metadata describing product, recipe, equipment, and operator in the language of the ISA-88 and ISA-95 standards (ISA-88 describes how a batch recipe and its procedures are structured; ISA-95 describes how a plant's equipment and enterprise are organized — the shared vocabularies we build on in Part IV); the process events — feeds, doses, the moment a sample was pulled; and the audit trail and electronic signatures recording who did what, when, and who approved it. Six sources, six formats, six birthplaces.

What turns those six scattered sources into one coherent story is a single shared key: the batch_id. Stamp the same batch identifier on every reading, result, and event, and a bare 37.2 becomes "37.2 °C, in batch BATCH-2026-001, during the Production phase, on equipment BR101" — a number with a context you can trust.

Identity card for a batch's data shadow: six sources — continuous sensor readings, at-line and off-line lab results, phase-transition events, the batch master record, process events, and the audit trail with signatures — each with an example data point and where it is born, all stitched together by one shared batch_id key into a single contextualized record The six sources of a batch's data shadow, each born in a different system, all joined by one shared batch_id into a single trustworthy record. Original diagram by the authors, created with AI assistance.

Batch genealogy: linking stations into one traceable story

Those six sources are not produced all at once in one room. They are laid down station by station as the batch moves through the plant — the seed train that wakes the cells, the production bioreactor that grows them, the purification suites, the fill line. Each station hands the batch forward and, with it, its own slice of the data shadow. Batch genealogy is the chain of identifiers that lets you walk that path backwards: from a vial in a patient's hand, to the fill lot, to the bioreactor run, to the very vial of frozen cells the batch started from. When the genealogy holds, an investigator can answer "what happened to this dose?" in minutes; when it breaks, the answer can take weeks — a problem we return to below.

The hard consequence is this: in a regulated process, a batch that was not recorded effectively did not happen. If the data proving a batch is good is missing, incomplete, or untrustworthy, the medicine cannot be released — even if the molecule in the tank is perfect. The data is not paperwork about the product. It is a product.

The unsolved part: the fragmentation problem

Here is the honest difficulty at the center of this whole subject, and it is not yet solved. We have just described six neat sources joined by a tidy batch_id. In a real plant they are rarely so tidy. The sensor traces live in a historian; the lab results live in a LIMS (Laboratory Information Management System); the batch record and process events live in an MES (Manufacturing Execution System); the analyst's notes live in an ELN (Electronic Laboratory Notebook); the raw instrument files live on the instrument. Each system was bought from a different vendor, in a different decade, speaks a different format, and timestamps to a different clock. None of them agrees, out of the box, on what a "batch" even is or how to spell its identifier.

The cost is concrete. When something goes wrong and an investigator has to reconstruct what happened, most of the effort goes not into analyzing the data but into locating and aligning it — exporting spreadsheets, matching timestamps by hand, reconciling one system's batch name against another's. Regulators have noticed: modern data-integrity guidance explicitly extends its expectations to data that is distributed across multiple computerized systems, insisting that the complete, attributable record survive the seams between them [4]. And the scientific-data community's answer — making data Findable, Accessible, Interoperable, and Reusable — turns out to be far harder to achieve inside a heterogeneous manufacturing plant than inside a single research lab, precisely because the context that gives a number meaning is scattered across systems that were never designed to share it [5].

This book does not pretend the problem is finished. What it offers is the way through: a shared model of what a batch is, a shared vocabulary so a number means the same thing everywhere, and an architecture that joins the sources back together. The concrete shape of that join — a ts.sensor_reading row stitched to its s88.batch context so the reading is never bare — is exactly what the third book's reference architecture builds in code. Naming the problem honestly is the first step; the rest of this book, and its companion, is the answer.

How to read this book

The five parts of this book

The book is one continuous argument, told in five parts. You can read straight through, or jump to the part you need.

Part I — Why data is the product's twin — the idea you just met, unpacked. What a batch's data shadow contains, and why an undocumented batch did not happen.
Part II — Sources, systems and architecture — where the data is born (sensors, instruments, people) and the systems that capture and store it across a plant.
Part III — Integrity, compliance and validation — the rules that make data trustworthy enough to bet a patient's safety on, and how that trust is proven.
Part IV — Semantics and the digital thread — making data not just stored but meaningful: shared vocabularies and connections so a number means the same thing to every system and person who reads it — expressed formally enough that a machine can check and query it (a knowledge graph of RDF triples, validated by SHACL shapes and answered by SPARQL queries), the layer the companion Ontologies for Biopharmaceutical Manufacturing builds in full.
Part V — Analytics and the future — what becomes possible once the data is clean, connected, and trustworthy, from process insight to the factories of tomorrow — including the predictive models that, done honestly under GxP, demand the rigor this book keeps returning to: leakage-safe validation, an explicit applicability domain, and drift detection, developed in depth by the companion Machine Learning & AI for Biomanufacturing.

A thread runs through all five: the FAIR principles, a widely adopted set of guiding principles holding that good scientific data should be Findable, Accessible, Interoperable, and Reusable — not locked in a drawer or a format only one machine can read [3]. Keep FAIR in mind; we return to it often.

From the process guide to this book: the data shadow's journey

This book is the middle volume of a trilogy. The first book, Biologic Drug Manufacturing, is a beginner's guide to the actual physical process of making a biologic — choosing a target, building the cells, growing them in bioreactors (the warm tanks where living cells produce the medicine), purifying the result, and filling it into vials. That guide answers how the medicine is made. This book answers how the data is made and managed. And a third book, Open-Source Bioprocess Data Systems, answers how you build the software that holds it all — the same batch_id-stamped reading, shown as a concrete database row.

So a single fact threads through all three: a probe in the production bioreactor physically measures a temperature (book one); that measurement becomes a tagged data point with units, a timestamp, and a quality flag (this book); and that point lands in a database as a ts.sensor_reading row joined to its s88.batch context, exactly as the open-source reference architecture shows it. The physical step, the data point, the stored row — one thread, three books.

You do not need the other two to follow along — we will reintroduce each process step as it becomes relevant. But if you would rather meet the biology and the equipment before the data, start with Biologic Drug Manufacturing; the volumes are designed to be read side by side.

note

Throughout, we lean on the standard commercial way of making biologics to teach the basics, and point out where the modern, more continuous approach differs. You will meet that modern path in the real-world sections.

A few conventions

These appear on every page, so it is worth knowing them once.

Citations. Inline markers like [2] link to the References page. The visible number is local to each chapter and restarts at [1] in every chapter.
Key terms. Each chapter ends with a short glossary of the terms it introduced, so you never have to scroll back.
Admonitions. Coloured boxes flag the helpful asides: a tip for the plain-English analogy near the top, a note for useful context, and a caution where a misunderstanding could genuinely cause harm.
Bilingual. The book is published in English and Korean (한국어), so a reader can follow it in either language.
Trademarks. Product and company names mentioned in this book (including but not limited to Siemens, gPROMS, AspenTech, Aspen Hybrid Models, DataHow, DataHowLab, OPC UA, GAMP, PI System, Sartorius, Thermo Fisher, Cytiva, Waters, Agilent, AVEVA, OSIsoft, InfoPlus.21) may be trademarks or registered trademarks of their respective owners and are used for identification and editorial purposes only, with no claim of endorsement.

caution

This book teaches how to think about data management in medicine-making. It is not regulatory advice, and it is not a validated procedure. Real manufacturing decisions must follow current official guidance and your organization's approved processes.

Why it matters

If you remember one thing, make it this: managing the data is not the clerical tail of manufacturing — it is half the manufacturing. The molecule and its data shadow are made together, judged together, and released together. Treat the data as an afterthought and you risk a batch that is physically fine but legally and scientifically unreleasable; treat it as a product — designed, built, and quality-checked with the same care as the molecule — and everything downstream, from regulatory approval to patient trust, rests on solid ground. Every chapter that follows is, at bottom, about earning that trust.

In the real world

This is not a theoretical concern. The Quality by Design framework that anchors Part I came directly out of regulatory science — set out in the ICH guidelines Q8(R2) [6], Q9 [7], and Q10 [8] — reframing process data as core to the product rather than a by-product [1]. The push toward real-time, in-process measurement — PAT — likewise came from regulators seeking to build quality in during manufacture instead of testing for it at the end, codified in the FDA's 2004 PAT guidance [2]. And the FAIR principles that shape how we store and connect data originated in the broader scientific community and have since spread across industry and government [3].

These ideas are realized on real equipment and real software. The bioreactors that grow the cells come from vendors such as Sartorius, Thermo Fisher, and Cytiva; the process chromatography systems that purify the product at plant scale from vendors such as Cytiva, alongside the analytical chromatography instruments in the QC lab from Waters and Agilent. Each instrument's continuous measurements — temperature, pH, dissolved oxygen — become the dense sensor streams that form part of the data shadow. Those streams do not vanish into the ether: they are captured by process historians (such as the AVEVA/OSIsoft PI System or AspenTech InfoPlus.21), orchestrated by a Manufacturing Execution System (MES), and joined by lab results held in a Laboratory Information Management System (LIMS). And because patients depend on it, that whole data shadow is governed by binding rules: electronic records and electronic signatures must satisfy the FDA's 21 CFR Part 11 (Title 21 of the U.S. Code of Federal Regulations, Part 11) and the EU's Annex 11, while computer systems are validated under guidance such as GAMP 5 (ISPE) — the regulations and frameworks this book returns to throughout.

Public-private institutes such as NIIMBL (the U.S. National Institute for Innovation in Manufacturing Biopharmaceuticals) and industry consortia like BioPhorum carry these ideas into modern biomanufacturing, with cGMP (current Good Manufacturing Practice — the binding quality rules for making medicine) pilot facilities scaling up and de-risking manufacturing innovations and training the workforce [11]. The themes of this book are the live, daily concerns of the people who make medicine.

Key terms

Biologic — a medicine made by living cells rather than by pure chemistry.
Monoclonal antibody (mAb) — one identical antibody protein, grown in vast numbers by engineered cells, that binds to a single target molecule (an antigen) in the body.
Antigen — the specific target molecule in the body that a monoclonal antibody is designed to recognize and lock onto.
Bioreactor — the warm tank in which living cells are grown to produce the medicine.
Batch / lot — one complete manufacturing run that produces a defined amount of product.
Data shadow — the complete body of recorded data (sensor traces, test results, batch records, signatures) generated alongside a batch.
Tag — the structured, hierarchical name identifying a single measured signal (e.g. BR101.Temp.PV), under which a historian stores its time-series.
Historian — the time-series database that stores a plant's tag readings over time, typically keeping only the points that change.
Titer — how much antibody the cells make per litre of culture.
Viability — the fraction of cells in a culture that are still alive.
ISA-88 / ISA-95 — standards that define, respectively, how a batch recipe and its procedures are structured and how a plant's equipment and enterprise are organized.
batch_id — the shared identifier stamped on every reading, result, and event of one batch, the key that stitches the data shadow's separate sources into one record.
Batch genealogy — the chain of identifiers linking a finished dose back through every station and earlier batch it came from, enabling end-to-end traceability.
Quality by Design (QbD) — a framework that treats recorded process understanding as essential to the product itself.
Critical process parameter (CPP) — a process setting that must be controlled because it affects product quality.
Critical quality attribute (CQA) — a measurable product trait that must stay within limits for the medicine to be safe and effective.
Process Analytical Technology (PAT) — measuring critical attributes in real time, during manufacture, rather than only after.
FAIR principles — the guiding principles that data should be Findable, Accessible, Interoperable, and Reusable.
cGMP (current Good Manufacturing Practice) — the legally binding quality rules a facility must follow to manufacture medicine.
References page — the single page where every inline citation marker resolves to its source.

Where this leads

We have claimed that every gram of medicine casts a data shadow as essential as the molecule itself. The next chapter, The Biologic and Its Data Shadow, makes that concrete: it walks through the scale and the many types of data a single batch generates — the sensor traces, the batch records, the test results, the signatures — and introduces the documentation imperative at the heart of regulated manufacturing, the rule that has guided this industry for decades: if it isn't documented, it didn't happen.

What this chapter covers​

The promise: every claim is traceable​

Data is the product's twin​

A batch makes two products: the molecule and its data shadow​

Quality by Design: process data as evidence​

Process Analytical Technology: measuring quality in real time​

Anatomy of a batch's data shadow​

Batch genealogy: linking stations into one traceable story​

The unsolved part: the fragmentation problem​

How to read this book​

The five parts of this book​

From the process guide to this book: the data shadow's journey​

A few conventions​

Why it matters​

In the real world​

Key terms​

Where this leads​