Validating Computerized Systems: GAMP 5 and the Move to CSA

📍 Where we are: The last chapter made electronic records legally trustworthy; this one proves the systems that hold those records actually work — and shows how the industry is learning to prove it with thinking instead of paperwork.

In the previous chapter we saw that an electronic record or signature can stand in for paper only if you can trust it — and that 21 CFR Part 11 (Part 11 of Title 21 of the U.S. Code of Federal Regulations — the electronic-records rule) and EU GMP Annex 11 (its European counterpart) both demand that the computer system holding those records be validated before you rely on it [8][7]. That single word — validated — is a whole discipline. This chapter is about what it means to prove that a regulated data system does what it is supposed to do, and about a quiet revolution in how the industry does that proving.

The simple version

Think about getting a new car onto the road. The old way was to fill a binder with photographs and signed checklists proving you tested every screw, every wire, the radio, the cup-holders — the same exhaustive paperwork whether the part was the brakes or the glove-box light. The new way says: think first. Hammer on the brakes and the steering, because lives depend on them. Glance at the cup-holder, note it works, and move on. You spend your effort where the risk is. That shift — from test everything the same way and document it heavily to think about what matters and prove that well — is the move from CSV to CSA.

What this chapter covers

We start with Computerized System Validation (CSV) — what it is and why it became a burden. Then we meet GAMP 5, the industry's risk-based playbook, with its software categories and its famous V-model. Then comes Computer Software Assurance (CSA), the FDA-led shift toward critical thinking over scripts. Then we open the requirements-traceability matrix — the single table an inspector actually reads — and take one row apart to show how the whole philosophy collapses into one cell. Finally we connect validation back to the data-integrity controls of the last two chapters, and glimpse the frontier: validating cloud software and AI models.

What validation is, and why it grew heavy

Computerized System Validation (CSV) is the act of producing documented evidence that a computer system does what it is intended to do — and nothing it should not [6]. It is not a single test but a lifecycle of activities that establishes, with a high degree of assurance, that a system is fit for its intended use [6]. The FDA's General Principles of Software Validation (2002) (FDA's medical-device software-validation guidance, whose principles the wider industry applies broadly) set this baseline: software validation should be an integrated part of the system's whole life, paired with risk management, because you can never test every possible path through complex software, so you must reason about where failure would do harm [6].

Why is this required at all? Because in regulated biomanufacturing the software is part of the product's quality system. A miscalibrated bioreactor controller or a spreadsheet that rounds the wrong way can corrupt a patient's medicine — and corrupt the records that are supposed to prove the medicine is safe. The requirement begins in the core GMP (Good Manufacturing Practice) regulation itself: 21 CFR 211.68 (the same Title 21 numbering scheme, here a specific section rather than a Part) requires that automatic and electronic equipment used in manufacturing be routinely calibrated, inspected, or checked according to a written program designed to assure proper performance (211.68(a)), and that the data such systems hold be checked for accuracy and backed up (211.68(b)) — the GMP basis the FDA reads as requiring computerized-system validation [12]. Part 11 and Annex 11 then extend that expectation to the electronic records and signatures those systems produce, making validation a legal duty, not a nicety [8][7].

Here is the catch. Over two decades, validation drifted into ritual. Teams wrote enormous test scripts — step-by-step scripted procedures — and screenshotted every click, applying the same exhaustive treatment to a trivial label printer as to a system controlling a sterilizing filter [9]. Fear of regulators turned "documented evidence" into "document everything, identically." The result was slow, expensive, and — perversely — often worse for quality, because energy went into paperwork volume instead of into testing the things that could actually hurt a patient [2][9].

Verification vs. validation

A useful distinction: verification asks "did we build the system right?" — does it meet its specification? Validation asks "did we build the right system?" — is it fit for the real intended use? Good practice does both, and modern standards increasingly fold them under a single risk-based umbrella [1].

GAMP 5: a risk-based playbook

The most influential cure for ritual validation is GAMP 5 — Good Automated Manufacturing Practice, a guide published by ISPE (the International Society for Pharmaceutical Engineering) [1]. GAMP 5 is not a law; it is the industry's most widely followed interpretation of how to validate computerized systems sensibly. Its organizing idea is in its subtitle: a risk-based approach [1]. You scale the effort to the risk — a principle borrowed directly from formal quality risk management, codified in ICH Q9 (a guideline of the International Council for Harmonisation, the body that issues globally harmonised pharmaceutical guidelines), which tells the industry to make risk-based decisions, match the formality of the effort to the level of risk, and reduce subjectivity [5].

GAMP 5 software categories and the risk-based principle

GAMP 5 puts a practical handle on "how much effort" through software categories — a way of classifying software by its type and by how much it is configured or custom-built, because custom code carries more unknown risk [1]:

Category 1 — Infrastructure software: operating systems, databases, the plumbing. You manage it, you do not validate it as an application.
Category 3 — Non-configured products: commercial off-the-shelf software used as-is, "out of the box."
Category 4 — Configured products: commercial software you tailor with settings — a LIMS or MES (the lab and manufacturing-execution systems from the previous chapter) configured to your process. A commercial single-use bioreactor supplied with a proprietary control system and then extensively configured on-site for a specific cell line and process recipe is a tangible Category 4 example: a commercial product reshaped to the customer's process without writing new code.
Category 5 — Custom (bespoke) software: code written specifically for you, carrying the most risk and so the most scrutiny.

(There is no Category 2; the old GAMP 4 firmware category was dropped when GAMP 5 was introduced, and firmware is now placed in Category 3, 4, or 5 by its complexity — the numbering was simply never renumbered.) The higher the category, the more you must prove — and crucially, where a product is commercially supplied you can leverage the supplier: if a vendor already tested and documented its product, you assess that work and reuse it rather than re-testing from scratch [1]. This is exactly the philosophy of ASTM E2500, the science- and risk-based verification standard that argues you should verify a system is fit for intended use using the best available knowledge — including the supplier's — instead of rote, identical qualification of every component [4].

The V-model: URS/FS/DS down, IQ/OQ/PQ up

GAMP 5's signature picture is the V-model, which pairs every specification on the way down with a matching verification on the way up. You specify what the system must do, then build, then prove — and each promise made on the left is checked by a test on the right [1].

A V-shaped GAMP 5 model: User Requirements, Functional Specification, and Design Specification descend the left arm into Build and Configure at the bottom, then Installation, Operational, and Performance Qualification ascend the right arm, with dashed verified-by links pairing each specification to its matching qualification. The GAMP 5 V-model: each specification on the left descent is proven by a matching verification on the right ascent. Original diagram by the authors, created with AI assistance.

A concrete walk-through helps. Imagine validating a new potency-assay method in a LIMS. The User Requirement states the plain goal — "the system must calculate potency results with ±5% accuracy against reference standards." The Functional Specification details the calculation logic that meets it. The Design Specification describes the database schema and input-validation rules. On the way back up: Installation Qualification confirms the software builds and installs as specified; Operational Qualification confirms the calculation logic functions as specified in a test environment; and Performance Qualification confirms the system meets the ±5% accuracy requirement against reference standards on real samples in the customer's own lab.

For a batch-execution system (MES) the OQ rung carries one extra check: the executing batch recipe is verified to follow the ISA-88 (IEC 61512) procedure model — the recipe-decomposition standard, explained in Architecture and Integration: ISA-95, OT/IT, and the Edge-to-Cloud Path, that defines how a recipe breaks down into procedures, unit procedures, operations, and phases.

The same V applies to the process equipment that makes the medicine, not only the lab and plant software — and a downstream purification skid shows why the risk-based reading matters. Consider validating the chromatography data system (CDS) and control software on a Protein A capture chromatography skid — the affinity step that grabs the antibody out of the clarified harvest and is the single highest-leverage purification move, detailed in Book 1's Capture: grabbing the antibody (Protein A). Some of its functions are unambiguously Critical and earn full scripted IQ/OQ/PQ: the pooling-window cut points (the two thresholds, set against the live ultraviolet trace, between which the elution peak is collected as product) decide which slice of the antibody peak becomes drug substance and which is sent to waste — set them wrong and aggregate-rich tailing material or leached Protein A contaminates the pool — so the software that detects the peak, applies the cut criteria, and records the pooling decision into the batch record is tested rigorously and verified to hold a complete, attributable, unaltered raw chromatogram (a recurring data-integrity inspection finding). The dynamic-binding-capacity (DBC) alarm that stops the load before antibody breaks through to waste is likewise Critical. By contrast, a cosmetic trend-display colour or an operator-convenience report layout on the same CDS is genuinely low-risk and earns a lighter, unscripted check — the very CSA discrimination the next sections build. Equipment-side, the skid's installation is proven by FAT at the vendor and SAT on site, and the chromatography step's own qualification (column packing checks such as HETP and peak asymmetry, the proven resin-reuse lifetime) feeds the same body of evidence the CSA verification leverages rather than re-deriving.

The traditional verification rungs are IQ / OQ / PQ — Installation, Operational, and Performance Qualification: proof that the system was installed right, operates right, and performs right for its real workload [4]. In the classic scheme a fourth rung, Design Qualification (DQ), precedes IQ — a documented check that the design itself will meet the requirements before anything is built or installed. That three- (or four-) script ladder is the legacy default, not the only model. ASTM E2500 and the second edition of GAMP 5 increasingly fold IQ/OQ/PQ into a single risk-based verification that leverages a supplier's Factory and Site Acceptance Testing (FAT/SAT) and commissioning evidence rather than re-running three separate scripted qualifications from scratch — so the same proof is assembled once, from the best available evidence, instead of transcribed three times [4][1]. The second edition of GAMP 5 (2022) modernized all of this for how software is now built and bought — embracing iterative Agile development, cloud services, and emerging AI, and elevating critical thinking as the explicit thread that decides where effort goes [1].

The same V is read two ways now. Compare the two diagrams below: the first is the classic V, where every rung is tested alike; the second redraws it with a CSA overlay, where risk dials the rigor rung by rung. Drawn plainly, the V pairs each specification with its verification. Drawn through the CSA lens, it gains an overlay: the rigor applied to each rung is no longer uniform but dialed by risk, so a critical, patient-impacting function earns rigorous scripted proof on its rung while a trivial one earns a lighter, unscripted check and far less paperwork.

The same V-model, now read through CSA: each specification is still proven by its paired verification, but risk decides how hard each rung is tested. Original diagram by the authors, created with AI assistance.

From CSV to CSA: Computer Software Assurance, thinking over scripts

The 2022 FDA Computer Software Assurance shift

That phrase — critical thinking — is the hinge to the latest act of this story. In 2022 the FDA published a draft guidance introducing Computer Software Assurance (CSA) — a guidance written for medical-device production and quality-system software, whose risk-based principles the drug and biologics industry has rapidly adopted by analogy — and in 2025 it issued the final version [3][2]. CSA is a deliberate course-correction on the burden CSV had become. Its message: focus assurance effort on what matters to patient safety and product quality, apply critical thinking to decide how much testing each function needs, and use the least-burdensome approach that still gives confidence [2].

Scripted vs. unscripted (exploratory) testing

CSA changes the testing toolkit, not just the attitude. Alongside heavy scripted testing, it explicitly endorses lighter methods — unscripted testing (such as exploratory or ad-hoc testing, where a skilled tester probes the system without a pre-written script) — for lower-risk features, with documentation proportionate to risk rather than uniform and exhaustive [2][9]. The decision flows from a simple question first: if this software feature failed, could it harm a patient or compromise product quality? High-impact, direct-to-patient functions get rigorous scripted proof; low-impact functions get lighter, faster assurance [9].

A concrete example shows the logic in action. In a LIMS, the field that stores an analyst's free-text comment on a test result is not critical: if it failed, the reportable assay result itself would be unaffected, so CSA classifies it as lower-risk and allows unscripted testing with proportionate documentation. By contrast, the code that calculates assay potency from raw instrument data is critical — it directly affects the number on which a patient's medicine is released — so it requires rigorous scripted testing and full IQ/OQ/PQ. Same system, two functions, two very different levels of assurance.

A two-by-two matrix mapping a function's risk to the patient, product, and data against testing intensity (unscripted versus scripted), sorting each into appropriate, danger, or wasted-effort quadrants. CSA matches testing effort to risk — rigor where it protects the patient, restraint where it doesn't. Original diagram by the authors, created with AI assistance.

Crucially, CSA does not throw out CSV or GAMP 5 — it operationalizes their risk-based intent [9]. The final 2025 guidance supplements the 2002 software-validation guidance and supersedes only its specific validation section, leaving the broader lifecycle approach intact [2][6]. GAMP 5's second edition and CSA are best read as two voices saying the same thing: stop measuring quality by the weight of the binder [1][2].

The requirements-traceability matrix as the evidence backbone

If the V-model is the shape of validation and CSA is its attitude, the requirements-traceability matrix (RTM) is the record that ties them together. It is the artifact an inspector actually reads: a table in which every user requirement maps forward to the test that proves it, and every test maps back to a requirement, so that nothing is untested and nothing is untraced. The matrix is where the philosophy of this chapter turns into a row of data you can point at.

Each row carries a small, fixed set of columns: a stable requirement id, the plain-language requirement text, the GAMP software category that sets the validation tier, the risk class that sets the testing rigor, the verification that covers it, and — the column that matters most — the test it resolves to, with its result and a link to the evidence. Under CSA the matrix is where the risk-based principle becomes visible: the Critical-risk rows draw rigorous scripted proof, while low-risk rows are satisfied by a lighter check, and the same table shows both decisions side by side. This is the data-management view of the same record that the open-source companion makes concrete: in Validating an Open-Source GxP Stack, one row of a traceability matrix — generated from the test IDs rather than hand-written into a shipped file — resolves not to a screenshot folder but to a runnable pytest node that continuous-integration (CI) re-executes on every commit — so the evidence is generated, not transcribed.

Anatomy of a traceability-matrix row

Take a single row apart, because each column is one of the inspector's questions answered in machine-readable form. The row below is URS-003 — the requirement that record changes be attributable, reasoned, and tamper-evident, the same data-integrity control the Part 11 chapter made a legal duty. Its gamp_cat is 5 because in this example the tamper-evident hash-chain is bespoke code written for this stack (the same chain the Part 11 chapter dissected), not a configured vendor feature — which is exactly why it earns the highest scrutiny rather than letting us leverage a supplier. Its risk is Critical, and its test_id resolves to an executable test whose result is a PASS backed by a CI log rather than a binder page.

Each column of a traceability row plays one role; the test_id is the executable punchline — it names a test that runs, so the evidence is regenerated, not transcribed. Original diagram by the authors, created with AI assistance.

Read field by field, the first five columns are bookkeeping that any binder-based matrix would carry. The sixth, test_id, changes the kind of document this is: a traditional matrix points its verification column at evidence a human transcribed once, while a modern one points at evidence a machine regenerates — which is the whole difference between CSV and CSA expressed in a single cell. The row also hands the reader across the series' three intertwined threads — physical manufacturing, data, and open-source: the physical control point it protects (the batch record's attributable edits) traces back to the manufacturing step in Quality, Regulatory, and the Data Backbone, and the concrete implementation of this exact row lives in the open-source validation chapter.

The traceability row as a graph: a shape, not a spreadsheet cell

Notice the deeper structure under that row. The matrix says "every requirement maps forward to a test and back again, so nothing is untested and nothing is untraced." That is not a query over the rows that exist — it is a claim about the rows that should exist and might be missing, which is exactly the closed-world check (a check that treats a missing required fact as a failure, not as merely unknown) the ontology book reaches for when it gates a release. The same URS-003 record can be expressed as RDF (Resource Description Framework — the standard way to write data as subject-predicate-object triples, the atomic facts of a graph) and gated by a SHACL (Shapes Constraint Language — a language for validating that graph data has a required structure) shape, instead of as a row a human eyeballs:

# URS-003 as a triple, and a SHACL shape that gates its traceability closed-world.
@prefix v:   <https://example.org/validation#> .
@prefix sh:  <http://www.w3.org/ns/shacl#> .

v:URS-003 a v:UserRequirement ;
    v:gampCategory 5 ;
    v:riskClass    "Critical" ;
    v:verifiedBy   v:TEST-003 .          # forward link to the test that proves it

v:TraceabilityShape a sh:NodeShape ;
    sh:targetClass v:UserRequirement ;
    sh:property [ sh:path v:verifiedBy ;  # every requirement MUST resolve to a test
                  sh:minCount 1 ;
                  sh:message "Untraced requirement: no test resolves it." ] ;
    sh:property [ sh:path v:riskClass ;   # and carry a risk class from the controlled set
                  sh:minCount 1 ; sh:maxCount 1 ;
                  sh:in ( "Critical" "High" "Medium" "Low" ) ] .

An untraced requirement is now a validation report, not a gap someone has to notice. And the matrix's reverse direction — does every requirement still have a passing test? — is a one-line SPARQL (the standard query language for RDF graphs, as SQL is for tables) competency question (a plain-English question the data must be able to answer, run as a pass/fail check):

# CQ: list any Critical requirement whose test is missing or not PASS — the inspector's question.
PREFIX v: <https://example.org/validation#>
SELECT ?req WHERE {
  ?req a v:UserRequirement ; v:riskClass "Critical" .
  OPTIONAL { ?req v:verifiedBy ?t . ?t v:result ?r }
  FILTER (!BOUND(?r) || ?r != "PASS")
}

The point is not to retire the matrix table — it is to see that the matrix already is a small graph, and that its "nothing untested, nothing untraced" promise is precisely the completeness guarantee a SHACL gate enforces mechanically. The same record also carries provenance in the W3C PROV-O sense (the standard vocabulary for recording who or what produced a fact, from which activity) — the test_id link is a prov:wasGeneratedBy edge from the result back to the CI run that produced it, so an auditor reads not just the PASS but the activity that asserted it. The ontology book builds exactly this machinery — a release decision modeled as a SHACL shape whose missing-result violation routes an investigation — in Validation: The Release Gate and SHACL, and turns the FAIR (Findable, Accessible, Interoperable, Reusable) test for such records into a measured scorecard in Maintenance: Publication, the Assembled Thread, and FAIR. A modern RTM and a SHACL release gate are the same idea in two notations: a checklist whose completeness a machine can guarantee.

Validation never ends: data integrity and periodic review

Validation is not a one-time gate you pass and forget. A regulated system must keep its data-integrity controls working: the audit trail (the secure, time-stamped record of who did what, when, and why — the backbone of trust we met in the Part 11 chapter) must itself be validated and reviewed, not merely switched on [8][7]. Annex 11 makes this explicit, requiring that computerised systems be validated, that data be protected, and that systems undergo periodic review to confirm they remain in a validated state over their whole life [7]. A system validated five years ago, since patched, reconfigured, and upgraded, is not automatically trustworthy today — periodic review is how you re-earn that trust [7].

Validating in a world of continuous change

Periodic review assumes a system that changes slowly — a patch here, an upgrade there, reviewed on a comfortable annual cadence. The modern software supply chain breaks that assumption. A cloud or SaaS (software-as-a-service) platform is a moving target: the vendor ships features, security fixes, and infrastructure changes on its own schedule, often weekly, and the customer cannot freeze the version the way they once froze a server in a locked room. Agile development inside the customer's own teams adds the same churn from the inside. GAMP 5's second edition explicitly reaches for this world — endorsing iterative delivery, continuous integration, and the leveraging of supplier and automated evidence rather than one-time scripted runs [1] — and CSA's least-burdensome posture is what makes re-proving a change cheap enough to do continuously rather than dreading it [2]. The aspiration is a system that is continuously validated: every change re-triggers a proportionate, mostly-automated assurance check, and the validated state is maintained as a living property rather than a five-year-old certificate. Regulators are catching up to exactly this gap: the EU put a revised Annex 11 (and a companion update to Chapter 4 on documentation) out for public consultation in 2025, explicitly extending coverage to cloud and SaaS service providers, audit-trail review, data integrity across networked multi-system environments, and AI/ML — the very 2011-era assumptions this section says are breaking [11]. Whether that aspiration is reachable in practice is the open problem of the next section.

The unsolved challenge: keeping a moving system validated

CSA's promise is to cut low-value documentation and let assurance follow risk. But it sharpens a problem it does not fully solve. The FDA's Computer Software Assurance for Production and Quality System Software guidance (FDA, 2025; draft 2022) and GAMP 5 Second Edition (ISPE, 2022) both tell you to scale effort to risk and to leverage supplier and automated evidence [2][1] — yet neither answers cleanly how you keep a GxP system — one governed by Good Practice regulations such as GMP and GLP — in a validated state when it changes weekly, and who owns the evidence when the software is a vendor's multi-tenant service.

Three knots remain genuinely hard. First, cadence: traditional periodic review and change control were built for slow change, and a cloud platform that updates every week can outpace any human-paced revalidation cycle — the validated state risks becoming a fiction the moment the vendor pushes an update you did not schedule [2]. Second, ownership of evidence: in a multi-tenant SaaS, the customer cannot see, freeze, or fully test the shared instance every other tenant also runs; you must assess and trust the vendor's qualification evidence, but you remain the regulated party the inspector holds accountable, so the burden does not actually transfer with the software [1][4]. The practical instruments for discharging this in practice — supplier audits and shared (consortium) audits, the vendor's qualification package, and a SOC 2 Type II report as third-party assurance the regulated customer assesses and trusts — let you build confidence without owning the instance, but never let you hand off the accountability. Third, continuous proof: making "every change re-triggers a proportionate check" real requires the assurance itself to be automated and re-runnable, which is an engineering capability most regulated quality systems have not yet built — and the open-source companion's bet that an automated pytest suite can be the operational evidence is one concrete attempt at exactly this, explored in Validating an Open-Source GxP Stack and, for the records side, Part 11 and Annex 11 on Open Source. The honest state of the art is that CSA gives permission to validate continuously; it does not hand anyone a finished method for doing so when the software underfoot never stops moving.

Why it matters

For data management, validation is what converts "the system says so" into "the system can be trusted to say so." Validation is what lets the batch's data shadow — the complete digital record trailing campaign BATCH-2026-001 from seed vial to released vial — be trusted as much as the molecule itself: every released potency result, every audit-trail edit, every genealogy link rests on the system that produced it having been proven fit for purpose. It is also what makes a FAIR data shadow worth the effort — data that is findable, accessible, interoperable, and reusable is only as trustworthy as the validated system that recorded it; FAIR without validation is merely well-organized hearsay. Every downstream use of the data — releasing a batch, investigating a deviation, training a model — rests on the assumption that the system producing the data was proven fit for purpose. The CSA shift matters because it focuses assurance effort where risk is highest: on the functions that directly protect patient safety and data quality, rather than applying uniform scrutiny to every component equally [2][9]. Done well, that means more real quality for less wasted paperwork — and a faster path to adopting the modern, data-rich systems the rest of this book depends on.

In the real world

The economics here are not abstract. The legacy CSV approach was so document-heavy that companies routinely delayed software upgrades — sometimes for years — purely to avoid the revalidation paperwork, which left plants running old, less secure systems [3][9]. The FDA launched CSA in part to break that logjam and encourage automation that improves quality [3]. GAMP 5's "leverage the supplier" principle is now everyday practice: when a biomanufacturer adopts a cloud or SaaS (software-as-a-service) platform — a cloud-hosted MES or LIMS delivered as a subscription rather than installed on the customer's own servers — it assesses and reuses the vendor's testing and qualification evidence instead of pretending it built the software itself [1][4]. For a Category 3 commercial off-the-shelf SaaS product, much of the assurance can come from the supplier's audited documentation rather than re-validating every function on-site. The same risk-based, lifecycle thinking is spreading beyond software validation: ICH Q14 on analytical procedure development encourages an enhanced, lifecycle approach to the methods that generate the data — measure what matters, manage it over the method's life — echoing CSA's "scale the effort to the risk" logic in the laboratory [10]. (Q14 was finalized at Step 4 in November 2023 as a pair with the revised ICH Q2(R2) on analytical-procedure validation, the two together extending the enhanced-approach idea from products to methods.) And the frontier keeps moving: validating AI/ML models, whose accuracy can drift as the data and process they watch move away from what they were trained on, stretches these frameworks in new ways — such models need periodic re-validation and drift monitoring beyond a one-time IQ/OQ/PQ, a challenge GAMP 5's second edition began to address [1], and the FDA's 2025 draft guidance on AI to support regulatory decisions and ISPE's emerging GAMP AI/ML guidance are the first concrete attempts to extend these frameworks to learning systems [13][14] — a thread we return to in MLOps and Lifecycle: Drift, Retraining, and the Validation Paradox.

What "validating a model" actually demands

The reason an AI model strains the IQ/OQ/PQ frame is worth making concrete, because it changes what evidence the PQ rung must produce. A model's PQ is not a one-time accuracy reading — it has to prove the accuracy will generalize, and for bioprocess that proof has three parts that ordinary software qualification never needed.

First, the held-out test must be split by batch, not by row — a discipline called grouped (leave-one-batch-out) cross-validation (holding all of one batch's rows out of training and scoring only on that unseen batch). Splitting a single batch's time-series randomly lets near-identical neighbouring rows land in both training and test, an error called data leakage (when information from the test set bleeds into training), which inflates the reported accuracy into a number that collapses the first time the model meets a genuinely new batch. A validation that reports a flattering R² from a leaked split is not a validation; it is the same "looks proven, isn't" failure the binder era produced, now in a different medium.

Second, the qualification must declare the model's applicability domain — the input region (cell line, scale, raw-material lots, operating ranges) over which it was proven and outside which its output is by definition extrapolation and untrusted. This is the model's analogue of a method's validated range, and it is what makes the difference between model drift (the model going stale because the relationship it learned has moved) and ordinary process drift auditable: the same input-distribution shift that signals a process leaving its envelope is also the signal that a model is now being asked to predict outside the data it was qualified on.

Third — and this is where CSA's logic returns exactly — the model is risk-classified by what it touches: a soft sensor that merely advises an operator on a critical quality attribute (CQA — a measurable product property that must stay within limits to release the batch) is held to a far lighter evidence bar than one that autonomously acts on it, the same patient-impact question CSA asks of every software function. The resolution the regulators have converged on is locked-then-relearn: the model is frozen in production, never edited in place, and any retraining produces a new validated version promoted through change control along a pre-agreed Predetermined Change Control Plan (PCCP — a pre-approved written specification of how a model may be retrained and updated) — the bridge that lets a model evolve along a path proven safe in advance, rather than treating every retrain as an unforeseen change. The ML book builds this lifecycle end to end — grouped cross-validation and leakage in Models and Validation: From PLS to Transformers, and drift detection, the PCCP, and the validation-versus-learning paradox in MLOps and Lifecycle.

Key terms

Computerized System Validation (CSV) — producing documented evidence that a computer system consistently does what it is intended to do, fit for its intended use.
GxP — the family of Good Practice regulations (GMP, GLP, GCP, etc.) governing regulated life-sciences work.
Verification vs. validation — verification asks "built right?" (meets spec); validation asks "right thing built?" (fit for use).
GAMP 5 — ISPE's widely used risk-based guide for validating regulated (GxP) computerized systems, where GxP is the family of Good Practice regulations such as GMP and GLP; second edition (2022).
Software categories (1, 3, 4, 5) — GAMP 5's classification of software by its type and how much it is configured or custom-built, scaling required effort.
Leverage the supplier — assessing and reusing a vendor's testing and documentation instead of re-testing from scratch.
V-model — the GAMP 5 framework pairing each specification with a matching verification.
IQ / OQ / PQ — Installation, Operational, and Performance Qualification: the traditional verification rungs.
ASTM E2500 — a science- and risk-based standard for verifying systems are fit for intended use.
ICH Q9 — the quality risk management guideline that justifies scaling effort to risk.
Computer Software Assurance (CSA) — the FDA's risk-based, least-burdensome, critical-thinking approach succeeding traditional CSV.
Critical thinking — deciding how much testing a function needs based on its impact on patient safety and product quality.
Scripted vs. unscripted testing — pre-written step-by-step tests versus lighter exploratory/ad-hoc testing for lower-risk features.
Periodic review — recurring confirmation that a system remains in a validated state over its whole life.
Requirements-traceability matrix (RTM) — the table mapping each requirement forward to the test that proves it and back again, so nothing is untested or untraced; the evidence backbone of validation.
Risk class — the per-requirement rating (such as Critical) that, under CSA, sets how rigorously its verification is tested.
Continuous validation — the aspiration of keeping a frequently-changing system (cloud, SaaS, Agile) in a validated state by re-proving each change with proportionate, mostly-automated assurance rather than infrequent revalidation.
SHACL traceability shape — expressing a requirements-traceability row as RDF triples and gating it with a SHACL (Shapes Constraint Language) shape, so an untraced or untested requirement becomes a validation report rather than a gap a human must notice.
Competency question — a plain-English question the data must answer, run as a pass/fail SPARQL check (e.g. "list any Critical requirement whose test is missing or not PASS").
PROV-O provenance — the W3C standard for recording who or what produced a fact and from which activity; the test_id link is a prov:wasGeneratedBy edge from a result back to the CI run that produced it.
Grouped (leave-one-batch-out) cross-validation — splitting a model's held-out test by batch rather than by row, so near-identical neighbouring rows cannot leak into both training and test.
Data leakage — information from the test set bleeding into training, which inflates reported model accuracy into a number that collapses on a genuinely new batch.
Applicability domain — the input region (cell line, scale, raw-material lots, operating ranges) over which a model was proven; outside it the output is extrapolation and untrusted — the model's analogue of a method's validated range.
Locked-then-relearn / PCCP — freezing a model in production and never editing it in place; any retraining produces a new validated version promoted through change control along a pre-approved Predetermined Change Control Plan.

Where this leads

Validation proves a single system is trustworthy; data integrity controls keep each record honest. But neither, on its own, makes an organization's data trustworthy at scale — that requires policies, roles, and definitions that everyone follows. The next chapter, Data Governance, Data Quality, and Master Data, supplies that organizational backbone: what governance is, the dimensions that make data "good," how metadata and master data keep meaning consistent across systems, and who owns and stewards the data — the human structures that make all the technical controls of Part III actually stick.

What this chapter covers​

What validation is, and why it grew heavy​

GAMP 5: a risk-based playbook​

GAMP 5 software categories and the risk-based principle​

The V-model: URS/FS/DS down, IQ/OQ/PQ up​

From CSV to CSA: Computer Software Assurance, thinking over scripts​

The 2022 FDA Computer Software Assurance shift​

Scripted vs. unscripted (exploratory) testing​

The requirements-traceability matrix as the evidence backbone​

Anatomy of a traceability-matrix row​

The traceability row as a graph: a shape, not a spreadsheet cell​

Validation never ends: data integrity and periodic review​

Validating in a world of continuous change​

The unsolved challenge: keeping a moving system validated​

Why it matters​

In the real world​

What "validating a model" actually demands​

Key terms​

Where this leads​