Validating an Open-Source Stack: GAMP 5 & CSA

📍 Where we are: Part V · Trust — we have a running stack with an audit trail and e-signatures; now we have to prove it is fit for GxP use (GxP is the umbrella for the "Good x Practice" rules — GMP (Manufacturing), GLP (Laboratory), GCP (Clinical) Practice — that health authorities enforce on regulated drug work), with no vendor to hide behind.

The simple version

Buying validated software is like hiring a contractor who hands you a thick binder of paperwork and takes the blame if the roof leaks. Open source is like building the deck yourself: nobody hands you the binder, and nobody else signs off. That is not a deal-breaker — a home inspector still passes a well-built deck — but you have to keep the receipts, the photos, and a list of every board you used. This chapter is about keeping those receipts in a way an inspector will accept, and doing it as code you can re-run.

In the last chapters you brought up the core stack, loaded the audit triggers, and sketched how an e-signature service would slot in. Everything works. But "works on my laptop" is not "validated." A regulated manufacturer running this stack to make a monoclonal antibody (mAb — a therapeutic protein grown in living cells; the physical process is Book 1's subject) must be able to tell an inspector, in writing, why they trust it — and for open source there is no supplier to lean on.

This is the chapter where the honest-hybrid thesis bites hardest. Pure open source can carry you about 80% of the way to a defensible validation package. The last mile — the procedures, the change control, the human sign-off, the audited environment — is yours, and it is the same burden whether the software underneath cost zero dollars or a million.

What this chapter covers

Why "validated software" is a myth and validation is a property of a system + procedures, framed by 21 CFR Part 11 §11.10(a), 21 CFR 820.70(i), and FDA's General Principles of Software Validation.
GAMP 5 categorization of each stack component — PostgreSQL as infrastructure, our batch model as Category 5 — and the CSA (Computer Software Assurance) shift from documentation to risk.
Supplier and provenance assessment for community projects that have no supplier quality system, backed by an SBOM pinned to image digests.
A URS → IQ/OQ/PQ traceability matrix and an IQ manifest locked to image digests.
Automated OQ as pytest — the tests in examples/tests/ are the validation evidence, re-run by CI (continuous integration — the automation that re-runs the test suite on every code change) on every change.

"Validated software" does not exist

Start by deleting a phrase from your vocabulary: "validated software." No download is compliant. The law is precise about this. 21 CFR Part 11 — Title 21 of the US Code of Federal Regulations (CFR), the FDA's binding rule on electronic records and signatures — §11.10(a) requires "validation of systems to ensure accuracy, reliability, consistent intended performance" [5]. For devices and quality systems, 21 CFR 820.70(i) is even more direct: computer software used as part of production or the quality system "shall be validated for its intended use according to an established protocol" [6]. Both put the obligation on the system as you use it, not on the vendor's box.

FDA's General Principles of Software Validation gives us the verbs — installation, operational, and performance qualification (IQ/OQ/PQ) — and frames validation as a lifecycle activity, not a one-time test [4]. The good news, which the rest of this chapter cashes out, is that for an infrastructure-as-code stack — one whose every component is declared in version-controlled files (here a compose.yaml) rather than hand-installed — each of those verbs maps cleanly onto something you can run.

GAMP 5, Second Edition, is the industry's playbook for doing this proportionately, and crucially it does not treat open source as forbidden fruit: it provides a risk-based, critical-thinking approach and explicit guidance on software categories, suppliers, and even open-source software [1]. The companion Pharmaceutical Engineering article from the GAMP Community of Practice spells out the OSS angle: keep a catalog of the open-source components, assess the project's governance and sustainability, and verify the installed copy matches the intended version from a reputable source [7].

Categorize before you validate

The first GAMP 5 move is to categorize each component, because category drives effort. You do not validate PostgreSQL's B-tree implementation; you validate the application you built on top of it. Here is how our stack splits.

Component	GAMP category	Why	Validation focus
PostgreSQL / TimescaleDB	1 (Infrastructure)	Established platform software (the SQL database and its time-series extension)	Correct install, version pinned, configured per spec
Mosquitto, Grafana, Fuseki	1 (Infrastructure)	Configurable platform tools, used unmodified (the MQTT message broker, the dashboard server, the RDF triplestore)	Install + configuration verification
Grafana dashboards (`br101-batch-overlay.json`), Grafana provisioning, Mosquitto config (`mosquitto.conf`)	4 (Configured)	We configure, not code, the behavior	Verify the configuration meets the requirement
`bioproc_sim`, the ISA-88/95 model, audit triggers, the soft-sensor	5 (Custom)	Code we wrote for this intended use	Full lifecycle: requirements → design → code review → test

The category-1-vs-5 split is the whole economy of validation. The bulk of the stack is infrastructure or configuration; only the code we authored is Category 5 and needs the heavyweight treatment. That is exactly where we point our OQ tests.

Flow diagram: a URS feeds a Risk and GAMP category decision diamond that branches by category — Cat 1 infrastructure to IQ install and config verified, Cat 4 configured and Cat 5 custom both to OQ behavior tested with pytest — then IQ and OQ converge into PQ end-to-end on the real process, which flows to signed evidence plus a traceability matrix.

The GAMP category dial: why 1 vs 5 is the whole economy

It is worth dwelling on why the category, not the price tag, sets the bill. GAMP 5's categories are a risk dial, not a hierarchy of quality [1]. Category 1 is established infrastructure — an operating system, a database engine — whose correct functioning is presumed because a vast user base exercises it daily; your duty is to prove the right version is installed and configured to spec, not to re-test its internals. Category 4 is configured product: you assemble behavior from supplied building blocks (a Grafana dashboard, a Mosquitto broker config), so you verify that the configuration meets the requirement. Category 5 is custom code written for one intended use — our simulator, the ISA-88/95 model, the audit triggers — and it gets the full lifecycle because nobody else has ever run exactly this, so nobody else's usage is evidence.

The economic point is that effort is not uniform. Treating PostgreSQL as Category 5 — writing requirements and code reviews for a B-tree — would be both ruinously expensive and less convincing than the millions of production deployments that already vouch for it. Treating our audit trigger as Category 1 would be negligent, because we are the only people who have ever run it. The dial lets you spend the validation budget where the residual risk actually lives, which for this stack is the handful of Category 5 components. CSA, below, sharpens the same instinct: within Category 5, the risk of each function decides how hard you test it.

CSA: stop writing, start thinking

For years, CSV (Computer System Validation) drifted into a documentation arms race — pages of screenshots proving a button is blue. FDA's Computer Software Assurance (CSA) guidance is the correction. Issued as a draft in 2022 [2] and finalized in 2025 (with an administrative update in early 2026) [3], CSA tells you to spend effort in proportion to risk: high-risk, patient-impacting functions get rigorous scripted testing (each step written out and evidenced in advance); low-risk functions get lighter, unscripted (exploratory) or automated checks. The recordkeeping scales with the risk, not the other way around.

CSA is a gift to an OSS stack. It explicitly blesses leveraging existing evidence — logs, audit trails, automated test results — instead of re-deriving everything by hand. That is precisely what a Git repository full of pytest runs and CI logs gives you. The assurance activity becomes: identify the intended use, judge the risk, choose the lightest test that establishes confidence, and let the automation produce the record.

And this is not theory. The IMPALA Consortium — Roche, MSD, and Boehringer Ingelheim — independently validated a community-developed open-source R package to GxP/GCP standards and published exactly how they did it [8]. Regulated pharma can defensibly validate software with no vendor. The pattern they used is the pattern here: assess provenance, define intended use, test against requirements, keep the evidence.

Supplier assessment when there is no supplier

GAMP 5 expects a supplier assessment — normally a questionnaire or audit of the vendor's quality system. Open source has no vendor to audit, so you assess the project and the artifact instead. Two questions: is the project a trustworthy "supplier," and is the bit-for-bit copy you are running the one you think it is?

Scoring project health as a supplier proxy

When there is no quality manual to request, the GAMP CoP open-source guidance says to assess the project itself as the proxy for a supplier [7]. The criteria are concrete and, helpfully, mostly observable from the public record:

Supplier-quality dimension	What you would ask a vendor	The open-source proxy (where to look)
Governance & ownership	"Who is accountable for this product?"	A named foundation or maintainer team, a documented governance model, a published security policy (`SECURITY.md`).
Release & defect handling	"What is your patch SLA?" (service-level agreement — a promised turnaround time)	Release cadence, tagged semantic versions, a changelog, time-to-fix on past CVEs (Common Vulnerabilities and Exposures — publicly tracked security flaws).
Sustainability	"Will you still exist in five years?"	Contributor count and bus factor (how many maintainers would have to leave before the project stalls — a bus factor of one is a single point of failure), funding/sponsorship, downstream adopters.
License clarity	"What may we do with it?"	An SPDX-identifiable license, no ambiguous dual-licensing surprises (see the TimescaleDB trap below).

This is not a softer bar than a vendor audit — for a thinly-staffed project it can be a harder one, because a single-maintainer repo with no security policy may fail the assessment outright. That is a legitimate finding, and it is why the honest verdict at the end of this chapter notes that assessing a fragile project can cost more than buying a supported product. The OpenSSF Scorecard and similar tools automate much of this, turning "is this a trustworthy supplier?" into a repeatable, evidenced check rather than a gut feeling.

For the second question — is the running copy the one you think it is? — you pin and verify. Our examples/platform/compose/compose.yaml (the Docker Compose file that declares each service as a container image — a frozen, downloadable bundle of a program and everything it needs to run) pins every image by human-readable tag, and the repo's examples/platform/versions.lock records the matching immutable manifest digest for each one (regenerated by make lock):

# from examples/platform/compose/compose.yaml
postgres:
  image: timescale/timescaledb:2.17.2-pg17
  profiles: ["core"]

mosquitto:
  image: eclipse-mosquitto:2.0.22
  profiles: ["core"]

grafana:
  image: grafana/grafana-oss:11.4.0
  profiles: ["core"]

A tag like :2.17.2-pg17 is a human-friendly label that can be re-pointed; a digest (@sha256:…) is content-addressable and cannot. Pinning both is the difference between "we ran TimescaleDB 2.17" and "we ran this exact TimescaleDB 2.17." That distinction is what makes the install reproducible — and reproducibility is the foundation of every IQ claim. One image is honest about its gap: the community Fuseki image moved to a different registry (the server that hosts container images) — and because a digest is computed from the exact bytes a specific registry serves, the old pin no longer resolves, so it carries a VERIFY-BEFORE-USE placeholder in versions.lock until you re-fetch ("resolve") the digest from whichever copy (mirror) you now pull — docker buildx imagetools inspect apache/jena-fuseki:5.2.0 prints it. That is a real-world reminder that provenance for a moved community artifact is exactly the kind of finding a supplier assessment surfaces.

The SBOM as supplier dossier: what, whence, which license

The machine-readable supplier dossier is a Software Bill of Materials (SBOM). We generate one with Syft, the Apache-2.0 tool that walks a container image and lists every package inside it [11]. The output is a standardized format — CycloneDX, now Ecma International ECMA-424 [9], or SPDX, ratified as ISO/IEC 5962:2021 [10] — so the inventory has formal, inspection-defensible footing rather than being a homemade spreadsheet. A trimmed CycloneDX row looks like this:

// illustrative SBOM row (CycloneDX 1.6), produced by `make sbom` (Syft)
{
  "type": "container",
  "name": "timescale/timescaledb",
  "version": "2.17.2-pg17",
  "purl": "pkg:docker/timescale/timescaledb@2.17.2-pg17",
  "hashes": [{ "alg": "SHA-256", "content": "sha256:3324f81c…  (digest pinned in versions.lock)" }],
  "licenses": [
    { "license": { "name": "PostgreSQL License" } },
    { "license": { "name": "Timescale License (TSL)" } }
  ]
}

Read that row field by field, because each column is one of the inspector's questions answered in machine-readable form. type: container and name say what the component is; version and purl (the Package URL) say whence it came — pkg:docker/timescale/timescaledb@2.17.2-pg17 is a globally unambiguous coordinate a scanner can resolve back to a registry. The hashes entry carries the same SHA-256 manifest digest recorded in versions.lock, so the inventory is pinned to the bit-for-bit artifact, not to a re-pointable tag. And licenses answers which license — the column that turns the SBOM from a parts list into a legal dossier. In our repo the SBOM is not hand-written: make sbom runs Syft against each image in compose.yaml and writes one CycloneDX file per image under compliance/ (for example compliance/sbom-timescale__timescaledb__2.17.2-pg17.cdx.json), so the dossier is regenerated from the running images, never typed.

That single artifact answers three questions an inspector will ask: what is in here, where did it come from, and under what license. And it forces an honesty the prose alone can hide: the pinned timescale/timescaledb:2.17.2-pg17 is the Community bundle, dual-licensed under the PostgreSQL License (the Apache-2.0 core) and the source-available Timescale License (TSL) — free to read and run, but not OSI-approved open source — so the SBOM row carries TSL, not a clean Apache-2.0 (the strictly Apache-only build would be the -oss tag). The book is deliberate about this trap (see the historian chapter): it uses the free TSL Community automation — continuous aggregates and add_retention_policy — while staying off the one TSL feature, Hypercore columnstore/compression, that it does not need; it ships VictoriaMetrics instead of InfluxDB v3, and it flags Grafana's AGPL for redistribution. The license column is not decoration — the supplier register and the SBOM share one source (versions.lock), so the recorded license is whatever the pinned image actually is, and the inventory cannot quietly drift away from the running stack.

The traceability matrix: URS → IQ/OQ/PQ

Validation is fundamentally about answering "did we build what we said we needed?" The artifact that proves it is a traceability matrix: every user requirement (URS) maps forward to a test, and every test maps back to a requirement. Nothing untested, nothing untraced.

Here is a slice of the requirements-to-test matrix for our stack, expressed as the CSV the repo would ship under compliance/ (the "golden batch" it refers to is the canonical reference run the book trends everything against):

# illustrative compliance/traceability.csv (URS -> test, generated from test IDs)
urs_id,requirement,gamp_cat,risk,verifies,test_id
URS-001,"Historian stores all bioreactor tags for a batch",1,High,test_historian_loaded,tests/test_db.py::test_historian_loaded
URS-002,"Readings within a batch resolve to their ISA-88 phase (all four golden-batch phases present)",5,High,test_contextualization_joins_phase,tests/test_db.py::test_contextualization_joins_phase
URS-003,"Record changes are attributable, reasoned, tamper-evident",5,Critical,test_audit_captures_update,tests/test_db.py::test_audit_captures_update
URS-004,"The audit hash chain has no broken links",5,Critical,test_alcoa_chain_intact,tests/test_db.py::test_alcoa_chain_intact
URS-005,"Generated process data is deterministic & reproducible",5,Medium,test_determinism_two_runs_identical,tests/test_simulator.py::test_determinism_two_runs_identical

The test_id column is the punchline: each requirement does not point to a paragraph of prose, it points to a function that runs. That is CSA in practice — the evidence is generated, not transcribed.

Anatomy of a traceability-matrix row

Take the third row apart, field by field, because every column is doing a distinct job and the last one is what makes this matrix executable rather than narrative. URS-003 — "record changes are attributable, reasoned, tamper-evident" — is one of the two Critical-risk requirements in the table, and it is the one an inspector will press hardest on.

Identity-card anatomy of one traceability.csv row for URS-003: columns urs_id, requirement, gamp_cat 5, a Critical risk pill, verifies, and a highlighted green test_id block resolving to the pytest node tests/test_db.py colon-colon test_audit_captures_update, with a violet panel listing the two assertions the node makes — action equals UPDATE and verify_chain returns zero broken links. Each column of a traceability row plays one role; the test_id is the executable punchline — it names a pytest node CI re-runs on every commit, so the evidence is generated, not transcribed.

Original diagram by the authors, created with AI assistance.

The first five columns are unremarkable bookkeeping that any binder-based matrix would carry: a stable urs_id, the plain-language requirement, the gamp_cat that sets the validation tier (5, custom code), the risk that sets the testing rigor (Critical), and a verifies label naming the check. The sixth column is the one that changes the kind of document this is. test_id does not resolve to a page reference or a screenshot folder; it resolves to tests/test_db.py::test_audit_captures_update — a pytest node address that CI can execute. That node, which we dissect under OQ below, performs a real UPDATE with a user and a reason, then asserts both that the change was captured (action == "UPDATE") and that the tamper-evident hash chain is unbroken (verify_chain() returns zero). A traditional CSV matrix points its verifies column at evidence a human transcribed once; this one points at evidence the machine regenerates, which is the whole difference between CSV and CSA. (Note the honest framing: the repo does not yet ship a literal compliance/traceability.csv — the matrix is generated from the test IDs, and the real artifact it dissects is the live pytest node those IDs name.)

The matrix as a graph: a requirement is a triple, a test is a shape

The traceability CSV has a quietly semantic shape, and naming it connects this chapter to the knowledge-graph chapter that runs Fuseki — the same Category 1 triplestore on the supplier register above. Each row asserts facts about a requirement, and facts are triples (subject-predicate-object, the atom of RDF — the graph data model that stores relationships as first-class data). URS-003 modeled as triples is just the row in graph form — bp:URS-003 bp:risk "Critical", bp:URS-003 bp:gampCat 5, bp:URS-003 bp:verifiedBy bp:test_audit_captures_update — and the "every test maps back to a requirement, nothing untraced" rule that the matrix exists to enforce is exactly a SHACL (Shapes Constraint Language) shape: a closed-world gate that fails a requirement node carrying no verifiedBy edge, the same mechanism Book 4 uses to make a release specification an executable gate:

# illustrative: the "nothing untraced" rule as a SHACL shape (closed-world).
bp:TraceableRequirementShape a sh:NodeShape ;
  sh:targetClass bp:Requirement ;
  sh:property [ sh:path bp:verifiedBy ; sh:minCount 1 ;
                sh:message "Requirement has no verifying test — untraced." ] ;
  sh:property [ sh:path bp:gampCat ; sh:datatype xsd:integer ; sh:maxCount 1 ] .

That sh:minCount 1 is the graph saying what the matrix says in prose — a missing edge is a failure now, not an open question — which is why SHACL's closed-world stance, not OWL's open-world reasoning, is the right tool for a completeness check (a coverage gap is a thing that should be present and is not). The traceability question itself reads as a SPARQL competency question — a query the model must be able to answer, used as an acceptance test the way Book 4 runs its 23 competency questions as PASS/FAIL checks:

# competency question: which Critical-risk requirements have no verifying test? (must return zero)
PREFIX bp: <https://example.org/bioproc#>
SELECT ?req WHERE {
  ?req a bp:Requirement ; bp:risk "Critical" .
  FILTER NOT EXISTS { ?req bp:verifiedBy ?t . }
}

The honest boundary is the same one SHACL always carries: the shape proves the matrix is complete and well-formed — every requirement traced, every test named — not that the test it names is correct. Completeness is a graph property a machine guarantees; correctness still rests on the qualified-person judgment the gate records but does not replace. And as with the digital-thread graph, this traceability graph would be a derived view of the test IDs, so it earns the same validated, change-controlled load the semantics chapter demands of any graph holding GxP-relevant facts.

A pipeline showing user requirements flowing through GAMP categorization and risk assessment into IQ, OQ, and PQ stages, where IQ verifies pinned image digests, OQ runs pytest against the live stack, and PQ exercises the end-to-end batch, all feeding a signed traceability matrix and SBOM as inspection evidence.

From a written requirement to a runnable test: the OSS validation lifecycle treats GAMP categorization and risk as the dial that sets how much qualification each component gets, then captures IQ as a digest-pinned manifest, OQ as automated pytest, and PQ as an end-to-end batch — every step traceable back to a URS.

Original diagram by the authors, created with AI assistance.

IQ: the install manifest, pinned to digests

Installation Qualification answers "is the right thing installed, configured correctly, in the right place?" For infrastructure-as-code this is almost free, because the install is a file. The IQ manifest is a snapshot of the running stack's components and their digests — the evidence that what is deployed matches what was specified:

// illustrative compliance/iq_manifest.json (captured from the running stack)
{
  "captured_utc": "2026-06-14T09:00:00Z",
  "compose_file_sha256": "…",
  "components": [
    { "service": "postgres",  "image": "timescale/timescaledb:2.17.2-pg17",
      "digest": "sha256:3324f81c…", "gamp_category": 1, "profile": "core" },
    { "service": "grafana",   "image": "grafana/grafana-oss:11.4.0",
      "digest": "sha256:d8ea3779…", "gamp_category": 1, "profile": "core" }
  ]
}

Because the manifest is generated from the same versions.lock digests that the SBOM and supplier register use, an IQ that passes proves three things at once: the right versions are installed, they match the license inventory, and they match what CI tested. If a digest drifts, the OQ suite in the next section is designed to fail loudly — which is the whole point.

Anatomy of an IQ-manifest entry

Take one component apart the way we took the traceability row apart, because the IQ manifest's power is concentrated in a single field. The artifact that actually exists in the repo is platform/versions.lock — the manifest JSON above is a presentation of the same data — so we dissect a real versions.lock line: the postgres service, timescale/timescaledb:2.17.2-pg17, pinned to digest sha256:3324f81c….

One versions.lock line, field by field: the tag records what you meant to run; the immutable sha256 digest is the bit-for-bit IQ claim, and because three artifacts read the same digest they cannot drift apart.

Original diagram by the authors, created with AI assistance.

The service, image:tag, gamp_category, and profile fields are context: they say which compose service this is, what human-readable image was requested, that it is Category 1 infrastructure (verify the install, do not rebuild it), and which profile brings it up. The load-bearing field is the digest. A tag like :2.17.2-pg17 is mutable — a registry can re-point it to a freshly rebuilt image tomorrow — so a tag alone supports "we meant to run TimescaleDB 2.17," not "we ran this exact image." The sha256 manifest digest is content-addressable: it is computed from the image bytes, so it cannot be silently re-pointed, and it is therefore the bit-for-bit claim an IQ rests on. The quiet superpower is that the same digest is read by three different artifacts — the versions.lock supplier register, the Syft SBOM's hashes field, and the IQ manifest's digest field — all regenerated from one source. That is why a single passing IQ simultaneously proves install (the right image is deployed), license (it matches the SBOM's recorded license), and CI-tested (it matches what make test exercised). One fingerprint, three guarantees, no transcription.

OQ: the tests in the repo are the evidence

Operational Qualification answers "does it do what it should, across its operating range?" Here is where the hands-on nature of this book pays off: we already wrote the OQ. The pytest suite [12] — pytest is the standard Python testing tool; it discovers and runs small test_… functions and reports pass/fail — in examples/tests/ is not a developer convenience bolted on afterward — it is the operational evidence, and it runs against the live compose stack.

Look at examples/tests/test_db.py. These tests connect to the running PostgreSQL+TimescaleDB and assert that the system behaves to requirement. The historian and contextualization checks verify URS-001 and URS-002:

# from examples/tests/test_db.py
def test_historian_loaded(conn):
    n = _scalar(conn, "select count(*) from ts.sensor_reading where batch_id='BATCH-2026-001'")
    assert n > 300_000   # 16 tags x ~20160 minutes


def test_contextualization_joins_phase(conn):
    # the golden batch should surface all four named ISA-88 phases (LEFT JOIN allows NULL outside a window)
    rows = _scalar(conn, "select count(distinct phase_name) from s88.v_batch_sensor "
                         "where batch_id='BATCH-2026-001' and phase_name is not null")
    assert rows >= 4   # Inoculate, Growth, Production, Harvest

Those four phases are the canonical fed-batch arc — a culture started with cells and fed nutrients as it runs (the physical process is Book 1's subject): inoculate the bioreactor (the vessel the cells grow in), grow viable cells, switch to antibody production, then harvest and clarify (filter the cells out of the liquid). The 16 tags the historian check counts are the bioreactor's instrumentation — temperature, pH, dissolved oxygen and agitation (each carried as a measured process value plus its setpoint — the target value the controller tries to hold it to — so four quantities make 8 tags), the two nutrient feeds (cumulative feed mass, in kg), vessel pressure and working volume, two off-gas channels (CO2 and O2), and online glucose and titer (grams of antibody per litre) — those eight single process-value channels add 8 more, which is how four PV-plus-setpoint pairs and eight single channels make 16 tags in all. Across the 14-day batch each tag is sampled once a minute (14 days is about 20,160 minutes), so 16 tags times roughly 20,160 minutes is over 300,000 readings — the count the test asserts.

The two data-integrity tests are the most important, because they verify the Critical-risk URS-003 and URS-004 — the ALCOA+ controls (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available — the data-integrity attributes regulators hold a GMP record to) an inspector cares about most. test_audit_captures_update performs a real UPDATE, supplying a user and a reason, then asserts the change was captured and the tamper-evident hash chain is still intact. A hash chain links every audit row to the one before it by storing a fingerprint (hash) computed from both the new row and the prior row's fingerprint, so deleting or reordering any entry breaks the math of every later link — which is what makes silent tampering detectable; verify_chain() walks the chain and returns the broken links (zero means intact):

# from examples/tests/test_db.py
def test_audit_captures_update(conn):
    # an UPDATE must record old + new + who + why and keep the chain intact
    with conn.cursor() as cur:
        cur.execute("select set_config('app.user','pytest',false), "
                    "set_config('app.reason','test correction',false)")
        cur.execute("update lab.result set value = value where result_id = "
                    "(select result_id from lab.result limit 1)")
        conn.commit()
    last = _scalar(conn, "select action from audit.change_log "
                         "where app_user='pytest' order by seq desc limit 1")
    assert last == "UPDATE"
    assert _scalar(conn, "select count(*) from audit.verify_chain()") == 0

That single test, run and logged, is an OQ script that exercises the audit trail end to end. It is better evidence than a screenshot, because anyone — including the inspector — can re-run it and watch it pass.

The simulator tests in examples/tests/test_simulator.py cover the reproducibility requirement (URS-005). Determinism is a validation property in its own right: if the same inputs do not produce the same outputs, you cannot qualify anything built on them.

# from examples/tests/test_simulator.py
def test_determinism_two_runs_identical():
    a = fed_batch.simulate("BATCH-2026-001").tags["value"].to_numpy()
    b = fed_batch.simulate("BATCH-2026-001").tags["value"].to_numpy()
    assert np.array_equal(a, b)

And the whole suite is fronted by a single command in examples/Makefile, which is the exact line a validation engineer (or CI) types to produce the operational evidence:

# from examples/Makefile
test: ## run the test suite (determinism + db + analytics)
	$(PY) -m pytest -q tests

make test is the OQ execution. The console output, captured to a file, is the OQ result. Re-running it on a clean machine — which CI does on every commit — is regression-testing your validated state for free.

The Category 5 soft-sensor gets the same treatment, but its OQ is a predictive-floor test: it must clear a held-out R-squared threshold (assert m["r2"] > 0.85) or CI fails — for a model, "operates correctly" means it still clears a predictive floor on held-out data. Here that held-out split is within a single golden batch, so it shows the signal is genuinely present rather than proving field accuracy on an unseen batch; the honest bar — a leave-one-batch-out validation — is covered in the analytics chapter. That worked example lives in the analytics and capstone chapters: see Process Analytics: SPC, MVDA & Soft Sensors and Capstone: One Batch, End to End.

A Category 5 model hides a validation trap a plain function does not, and naming it is what keeps the soft-sensor's OQ honest rather than self-flattering. A random row-wise train/test split leaks: sibling batches off one cell bank share a media lot and a skid, so a near-twin landing on both sides of the split lets the model see the answer and report a fantasy score. The defensible split is grouped — every reading from a batch goes wholly to train or wholly to test — so the held-out R-squared measures generalization to an unseen batch, not memorization of a seen one; Book 5 makes this GroupKFold / leave-one-batch-out fold the default in Models and Validation. Two more model-only obligations ride alongside the predictive floor and belong in the validation package even though the analytics chapter is where they are exercised: an applicability domain (a gate — a Hotelling's-T-squared / residual check — that declines to predict when a new batch sits outside the trained envelope, so the sensor abstains rather than guesses out of its depth), and a clean split between process drift (the living CHO culture genuinely wandering batch to batch — a real signal the historian should preserve) and model drift (the predictor going stale against that moving process — a defect to detect), the distinction Book 5's MLOps and Lifecycle chapter builds monitoring around. Under CSA the depth of a model's OQ is risk-scaled the same way every other Category 5 function is: a soft-sensor whose output informs a release or a feed decision earns scripted, grouped-CV evidence and a logged model lineage (which dataset hash, which frozen model version, which CQA it scored — captured exactly as the IQ manifest captures an image digest), while a read-only trend view earns far less. The model is a Category 5 component; CSA's risk dial applies to it unchanged.

PQ: the batch is the proof

Performance Qualification answers "does it work for the real job, under real conditions?" For us that is the end-to-end run: a full simulated fed-batch CHO (Chinese Hamster Ovary — the workhorse mammalian cell line most therapeutic antibodies are grown in; the physical process is Book 1's subject) campaign flowing from sensor through historian and contextualization to a reviewable, signed dataset and an audit-trail review report. PQ is where the abstract requirements meet the actual mAb process. Concretely, PQ replays the golden BATCH-2026-001 — the reference run the book trends everything against, a 14-day fed-batch where titer (grams of antibody produced per litre) accumulates in g/L, with a deliberate 0.5 degC temperature excursion (a brief departure from the target setpoint) injected on day 7 — and proves that excursion surfaces as a quality-flagged deviation the audit trail can explain. That path is exercised by make data && make load plus the test_db/test_simulator suite (contextualization, audit chain, determinism); a signed reviewable-dataset and audit-trail-review report is the PQ deliverable a deploying site adds on top — the headline "implementation is possible" proof for the whole book.

What actually gets cited: the field-failure datapoint

It is easy to treat validation as a box-ticking ritual until you look at what inspectors actually write up. A retrospective analysis of FDA Warning Letters (the agency's formal written notice that violations must be corrected) issued to pharmaceutical companies over 2010–2020 found that documentation and data-integrity deficiencies accounted for roughly 21% of cGMP (current Good Manufacturing Practice — "current" meaning the standards in force today) warning letters, with documentation cited as a major deficiency in something like 20–25% of letters on average [13]. These are not exotic findings; they are the bread-and-butter citations — records that are not attributable, audit trails that are absent or disabled, data that cannot be tied back to who changed it and why. In US drug-GMP terms they map to the recordkeeping and electronic-records expectations of 21 CFR 211.68 and 211.180 and 21 CFR Part 11 [5].

That statistic is the honest mirror for this chapter's whole approach. The two Critical-risk requirements in our traceability matrix — URS-003 (changes are attributable, reasoned, tamper-evident) and URS-004 (the hash chain has no broken links) — are precisely the controls that, when missing, become a data-integrity citation. The reason we point our most rigorous, re-runnable evidence at exactly those two rows is that the field-failure record says this is where regulated facilities actually get cited. An automated OQ that re-proves attribution and chain integrity on every commit is not gold-plating; it is defending the line item most likely to appear on a Form 483 (the list of deficiencies an FDA investigator hands a site at the close of an inspection).

And it is worth naming the limit honestly: a passing pytest chain proves the control worked when tested, not that a determined insider cannot defeat it. There is also a narrower limit in the check itself: verify_chain() proves no link was deleted or reordered, but it does not recompute each row hash from its payload, so a silent in-place edit of an already-logged old_row/new_row is not caught by this check alone — the row-level hash is the defense there. As the Ch 23 hash chain showed, a superuser who disables the audit trigger can still bypass it — detectability is not impossibility. The datapoint above is a reason to test these controls hard, not a claim that testing makes them tamper-proof.

Why it matters

Without this chapter, the previous twenty are a hobby project. A regulated facility cannot run software it cannot defend. The reason this matters so much for open source specifically is the missing vendor: when you buy a validated LIMS, the supplier's quality system absorbs part of your burden. With OSS, that burden lands entirely on you — but it does not disappear, and CSA plus automated testing make it cheaper to discharge honestly than the old screenshot-driven CSV ever did. The repo turns validation from a documentation project into an engineering one, where the evidence is generated by the same CI that builds the software.

In the real world

The honest verdict: this approach is real, and it is being done. The IMPALA Consortium's published validation of a community R package is proof that big pharma will stand behind independently-assessed OSS in a GxP context [8]. GAMP 5 Second Edition's open-source appendix and the GAMP CoP guidance mean the framework explicitly supports it [1] [7]. And CSA's finalization in 2025 means the risk-based, evidence-leveraging posture this chapter takes is now the expected inspection-ready approach, not a clever workaround [3].

But be brutally clear about the limits. Automated pytest is excellent OQ evidence; it is not the whole validation package. You still owe: a validation plan and report signed by a qualified person, a change-control procedure (every digest change re-triggers assessment), periodic review, a deviation process, and — the recurring honesty of this book — the recognition that no OSS component is Part 11-compliant out of the box. The Ch 23 hash chain makes tampering detectable, not impossible; a superuser who disables the trigger can still bypass it. The supplier-assessment effort for a thinly-maintained project can exceed the cost of buying a supported one, and that trade-off is a legitimate business decision, not a failure. This chapter shows you can get there with open source; it never pretends the last mile is free.

Key terms

GxP — the family of "Good x Practice" regulations (GMP, GLP, GCP) that govern regulated pharmaceutical work; the rules this stack must satisfy.
GAMP 5 — ISPE's risk-based framework for validating GxP computerized systems; its categories (1 infrastructure → 5 custom) scale validation effort.
ALCOA+ — the data-integrity attributes a regulated record must hold: Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available.
Infrastructure-as-code — defining a stack's every component in version-controlled files (here the compose.yaml and versions.lock) rather than by hand-clicked installs, so the install is itself a file you can re-run and audit.
CSA (Computer Software Assurance) — FDA's risk-based successor to heavyweight CSV: spend assurance effort in proportion to patient/data risk; leverage existing evidence.
CSV (Computer System Validation) — the traditional, often document-heavy approach CSA reforms.
IQ / OQ / PQ — Installation, Operational, and Performance Qualification: is it installed right, does it operate right, does it perform for the real job.
URS (User Requirement Specification) — the statements of what the system must do; the root of the traceability matrix.
Traceability matrix — the mapping from each URS to the test(s) that verify it (and back).
Supplier assessment — evaluating a software supplier's quality; for OSS, assessing the project's health and verifying the artifact instead.
SBOM — Software Bill of Materials; a standardized inventory (CycloneDX/ECMA-424, SPDX/ISO 5962) of every component, its version, provenance, and license.
Image digest — a content-addressable SHA-256 of a container image; pinning by digest makes an install bit-for-bit reproducible.
purl (Package URL) — a standardized, globally unambiguous coordinate for a software component (e.g. pkg:docker/timescale/timescaledb@2.17.2-pg17) carried in the SBOM so a scanner can resolve provenance back to its source registry.
OpenSSF Scorecard — an automated check that scores an open-source project's security and maintenance practices (security policy, release signing, branch protection, dependency update hygiene), turning the project-health half of a supplier assessment into a repeatable, evidenced result.
Grouped split / leave-one-batch-out — the leak-free train/test split for a Category 5 model, where every reading from a batch goes wholly to train or wholly to test, so the held-out score measures generalization to an unseen batch rather than memorization of a near-twin sibling.
Applicability domain — a model's gate (a Hotelling's-T-squared / residual check) that declines to predict when a new batch sits outside the trained envelope, so the soft-sensor abstains rather than guesses out of its depth.
Process drift vs. model drift — process drift is the living CHO culture genuinely wandering batch to batch (a real signal to preserve); model drift is the predictor going stale against that moving process (a defect to detect) — conflating them makes a monitor cry wolf or miss a real shift.
Model lineage — the captured record of which dataset hash, frozen model version, and CQA a deployed model scored, logged exactly as the IQ manifest captures an image digest, so an audit can walk from a released lot back to the model that touched it.
SHACL traceability shape / competency question — the "nothing untraced" rule expressed as a closed-world SHACL shape (a requirement node with no verifiedBy edge fails) and a SPARQL query (which Critical requirements have no test?) run as a PASS/FAIL acceptance test, the graph form of the traceability matrix.

Where this leads

Validation answers "is this system trustworthy?" — but trust is not global. A record that satisfies FDA's Part 11 may face different residency, retention, and audit-trail expectations in the EU, Korea, China, or Japan. The next chapter, Data Across Jurisdictions: FDA, EU, PIC/S, NMPA, PMDA, MFDS, takes the validated stack across borders and shows how to encode those differing rules as data and policy rather than as forked deployments.

What this chapter covers​

"Validated software" does not exist​

Categorize before you validate​

The GAMP category dial: why 1 vs 5 is the whole economy​

CSA: stop writing, start thinking​

Supplier assessment when there is no supplier​

Scoring project health as a supplier proxy​

The SBOM as supplier dossier: what, whence, which license​

The traceability matrix: URS → IQ/OQ/PQ​

Anatomy of a traceability-matrix row​

The matrix as a graph: a requirement is a triple, a test is a shape​

IQ: the install manifest, pinned to digests​

Anatomy of an IQ-manifest entry​

OQ: the tests in the repo are the evidence​

PQ: the batch is the proof​

What actually gets cited: the field-failure datapoint​

Why it matters​

In the real world​

Key terms​

Where this leads​

What this chapter covers

"Validated software" does not exist

Categorize before you validate

The GAMP category dial: why 1 vs 5 is the whole economy

CSA: stop writing, start thinking

Supplier assessment when there is no supplier

Scoring project health as a supplier proxy

The SBOM as supplier dossier: what, whence, which license

The traceability matrix: URS → IQ/OQ/PQ

Anatomy of a traceability-matrix row

The matrix as a graph: a requirement is a triple, a test is a shape

IQ: the install manifest, pinned to digests

Anatomy of an IQ-manifest entry

OQ: the tests in the repo are the evidence

PQ: the batch is the proof

What actually gets cited: the field-failure datapoint

Why it matters

In the real world

Key terms

Where this leads